Originally Posted by kakik09
I don't get what the RX is supposed to stand for. It's a great compute card too? A gaming tag?
X is the Roman numeral for 10.
The previous generation was "R9" , so they this is "R10", only they chose to label it as RX.
Originally Posted by Randomdude
If the 6144 core Vega isn't at least 70% faster than a Fury X then AMD are in trouble. Supposing that the Fury bottlenecks are removed then that means that the card is actually 45% faster given the core count than a 390x. Instead of just 25%. That would put a 50% faster Vega (bar bottlenecks) at around 217.5% 390x performance. Is that enough?
It will come down to clock speeds.
Well with Polaris, they already got 10-15% faster per CU, and Vega should add to that. The big problem right now I see is the lack of RBEs (render back ends) on Polaris (Z Stencil ROPs as Mahigan noted). We need a 4096 core with ideally 32 RBEs and ideally the 6144 core with have 48 RBEs.
If they can get similar clocks, it should be easily viable. Actually more than 70% might be doable, considering Vega itself should have enhancements. Remember Polaris is a refinement of GCN - and Vega has been billed like a new architecture. I expect multiple similarities with Polaris, but perhaps AMD now has had the time to go back and see what works well/what doesn't.
A larger L2 cache, more RBEs, and solving the occupancy limit on GCN's Compute Units is a priority IMO. With Polaris they have managed to close the "triangle gap" with Nvidia, assuming they can scale linearly.
Originally Posted by prjindigo
The entirety of the HBM/HBM2 doesn't communicate to the GPU at native 4096 bus, it goes through a "controller chip" much the same as how there's a controller chip on large array DDR3/4 DIMMS except instead of changing page ranks this controller combines the total 1024bits of 500/1000mhz throughput of the HBM stack into a smaller bandwidth 4x or 8x faster stream to feed it into the GPU itself. Each HBM/HBM2 tranceives data with the GPU through a smaller number of channels. Basically since we're talking an on-silicon local-to-GPU situation with minimum distance I'd bet that a 128/128 bit channel pair to each HBM/HBM2 stack would be sufficient. for a 1024 bit 500mhz stack to translate you'd be running 128 bit at 4000mhz in and out of the GPU itself per stack. So Fiji probably only has 512 in and 512 out channels for a total of 1000 memory pins. The real magic of HBM/HBM2 occurs because of these little memory interface chips under the stack of ram that translate slow-wide into narrow fast.
Warning: Spoiler! (Click to show)
So each HBM memory stack would have a 256 bit bus consisting of 128 send and 128 receive to the GPU itself and there's no reason for the GPU to actually concatenate all this into a single internal process either... Fiji has 1024 shaders per HBM stack.
Now the Tesla P100 has 8 memory controllers connected in pairs to each HBM2 bank. nVidia is selling Tesla P100 in 16GB and 12GB models for PCIe operation as workload accelerators, they're not really designed to make graphics but sit in 64bit mode all day long. Graphics are 32bit or less. But the fact that they're selling P100T's with a bad HBM2 stack on them with a listed 3072 bit width memory tells us that the GPU itself is NOT operating at 4096 bits wide communicating with the RAM, but is communicating in a bus width that works for both the 3072 and 4096. The more connections you have the more likely you'll have failed chips so it makes sense that the entire chip is operating with an aggregate of 512 send and 512 receive channels. Since the data itself is only gonna be 64bits wide anyway HBM2 has dual 64 bit channel function up through the stack allowing much faster random read and write that is VERY advantageous to random processing AND having two 64/64 memory controllers talking to each HBM2 stack's interface chip doubles the chance of having a "fully functioning" P100T in the case that an error occurs in one of the controllers - it can still operate the entire 4GB/8GB stack in 64bit tranceive . The chip will be slower but nVidia can still sell it as a 16GB 4096 wide chip even tho its technically a "differently able" processor.
So if we deal with standard HBM1 500mhz 1024 wide and 4b deep chips we're looking at the internal-to-stack 128 bit DDR signalers per memory layer allowing it to send 256 bits per clock per layer. That goes down to the interface chip in the stack which translates it up to 4000mhz by 128 wide and pushes it into the GPU or receives at 4000mhz by 128 wide from the GPU and brings it down to 128 wide DDR write... etc.
Now the tricky thing is that HBM1 actually allows for 8 layer memory but the HBM stacks would stand too tall to match "flush" with the GPU beside them! Technically with a very special and soooper precision milled heatsink AMD could have made 8GB Fury cards! Like micron milled and fitted... HBM2's layers are wider and thinner.
Fun tidbit: The back of the completed Fiji assembly interposer has 2021 balls on it (a square grid 45 in each dimension with one missing in each corner) and JEDEC says there's 3982 micro-bumps on the interposer upon which the HBM stacks and GPU itself mount - which I assume they're talking about Fiji's interposer but I have no way to qualify that information. Pretty sure the P100T has more since it also has a cross-talk bus to talk to the 3 other P100T in the DGX-1. I would assume that ALL of the P100T in the Tesla P PCIe cards have damage on the cross-talk bus.
There really isn't anything other than a xPU (GPU OR CPU WHATEVER) pin connection limit on HBM1/HBM2 memory interface, at 14nm it's conceivably possible to use a 32nm interposer and produce a 8192 wide HBM2 array using 16x64bit memory controllers talking to the logic chips on the bottom of the HBM2 stacks... it just becomes an issue of mindless complexity.
Now the HBM/HBM2 standard allows for the use of a single, double or quadruple stack, so you could technically have an HBM2 device that operates with just two 64bit connections with a single 4 or 8 gigabyte stack on a smaller interposer.
HBM/HBM2 isn't really "new" per-se, our DDR4 DIMMS are actually 4 ram banks per slot right now at 800 to 1000mhz clock, it's just a different way of distributing the transaction.
The real basic core of the HBM technology is the interposer.
Interposers are made using higher nm photolithography and built like processors.I heard you liked processors, so we put processors beside processors on processors! - JEDEC
tldr; AMD Fury and nVidia Tesla P100 both use a 512 bit bi-directional memory interface solution to communicate with the HBM/HBM2 stacks, the only difference at processor level between HBM1 and HBM2 is HBM1 uses solid 128bit tranceivers and HBM2 uses a pair of 64bit tranceivers and all the badonk you've heard about HBM1 being limited to 4GB is wrong - both standards allow 8GB. The use of the "4096 bits wide" is a bold faced lie since the HBM1 and HBM2 stacks communicate internally over 128 bits wide bus inside the stack. HBM1 and HBM2 are both 512bit technologies.
That interposer is quite an achievement technologically, but right now for gamers, we need a powerful enough GPU to be able to take advantage of that bandwidth. A 6144 core Vega might just be that ticket, especially at higher resolutions.
The Fury X, although a decent demo, was heavily limited by its triangle output. I suspect though that the deployment gave AMD some pretty valuable learning experience with working with HBM.
If the cost of the interposer manufacture goes down, I could see this going into APUs which are heavily bandwidth limited. Right now it is only an option on high end cards though because it is very pricey to manufacture. Rumor has it even GP102 may use GDDR5X instead of HBM2.
Originally Posted by Omega X
So now the vitriol has gotten SO low that there's manufactured evidence? What the hell is going on with OCN these days.
Unfortunately so. There does seem to be a lack of objectivity for the side that one hates.Edited by CrazyElf - 7/7/16 at 5:23pm