Originally Posted by Particle
I'm not sure what you mean. Single HBM chips are 1024 bits wide. Can you explain what you mean when you talk about 256 bit buses on these cards? It doesn't seem to fit into anything that we know about HBM based products.
The entirety of the HBM/HBM2 doesn't communicate to the GPU at native 4096 bus, it goes through a "controller chip" much the same as how there's a controller chip on large array DDR3/4 DIMMS except instead of changing page ranks this controller combines the total 1024bits of 500/1000mhz throughput of the HBM stack into a smaller bandwidth 4x or 8x faster stream to feed it into the GPU itself. Each HBM/HBM2 tranceives data with the GPU through a smaller number of channels. Basically since we're talking an on-silicon local-to-GPU situation with minimum distance I'd bet that a 128/128 bit channel pair to each HBM/HBM2 stack would be sufficient. for a 1024 bit 500mhz stack to translate you'd be running 128 bit at 4000mhz in and out of the GPU itself per stack. So Fiji probably only has 512 in and 512 out channels for a total of 1000 memory pins. The real magic of HBM/HBM2 occurs because of these little memory interface chips under the stack of ram that translate slow-wide into narrow fast.
So each HBM memory stack would have a 256 bit bus consisting of 128 send and 128 receive to the GPU itself and there's no reason for the GPU to actually concatenate all this into a single internal process either... Fiji has 1024 shaders per HBM stack.
Now the Tesla P100 has 8 memory controllers connected in pairs to each HBM2 bank. nVidia is selling Tesla P100 in 16GB and 12GB models for PCIe operation as workload accelerators, they're not really designed to make graphics but sit in 64bit mode all day long. Graphics are 32bit or less. But the fact that they're selling P100T's with a bad HBM2 stack on them with a listed 3072 bit width memory tells us that the GPU itself is NOT operating at 4096 bits wide communicating with the RAM, but is communicating in a bus width that works for both the 3072 and 4096. The more connections you have the more likely you'll have failed chips so it makes sense that the entire chip is operating with an aggregate of 512 send and 512 receive channels. Since the data itself is only gonna be 64bits wide anyway HBM2 has dual 64 bit channel function up through the stack allowing much faster random read and write that is VERY advantageous to random processing AND having two 64/64 memory controllers talking to each HBM2 stack's interface chip doubles the chance of having a "fully functioning" P100T in the case that an error occurs in one of the controllers - it can still operate the entire 4GB/8GB stack in 64bit tranceive . The chip will be slower but nVidia can still sell it as a 16GB 4096 wide chip even tho its technically a "differently able" processor.
So if we deal with standard HBM1 500mhz 1024 wide and 4b deep chips we're looking at the internal-to-stack 128 bit DDR signalers per memory layer allowing it to send 256 bits per clock per layer. That goes down to the interface chip in the stack which translates it up to 4000mhz by 128 wide and pushes it into the GPU or receives at 4000mhz by 128 wide from the GPU and brings it down to 128 wide DDR write... etc.
Now the tricky thing is that HBM1 actually allows for 8 layer memory but the HBM stacks would stand too tall to match "flush" with the GPU beside them! Technically with a very special and soooper precision milled heatsink AMD could have made 8GB Fury cards! Like micron milled and fitted... HBM2's layers are wider and thinner.
Fun tidbit: The back of the completed Fiji assembly interposer has 2021 balls on it (a square grid 45 in each dimension with one missing in each corner) and JEDEC says there's 3982 micro-bumps on the interposer upon which the HBM stacks and GPU itself mount - which I assume they're talking about Fiji's interposer but I have no way to qualify that information. Pretty sure the P100T has more since it also has a cross-talk bus to talk to the 3 other P100T in the DGX-1. I would assume that ALL of the P100T in the Tesla P PCIe cards have damage on the cross-talk bus.
There really isn't anything other than a xPU (GPU OR CPU WHATEVER) pin connection limit on HBM1/HBM2 memory interface, at 14nm it's conceivably possible to use a 32nm interposer and produce a 8192 wide HBM2 array using 16x64bit memory controllers talking to the logic chips on the bottom of the HBM2 stacks... it just becomes an issue of mindless complexity.
Now the HBM/HBM2 standard allows for the use of a single, double or quadruple stack, so you could technically have an HBM2 device that operates with just two 64bit connections with a single 4 or 8 gigabyte stack on a smaller interposer.
HBM/HBM2 isn't really "new" per-se, our DDR4 DIMMS are actually 4 ram banks per slot right now at 800 to 1000mhz clock, it's just a different way of distributing the transaction.
The real basic core of the HBM technology is the interposer.
Interposers are made using higher nm photolithography and built like processors.I heard you liked processors, so we put processors beside processors on processors! - JEDEC
tldr; AMD Fury and nVidia Tesla P100 both use a 512 bit bi-directional memory interface solution to communicate with the HBM/HBM2 stacks, the only difference at processor level between HBM1 and HBM2 is HBM1 uses solid 128bit tranceivers and HBM2 uses a pair of 64bit tranceivers and all the badonk you've heard about HBM1 being limited to 4GB is wrong - both standards allow 8GB. The use of the "4096 bits wide" is a bold faced lie since the HBM1 and HBM2 stacks communicate internally over 128 bits wide bus inside the stack. HBM1 and HBM2 are both 512bit technologies.