Originally Posted by Cyrious
See, I was asking because the HBM is likely going to be attached to the iGPU, which means any L2$ miss would result in a request going out through the CPU-GPU interlink, through the GPU memory controller, to the HBM, wait for the HBM to service the request (as it is a form of DRAM) and then all the way back. Theres the latency penalty from doing that, but then theres the bandwidth penalty as well, as the interlink between CPU and GPU is only rumored to hit about 100GB/s, which is in turn going to take a chunk out of it from the CPU-GPU traffic. Id rather be able to hit ~200GB/s and 10ns latency to the local L3 cache than <100GB/s and 20-30+ns latency to the HBM, even if the HBM has orders of magnitude more capacity.
Who knows though, other than the AMD engineers who built the damn thing? I could be entirely wrong (you right) and the HBM's relatively enormous capacity is enough to null out the latency and bandwidth hits, or render them inconsequential.
The latency and bandwidth penalties could be made up for with capacity and streaming techniques for the target applications. It's one thing to be able to fit an 8MB dataset into a 100GB/s 28cycle latency buffer and quite another to be able to fit a 1GB dataset with 100GB/s 48ns buffer (plus some interface overhead - probably ~15 cycles total, including memory controller commands).
BTW, I used Intel's superior cache numbers from Sandy Bridge (96GB/s, 28-cycle latency). AMD has never proven to be capable of that level of performance with their L3 caches, but we'll just assume they managed it for Zen.
Of course, we're only talking about large dataset workloads... which is exactly where AMD would likely target such beast of an APU... and the only people willing to spend big money on something like this.
If we pretend the HBM is used as an L4, though, that means we will have a 28cycle added penalty before being able to hit up the HBM. This would be worthwhile only some of the time - in certain latency-sensitive cases... however, those very same cases will also be hurt every time there is a miss. This would mean that the APU would need to search main memory and the HBM concurrently.
The next option is to allow full software control and allow systems to map part of the HBM as system memory and merely copy speed-sensitive data into that higher performance memory (likely 512GB/s). An operating system's file system cache would be a fantastic use for this - and would speed up every program that uses the file system. AMD's HSA could be a real boon for them... and this could be the first serious use of it we will see.