Originally Posted by ku4eto
The 390x OC'es around 15% core and 10% memory. The 980 Ti does ~25% core and ~15% memory. I wouldn't say that his is easy win for the memory at least. On core, yes 10% seems like a win.
Remember though, Nvidia's GPUs once you reach a point don't seem to scale linearly. They seem to be bottlenecked somewhere. Mahigan thinks it's the VRAM. Could be. We'll know soon enough (when Pascal comes out with HBM2). I've suggested elsewhere that it could be the graphics drivers - this will end with DX12.
Originally Posted by Mahigan Warning: Spoiler! (Click to show)
I think that Arctic Islands did get the desired revamp. Since AMD feel confident in claiming it is Post-GCN or GCN Next then this would indicate that, unlike GCN1.0/GCN1.1/GCN1.2, Arctic Islands will in fact be based on a more efficient design. I'm not sure why, but I have this feeling that we won't necessarily see a large increase in ALUs, rather I see an increase in the IPC throughput of several elements in both the compute and graphics pipelines. I see AMD sticking with 8 ACEs or maybe even dropping down to 4 ACEs (staying at 8 ACEs if there is an increase in ALUs and dropping down to 4 if there is little to no increase in ALUs but rather an increase in computational performance per ALU). The current ACE organization is not truly being tapped yet therefore investing in this dept probably wouldn't lead to much if any gains in the immediate. One way of achieving higher performance, all around, would be to increase the memory pools on GCN. Increasing the size and speed of the LDS and GDS pools would have a rather large impact in both triangle as well as computational throughput under parallel workloads (unless AMD engineers re-worked HyperZ and improve Z-Culling which in that case they would need less of an investment in the on-die cache). More ROPs makes sense but more RBEs as well (or improved Z-Culling, as mentioned before, like a new version of HyperZ which could help to remove un-necessary pixels from being rendered thus boosting Tessellation performance by saving on memory and compute resources), perhaps an increase in TMU efficiency (particularly as it pertains to int16 performance by moving to 64-bit/FP16 @ 4 Texels/clk which newer games are making use of). 16nm FinFET should allow for Greenland to achieve higher clocks which would help in terms of improving the speed of the on-die cache.
AMD don't need to make drastic changes to GCN in order to have an architecture which fits the mold of what many devs will utilize because of their design wins in the console markets.
If AMD simply add:
- Improved Z-Culling (New HyperZ)
- 64-bit/FP16 @ 4 Texels/clk
- Boost RBEs and ROPs
- Improved Cache (LDS/GDS)
They'd go a long way in rectifying some of the shortcomings in their front end graphics Pipeline performance. That being said, they do need to beef up the front end and that may mean cutting down on the die space being used for the computational units. Perhaps 16nm FinFET will allow AMDs engineers to retain both the Computational advantage and beef up the front end but we'll see I suppose. Unless AMD continues with the forward thinking approach, which in terms of sales is quite foolish even though it does lend itself to a higher return on investment from a consumers perspective, which could mean a more powerful compute pipeline and the front end getting little to no revamp.
The front end no doubt needs an update:
I would agree that this should address the issues. Regardless of whether or not this is the ROPs
- Do you think it'd be worth splitting into more CEs? Right now they've got 4 CEs, with 1024 SP, 16 ROPs, and 4 RBE groups per CE. Perhaps with more CEs that could be addressed.
- HBM2 I think will actually benefit AMD more right now, because they are less efficient with their color compression. Either way, the bandwidth should not be a limit for either. Greenland needs to ship with at least 8GB of VRAM and perhaps 16GB would be ideal, if not overkill. I guess the key is to have the point where you have enough VRAM so that you don't run out of VRAM before you run out of Core, but too much decreases performance too.
- Compute for the professional cards (like FireGL and Quadro/Tesla) are of course different. We'll see a lot more VRAM in those and it will be ECC HBM2. We'll also likely see FP64 performance on both Nvidia and AMD GPUs pushed back up. Both Maxwell and Fiji gimped their DP performance.
Personally I think splitting into 8 or even 10 CEs with a higher ROP and RBE to SP ratio should address the issue - perhaps 512 SP per CE, then keep 16 ROP and 16 RBE per CE. Plus with a more expanded front end, the capacity is 2 - 2.5x as good and that's not taking into account the new HyperZ.
- 10 CEs with 5120 SP, 160 ROPs, 40 RBE groups
- HBM2 so 1024 MB/s RAM, at least 8GB of HBM2 and perhaps 16 GB
- Front end, most importantly has 2.5x the triangle/tesselation performance x whatever improvement HyperZ has (let's say it doubles it), so in that case 2.5 x 2 = 5x as powerful a triangle output. If it's not double, then it's 2.5 x Hyper Z improvement.
You don't think the ACEs are a bottleneck? I had been advocating more for a "hyper parallel" sort of GPU - perhaps as many as 16 or even 20 (with the 10 CE configuration) combined with a vastly improved cache. Can you go into details on the cache ideas?Warning: Current ACE Configuration! (Click to show)
The question is, how parallel can a GPU get before we reach the limits? Or is it something close to near perfectly embarrassingly parallel?
Regardless of the method, I think though that by far the most urgent thing AMD has to do is to get the front end of that GPU vastly upgraded.
We are in agreement here.
Originally Posted by Blameless
NVIDIA has a greater marketshare than AMD, but there is almost certainly more GCN hardware in circulation than there is Maxwell 2 hardware in circulation.
I would also be astounded if Pascal wasn't vastly better at handling async compute than Maxwell 2.
Maxwell 2 is going to be the outlier product, not the status quo.
I still think it's an ROP limitation. All the memory bandwidth and color compression in the world can't change final technical limits on pixel fill rate.
My Hawaii parts, back when I was mining on them, saw about a 50% reduction in frame rate by using custom firmware that substantially improved memory performance, but cut active ROPs (and only ROPs) by half...implying that even Hawaii was close to an ROP bottleneck.
Fiji improved fill rate by 5%, but shader horsepower by ~45%. Even with it's superior compression and loads of bandwidth, everything is pointing to a bottleneck in this area.
On the note of Nvidia, I think we will see gains with Pascal (it is also very Compute oriented after all), but a truly "parallel" GPU may actually have to wait until Volta. I think that they may have taped out Pascal before realizing the true extent of AMD's intentions.
On the note of AMD, they've got to address that bottleneck. Well, pulling up the hard specs again (from TechReport
From what we can see:
- Front end of Fury X was very similar to 290X
- Rasterization seems unlikely to be the bottleneck (Fury X can actually support more draw calls than a 980Ti)
- That leaves the triangles on the front end, or perhaps as you've noted the ROPs
The fill rate and the triangle output did not improve from Hawaii to Fiji. It's gotta be one of those two.
Another possibility is that it is both the ROPs and the triangles at different areas. So we might both be right.
I guess the way to describe this would be, imagine a factory with different steps in manufacturing a process. First the raw materials come in.
- Step 1 does 2000 units/day.
- Step 2 does 1000 units/day.
- Step 3 does 3000 units/day.
- Step 4 does 2500 units/day.
- Step 5 does 1000 units/day.
- Step 6 does 1500 units/day.
- Step 7 does 2700 units/day.
Out goes a finished product.
Steps 2 and 5 are the bottlenecks (so in this analogy, that would be the triangle output and the ROPs). I guess you could argue that by adding more shaders, AMD has done the equal of upgrading step 3.
Not a perfect analogy, but I think you get what I am trying to say here.
Regardless, I believe that adding more CEs and fewer shaders/clusters per CE should address this. If what you are saying is true, then even 704 SP per 16 ROPs may be too many, in that case, the optimal (and by optimal, to use the factory analogy, we want all steps to be about the same in maximum capacity), may be much lower, perhaps ~512 SP per 16 ROPs (I'm making an educated guess here - if you have any better ideas I would love for you to share).
On the note of the tessellation and memory compression, it's a lesser issue and not the bottleneck per se in that it is limiting the frame rates, but it's a weak point that assuming AMD has the resources, it should address. That and things like Hairworks will no longer work very well assuming AMD can address these. Memory bandwidth is the least important matter I think because HBM2 will be doubling the amount of bandwidth on top of HBM and the VRAM 4GB bottleneck will be gone.
Anyways, I presume you've read Mahigan's post which I've quoted. The key is to build a "balanced GPU", which the Fury X is clearly not, and although we may disagree on what the causes may be, it's clear that it's being bottlenecked somewhere or we'd see a 45% increase compared to Hawaii (assuming the same core clock of course).
By balanced, I'm referring to something like each step in that factory being able to pull off say, 2000 units a day. You're only as good as the weakest link in that chain. I think we both agree on the same goal, we just at this point are in disagreement over where the bottlenecks are.Edited by CrazyElf - 9/26/15 at 9:13am