Originally Posted by fateswarm
Has anyone made a good professional hardware designer-grade commentary on the depths of the design ('cause most of us have no idea how to judge it as a big picture)? Anywhere on the internet. No, I don't mean "wow, more rops!".
What do you mean, something like this piece on compute on GPUs , but for graphics only (I.e. games)
one of the features Nvidia introduced with Compute Capability 3.5 (only supported on the GTX Titan and the Tesla K20/K20X) is a funnel shifter. The funnel shifter can combine operations, shrinking the 3-cycle penalty Nvidia significantly. We’ll look at how much performance improves momentarily, because this isn’t GK110′s only improvement over GK104. GK110 is also capable of up to 64 32-bit integer shifts per SMX (Titan has 14 SMX’s). GK104, in contrast, could only handle 32 integer shifts per SMX, and had just eight SMX blocks.
For now, we’re betting that the high number of cores per SMX (192 for Kepler, 64 for GCN) is part of the problem. Each SMX has to work harder to extract sufficient parallelism to keep the entire processor block fed, which makes peak utilization problematic. Further CUDA optimizations might improve the overall performance scenario slightly, but there’s no miracle kernel with 100% increased performance waiting in the wings.
or do you mean something like Anandtech:
Not to be confused with the SIMD on Cayman (which is a collection of SPs), the SIMD on GCN is a true 16-wide vector SIMD. A single instruction and up to 16 data elements are fed to a vector SIMD to be processed over a single clock cycle. As with Cayman, AMD’s wavefronts are 64 instructions meaning it takes 4 cycles to actually complete a single instruction for an entire wavefront. This vector unit is combined with a 64KB register file and that composes a single SIMD in GCN.
As is the case with Cayman's SPs, the SIMD is capable of a number of different integer and floating point operations. AMD has not gone into fine detail yet of what those are, but we’re expecting something similar to Cayman with the possible exception of how transcendentals are handled. One thing that we do know is that FP64 performance has been radically improved: the GCN architecture is capable of FP64 performance up to ½ its FP32 performance. For home users this isn’t going to make a significant impact right away, but it’s going to help AMD get into professional markets where such precision is necessary.
Diving deeper into Tahiti, as per the GCN architecture Tahiti’s 2048 SPs are organized into 32 Compute Units. Each of these CUs contains 4 texture units and 4 SIMD units, along with a scalar unit and the appropriate cache and registers. At the 7970’s core clock of 925MHz this puts Tahiti’s theoretical FP32 compute performance at 3.79TFLOPs, while its FP64 performance is ¼ that at 947GFLOPs. As GCN’s FP64 performance can be configured for 1/16, ¼, or ½ its FP32 performance it’s not clear at this time whether the 7970’s ¼ rate was a hardware design decision for Tahiti or a software cap that’s specific to the 7970. However as it’s obvious that Tahiti is destined to end up in a FireStream card we will no doubt find out soon enough.
Meanwhile the frontend/command processor for Tahiti is composed of 2 Asynchronous Command Engines (ACEs) and 2 geometry engines. Just as with Cayman each geometry engine can dispatch 1 triangle per clock, giving Tahiti the same theoretical 2 triangle/clock rate as Cayman. As we’ll see however, in practice Tahiti will be much faster than Cayman here due to efficiency improvements.
Looking beyond the frontend and shader cores, we’ve seen a very interesting reorganization of the rest of the GPU as opposed to Cayman. Keeping in mind that AMD’s diagrams are logical diagrams rather than physical diagrams, the fact that the ROPs on Tahiti are not located near the L2 cache and memory controllers in the diagram is not an error. The ROPs have in fact been partially decoupled from the L2 cache and memory controllers, which is also why there are 8 ROP partitions but only 6 memory controllers. Traditionally the ROPs, L2 cache, and memory controllers have all been tightly integrated as ROP operations are extremely bandwidth intensive, making this a very unusual design for AMD to use.
As it turns out, there’s a very good reason that AMD went this route. ROP operations are extremely bandwidth intensive, so much so that even when pairing up ROPs with memory controllers, the ROPs are often still starved of memory bandwidth. With Cayman AMD was not able to reach their peak theoretical ROP throughput even in synthetic tests, never mind in real-world usage. With Tahiti AMD would need to improve their ROP throughput one way or another to keep pace with future games, but because of the low efficiency of their existing ROPs they didn’t need to add any more ROP hardware, they merely needed to improve the efficiency of what they already had.
The solution to that was rather counter-intuitive: decouple the ROPs from the memory controllers. By servicing the ROPs through a crossbar AMD can hold the number of ROPs constant at 32 while increasing the width of the memory bus by 50%. The end result is that the same number of ROPs perform better by having access to the additional bandwidth they need.
The big question right now, and one we don’t have an answer to, is what were the tradeoffs for decoupling the ROPs? Clearly the crossbar design has improved ROP performance through the amount of memory bandwidth they can access, but did it impact anything else? The most obvious tradeoff here would be for potentially higher latency, but there may be other aspects that we haven’t realized yet.
On that note, let’s discuss the memory controllers quickly. Tahiti’s memory controllers aren’t significantly different from Cayman’s but there are more of them, 50% more in fact, forming a 384bit memory bus. AMD has long shied away from non-power of 2 memory busses, and indeed the last time they even had a memory bus bigger than 256bits was with the ill-fated 2900XT, but at this point in time AMD has already nearly reached the practical limits of GDDR5. AMD’s ROPs needed more memory bandwidth, but even more than that AMD needed more memory bandwidth to ensure Tahiti had competitive compute performance, and as such they had little choice but to widen their memory bus to 384bits wide by adding another 2 memory controllers.
It’s worth noting though that the addition of 2 more memory controllers also improves AMD’s cache situation. With 128KB of L2 cache being tied to each memory controller, the additional controllers gave AMD 768KB of L2 cache, rather than the 512KB that a 256bit memory bus would be paired with.
Pretty sure most people could tell that four times the ACEs could help boost performance a lot and judging by Anandtech's review of the 7970 those added ROPs need more memory bandwidth.
AMD has been lacking in the TMU (texture mapping unit) area this gen.
The CU (compute unit) is the same I think , so it's a high level change.
This is a GK110 SMX
Gk110 block diagram (with GPC separation, many sites do not have the GPCs outlined out)
^ some GTX780s have a rasterize engine disabled due to having 3 disabled SMX in the same GPC.
from Gk110 whitepaper http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
Gk104 has less SMX per GPC and 1 less GPC:
GTX 760 Gk104 , missing a GPC (and therefore also a raster engine):
If you're looking for anything more in depth you'll have to look for AMD developers whitepapers or wait 3 days for their "dumbed down" presentation which probably lacks what you're asking. The GPU'14 presentation only had simple metrics such as Firestrike number that jive with "gamers".Edited by AlphaC - 10/12/13 at 5:59pm