Although the Kepler SMX design was extremely efficient for its generation, through its development NVIDIA’s GPU architects saw an opportunity for another big leap forward in architectural efficiency; the Maxwell SM is the realization of that vision. Improvements to control logic partitioning, workload balancing, clock-gating granularity, scheduling, number of instructions issued per clock cycle, and many other enhancements allow the Maxwell SM (also called “SMM”) to far exceed Kepler SMX efficiency. The new Maxwell SM architecture enabled us to increase the number of SMs to five in GM107, compared to two in GK107, with only a 25% increase in die area.
Maxwell also boasts a dramatically larger L2 cache design; 2048KB in GM107 versus 256KB in GK107. With more cache located on-chip, fewer requests to the graphics card DRAM are needed, thus reducing overall board power and improving performance.
The primary contributor to Maxwell’s improved efficiency is the new Maxwell SM architecture, SMM. This new SM architecture achieves much higher power efficiency and delivers 35% more performance per CUDA Core on shader-limited workloads.
The organization of the SM has also changed. Each SM is now partitioned into four separate processing blocks, each with its own instruction buffer, scheduler and 32 CUDA cores. The Kepler approach of having a non-power-of-two number of CUDA cores, with some that are shared, has been eliminated. This partitioning simplifies the design and scheduling logic, saving area and power, and reduces computation latency.
Pairs of processing blocks share four texture filtering units and a texture cache. The compute L1 cache function has now also been combined with the texture cache, and shared memory is a separate unit (similar to the approach used on G80, the first CUDA capable GPU), that is shared across all four blocks.
Overall, with this new design, each “SM” is significantly smaller while delivering about 90% of the performance of a Kepler SM, and the smaller area enables us to implement many more SMs per GPU. Comparing GK107 versus GM107 total SM related metrics, GM107 has five versus two SMs, 25% more peak texture performance, 1.7 times more CUDA cores, and about 2.3 times more delivered shader performance.
and that cache is to eliminate memory bottleneck supposedly
Edited by AlphaC - 3/14/14 at 7:48pm
For GM107, to achieve its goal of significantly higher performance with the same memory width as GK107, it was also important to invest in memory system enhancements. On-chip memory system bandwidth was increased along with improvements in efficiency of the design. In addition, the large 2MB L2 cache configuration (larger than any previous GPU design) is highly effective at reducing memory bandwidth demand and ensuring that DRAM bandwidth is not a bottleneck.