What I expect from Nvidia’s Maxwell – speculation (draft)

GM107 would likely be reused for GTX 850 , GT 840 / GT 830 , unless there’s a refresh GM207 or GM208

GM206 (<150W , 6-pin PCI-e power) – possible SKU = GTX 850 Ti / GTX 860
^ $150-200 MSRP pricepoints , 2-way SLI

2 GPCs with 192-bit memory bus w/ fast memory = GTX 660 Ti performance at a minimum ; if power use scales nicely it’ll be < 120W , maybe even < 110W ( ~26-29 GFLOPs/W)
* 3.3Tflops FP32, ~ 104GFLOPs FP64, 102 GTexels/s , 31.2 GPixels/s at ~ 1300Mhz
* 1280 CUDA , 80 TMU, 24 ROPs
* Memory bandwidth , 192-bit @6Ghz eff = 144 GB/s ; @7Ghz eff = 163GB/s
* If 2GB VRAM is used then there’s going to be interleaving issues (see http://www.anandtech.com/show/6159/the-geforce-gtx-660-ti-review/2) but there’s no problems with 1.5GB or 3GB
* 2MB L2 Cache
* roughly under R9 280 @ 933Mhz spec when at 1300Mhz while consuming half the wattage
* Would be roughly 2.8Tflops, 88 GTexels/s , 26.4 GPixels/s at 1100Mhz (GTX 750 Ti & GTX 750 are 1085Mhz Boost)
* 10 SMM = 80 TMUs ; GTX 660 had 80 TMUs (the rationale for this)
Probable Block diagram , SM206 192-bit (Click to show)

2 GPCs with 256-bit memory bus w/ low voltage memory = GTX 670 performance at a minimum ; if power use scales nicely it’ll be < 120W
* 3.3Tflops FP32, ~ 104 GFLOPs FP64, 102 GTexels/s , 41.6 GPixels/s at ~ 1300Mhz
* 1280 CUDA , 80 TMU, 32 ROPs
* Memory Bandwidth , 5Ghz eff = 160GB/s , 5.4Ghz eff = 173 GB/s (see Quadro K5000 with 122W TDP)
* 2GB VRAM or 4 GB VRAM
* 2MB L2 Cache
* Would be roughly 2.8Tflops, 86 GTexels/s , 27 GPixels/s at 1100Mhz (GTX 750 Ti & GTX 750 are 1085Mhz Boost) , but I expect having more power from PCI-e would allow higher Boost clocks
Probable Block diagrams, SM206 256-bit (Click to show)

GM204 (< 225W , 2 pin PCI-e power) – possible SKU = GTX 860 Ti / GTX 870
^ $200-400 pricepoints , 3-way or 4-way SLI

3 GPCs with 256-bit memory bus clocked high = roughly stock GTX 780 performance ; < 180W (~22-29 GFLOPs/W)
* 5Tflops FP32, ~ 160 GFLOPs FP64, 152 GTexels/s , 41.6 GPixels/s at ~ 1300Mhz
* 1920 CUDA, 120 TMU, 32 ROPs
* Memory bandwidth , 7Ghz eff = 224 GB/s (see GTX 770)
* 4 GB VRAM
* 2MB L2 Cache
* GTX 770 had 128 TMUs ; GTX 670 had 112 TMUs (the rationale for this)
* Would be roughly GTX 780 performance (GTX [email protected] 1085Mhz Boost = ~4.2TFlops, ~174 GTexels/s , ~44 GPixels/s)
* GTX 770 replacement to claim +35% faster

3 GPCs with 384-bit memory bus w/ low voltage memory = roughly stock GTX 780 performance ; < 180W
* 5Tflops FP32, ~ 160 GFLOPs FP64, 152 GTexels/s , 62.4 GPixels/s at ~ 1300Mhz
* 1920 CUDA, 120 TMU, 48 ROPs
* Memory Bandwidth , 6Ghz eff = 288 GB/s (see Quadro K6000 with 225W TDP)
* 3GB VRAM or 6GB VRAM
* 2MB L2 Cache
* roughly R9 290 spec @ 1Ghz / GTX 770 replacement

4 GPCs with 384-bit memory bus w/ low voltage memory = GTX TITAN performance at a minimum ; if power scales nicely it’ll be < 240W
* 6.7Tflops FP32, ~ 208 GFlops FP64 , 203 GTexels/s , 62.4 GPixels/s at ~ 1300Mhz
* 2560 CUDA , 160 TMU , 48 ROPs
* Memory Bandwidth , 6Ghz eff = 288 GB/s (see Quadro K6000 with 225W TDP)
* 3GB VRAM or 6GB VRAM
* 2MB L2 Cache
* R9 290 has 2560 GCN cores, 160 TMU, 64 ROP

GM200 / GM210 (<300W , 6-pin and 8-pin PCI-e power) – possible SKU = GTX 880 Ti / GTX 880 (~30-40 GFlops/W)
^ $400+ pricepoints , 4-way SLI
Different GPC layout with focus on double precision & compute

A gaming-gimped compute one could have 5 GPCs
5 GPCs , 384-bit memory bus clocked high , 48 ROPs
* 7TFlops FP32, ~219GFLOPs FP64, 215 GTexels/s , 52.8 GPixels/s ~ at 1100Mhz
* 3,200 CUDA , 200 TMU, 48 ROPs
* Memory Bandwidth , 7Ghz eff = 336 GB/s (see GTX 780 Ti , GTX TITAN Black)
* 6GB VRAM
* L2 cache? (> 1.5MB)
* GTX TITAN had 224 TMUs , GTX 780 has 192 TMU

A gaming-gimped compute one could have 5 GPCs
5 GPCs , 512-bit memory bus with low voltage memory , 64 ROPs
* 7TFlops FP32, ~219GFLOPs FP64, 215 GTexels/s , 70.4 GPixels/s ~ at 1100Mhz
* 3,200 CUDA , 200 TMU, 64 ROPs
* 4GB VRAM or 8GB VRAM
* L2 cache? (> 1.5MB)
* GTX TITAN had 224 TMUs , GTX 780 has 192 TMU
* roughly GTX 680 SLI or HD7970 Crossfire?

Tabulated based off core clock

Possible cut-down versions with 40 ROPs and 320-bit memory or something akin to the GTX 580 –> GTX 570 / GTX 560 Ti approach. I don’t believe they will make a GTX 670 product lineup mistake , if a cut-down version is only cut 20% shader/TMU-wise it would be worth it.

Speculated Product Lineup before EOL of GTX 700 series , based on price:
High-end Enthusiast
GM200 / GM210 card (flagship pricing) ~ $600-1000
GTX TITAN Black (GK110 full die, full FP64) ~ $650- 800 … no less than $500 due to FP64 performance
GM200 / GM210 cut-down die (GTX 880?) ~ $500-600
GTX 780 Ti (GK110 full die) ~ $450 – 550
GTX 780 (GK110 cut-down die) ~ $350 -400
(GTX 770 Ti GK110 with 1920 CUDA would go here)
Performance Mid-range
GM204 full die (GTX 870?)–> $200 -400 , maybe $350-450 at launch if 20nm
GM204 cut-down die (GTX 860 TI?) –> $200-400 , probably $250-350 at launch
GTX 770 (GK104 full die) ~ $250-300
(GTX 760 Ti (GK104 GTX 670 rebrand possibly with higher clocked VRAM would go here))
Mid-range
GM206 full die (GTX 860?) –> $150-200 , probably $220- 250 at launch
GTX 760 (GK104 cut-down die) ~ $180-220
GM206 cut-down die (GTX 850 Ti?) –> $150-200 , probably $180-220 at launch
Entry Level
GTX 850 / GTX 750 Ti (GM107 full die) ~ $110-130
GT 840 / GTX 750 (GM107 cut-down die) ~ $90-100

NOTE:
Pixel Fill rate scales with ROPs , not cores & TMUs. Generally ROPs x 8 = memory bus width in bits.

Texture Fill Rate = (# of TMUs) x (Core Clock)
Pixel Fill Rate = (# of ROPs) x (Core Clock)
FLOPs=cores x Core clock x FLOPs per cycle

There’s two strategies to optimize power use, wider memory bus clocked low (i.e. low voltage memory) or smaller memory bus clocked high (with extra L2 cache) and decreasing time to idle / downclock power state.

See also http://techreport.com/review/26050/nvidia-geforce-gtx-750-ti-maxwell-graphics-processor/11 , comparing Peak Rasterization / Peak Pixel Fill / Peak shader FLOPs vs FPS

The smallest unit of a Maxwell GPU, the SMM , has 128 CUDA cores. Currently GM107 has 5 SMM in one GPC. By inference the possibilities are:
Warning: Spoiler! (Click to show)

512 –> GTX 750 currently (cut down GM107) , 4 SMM
640 –> GTX 750 TI currently (GM107) , 5 SMM
768
896
1024 –> Possible cut-down 2 GPC card such as a GTX 850 Ti , 8 SMM (2 SMM disabled of 10)
1152 –> Possible cut-down 2 GPC card such as a GTX 850 Ti / GTX 855 / GTX 860 LE, 9 SMM (1 SMM disabled of 10)
1280 –> 2 GPC = 10 SMM if SMMs per GPC stay the same (GTX 860?)
1408
1536 –> Possible cut-down 3 GPC card such as GTX 860 Ti , 12 SMM (3 SMM disabled of 15)
1664 –> Possible cut-down 3 GPC card such as GTX 860 Ti / GTX 865 , 13 SMM (2 SMM disabled of 15)
1792
1920 –> 3 GPC = 15 SMM if SMMs per GPC stay the same (GTX 870?)
2048
2176
2304
2432
2560 –> 4 GPC if SMMs per GPC stay the same (GTX 880?)
2688
2816
2944
3072
3200 –> 5 GPC if SMMs per GPC stay the same (GTX TITAN MAX? GTX 880 Ti Gaming card?)

Nvidia claims Maxwell is 35% stronger per core vs Kepler and twice as efficient. (see whitepaper http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf)

SMM structure (GM107 , SM200 likely has Double precision and such):

Since we know the architecture’s most small unit is a SMM made of 128 CUDA cores, for the mid-range chips it’s all about whether Nvidia keeps the 5 SMM per GPC layout and how many GPCs and ROPs (and therefore the memory width bus chosen) will be allowed. Big Maxwell GM200/GM210 is unlikely to keep the same number of SMMs per GPC.

WISHFUL THINKING
Another price/performance powerhouse like the 8800GT (a high end part at mid-range prices) could be had with a GM204 chip if they aren’t gimped , a $250-300 card with GTX 780/GTX 780Ti performance and less than 180W power consumption would be astounding.

Even the GTX 560 Ti was rather good value at $250 compared to the GTX 660 Ti and GTX 650 Ti non-Boost which were both quite poor in price/performance value.

If we are extremely optimistic, then we may see 3 GPCs in GM206 , but it would likely be clocked low at ~ 1100Mhz or gimped with 192-bit memory bus in order to hit 150W power consumption.

Likewise, a 4 GPC GM204 would be extremely nice assuming to can fit 2560 Maxwell CUDA cores in 225W. Assuming 30 GFlops/W (double efficiency of GK104) this is doable even if it’s clocked at 1300Mhz and using 384-bit memory bus.

3200 CUDA cores is the minimum I expect from GM200/GM210 but I hope for 3840 CUDA cores (6 GPCs of the current GM107 structure).

Performance relative to current cards

One thought on “What I expect from Nvidia’s Maxwell – speculation (draft)”

  1. Have I told you that I am a fan of your articles? Because I am a fan of your articles.

    Anyway, one thing I have found odd about the 700 series is the lack of GPUs available for it. There is entry-level Maxwell (750 and Ti), there is stripped-down GK104 (760) and full GK104 (770), and there are four different versions of GK110 (both Titans and both 780s). There isn't much in-between however. Wikipedia lists a 760Ti that is essentially a 670 rebrand, but it has no release date. (I predict it will never be released, given just how close a 670 was to a 680 last gen and how much cheaper it was than the flagship.) There is also a 760 with a gimped, 192-bit bus that has no release date, and this could actually serve as a pretty competitor to AMD's low-end R9 GPUs.

    On the other hand, the 600 series is/was inundated with SKUs. There was the 650Ti, and there was the 650Ti BOOST with only eight additional ROPs. No other difference. Both of these were GK106 and were cut-down 660s, which had 192 more CUDA cores (I think that's two compute units or whatever Nvidia's term for them is? SMM?). Then there were two 660s available to OEMs, one with GK104's full 256-bit bus and the other with a gimped 192-bit bus. It (the one with the full back-end) was rebranded as the 760 and made available to consumers. Finally, the 660Ti was just a 670 with eight fewer ROPs, which was itself a 680 with 192 fewer shaders.

    So has Nvidia decided not to make this many SKUs? The Titan was released 15 months ago, and by that time last-gen, most if not all of those 600 series cards had launched. Perhaps we'll see more “entry-level” GPUs that barely beat Intel integrated graphics for the 800 series, marketed at OEMs for their terrible, terrible prebuilts, since this is a new architecture, but I am not sure about any reasonably powerful chipset receiving that same treatment.

    Anyway, since people seem to equate more VRAM with more performance, I think it's possible for some of the higher-end cards to be 320-bit and come with 2.5GB or even 5GB of VRAM stock, depending on just how high-end they are. I also really do not want to see a repeat of the Titan. “$1000? How ridiculous!” said some. Then, when the 780 was launched, they said, “Oh golly gee! Over 90% the performance and it's just $650? I'll take ten!” Even though no flagship had ever launched for quite that much money ever. 7970s were $550, 680s were $500, 6970s were just $370, and 580s were also $500. Not only that, both of those had a “close enough” version using the same chipset with a couple compute units (is there even a universal term for that thing called SMM or GCN core, depending on the architecture?) for $100+ and 5% FPS less. (Wikipedia list two prices for the 6950, one of which is $110 less and the other just $70 less. My point still stands given that it was the cheapest of the four and one price is less by more than $100). 780? Spend $100 less and get the 770, which has two-thirds the everything: cores, bus, VRAM, etc.

Leave a Reply

Your email address will not be published. Required fields are marked *

Recommended