I'm gonna go off the deep end here with a detailed guesswork at what we might see with GM200 at 20nm. THE RESULTS AND MATH IS THEORETICAL ONLY; THIS IS NOT BASED ON ANY SECRETLY-KNOWN INFO FOR UNRELEASED PRODUCTS.
This slide says the case for Maxwell's first generation is reportedly, for the same performance level we should be seeing half the wattage used even on 28nm, and that per-core we are going to see a 35% gain (by their claims, at this point of course). So a card that would take Kepler 300w can be done in a 150w or less envelope is their claim, on 28nm even. That leaves a huge amount of extra power to continue stuffing in more cores and increasing clocks. Now add in that the Big Maxwell is going to be (GM200/204) on 20nm, and that improves power efficiency and die size usable even further. Now add in that this is Maxwell FIRST generation, for the GM107, and that GM200/204 are SECOND generation by all known info.... and we have a recipe for easily doubling the performance at least, if their claims are anywhere near true.
So, time for some math based on the theoretical and what is a real product launching in a couple of days (GTX 750 Ti).
GM107 has 5 SMM units containing 128 cuda cores, in each GPC. One GPC is what GM107 is using. It uses 60 watts for performance that is only beaten out by a GTX 660 by about 12% per the leaked benches
. A GTX 660 has 960 Kepler cores, while a GM107 has 640 Maxwell cores. A GTX 660 uses a TDP rating of 140 watts. See an interesting number here?
960 is 50% more cores than the 640 ones Maxwell is using for the 750Ti. Now, it is around 12% slower there.... see the magic number close by? Nvidia claims a 35% performance increase PER CORE and additionally that the cores there will use only around half the power overall in TDP. Now extrapolate on some napkin math: the GM107 only has a 128-bit bus. That means the bandwidth efficiency is greatly improved because a GTX 650 Ti Boost is a huge amount above a GTX 650 Ti (192-bit vs. 128-bit bus widths and basically the same otherwise).
So it's safe to say that with 50% more bandwidth and 50% more cores, a GTX 660 only performing ~12% higher than a Maxwell part with slightly higher clocks is incredibly impressive. Now let's scale! We know at 28nm lithography that GM107 is around 148 square millimeters for the die size. Let's COMPLETELY IGNORE 20NM for a second here on the power savings and size! Forget about it for a minute. It would be extremely easy to see, since even on 28nm the reticule size is around 570mm2, that they could use 3 GPC units on 28nm taking around 420-430mm2 with this imaginary chip that would have 15 SMM units. 15 SMM units times 128 per unit would mean 1920 Maxwell cores.
Pretend their scheduler is great and the performance scales well with core count and clocks, and that they kept the idea of triple everything there in this hypothetical, non-existent card that is an illustration only. So we'd have a 384 bit bus with 1920 Maxwell cores, probably 7ghz memory speed of GDDR5 like Kepler does at least, and a TDP that fits inside of 200 watts. Now let's say that you only get about 75% scaling from core count here, which is reasonable even though Kepler scales pretty linearly, but it's a new architecture with Maxwell, so let's make the safe assumption. So a GTX 660 performs 12% better than a GM107 with 640 cores. Triple the core count there with our rough napkin math again with everything else and you would have a card performing around the same as GK110 fully unlocked by that theory-crafting, at least, and it has better potential for higher clocks thanks to the lower power usage.
However, in reality, we know they are going 20nm. This allows for even more power savings. This also allows for many more transistors per square mm on the die. So pretend they want to go for a 520mm2 chip on 20nm, keep costs down a tad for Big Maxwell and improve yields per wafer. According to released documents such as this: http://www.cadence.com/rl/Resources/overview/20nm_qa.pdf
we can expect to see transistor counts possible of 8-12 billion. GK110 is 7.1 billion transistors. Using that as a point of reference, let's scale 28nm GK110 to 28nm Maxwell GM107: we need less memory controllers and pad space, so we can safely come up with a number in the neighborhood, considering the 148mm2 die size compared to GK110's 551mm2 size. At 20nm, you will be able to fit upwards of 11-12 billion transistors for a high-end part. For 28nm the 148mm2 die size indicates roughly 1.7-1.8 billion transistors with the 128-bit bus.
So now we have a decent number here: we know that Maxwell at 28nm in GM107 form is taking about 2 billion transistors to perform at a level about 89% as fast as a GK106. Let's use this as a base for the next part of this thought exercise
So 2 billion.... it's safe to say they could fit 7 GPC's at 20nm easily since 20nm should provide roughly a 2x density shrink in die size used per transistor, very easily, and we wouldn't need to duplicate memory controllers beyond 3x of the 28nm design's if we went for a 384-bit bus. At 20nm let's say they went for a 384-bit bus, to feed 7 GPC's worth of cores since the architecture is more bandwidth-efficient than Kepler, clearly by far. That'd give us 4480 Maxwell cores which already are much more bandwidth-efficient, so it probably would be plenty well-fed on that end of things.
Power-wise we'd be looking at an envelope of, 250 watts since Maxwell GM107 at 28nm takes a full-card power of 60w for 640 cores. That means probably around 45w for the GPU itself, allowing 10w for the GDDR5 and 5w for the fan and other circuitry. The move to 20nm will improve power efficiency drastically, due to the shorter gate lengths. So it's fair to say they could fit seven of those inside of that envelope, more than easily.
Again, let's go for a linear scaling factor of approximately 75% for the core count improvements... so we have a GTX 660 GK106 card we will use and compare to a GK110 780 Ti. A GTX 660 at 1080p is able to pull about 50% of the performance of a fully-unlocked, 780 Ti (source: http://www.techpowerup.com/reviews/NVIDIA/GeForce_GTX_780_Ti/27.html
Now we know each SMM will provide around 90% of that performance, and revision 2.0 (second gen) on 20nm will probably be closer to a 100% figure. So let's conservatively say each SMM-based GPC results in a real-world performance of a GTX 660. In other words, around half of a 780 Ti. Now let's conservatively also say we only see the benefit of 75% of the cores when scaling it to 7 GPC's or 35 SMM's. 35 SMM's would be, as you recall 4480 maxwell cores. Multiply the performance of the 50% card by 7 and we'd have 350% (or 2.5x faster than) of the performance of a full GK110. However, let's now apply the 75% rough rule and we come up with a much more reasonable 262.5% of the performance, or 2.6x as fast as (1.6x faster than) a GTX 780 Ti.
My predictions, therefore, are that we will see a Big Maxwell on 20nm with at minimum twice the performance of a GTX 780 Ti, and by current rumors it is due this year. Add in that Maxwell will be able to clock higher at 20nm (I based those power figures off of the numbers above which were of a card at 28nm with a 1085mhz GPU clock, and realistically they can probably fit a stock core speed of 1100-1150mhz of this big a chip in then. Thus my predictions are we would see a 450-460mm2 die size for this hypothetical card, at 20nm, with a 250-260w TDP rating and thanks to the 2x transistor density, approximately 9.2b transistors.
Napkin math, for sure, but worth thinking about, eh?