[Techpowerup] NVIDIA DLSS and its Surprising Resolution Limitations - Page 8 - Overclock.net - An Overclocking Community

Forum Jump: 

[Techpowerup] NVIDIA DLSS and its Surprising Resolution Limitations

Reply
 
Thread Tools
post #71 of 91 (permalink) Old 02-19-2019, 07:35 PM - Thread Starter
New to Overclock.net
 
ILoveHighDPI's Avatar
 
Join Date: Oct 2011
Posts: 3,284
Rep: 133 (Unique: 84)
Quote: Originally Posted by guttheslayer View Post
Where do you even get the figure from, the last I check in comparison with both GP100 and GV100, after scaling, TENSOR do not consume >5% of the die. It is in fact negligible.
Actually Tenser Cores take up 2/5ths of each Turing Shader Module: https://www.anandtech.com/show/13282...re-deep-dive/4

You can see their oversimplified graphic of die space allocation at the bottom of this article: https://www.pcworld.com/article/3305...x-2080-ti.html

So it's not consuming exactly 50% of the space for Cuda Cores, just 40%.
ILoveHighDPI is offline  
Sponsored Links
Advertisement
 
post #72 of 91 (permalink) Old 02-19-2019, 08:53 PM
New to Overclock.net
 
8051's Avatar
 
Join Date: Apr 2014
Posts: 2,824
Rep: 23 (Unique: 17)
Quote: Originally Posted by ILoveHighDPI View Post
Actually Tenser Cores take up 2/5ths of each Turing Shader Module: https://www.anandtech.com/show/13282...re-deep-dive/4

You can see their oversimplified graphic of die space allocation at the bottom of this article: https://www.pcworld.com/article/3305...x-2080-ti.html

So it's not consuming exactly 50% of the space for Cuda Cores, just 40%.
So 40% wasted die space instead of 50%.

Quote: Originally Posted by tpi2007 View Post
Here's the thing, it might very well be that on 12nm you couldn't have all those CUDA cores on a reasonable TDP to begin with. Nvidia most probably gets away with a 250w TDP for the 2080 Ti (260w for the FE) because when you're not using RTX features a big portion of the card is idle, and when you are, the CUDA cores are bottlenecked by the RTX cores (RT+Tensor), so it was a smart way for them to manage things.
Interesting take, so Tensor cores are a way to reduce TDP and performance.

Last edited by ryan92084; 02-20-2019 at 04:54 AM.
8051 is offline  
post #73 of 91 (permalink) Old 02-20-2019, 12:51 AM
mfw
 
ToTheSun!'s Avatar
 
Join Date: Jul 2011
Location: Terra
Posts: 7,018
Rep: 393 (Unique: 205)
Quote: Originally Posted by 8051 View Post
Interesting take, so Tensor cores are a way to reduce TDP and performance.
The AI developed a conscience and started caring about the environment and energy expenditure. It also reduces framerate because it heard Jensen saying that amount of performance is irresponsible.

CPU
Intel 6700K
Motherboard
Asus Z170i
GPU
MSI 2080 Sea Hawk X
RAM
G.skill Trident Z 3200CL14 8+8
Hard Drive
Samsung 850 EVO 1TB
Hard Drive
Crucial M4 256GB
Power Supply
Corsair SF600
Cooling
Noctua NH C14S
Case
Fractal Design Core 500
Operating System
Windows 10 Education
Monitor
ViewSonic XG2703-GS
Keyboard
Ducky One 2 Mini
Mouse
Glorious Odin
Mousepad
Asus Scabbard
Audio
Fiio E17K v1.0 + Beyerdynamic DT 1990 PRO (B pads)
▲ hide details ▲
ToTheSun! is offline  
Sponsored Links
Advertisement
 
post #74 of 91 (permalink) Old 02-20-2019, 01:24 AM
Looking Ahead
 
TheBlademaster01's Avatar
 
Join Date: Dec 2008
Location: Cluain Dolcáin, Leinster (Ireland)
Posts: 13,043
Rep: 785 (Unique: 536)
Quote: Originally Posted by guttheslayer View Post
Where do you even get the figure from, the last I check in comparison with both GP100 and GV100, after scaling, TENSOR do not consume >5% of the die. It is in fact negligible.
Quote: Originally Posted by ILoveHighDPI View Post
Actually Tenser Cores take up 2/5ths of each Turing Shader Module: https://www.anandtech.com/show/13282...re-deep-dive/4

You can see their oversimplified graphic of die space allocation at the bottom of this article: https://www.pcworld.com/article/3305...x-2080-ti.html

So it's not consuming exactly 50% of the space for Cuda Cores, just 40%.
Quote: Originally Posted by 8051 View Post
So 40% wasted die space instead of 50%.
No, those schematics are not drawn to scale. For one, the register file is orders of magnitude larger than the logic (L1 cache is much larger still). When it comes to logic, FP32 units are orders of magnitude larger than INT32 units. You cannot derive resource usage from the schematics (if drawn to scale, it would be very uninformative).

On the bottom of the page the blocks "overlayed" on top of the die shot are not even in the correct place.

 



TheBlademaster01 is offline  
post #75 of 91 (permalink) Old 02-20-2019, 02:12 AM - Thread Starter
New to Overclock.net
 
ILoveHighDPI's Avatar
 
Join Date: Oct 2011
Posts: 3,284
Rep: 133 (Unique: 84)
Quote: Originally Posted by 8051 View Post
So 40% wasted die space instead of 50%.
EDIT: Commenting on INT32 cores is really above my head, however.
After a bit of reading it does seem that INT32 is generally useful, I assume it’s part of being a GPGPU architecture, but the quantity of INT32 cores may be exaggerated in Turing to help enable the Tenser Cores, or as a vestigial component from the Workstation design.

Quote: Originally Posted by TheBlademaster01 View Post
No, those schematics are not drawn to scale. For one, the register file is orders of magnitude larger than the logic (L1 cache is much larger still). When it comes to logic, FP32 units are orders of magnitude larger than INT32 units. You cannot derive resource usage from the schematics (if drawn to scale, it would be very uninformative).

On the bottom of the page the blocks "overlayed" on top of the die shot are not even in the correct place.
Right, ultimately it’s Nvidia telling us how they’re using the silicon.
If Nvidia wanted to tell us that Ray Tracing doesn’t add a lot of cost to the overall design they thoroughly failed at it, and gave us a lot of ammunition to say otherwise along the way.

Last edited by ryan92084; 02-20-2019 at 04:54 AM.
ILoveHighDPI is offline  
post #76 of 91 (permalink) Old 02-20-2019, 02:45 AM
Looking Ahead
 
TheBlademaster01's Avatar
 
Join Date: Dec 2008
Location: Cluain Dolcáin, Leinster (Ireland)
Posts: 13,043
Rep: 785 (Unique: 536)
Quote: Originally Posted by ILoveHighDPI View Post
EDIT: Commenting on INT32 cores is really above my head, however.
After a bit of reading it does seem that INT32 is generally useful, I assume it’s part of being a GPGPU architecture, but the quantity of INT32 cores may be exaggerated in Turing to help enable the Tenser Cores, or as a vestigial component from the Workstation design.
Was this meant in response to me ?

If so INT32 are regular integer units (32-bit). They are used in GPGPU, but also for graphics. Basically simple calculations like 2 + 2 = 4 are integer operations. Also to evaluate which number is bigger, bit shifts etc. Some practical use cases would be pixel color manipulations, image processing filters and dataflow control (conditional execution). FP32 are floating point units and are used when you need data to be accurate to several decimal points or simply need to represent very small and very large numbers (game physics, camera rotations and lighting/shading fall in this category, i.e. most of the heavy lifting GPUs do).

Integer arithmetic is much simpler, since you can easily/efficiently perform operations on each individual bit. Floating point arithmetic is more complicated because, while the number representation scheme is efficient for representing a great range of numbers, it uses a certain encoding scheme that needs to be accounted for in each calculation. That is why floating point units are much larger, slower and power hungry than integer units (especially the double precision floating point units integrated in Voltage and Radeon VII).

Tensor cores are custom hardware that take in a batch (48 arranged in three 4x4 grids) of small (half precision) floating point operands and perform matrix multiplications on them. It's slightly more complicated (technically they multiply two FP16 4x4 grids and add it to a FP32 4x4 grid). They would be larger than both FP32 and INT32 in size, but there are not a lot of them integrated on chip. Most of the work is still done on the FP32 units.

E:

I do think Tensor cores and RTX cores added significantly to the bill of the Turing architecture, but it's not possible to say how much of the die area went to which component (cache vs registers vs RTX vs Tensor vs FP32 vs INT32 etc.). What you can see is that TU102 is massive, so there certainly is a significant increase in hardware but it's difficult to make an exact taxonomy without specific data. Definitely these features added to R&D costs though.

 




Last edited by TheBlademaster01; 02-20-2019 at 03:39 AM.
TheBlademaster01 is offline  
post #77 of 91 (permalink) Old 02-20-2019, 04:23 AM - Thread Starter
New to Overclock.net
 
ILoveHighDPI's Avatar
 
Join Date: Oct 2011
Posts: 3,284
Rep: 133 (Unique: 84)
Quote: Originally Posted by TheBlademaster01 View Post
Was this meant in response to me ?

If so INT32 are regular integer units (32-bit). They are used in GPGPU, but also for graphics. Basically simple calculations like 2 + 2 = 4 are integer operations. Also to evaluate which number is bigger, bit shifts etc. Some practical use cases would be pixel color manipulations, image processing filters and dataflow control (conditional execution). FP32 are floating point units and are used when you need data to be accurate to several decimal points or simply need to represent very small and very large numbers (game physics, camera rotations and lighting/shading fall in this category, i.e. most of the heavy lifting GPUs do).

Integer arithmetic is much simpler, since you can easily/efficiently perform operations on each individual bit. Floating point arithmetic is more complicated because, while the number representation scheme is efficient for representing a great range of numbers, it uses a certain encoding scheme that needs to be accounted for in each calculation. That is why floating point units are much larger, slower and power hungry than integer units (especially the double precision floating point units integrated in Voltage and Radeon VII).

Tensor cores are custom hardware that take in a batch (48 arranged in three 4x4 grids) of small (half precision) floating point operands and perform matrix multiplications on them. It's slightly more complicated (technically they multiply two FP16 4x4 grids and add it to a FP32 4x4 grid). They would be larger than both FP32 and INT32 in size, but there are not a lot of them integrated on chip. Most of the work is still done on the FP32 units.

E:

I do think Tensor cores and RTX cores added significantly to the bill of the Turing architecture, but it's not possible to say how much of the die area went to which component (cache vs registers vs RTX vs Tensor vs FP32 vs INT32 etc.). What you can see is that TU102 is massive, so there certainly is a significant increase in hardware but it's difficult to make an exact taxonomy without specific data. Definitely these features added to R&D costs though.
Amazing post, thanks a bunch.

I think we’ll “mostly” have our answer about the die cost of RTX in a few days when the 1660Ti launches.
ILoveHighDPI is offline  
post #78 of 91 (permalink) Old 02-20-2019, 07:55 AM
New to Overclock.net
 
DNMock's Avatar
 
Join Date: Jul 2014
Location: Dallas
Posts: 3,485
Rep: 171 (Unique: 125)
Quote: Originally Posted by TheBlademaster01 View Post
Was this meant in response to me ?

If so INT32 are regular integer units (32-bit). They are used in GPGPU, but also for graphics. Basically simple calculations like 2 + 2 = 4 are integer operations. Also to evaluate which number is bigger, bit shifts etc. Some practical use cases would be pixel color manipulations, image processing filters and dataflow control (conditional execution). FP32 are floating point units and are used when you need data to be accurate to several decimal points or simply need to represent very small and very large numbers (game physics, camera rotations and lighting/shading fall in this category, i.e. most of the heavy lifting GPUs do).

Integer arithmetic is much simpler, since you can easily/efficiently perform operations on each individual bit. Floating point arithmetic is more complicated because, while the number representation scheme is efficient for representing a great range of numbers, it uses a certain encoding scheme that needs to be accounted for in each calculation. That is why floating point units are much larger, slower and power hungry than integer units (especially the double precision floating point units integrated in Voltage and Radeon VII).

Tensor cores are custom hardware that take in a batch (48 arranged in three 4x4 grids) of small (half precision) floating point operands and perform matrix multiplications on them. It's slightly more complicated (technically they multiply two FP16 4x4 grids and add it to a FP32 4x4 grid). They would be larger than both FP32 and INT32 in size, but there are not a lot of them integrated on chip. Most of the work is still done on the FP32 units.

E:

I do think Tensor cores and RTX cores added significantly to the bill of the Turing architecture, but it's not possible to say how much of the die area went to which component (cache vs registers vs RTX vs Tensor vs FP32 vs INT32 etc.). What you can see is that TU102 is massive, so there certainly is a significant increase in hardware but it's difficult to make an exact taxonomy without specific data. Definitely these features added to R&D costs though.

Shoddy math will give a hint I think.

Going from 16nm to 12nm should yield a density increase of ~42% (144 / 256) although admittedly, I have no clue how what the real change is, nor do I know if the process shrink effects only lengths or if it effects lengths and widths the same... Someone fix this for me if they know a more accurate number.


GP100 is on a 610mm^2 die has


Texture Units - 224
CUDA - 3584
Tensor - 640

GV100 on the 815mm^2 die and has

Texture Units - 320
CUDA - 5120
Tensor - 640

Between the bigger die size and increased density, you should see about a total uptick of 79% across the board for a GP100 at that size on that process.


Theoretical GP100 on 12nm process at 815 mm^2

CUDA - 6415
Texture Units - 400


Which gives us 640 tensor cores = 1295 Cuda and 80 Texture Units

Or 1 tensor core = 2 Cuda and .1 Texture unit as far as overall real estate



Going out on a limb and saying a RT core is the same size as a Tensor core unit, the 2080ti has 544 Tensor cores and 68 RT cores adding up to 611 total. To further extrapolate, that would put a 2080ti with no tensor/RT cores at 5574 instead of the 4352 it got. Giving a final total of 25% of the Tensor/CUDA area budget being used on Tensor/RT cores.


Obviously that's taking a lot of freedom and making a series of extrapolations on that fast and loose guess, so take that with a dump truck of salt.


DNMock is offline  
post #79 of 91 (permalink) Old 02-20-2019, 09:09 AM
Looking Ahead
 
TheBlademaster01's Avatar
 
Join Date: Dec 2008
Location: Cluain Dolcáin, Leinster (Ireland)
Posts: 13,043
Rep: 785 (Unique: 536)
Quote: Originally Posted by DNMock View Post
Shoddy math will give a hint I think.

Going from 16nm to 12nm should yield a density increase of ~42% (144 / 256) although admittedly, I have no clue how what the real change is, nor do I know if the process shrink effects only lengths or if it effects lengths and widths the same... Someone fix this for me if they know a more accurate number.


GP100 is on a 610mm^2 die has


Texture Units - 224
CUDA - 3584
Tensor - 640

GV100 on the 815mm^2 die and has

Texture Units - 320
CUDA - 5120
Tensor - 640

Between the bigger die size and increased density, you should see about a total uptick of 79% across the board for a GP100 at that size on that process.


Theoretical GP100 on 12nm process at 815 mm^2

CUDA - 6415
Texture Units - 400


Which gives us 640 tensor cores = 1295 Cuda and 80 Texture Units

Or 1 tensor core = 2 Cuda and .1 Texture unit as far as overall real estate



Going out on a limb and saying a RT core is the same size as a Tensor core unit, the 2080ti has 544 Tensor cores and 68 RT cores adding up to 611 total. To further extrapolate, that would put a 2080ti with no tensor/RT cores at 5574 instead of the 4352 it got. Giving a final total of 25% of the Tensor/CUDA area budget being used on Tensor/RT cores.


Obviously that's taking a lot of freedom and making a series of extrapolations on that fast and loose guess, so take that with a dump truck of salt.
Haha, nice

Yes, it's valid to assume that transistor size scales quadratically with decreasing feature size. Aside from some manufacturing process specific factors (non-trivial), there's a linear relation between the width and length of a transistor and its driving strength/current. So, in order to get similar performance you can reduce width by approximately the same ratio as the feature size reduction.

I can follow the rest of your reasoning. I think equating the size of an RT core to be similar to Tensor core is not accurate, but the error might be negligible because of how few there are on the chip.

I Google'd around and your "pseudo-science" is within 1% of someone who solved the same problem but then graphically, so who knows

https://www.reddit.com/r/nvidia/comm...dont_think_so/

 



TheBlademaster01 is offline  
post #80 of 91 (permalink) Old 02-20-2019, 09:23 AM
Waiting for 7nm EUV
 
tpi2007's Avatar
 
Join Date: Nov 2010
Posts: 11,384
Rep: 894 (Unique: 503)
The 12nm that they are using is just a slight improvement over 16nm, it probably doesn't amount to that much. Also, GV100 actually has 84 SMs, but we only ever saw products with 80 enabled due to yield and probably also power consumption reasons, so take that into account in your math. Also, there is a difference in that GV100 uses an HBM2 memory controller, thus saving a bit of die space compared to the GDDR6 memory controller equipped Turing dies.


tpi2007 is offline  
Reply

Quick Reply
Message:
Options

Register Now

In order to be able to post messages on the Overclock.net - An Overclocking Community forums, you must first register.
Please enter your desired user name, your email address and other required details in the form below.
User Name:
If you do not want to register, fill this field only and the name will be used as user name for your post.
Password
Please enter a password for your user account. Note that passwords are case-sensitive.
Password:
Confirm Password:
Email Address
Please enter a valid email address for yourself.
Email Address:

Log-in



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Show Printable Version Show Printable Version
Email this Page Email this Page


Forum Jump: 

Posting Rules  
You may post new threads
You may post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off