Overclock.net - An Overclocking Community - View Single Post - [Techpowerup] NVIDIA DLSS and its Surprising Resolution Limitations

View Single Post
post #78 of (permalink) Old 02-20-2019, 07:55 AM
New to Overclock.net
DNMock's Avatar
Join Date: Jul 2014
Location: Dallas
Posts: 3,485
Rep: 171 (Unique: 125)
Quote: Originally Posted by TheBlademaster01 View Post
Was this meant in response to me ?

If so INT32 are regular integer units (32-bit). They are used in GPGPU, but also for graphics. Basically simple calculations like 2 + 2 = 4 are integer operations. Also to evaluate which number is bigger, bit shifts etc. Some practical use cases would be pixel color manipulations, image processing filters and dataflow control (conditional execution). FP32 are floating point units and are used when you need data to be accurate to several decimal points or simply need to represent very small and very large numbers (game physics, camera rotations and lighting/shading fall in this category, i.e. most of the heavy lifting GPUs do).

Integer arithmetic is much simpler, since you can easily/efficiently perform operations on each individual bit. Floating point arithmetic is more complicated because, while the number representation scheme is efficient for representing a great range of numbers, it uses a certain encoding scheme that needs to be accounted for in each calculation. That is why floating point units are much larger, slower and power hungry than integer units (especially the double precision floating point units integrated in Voltage and Radeon VII).

Tensor cores are custom hardware that take in a batch (48 arranged in three 4x4 grids) of small (half precision) floating point operands and perform matrix multiplications on them. It's slightly more complicated (technically they multiply two FP16 4x4 grids and add it to a FP32 4x4 grid). They would be larger than both FP32 and INT32 in size, but there are not a lot of them integrated on chip. Most of the work is still done on the FP32 units.


I do think Tensor cores and RTX cores added significantly to the bill of the Turing architecture, but it's not possible to say how much of the die area went to which component (cache vs registers vs RTX vs Tensor vs FP32 vs INT32 etc.). What you can see is that TU102 is massive, so there certainly is a significant increase in hardware but it's difficult to make an exact taxonomy without specific data. Definitely these features added to R&D costs though.

Shoddy math will give a hint I think.

Going from 16nm to 12nm should yield a density increase of ~42% (144 / 256) although admittedly, I have no clue how what the real change is, nor do I know if the process shrink effects only lengths or if it effects lengths and widths the same... Someone fix this for me if they know a more accurate number.

GP100 is on a 610mm^2 die has

Texture Units - 224
CUDA - 3584
Tensor - 640

GV100 on the 815mm^2 die and has

Texture Units - 320
CUDA - 5120
Tensor - 640

Between the bigger die size and increased density, you should see about a total uptick of 79% across the board for a GP100 at that size on that process.

Theoretical GP100 on 12nm process at 815 mm^2

CUDA - 6415
Texture Units - 400

Which gives us 640 tensor cores = 1295 Cuda and 80 Texture Units

Or 1 tensor core = 2 Cuda and .1 Texture unit as far as overall real estate

Going out on a limb and saying a RT core is the same size as a Tensor core unit, the 2080ti has 544 Tensor cores and 68 RT cores adding up to 611 total. To further extrapolate, that would put a 2080ti with no tensor/RT cores at 5574 instead of the 4352 it got. Giving a final total of 25% of the Tensor/CUDA area budget being used on Tensor/RT cores.

Obviously that's taking a lot of freedom and making a series of extrapolations on that fast and loose guess, so take that with a dump truck of salt.

DNMock is offline