Overclock.net › Forums › Industry News › Video Game News › [WCCF] HITMAN To Feature Best Implementation Of DX12 Async Compute Yet, Says AMD
New Posts  All Forums:Forum Nav:

[WCCF] HITMAN To Feature Best Implementation Of DX12 Async Compute Yet, Says AMD - Page 75

post #741 of 799
Quote:
Originally Posted by Mahigan View Post

Warning: Spoiler! (Click to show)
7ba3fd6d8a24b1947d16f5af1a52ec28.jpg
Multi threaded command listing & deferred rendering (DirectX runtime MT):
DirectX works by creating bundles (batches) of commands (command lists). These bundles or batches of commands are sent from the API to the Graphics driver. The driver can perform some changes to these commands (shader replacements, reordering of commands etc) and then translates them into ISA (Instruction Set Architecture, the GPUs language) command lists (Grids/threads) before sending them to the GPU for processing.

Multi-threaded command listing allows the DirectX driver to pre-record lists of commands on idling CPU cores. These lists of commands are then played back to the Graphics driver using the CPUs primary Core (thread 0). Why? The DirectX driver can only run on the primary CPU thread.

3993748dd5b336cc60c620cc1c884bdb.jpg
Multi-threaded rendering (DirectX runtime MT + DirectX driver MT):
Is more or less same as above (DirectX runtime can also scale past 4 cores) except the last part, the DirectX driver doesn't need to play back the commands over the primary CPU thread, any CPU core/thread can talk directly to the GPU driver and thus send its command lists to the Graphics driver. How? The DirectX driver is split amongst every CPU thread.

NVIDIA:
NVIDIA's driver uses more than one thread (hidden driver threads) to perform the DirectX driver translations into ISA. These commands are kept in system memory and fetched in bulk by the Gigathread engine. This saves on CPU time. Commands can be sent in bulk and then the CPU can handle other complex tasks without creating a stall. Lower DX11 API over head. Result: Higher draw call rate.

AMD:
The AMD driver wouldn't benefit from being multi-threaded because there is only a 64 thread slot in the commmand processor. So even if multi-threaded command listing and deferred rendering were used, the Command Processor could only fetch 64 threads at a time. That means constant fetching or streaming of work. If the CPU is busy with some other work, a stall occurs, and the GPU waits for the CPU to feed it. Hence GCNs higher DX11 API over head. Result: Lower draw call rate
.

I am a graphics engine programmer with around average knowledge about graphics API. I know why this is happening. But even i dont have any data or facts why is this happening exactly because the bottlenecks and the examples are different everytime. And it's not only that you posted that hurts dx11 performance. For example the support for command lists with amd is bad. And many other things. My point is that someone with such an amount of reps should post his post as a speculation not as a fact
Workstation
(4 items)
 
  
CPUMotherboardGraphicsMonitor
Xeon E5-2690 Supermicro 2011 Nvidia GP100/ Vega FE Dell ultrasharp 4k 
  hide details  
Reply
Workstation
(4 items)
 
  
CPUMotherboardGraphicsMonitor
Xeon E5-2690 Supermicro 2011 Nvidia GP100/ Vega FE Dell ultrasharp 4k 
  hide details  
Reply
post #742 of 799
Quote:
Originally Posted by sugarhell View Post

I am a graphics engine programmer with around average knowledge about graphics API. I know why this is happening. But even i dont have any data or facts why is this happening exactly because the bottlenecks and the examples are different everytime. And it's not only that you posted that hurts dx11 performance. For example the support for command lists with amd is bad. And many other things. My point is that someone with such an amount of reps should post his post as a speculation not as a fact

don't worry, power point presentation slides always gives it away. biggrin.gif
loon 3.2
(18 items)
 
  
CPUMotherboardGraphicsRAM
i7-3770K Asus P8Z77-V Pro EVGA 980TI SC+ 16Gb PNY ddr3 1866 
Hard DriveHard DriveHard DriveOptical Drive
PNY 1311 240Gb 1 TB Seagate 3 TB WD Blue DVD DVDRW+/- 
CoolingCoolingOSMonitor
EKWB P280 kit EK-VGA supremacy Win X LG 24MC57HQ-P 
KeyboardPowerCaseMouse
Ducky Zero [blues] EVGA SuperNova 750 G2 Stryker M [hammered and drilled] corsair M65 
AudioAudio
SB Recon3D Klipsch ProMedia 2.1  
  hide details  
Reply
loon 3.2
(18 items)
 
  
CPUMotherboardGraphicsRAM
i7-3770K Asus P8Z77-V Pro EVGA 980TI SC+ 16Gb PNY ddr3 1866 
Hard DriveHard DriveHard DriveOptical Drive
PNY 1311 240Gb 1 TB Seagate 3 TB WD Blue DVD DVDRW+/- 
CoolingCoolingOSMonitor
EKWB P280 kit EK-VGA supremacy Win X LG 24MC57HQ-P 
KeyboardPowerCaseMouse
Ducky Zero [blues] EVGA SuperNova 750 G2 Stryker M [hammered and drilled] corsair M65 
AudioAudio
SB Recon3D Klipsch ProMedia 2.1  
  hide details  
Reply
post #743 of 799
Whether it's shaders or something else, it's quite clear that under DX11 AMD cards are nowhere near using their full potential. Even the mid range such as the R9 380 show this. The percentage is speculation, but it's nowhere near 90%+. And the higher end the card, the larger the gap.

The Fury X has been underperforming since the beginning.
post #744 of 799
Quote:
Originally Posted by BradleyW View Post

No I don't have any technical data which proves my comment, so we may categorise it as an opinion. I have completed 2 years of Computer Science at University. My Computer Science BSc degree covers hardware architecture in depth, but not on GCN specifically. It is in my opinion that AMD's GCN is not being fully utilized by single threaded dependant applications as seen on various DX11 titles, resulting in reduced performance as all cylinders might not be firing, or at least "not as many at the same time".

Polaris will have bigger instruction buffer size for better single threaded performance. So, beside being great for DX12, it should improve CPU overhead in DX11.

Sadly, I don't think we will see a big Polaris chip in 4xx series, in 5xx we probably will
post #745 of 799
Quote:
Originally Posted by sugarhell View Post

I am a graphics engine programmer with around average knowledge about graphics API. I know why this is happening. But even i dont have any data or facts why is this happening exactly because the bottlenecks and the examples are different everytime. And it's not only that you posted that hurts dx11 performance. For example the support for command lists with amd is bad. And many other things. My point is that someone with such an amount of reps should post his post as a speculation not as a fact

Well lets take it a step further shall we?

SM200 (GTX 980 Ti):
22 SMMs
Each SMM contains 128 SIMD cores.
Each SMM can execute 64 warps concurrently.
Each Warp is comprised of 32 threads.
So that's 2,048 threads per SMM or 128 SIMD cores.
2,048 x 22 = 45,056 threads in flight (executing concurrently).

GCN3:
64 CUs
Each CU contains 64 SIMD cores.
Each CU can execute 40 wavefronts concurrently.
Each Wavefront is comprised of 64 threads.
So that's 2,560 threads per CU or 64 SIMD cores.
2,560 x 64 = 163,840 threads in flight (executing concurrently).

Now factor this:


GCN3 SIMDs are more powerful than SM200 SIMDs core for core. It takes a GCN3 SIMD less time to process a MADD.

So what is the conclusion?

1. GCN3 is far more parallel.

2. GCN3 has less SIMD cores dedicated towards doing more work. SM200 has more SIMD cores dedicated towards doing less work.

3. If you're feeding your GPU with small amounts of compute work items, SM200 will come out on top. If you're feeding your GPU with large amounts of compute work, GCN3 will come out on top.

Now this is just the tip of the Iceberg, there's also this to consider (Multi-Engine support): Warning: Spoiler! (Click to show)

GCN3 stands to benefit more from Asynchronous compute + graphics than SM200 does because GCN3 has more threads idling than SM200. So long as you feed both architectures with optimized code, they perform as expected.

What is as expected? Warning: Spoiler! (Click to show)


Hitman, under DX12, will showcase GCN quite nicely I believe.
Edited by Mahigan - 2/24/16 at 10:28am
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
post #746 of 799
Quote:
Originally Posted by Mahigan View Post

Warning: Spoiler! (Click to show)
Well lets take it a step further shall we?

SM200 (GTX 980 Ti):
22 SMMs
Each SMM contains 128 SIMD cores.
Each SMM can execute 64 warps concurrently.
Each Warp is comprised of 32 threads.
So that's 2,048 threads per SMM or 128 SIMD cores.
2,048 x 22 = 45,056 threads in flight (executing concurrently).

GCN3:
64 CUs
Each CU contains 64 SIMD cores.
Each CU can execute 40 wavefronts concurrently.
Each Wavefront is comprised of 64 threads.
So that's 2,560 threads per CU or 64 SIMD cores.
2,560 x 64 = 163,840 threads in flight (executing concurrently).

Now factor this:


GCN3 SIMDs are more powerful than SM200 SIMDs core for core. It takes a GCN3 SIMD less time to process a MADD.

So what is the conclusion?

1. GCN3 is far more parallel.

2. GCN3 has less SIMD cores dedicated towards doing more work. SM200 has more SIMD cores dedicated towards doing less work.

3. If you're feeding your GPU with small amounts of compute work items, SM200 will come out on top. If you're feeding your GPU with large amounts of compute work, GCN3 will come out on top.

Now this is just the tip of the Iceberg, there's also this to consider (Multi-Engine support): Warning: Spoiler! (Click to show)

GCN3 stands to benefit more from Asynchronous compute + graphics than SM200 does because GCN3 has more threads idling than SM200. So long as you feed both architectures with optimized code, they perform as expected.

What is as expected? Warning: Spoiler! (Click to show)


Hitman, under DX12, will showcase GCN quite nicely I believe
.

This is an excellent post and I very much agree with your conclusions, especially, "GCN3 stands to benefit more from Asynchronous compute + graphics than SM200 does because GCN3 has more threads idling than SM200."
+1
X79-GCN
(22 items)
 
  
CPUMotherboardGraphicsRAM
Intel 3930K 4.5GHz HT GIGABYTE GA-X79-UP4 AMD R9-290X GEil Evo Potenza DDR3 2400MHz CL10 (4x4GB) 
Hard DriveCoolingCoolingCooling
Samsung 840 Pro 120GB EK Supremacy (CPU) NF F12's P/P (360 Rad)  NF A14's (420 Rad)  
CoolingCoolingCoolingCooling
XSPC Chrome Compression Fittings EK RES X3 150 Primochill PremoFlex Advanced LRT Clear 1/2 ID EK-FC (R9 290X) 
CoolingCoolingCoolingOS
EK D5 Vario Top-X  Phobya G-Changer V2 360mm Phobya G-Changer V2 420mm Win 10 x64 Pro 
MonitorKeyboardPowerCase
BenQ XR3501 35" Curved Corsair Vengeance K90 Seasonic X-1250 Gold (v2) Corsair 900D 
MouseAudio
Logitech G400s Senn HD 598 
  hide details  
Reply
X79-GCN
(22 items)
 
  
CPUMotherboardGraphicsRAM
Intel 3930K 4.5GHz HT GIGABYTE GA-X79-UP4 AMD R9-290X GEil Evo Potenza DDR3 2400MHz CL10 (4x4GB) 
Hard DriveCoolingCoolingCooling
Samsung 840 Pro 120GB EK Supremacy (CPU) NF F12's P/P (360 Rad)  NF A14's (420 Rad)  
CoolingCoolingCoolingCooling
XSPC Chrome Compression Fittings EK RES X3 150 Primochill PremoFlex Advanced LRT Clear 1/2 ID EK-FC (R9 290X) 
CoolingCoolingCoolingOS
EK D5 Vario Top-X  Phobya G-Changer V2 360mm Phobya G-Changer V2 420mm Win 10 x64 Pro 
MonitorKeyboardPowerCase
BenQ XR3501 35" Curved Corsair Vengeance K90 Seasonic X-1250 Gold (v2) Corsair 900D 
MouseAudio
Logitech G400s Senn HD 598 
  hide details  
Reply
post #747 of 799
From my own testing in Dirt Rally
3840x2160 Ultra settings w/o Advanced blending unless specified. This is on a single 290X. Warning: Spoiler! (Click to show)
No AA


No AA + Advanced blending


CMAA


8xMSAA


8xMSAA + Advanced blending
Click on Original.

I'd like to point out something that I noticed when the game was still in early access. Back then, when I turned on Advanced Blending, the entire forest would go white, with specks of green inbetween. I believe it had something to do with the vegetation using transparant textures. The road, the car, everything else, was completely fine. I believe transparent textures is the culprit for the IQ on AMD-cards in this game.
post #748 of 799
I'm surprised in a little difference between DX11 and DX12 for 980ti in Ashes benchmark. But the game itself is a lot more heavier on CPU than the benchmark itself. And even nVidia cards would see a significant increase from less CPU overhead.

It's good to see Fury X doing so well. It was expected when they added significant (over 15%) async compute. But nVidia might get some boost too when they release their software emulation.
post #749 of 799
Quote:
Originally Posted by Potatolisk View Post

I'm surprised in a little difference between DX11 and DX12 for 980ti in Ashes benchmark.

Might have something to do with this:
Quote:
Originally Posted by Kollock View Post

Async compute is currently forcibly disabled on public builds of Ashes for NV hardware.
post #750 of 799
Quote:
Originally Posted by Forceman View Post

Might have something to do with this:

in beta2 that was tested for today you can turn async on and off.

kepler and maxwell just cant handle context switching. no async for kepler/maxwell.
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Video Game News
Overclock.net › Forums › Industry News › Video Game News › [WCCF] HITMAN To Feature Best Implementation Of DX12 Async Compute Yet, Says AMD