Overclock.net › Forums › Graphics Cards › Graphics Cards - General › DirectX 12: Asynchronous Compute (An exercise in Crowd-sourcing)
New Posts  All Forums:Forum Nav:

DirectX 12: Asynchronous Compute (An exercise in Crowd-sourcing)

post #1 of 252
Thread Starter 
*This is a work in progress and should be viewed as such*

It's been several weeks since the Ashes of the Singularity benchmarks hit the PC Gaming scene and brought a new feature into our collective vocabulary. Throughout these past few weeks, there has been a lot of confusion and mis-information spreading throughout the web as a result of the rather complex nature of this topic. In an effort, to combat this misinformation, @GorillaSceptre asked if a new thread could be started in order to condense a lot of the information which has been gathered on the topic. This thread is my no means final. This thread is set to change as new information comes to light. If you have new information, feel free to bring it to the attention of the Overclock.net community as a whole by way of commenting on this thread.

As things stand right now, Sept 6, 2015, we're waiting for a new driver, from nVIDIA, to rectify an issue which has inhibited the Maxwell 2 series from supporting Asynchronous Compute. Both Oxide, the developer of the Ashes of the Singularity video game and benchmark, and nVIDIA are working hard to implement a fix to this issue. While we wait for this fix, lets take an opportunity to break down some misconceptions.



nVIDIA HyperQ


nVIDIA implement Asynchronous Compute through what nVIDIA calls "HyperQ". HyperQ is a hybrid solution which is part software scheduling and part hardware scheduling. While little information is available as it pertains to its implementation in nVIDIAs Maxwell/Maxwell 2 architecture, we've been able to piece together just how it works from various sources.


Now I'm certain many of you have seen this image floating around, as it pertains to Kepler's HyperQ implementation:
Warning: Spoiler! (Click to show)
Created with GIMP Or thia updated one by Ext3h:

What the image shows is a series of CPU Cores (indicated by blue squares) scheduling a series of tasks to another series of queues (indicated in black squares) which are then distributed to the various SMMs throughout the Maxwell 2 architecture. While this image is useful, it doesn't truly tell us what is going on. In order to figure out just what those black squares represent, we need to take a look at the nVIDIA Kepler White Papers. Within these white papers we find the HyperQ being defined as such: Warning: Spoiler! (Click to show)


Based on this diagram we can infer that HyperQ works through two software components what we now speculate to be a hardwired ARM processor built into the Maxwell die which is comprised of two hardware components before tasks are scheduled to the the GPU:
  1. Grid Management Unit
  2. Work Distributor


So far the scheduling works like this:
  1. The developer marks up a command list
  2. This command list is sent to the nVIDIA software driver
  3. The nVIDIA software driver translates the commands into ISA
  4. The ISA commands are fed to a Grid Management Unit
  5. The Grid Management Unit transfers 32 pending grids (32 Compute or 1 Graphic and 31 Compute) to the Work Distributor
  6. The Work Distributor transfers the 32 Compute or 1 Graphic and 31 Compute tasks to the SMMs which are a hardware component within the nVIDIA GPU.
  7. The components within the SMMs which receive the tasks are called Asynchronous Warp Schedulers and they assign the tasks to available CUDA cores for processing.

Quote:
That's all fine and dandy but why doesn't it work?

New information has come to light which stipulates that Asynchronous compute + graphics does not work due to the lack of proper Ressource Barrier support under HyperQ. What is a resource barrier? Resource barriers - add commands to convert a resource (or resources) from one type to another (such as a render target to a texture), prevents further command execution until the GPU has finished doing any work needed to convert the resources as requested. Without this feature, nVIDIAs HyperQ implementation cannot be used by DirectX12 in order to execute Graphics and Compute commands in parallel. (However, Compute commands should be able to be executed in parallel to one another).
Warning: Spoiler! (Click to show)

An indepth explanation can be found here: http://ext3h.makegames.de/DX12_Compute.html

Quote:
What is preemption?

David Kanter explains it here (starts around 1:18:00 into the video): Warning: Spoiler! (Click to show)

Preemption is important for VR, GCN has finer-grained preemption which allows the ACEs to execute a compute task asynchronously and in parallel to other tasks, and at a lower latency, whenever a VR headset movement is detected. The compute task being executed is an Asynchronous Warp. It is a compute shader which alters the prior rendered frame slightly in order to adjust for the new angle of the VR headset. Without this feature operating at a low latency (20ms or less) motion sickness can ensue.

What about Maxwell?:

Say you have a graphic shader, in a frame, that is taking a particularly long time to complete, with Maxwell2, you have to wait for this graphic shader to complete before you can preempt it. If a graphic shader takes 16ms, you have to wait till it completes before executing another graphic or compute command. Thia is because NVIDIA do not support finer-grained preemption. They support coarse grained preemption. NVIDIA have made great strides with their VR implementation from Kepler to Maxwell. Going from 57ms latency down to 34ms but they're still not there. Warning: Spoiler! (Click to show)

Quote:
What is Slow Context Switching?

Slow context switching has to do with preemption is VR. Do you remember that 16ms graphics shader? Well if it were a Graphics task being executed instead of a shader and you needed to preempt it with an Asynchronous Time Warp shader, then you would be switching from a Graphics context to a compute context. Due to the shared resources within an SMM, between graphics and compute jobs such as the shared L1 cache for example, not only would you have to wait until the end of the execution of that graphics task (like mentioned in the preemption section) but you'd also need to perform a full "flush" of the SMM. A flush means emptying the caches etc in the SMM. This can incur latency of upwards of 1,000ms.

Quote:
Why is this different than GCN?

GCN doesn't have this issue because GCN has built in hardware redundancy (and the power usage that goes along with it). Each CU has its own L1 cache as do the RenderBackEnds, Texture Units etc. GCN can switch contexts in a single cycle. With GCN, you can execute tasks simultaneously (without a waiting period), the ACEs will also check for errors and re-issue, if needed, to correct an error. You don't need to wait for one task to complete before you work on the next. So say, on GCN, a Graphic shader task takes 16ms to execute, in that same 16ms you can execute many other tasks in parallel (like the compute and copy command above). Therefore your frame ends up taking only 16ms because you're executing several tasks in parallel. There's little to no latency or lag between executions because they execute like Hyper-threading (hence the benefits to the LiquidVR implementation). Warning: Spoiler! (Click to show)

Quote:
What does this mean for gaming performance?

Developers need to be careful about how they program for Maxwell 2, if they aren't... then far too much latency will be added to a frame. This is true even once nVIDIA fix their driver issue. It's an architectural issue, not a software issue.

Quote:
It's architectural? How so?

Well that's all in how a context switch is performed in hardware. In order to understand this, we need to understand something about the Compute Units found in every GCN based Graphics card since Tahiti. We already know that a Compute Unit can hold several threads, executing in flight, at the same time. The maximum amount of simultaneous threads executed concurrently, per CU, is 2,560 (40 Wavefronts @ 64 Threads ea). GCN can, within one cycle, switch to and begin the execution of a different Wavefront . While that's happening, GCN can also be working on Graphics tasks. This allows the entire GCN GPU to execute and process both Graphics and Compute tasks, simultaneously, with extremely low-latency being associated with switching between them. Idle CUs can be switched to a different task at a very fine grained level with a minimum performance penalty associated with the switch. Warning: Spoiler! (Click to show)


On top of what the Compute Units can do, The ACEs are also far more flexible than the AWSs found in Maxwell 2. Each ACE can synchronize with other ACEs in order to execute large workloads requiring dependencies (enforced by fences on the developer side). On top of this, if one ACE dispatches work to on particular Compute Unit (say CU1) and the result of the shader computed is required by another Compute Unit (say CU2), then the intermediate result can be placed into the LDS (Local Data Share Cache) or the GDS (Global Data Share Cache). The result is then pulled straight from the LDS/GDS by the CU2 in order to complete the shader calculation. Each ACE can stop, start, pause or move intermediate data into memory. The image below explains just what ACEs can do: Warning: Spoiler! (Click to show)

Quote:
Well that's GCN's architecture... What about Maxwell 2 and that Slow Context Switching thing?

In terms of a Context Switch, Maxwell 2 can switch between a Compute and Graphics task in a coarse-grained fashion and pays a penalty (sometimes to the order of over a thousand cycles in worst case scenarios) for doing so. While Maxwell 2 excels at ordering tasks, based on priority, in a way which minimizes conflicts between Graphics and Compute tasks; Maxwell 2 doesn't necessarily gain a boost in performance from doing so. This is why it remains to be seen if Maxwell 2 will gain a performance boost from Asynchronous Compute. A developer would need to finely tune his/her code in order to derive any sort of performance benefits. From all the sources I've seen, Pascal will perhaps fix this problem (but wait and see as it could just be speculation). There is also evidence that Pascal will not fix this issue as you can see here: Warning: Spoiler! (Click to show)
Source here: Page 23


nVIDIA may not have thought that the industry would jump on DX12 the way it is right now, or VR for that matter, but many AAA titles will be heading dowm the DX12 route in 2016. We'll even get a few titles in a few months in 2015. What's worse is that the majority of these titles have partnered with AMD. We can therefore be quite certain that Asynchronous Compute will be implemented, for AMD GCN at least, throughout the majority of DX12 titles to arrive in 2016. Warning: Spoiler! (Click to show)


TechReport: David Kanter discusses Asynchronous Compute Warning: Spoiler! (Click to show)

Extra Goodies! Warning: Spoiler! (Click to show)

There is no underestimating nVIDIAs capacity to fix their driver. It will be fixed. As for the performance derived out of nVIDIAs solution? Best to wait and see.

Take Care smile.gif



**If you spot a mistake, either PM me or post it in the comment section. Lets get this whole issue as factual as possible**
Edited by Mahigan - 2/29/16 at 1:22am
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
post #2 of 252
Damn, nice work thumb.gif

We need an "Over 9 000!" REP+ button biggrin.gif

Tons of info there, this should stop people from getting confused. Despite all the negativity that's been thrown your way, some of us really appreciate all the research you've put into this.
Edited by GorillaSceptre - 9/6/15 at 1:29pm
post #3 of 252
Thread Starter 
Quote:
Originally Posted by GorillaSceptre View Post

Damn, nice work thumb.gif

We need an "Over 9 000!" REP+ button biggrin.gif

Tons of info there, this should stop people from getting confused. Despite all the negativity that's been thrown your way, some of us really appreciate all the research you've put into this.

Thank You brother smile.gif
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
post #4 of 252
Before everyone yells wrong section at you, just wanted to say well done. Even though I dont have nVidia, I'm finding this very interesting and informative.

+rep
Sabertooth
(18 items)
 
  
CPUMotherboardGraphicsRAM
Intel Core i7 4770k @4.5 GHz Asus Z87 Pro PowerColor R9 290X BF4 Edition Corsair/G.Skill 16GB @ 1600 
Hard DriveHard DriveHard DriveHard Drive
Sandisk Ultra Plus 256GB WD1002FAEX Caviar Black 1TB  Samsung HD502HJ 500GB Samsung HD501LJ 500GB 
Optical DriveCoolingOSMonitor
Lite-ON 24X CD/DVD Burner  Antec H20 620 w/ AP-15 Push-Pull Windows 7 Pro  Acer G276HL 27" 
KeyboardPowerCaseMouse
Das Keyboard Pro w/ MX Blues Corsair HX850 Cooler Master HAF 922 Microsoft IE 3.0 
Mouse Pad
Steelseries Qck 
  hide details  
Reply
Sabertooth
(18 items)
 
  
CPUMotherboardGraphicsRAM
Intel Core i7 4770k @4.5 GHz Asus Z87 Pro PowerColor R9 290X BF4 Edition Corsair/G.Skill 16GB @ 1600 
Hard DriveHard DriveHard DriveHard Drive
Sandisk Ultra Plus 256GB WD1002FAEX Caviar Black 1TB  Samsung HD502HJ 500GB Samsung HD501LJ 500GB 
Optical DriveCoolingOSMonitor
Lite-ON 24X CD/DVD Burner  Antec H20 620 w/ AP-15 Push-Pull Windows 7 Pro  Acer G276HL 27" 
KeyboardPowerCaseMouse
Das Keyboard Pro w/ MX Blues Corsair HX850 Cooler Master HAF 922 Microsoft IE 3.0 
Mouse Pad
Steelseries Qck 
  hide details  
Reply
post #5 of 252
From the developer point of view, what the diference of sending a compute task to the DX12 device that is doing graphics or send it to a secondary DX12 device that is doing nothing? the way i see it, this is a perfect example of a simple DX12 Multiadapter solution, one does graphics, the secondary does compute tasks. You just cant ignore that option.
post #6 of 252
Fantastic, no, awesome post!. +rep indeed wink.gif I'm looking forward to how this all will work out. That will decide wether or not I'll buy a second 970 next year or a AMD card. Or perhaps even something Pascal. thumb.gif
 
Naru
(16 items)
 
CPUMotherboardGraphicsGraphics
Intel Core i5 750 (3.8 Ghz, 1.34v) P7H55-M 7870 xt Gigabyte G1 970 
RAMRAMHard DriveCooling
Hyperx Fury Blue (2x4gb) Hyperx Fury Blue (2x4gb) Sandisk 240 Plus SSD Gelid Tranquillo rev. 2 
OSMonitorKeyboardPower
Windows 7, 64 bit Dell U2515H Kûl ES-87 (Cherry MX Black) Corsair 400w 
CaseMouseMouse PadAudio
Fractal Design Define Mini Logitech G303 Steelseries QCK Heavy Audio Technica mth 50 
CPUMotherboardGraphicsRAM
i7 5820k msi x99a EVGA 1080 Superclocked Hyper X 4GB (16GB in total) 
Hard DriveHard DriveHard DriveCooling
Samsung Evo 850 500GB Kingston 256 GB Western Digital Green 1TB Noctua NH-D15 
OSMonitorKeyboardPower
Windows 10 Dell U2515H Leopold FC750R (Cherry MX Blue) Corsair RM850i 
CaseMouseMouse PadAudio
Fractal Design Define R5 Zowie ZA11 Steelseries QCK Heavy AKG K612 Pro 
  hide details  
Reply
 
Naru
(16 items)
 
CPUMotherboardGraphicsGraphics
Intel Core i5 750 (3.8 Ghz, 1.34v) P7H55-M 7870 xt Gigabyte G1 970 
RAMRAMHard DriveCooling
Hyperx Fury Blue (2x4gb) Hyperx Fury Blue (2x4gb) Sandisk 240 Plus SSD Gelid Tranquillo rev. 2 
OSMonitorKeyboardPower
Windows 7, 64 bit Dell U2515H Kûl ES-87 (Cherry MX Black) Corsair 400w 
CaseMouseMouse PadAudio
Fractal Design Define Mini Logitech G303 Steelseries QCK Heavy Audio Technica mth 50 
CPUMotherboardGraphicsRAM
i7 5820k msi x99a EVGA 1080 Superclocked Hyper X 4GB (16GB in total) 
Hard DriveHard DriveHard DriveCooling
Samsung Evo 850 500GB Kingston 256 GB Western Digital Green 1TB Noctua NH-D15 
OSMonitorKeyboardPower
Windows 10 Dell U2515H Leopold FC750R (Cherry MX Blue) Corsair RM850i 
CaseMouseMouse PadAudio
Fractal Design Define R5 Zowie ZA11 Steelseries QCK Heavy AKG K612 Pro 
  hide details  
Reply
post #7 of 252
Quote:
Originally Posted by Nehabje View Post

Fantastic, no, awesome post!. +rep indeed wink.gif I'm looking forward to how this all will work out. That will decide wether or not I'll buy a second 970 next year or a AMD card. Or perhaps even something Pascal. thumb.gif

Yup, solid research like this goes a long way in helping us make educated buying decisions. Something i used to count on tech websites for rolleyes.gif
post #8 of 252
THX alot Mahigan !

I also follow this thread with big opened eyes http://www.overclock.net/t/1569897/various-ashes-of-the-singularity-dx12-benchmarks/0_20 !

+rep
+thumb.gif
Fury0s4
(22 items)
 
   
CPUMotherboardGraphicsRAM
Intel Core i7 2700K P8P67 DELUXE Sapphire R9 Fury Nitro OC+ Crucial BLE2CP8G3D1869DE1TX0CEU  
Hard DriveHard DriveHard DriveHard Drive
Crucial M4 256 GB Western Digital Caviar Black 2000GB Western Digital Caviar Black 2000GB Western Digital VelociRaptor 600GB 
Hard DriveOptical DriveCoolingCooling
Samsung 840 evo Pioneer BDR-S09XLT EK Laing D5 Mo-ra 420 Pro  
CoolingOSMonitorKeyboard
EK D5 X-Res CSQ 7 x64 Ultimate Samsung XL 2370 Logitech G19 
PowerCaseMouseMouse Pad
Enermax Platimax 850W HAF X Roccat XTD optical Steelseries DEX 
AudioOther
X-Fi Titanium HD Steelseries Arctis 5 
CPUMotherboardGraphicsRAM
Q9550S ASUS P5G41T-M LX EVGA GTX 670 FTW Kingston  
RAMHard DriveHard DriveHard Drive
Kingston  WD3000HLFS WD2000 EADS  Samsung SM843 MZ7TD240HAFV-000DA 
CoolingOSPowerCase
Scythe Big Shuriken 2 Windows 7 Prof x64 Antec EarthWatts 380w Phanteks Enthoo Evolv MATX 
MouseMouse PadAudio
Roccat Kone Pure Optical Steelseries onboard 
  hide details  
Reply
Fury0s4
(22 items)
 
   
CPUMotherboardGraphicsRAM
Intel Core i7 2700K P8P67 DELUXE Sapphire R9 Fury Nitro OC+ Crucial BLE2CP8G3D1869DE1TX0CEU  
Hard DriveHard DriveHard DriveHard Drive
Crucial M4 256 GB Western Digital Caviar Black 2000GB Western Digital Caviar Black 2000GB Western Digital VelociRaptor 600GB 
Hard DriveOptical DriveCoolingCooling
Samsung 840 evo Pioneer BDR-S09XLT EK Laing D5 Mo-ra 420 Pro  
CoolingOSMonitorKeyboard
EK D5 X-Res CSQ 7 x64 Ultimate Samsung XL 2370 Logitech G19 
PowerCaseMouseMouse Pad
Enermax Platimax 850W HAF X Roccat XTD optical Steelseries DEX 
AudioOther
X-Fi Titanium HD Steelseries Arctis 5 
CPUMotherboardGraphicsRAM
Q9550S ASUS P5G41T-M LX EVGA GTX 670 FTW Kingston  
RAMHard DriveHard DriveHard Drive
Kingston  WD3000HLFS WD2000 EADS  Samsung SM843 MZ7TD240HAFV-000DA 
CoolingOSPowerCase
Scythe Big Shuriken 2 Windows 7 Prof x64 Antec EarthWatts 380w Phanteks Enthoo Evolv MATX 
MouseMouse PadAudio
Roccat Kone Pure Optical Steelseries onboard 
  hide details  
Reply
post #9 of 252
I love you.
My Rig
(10 items)
 
  
CPUMotherboardGraphicsRAM
Intel i7-4770k @ 4.1Ghz  Asus z87-Plus XFX Double D r9 290X G Skill Sniper 8GB @ 2133 
OSMonitorKeyboardMouse
Windows 10 Pro 27 Inch 1440P PLS @ 110Hz Mechanical Keyboard MX Red Steelseries Kana 
Mouse PadAudio
Steelseries Mouse pad Samson SR850 Headphones/Asus Xonar Sound card 
  hide details  
Reply
My Rig
(10 items)
 
  
CPUMotherboardGraphicsRAM
Intel i7-4770k @ 4.1Ghz  Asus z87-Plus XFX Double D r9 290X G Skill Sniper 8GB @ 2133 
OSMonitorKeyboardMouse
Windows 10 Pro 27 Inch 1440P PLS @ 110Hz Mechanical Keyboard MX Red Steelseries Kana 
Mouse PadAudio
Steelseries Mouse pad Samson SR850 Headphones/Asus Xonar Sound card 
  hide details  
Reply
post #10 of 252
Quote:
Originally Posted by axizor View Post

Before everyone yells wrong section at you, just wanted to say well done. Even though I dont have nVidia, I'm finding this very interesting and informative.

+rep

Ditto~
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Graphics Cards - General
Overclock.net › Forums › Graphics Cards › Graphics Cards - General › DirectX 12: Asynchronous Compute (An exercise in Crowd-sourcing)