Overclock.net banner
1 - 20 of 252 Posts

·
Premium Member
Joined
·
1,904 Posts
Discussion Starter · #1 ·
*This is a work in progress and should be viewed as such*

It's been several weeks since the Ashes of the Singularity benchmarks hit the PC Gaming scene and brought a new feature into our collective vocabulary. Throughout these past few weeks, there has been a lot of confusion and mis-information spreading throughout the web as a result of the rather complex nature of this topic. In an effort, to combat this misinformation, @GorillaSceptre asked if a new thread could be started in order to condense a lot of the information which has been gathered on the topic. This thread is my no means final. This thread is set to change as new information comes to light. If you have new information, feel free to bring it to the attention of the Overclock.net community as a whole by way of commenting on this thread.

As things stand right now, Sept 6, 2015, we're waiting for a new driver, from nVIDIA, to rectify an issue which has inhibited the Maxwell 2 series from supporting Asynchronous Compute. Both Oxide, the developer of the Ashes of the Singularity video game and benchmark, and nVIDIA are working hard to implement a fix to this issue. While we wait for this fix, lets take an opportunity to break down some misconceptions.

nVIDIA HyperQ

nVIDIA implement Asynchronous Compute through what nVIDIA calls "HyperQ". HyperQ is a hybrid solution which is part software scheduling and part hardware scheduling. While little information is available as it pertains to its implementation in nVIDIAs Maxwell/Maxwell 2 architecture, we've been able to piece together just how it works from various sources.

Now I'm certain many of you have seen this image floating around, as it pertains to Kepler's HyperQ implementation:
Created with GIMP Or thia updated one by Ext3h:

What the image shows is a series of CPU Cores (indicated by blue squares) scheduling a series of tasks to another series of queues (indicated in black squares) which are then distributed to the various SMMs throughout the Maxwell 2 architecture. While this image is useful, it doesn't truly tell us what is going on. In order to figure out just what those black squares represent, we need to take a look at the nVIDIA Kepler White Papers. Within these white papers we find the HyperQ being defined as such:

Based on this diagram we can infer that HyperQ works through two software components what we now speculate to be a hardwired ARM processor built into the Maxwell die which is comprised of two hardware components before tasks are scheduled to the the GPU:
  1. Grid Management Unit
  2. Work Distributor

So far the scheduling works like this:
  1. The developer marks up a command list
  2. This command list is sent to the nVIDIA software driver
  3. The nVIDIA software driver translates the commands into ISA
  4. The ISA commands are fed to a Grid Management Unit
  5. The Grid Management Unit transfers 32 pending grids (32 Compute or 1 Graphic and 31 Compute) to the Work Distributor
  6. The Work Distributor transfers the 32 Compute or 1 Graphic and 31 Compute tasks to the SMMs which are a hardware component within the nVIDIA GPU.
  7. The components within the SMMs which receive the tasks are called Asynchronous Warp Schedulers and they assign the tasks to available CUDA cores for processing.
Quote:
That's all fine and dandy but why doesn't it work?
New information has come to light which stipulates that Asynchronous compute + graphics does not work due to the lack of proper Ressource Barrier support under HyperQ. What is a resource barrier? Resource barriers - add commands to convert a resource (or resources) from one type to another (such as a render target to a texture), prevents further command execution until the GPU has finished doing any work needed to convert the resources as requested. Without this feature, nVIDIAs HyperQ implementation cannot be used by DirectX12 in order to execute Graphics and Compute commands in parallel. (However, Compute commands should be able to be executed in parallel to one another).

An indepth explanation can be found here: http://ext3h.makegames.de/DX12_Compute.html

Quote:
What is preemption?
David Kanter explains it here (starts around 1:18:00 into the video):

Preemption is important for VR, GCN has finer-grained preemption which allows the ACEs to execute a compute task asynchronously and in parallel to other tasks, and at a lower latency, whenever a VR headset movement is detected. The compute task being executed is an Asynchronous Warp. It is a compute shader which alters the prior rendered frame slightly in order to adjust for the new angle of the VR headset. Without this feature operating at a low latency (20ms or less) motion sickness can ensue.

What about Maxwell?:

Say you have a graphic shader, in a frame, that is taking a particularly long time to complete, with Maxwell2, you have to wait for this graphic shader to complete before you can preempt it. If a graphic shader takes 16ms, you have to wait till it completes before executing another graphic or compute command. Thia is because NVIDIA do not support finer-grained preemption. They support coarse grained preemption. NVIDIA have made great strides with their VR implementation from Kepler to Maxwell. Going from 57ms latency down to 34ms but they're still not there.

Quote:
What is Slow Context Switching?
Slow context switching has to do with preemption is VR. Do you remember that 16ms graphics shader? Well if it were a Graphics task being executed instead of a shader and you needed to preempt it with an Asynchronous Time Warp shader, then you would be switching from a Graphics context to a compute context. Due to the shared resources within an SMM, between graphics and compute jobs such as the shared L1 cache for example, not only would you have to wait until the end of the execution of that graphics task (like mentioned in the preemption section) but you'd also need to perform a full "flush" of the SMM. A flush means emptying the caches etc in the SMM. This can incur latency of upwards of 1,000ms.

Quote:
Why is this different than GCN?
GCN doesn't have this issue because GCN has built in hardware redundancy (and the power usage that goes along with it). Each CU has its own L1 cache as do the RenderBackEnds, Texture Units etc. GCN can switch contexts in a single cycle. With GCN, you can execute tasks simultaneously (without a waiting period), the ACEs will also check for errors and re-issue, if needed, to correct an error. You don't need to wait for one task to complete before you work on the next. So say, on GCN, a Graphic shader task takes 16ms to execute, in that same 16ms you can execute many other tasks in parallel (like the compute and copy command above). Therefore your frame ends up taking only 16ms because you're executing several tasks in parallel. There's little to no latency or lag between executions because they execute like Hyper-threading (hence the benefits to the LiquidVR implementation).

Quote:
What does this mean for gaming performance?
Developers need to be careful about how they program for Maxwell 2, if they aren't... then far too much latency will be added to a frame. This is true even once nVIDIA fix their driver issue. It's an architectural issue, not a software issue.

Quote:
It's architectural? How so?
Well that's all in how a context switch is performed in hardware. In order to understand this, we need to understand something about the Compute Units found in every GCN based Graphics card since Tahiti. We already know that a Compute Unit can hold several threads, executing in flight, at the same time. The maximum amount of simultaneous threads executed concurrently, per CU, is 2,560 (40 Wavefronts @ 64 Threads ea). GCN can, within one cycle, switch to and begin the execution of a different Wavefront . While that's happening, GCN can also be working on Graphics tasks. This allows the entire GCN GPU to execute and process both Graphics and Compute tasks, simultaneously, with extremely low-latency being associated with switching between them. Idle CUs can be switched to a different task at a very fine grained level with a minimum performance penalty associated with the switch.


On top of what the Compute Units can do, The ACEs are also far more flexible than the AWSs found in Maxwell 2. Each ACE can synchronize with other ACEs in order to execute large workloads requiring dependencies (enforced by fences on the developer side). On top of this, if one ACE dispatches work to on particular Compute Unit (say CU1) and the result of the shader computed is required by another Compute Unit (say CU2), then the intermediate result can be placed into the LDS (Local Data Share Cache) or the GDS (Global Data Share Cache). The result is then pulled straight from the LDS/GDS by the CU2 in order to complete the shader calculation. Each ACE can stop, start, pause or move intermediate data into memory. The image below explains just what ACEs can do:

Quote:
Well that's GCN's architecture... What about Maxwell 2 and that Slow Context Switching thing?
In terms of a Context Switch, Maxwell 2 can switch between a Compute and Graphics task in a coarse-grained fashion and pays a penalty (sometimes to the order of over a thousand cycles in worst case scenarios) for doing so. While Maxwell 2 excels at ordering tasks, based on priority, in a way which minimizes conflicts between Graphics and Compute tasks; Maxwell 2 doesn't necessarily gain a boost in performance from doing so. This is why it remains to be seen if Maxwell 2 will gain a performance boost from Asynchronous Compute. A developer would need to finely tune his/her code in order to derive any sort of performance benefits. From all the sources I've seen, Pascal will perhaps fix this problem (but wait and see as it could just be speculation). There is also evidence that Pascal will not fix this issue as you can see here:
Source here: Page 23

nVIDIA may not have thought that the industry would jump on DX12 the way it is right now, or VR for that matter, but many AAA titles will be heading dowm the DX12 route in 2016. We'll even get a few titles in a few months in 2015. What's worse is that the majority of these titles have partnered with AMD. We can therefore be quite certain that Asynchronous Compute will be implemented, for AMD GCN at least, throughout the majority of DX12 titles to arrive in 2016.

TechReport: David Kanter discusses Asynchronous Compute

Extra Goodies!

There is no underestimating nVIDIAs capacity to fix their driver. It will be fixed. As for the performance derived out of nVIDIAs solution? Best to wait and see.

Take Care
smile.gif


**If you spot a mistake, either PM me or post it in the comment section. Lets get this whole issue as factual as possible**
 

·
Premium Member
Joined
·
2,373 Posts
Damn, nice work
thumb.gif


We need an "Over 9 000!" REP+ button
biggrin.gif


Tons of info there, this should stop people from getting confused. Despite all the negativity that's been thrown your way, some of us really appreciate all the research you've put into this.
 

·
Premium Member
Joined
·
1,904 Posts
Discussion Starter · #3 ·
Quote:
Originally Posted by GorillaSceptre View Post

Damn, nice work
thumb.gif


We need an "Over 9 000!" REP+ button
biggrin.gif


Tons of info there, this should stop people from getting confused. Despite all the negativity that's been thrown your way, some of us really appreciate all the research you've put into this.
Thank You brother
smile.gif
 
  • Rep+
Reactions: sages and doritos93

·
Registered
Joined
·
1,468 Posts
Before everyone yells wrong section at you, just wanted to say well done. Even though I dont have nVidia, I'm finding this very interesting and informative.

+rep
 

·
Registered
Joined
·
8 Posts
From the developer point of view, what the diference of sending a compute task to the DX12 device that is doing graphics or send it to a secondary DX12 device that is doing nothing? the way i see it, this is a perfect example of a simple DX12 Multiadapter solution, one does graphics, the secondary does compute tasks. You just cant ignore that option.
 

·
Registered
Joined
·
120 Posts
Fantastic, no, awesome post!. +rep indeed
wink.gif
I'm looking forward to how this all will work out. That will decide wether or not I'll buy a second 970 next year or a AMD card. Or perhaps even something Pascal.
thumb.gif
 
  • Rep+
Reactions: GorillaSceptre

·
Premium Member
Joined
·
2,373 Posts
Quote:
Originally Posted by Nehabje View Post

Fantastic, no, awesome post!. +rep indeed
wink.gif
I'm looking forward to how this all will work out. That will decide wether or not I'll buy a second 970 next year or a AMD card. Or perhaps even something Pascal.
thumb.gif
Yup, solid research like this goes a long way in helping us make educated buying decisions. Something i used to count on tech websites for
rolleyes.gif
 
  • Rep+
Reactions: sawe

·
Black Sun Empire
Joined
·
802 Posts

·
Not a Fan
Joined
·
2,656 Posts
Aren't DirectX 12 and Asynchronous Compute two different things? There seems to be a bit of confusion, posters declaring that only AMD can implement DX12 since only their hardware architecture supports ASC.
 

·
Premium Member
Joined
·
1,904 Posts
Discussion Starter · #12 ·
Quote:
Originally Posted by GnarlyCharlie View Post

Aren't DirectX 12 and Asynchronous Compute two different things? There seems to be a bit of confusion, posters declaring that only AMD can implement DX12 since only their hardware architecture supports ASC.
Asynchronous Compute is a feature, now made available, by DirectX 12. It serves as a form of optimization, in order to derive better performance out of a GPU, by executing tasks in parallel... keeping the available Graphic and Compute resources fed.

GCN was built, from the ground up, with this sort of usage scenario in mind. GCN wasn't too great at DX11, though not horrible, because most of its capabilities were not exposed to the DX11 API.

Maxwell 2 is a step towards the direction of a more GCN-like architecture. We can only assume that Pascal will take an even greater step in that direction.

Current DX11 performance will remain the same, with nVIDIAs Maxwell 2 reigning as top dog. Under DX12, however, things will likely be quite different. We should see AMD and nVIDIA trading blows at first, at least until Pascal and Greenland arrive... then who knows.
 
  • Rep+
Reactions: doritos93

·
Premium Member
Joined
·
6,192 Posts
Glad to see this thread!
biggrin.gif


Alot of people don't understand Asynchronous Compute or DX12 very well so this should be really helpful for them.

Before this whole DX12 thing blew up, I read alot about GCN's Asynchronous Compute Capabilities and I understand it very well; so I'm not unfamiliar with the subject at-all.

Infact, I'd also like to help other people understand Asynchronous Compute and DX12.
biggrin.gif
 

·
Premium Member
Joined
·
6,440 Posts
Quote:
Originally Posted by Mahigan View Post

Asynchronous Compute is a feature, now made available, by DirectX 12. It serves as a form of optimization, in order to derive better performance out of a GPU, by executing tasks in parallel... keeping the available Graphic and Compute resources fed.

GCN was built, from the ground up, with this sort of usage scenario in mind. GCN wasn't too great at DX11, though not horrible, because most of its capabilities were not exposed to the DX11 API.

Maxwell 2 is a step towards the direction of a more GCN-like architecture. We can only assume that Pascal will take an even greater step in that direction.

Current DX11 performance will remain the same, with nVIDIAs Maxwell 2 reigning as top dog. Under DX12, however, things will likely be quite different. We should see AMD and nVIDIA trading blows at first, at least until Pascal and Greenland arrive... then who knows.
How interesting. In this context, GCN was a better architecture than Kepler or Maxwell, but because DX11 does not support a feature of the architecture natively, it has never been fully realized.

That's quite eye-opening.
thumb.gif
 

·
Registered
Joined
·
653 Posts
Clears up a lot of misinformation, no doubt.

Personally I've thought Nvidia's driver fix like 120Hz virtual on TV's (doubling of the frames) and a native 120Hz monitor. Yes it'll work, but it'll only get you this far.

Nevertheless, nice post
thumb.gif
 

·
Banned
Joined
·
1,258 Posts
Quote:
Originally Posted by Shivansps View Post

From the developer point of view, what the diference of sending a compute task to the DX12 device that is doing graphics or send it to a secondary DX12 device that is doing nothing? the way i see it, this is a perfect example of a simple DX12 Multiadapter solution, one does graphics, the secondary does compute tasks. You just cant ignore that option.
Is it possible to use 1 GPU for graphics and 1 GPU for compute WHILE AT THE SAME TIME rendering in SFR (split frame rendering)?
 

·
Registered
Joined
·
2,280 Posts
Quote:
Originally Posted by mtcn77 View Post

Does this qualify for an editorial? We should have more of these in house analyses.
My
2cents.gif


I don't think there is any other site that can provide the breadth and exposure for talented independents to showcase their work in this rather niche "pc gaming"/"enthusiast" journalist segment.

There can be a lot of win-win opportunities, especially for the consumers, should this be expanded further.
smile.gif
 
1 - 20 of 252 Posts
Top