Originally Posted by Kollock
Async compute is currently forcibly disabled on public builds of Ashes for NV hardware. Whatever performance changes you are seeing driver to driver doesn't have anything to do with async compute.
I can confirm that the latest shipping DX12 drivers from NV do support async compute. You'd have to ask NV how specifically it is implemented.
Now this I've got to see... Will Maxwell execute Graphics + Compute commands concurrently or will Asynchronous Compute simply mean that there is no defined order by which compute commands are executed.
AMD appear to stress that performance gains are best achieved through concurrent execution of Graphics and Compute commands whereas asynchronous compute doesn't really mean that in the computer science world.
Currently the prevailing conclusiom has been (sourced from across the web)
Nvidia executes Asynchronous compute + graphics code synchronously under DX12. Nvidia supports Async Compute through Hyper-Q, CUDA, but Hyper-Q doesn't support the additional wait conditions of barriers (a DX12 requirement). So no, there is no Async Compute + Graphics for Fermi, Kepler or Maxwell under DX12 currently.
Let me explain, Microsoft have introduced additional compute queues into 3D apps with their DX12 API:
Graphics queue for primary rendering tasks
Compute queue for supporting GPU tasks (lighting, post processing, physics etc)
Copy queue for simple data transfers
Command lists, from a specific queue, are still executed synchronously, while those in different queues can execute asynchronously (ex: concurrently and in parallel). What does asynchronous mean? Asynchronous means that the order of execution of each queue in relation to another is not defined. Work loads submitted to these queues may start or complete in a different order than they were issued. In terms of Fences and barriers, they only apply to each respective queue. When the work load in a queue is blocked by a fence, the other queues can still be running and submitting work for execution. If Synchronisation points between two or more queues are required, they can be defined and enforced by using fences.
Similar features have been available under OpenCL and CUDA for some time. The fences and signals, under DX12 map directly to a subset of the event system under OpenCL and CUDA. Under DX12, however, Barriers have additional wait conditions. These wait conditions are not supported by either OpenCL or CUDA. Instead, a write through of dirty buffers needs to be explicitly requested. Therefore Asynchronous compute + Graphics under DX12, though similar to Asynchronous compute under OpenCL and CUDA, requires explicit feature support for compatibility with the Asynchronous Compute + Graphics feature.
These new queues are also different than the classic Graphics queue. While the classic Graphics queue can be fed with compute commands, copy commands and graphics commands (draw calls), the new compute and copy queues can only accept compute and copy commands respectively. Hence their names.
For Maxwell, Compute and Graphics can't be active at the same time under DX12, currently, it is theorized that this is due to the fact that there is only a single function unit (Command Processor) rather than having access to ACEs as well. Copy commands, however, can run concurrently to Graphics and Compute commands due to the inclusion of more than one DMA engine in Maxwell. We see this when looking at how Fable Legends executes the various queues. What nvidia would need, in order to execute graphics and compute commands asynchronously, is to add support for additional barrier wait times for their Hyper-Q implementation. Why? This would expose the additional execution unit under Hyper-Q. The Hyper-Q interface used for CUDAs concurrent executions supports Asynchronous compute as we see in DX11 + Physx titles (Batman Arkham series for example). Hyper-Q is, however, not compatible with the DX12 API as of the time of writting this (for reasons mentioned above). If it was compatible, there would be a hardware limit of 31 asynchronous compute queues and 1 Graphics queue (as Anandtech reported).
So all that to say that if you fence often, you can get nvidia hardware to run the Asynchronous + Graphics code synchronously. You also have to make sure you use large batches of short running shaders, long running shaders would complicate scheduling on nvidia hardware and introduce latency. Oxide, because they were using AMD supplied code, ran into this problem in Ashes of the Singularity (according to posts over at overclock.net).
Since AMD are working with IO for the Hitman DX12 path, then you can be sure that the DX12 path will be AMD optimized. That means less fencing and longer running shaders.
For Hitman, Nvidia basically have to work with IO as well, in order to add a vendor ID specific DX12 path (like we saw Oxide do). It's probably not worth it seeing as nvidia have little to gain from DX12 over DX11. AMD, however, will likely suffer from a CPU bottle neck under Hitman DX11 (as they do under Rise of the Tomb Raider DX11). AMD have a lot to gain from working with developers on coding and optimizing a DX12 path.
So to summarize,
Nvidia do not support Async compute + Graphics under DX12 at this time or perhaps ever. Hitman's DX12 path may run like crap on nvidia hardware unless nvidia convince IO Interactive to code a vendor ID specific path and supply IO with optimized short running shaders. Basically, same thing that nvidia did with Oxide for Ashes of the Singularity (if memory serves me right). Since nvidia have little to gain from moving from DX11 to DX12, best for them to not waste time and money helping IO code a vendor ID specific path.
AMD will suffer performance issues due to a CPU bottleneck, brought on by the lack of support for DX11 multi-threaded command listing, when running the Hitman DX11 path. AMD has everything to gain in assisting IO Interactive in the implementation of a DX12 path. Asynchronous compute is just an added bonus on top of the removal of the CPU bottle neck which plagues AMD GCN under DX11 titles.