Originally Posted by Xuper
So nvidia was able to use Async Compute with Cuda (PhysX) but couldn't do with Pure DX12? if it's true that Nvidia was able to use Async, then i can say that Async compute is doing at hardware level?
still I'm confused.It's possible to use Async compute without using context switching ? or still required?
Context switching can't be avoided in full.
It's happening on multiple levels, once in the scheduling frontend, where the GPU is switching between 3D and pure compute mode in a whole. And once again on the SMM/SMX level where each individual units needs to switch once again.
That first mode switch can be avoided, if it was just working correctly. The second one not yet. Not with Maxwell at least.
Originally Posted by Xuper
I read somewhere that "async isnt a single command, its multiple commands spread across multiple command queues"
One step further, and it would be complete. Async also requires that the queues can move freely in relation to each other. There is no "happens before" or "happens after" relation unless explicitly modelled with signals and fences. The execution order is just fixed inside each queue.
But that alone doesn't get you much, except for some freedom in the execution schedule. In order to take full advantage of that freedom, you need to be responsive in terms of using every possible idle phase for interleaved execution. If you don't, your possible gains are limited to solely reducing context switching, nothing else.
Originally Posted by Mahigan
Software but the context switch is processed by the SMM/SMX which requires a complete flush, meaning that all work the SMM/SMX was working on is lost when a high priority/preemption request is sent, which is worse than I had originally thought. This is what results in the added latency.
A flush doesn't necessarily imply data loss, as it can easily wait until the SMM underruns its current queue, plus this only goes for mixing compute/3D kernels. It's really just a minor issue, compared to the rest.
Maxwell doesn't even have such fine grained preemption that it would be possible to evict running jobs in any way. It can only preempt jobs which haven't started execution yet. That's what NV meant with "Preemption at draw call boundaries" on this years GDC talk.
It's not loosing any progress either, in fact, it looks like Nvidia is abusing regular signals to communicate to the GPU how far the work in each queue has advanced. You can even see these signals in GPUView, in the device context. You will notice that there is an additional fence placed on all command buffers which values correlates with the length of that specific command buffer.
Well, actual "preemption" with dataloss does happen. Whenever the drivers decides to commit suicide because something timed out. That even causes duplicated work. But it's not the regular case, by no means. It's more like a reset button on the GPU.