Originally Posted by rosade
Pointing out mistakes in the post
1. In the async feature table, comparing GCN to NVIDIA architectures (conveniently excludes pascal)
The Table was done before Pascal was released,
2. Pascal supports much more than pre-emption , it actually supports async compute via it's dynamic load balancer
How does this work?
a) There are two Tasks A and B running at the same time.
b) Task A completes before Task B
c)The Dynamic load balancer ensure Task B takes over the resources of Task A
d) This results in usage of "Idle cycles", thus decreasing latency
Exactly in GPUView the AotS test shows how it is doing asynchronous compute executing Graphics and Compute queues at the same time with more compute queues, meanwhile 3dmark time spy shows how whenever a Graphics/3d queues ends there is a space/pause(fences/barriers) and then comes a compute queue
The same way that pre emption works
and using fences cause some time/resources wasted/idling switching contexts
- Minimize the use of barriers and fences
- We have seen redundant barriers and associated wait for idle operations as a major performance problem for DX11 to DX12 ports
- The DX11 driver is doing a great job of reducing barriers – now under DX12 you need to do it
- Any barrier or fence can limit parallelism
- Don’t insert redundant barriers
- This limits parallelism
- This doesn’t allow the driver to pick the worst case of a set of barriers
- Don’t expect fences to trigger signals/advance at a finer granularity then once per ExecuteCommandLists call.
- To reiterate: We have seen redundant and/or overly conservative barrier flags and their associated wait for idle operations as a major performance problem for DX11 to DX12 ports.
- Submit work in parallel and evenly across several threads/cores to multiple command lists
- Recording commands is a CPU intensive operation and no driver threads come to the rescue
- Command lists are not free threaded so parallel work submission means submitting to multiple command lists
- Be aware of the fact that there is a cost associated with setup and reset of a command list
- You still need a reasonable number of command lists for efficient parallel work submission
- Fences force the splitting of command lists for various reasons ( multiple command queues, picking up the results of queries)
- Try to aim at a reasonable number of command lists in the range of 15-30 or below. Try to bundle those CLs into 5-10 ExecuteCommandLists() calls per frame.
- ]Make 3D and compute sections long enough.
- Switching between compute and 3D queues results in a full flush of all pipelines. The GPU should have spent enough time in one mode to justify the penalty for switching.
- Beware that there is no active preemption, a long running shader in either engine will stall the transition
Originally Posted by rosade
.Sound Familiar? Yes that is basic Async compute right there
which this benchmark isnt doing.
If you read DOOM Vulkan patch notes They clearly say Async compute support for Pascal is still on the works and even without that Pascal Cards see 8- 12 % gain already
for sure it wont be using asynchronous compute+graphics in parrallel.
Hitman, AoTS are AMD Gaming Evolved titles , You can not use them for fair comparison just like RotTR is a gameworks title. And Quantum Break is the worst PC port ever and Total war is a bad title regardless of platform.
why not if they have different path for each GPU vendor? dont they?
Which implies there are so far only two neutral DX12/Vulkan benchmarks
a) DOOM (Where pascal async is going to be enabled soon)
b) Time Spy
DOOM is not a bechmark.
So I guess there is no conspiracy here , And If people are talking about favoring one architecture , then they should have ignored Hitman and AoTs long ago.
well this happened with AotS
why 3dmark cant be biased if a nuetral dev couldnt use this feature?Edited by PontiacGTX - 7/19/16 at 5:57pm