I am not sure if people have asked about fences or not. The fences are the basis of my argument. It does not matter if the driver switches Async Compute on or off. 3DMark has two paths. One with Async Compute and another without. When you toggle Async On... the Async path is used. When you toggle Async Off... the non-Async path is used.
The argument has to do with the fences in the Async path. Those fences would still incur a performance penalty regardless of what the nVIDIA Maxwell driver is doing. Those fences cause tiny delays... the more fences the more delays are introduced. The more delays... the lower your FPS due to the introduced latency.
What I do not see is the performance hit associated with those delays when running TimeSpy on an nVIDIA Maxwell based GPU. Those fences should be negatively affecting the performance of the Maxwell GPU as they synchronize Graphics and Compute tasks. I have already mentioned ways in which this would not be the case and those ways would be if the application was coded to favor nVIDIA hardware (utilizing short running shaders and very few Asynchronous Compute + Graphics workloads).
Since 3DMark apparently only uses a single path for both AMD and nVIDIA hardware then if that 3DMark path was coded to favor nVIDIA hardware then the benchmark is not objective. Why not? Because the AMD hardware can take even more Asynchronous Compute + Graphics workloads in order to gain even more performance than it curently being shown. Therefore if a separate path were coded for AMD... more tasks could be "marked" to run in parallel (more Graphics + Compute jobs) leading to a reduction in per frame latency (resulting in higher FPS).
The main reason both AMD and nVIDIA (as well as Microsoft) stated that separate paths are the way to go for DX12 applications is specifically due to this issue. If a developer codes in favor of nVIDIA then AMD will suffer bad performance and vice versa. In order to avert this... both architectures ought to have their own optimized paths.
This should be mentioned to that 3DMark rep on steam.
and Nvidia tries to show that this breaks their DX12 perf
Warning: Spoiler! (Click to show)
- Minimize the use of barriers and fences
- We have seen redundant barriers and associated wait for idle operations as a major performance problem for DX11 to DX12 ports
- The DX11 driver is doing a great job of reducing barriers – now under DX12 you need to do it
- Any barrier or fence can limit parallelism
- Make sure to always use the minimum set of resource usage flags
- Stay away from using D3D12_RESOURCE_USAGE_GENERIC_READ unless you really need every single flag that is set in this combination of flags
- Redundant flags may trigger redundant flushes and stalls and slow down your game unnecessarily
- To reiterate: We have seen redundant and/or overly conservative barrier flags and their associated wait for idle operations as a major performance problem for DX11 to DX12 ports.
- Specify the minimum set of targets in ID3D12CommandList::ResourceBarrier
- Adding false dependencies adds redundancy
- Group barriers in one call to ID3D12CommandList::ResourceBarrier
- This way the worst case can be picked instead of sequentially going through all barriers
- Use split barriers when possible
- Use the _BEGIN_ONLY/_END_ONLY flags
- This helps the driver doing a more efficient job
- Do use fences to signal events/advance across calls to ExecuteCommandLists
Dont's Warning: Spoiler! (Click to show)
- Don’t insert redundant barriers
- This limits parallelism
- A transition from D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE to D3D12_RESOURCE_STATE_RENDER_TARGET and back without any draw calls in-between is redundant
- Avoid read-to-read barriers
- Get the resource in the right state for all subsequent reads
- Don’t use D3D12_RESOURCE_USAGE_GENERIC_READ unless you really needs every single flag
- Don’t sequentially call ID3D12CommandList::ResourceBarrier with just one barrier
- This doesn’t allow the driver to pick the worst case of a set of barriers
- Don’t expect fences to trigger signals/advance at a finer granularity then once per ExecuteCommandLists call.
Also the work batches/command list summited for Maxwell can be big but few therefore less synchronizaton points which fits maxwell and pascal
Warning: Spoiler! (Click to show)
- Submit work in parallel and evenly across several threads/cores to multiple command lists
- Recording commands is a CPU intensive operation and no driver threads come to the rescue
- Command lists are not free threaded so parallel work submission means submitting to multiple command lists
- Be aware of the fact that there is a cost associated with setup and reset of a command list
- You still need a reasonable number of command lists for efficient parallel work submission
- Fences force the splitting of command lists for various reasons ( multiple command queues, picking up the results of queries)
- Try to aim at a reasonable number of command lists in the range of 15-30 or below. Try to bundle those CLs into 5-10 ExecuteCommandLists() calls per frame.
Even if you optimize everything for the specific target GPU's, each will have strengths and weaknesses depending on how they handle the tasks given to them.
We have seen this argument before with the now ancient 8800 GTX vs HD 2900 XT argument (SIMD vs VLIW respectively). You cant change the fact that both approaches deal with data processing in intrinsically different ways.
Maybe they use same patch regardless of the architecture they use a basic procedure for every GPU which sends info to the numbers of CUs with the common way it works along all architecture even if the asynchronous compute queues dont take advantage of the architecture, like using Pascal path which works in Maxwell and GCN
Edited by PontiacGTX - 7/17/16 at 3:45pm