Warning: Spoiler! (Click to show)
Minimize the use of barriers and fences
We have seen redundant barriers and associated wait for idle operations as a major performance problem for DX11 to DX12 ports
The DX11 driver is doing a great job of reducing barriers – now under DX12 you need to do it
Any barrier or fence can limit parallelism
Make sure to always use the minimum set of resource usage flags
Stay away from using D3D12_RESOURCE_USAGE_GENERIC_READ unless you really need every single flag that is set in this combination of flags
Redundant flags may trigger redundant flushes and stalls and slow down your game unnecessarily
To reiterate: We have seen redundant and/or overly conservative barrier flags and their associated wait for idle operations as a major performance problem for DX11 to DX12 ports.
Specify the minimum set of targets in ID3D12CommandList::ResourceBarrier
Adding false dependencies adds redundancy
Group barriers in one call to ID3D12CommandList::ResourceBarrier
This way the worst case can be picked instead of sequentially going through all barriers
Use split barriers when possible
Use the _BEGIN_ONLY/_END_ONLY flags
This helps the driver doing a more efficient job
Do use fences to signal events/advance across calls to ExecuteCommandLists
Don’t insert redundant barriers
This limits parallelism
A transition from D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE to D3D12_RESOURCE_STATE_RENDER_TARGET and back without any draw calls in-between is redundant
Avoid read-to-read barriers
Get the resource in the right state for all subsequent reads
Don’t use D3D12_RESOURCE_USAGE_GENERIC_READ unless you really needs every single flag
Don’t sequentially call ID3D12CommandList::ResourceBarrier with just one barrier
This doesn’t allow the driver to pick the worst case of a set of barriers
Don’t expect fences to trigger signals/advance at a finer granularity then once per ExecuteCommandLists call.
This has come to my attention thanks to a few members of this board, namely @Doothe, @Mahigan, @Slomo4shO, @JackCY, @PontiacGTX among others.
These are some of the more interesting posts:
I took three screenshots, one of each game, in GPUView. From left to right, DOOM, AOTS, and Time Spy. Each timeline is roughly the same length of time. I'm still learning how to read, and interpret this information but I figured I'd share some of the images with you guys and maybe get a better understanding of whats going on.
the image is 4800x2560. i recommend opening it up in a separate tab.
Time Spy has a Pre-Emption Packet(black rectangle) in the 3D Queue that shows up every time a compute queue is processed
From Nvidia’s whitepaper:
"Compute Preemption is another important new hardware and software feature added to GP100 that allows compute tasks to be preempted at instruction-level granularity, rather than thread block granularity as in prior Maxwell and Kepler GPU architectures. Compute Preemption prevents long-running applications from either monopolizing the system (preventing other applications from running) or timing out."
btw doom is vulkan. Idk if Vulkan is properly picked up by GPUView so disregard it if you want.
That's what I keep saying They simply reused their older DX11 like approach with DX12 and the features they use are quite limited as well so that they can support old hardware so new HW with new features that older doesn't have they do not use. I bet they also want 1 engine with 1 path to run on all GPUs to make their Benchmark "valid", to them but it makes it invalid to me since it doesn't use each HW to it's maximum potential, be it NV or AMD or some other GPU.
Figuratively: Say there are two architectures, one has 1 thread to do the work and the other has 16 threads, now they make an engine that only uses 1 thread and try to compute parallel work using 1 thread so they switch context like mad to get it done, of course this engine works on both 1 and 16 threaded HW and runs the same speed in theory but that 16 threaded HW is underutilized as it could do 16 times more work at the same time if used in parallel with 16 submission threads. Context switching is expensive and so on.
This article has a bit of explanation of the differences between architectures and their features.
It can be used for GCN, but it wont take advantage of parrarellism and perofrmance gains, Maxwell can do some degree of pre emption and it doesnt get negative performance(given how fences are limiting the contexts switching) and it can work in Pascal given it has improved pre emption
when people compares them Maxwell seems to have some degree of async compute(benchmark is aimed to it but it does pre emption) GCN can do pre emption but it isnt deliver same gains as async compute and Pascal shows their improved pre emption gains
Devs tell they use a single path but this only favors one side
I love big boards like this, because you can call everyone's attention to a problem when it's noticed. And a lot of people here are quite capable of noticing such problems Glad to be a part of such a community.
Anyway, I can say that the logical conclusion from this is that Futuremark's benchmark is BOTCHED and biased, not indicative of DX12 capabilities as it should be, but instead restricting them - thus it has arguably no credibility as a BENCHMARK suite.
Benchmark - Standard, or a set of standards, used as a point of reference for evaluating performance or level of quality. Benchmarks may be drawn from a firm's own experience, from the experience of other firms in the industry, or from legal requirements such as environmental regulations.
+example A new benchmark was set for the football team when weakest member benched 200 pounds, therefore setting the expectations for all other teammates to bench at least that amount.
In this case we have the weakest member benching 200 pounds, but he happens to be sponsoring the gym... and the gym has 2 members. So the bench press goes to 200 pounds.
I am incredibly disappointed and that's why I am giving voice to this in NEWS.
Well , you think should I pick this bench as Valid Evidence for Async Compute ? I say No.
CPU : AMD Ryzen 1600X | Memory : [Ripjaws V] F4-3200C16D-16GVKB | Motherboard : Asus Prime X370 Pro | Graphic : XFX AMD Radeon R9 290 Double Dissipation | Monitor : AOC 931Sw | HDD : 1x120 GB SSD Samsung Evo , 1x2TB Seagate
Inner liberty can be judged by how often a person feels offended, for you can no more insult a mature man than you can paint the air. -Vernon Howard
No, but if I recall, similar complaints were made about Firestrike favoring Nvidia hardware. So, my question was if this surprised anyone, considering the accusations against the previous 3DMark benchmark.
However that is most likely not going to happen so you are either going to have someone coding for an AMD Preferred path or vice versa.
made to look better for nvidia results
ohh wow i'm so surprised i can't even.
Corsair Vengeance Pro - 32GB DDR3 - 2666Mhz
Gigabyte - Z97x Gaming 3
Sandisk X400 M.2 - 512GB
NAS Storage - x6 - 4TB WD RED
EVGA - 980ti SSC + Dell U2515H
Corsair - CX650M
Also dont know if this should be in the news section allready.
3rd PC >>> FX-6300 @ 4,5ghz - 8GB 1866mhz - GTX960
4th PC >>> Dual Core E5700 - 2GB 800mhz - HD6670
5th PC >>> AMD AM1 5350 - 4GB DDR3 - 3,1TB storage