Overclock.net › Forums › Industry News › Video Game News › [computerbase.de] DOOM + Vulkan Benchmarked.
New Posts  All Forums:Forum Nav:

[computerbase.de] DOOM + Vulkan Benchmarked. - Page 44

post #431 of 632
Quote:
Originally Posted by Mahigan View Post

I am not sure if people have asked about fences or not. The fences are the basis of my argument. It does not matter if the driver switches Async Compute on or off. 3DMark has two paths. One with Async Compute and another without. When you toggle Async On... the Async path is used. When you toggle Async Off... the non-Async path is used.

The argument has to do with the fences in the Async path. Those fences would still incur a performance penalty regardless of what the nVIDIA Maxwell driver is doing. Those fences cause tiny delays... the more fences the more delays are introduced. The more delays... the lower your FPS due to the introduced latency.

What I do not see is the performance hit associated with those delays when running TimeSpy on an nVIDIA Maxwell based GPU. Those fences should be negatively affecting the performance of the Maxwell GPU as they synchronize Graphics and Compute tasks. I have already mentioned ways in which this would not be the case and those ways would be if the application was coded to favor nVIDIA hardware (utilizing short running shaders and very few Asynchronous Compute + Graphics workloads).

Since 3DMark apparently only uses a single path for both AMD and nVIDIA hardware then if that 3DMark path was coded to favor nVIDIA hardware then the benchmark is not objective. Why not? Because the AMD hardware can take even more Asynchronous Compute + Graphics workloads in order to gain even more performance than it curently being shown. Therefore if a separate path were coded for AMD... more tasks could be "marked" to run in parallel (more Graphics + Compute jobs) leading to a reduction in per frame latency (resulting in higher FPS).

The main reason both AMD and nVIDIA (as well as Microsoft) stated that separate paths are the way to go for DX12 applications is specifically due to this issue. If a developer codes in favor of nVIDIA then AMD will suffer bad performance and vice versa. In order to avert this... both architectures ought to have their own optimized paths.

This should be mentioned to that 3DMark rep on steam.
it seems devs can control how many fences/ barrier are used and when

and Nvidia tries to show that this breaks their DX12 perf

Do's


Warning: Spoiler! (Click to show)
  • Minimize the use of barriers and fences
  • We have seen redundant barriers and associated wait for idle operations as a major performance problem for DX11 to DX12 ports
  • The DX11 driver is doing a great job of reducing barriers – now under DX12 you need to do it
  • Any barrier or fence can limit parallelism
  • Make sure to always use the minimum set of resource usage flags
  • Stay away from using D3D12_RESOURCE_USAGE_GENERIC_READ unless you really need every single flag that is set in this combination of flags
  • Redundant flags may trigger redundant flushes and stalls and slow down your game unnecessarily
  • To reiterate: We have seen redundant and/or overly conservative barrier flags and their associated wait for idle operations as a major performance problem for DX11 to DX12 ports.
  • Specify the minimum set of targets in ID3D12CommandList::ResourceBarrier
  • Adding false dependencies adds redundancy
  • Group barriers in one call to ID3D12CommandList::ResourceBarrier
  • This way the worst case can be picked instead of sequentially going through all barriers
  • Use split barriers when possible
  • Use the _BEGIN_ONLY/_END_ONLY flags
  • This helps the driver doing a more efficient job
  • Do use fences to signal events/advance across calls to ExecuteCommandLists

Dont's Warning: Spoiler! (Click to show)
Quote:
  • Don’t insert redundant barriers
  • This limits parallelism
  • A transition from D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE to D3D12_RESOURCE_STATE_RENDER_TARGET and back without any draw calls in-between is redundant
  • Avoid read-to-read barriers
  • Get the resource in the right state for all subsequent reads
  • Don’t use D3D12_RESOURCE_USAGE_GENERIC_READ unless you really needs every single flag
  • Don’t sequentially call ID3D12CommandList::ResourceBarrier with just one barrier
  • This doesn’t allow the driver to pick the worst case of a set of barriers
  • Don’t expect fences to trigger signals/advance at a finer granularity then once per ExecuteCommandLists call.

Also the work batches/command list summited for Maxwell can be big but few therefore less synchronizaton points which fits maxwell and pascal

Do´s


Warning: Spoiler! (Click to show)
  • Submit work in parallel and evenly across several threads/cores to multiple command lists
  • Recording commands is a CPU intensive operation and no driver threads come to the rescue
  • Command lists are not free threaded so parallel work submission means submitting to multiple command lists
  • Be aware of the fact that there is a cost associated with setup and reset of a command list
  • You still need a reasonable number of command lists for efficient parallel work submission
  • Fences force the splitting of command lists for various reasons ( multiple command queues, picking up the results of queries)
  • Try to aim at a reasonable number of command lists in the range of 15-30 or below. Try to bundle those CLs into 5-10 ExecuteCommandLists() calls per frame.


Quote:
Originally Posted by KarathKasun View Post

Even if you optimize everything for the specific target GPU's, each will have strengths and weaknesses depending on how they handle the tasks given to them.

We have seen this argument before with the now ancient 8800 GTX vs HD 2900 XT argument (SIMD vs VLIW respectively). You cant change the fact that both approaches deal with data processing in intrinsically different ways.

Maybe they use same patch regardless of the architecture they use a basic procedure for every GPU which sends info to the numbers of CUs with the common way it works along all architecture even if the asynchronous compute queues dont take advantage of the architecture, like using Pascal path which works in Maxwell and GCN
Edited by PontiacGTX - 7/17/16 at 3:45pm
Wanted: [WTB] GPU upgrade
$210 (USD) or best offer
  
Reply
Wanted: [WTB] GPU upgrade
$210 (USD) or best offer
  
Reply
post #432 of 632

If you give Async code to Driver and driver is doing something to make it shine while It's disabled on maxwell card, score is not Valid! because driver is doing Cheat.either Developer should disable Async compute ( by using gray out on Button ) Or we should see Perf hit with Async On.

post #433 of 632
Quote:
Originally Posted by Remij View Post

Big developers with their own engines of course have much to gain from low level APIs where they push the latest and greatest hardware to show off their games and engines.. You'll see those devs take initiative and support the hardware better regardless of the API used because they have huge teams with highly specialized engineers and programmers that know exactly how to code close to the metal. It's no surprise they want to push technology forward.

But then you have to remember about the point I made about developing for the largest potential market of hardware out there. Even Johan was debating pushing for DX12 only vs coding two separate paths. He said the benefits would be there, but they have to consider the market.

So relax... don't blame developers just yet screaming bloody murder when something isn't fully taken advantage of. This is a transition period and the two architectures are quite different as we already know... So in the games that AMD gets ahead, celebrate and be happy. But when Nvidia wins games here and there.. just take solace in the fact that AMDs performance will likely be much better than it would have been before DX12. So progress is being made.

I am not blaming anyone. I just expressed my objection to the notion that devs have this passive role of getting served a specific type of hardware or API and they can't do anything about it, only code for whatever the majority uses. In fact it is the top devs and their games that shape the hardware. Carmack did it back in the day, now it is Andersson. Radeons and the modern APIs had to meet his demands essentially. Not that he is some sort of King or anything - his wishes echoed the industry as a whole. Indie devs and non AAA studios in general will compromise, sure. But even they affect the shape of things to come by supporting certain advanced engines over older ones. As for Johan and DX12 , he is dead set on using it. The thing he considered was whether to drop DX11 burden altogether or not. My guess is that if DX12 was not tied to Windows 10, BF1 would be DX12 only, just like BF3 was DX11 only back in 2011. In other words his market concerns are on OS level,not hardware one - non DX12 compliant GPUs are far too weak these days to matter for this discussion.
Edited by Kuivamaa - 7/17/16 at 3:22pm
Mastodon Ryzen
(12 items)
 
HP Z220
(8 items)
 
 
CPUMotherboardGraphicsRAM
R7 1800X Asus Crosshair VI Hero Sapphire RX Vega 64 reference Gskill TridentZ 
Hard DriveOptical DriveCoolingOS
Pny SSD 240GB Crucial MX100 CM Nepton 280L Win 10 
MonitorPowerCaseMouse
Acer Predator XG270HU Freesync XFX 750W Pro HAF XM Logitech G502 
CPUMotherboardGraphicsCooling
i7 3770 HP Quadro K2000 HP 
OSPowerCaseMouse
Win 7  HP 400W HP CMT RAT 7 
  hide details  
Reply
Mastodon Ryzen
(12 items)
 
HP Z220
(8 items)
 
 
CPUMotherboardGraphicsRAM
R7 1800X Asus Crosshair VI Hero Sapphire RX Vega 64 reference Gskill TridentZ 
Hard DriveOptical DriveCoolingOS
Pny SSD 240GB Crucial MX100 CM Nepton 280L Win 10 
MonitorPowerCaseMouse
Acer Predator XG270HU Freesync XFX 750W Pro HAF XM Logitech G502 
CPUMotherboardGraphicsCooling
i7 3770 HP Quadro K2000 HP 
OSPowerCaseMouse
Win 7  HP 400W HP CMT RAT 7 
  hide details  
Reply
post #434 of 632
Seems the NV approach relies on doing compute kernels in a single wavefront or whatever they call the smallest slice of GPU time they can allot to a task. Effectively limiting the chip to being treated as a single resource. This is similar to the way a single CPU core was used for multitasking before multiple core PC's became commonplace.

GCN OTOH is operated more like a cluster of weaker components where each can do is own task without disturbing the jobs taking place in adjacent resources. It has more transistors dedicated to control and scheduling logic, so it gets less raw performance per transistor but can handle multiple tasks concurrently.

Both can do the same things, but you have to optimize for either to get the most out of them.
μRyzen
(12 items)
 
Mini Box
(4 items)
 
 
CPUMotherboardGraphicsRAM
Ryzen R5 1400 MSI B350M Gaming Pro Zotac GTX 670 4GB G.SKILL FORTIS Series 8GB (2 x 4GB) 
Hard DriveCoolingOSOS
WD Green 3tb Wraith Stealth Windows 10 Debian 8.7 
MonitorKeyboardPowerMouse
ViewSonic VX-2257-8 Chinese backlit mechanical Kingwin 850w Chinese laser optical 
CPUMotherboardGraphicsRAM
Athlon 5350 Asus AM1I-A EVGA GTX 750 Ti SC 2x4GB DDR 3 1333 
  hide details  
Reply
μRyzen
(12 items)
 
Mini Box
(4 items)
 
 
CPUMotherboardGraphicsRAM
Ryzen R5 1400 MSI B350M Gaming Pro Zotac GTX 670 4GB G.SKILL FORTIS Series 8GB (2 x 4GB) 
Hard DriveCoolingOSOS
WD Green 3tb Wraith Stealth Windows 10 Debian 8.7 
MonitorKeyboardPowerMouse
ViewSonic VX-2257-8 Chinese backlit mechanical Kingwin 850w Chinese laser optical 
CPUMotherboardGraphicsRAM
Athlon 5350 Asus AM1I-A EVGA GTX 750 Ti SC 2x4GB DDR 3 1333 
  hide details  
Reply
post #435 of 632
Quote:
Originally Posted by Xuper View Post

If you give Async code to Driver and driver is doing something to make it shine while It's disabled on maxwell card, score is not Valid! because driver is doing Cheat.either Developer should disable Async compute ( by using gray out on Button ) Or we should see Perf hit with Async On.
I honestly couldn't care either way... as long as image quality isn't affected.

If Nvidia can do something in drivers to disable async on Maxwell gpus so there is no performance hit, then it's the right thing to do imo.

Again.. as long as there is no affect to image quality who cares what drivers do to get the performance they do? Async itself is a way gaining more performance. In the end, the numbers are important.. not how they got there.
My main PC
(8 items)
 
  
CPUMotherboardGraphicsRAM
Intel i7 6700k Asus ROG Maximus VIII Gene Nvidia GTX 1080Ti G.Skill Ripjaws 
Hard DriveOSKeyboardPower
Samsung 850 EVO  Windows 10 Razer Blackwidow Chroma EVGA Supernova 1300w 
  hide details  
Reply
My main PC
(8 items)
 
  
CPUMotherboardGraphicsRAM
Intel i7 6700k Asus ROG Maximus VIII Gene Nvidia GTX 1080Ti G.Skill Ripjaws 
Hard DriveOSKeyboardPower
Samsung 850 EVO  Windows 10 Razer Blackwidow Chroma EVGA Supernova 1300w 
  hide details  
Reply
post #436 of 632
Quote:
Originally Posted by PontiacGTX View Post

it seems devs can control how many fences/ barrier are used and when

and Nvidia tries to show that this breaks their DX12 perf

Do's


Warning: Spoiler! (Click to show)

Dont's Warning: Spoiler! (Click to show)
Maybe they use same patch regardless of the architecture they use a basic procedure for every GPU which sends info to the numbers of CUs with the common way it works along all architecture even if the asynchronous compute queues dont take advantage of the architecture, like using Pascal path which works in Maxwell and GCN

They are the only ones who can control how many fences are used and when they are used. The driver has no control over fences according to Kollock. Kollock stated that the fences are invisible to the driver contrary to what the 3DMark rep is saying.

Every single Asynchronous + Graphics task requires a fence in order to synchronize both the Compute and Graphics context. So the more Asynchronous + Graphics work... the more fences.
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
post #437 of 632
Quote:
Originally Posted by Remij View Post


I honestly couldn't care either way... as long as image quality isn't affected.

If Nvidia can do something in drivers to disable async on Maxwell gpus so there is no performance hit, then it's the right thing to do imo.

Again.. as long as there is no affect to image quality who cares what drivers do to get the performance they do? Async itself is a way gaining more performance. In the end, the numbers are important.. not how they got there.

 

Sorry I don't accept this kind of logic! When I say Do it , You should do it exactly what I am saying !! Not by using any trick to get any result.Test A = Async OFF , Test B = Async ON , If you Can't do Test B then your Score should be much lower than Test A.because I ordered you Do it exactly what I say not by using any optimization path.

post #438 of 632
The test in question actually does do a good bit of async. It was altered by AMD to use a good amount. You say in games we see much different, but so far in games, that hasnt been true really. In fact, the difference in ashes between on and off isnt much at all. So where exactly are you going with this? Are you saying that its impossible for async to be turned off in driver by Nvidia? That no matter what you're going to have async running on Maxwell if the programmer doesnt explicitly disable it in code? Thats easy enough to prove. Since AMDs own demonstration of async isnt enough, do you care to write your own program? I'll gladly test.
post #439 of 632
Quote:
Originally Posted by Mahigan View Post

They are the only ones who can control how many fences are used and when they are used. The driver has no control over fences according to Kollock. Kollock stated that the fences are invisible to the driver contrary to what the 3DMark rep is saying.

Every single Asynchronous + Graphics task requires a fence in order to synchronize both the Compute and Graphics context. So the more Asynchronous + Graphics work... the more fences.
that is exactly what the text says
Quote:
Submit work in parallel and evenly across several threads/cores to multiple command lists
Recording commands is a CPU intensive operation and no driver threads come to the rescue
Command lists are not free threaded so parallel work submission means submitting to multiple command lists

Reuse fragments recorded in bundles if you can
No need to spend CPU time once again
...
...
The DX11 driver is doing a great job of reducing barriers – now under DX12 you need to do it

To reiterate: We have seen redundant and/or overly conservative barrier flags and their associated wait for idle operations as a major performance problem for DX11 to DX12 ports.

Don’t insert redundant barriers
This limits parallelism
A transition from D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE to D3D12_RESOURCE_STATE_RENDER_TARGET and back without any draw calls in-between is redundant
Avoid read-to-read barriers

Don’t sequentially call ID3D12CommandList::ResourceBarrier with just one barrier
This doesn’t allow the driver to pick the worst case of a set of barriers
Don’t expect fences to trigger signals/advance at a finer granularity then once per ExecuteCommandLists call.
Multi GPU

Edited by PontiacGTX - 7/17/16 at 3:58pm
Wanted: [WTB] GPU upgrade
$210 (USD) or best offer
  
Reply
Wanted: [WTB] GPU upgrade
$210 (USD) or best offer
  
Reply
post #440 of 632
Quote:
Originally Posted by Xuper View Post

Sorry I don't accept this kind of logic! When I say Do it , You should do it exactly what I am saying !! Not by using any trick to get any result.Test A = Async OFF , Test B = Async ON , If you Can't do Test B then your Score should be much lower than Test A.because I ordered you Do it exactly what I say not by using any optimization path.
There is no Test A and Test B though, and there doesn't need to be. This isn't a image quality affecting optimization. There's Async = On, which benefits AMD. There's Async = Disabled on Nvidia, which still benefits AMD, and benefits Nvidia Maxwell GPUs.

I doubt many game in the future will come with an Async on/off option because it's just something you'd naturally want to use on AMD, and something that Nvidia will disable in drivers to maintain better performance on their older hardware.

I might catch hell for this, but I'm a fan of driver cheats. I'm a fan of anything they can do to improve performance as long as Image Quality isn't affected. tongue.gif
Edited by Remij - 7/17/16 at 4:20pm
My main PC
(8 items)
 
  
CPUMotherboardGraphicsRAM
Intel i7 6700k Asus ROG Maximus VIII Gene Nvidia GTX 1080Ti G.Skill Ripjaws 
Hard DriveOSKeyboardPower
Samsung 850 EVO  Windows 10 Razer Blackwidow Chroma EVGA Supernova 1300w 
  hide details  
Reply
My main PC
(8 items)
 
  
CPUMotherboardGraphicsRAM
Intel i7 6700k Asus ROG Maximus VIII Gene Nvidia GTX 1080Ti G.Skill Ripjaws 
Hard DriveOSKeyboardPower
Samsung 850 EVO  Windows 10 Razer Blackwidow Chroma EVGA Supernova 1300w 
  hide details  
Reply
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Video Game News
Overclock.net › Forums › Industry News › Video Game News › [computerbase.de] DOOM + Vulkan Benchmarked.