Overclock.net › Forums › Industry News › Software News › [PCPER] NVIDIA Publishes DirectX 12 Tips for Developers
New Posts  All Forums:Forum Nav:

[PCPER] NVIDIA Publishes DirectX 12 Tips for Developers - Page 8

post #71 of 127
Mahigan, care to comment? I wonder how async compute heavy multiplatform games this gen will be.
Quote:
A masterpiece. Your work is done here.

Just a note so I can add something to this discussion. For async compute jobs on GCN (edit: I speculate based upon the information provided below) a default of 4CU are allocated for the job unless specified for more.

Fable legends may use async compute yes but it is designed for XBO in mind which has 12 CU so keep this in mind when you think about this particular benchmark. FuryX has 64 IIRC.

Optimizing for Async Compute in games will not be easy because of the varying number of ACE queues and CUs available at any given time for different hardware configurations. Though it should be doable, it will be much harder optimizing between Nvidia and AMD since they approach async differently.


Source: https://forum.beyond3d.com/posts/1873115/
post #72 of 127
Quote:
Originally Posted by SpeedyVT View Post

There are more GCN systems than Maxwell for supporting these API designs, between console and desktop combined. It's hard to feel that nVidia will have much say on game development when they are in the minority. Even if AMD only has 25-30% of the desktop sales for graphics total.
Multiplatform games has always been design around consoles.Since consoles are x86 + GCN, plot over to consoles will take much less efford than b4. But the problem here is Nvidia, future games need to put heavy use Async compute since consoles CPU are weak, that means porting over to PC with Nvidia is an extra work the developer have to do, the Async compute that has been optimized heavily for GCN has to be either take out for Nvidia, or using a different code path for Nvidia.

I beginning to wonder will lazy developer do that? Or they are just gonna go out and tell Nvidia users to upgrade a more powerful GPU to brute force the thing.
Gaming Rig
(10 items)
 
Work/Web Rig
(11 items)
 
Web Rig
(8 items)
 
CPUMotherboardGraphicsRAM
i5-2500K @4GHz stock volt OC Asus P8Z77-V LK Palit GTX 750Ti StormX Dual Corsair Vengence 1600MHz_CL8 4GBx4 
Hard DriveHard DriveCoolingOS
Seagate 80GB 7200rpm Maxtor 250GB 7200rpm Cooler Master Hyper 212+ Windows 10 Pro 
MonitorCase
BenQ XL2720z @ 1920x1080 144Hz Silverstone Ps06 
CPUMotherboardGraphicsRAM
Core 2 Quad Q9650 Gigabyte EP41-UD3L Asus 1GB Radeon 7790 DirectCU OC Corsair Gaming Ram 2GBx2 DDR2-800 
Hard DriveHard DriveCoolingOS
Samsung 80GB 7200rpm Western Digital 200GB 7200rpm Cooler Master Hyper TX3 Windows 7 pro 
MonitorPowerCase
Samsung 226bw 1680x1050 Acbel iPower 510w (450w PSU) LianLi PS05 
CPUMotherboardGraphicsRAM
AMD E350 MSI E350IA-E45 AMD Radeon 6310 4GB Kingston DDR3 1333 
Hard DriveOSMonitorPower
Western Digital 80GB Linux Slackware 14.1 Samsung 22" 1680x1050 LCD Acbel 300w PSU 
  hide details  
Reply
Gaming Rig
(10 items)
 
Work/Web Rig
(11 items)
 
Web Rig
(8 items)
 
CPUMotherboardGraphicsRAM
i5-2500K @4GHz stock volt OC Asus P8Z77-V LK Palit GTX 750Ti StormX Dual Corsair Vengence 1600MHz_CL8 4GBx4 
Hard DriveHard DriveCoolingOS
Seagate 80GB 7200rpm Maxtor 250GB 7200rpm Cooler Master Hyper 212+ Windows 10 Pro 
MonitorCase
BenQ XL2720z @ 1920x1080 144Hz Silverstone Ps06 
CPUMotherboardGraphicsRAM
Core 2 Quad Q9650 Gigabyte EP41-UD3L Asus 1GB Radeon 7790 DirectCU OC Corsair Gaming Ram 2GBx2 DDR2-800 
Hard DriveHard DriveCoolingOS
Samsung 80GB 7200rpm Western Digital 200GB 7200rpm Cooler Master Hyper TX3 Windows 7 pro 
MonitorPowerCase
Samsung 226bw 1680x1050 Acbel iPower 510w (450w PSU) LianLi PS05 
CPUMotherboardGraphicsRAM
AMD E350 MSI E350IA-E45 AMD Radeon 6310 4GB Kingston DDR3 1333 
Hard DriveOSMonitorPower
Western Digital 80GB Linux Slackware 14.1 Samsung 22" 1680x1050 LCD Acbel 300w PSU 
  hide details  
Reply
post #73 of 127
Quote:
Originally Posted by Devnant View Post

Mahigan, care to comment? I wonder how async compute heavy multiplatform games this gen will be.
Source: https://forum.beyond3d.com/posts/1873115/

The degree of Async computing, programmed into consoles, is higher than that of any PC title to date, including Ashes of the Singularity. Therefore if a console title was ported to the PC with its Asynchronous compute commands intact, we would see a greater boost than we've seen thus far in either Fable Legends or Ashe's of the Singularity. I'm even beginning to doubt that Async compute is even active in the Fable Legends benchmark which was recently released. It could be that the feature has been placed on a separate path which would need to be activated from within the game title when it is released or by a console command (Zlatan, an Anandtech user, made this claim).

If it is active, then one could assume that only a minute degree of Asynchronous compute is taking place (I keep hearing a 5% figure being quoted around various forums with its source having been ExtremeTech's article/comment section).

We know that Kollock mentioned that Ashe's of the singularity only made mild use of this feature. This would appear to indicate that the boost in performance, for AMD hardware, had more to do with the increased parallelism of GCN hardware, under the DX12 API, than the Async computing taking place. I'm thinking that we will know more soon, once Fable Legends is released in mid October.
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
post #74 of 127
Quote:
Originally Posted by Glottis View Post

oh yes, game devs should totally cripple performance on nvidia gpus by needlessly overusing async compute like Ashes did just to prove some sick point. yes, that's so good and healthy for gaming community! if you actually bothered reading what more neutral game devs have to say about this matter you would see they just want to optimize their game so people have best experience regardless of gpu brand. speaking of maxwell 2, you do realize that 980Ti is still #1 DX12 card out there. it's only in mid range where amd has a few faster cards and only in async heavy game like ashes.

Oh you mean like how some game devs totally crippled performance on AMD GPUs by needlessly overusing tessellation just to prove some sick point?

Warning: Spoiler! (Click to show) Warning: Spoiler! (Click to show) Warning: Spoiler! (Click to show)
Fun fact: The tessellation slider in AMD's CCC resulted from the Crysis 2 invisible tessellated water shenanigans
post #75 of 127
Quote:
Originally Posted by ku4eto View Post

My post was rather a opinion, not a technical statement. And yes, i read the Oxide dev post. I am just saddend that nVidia is still keeping on with the stupid crap while having a straight poker face, like nothing happened at all.

Do you really think AMD doesn't do something similar? Obviously they are going to steer things in their preferred direction, but if you think only Nvidia does this you are crazy. Have you seen AMD's reviewer guides, for example?
post #76 of 127
Nvidia should post some Direct X 11 tips for AMD
HeavyHitter
(14 items)
 
  
CPUMotherboardGraphicsGraphics
i7-4970k MSI x97 Gaming 7 EVGA 980ti SC ACX2.0 EVGA 980ti SC ACX2.0 
RAMHard DriveCoolingOS
G.Skill Sniper 1866 Crucial M4  Corsair Hydro Series H105 Windows 10 
MonitorKeyboardPowerCase
ACER XB280HK Logitech G10 EVGA 850 G2 Corsair Carbide Series Black 500R 
MouseAudio
Aorus Thunder M7 Amp/WooAudio WA6 DAC/Schiit BiFrost Headphone/S... 
  hide details  
Reply
HeavyHitter
(14 items)
 
  
CPUMotherboardGraphicsGraphics
i7-4970k MSI x97 Gaming 7 EVGA 980ti SC ACX2.0 EVGA 980ti SC ACX2.0 
RAMHard DriveCoolingOS
G.Skill Sniper 1866 Crucial M4  Corsair Hydro Series H105 Windows 10 
MonitorKeyboardPowerCase
ACER XB280HK Logitech G10 EVGA 850 G2 Corsair Carbide Series Black 500R 
MouseAudio
Aorus Thunder M7 Amp/WooAudio WA6 DAC/Schiit BiFrost Headphone/S... 
  hide details  
Reply
post #77 of 127
Quote:
Originally Posted by Clocknut View Post

Multiplatform games has always been design around consoles.Since consoles are x86 + GCN, plot over to consoles will take much less efford than b4. But the problem here is Nvidia, future games need to put heavy use Async compute since consoles CPU are weak, that means porting over to PC with Nvidia is an extra work the developer have to do, the Async compute that has been optimized heavily for GCN has to be either take out for Nvidia, or using a different code path for Nvidia.

I beginning to wonder will lazy developer do that? Or they are just gonna go out and tell Nvidia users to upgrade a more powerful GPU to brute force the thing.

Mind you x86 doesn't mean PC compatible x86. However I agree. Multiplatform usually settle for a non-custom engine that suits a generic application to all platforms best support across the board not the best performance. Although lately it seems more console orientated then to PC. Such as why we've got so many bad ports recently.
Power Tower
(22 items)
 
SteamBox
(9 items)
 
Doge Miner
(7 items)
 
CPUMotherboardGraphicsRAM
Ryzen 1700X AX370-Gaming 5 AMD Radeon R9 200 Series G.Skill DDR4-2400 
RAMRAMRAMHard Drive
G.Skill DDR4-2400 G.Skill DDR4-2400 G.Skill DDR4-2400 Samsung 840 Pro 
Hard DriveHard DriveHard DriveHard Drive
CX300 Crucial 480GB Toshiba 4TB Toshbia 4TB Western Digital Black 1TB 
CoolingOSMonitorMonitor
h110i Windows 10 42" LG TV 20" Digitizer ASUS 
KeyboardPowerCaseMouse
Corsair Vengeance Mechanical Keyboard  850watt Vampire Gold Rated NZXT S340 Elite Corsair RGB FPS Mouse 
Mouse PadAudio
Borderlands Mousepad Realtek HD 
  hide details  
Reply
Power Tower
(22 items)
 
SteamBox
(9 items)
 
Doge Miner
(7 items)
 
CPUMotherboardGraphicsRAM
Ryzen 1700X AX370-Gaming 5 AMD Radeon R9 200 Series G.Skill DDR4-2400 
RAMRAMRAMHard Drive
G.Skill DDR4-2400 G.Skill DDR4-2400 G.Skill DDR4-2400 Samsung 840 Pro 
Hard DriveHard DriveHard DriveHard Drive
CX300 Crucial 480GB Toshiba 4TB Toshbia 4TB Western Digital Black 1TB 
CoolingOSMonitorMonitor
h110i Windows 10 42" LG TV 20" Digitizer ASUS 
KeyboardPowerCaseMouse
Corsair Vengeance Mechanical Keyboard  850watt Vampire Gold Rated NZXT S340 Elite Corsair RGB FPS Mouse 
Mouse PadAudio
Borderlands Mousepad Realtek HD 
  hide details  
Reply
post #78 of 127
Quote:
Originally Posted by Kollock View Post



No, DXGI swap chain has nothing to do with any vendor. It has to do with being more directly exposed to the swap buffer, as well as some big changes MS made with windows 10. The big issue is that you really don't have the equivalent of VSYNC being disabled in D3D12 that you did in D3D11. the compositor gives you a buffer to write to that is used directly. It makes things a little quirky. To top it off, the DXGI interface date back to Vista and there are some dragons burrowed deep at an OS level. It's proved pretty tricky to navigate. This becomes even more apparent when you start dealing with MGPU stuff.

I would say that Oxide does almost everything in the recommended list and virtually nothing on the don't list. The list is very good advice and pretty vendor independent; It's just good general advice for using D3D12. Many of them are really good to do with D3D11 as well. Ironically, if you refactor your engine for D3D12, you will typically end up getting much better perf in D3D11. Our D3D11 performance is very good for many of the same reasons.

Thank you Kollock smile.gif

If Oxide is, as you're saying, following mostly the Do's on this list then I would therefore assume that this means that Ashe's of the Singularity is representative of DX12 performance, in its infancy of course. This would lend me to believe that Maxwell 2 is performing as expected and that, contrary to the claims many people have made, Oxide is not crippling Maxwell 2's performance under Ashe's of the Singularity. The performance we've seen thus far, from Maxwell 2, is thus indicative of what one could expect from that architecture under a DX12 title.

As for AMD GCNs DX11 performance, some have mentioned that a lack of multi-threading support, in AMDs DX11 driver, is responsible for the poor performance we're seeing from AMD GCN when running the DX11 path under Ashe's of the Singularity. This could explain why the D3D12 performance does not carry over to D3D11 with AMDs GCN. This is alleged to be caused by a lack of deferred context/command listing support in the AMD driver (ID3D11DeviceContext).

We're told that a developer would be interested in using deferred context command lists if:
•Your game is CPU bottlenecked.
•You have a significant # of draw calls (>3000).
•Your CPU bottleneck is from render thread load or Direct3D API calls.
•You have a threaded renderer but serialize to a main render thread for mapping incurring sync point costs

Do you believe that this is responsible for the low performance, for AMDs GCN, when using D3D11?
Edited by Mahigan - 9/27/15 at 10:24pm
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
post #79 of 127
Quote:
Originally Posted by Mahigan View Post

Conservative Rasterization and ROV are hardware supported by Maxwell 2. ROV comes with a hefty performance penalty regardless of software or hardware support.

Conservative Rasterization can be emulated in software, using CPU cycles, with a minimal to zero impact on performance, on GCN equipped systems. This is because GCN doesn't have software-side scheduling. Therefore GCNs CPU footprint is minimal to begin with (low CPU overhead). Software side CR was used in a Dirt game, AMD Gaming Evolved Title, as a proof of concept.

Doing conservative rasterization in "software" means it's being done with a geometry shader. It's still done with the GPU, not CPU.

Quote:
Originally Posted by SpeedyVT View Post

Incorrect. Only in exclusive titles do we or will we ever see this exploitation of hardware. Not in any cookie cut titles. Cookie cut titles are usually games that are placed across all platforms.


Not anymore. Both consoles and a good chunk of PCs use amd GPUs.
Edited by dogen1 - 9/27/15 at 10:48pm
post #80 of 127
Quote:
Originally Posted by dogen1 View Post

Doing conservative rasterization in "software" means it's being done with a geometry shader. It's still done with the GPU, not CPU.

The Global Illumination is done in hardware but not the Occlusion Culling, it is done in software.
Quote:
The first two methods do not require a geometry shader; rendering
is repeated by submitting N times as many draw calls, which
may cause an application to become CPU-bound if driver overhead
is large.
Quote:
In the shading pass, it is critical that all pixels that in any part overlap
a triangle in shading space are rasterized and shaded, as otherwise
shading samples would be missed. This requires conservative
rasterization, as opposed to standard sample-based rasterization.
Due to the lack of hardware support in current GPUs, we implement
conservative rasterization in a geometry shader (GS) [Hasselgren
et al. 2005]. The triangle is first clipped to the near plane, and then
a bounding shape B with up to eight vertices is computed, which
dilates the triangle by 0.5 pixels in all directions. To enable our
depth-culling optimization (c.f., Figure 3), it is desirable to place B
as far back as possible, while never beyond the original triangle. In
practice, if B does not intersect the near plane, we place it in the
plane of the original triangle, and otherwise at the near plane.
The main drawback of a geometry shader-based approach, besides
having to enable the GS stage, is that we cannot rely on built-in
perspective-correct vertex attribute interpolation.1
Instead, the vertex
attributes of the original triangle have to be passed to the pixel
shader for manual interpolation, which consumes a large number of
input/output registers. For these reasons, we believe hardware support
for conservative rasterization is highly desired. In Section 6 we
have estimated the performance gains that could result.
Quote:
In this paper, we present the details of a conservative rasterization algorithm based
on edge functions [5, 13]. It can be used in both hardware and software. An advantage
of this algorithm is that it requires only a small modification to the triangle setup of a
rasterizer, while the rest of the pipeline is left unmodified. Furthermore, we show that
the same algorithm can be used for tiled rasterization, which is used to improve memory
coherence [10, 11], to do simple forms of culling [2, 12], and for different types
of analysis to accelerate rendering [1]. The algorithm allows enabling conservative
rasterization separately for each edge.

Source
Source
Source
Edited by Mahigan - 9/27/15 at 10:51pm
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
Kn0wledge
(20 items)
 
Pati3nce
(14 items)
 
Wisd0m
(10 items)
 
Reply
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Software News
Overclock.net › Forums › Industry News › Software News › [PCPER] NVIDIA Publishes DirectX 12 Tips for Developers