Joined
·
28,696 Posts
For the past Five years, I've been very vocal and outspoken towards CPU inefficiency, commonly known as a CPU bottleneck. Back then, I could not convince a single soul that such a thing even existed. Every time I was told that something was wrong with my system. Five systems later, I still have the problem! Often, the CPU's usage will increase substantially, causing the GPU usage to reduce overall. FPS often suffers, mouse input becomes delayed and frame latency increases. Windows task manager reports high usage on all cores. Although this indicates a bottleneck, the actual usage cannot be displayed by task manager properly. What actually happens is that Core #0 runs at 99% usage, whilst the remaining cores show less usage. We are indeed quick to blame the Direct-X 11 API for it's short comings, but how does the AMD driver fair in all this?
Let us start with Direct-X 11 API. In short, it does have the ability to use multiple CPU cores for a more efficient workload:
Quote:
This API has introduced multi-threaded capabilities to the pipeline, whilst utilizing parallel loads:
Quote:
Direct-X 11 has the ability to use deferred context/command listing as it's main multi workload function:
Quote:
Nvidia explain the reasons why a developer would like to take advantage of deferred contexts within the API
Quote:
After this quick explanation of the API and some of it's core functions, let's look at AMD's side of the story. Again, to put things simply, deferred contexts and command lists are not mandatory. AMD have specifically chosen not to use it in their driver. The drawback is significantly higher CPU overhead is observed, which heavily chokes the GPU(s) in CPU intensive applications. However, the positives from this is that you get a simple and stable driver which can work with almost any newly released game without compatibility issues and the need for specific driver tweaks. It also gives consistent performance in most games across the board, even if that performance is relatively low. Hence why you can run pretty much any Indie game on an AMD card without a "not supported" error message. In contrast to Nvidia drivers, each game requires driver tweaks for that game to actually remain stable. This is a lot of work for the driver team, but at least you are getting the performance boosts from the multi rendering capabilities of Direct-X 11. I would not be surprised if AMD chose compatibility over performance to either reduce staff workload, or because the driver team don't have the ability to compile a fully stable driver which uses such features discussed.
AMD have shown promise in the past with their CPU overhead issues with "Sid Meier's Civilization Beyond Earth", whereby they worked closely with the developer to ensure the game used good API optimization methods for enhanced CPU performance. AMD also enabled command list support (such a rare moment) for this title which allowed proper usage on all threads.

Source
Here is more information on the Civ: Beyond Earth case in relation to AMD's driver and the API's capabilities:
Quote:
Will we see an improvement from AMD's DriectX-11 overhead, I think not as suggested here in an interview with AMD's Richard Huddy at Bit-tech:
Quote:
Source Two
Nvidia not too long back decided to fully support Direct-X 11's capabilities with their "wonder driver" which promised the following:

Source
It would seem Nvidia had fully enabled all the bells and whistles associated with the Direct-X 11 API, putting an end to CPU bottlenecks for most games with those running an Nvidia GPU. AMD still refuse to make such changes even now, in Q3 2015. Their main focus is clearly Direct-X 12, which does not help those who wish to play a game using Direct-X 11 as shown here:

Source
Clearly, the image above shows that AMD's driver had a maximum output of 1.1m draw calls despite the hardware being used at the time of testing. This has now been improved to around 1.3m depending on the hardware configuration.
To conclude, this is only a basic analysis of Direct-X and AMD's drivers. It is clear that AMD are not utilizing the full potential of Direct-X 11, causing CPU limitations for AMD customers. To reiterate, I spoke about all this 4-5 years ago, yet I was labelled as the crazy guy in the corner of the room who's making stuff up because he's out of his mind, and the bottleneck was nothing more than a problem with the system. instead, it turns out those involved just has a lack of understanding. hopefully with these findings, we can show people the true state of AMD GPU drivers and demand change once and for all. last but not least, i'd like to leave you with some Nvidia marketing benchmarks (which turned out to be completely true) to solidify the fact that AMD's driver performance is in a diabolical situation in comparison.
Source
UPDATE 24/02/2016: More evidence of AMD's DX11 single threaded performance issues. Look at the boost over DX11, and over the 980 Ti, in DX12.
New Ashes of the Singularity build benchmark results:




Quote:
Let us start with Direct-X 11 API. In short, it does have the ability to use multiple CPU cores for a more efficient workload:
Quote:
SourceDX11 adds multi-threading support that allows applications to simultaneously create resources or manage state and issue draw commands, all from an arbitrary number of threads. This may not significantly speed up the graphics subsystem (especially if we are already very GPU limited), but this does increase the ability to more easily explicitly massively thread a game and take advantage of the increasing number of CPU cores on the desktop.
This API has introduced multi-threaded capabilities to the pipeline, whilst utilizing parallel loads:
Quote:
SourceThe major benefit I'm talking about here is multi-threading. Yes, eventually everything will need to be drawn, rasterized, and displayed (linearly and synchronously), but DX11 adds multi-threading support that allows applications to simultaneously create resources or manage state and issue draw commands, all from an arbitrary number of threads
Direct-X 11 has the ability to use deferred context/command listing as it's main multi workload function:
Quote:
SourceA deferred contexts is a special ID3D11DeviceContext that can be called in parallel on a different thread than the main thread which is issuing commands to the immediate context. Unlike the immediate context, calls to a deferred contexts are not sent to the GPU at the time of call and must be marshalled into a command list which is then executed at a later date. It is also possible to execute a command list multiple times to replay a sequence of GPU work against different input data.
Nvidia explain the reasons why a developer would like to take advantage of deferred contexts within the API
Quote:
SourceThe entire reason for using or not using deferred contexts revolves around performance. There is a potential to parallelize CPU load onto idle CPU cores and improve performance.
You will be interested in using deferred context command lists if:
•Your game is CPU bottlenecked.
•You have a significant # of draw calls (>3000).
•Your CPU bottleneck is from render thread load or Direct3D API calls.
•You have a threaded renderer but serialize to a main render thread for mapping incurring sync point costs
After this quick explanation of the API and some of it's core functions, let's look at AMD's side of the story. Again, to put things simply, deferred contexts and command lists are not mandatory. AMD have specifically chosen not to use it in their driver. The drawback is significantly higher CPU overhead is observed, which heavily chokes the GPU(s) in CPU intensive applications. However, the positives from this is that you get a simple and stable driver which can work with almost any newly released game without compatibility issues and the need for specific driver tweaks. It also gives consistent performance in most games across the board, even if that performance is relatively low. Hence why you can run pretty much any Indie game on an AMD card without a "not supported" error message. In contrast to Nvidia drivers, each game requires driver tweaks for that game to actually remain stable. This is a lot of work for the driver team, but at least you are getting the performance boosts from the multi rendering capabilities of Direct-X 11. I would not be surprised if AMD chose compatibility over performance to either reduce staff workload, or because the driver team don't have the ability to compile a fully stable driver which uses such features discussed.
AMD have shown promise in the past with their CPU overhead issues with "Sid Meier's Civilization Beyond Earth", whereby they worked closely with the developer to ensure the game used good API optimization methods for enhanced CPU performance. AMD also enabled command list support (such a rare moment) for this title which allowed proper usage on all threads.
Source
Here is more information on the Civ: Beyond Earth case in relation to AMD's driver and the API's capabilities:
Quote:
Quote:Traditionally, rendering is a very serial process. The program needs to setup a bunch of objects and then pass that on to the video drivers and finally to the GPU. There's a high degree of submission overhead, meaning it's possible to choke the CPU while submitting a large number of objects to the GPU. In DirectX 11, multi-threaded rendering is achieved by turning the D3D pipeline into a 3 step process: the Device, the Immediate Context, and the Deferred Context. The important bit here is that the deferred context is full of things that have yet to be sent to the GPU, and that you can have a deferred context for each thread. When developers talk about multi-threaded rendering with DX11, this is what they're referring to. When you use DX11s multi-threaded rendering capabilities correctly, you can have several threads assemble their deferred contexts, and then combine them into a single command list once it comes time to render the scene.
Quote:But let's be clear here: multi-threaded rendering is a massive undertaking on the driver and hardware side. You're doing the GPU equivalent of inventing the multi-tasking operating system. NVIDIA and AMD have not until this point supported multi-threaded rendering in their drivers, as they have needed time to implement this feature correctly in their drivers.
Quote:Anyhow, as far as I know, AMD does not currently offer fully support for multi-threaded rendering (I don't have an AMD card plugged in right now to run the DX Caps Viewer against). I'm not sure where they are on it, though I doubt they're very far behind.
SourceSo in conclusion, the reason NVIDIA beats AMD in Civ V is that NVIDIA currently offers full support for multi-threaded rendering/deferred contexts/command lists, while AMD does not. Civ V uses massive amounts of objects and complex terrain, and because it's multi-threaded rendering capable the introduction of multi-threaded rendering support in NVIDIA's drivers means that NVIDIA's GPUs can now rip through the game.
Will we see an improvement from AMD's DriectX-11 overhead, I think not as suggested here in an interview with AMD's Richard Huddy at Bit-tech:
Quote:
Source OneAMD says DX11 multi-threaded rendering can double object/draw-call throughput, and they want to go well beyond that by bypassing the DX11 API.
Source Two
Nvidia not too long back decided to fully support Direct-X 11's capabilities with their "wonder driver" which promised the following:
Source
It would seem Nvidia had fully enabled all the bells and whistles associated with the Direct-X 11 API, putting an end to CPU bottlenecks for most games with those running an Nvidia GPU. AMD still refuse to make such changes even now, in Q3 2015. Their main focus is clearly Direct-X 12, which does not help those who wish to play a game using Direct-X 11 as shown here:
Source
Clearly, the image above shows that AMD's driver had a maximum output of 1.1m draw calls despite the hardware being used at the time of testing. This has now been improved to around 1.3m depending on the hardware configuration.
To conclude, this is only a basic analysis of Direct-X and AMD's drivers. It is clear that AMD are not utilizing the full potential of Direct-X 11, causing CPU limitations for AMD customers. To reiterate, I spoke about all this 4-5 years ago, yet I was labelled as the crazy guy in the corner of the room who's making stuff up because he's out of his mind, and the bottleneck was nothing more than a problem with the system. instead, it turns out those involved just has a lack of understanding. hopefully with these findings, we can show people the true state of AMD GPU drivers and demand change once and for all. last but not least, i'd like to leave you with some Nvidia marketing benchmarks (which turned out to be completely true) to solidify the fact that AMD's driver performance is in a diabolical situation in comparison.
Source
UPDATE 24/02/2016: More evidence of AMD's DX11 single threaded performance issues. Look at the boost over DX11, and over the 980 Ti, in DX12.
New Ashes of the Singularity build benchmark results:
Quote:
Originally Posted by Mahigan
Well lets take it a step further shall we?
SM200 (GTX 980 Ti):
22 SMMs
Each SMM contains 128 SIMD cores.
Each SMM can execute 64 warps concurrently.
Each Warp is comprised of 32 threads.
So that's 2,048 threads per SMM or 128 SIMD cores.
2,048 x 22 = 45,056 threads in flight (executing concurrently).
GCN3:
64 CUs
Each CU contains 64 SIMD cores.
Each CU can execute 40 wavefronts concurrently.
Each Wavefront is comprised of 64 threads.
So that's 2,560 threads per CU or 64 SIMD cores.
2,560 x 64 = 163,840 threads in flight (executing concurrently).
Now factor this:
GCN3 SIMDs are more powerful than SM200 SIMDs core for core. It takes a GCN3 SIMD less time to process a MADD.
So what is the conclusion?
1. GCN3 is far more parallel.
2. GCN3 has less SIMD cores dedicated towards doing more work. SM200 has more SIMD cores dedicated towards doing less work.
3. If you're feeding your GPU with small amounts of compute work items, SM200 will come out on top. If you're feeding your GPU with large amounts of compute work, GCN3 will come out on top.
Now this is just the tip of the Iceberg, there's also this to consider (Multi-Engine support):
GCN3 stands to benefit more from Asynchronous compute + graphics than SM200 does because GCN3 has more threads idling than SM200. So long as you feed both architectures with optimized code, they perform as expected.
What is as expected?
Hitman, under DX12, will showcase GCN quite nicely I believe.