Originally Posted by Mahigan
I think you are both arguing different things.
You are arguing that FuryX beats a GTX 980 Ti under every DX12/Vulkan game.
Remi is arguing that a GTX 1070/1080 are faster than AMDs older generation GPUs.
As for me... I do not care much about the performance I am seeing. I am simply here to explain why we are seeing it. I could care less about benchmarks or epeen.Warning: Spoiler! (Click to show)
There is one dude brushing aside what I am saying (eventhough this is my topic and yes... some of us here ARE ENGINEERS) and others who claim I am making excuses when I am actually attempting to explain what is happening.
There are people who think that Pascal will not gain performance when running Asynchronous compute + graphics. They are wrong.
There are other people who think that Pascal supports Asynchronous Compute + Graphics and they are wrong.
Pascal is using a hack... a clever hack.
Pascal makes use of its more advanced pre-emption capabilities (compared to Maxwell) and its Dynamic Load Balancing capabilities in order to process Asynchronous Compute + Graphics tasks concurrently. There is a limit to how many of these tasks Pascal can handle due to the GPC nature of Pascal. Each GTX 1080 is broken up into 4 GPCs (GPUs within a GPU) and each GPC can only be populated by either an array of Compute or Graphics tasks (not both) at any given time. Therefore a GTX 1080 can have two GPCs populated with Compute Tasks and two GPCs populated with Graphics Tasks (or 3 to 1 etc). The load can be dynamically balanced in order to allow for more GPU resources to be dedicated to one set of tasks over another.
Compute and Graphics tasks are not executed in parallel or concurrently. They are (as in Maxwell) executed sequentially but they are processed (up to a certain limit) concurrently.
This will give you a small performance boost under light loads. This is why nVIDIA released AotS benchmark figures using the "high" preset and not the "crazy" preset.
This is not bickering... that is how Pascal works.
I am only attempting to explain why we see the performance boost. The kicker is that everything I have said comes straight from nVIDIA...
Translated... In other words... you have one GPC handling Graphics and another handling Compute. You have one task in the GPC handling Graphics which is synchronized with one task in the GPC handling Compute. If the Compute task takes longer than the Graphics task and if there are idle compute resources in the GPC in which this longer running compute task is located then those idle resources can be dynamically assigned to the long running compute task in order to boost performance and complete it quicker.
Translated... In other words if you have a high priority Asychronous compute + graphics task that enters the pipeline it can be assigned a priority over currently running tasks. So a GPC will be flushed of currently running tasks (taking less than 100 ms) so that this high priority job can be executed quickly.
What if the GPU is busy and there are no idling resources? That is my argument. You get no performance boost from using Dynamic Load Balancing and thus no performance boost (and maybe even a performance loss) when running high Asynchronous Compute + Graphics workloads.
So yeah... not bickering... informing people instead.
This has been interesting to read for sure , before the time spy benchmark i was a bit dubious that anything was actually occurring other than spin from Nvs side.
Its still a hack or the less elegant approach as this switching still has small time penalties and in worst case heavy load situations there are likely no gains to be had via this pre-emptive split loading method.
It does however move mid way between nothing and some gain which is a good step forward for NV. Games unfortunately have shown very little gain , i would say half of what we are seeing here . So this could be quite difficult to optimise.
I think it shows that either system can only use the idle resources available to get gains from async commands, hence polaris with way less streaming processors than Fury X has far less chance of an idle stream processor (possibly it would show better if the bench ran at 1080p). Also the 1080 gains slightly more than the 1070 as it has a 1/4 of its GPCs fused off , so hence it has less idle units for a given work load than the 1080 could.