Overclock.net banner

[Gamer's Nexus] PCIe x16/x16 vs. x8/x8 (Dual Titan V Bandwidth Limit Test)

4.6K views 22 replies 16 participants last post by  Blameless  
#1 ·
Quote:
We've previously found unexciting differences of <1% gains between x16 vs. x8 PCIe 3.0 arrangements, primarily relying on GTX 1080 Ti GPUs for the testing. There were two things we wanted to overhaul on that test: (1) Increase the count of GPUs to at least two, thereby placing greater strain on the PCIe bus (x16/x16 vs. x8/x8), and (2) use more powerful GPUs.
...
It's time to revisit PCIe bandwidth testing. We're looking at the point at which a GPU can exceed the bandwidth limitations of the PCIe Gen3 slot, particularly when in x8 mode. This comparison includes dual Titan V testing in x8 and x16 configurations, pushing the limits of the 1GB/s/lane limits of the PCIe slots.

Testing PCIe x8 vs. x16 lane arrangements can be done a few ways, including: (1) Tape off the physical pins on the PCIe foot of the GPU, thereby forcing x8 m ode; (2) switch motherboard PCIe generation to Gen2 for half the bandwidth, but potentially introduce variables; (3) use a motherboard with slots which are physically wired for x8 or x16.

Our test platform includes 2x Titan V cards, the EVGA X299 Dark motherboard, an Intel i9-7980XE, and 32GB 3866MHz GSkill Trident Z Black. We were using the Titan Vs under air for these tests. When overclocked, they were set to a stable OC that was achievable on both cards -- +150 core and HBM2.

Ashes is being run at 4K with completely maxed settings. We set them to "Crazy," then manually increment all options to the highest point, including 8xMSAA. This was required to ensure adequate GPU work, thereby reducing the potential for a CPU bottleneck.
...
What we are left with, however, is a somewhat strong case for waning PCIe bandwidth sufficiency as we move toward the next generation - likely named something other than Volta, but built atop it. SLI or HB SLI bridges may still be required on future nVidia designs, as it's possible that a 1080 Ti successor could encounter this same issue, and would need an additional bridge to transact without limitations.


Source

So we are finally reaching a point where the PCIe 3.0 x8 links may be a bottleneck. PCIe 4.0 is coming - probably around 2019 to 2020 - it's main attraction was for NVMe SSDs, but it seems like if GPUs continue to advance, we may be reaching a point now where it is a bottleneck. Keep in mind that we should not be comparing to the Titan V - we should be considering that the big die GPUs of 2019 or 2020 would otherwise also use PCIe 3.0.

Edit: Oh, and we finally have a solution that can average more than 60 fps at 4k on Crazy settings with Ashes of Singularity: Escalation. The problem of course is the cost and the fact that 0.1% frame percentile is still hovering around 20 fps, with no gains from 2 GPUs.
 
#2 ·
That's a relatively minimal hit for cutting bandwidth in half in a top of the line multi-GPU rendering configuration that doesn't have any sort of bridge dedicated to frame compositing.

Probably time for those looking to continue to use top of the line multi-GPU configs to use 16x/16x 3.0, if possible, but that's a pretty small segment.

Single card is probably fine at far lower interface bandwidth, probably even with most external GPU enclosures over thunderbolt (something I'd also like to see tested).
 
#3 ·
Quote:
Originally Posted by CrazyElf View Post

Edit: Oh, and we finally have a solution that can average more than 60 fps at 4k on Crazy settings with Ashes of Singularity: Escalation. The problem of course is the cost and the fact that 0.1% frame percentile is still hovering around 20 fps, with no gains from 2 GPUs.
That would be because of the game engine doing something that takes significant CPU time and causes a bottleneck that prevents the GPU from being fed with data any faster during that calculation.
 
#4 ·
this puts people wanting this setup in a weird situation, go for the 8700K which is the best gaming CPU, but being limited on PCIE bandwidth, or buy the more workstation style 7900x with no bandwidth problems but clearly less CPU perf...

in the end, SLI, specially of this expensive card is not very practical these days, so... interesting to see, shows that the new PCIE with double the bandwidth per lane could be beneficial in some cases for dual GPUs soon.
 
#5 ·
Quote:
Originally Posted by HMBR View Post

this puts people wanting this setup in a weird situation, go for the 8700K which is the best gaming CPU, but being limited on PCIE bandwidth, or buy the more workstation style 7900x with no bandwidth problems but clearly less CPU perf...

in the end, SLI, specially of this expensive card is not very practical these days, so... interesting to see, shows that the new PCIE with double the bandwidth per lane could be beneficial in some cases for dual GPUs soon.
7900x is not less performant. I am running it at 4.7Ghz daily, I can go up to 4.8 which removes any possible bottlenecks.
 
#8 ·
Quote:
Originally Posted by Nautilus View Post

At x16 it does not make any difference in single vs sli, no bottleneck.
That's not what im saying ... two Titan V @ 16x is 14% faster than two Titan V @ 8x. But a single Titan V @ 16x or 8x, there's no difference.
Edit: aww its because they dont use an SLI bridge, they rely on DX12 multi GPU thingy (cant remember whats called lol)
 
#9 ·
Quote:
Originally Posted by Nautilus View Post

7900x is not less performant. I am running it at 4.7Ghz daily, I can go up to 4.8 which removes any possible bottlenecks.
Skylake E will always perform worse in high refresh rate scenarios over old ring bus based CPUs. In this case the 8700K would overtake it in performance and it clocks higher than your 7900x on average.
 
#10 ·
Hi,
here are more practical numbers of how the pci-e and bridge-bandwith can influence performance/scaling when games are maxed out (especially incl. a G-SYNC Panel which has more Impact at the needed bandwith). About 1.5 half years old by now but it shows very well that a mainstream socket with PCI-E 3.0 x8 x8 (simulated with using Gen 2 x16 x16) is not sufficient in some cases. Especially this is the case when using highres, T.AA and adaptive sync. So PCI-E 4.0 x8 x8 is going to be a big boost for upcoming mainstream sockets. The interesting thing is that MGPU @ DX12 has not that impact when it comes to scaling with less bandwith. Saw some good numbers at YT for example, but anyway there are not really that much titles on the market supporting MGPU while using that API.

These tests were made by a friend of mine (Blaire at german's 3DCenter-forums) who is also beta tester for NV drivers, so you can call him very experienced.
thumb.gif


Yes the socket is not really up to date but with 4K+GW effects and T.AA, you are GPU limited in SLI.
wink.gif


Some hard cases - you can clearly see while using a flex bridge and Gen 2 x16 x16 the scaling does not do well:










Original post (in german)

Another new game showing how important PCI-E bandwith is -> Hellblade with UE4 and T.AA. With Blaire's custom SLI bits you get very good scaling and consistent frametimes (~50-70% scaling like always when games are using T.AA), but some users who are running 3.0 x8 x8 systems with SLI responded that they had no scaling at all - turned out to be a bit better when disabling G-SYNC. In fact as you can see the frametimes in MGPU are much smoother than with SGPU! Using 2-way TITAN X (Pascal).
wink.gif


Have a look at this comment and the replies:
1080 in SLI... made absolutely zero difference.

Regards,
Edge
 
#12 ·
GN stated at 1:00 that the 1080ti test was single card. Multigpu uses more pcie bandwith per card. They may have found the same conclusion had they tested sli 1080tis, just to a lesser extent.

I tested this too, just a bit, by tossing my PhysX gpu in a slot that forced a x16,x0 to 2 x8s and made my sli run x16,x8. my loss in Deus Ex MD dx12 was bigger, dx11 stayed pretty close to the same. Don't know why, but it did.
 
#13 ·
My 1080 is on a riser out of the second slot. I probably don't have to worry
rolleyes.gif
 
#14 ·
Most games don't have a problem but I am currently testing that with a 6850K @ 4.5Ghz and SLI 1080TI @ 2Ghz and in Total-War Warhammer 2 @ 4K DSR there was a 10% difference between PICE 3.0 and 2.0 x16 (same speed difference as v3.0 x16 vs x8). 1080p was partially CPU limited which is why the test didn't make much sense.

There are a couple of games which can supposedly loose up to 20% performance with x8 PCIE in SLI: Fallout 4 (-20%), Witcher 3 (-20%), Watch Dogs (-10%):
https://www.forum-3dcenter.org/vbulletin/showpost.php?p=11059897&postcount=2261

Especially modern SSAO (eg. HBAO+) and AA solutions (e.g. TSSAA) seem to require a lot of bandwidth.
 
#15 ·
Quote:
Originally Posted by pas008 View Post

did they do this pcie test on titan x with hb sli bridge and ribbon?
they tested the titan V which doesn't support SLI.

if you go to https://youtu.be/i8iE_sQBFXk?t=1m57s

aots uses explicit multi gpu under dx12 so SLI and Xfire doesn't need supported.

oopps misundeerstood. read below.
 
#16 ·
Quote:
Originally Posted by pas008 View Post

did they do this pcie test on titan x with hb sli bridge and ribbon?
in their previous test they basically did exactly that yes. The only difference was that it was not exactly the older TitanX but the somewhat newer and more performant 1080ti which was pretty much the same performance of the TitanXPv1 at that time (before v2 was released)
 
#17 ·
Quote:
Originally Posted by profundido View Post

in their previous test they basically did exactly that yes. The only difference was that it was not exactly the older TitanX but the somewhat newer and more performant 1080ti which was pretty much the same performance of the TitanXPv1 at that time (before v2 was released)
have a link to that one? cant find it with my limited internet at work
 
#18 ·
#19 ·
Quote:
Originally Posted by profundido View Post

I went through the articles on their site and found it:

https://www.gamersnexus.net/guides/2963-intel-12k-marketing-blunder-pcie-lane-scaling-benchmarks

and also:

https://www.gamersnexus.net/guides/2488-pci-e-3-x8-vs-x16-performance-impact-on-gpus

Really nice tests they did there
thanks
I know I seen them before just couldnt get access to anything at work
was going to hunt them down once I was done at work but kiddos got me all sidetracked

really wish they didnt compared different cpus
ring vs mesh is really an inconvenience to this test to me
but I know it wouldnt make that much of a difference but the gaps would possibly get bigger

I am more interested on how much gpu power it takes for considering pcie lane bandwidth generations/versions

sry typed fast at work again
 
#20 ·
Wait they didn’t use sli bridges? Dosnt that make these tests sorta flawed since nvidia cards aren’t really designed as a bridgeless design. I wonder how using the dx12 multi gpu over pcie would effect an amd card? Or does it just bypass any hardware based sli/crossfire bridging?
 
#21 ·
Quote:
Originally Posted by Talon720 View Post

Wait they didn't use sli bridges? Dosnt that make these tests sorta flawed since nvidia cards aren't really designed as a bridgeless design. I wonder how using the dx12 multi gpu over pcie would effect an amd card? Or does it just bypass any hardware based sli/crossfire bridging?
It depends how the program wants to use the multiple GPUs, for a typical DX12 Multi-GPU you dont actually enable SLI and thus you dont use an SLI bridge. There are games that do use SLI mode and need the bridge when in SLI, but the way games like AOTS do it you dont enable SLI. And you can also mix things like Nvidia and AMD GPUs in the same system and it works. In the case of these tests, because the way the game functions an SLI bridge would do nothing so the test is valid for the game. I believe the mode is called "Explicit multi-GPU"?
 
#23 ·
Quote:
Originally Posted by okcomputer360 View Post

What about with regards to mining and hash rate, where the cards are usually running at or near full capacity? Any bottlenecks between architectures 2.0 & 3.0

Thank you
Mining/hashing generally uses very little PCI-E bandwidth and even 1x 2.0 won't be a bottleneck for most mining or password cracking setups.