Overclock.net › Forums › AMD › AMD CPUs › The Ryzen Gaming Performance Gap is Mostly Gone
New Posts  All Forums:Forum Nav:

The Ryzen Gaming Performance Gap is Mostly Gone - Page 4

post #31 of 39
Thread Starter 
Yeah you do make interesting points.

A lot has changed in the past few months. SMT no longer has to be disabled. One of the newer AGESA updates has ensured that there isn't the performance penalty.

You are very right though about the diminishing returns past 3200 MHz. I think to be honest, for Zen+, AMD may want to aim for say, 3200 MHz, a 50% improvement in their base Infinity Fabric speeds and make that independent of RAM. That would improve gaming performance dramatically. So too would adding a separate multiplier (kind of like what INtel has done).

What's really fascinating about this is that the speed between the communication between the cores is causing the slowdown. WE see this happening in Intel CPUs too.



The 7900X is both faster in terms of IPC and its clocked at 4.6 GHz, but it losses in games to a 4.4 GHz 6950X in many cases. This is caused by the mesh fabric. In INtel's case though, Uncore and memory are separate, so overclocking the memory does not lead to the Uncore being faster. To add insult to injury, unlike with say, a 5960X, no OC socket is present, limiting Uncore overclocks. While the Mesh topology is good for multithreaded performance, gaming does suffer.

I think that if AMD were to bump up the inter-core speeds, it would be very helpful. The Hyper-Transport architecture from which INfinity Fabric was developed from was rated to 3.2 GHz. Perhaps with OCing, that could be brought up to 4 GHz. The stock Infinity Fabric is operating at just 1066 MHz - 1/3 of that. The interesting question, as you note, is the optimal balance between single threaded performance vs multi-threaded performance. A higher clocked fabric would use more power for sure. Perhaps 1600 MHz (the speed of DDR4 3200) may be where the laws of diminishing returns begins.


It would be interesting to see the returns on 3466 MHz tight timings kits - or how much can be gained from say, a 4000 MHz kit stepped down to very tight timings. Ryzen seems to be very timing sensitive, perhaps because of the design of the CCX communication. I still think that a 1-2 MB L4 cache might have been very helpful. It would have to be small to keep the cost down, but even a small cache might be helpful. It might also help for inter-die communications on Threadripper and Epyc successors.






Quote:
Originally Posted by gtbtk View Post

HI, just checking in from YT. I actually had a half typed reply in draft to this thread from a while ago. I must have gotten side tracked and forgot to come back to it

Thanks for checking in. You might still have it in your draft folder.








I think we will know soon enough how big the penalty for one CCX is. If AMD is releasing their new APUs next year, the report is that they will have 4 cores (1 CCX) and 11 NCUs (so 704 Vega like SPs). This could test our hypothesis.
Trooper Typhoon
(20 items)
 
  
CPUMotherboardGraphicsGraphics
5960X X99A Godlike MSI 1080 Ti Lightning MSI 1080 Ti Lightning 
RAMHard DriveHard DriveHard Drive
G.Skill Trident Z 32 Gb Samsung 850 Pro Samsung SM843T 960 GB Western Digital Caviar Black 2Tb 
Hard DriveOptical DriveCoolingCooling
Samsung SV843 960 GB LG WH14NS40 Cryorig R1 Ultimate 9x Gentle Typhoon 1850 rpm on case 
OSMonitorKeyboardPower
Windows 7 Pro x64 LG 27UD68 Ducky Legend with Vortex PBT Doubleshot Backlit... EVGA 1300W G2 
CaseMouseAudioOther
Cooler Master Storm Trooper Logitech G502 Proteus Asus Xonar Essence STX Lamptron Fanatic Fan Controller  
  hide details  
Reply
Trooper Typhoon
(20 items)
 
  
CPUMotherboardGraphicsGraphics
5960X X99A Godlike MSI 1080 Ti Lightning MSI 1080 Ti Lightning 
RAMHard DriveHard DriveHard Drive
G.Skill Trident Z 32 Gb Samsung 850 Pro Samsung SM843T 960 GB Western Digital Caviar Black 2Tb 
Hard DriveOptical DriveCoolingCooling
Samsung SV843 960 GB LG WH14NS40 Cryorig R1 Ultimate 9x Gentle Typhoon 1850 rpm on case 
OSMonitorKeyboardPower
Windows 7 Pro x64 LG 27UD68 Ducky Legend with Vortex PBT Doubleshot Backlit... EVGA 1300W G2 
CaseMouseAudioOther
Cooler Master Storm Trooper Logitech G502 Proteus Asus Xonar Essence STX Lamptron Fanatic Fan Controller  
  hide details  
Reply
post #32 of 39
Quote:
Originally Posted by CrazyElf View Post

Yeah you do make interesting points.

A lot has changed in the past few months. SMT no longer has to be disabled. One of the newer AGESA updates has ensured that there isn't the performance penalty.

You are very right though about the diminishing returns past 3200 MHz. I think to be honest, for Zen+, AMD may want to aim for say, 3200 MHz, a 50% improvement in their base Infinity Fabric speeds and make that independent of RAM. That would improve gaming performance dramatically. So too would adding a separate multiplier (kind of like what INtel has done).

What's really fascinating about this is that the speed between the communication between the cores is causing the slowdown. WE see this happening in Intel CPUs too.



The 7900X is both faster in terms of IPC and its clocked at 4.6 GHz, but it losses in games to a 4.4 GHz 6950X in many cases. This is caused by the mesh fabric. In INtel's case though, Uncore and memory are separate, so overclocking the memory does not lead to the Uncore being faster. To add insult to injury, unlike with say, a 5960X, no OC socket is present, limiting Uncore overclocks. While the Mesh topology is good for multithreaded performance, gaming does suffer.

I think that if AMD were to bump up the inter-core speeds, it would be very helpful. The Hyper-Transport architecture from which INfinity Fabric was developed from was rated to 3.2 GHz. Perhaps with OCing, that could be brought up to 4 GHz. The stock Infinity Fabric is operating at just 1066 MHz - 1/3 of that. The interesting question, as you note, is the optimal balance between single threaded performance vs multi-threaded performance. A higher clocked fabric would use more power for sure. Perhaps 1600 MHz (the speed of DDR4 3200) may be where the laws of diminishing returns begins.


It would be interesting to see the returns on 3466 MHz tight timings kits - or how much can be gained from say, a 4000 MHz kit stepped down to very tight timings. Ryzen seems to be very timing sensitive, perhaps because of the design of the CCX communication. I still think that a 1-2 MB L4 cache might have been very helpful. It would have to be small to keep the cost down, but even a small cache might be helpful. It might also help for inter-die communications on Threadripper and Epyc successors.





 
Quote:
Originally Posted by gtbtk View Post

HI, just checking in from YT. I actually had a half typed reply in draft to this thread from a while ago. I must have gotten side tracked and forgot to come back to it

Thanks for checking in. You might still have it in your draft folder.








I think we will know soon enough how big the penalty for one CCX is. If AMD is releasing their new APUs next year, the report is that they will have 4 cores (1 CCX) and 11 NCUs (so 704 Vega like SPs). This could test our hypothesis.

 

The mesh and the Fabric are conceptually similar and are exhibiting similar issues regarding throughput/latency. The difference is that you can separately overclock the mesh and improve performance with the skylake X chips.

 

A single 4 core CCX in APUs should actually work better than the 2 core on 2 CCX design that the R3 chips are using. In spite of the spec sheet saying 16GB L3 cache, Ryzen only has 8GB L3 cache repeated twice. for 4 cores, a single CCX will alleviate an cross fabric switching. 

 

Ryzen is a Ver.1 product. Of course there are design elements that AMD would change is they had their time over again just like there is with any newly developed product. I hope that the next generation of Zen improves a few things:

 

Firstly clock the fabric at x2 the memory frequency or separate the clocks and make it overclockable like the Intel chips. It will give the chips more headroom for optimization in terms of memory interleaving etc.

 

Secondly either add an L4 on die cache running at the same CPU frequency the other caches use that is shared and accessible directly from both CCX or even better, make the L3 cache a monolithic single 16GB cache accessible from both CCX modules. A Cache shared between both CCX modules will alleviate the cross fabric thread latency penalty.

 

Thirdly increase the max achievable all core frequency to 4.5ghz from 4Ghz or so. In spite of the myth being perpetuated on the internet that the Intel chips have significantly better IPC (Instructions per clock) than Ryzen, they dont. Ryzen IPC is better than broadwell-E, about the same as skylake-E only slightly behind Kaby Lake/coffee lake. The only reason that the Intel chips are winning in benchmark scores like cinebench single core is because the have more clock cycles to process instructions.

 

All of the current gen CPUs from both Intel and AMD score at about 25 cycles per point +/- .5 cycle

post #33 of 39
This is a fantastic review! I find that when I look at gaming rigs and test them, once I get out of the 1080p range I see a lot less of a difference.

I personally just finished a TR setup based on the 1900x with 2 980ti in SLI and I find the 4k performance to be top notch!
post #34 of 39
Quote:
Originally Posted by robtorbay View Post

This is a fantastic review! I find that when I look at gaming rigs and test them, once I get out of the 1080p range I see a lot less of a difference.

I personally just finished a TR setup based on the 1900x with 2 980ti in SLI and I find the 4k performance to be top notch!

As Resolution increases the number of rendered frames reduces meaning less work for the CPU working out the calculations for all the extra frames but more work for the GPU rendering all the pixels that are being set to the screen. 

 

Would it be possible for you to do a test for me? Do you have either a flexible or some alternative SLI bridges so that you can try SLI in PCI slot 1 and 2 and compare that to PCIe slot 1 and 3 and run some 1080p benchmarks with the two different configurations?

 

What you will end up testing is the difference between one x16 + one x8 both connected to the same PCIe controller and two x16 slots connected to two different controllers that are separated by infinity fabric. I am curious to see what , if any performance difference there is. I dont know for sure but I suspect that the slot 1 + slot 2 may perform better than slot 1 + slot 3 even though they are both x16 slots. 

post #35 of 39
too bad amd rushed the launch and reviewers got beta firmware. now that poor gaming reputation is gonna dog them for a while
post #36 of 39
Quote:
Originally Posted by gtbtk View Post

As Resolution increases the number of rendered frames reduces meaning less work for the CPU working out the calculations for all the extra frames but more work for the GPU rendering all the pixels that are being set to the screen. 

Would it be possible for you to do a test for me? Do you have either a flexible or some alternative SLI bridges so that you can try SLI in PCI slot 1 and 2 and compare that to PCIe slot 1 and 3 and run some 1080p benchmarks with the two different configurations?

What you will end up testing is the difference between one x16 + one x8 both connected to the same PCIe controller and two x16 slots connected to two different controllers that are separated by infinity fabric. I am curious to see what , if any performance difference there is. I dont know for sure but I suspect that the slot 1 + slot 2 may perform better than slot 1 + slot 3 even though they are both x16 slots. 

In my circumstance with DR and tuned timings @ 3200. No difference.

Maybe if your IF/mem speeds are low it matters. Maybe if you have better than fury x it matters...etc saturate bandwidth. Maybe my cards don't have enough power to do so.

All I can say is my scores are spot on if not better than expected using both 16x lanes.

Maybe its the fact that I am using uma but whatever the case it seems not to matter.

Honestly...anyone claiming issues gaming on TR bought specifically for gaming... Well they first and foremost are idiots. You do not buy TR for gaming.

Game tests on x370 have more fps even though TR wins most newer benchmarks overall. ( fwiw most newer 3d benchmarks are cpu tests )
Edited by chew* - 10/9/17 at 10:00am
post #37 of 39
Meh, I'm an idiot then smile.gif bought my TR for gaming, it's going to work out just fine, CPU just isn't the important thing when it comes to games, never has been, never will be, sure if you want to be an elite benchmarker or twitch gamer @ 720p it's important but I just want to put some images up on a nice big 4k screen, max details and a min 60fps, I might even use a joypad tongue.gifwink.gif what this platform allows me to do is plug in what I want/need and not have to deal with dark lane rubbish.

Built my last system in 2013 and if it wasn't for spilling a cuppa on it I wouldn't be here, I'd still be quite happy pootling along with a 3770k for the next few years biggrin.gif that didn't start out as much of a system, with only 1 GPU and some spinners, by the end there were 3 GPUs and NVME SSDs in there, no CPU/Motherboard upgrade needed as I bought a decent platform with some scope (z77 + PLX, I did actually buy another CPU after the incident, as CPU died, all I could find was a 3570k, even that ran most games just fine! )

This one will start out similarly humble but who knows what it will end up as, a lot can change in a few years, one thing that probably won't is board/CPU, plenty of capability in there. When you look at the other lower end platforms it felt like that had gone backwards in many respects, throw in a million RGB LED, USB ports and SATA that'll sell it too 'em.

Has it cost me more to do this build, sure, but really, what's the extra boil down to, no more than a weekend out on the beers with mates that I would probably be too drunk to remember anyway bar the empty wallet biggrin.gif I can skip one of those and the embarrassing pictures that come after biggrin.gif
Edited by sandysuk - 10/9/17 at 4:59pm
post #38 of 39
Im referring to the 1080P guys wink.gif
post #39 of 39
Quote:
Originally Posted by chew* View Post
 
Quote:
Originally Posted by gtbtk View Post

As Resolution increases the number of rendered frames reduces meaning less work for the CPU working out the calculations for all the extra frames but more work for the GPU rendering all the pixels that are being set to the screen. 

Would it be possible for you to do a test for me? Do you have either a flexible or some alternative SLI bridges so that you can try SLI in PCI slot 1 and 2 and compare that to PCIe slot 1 and 3 and run some 1080p benchmarks with the two different configurations?

What you will end up testing is the difference between one x16 + one x8 both connected to the same PCIe controller and two x16 slots connected to two different controllers that are separated by infinity fabric. I am curious to see what , if any performance difference there is. I dont know for sure but I suspect that the slot 1 + slot 2 may perform better than slot 1 + slot 3 even though they are both x16 slots. 

In my circumstance with DR and tuned timings @ 3200. No difference.

Maybe if your IF/mem speeds are low it matters. Maybe if you have better than fury x it matters...etc saturate bandwidth. Maybe my cards don't have enough power to do so.

All I can say is my scores are spot on if not better than expected using both 16x lanes.

Maybe its the fact that I am using uma but whatever the case it seems not to matter.

Honestly...anyone claiming issues gaming on TR bought specifically for gaming... Well they first and foremost are idiots. You do not buy TR for gaming.

Game tests on x370 have more fps even though TR wins most newer benchmarks overall. ( fwiw most newer 3d benchmarks are cpu tests )

I don't know one way or the other which approach will work better. I was theorizing about SLI/crossfire and the differences between running the cards on the same dies or across dies. I would have guessed that the AMD cards, without any additional bridge cables may gave struggled a bit putting them on two different dies. I am glad to hear that using the two x16 cards are working well for you. You may be right about the fury not having the power to really push things along.

 

As a matter of interest, On your TR system, have you ever tried the two cards in the x16 and x8 slot connected to the first die?

 

Personally I don't really give a damn about the games themselves. Sure, they are fun to play every now and then but I am not a rabid gamer. To me, they are tools, like wprime, that exercise various components and identify areas that are not behaving as expected. In this case, the Fabric is not functioning in the way that we all assumed it did in the beginning. In this case, the gaming benchmarks gave me insight to how the fabric actually connects the components, seems to work, its strengths and weaknesses, not just in games but in everything we use these systems for.

 

I've discovered that using CPU affinity to pin some applications to appropriate cores gives you the benefits of "gaming mode" and Numa mode while being able to leave the rest of the system in UMA mode. Its let you have your cake and eat it too.

 

I have discovered that the location of a pcie device in relation to a particular die is relevant to performance, particularly for applications that cannot scale to 32 threads. TR and Epyc are likely to see extensive use as VM hosts. Understanding that if passthrough GPUs and shared NVME storage devices are installed locally to the die that the VM is running on means a more efficient use of that hardware device, better performance and reduced operating costs. 

New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: AMD CPUs
Overclock.net › Forums › AMD › AMD CPUs › The Ryzen Gaming Performance Gap is Mostly Gone