Overclock.net - An Overclocking Community

Overclock.net - An Overclocking Community (https://www.overclock.net/forum/)
-   AMD CPUs (https://www.overclock.net/forum/10-amd-cpus/)
-   -   Theories on why the SMT hurts the performance of gaming in Ryzen and some recommendations for the future (https://www.overclock.net/forum/10-amd-cpus/1624566-theories-why-smt-hurts-performance-gaming-ryzen-some-recommendations-future.html)

Kuivamaa 03-07-2017 01:01 AM

Ryzen is excellent at BF1 MP.

CrazyElf 03-07-2017 07:33 AM

Well we have one answer - look at Ryzen with just one CCX enabled. This is a review of someone that disabled one CCX and has just 4 cores, 8 threads. He then made it clock for clock with a 7700k.

http://www.zolkorn.com/reviews/amd-ryzen-7-1800x-vs-intel-core-i7-7700k-mhz-by-mhz-core-by-core/view-all/


Real bench:



Grand Theft Auto:



Battlefield performance:


Note that the penalty is smaller than with both CCXs enabled. I suspect that with the other CCX disabled, this is a big step forward because it means the L3 becomes Last Level Cache. It'd be interesting to see what would happen if SMT is offline as well. It could go either way. Battlefield is more threaded than most games.

That's good news for the 4 core and Zen APUs.


Edit:

One other thing I need to draw to people's attention. Disabling HPET can sometimes improve gaming performance at the expense of AMD Ryzen Master.

http://www.tomshardware.com/reviews/amd-ryzen-7-1800x-cpu,4951-12.html
Quote:
The evening before launch, AMD sent us a list of games that it says should perform well with Ryzen, including Sniper Elite 4, Battlefield 1, Star Wars: Battlefront, and Overwatch, among others. Many of the titles tend to be heavily threaded, which would lend itself well to Ryzen's high core count. We plan on revisiting some of those. Further, AMD suggests adjusting several different parameters for games that suffer from low performance. It recommends using Windows' High Performance power profile (which also helps Intel CPUs). It also says to disable the HPET (High Precision Event Timer), either in your BIOS or operating system, to gain a 5-8% advantage. Our results already reflect HPET disabled, though. Interestingly, AMD's Ryzen Master software requires HPET to “provide accurate measurements,” so you may find yourself toggling back and forth for the best experience.

Going to update the OP on that later.


Quote:
Originally Posted by superstition222 View Post

The worst Ryzen gaming results are typically with those that favor single-threaded performance and fewer threads being heavily used, right? So, a quad that doesn't deal with the CCX would be most optimal? Things like Dolphin would also benefit from higher clocks and fewer cores/threads.

Is there a way to have a quad (half of Ryzen) and have 8 threads via SMT or does that involve the CCX1 to CCX2 latency issue? A 4/8 part with a high enough clock that doesn't have the CCX to CCX latency issue should be pretty competitive.

I wonder if Zen+ will have an eDRAM L4.


Either eDRAM L4 or even some on die L4. We don't need that much. Perhaps just 20 MB will do. That way there's no latency penalty for off die communications.

The only other way to do it would be to line up the caches together. That would mean abandoning the CCX configuration however in favor of something that looks more like the 5960X though and would be costly to redesign.

In my OP, I proposed using HBM as a solution as well. Its got high bandwidth and the latency is better than going off package, but it would be costly.


Quote:
Originally Posted by Kuivamaa View Post

No, Ryzen game performance seems to depend on engine sensitivities and nuances. Single thread performance is strong , MT even more so yet there are both poorly and well threaded games that run both good and less good on ryzen. It is down to what is the engine doing and whether it touches upon non optimized areas.

The ideal would be that:

  1. Any single threaded games that don't scale with cores keep it within 1 CCX.
  2. Any games that do need to optimize between the 2 CCX and minimize inter-CCX communication.
  3. For Zen+ of course, we need a better solution; ideally an on die last level cache.


Quote:
Originally Posted by gtbtk View Post


you need to take a look at this (not mine) is from Ryzenchrist here on ocn

http://valid.x86.fr/qmfrkd


Yes, but that is probably not in PCIe 3.0. At higher Base Clock frequencies (or REFClock) Ryzen goes into PCIe 2.0 or even 1.0.

It's all in the guide:
https://www.overclock.net/t/1624603/rog-crosshair-vi-overclocking-thread/0_100

Page 2 in the PDF.




While modern GPUs don't have much of a penalty in PCIe 2.0, that would be a serious penalty for any NVMe SSDs (like the SSD 750 if you have the PCIe version) and if you were running an SSD 750 with Ryzen, even then there's going to be some penalty because you are running a PCIe 2.0 x8 GPU.

gtbtk 03-07-2017 08:42 AM

Quote:
Originally Posted by mozmo View Post

Ryzen will never be as good as intel in heavily coherent memory sharing applications.

The L3 in Ryzen is a victim cache not inclusive, it's broken into 2(CCX) and acts like a cluster on die chip. The L3 is not the LLC in the system like the L3 is in intel designs. This means any coherent locks/dependent memory sharing is going to be much slower than intel because a lot of the time it will need to go through slower DDR4 to ensure memory coherency.

This is why it falls behind in gaming, gaming depends on coherency and memory sharing a lot more. Improving the windows scheduler to recognize the clusters will help somewhat but you'll still hit scaling issues if a thread from CCX1 need to share data to a thread on CCX2. The bandwidth between these 2 is only 22gb/s and not fast, you're looking at around 50-100ns of pipeline stall vs 10ns on intel.

That analysis ignores the face that non CPU/GPU workloads (cinebench, realbench, superpi etc) that have good performance also have to deal with the same cache design and limitations. Impaired performance without interaction with a GPU, is simply is not the case. If what you said was correct, Performance would be bad in every scenario, not only if you are gaming or not.

 

What everyone is ignoring is that the core processing unit is supported by a system on a chip that provides the memory controller and PCIe Controller for the x16 lanes that connects to the GPU. The only thing between the CPU, other than wire that only conducts the data and timing signalling is that SOC PCIe controller.

 

Getting faster memory speeds has also been problematic. Memory Clocks in the 3000Mhz and above range increase stress on the memory controller in the SOC, so it would suggest that the performance issues are being caused by the SOC not being tuned for optimal performance and are creating the bottleneck between the CPU and GPU as well as the CPU and the installed memory. Under performance of the SOC can also explain why 4 sticks of RAM, which puts even more load on the memory controller, perform worse than 2 sticks of RAM. The SMT is also managed by the SOC as far as I am aware and even switching the L3 Cache and accessing the DDR4 relies on that SOC memory controller to access the ram

 

The performance challenges should be resolvable with firmware setting tuning, be it manually by changing settings with current bioses or with changed defaults baked into new versions from the motherboard vendors

 

The thing that I do not understand is why not one of the reviewer types seem to be able to work this out and keep trying to blame the architecture and not just work through the processing pipeline that applies when you are gaming. 


btupsx 03-17-2017 02:43 PM

Absolutely outstanding thread, everyone. Best thread I've seen on OCN in some time. applaud.gif

Echoing a couple of thoughts already espoused, and agree with most of the analysis so far.
Quote:
Originally Posted by CrazyElf View Post

Well we have one answer - look at Ryzen with just one CCX enabled. This is a review of someone that disabled one CCX and has just 4 cores, 8 threads. He then made it clock for clock with a 7700k.

http://www.zolkorn.com/reviews/amd-ryzen-7-1800x-vs-intel-core-i7-7700k-mhz-by-mhz-core-by-core/view-all/


Real bench:



Grand Theft Auto:



Battlefield performance:


Note that the penalty is smaller than with both CCXs enabled. I suspect that with the other CCX disabled, this is a big step forward because it means the L3 becomes Last Level Cache. It'd be interesting to see what would happen if SMT is offline as well. It could go either way. Battlefield is more threaded than most games.

That's good news for the 4 core and Zen APUs.


Edit:

One other thing I need to draw to people's attention. Disabling HPET can sometimes improve gaming performance at the expense of AMD Ryzen Master.

http://www.tomshardware.com/reviews/amd-ryzen-7-1800x-cpu,4951-12.html
Going to update the OP on that later.
Either eDRAM L4 or even some on die L4. We don't need that much. Perhaps just 20 MB will do. That way there's no latency penalty for off die communications.

The only other way to do it would be to line up the caches together. That would mean abandoning the CCX configuration however in favor of something that looks more like the 5960X though and would be costly to redesign.

In my OP, I proposed using HBM as a solution as well. Its got high bandwidth and the latency is better than going off package, but it would be costly.
The ideal would be that:

  1. Any single threaded games that don't scale with cores keep it within 1 CCX.
  2. Any games that do need to optimize between the 2 CCX and minimize inter-CCX communication.
  3. For Zen+ of course, we need a better solution; ideally an on die last level cache.
Yes, but that is probably not in PCIe 3.0. At higher Base Clock frequencies (or REFClock) Ryzen goes into PCIe 2.0 or even 1.0.

It's all in the guide:
https://www.overclock.net/t/1624603/rog-crosshair-vi-overclocking-thread/0_100

Page 2 in the PDF.




While modern GPUs don't have much of a penalty in PCIe 2.0, that would be a serious penalty for any NVMe SSDs (like the SSD 750 if you have the PCIe version) and if you were running an SSD 750 with Ryzen, even then there's going to be some penalty because you are running a PCIe 2.0 x8 GPU.

Yes, FINALLY concrete data with just one CCX properly enabled. I agree this bodes extremely well for the quad Ryzens, as well as the forthcoming APUs. This also makes the R5 quads the probable (undisputed?) high performance gaming value chip king until the CCX communication and DRAM/bclk nuances are completely probed.

Most importantly, it would seem to offer the best confirmation yet that DRAM is indeed being used as a last level L4 cache, and that the true bottleneck is the Infinity Fabric structure, at least in current form. *A LOT* of low hanging fruit here (and interesting engineering choices) for AMD to make in optimizing Zen 2 & Zen 3 iterations. Do they integrate a kind of eDRAM solution on-die? Do they locate a separate HBM cache physically close to the socket? Do they forgo the current unified bclk design in favor of a more distributed arrangement?

This much is clear: IMC binning, DRAM speed/timings, high level BIOS refinement, and motherboard tolerances to high speed DRAM will all be the name of the game for this iteration of Zen, critical to obtaining the best performance out of the silicon. It would also be best habit to stick to two DIMMs, so I can see high capacity dual DIMM kits becoming de rigueur for any enthusiast.
Quote:
Originally Posted by gtbtk View Post

That analysis ignores the face that non CPU/GPU workloads (cinebench, realbench, superpi etc) that have good performance also have to deal with the same cache design and limitations. Impaired performance without interaction with a GPU, is simply is not the case. If what you said was correct, Performance would be bad in every scenario, not only if you are gaming or not.

What everyone is ignoring is that the core processing unit is supported by a system on a chip that provides the memory controller and PCIe Controller for the x16 lanes that connects to the GPU. The only thing between the CPU, other than wire that only conducts the data and timing signalling is that SOC PCIe controller.

Getting faster memory speeds has also been problematic. Memory Clocks in the 3000Mhz and above range increase stress on the memory controller in the SOC, so it would suggest that the performance issues are being caused by the SOC not being tuned for optimal performance and are creating the bottleneck between the CPU and GPU as well as the CPU and the installed memory. Under performance of the SOC can also explain why 4 sticks of RAM, which puts even more load on the memory controller, perform worse than 2 sticks of RAM. The SMT is also managed by the SOC as far as I am aware and even switching the L3 Cache and accessing the DDR4 relies on that SOC memory controller to access the ram

The performance challenges should be resolvable with firmware setting tuning, be it manually by changing settings with current bioses or with changed defaults baked into new versions from the motherboard vendors

The thing that I do not understand is why not one of the reviewer types seem to be able to work this out and keep trying to blame the architecture and not just work through the processing pipeline that applies when you are gaming. 

As with most things, the simplest answer is most likely the optimal one: They didn't have any idea where to look. Aside from AnandTech, most popular sites/reviewers aren't truly boned up on the latest EI trends and uarch designs.

CrazyElf 03-18-2017 08:35 AM

Thanks for the complements everyone.

Quote:
Originally Posted by btupsx View Post

Absolutely outstanding thread, everyone. Best thread I've seen on OCN in some time. applaud.gif

Echoing a couple of thoughts already espoused, and agree with most of the analysis so far.
Yes, FINALLY concrete data with just one CCX properly enabled. I agree this bodes extremely well for the quad Ryzens, as well as the forthcoming APUs. This also makes the R5 quads the probable (undisputed?) high performance gaming value chip king until the CCX communication and DRAM/bclk nuances are completely probed.

Most importantly, it would seem to offer the best confirmation yet that DRAM is indeed being used as a last level L4 cache, and that the true bottleneck is the Infinity Fabric structure, at least in current form. *A LOT* of low hanging fruit here (and interesting engineering choices) for AMD to make in optimizing Zen 2 & Zen 3 iterations. Do they integrate a kind of eDRAM solution on-die? Do they locate a separate HBM cache physically close to the socket? Do they forgo the current unified bclk design in favor of a more distributed arrangement?

This much is clear: IMC binning, DRAM speed/timings, high level BIOS refinement, and motherboard tolerances to high speed DRAM will all be the name of the game for this iteration of Zen, critical to obtaining the best performance out of the silicon. It would also be best habit to stick to two DIMMs, so I can see high capacity dual DIMM kits becoming de rigueur for any enthusiast.
As with most things, the simplest answer is most likely the optimal one: They didn't have any idea where to look. Aside from AnandTech, most popular sites/reviewers aren't truly boned up on the latest EI trends and uarch designs.



We are expecting to see 32 GB DIMMs in the near future, so 2x 32 makes 64 GB of RAM. I mean in theory, they can already make 128 GB DIMMs, but probably not overclocked.

I proposed an L4 cache, eDRAM, or perhaps even HBM might work for higher end designs.



If you haven't read my other thread:
https://www.overclock.net/t/1625187/the-ryzen-gaming-performance-gap-is-mostly-gone/0_100


Some games love RAM, by the way:




We need that RAM controller unlocked ASAP. Ideally we could get DDR4 >4000.

superstition222 03-18-2017 01:47 PM

Quote:
Originally Posted by CrazyElf View Post

I proposed an L4 cache, eDRAM, or perhaps even HBM might work for higher end designs.
People have been wanting L4 for a long time.

Intel’s Skylake lineup is robbing us of the performance king we deserve


All times are GMT -7. The time now is 02:22 AM.

Powered by vBulletin® Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.

User Alert System provided by Advanced User Tagging (Pro) - vBulletin Mods & Addons Copyright © 2019 DragonByte Technologies Ltd.
vBulletin Security provided by vBSecurity (Pro) - vBulletin Mods & Addons Copyright © 2019 DragonByte Technologies Ltd.

vBulletin Optimisation provided by vB Optimise (Pro) - vBulletin Mods & Addons Copyright © 2019 DragonByte Technologies Ltd.