Theories on why the SMT hurts the performance of gaming in Ryzen and some recommendations for the future - Overclock.net - An Overclocking Community

Forum Jump: 

Theories on why the SMT hurts the performance of gaming in Ryzen and some recommendations for the future

Reply
 
Thread Tools
post #1 of 56 (permalink) Old 03-02-2017, 05:39 PM - Thread Starter
Meeeeeeeow!
 
CrazyElf's Avatar
 
Join Date: Dec 2011
Location: Ontario, Canada
Posts: 2,227
Rep: 428 (Unique: 299)
In gaming, there is a performance penalty of around 10% (give or take) for Simultaneous Multi-Thread (SMT). Why is this?


The SMT Implementation

When Intel released its hyper-threading for the first time, there were actually performance penalties. Right now there seem to be with AMD right now.



The 3 queue resources (in green) are shared. That means that they are not duplicated in SMT.

I suspect that when SMT is on, they are essentially halved. While sufficient for a full core with no SMT, they are probably bottlenecking the SMT impelementation. IN essence, AMD repeated Intel's mistake with HT.

This cannot be fixed with any BIOS update, although perhaps they could find a few ways to mitigate in microcode.



The CCX

This is Ryzen CPU. Click on for full resolution.



There are 2 distinct 4 core clusters, making 8 cores. Each of these is called a CCX.




Communicating between each of the 4 cores within the CCX is very fast. Each CCX has 4 cores. These cores have their own L1 and L2 cache, then a shared L3 cache in 4 slices (8 MB shared amongst the 4 CPUs, kind of like a 4 core CPU). One notable difference versus Intel CPUs is that L3 is a victim cache (versus one that collects data from the prefetch/demand instructions - a write-back cache like on Intel CPUs). Note of course the larger than usual L2 cache to make up for this.



But communication between the 2 CCXs less so and there is a big performance penalty. The penalty is in both bandwidth and latency. How it works is that there is a link between the 2 CCX. The link that AMD currently uses something called "Infinity Fabric", which is basically an upgraded HyperTransport design. This Infinity Fabric appears to be at RAM speed. When one CCX has data that the other CCX needs, the Infinity Fabric checks the L3 cache of the other CCX and at the same time requests it of the memory controller. In most cases, the memory controller request will be cancelled because the data will be in the other CCX's L3 cache. However if not, then DRAM is the "true" last level cache. AMD claims that the Infinity Fabric has a bandwidth of 22 GB/s.

I'm actually concerned about that. For a comparison, the QPI on Haswell E is about 38.4 GB/s (QPI operates at 4.8 GHz on Haswell E x 2 for Double Data Rate x 16/20 (16 bits, but QPI bits wide) x 2 (bidirectional) / 8 (since there's 8 bits per byte) = 38.4 GB/s. That's almost double AMD's quoted 22GB/s and that's for a 2P socket! For Skylake (Purley) Intel plans an even faster interconnect called UltraPath Interconnect (UPI) technology (also known as KTI or Keizer Technology). It is reported to have 9.6GT/s or 10.4GT/s transfer speeds and it should support many requests per message, so it should see efficiency gains. The point is that in Intel, communication off die between 2 CPUs has more bandwidth than an on-die communication between 2 CCXs!

Edit: It is worth mentioning Intel lists the Haswell EP Xeons as 9.6 GT/s on QPI, which means 9.6 GT/s x 16 (data link is 16 bits; really 20 for data integrity) / 8 bits per byte x 2 for double data rate = 38.4 GB/s

Infinity Fabric will also be used in Vega.




That means this is not like having 1 big monolithic die. This is like having 2 fast 4 core CPUs.

For a comparison, here's Broadwell E, which uses a distinct "ring" design:




So there isn't a performance penalty on Intel CPUs for communication between cores because it is a "ring" rather than 2 CCX designs. While communicating within that ring will be slower (since data has to travel half way across the "ring" in the worst situation), it also means that the CPU acts as 1. By contrast with AMD's solution, it means that communication within the CCX is much faster, but between CCXs is really slow. Apparently AMD made this design to be scalable (they just don't have as many engineers as Intel).

Right now there is a performance penalty because Windows is treating this like a monolithic die, rather than 2 separate CPU complexes, which is what this really is.

Think of it as 2 4-core CPUs like on a 2P socket, not 1 8-core CPU. There is no real way to "fix" this issue - it's inherent in the design, although a Windows update/updated LInux kernel would be very good.




Other

RAM Speeds
We learned a while back that there was a RAM slowdown.

https://www.overclock.net/t/1624058/dvhardware-amd-ryzen-has-issues-with-high-frequency-ddr4-fix-expected-in-1-2-months/200

That might be an issue. When the 6700K first was released, it was often slower in gaming than the 4790K, despite Skylake having a better IPC than Haswell. The reason was due to the poor speed and loose timings of the DDR4. The interesting thing here is that Zen does very well at workstation benchmarks (better than Broadwell E in many cases and perhaps even better than a hypothetical Skylake E), which makes this a very likely culprit.


Clockspeeds

One of the reasons why the Ryzen is cheap is because they used High Density Libraries. That allows for more dies in a smaller area and reduced power consumption. The penalty is clockspeed. For similar reasons, a GPUs like the 290X do not have as much overclocking headroom as say, a 7970 might. You can put more transistors in a given area, but at the expense of clocks, due to the power density (clockspeed is exponential). Actually, that reminds me, one of the reasons why Kaby Lake clocks faster than Skylake by about 300 MHz is because Kaby Lake is less dense.


This design may very well be why Ryzen cannot overclock more. Actually 4 GHz is already very good considering this.

Voltage Integration
For those who remember, Haswell introduced FIVR, which integrated the voltage regulator on the CPU package, rather than the motherboard. I think that the LDO on Zen is bypassed on consumer boards, so this is a non-issue. That means voltage integration takes place on the motherboard in full.

Unlike Carrizo, I don't see this as a bottleneck, unless the integrated voltage is in use somehow.

Fun fact: The voltage integration design is called Zeppelin.

Uncore
The uncore (cache) does not seem to be separate from the core, unlike INtel CPUs. If this is bottlenecking clockspeeds and not the HDL, then we may be cache speed rather than core limited. This might explain Ryzen's weak OCs.

What I don't know is if it is the cache or the HDL that is limiting OCs. If it is the cache, then we may be able to get a few hundred MHz from splitting this out.

Base Clock
Much like Sandy Bridge, the Base Clock is closely tied to everything else, so overclocking it is likely to introduce instability. I'd guess that past 105 MHz on PCie 3.0, there may be instability.

I would like to see this separated (kind of like what Intel did with Skylake - they separated the CPU baseclock from the rest of the board). I would also like a "strap" function like on Intel boards to be added for unlocked CPUs.



4 cores don't suffer from the CCX communication problem

With only 1 CCX, the 4 core Zen CPUs will not suffer from this problem. Actually, for a mid-ranged system a 4 core Ryzen CPU with SMT disabled would be a very good value.

Once the RAM speeds are resolved and with SMT disabled, the main flaws are not a problem at all.

This works for a budget system, an APU, and may be an advantage on a laptop.




You may still be better of with Ryzen

Keep in mind that with most games, they are GPU not CPU bottlenecked.

You could buy a 6900K + an X99 motherboard + 1 GPU. Alternatively, you could buy a 1800X + X370 board + 2 GPUs for CF/SLI. In games that support CF or SLI, that would be an advantage and keep in mind you are not CPU bottlenecked.

The main drawback of course is multi-GPU issues. The other is of course where you do have CPU bottlenecks. Many strategy games (like the Total War series, simulator games) and the Battlefield series (especially in multiplayer on large maps) are CPU bottlenecked.


We really need a review of Zen vs X99 at 4k.



AVFS

Ryzen uses AVFS, much like Carrizo.



That may be a big part of the power savings of Ryzen.

I don't know if this has any impact on the overclock headroom, but the top Polaris chips also used AVFS and the best ones could (the XFX RX 480 GTR Black comes to mind) could go past 1500 MHz at times - provided you get lucky with the silicon lottery. We may see clocks mature as Zen matures.


AVX/FMA

Currently AMD's AVX/FMA does not scale as well as Intel's.
https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/

Not many games use these instructions, but it is a point to note. It may affect productivity though, depending on your workload.






Conclusions
Keep in mind the value proposition - you may still be better off with Zen. Also keep in mind the possibilities for your budget systems and APUs of the 4 core CPU.

The combination of these problems means that Zen cannot be as fast as Intel's HEDT in terms of gaming (let's assume that a 6800K/5820K + an X99 board is the approximate peer of an 1800X + an X370 board; the 1800X will be more expensive, but a good X99 board will be more expensive, due to the 40 PCie lanes, quad channel RAM, and more complex chipset). Combine this with a weak OC headroom (either due to the cache or the HDL) and you have an explanation.

That said, any game that uses more cores will mean that the 8 Core Ryzen should destroy the Skylake and Kaby Lake Intel CPUs. Keep in mind that with DX12 and Vulkan it may be more future proof to get Ryzen. Oh and Ryzen can destroy Intel at content creation - even the 6900K cannot keep up.


I think that we need a patch in Windows and for the next Linux Kernel for each CCX to be treated like 2 separate CPUs. Essentially this would eliminate much of the performance penalty of Ryzen's CCX, which causes data to miss the L3 cache and go into RAM, causing a performance penalty.

Maybe a microcode update could mitigate some of these problems, but they are inherent in the hardware, so I am unsure of how much it will gain.

They need to get the BIOS updates for the RAM out ASAP.

For gamers, disable the SMT when gaming.

From a programming POV, it may make sense to treat each 4 core CCX like a mini-NUMA cluster. Then communication between CCXs is minimized, keeping data in L3 rather than using the Infinity Fabric and DDR4 as Last Level Cache.





Zen+ Ideas
  • AVX and FMA performance need a boost in the future
  • Increase the resources for the queues as to prevent penalties with the SMT.
  • A higher speed interconnect between the CCX so that they don't have to go to DRAM as the Last Level Cache and perhaps further augments to Infinity Fabric. I think that Zen would benefit from an L4 eDRAM cache. Perhaps a future version could feature HBM.
  • Maybe a third idea might be to isolate the cache, Bclk and core clocks. On Skylake for example, you have your Core speed, then Uncore. There's no separate Uncore. If the Uncore is holding back clocks then perhaps the Core can go a bit faster. For the base clock, on unlocked Also, on Skylake, Intel introduced the ability for the Bclk of the CPU to be separate from the rest of the motherboard.

By high speed interconnect, look at this. It is a 24 core Broadwell E HCC design, with 2 "rings" of 12 cores.



Note the 2 buses between the 2 "rings" that allows for high speed connection. Without these, anything between the 2 rings would have to go to the DRAM, exacting a huge performance penalty. AMD needs to do something similar or beef up the Infinity Fabric. 22GB/s is not enough.

Yet another option may be an L4 eDRAM or even HBM configuration for the Last Level Cache (to prevent trips to DRAM that have a latency penalty).





I'm sure AMD engineers know about these.

This is an amazing CPU if you consider it, competitive and it could get a lot better with Zen+. I think that they could do 15% with some changes. Power consumption is good, and it has decent clocks even with low OC headroom. It's also a good value, awesome for content creation.



Thanks
To Looncraz as well for his advice on the LDOs
CrazyElf is offline  
Sponsored Links
Advertisement
 
post #2 of 56 (permalink) Old 03-03-2017, 08:18 AM
New to Overclock.net
 
Join Date: Sep 2013
Posts: 2,070
Rep: 42 (Unique: 34)
Wow, great review! Thanks!

OCmember is offline  
post #3 of 56 (permalink) Old 03-03-2017, 08:24 AM
Stock *ahem*
 
Quantum Reality's Avatar
 
Join Date: Nov 2008
Posts: 6,380
Rep: 307 (Unique: 237)
Excellent!

Also, some benches I've seen support the theory that disabling SMT will help gaming framerates in the interim while OS, BIOS and microcode patches get rolled out.
Quantum Reality is offline  
Sponsored Links
Advertisement
 
post #4 of 56 (permalink) Old 03-03-2017, 08:59 AM
New to Overclock.net
 
Yukon Trooper's Avatar
 
Join Date: Jun 2007
Posts: 159
Rep: 6 (Unique: 6)
Meh. Many gaming benchmarks show much less than 10% difference with SMT enabled/disabled. I don't think there's as much to gain there as people are hoping.

I also take issue with the GPU bottleneck argument. That may be true for average/max FPS, but for the all-important minimum FPS metric we'll likely see Zen destroyed as more in-depth gaming benchmarks come out.

System optimizations, BIOS updates, etc. are only going to get peoples' hopes up. AMD doesn't even really try arguing this point. AMD's main argument is game developers will start coding for Zen moving forward, but it will take at least 1-2 years before the market is semi-saturated with Zen optimized titles.
Yukon Trooper is offline  
post #5 of 56 (permalink) Old 03-03-2017, 09:33 AM
New to Overclock.net
 
madbrayniak's Avatar
 
Join Date: Jan 2012
Posts: 1,103
Rep: 11 (Unique: 11)
Here is a good video by Gamers Nexus explaining some variance in Ryzen performance reviews:

https://www.youtube.com/watch?v=TBf0lwikXyU

Best Regards,
Bryan
madbrayniak is offline  
post #6 of 56 (permalink) Old 03-03-2017, 09:55 AM
New to Overclock.net
 
Gamingboy's Avatar
 
Join Date: Feb 2017
Posts: 36
Rep: 0
Ryzen's launch was a bit "premature" in the sense that the third-party companies creating motherboards for their processors are not yet THAT prepared. There are many issues highlighted by Steve from Gamer's Nexus about the BIOS problems with the MSI and ASUS boards. Just click on the Youtube Link madbrayniak shared.
Gamingboy is offline  
post #7 of 56 (permalink) Old 03-03-2017, 10:43 AM
⤷ αC
 
AlphaC's Avatar
 
Join Date: Sep 2012
Posts: 10,668
Rep: 871 (Unique: 575)
Great overview.

It's interesting that you suggest the OS should recognize it as a 2P 4 core rather than 8 core.

I hope by the time Ryzen 5 releases many of these issues are ironed out by motherboard manufacturers and RAM manufacturers. So far the only RAM company seemingly on top of the RAM issue is GSkill , their "solution" in the long term is to release AMD Ryzen specialized RAM in the form of Flare X.

If your reasoning is correct then the biggest improvement could come with greater support RAM speeds (due to the CCX's Infinity Fabric implementation). Earlier news suggested the Infinity Fabric would be faster : "The company declined to give data rates or latency figures for Infinity, which comes only in a coherent version. However, it said that it is modular and will scale from 30- to 50-GBytes/second versions for notebooks to 512 Gbytes/s and beyond for Vega." http://www.eetimes.com/document.asp?doc_id=1330981&page_number=2

IMO when you see Firestrike/Timespy benchmarks with the Ryzen 7 competitive with respect to i7-6900K & i7-7700k @5.1GHz, then that means it is a game optimization issue.

Also I realized the Stilt's reasoning of < 3.3GHz being optimal for these chips might mean we ought to be under-volting instead of overclocking. It certainly explains the clockrates on ryzen 7 1700 and Ryzen 7 1700X.

► Recommended GPU Projects: [email protected] , [email protected] (FP64) (AMD moreso) ► Other notable GPU projects: [email protected] (Nvidia), GPUGrid (Nvidia) ► Project list


AlphaC is offline  
post #8 of 56 (permalink) Old 03-03-2017, 11:58 AM - Thread Starter
Meeeeeeeow!
 
CrazyElf's Avatar
 
Join Date: Dec 2011
Location: Ontario, Canada
Posts: 2,227
Rep: 428 (Unique: 299)
I feel that apart from the issues I've raised, AMD's CPU is very well made.

Quote:
Originally Posted by AlphaC View Post

Great overview.

It's interesting that you suggest the OS should recognize it as a 2P 4 core rather than 8 core.

I hope by the time Ryzen 5 releases many of these issues are ironed out by motherboard manufacturers and RAM manufacturers. So far the only RAM company seemingly on top of the RAM issue is GSkill , their "solution" in the long term is to release AMD Ryzen specialized RAM in the form of Flare X.

If your reasoning is correct then the biggest improvement could come with greater support RAM speeds (due to the CCX's Infinity Fabric implementation). Earlier news suggested the Infinity Fabric would be faster : "The company declined to give data rates or latency figures for Infinity, which comes only in a coherent version. However, it said that it is modular and will scale from 30- to 50-GBytes/second versions for notebooks to 512 Gbytes/s and beyond for Vega." http://www.eetimes.com/document.asp?doc_id=1330981&page_number=2

IMO when you see Firestrike/Timespy benchmarks with the Ryzen 7 competitive with respect to i7-6900K & i7-7700k @5.1GHz, then that means it is a game optimization issue.

Also I realized the Stilt's reasoning of < 3.3GHz being optimal for these chips might mean we ought to be under-volting instead of overclocking. It certainly explains the clockrates on ryzen 7 1700 and Ryzen 7 1700X.


+Rep - that gets my thought juices going.

Yes, which reminds me.

Once the RAM fix is in order, it may be advisable to buy the top binned RAM and OC it (best timings/clocks you can get). There is probably more to gain on tight timings and high RAM clocks on Ryzen than Intel platforms. The reason is because the Intel platforms don't use their DRAM as the last level cache, while Ryzen does. Furthermore, because they don't, it means that memory bandwidth is not the bottleneck in most cases, whereas when communicating between 2 CCXs, the memory could be a bottleneck. Actually, a 2 DIMM board might be a potentially good idea on Ryzen for that reason (trace lengths and possibility for better OC).

The source of the Infinite Fabric 22GB/s was the PCGH.de review. Apparently they talked with AMD about this.

Also, seeing that there's not much OC headroom and you want to undervolt, there is little point in buying flagship motherboards with insane VRMs, unless of course you need the other features that said flagship motherboards offer. Maybe put the savings towards buying better binned RAM. The only case you may want to consider a flagship then might be if that board has the ability to clock RAM faster.

Perhaps AMD should also focus on releasing a better memory controller for Zen+, although with my Zen+ proposals, it won't be needed because there will be a faster last level cache.

Will update the OP on this.





I'm actually worried about what the Infinity Fabric could mean for Vega. Keep in mind the 22GB/s is not a lot at all.

Quote:
Originally Posted by Quantum Reality View Post

Excellent!

Also, some benches I've seen support the theory that disabling SMT will help gaming framerates in the interim while OS, BIOS and microcode patches get rolled out.


WE should see modest gains. Getting rid of the SMT will add a few percentage points (perhaps as much as 10%) and the RAM fixes will add another few percent. That should mostly close the gap with Intel.

The big thing that we need to do for the microcode (and the OS kernels) to do is to treat the CCXs as different CPUs. If we can get say, a 4 thread game to only use 1 CCX, the gap will disappear. In that case, we could even see AMD get the kinds of wins in games that it gets in workstation benchmarks.



Quote:
Originally Posted by Yukon Trooper View Post

Meh. Many gaming benchmarks show much less than 10% difference with SMT enabled/disabled. I don't think there's as much to gain there as people are hoping.

I also take issue with the GPU bottleneck argument. That may be true for average/max FPS, but for the all-important minimum FPS metric we'll likely see Zen destroyed as more in-depth gaming benchmarks come out.

System optimizations, BIOS updates, etc. are only going to get peoples' hopes up. AMD doesn't even really try arguing this point. AMD's main argument is game developers will start coding for Zen moving forward, but it will take at least 1-2 years before the market is semi-saturated with Zen optimized titles.



True, it's not a huge difference, but it counts for a lot of people. The optimal use of CCXs would also likely boost not just games, but also the workstation loads even more. If games with less than 4 cores kept their loads within 1 CCX, they'd be able to pull the kind of results on games than they do on workstations.

Keep in mind that at 1440p where things are GPUs and not CPUs become the bottleneck (save in CPU bottlenecked games like Total War games). That means that even this bottleneck will disappear and in the few games that there are CPU bottlenecks, the faster RAM along with better CCX management should mitigate those.

I do not believe that with the fixes I have proposed Zen will get destroyed - at the very least, my Zen+ proposals would lead to viable fixes for Zen+.
CrazyElf is offline  
post #9 of 56 (permalink) Old 03-03-2017, 12:41 PM
⤷ αC
 
AlphaC's Avatar
 
Join Date: Sep 2012
Posts: 10,668
Rep: 871 (Unique: 575)
AlphaC is offline  
post #10 of 56 (permalink) Old 03-03-2017, 12:49 PM
Stock *ahem*
 
Quantum Reality's Avatar
 
Join Date: Nov 2008
Posts: 6,380
Rep: 307 (Unique: 237)
So applications (e.g. video encoding, calculations, etc) - SMT on.

Games (esp DX12) - SMT off.

Too bad you can't have dynamic feature-disabling profiles without needing a reboot to change the setting.
Quantum Reality is offline  
Reply

Quick Reply
Message:
Options

Register Now

In order to be able to post messages on the Overclock.net - An Overclocking Community forums, you must first register.
Please enter your desired user name, your email address and other required details in the form below.
User Name:
If you do not want to register, fill this field only and the name will be used as user name for your post.
Password
Please enter a password for your user account. Note that passwords are case-sensitive.
Password:
Confirm Password:
Email Address
Please enter a valid email address for yourself.
Email Address:

Log-in



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Show Printable Version Show Printable Version
Email this Page Email this Page


Forum Jump: 

Posting Rules  
You may post new threads
You may post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off