Overclock.net - An Overclocking Community - View Single Post - Theories on why the SMT hurts the performance of gaming in Ryzen and some recommendations for the future
View Single Post
post #1 of (permalink) Old 03-02-2017, 05:39 PM - Thread Starter
CrazyElf's Avatar
Join Date: Dec 2011
Location: Ontario, Canada
Posts: 2,227
Rep: 428 (Unique: 299)
In gaming, there is a performance penalty of around 10% (give or take) for Simultaneous Multi-Thread (SMT). Why is this?

The SMT Implementation

When Intel released its hyper-threading for the first time, there were actually performance penalties. Right now there seem to be with AMD right now.

The 3 queue resources (in green) are shared. That means that they are not duplicated in SMT.

I suspect that when SMT is on, they are essentially halved. While sufficient for a full core with no SMT, they are probably bottlenecking the SMT impelementation. IN essence, AMD repeated Intel's mistake with HT.

This cannot be fixed with any BIOS update, although perhaps they could find a few ways to mitigate in microcode.


This is Ryzen CPU. Click on for full resolution.

There are 2 distinct 4 core clusters, making 8 cores. Each of these is called a CCX.

Communicating between each of the 4 cores within the CCX is very fast. Each CCX has 4 cores. These cores have their own L1 and L2 cache, then a shared L3 cache in 4 slices (8 MB shared amongst the 4 CPUs, kind of like a 4 core CPU). One notable difference versus Intel CPUs is that L3 is a victim cache (versus one that collects data from the prefetch/demand instructions - a write-back cache like on Intel CPUs). Note of course the larger than usual L2 cache to make up for this.

But communication between the 2 CCXs less so and there is a big performance penalty. The penalty is in both bandwidth and latency. How it works is that there is a link between the 2 CCX. The link that AMD currently uses something called "Infinity Fabric", which is basically an upgraded HyperTransport design. This Infinity Fabric appears to be at RAM speed. When one CCX has data that the other CCX needs, the Infinity Fabric checks the L3 cache of the other CCX and at the same time requests it of the memory controller. In most cases, the memory controller request will be cancelled because the data will be in the other CCX's L3 cache. However if not, then DRAM is the "true" last level cache. AMD claims that the Infinity Fabric has a bandwidth of 22 GB/s.

I'm actually concerned about that. For a comparison, the QPI on Haswell E is about 38.4 GB/s (QPI operates at 4.8 GHz on Haswell E x 2 for Double Data Rate x 16/20 (16 bits, but QPI bits wide) x 2 (bidirectional) / 8 (since there's 8 bits per byte) = 38.4 GB/s. That's almost double AMD's quoted 22GB/s and that's for a 2P socket! For Skylake (Purley) Intel plans an even faster interconnect called UltraPath Interconnect (UPI) technology (also known as KTI or Keizer Technology). It is reported to have 9.6GT/s or 10.4GT/s transfer speeds and it should support many requests per message, so it should see efficiency gains. The point is that in Intel, communication off die between 2 CPUs has more bandwidth than an on-die communication between 2 CCXs!

Edit: It is worth mentioning Intel lists the Haswell EP Xeons as 9.6 GT/s on QPI, which means 9.6 GT/s x 16 (data link is 16 bits; really 20 for data integrity) / 8 bits per byte x 2 for double data rate = 38.4 GB/s

Infinity Fabric will also be used in Vega.

That means this is not like having 1 big monolithic die. This is like having 2 fast 4 core CPUs.

For a comparison, here's Broadwell E, which uses a distinct "ring" design:

So there isn't a performance penalty on Intel CPUs for communication between cores because it is a "ring" rather than 2 CCX designs. While communicating within that ring will be slower (since data has to travel half way across the "ring" in the worst situation), it also means that the CPU acts as 1. By contrast with AMD's solution, it means that communication within the CCX is much faster, but between CCXs is really slow. Apparently AMD made this design to be scalable (they just don't have as many engineers as Intel).

Right now there is a performance penalty because Windows is treating this like a monolithic die, rather than 2 separate CPU complexes, which is what this really is.

Think of it as 2 4-core CPUs like on a 2P socket, not 1 8-core CPU. There is no real way to "fix" this issue - it's inherent in the design, although a Windows update/updated LInux kernel would be very good.


RAM Speeds
We learned a while back that there was a RAM slowdown.


That might be an issue. When the 6700K first was released, it was often slower in gaming than the 4790K, despite Skylake having a better IPC than Haswell. The reason was due to the poor speed and loose timings of the DDR4. The interesting thing here is that Zen does very well at workstation benchmarks (better than Broadwell E in many cases and perhaps even better than a hypothetical Skylake E), which makes this a very likely culprit.


One of the reasons why the Ryzen is cheap is because they used High Density Libraries. That allows for more dies in a smaller area and reduced power consumption. The penalty is clockspeed. For similar reasons, a GPUs like the 290X do not have as much overclocking headroom as say, a 7970 might. You can put more transistors in a given area, but at the expense of clocks, due to the power density (clockspeed is exponential). Actually, that reminds me, one of the reasons why Kaby Lake clocks faster than Skylake by about 300 MHz is because Kaby Lake is less dense.

This design may very well be why Ryzen cannot overclock more. Actually 4 GHz is already very good considering this.

Voltage Integration
For those who remember, Haswell introduced FIVR, which integrated the voltage regulator on the CPU package, rather than the motherboard. I think that the LDO on Zen is bypassed on consumer boards, so this is a non-issue. That means voltage integration takes place on the motherboard in full.

Unlike Carrizo, I don't see this as a bottleneck, unless the integrated voltage is in use somehow.

Fun fact: The voltage integration design is called Zeppelin.

The uncore (cache) does not seem to be separate from the core, unlike INtel CPUs. If this is bottlenecking clockspeeds and not the HDL, then we may be cache speed rather than core limited. This might explain Ryzen's weak OCs.

What I don't know is if it is the cache or the HDL that is limiting OCs. If it is the cache, then we may be able to get a few hundred MHz from splitting this out.

Base Clock
Much like Sandy Bridge, the Base Clock is closely tied to everything else, so overclocking it is likely to introduce instability. I'd guess that past 105 MHz on PCie 3.0, there may be instability.

I would like to see this separated (kind of like what Intel did with Skylake - they separated the CPU baseclock from the rest of the board). I would also like a "strap" function like on Intel boards to be added for unlocked CPUs.

4 cores don't suffer from the CCX communication problem

With only 1 CCX, the 4 core Zen CPUs will not suffer from this problem. Actually, for a mid-ranged system a 4 core Ryzen CPU with SMT disabled would be a very good value.

Once the RAM speeds are resolved and with SMT disabled, the main flaws are not a problem at all.

This works for a budget system, an APU, and may be an advantage on a laptop.

You may still be better of with Ryzen

Keep in mind that with most games, they are GPU not CPU bottlenecked.

You could buy a 6900K + an X99 motherboard + 1 GPU. Alternatively, you could buy a 1800X + X370 board + 2 GPUs for CF/SLI. In games that support CF or SLI, that would be an advantage and keep in mind you are not CPU bottlenecked.

The main drawback of course is multi-GPU issues. The other is of course where you do have CPU bottlenecks. Many strategy games (like the Total War series, simulator games) and the Battlefield series (especially in multiplayer on large maps) are CPU bottlenecked.

We really need a review of Zen vs X99 at 4k.


Ryzen uses AVFS, much like Carrizo.

That may be a big part of the power savings of Ryzen.

I don't know if this has any impact on the overclock headroom, but the top Polaris chips also used AVFS and the best ones could (the XFX RX 480 GTR Black comes to mind) could go past 1500 MHz at times - provided you get lucky with the silicon lottery. We may see clocks mature as Zen matures.


Currently AMD's AVX/FMA does not scale as well as Intel's.

Not many games use these instructions, but it is a point to note. It may affect productivity though, depending on your workload.

Keep in mind the value proposition - you may still be better off with Zen. Also keep in mind the possibilities for your budget systems and APUs of the 4 core CPU.

The combination of these problems means that Zen cannot be as fast as Intel's HEDT in terms of gaming (let's assume that a 6800K/5820K + an X99 board is the approximate peer of an 1800X + an X370 board; the 1800X will be more expensive, but a good X99 board will be more expensive, due to the 40 PCie lanes, quad channel RAM, and more complex chipset). Combine this with a weak OC headroom (either due to the cache or the HDL) and you have an explanation.

That said, any game that uses more cores will mean that the 8 Core Ryzen should destroy the Skylake and Kaby Lake Intel CPUs. Keep in mind that with DX12 and Vulkan it may be more future proof to get Ryzen. Oh and Ryzen can destroy Intel at content creation - even the 6900K cannot keep up.

I think that we need a patch in Windows and for the next Linux Kernel for each CCX to be treated like 2 separate CPUs. Essentially this would eliminate much of the performance penalty of Ryzen's CCX, which causes data to miss the L3 cache and go into RAM, causing a performance penalty.

Maybe a microcode update could mitigate some of these problems, but they are inherent in the hardware, so I am unsure of how much it will gain.

They need to get the BIOS updates for the RAM out ASAP.

For gamers, disable the SMT when gaming.

From a programming POV, it may make sense to treat each 4 core CCX like a mini-NUMA cluster. Then communication between CCXs is minimized, keeping data in L3 rather than using the Infinity Fabric and DDR4 as Last Level Cache.

Zen+ Ideas
  • AVX and FMA performance need a boost in the future
  • Increase the resources for the queues as to prevent penalties with the SMT.
  • A higher speed interconnect between the CCX so that they don't have to go to DRAM as the Last Level Cache and perhaps further augments to Infinity Fabric. I think that Zen would benefit from an L4 eDRAM cache. Perhaps a future version could feature HBM.
  • Maybe a third idea might be to isolate the cache, Bclk and core clocks. On Skylake for example, you have your Core speed, then Uncore. There's no separate Uncore. If the Uncore is holding back clocks then perhaps the Core can go a bit faster. For the base clock, on unlocked Also, on Skylake, Intel introduced the ability for the Bclk of the CPU to be separate from the rest of the motherboard.

By high speed interconnect, look at this. It is a 24 core Broadwell E HCC design, with 2 "rings" of 12 cores.

Note the 2 buses between the 2 "rings" that allows for high speed connection. Without these, anything between the 2 rings would have to go to the DRAM, exacting a huge performance penalty. AMD needs to do something similar or beef up the Infinity Fabric. 22GB/s is not enough.

Yet another option may be an L4 eDRAM or even HBM configuration for the Last Level Cache (to prevent trips to DRAM that have a latency penalty).

I'm sure AMD engineers know about these.

This is an amazing CPU if you consider it, competitive and it could get a lot better with Zen+. I think that they could do 15% with some changes. Power consumption is good, and it has decent clocks even with low OC headroom. It's also a good value, awesome for content creation.

To Looncraz as well for his advice on the LDOs
CrazyElf is offline