Overclock.net banner

1 - 20 of 39 Posts

·
Meeeeeeeow!
Joined
·
2,229 Posts
Discussion Starter #1
Intro

The purpose of this thread is to further discuss the new information + about Ryzen and CPU performance in games. As most OCNers know, when Ryzen came out, it was very good at workstation, but a bit behind (maybe 10-20%) Kaby Lake, clock for clock at games at 1080p.

With faster RAM, the gap that Ryzen has is mostly gone.

This is a follow-up to my earlier thread:
http://www.overclock.net/t/1624566/theories-on-why-the-smt-hurts-the-performance-of-gaming-in-ryzen-and-some-recommendations-for-the-future/0_100

In my previous thread, there was a discussion about possible theories and what to do to mitigate these problems.

Ryzen loves memory

In that thread, I hypothesized in the OP that Zen might love memory due to the unique topology of Zen, the CCXs. DRAM is used as a cache to communicate between the 2 CCXs due to the way Infinity Fabric works. Infinity Fabric first checks the L3 caches of the other CCXs and if the information is not there, checks DRAM.

From eTechnix, we have:
http://www.eteknix.com/memory-speed-large-impact-ryzen-performance/



From the Finnish website IO Tech, we have:
https://www.io-tech.fi/artikkelit/amdn-uusi-zen-x86-arkkitehtuuri-clock-to-clock-suorituskyky/



There is also this - Elmor from HWBOT:
http://forum.hwbot.org/showpost.php?p=479666&postcount=22



With faster RAM, Ryzen clearly sees gains.

Potential Hypotheses as to Why
Hypothesis 1: Due to the CCXs using the RAM to communicate with each other, in the absence of an L4 cache, they are using DRAM and they love the higher speeds, as unlike Intel, the unique topology makes them the last level cache. If so, an L4 or even an eDRAM like solution would see huge gains. Memory clocks have a disproportionate effect because they are the bottleneck. Keep in mind that right now on X370, you've got 8 cores being fed by dual channel RAM, while on X99, you've got 6-10 cores being fed by quad-channel RAM. While the quad channel RAM is a bit slower, it's still a lot more bandwidth because it is quad channel.

Hypothesis 2: The Infinity Fabric Speed is tightly related to the memory clocks. In the previous article I wrote, I noted that unlike Skylake, where there is a Core Clock, an Uncore Clock, a RAM clock, and the Base Clock of the CPU is separate from the PCIe functions, Ryzen does not have these separations. Overclocking the RAM may be overclocking the Infinity Fabric, which itself is believed to be based on HyperTransport. (Thanks Looncraz).

There is one reason to believe Hypothesis 1 - the Geekbench shows that secondary timings do matter at 3466. We will know more in a month or two when the ability to alter RAM clocks and timings happens.

Of course, we need the ability to unlock the data to test these 2 hypothesis and more people with boards.

What we need to see
We also need to test how fast this can scale?
  1. What is the optimal point between timings and clocks? As you go faster, the timings get more loose, but it seems Ryzen is benefiting. An example, in X99, often with Haswell E, 2666 MHz was faster with tight timings than 3200 MHz with loose timings. With Ryzen, it might be faster speed, loose timings if Hypothesis 2 is true.
  2. How fast can Ryzen's memory controller go in raw clocks? Top G.Skill kits today come in 4266 @ 19-19-19-39 for Z270.
  3. Are there diminishing returns at some point?
We need to test this once software tools are updated and timings are unlocked.

AIDA64 for example is not up to date right now.

We'll need to test how far Zen can go with fast RAM and tight timings.

Are these typical of Ryzen?
Extremetech recently published a review of the 1080Ti. They used 3200 MT/s RAM on Ryzen and benchmarked it against a 6900K.

https://www.extremetech.com/gaming/245604-review-gtx-1080-ti-first-real-4k-gpu-drives-better-amd-intel

From the review, we can conclude that the following games scale with RAM.
  • Witcher 3 (from above)
  • ARMA 3 (from above)
  • Company of Heroes
  • Metro Last Light Redux - AMD 1800X is actually faster than Intel 6900K at 4k
  • Ashes of Singularity
  • Shadow of Mordor - AMD 1800X is actually faster than Intel 6900K at 4k
  • Rise of Tomb Raider - AMD 1800X is actually faster than Intel 6900K at 4k
  • DIRT Rally - AMD 1800X is actually faster than Intel 6900K at 4k

Interestingly, Hitman shows AMD's 1800X lagging the 6900K. There is also a gap in Civ VI. It's a notable exception because it is CPU bottlenecked it would seem, so it may be that turning on SMT would actually help. Further testing required.



Averaged out, Broadwell E was just 1% faster than the 1800X at 4k with a 1080Ti.

Keep in mind of course ET's testing methods - they did used slower RAM 2666 MHz RAM for Broadwell E, although it is in quad channel so I doubt there would be a RAM bandwidth bottleneck. I do not believe that any CPUs were OC"ed. There were a few games where Ryzen was faster than Broadwell E!

Broadwell E of course has a bit of OC headroom (4.2 to 4.4 GHz), but keep in mind that in testing, ET kept Simultaneous Multi-Threaded off. It will be very interesting to see how the 4 Zen scales with RAM, as there is only 1 CCX and thus would suffer from no inter-CCX bottleneck. It may very well give the 7700K a fair fight, even if the 7700K is a faster CPU.

I'd say though that many, if not most games love faster RAM with Ryzen. With fast RAM, the gaming gap basically goes away for most games.

What we need to see
Clock for clock review of Ryzen vs Broadwell E, with Ryzen at top RAM speeds and Broadwell E overclocked with best possible RAM speeds, with Ryzen's SMT off.

I don't expect it will change much. If Broadwell E has more OC headroom than Zen, base of 3.2 GHz, turbo 3.7 GHz, single core turbo of 4.0 GHz (overclocks were in the 4.2 to 4.4 GHz range), against an 1800X (3.6 GHz, with overclocks in the 3.9 - 4.1 GHz range), most of the advantages should be recouped with AMD's SMT disabled. It will however be interesting to see if we can find a CPU limited game that uses all 16 threads on both CPUs to see who is the winner. Basically with faster RAM though, AMD should not come out behind, and may even come out on top in some titles.

Seeing that GPU not CPU will be the bottleneck, I don't expect much variation from the titles, which considering the lower cost of Zen is a win for AMD.

AMD Binning

This is from Silicon Lottery:
https://siliconlottery.com/collections/all

Note that this is as of March 2017. Processes tend to mature with time.

Ryzen 7 1700
  • It's presumed all CPUs go to 3.7 GHz (Silicon Lottery please correct me if I am wrong about this assumption)
  • 93% can do 3.8GHz @ 1.376V
  • 70% can do 3.9GHz @ 1.408V
  • 20% can do 4.0GHz @ 1.440V

Ryzen 7 1700X
  • It's presumed all CPUs go to 3.8 GHz
  • 77% can do 3.9GHz @ 1.392V
  • 33% can do 4.0GHz @ 1.424V

Ryzen 7 1800X
  • It's presumed all CPUs go to 3.8 GHz
  • 97% can do 3.9GHz @ 1.376V
  • 67% can do 4.0GHz @ 1.408V
  • 20% can do 4.1GHz @ 1.440V
My thoughts
Keep in mind that Silicon Lottery uses RealBench as their testing platform. That is "game stable", but not "server grade stable". There isn't a filtering by IMC either. That's very important now on Zen.

There is very clearly a binning process going on with AMD and the higher numbers clearly are better binned.

Does it matter? Well, keep in mind that if you play games that are CPU intensive, then it is well worth it.

Examples:
  • Battlefield series
  • Cities: Skylines
  • Total War series
  • Many flight simulators are CPU limited
  • Starcraft 2 is single thread limited, as are many other strategy games
  • Unsurprisingly, Civ 6 was faster with Intel right now
Intel might win because of higher clockspeeds on Broadwell E from more OC headroom, but it's going to be very close. It's paying a lot more for a lot less performance.

You should keep into account what you play. If you play strategy or simulation or the Battlefield series, you will frequently be CPU bottlenecked.

The Hitman game was also faster with Intel rather than Ryzen.

Is there anything else I should know?

This is not about Ryzen, but considering how the Vega GPUs are using Infinity Fabric, this may affect how we overclock them. Worse case situation, the Infinity Fabric might be a bottleneck, in which case we may be limited by either the fabric or the VRAM overclocks. HBM1 overclocked with about a 10-15% headroom in Fury X, although it was driver locked. Why is this an issue? It may be that overclocking the Infinity Fabric involves overclocking the HBM2 VRAM, much like overclocking the memory speed might be doing with Ryzen. We have no way of knowing at this time what the headroom of HBM2 will be for overclocking either.

It will likely be that the Infinity Fabric is what the NCUs (Next Generation Compute Units) use to communicate, so this is fairly important. Of course without knowing the speed of the Fabric, nor any other details, we have no way of knowing if it is even a bottleneck at all.

However, one problem of the Fury X, due to the deployment of HDLs was that there was limited overclocking headroom. I expect that AMD will use HDLs again on Vega, but have no further information.

Without further details, there is no way to know.

I just want the maximum gaming performance!
Ok first thing to do is to wait. We need to see which boards are best for RAM overclocking.

Buy good RAM and a motherboard competent at overclocking RAM
First thing to do is to buy top tier RAM and depending on how things look after the RAM timings are unlocked, a Base Clock Generator might be useful. Some people are saying it's just going to be marketing soon, while others say it is key.

Watch Buildzoid's video:
Keep in mind, the following boards have Bclk schedulers
  • Asrock X370 Taichi (my recommendation currently)
  • Asrock X370 Fata1ty Professional Gaming
  • Asus X370 Crosshair Hero
  • Gigabyte X370 Gaming K7
No clue what the situation is, so wait for the platform to mature. We really need to see which boards are competent at OC'ing RAM. Make sure the motherboard can actually run at the RAM speed!

However, what is certain is that you'll want to invest in some top binned RAM. You could even buy a lower end board (maybe the cheapest with the Bclk if it turns out to be useful, if not, then your choices widen). Why? Without much OC headroom, no point in overkill VRMs. Only time you need a flagship board is if you need the other neat features. Just make sure though that the board is good at overclocking RAM. If you don't believe me, scroll back up and look at that Witcher 3 and Arma 3 benchmark!

Overclock the RAM! We will need to see what the best combination is though.

Disable the SMT
Most games do not use more than 8 threads. Only for games using more than 8 threads should you enable SMT.

It is a pain to reboot, but this is how you get the most performance.

There are some things that you cannot control
If you read my previous post, I discussed how I felt that AMD should add more resources to the queues, so that the performance penalty in SMT is minimized. It's something that will wait for Zen+ or Zen++.

On the software side, we still need Windows and Linux to have schedulers that make the most of AMD's CCX topology. Programmers too for their software need to optimize for AMD.

Conclusions

As we can see, the games gap is largely gone. Actually, AMD even is able to pull a few wins!

If we combine the following:
  • Faster RAM speeds and tighter timings from an unlocked RAM multiplier in a month or two
  • Highly binned RAM
  • Disabling SMT
  • Motherboard with ability to good RAM OCing ability
We have a very good CPU that will be able to beat Intel in some games.

For Zen+, what do we want?

First of all, faster clocks! There's no OC headroom. Hopefully they can get it onto Samsung's 14LPU (their fourth generation process on 14nm). 14LPU is dedicated to faster clocks judging by the description, which helps. I suspect that Skylake E will be faster. A 5960X did 4.4 to 4.6 GHz typically, which is roughly what a 4770k did. On average, the 4790K was around 300 MHz faster. If so, Skylake E should be a decent leap - 6700K speeds, but at Skylake IPC. Zen+ will need to be good to respond to that.

Unlink everything like on Skylake. On Skylake of course, you have the core clock, uncore, RAM clock, base clock, which in turn is separate from the CPU and PCIe, along with a Strap function on K CPUs.

Next we want more queue resources. That allows for no performance penalty with SMT on. AMD probably had inadequate performance there.

An L4 cache would be very helpful for inter-CCX communications. Even a small one could see huge gains. Failing that, something like the eDRAM of Broadwell or even an HBM solution for high end CPUs would help.

We need the Infinity Fabric Bandwidth to be higher so that it bottlenecks less the communication between cores.

A wider core would also help, although it is going to be diminishing returns like on Intel CPUs. Only certain apps can take advantage of it. WE also want better AVX2 performance (scaling not as good as Intel's).

Beef up the memory controller. Even with a L4 cache, this might help things out. AMD is trying to feed 8 fast cores with dual channel, versus INtel which is trying to feed their cores with quad channel.

Maybe AMD should consider offering an HEDT solution to fight the 6950X. Such a solution could be 2 1800X dies together in a Multi-Chip Module. CLocked in, that would be 190W, a lot less if they lower the clocks. It would feature the ability to have 2 M.2 PCIe x4 SSDs, quad channel RAM, and 32 PCIe 3.0 slots.

Closing thoughts
When I look at the resources that AMD had versus the resources that Intel had, yes it's a solid architecture. We may not even have the gaming penalty that we had initially feared. We may be pretty close to having our cake and getting to eat it.

That said, Zen+ has a lot of options for improvement.
 

·
Premium Member
Joined
·
4,604 Posts
Excellent post, but that's exactly Ryzen's drama. The RAM compatibility. Yesterday i downloaded an ASUS memory QVL sheet and there were RAM kits rated 3200Mhz, that in the sheet were validated for 2133Mhz effective speed for example. This is atrocious, because you can't expect every buyer to be RAM expert and know which one REALLY runs at 3000+ clocks. Soon, there will be disappointed buyers that will blame it all on Zen, not being able to run with the RAM they bought.
 

·
Registered
Joined
·
999 Posts
  • Rep+
Reactions: Steele84

·
⤷ αC
Joined
·
11,239 Posts
CrazyElf, nice updated overview!
smile.gif


It was very organized and thought out unlike some reviews we've been seeing online which are unprofessional and inaccurate.
Quote:
Originally Posted by Undervolter View Post

Excellent post, but that's exactly Ryzen's drama. The RAM compatibility. Yesterday i downloaded an ASUS memory QVL sheet and there were RAM kits rated 3200Mhz, that in the sheet were validated for 2133Mhz effective speed for example. This is atrocious, because you can't expect every buyer to be RAM expert and know which one REALLY runs at 3000+ clocks. Soon, there will be disappointed buyers that will blame it all on Zen, not being able to run with the RAM they bought.
The RAM issue is horrific but that's the price you pay for being an early adopter. Those kits are made for Intel X99 and Intel Z270/Z170.

When the rest of the memory manufacturers retailers start to actually do some work on their memory kits we will likely see some improvements on the PC building front.

What are Crucial / Corsair / Kingston / etc doing? Gskill already has Flare X kits for sale on Newegg albeit with poor timings, although the ones faster than 2666MHz are nowhere to be found.

Anandtech's review added the following caveats at the bottom:
Quote:
Originally Posted by http://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700/23
Windows 10 RTC disliking 0.25x multipliers, causing timing issues,
  • Software not reading L3 properly (thinking each core has 8MB of L3, rather than 2MB/core),
  • Latency within a CCX being regular, but across CCX boundaries having limited bandwidth,
  • Static partitioning methods being used shows performance gains when SMT is disabled,
  • Ryzen showing performance gains with faster memory, more so than expected,
  • Gaming Performance, particularly towards 240 Hz gaming, is being questioned,
  • Microsoft's scheduler not understanding the different CCX core-to-core latencies,
  • Windows not scheduling threads within a CCX before moving onto the next CCX,
  • Some motherboards having difficulty with DRAM compatibility,
  • Performance related EFIs being highly regularly near weekly since two weeks before launch.
As far as GTX 1080 Ti performance:
http://hothardware.com/reviews/nvidia-geforce-gtx-1080-ti-performance-review-with-intel-and-ryzen?page=7
http://www.guru3d.com/articles-pages/geforce-gtx-1080-ti-review,31.html
https://www.nordichardware.se/test/test-nvidia-geforce-gtx-1080-ti-nvidia-ensamma-i-toppen.html/16?/cpu-test-vilken-processor-ar-bast-for-gtx-1080-ti
http://www.eteknix.com/nvidia-gtx-1080-ti-cpu-showdown-i7-7700k-vs-ryzen-r7-1800x-vs-i7-5820k/3/
 

·
Premium Member
Joined
·
4,604 Posts

·
⤷ αC
Joined
·
11,239 Posts
Ryzen's "inherent problem" is if the OS decides to switch a thread from one CCX to another. In the pcperspective comments section, to which Allyn Malventano replies:

"We don't disagree, but the rumors we were attempting to control were those stating the performance issues stemmed from improper handling of physical vs. logical cores."

So essentially pcperspective was testing something that is not the main issue. CrazyElf's topic is more on point with the actual problem. The fact that memory affects the core performance so significantly presents a strong case for inter-CCX communication as the weak point.

edit: It has been proposed tools such as Process Lasso 7 Process Hacker should be used to fix alleviate (i.e. lessen impact of) this in the interim (i..e before a Windows patch is released) along with some core unparking tweaks.

edit 2: Silicon Lottery stats must have changed
As of 3/6/17, the top 70% of 1700s were able to hit 3.9GHz or greater. 1.408V CPU VCORE (https://siliconlottery.com/collections/all/products/1700a39g)
As of 3/6/17, the top 23% of 1700s were able to hit 4.0GHz or greater. 1.44V CPU VCORE (https://siliconlottery.com/collections/all/products/1700a40g)
As of 3/6/17, the top 77% of 1700Xs were able to hit 3.9GHz or greater. 1.392V CPU VCORE (https://siliconlottery.com/collections/all/products/1700x39g)
As of 3/6/17, the top 33% of 1700Xs were able to hit 4.0GHz or greater. 1.424V CPU VCORE (https://siliconlottery.com/collections/all/products/1700x40g)
As of 3/6/17, the top 67% of 1800Xs were able to hit 4.0GHz or greater. 1.408V CPU VCORE (https://siliconlottery.com/collections/all/products/1800x40g)
As of 3/6/17, the top 20% of 1800Xs were able to hit 4.1GHz or greater. 1.44V CPU VCORE (https://siliconlottery.com/collections/all/products/1800x41g)
 

·
Premium Member
Joined
·
2,636 Posts
Based on what the video is saying the way you want to game is not with smt off.

You would actually want SMT on and use the core disable feature.

On some boards you have choices 2+2 or 0+4.

Since we want to eliminate the cross core latency between ccx the 0+4 option with smt on should be the optimal setting in theory and on paper according to the video.

As I stated in my review and I stick by it 1700 is the chip for enthusiast overclockers and it seems silicon lottery had to adjust there findings.

Also keep in mind realbench is not real stability but it should still be a reliable way to "bin"

You will probably end up .025-.050 over the volts they required for realbench for "prime blend stable with kickers ( cinebench and wprime1024 during prime 95 )"

As far as ref clock boards go.....I would suggest keeping it low( 105 MAX) till someone gets some actual data in games and data on what you can get away with hardware combination wise without killing the hardware.

Hardcore overclockers benching tend to use specific hardware that tends to be far more tolerant to this and is in no way by any means is it indicative of real world without killing something.

Anyway cheers, I finally remembered my password.

chew*
 

·
Premium Member
Joined
·
2,636 Posts
I don't think this is the thread to discuss that but I can discuss it in a thread you make and offer some impartial and probably more objective advice than the majority.
 

·
Iconoclast
Joined
·
30,613 Posts
Quote:
Originally Posted by CrazyElf View Post

DRAM is used as a cache to communicate between the 2 CCXs due to the way Infinity Fabric works.
This particular statement strikes me as false. DRAM is far too slow to be an efficient way to communicate between CCXes.

It's my understanding that the memory clock controls the data fabric clock and that the data fabric connects the CCXes together, as well as connecting the memory controllers to the CCXes, so that direct CCX to CCX communication can be indirectly affected by how fast the system memory is, not that the system memory itself is a buffer/cache for this data.

C6eUL0DWAAAeCQG.jpg
 
  • Rep+
Reactions: mohiuddin

·
Meeeeeeeow!
Joined
·
2,229 Posts
Discussion Starter #11
Quote:
Originally Posted by Blameless View Post

This particular statement strikes me as false. DRAM is far too slow to be an efficient way to communicate between CCXes.

It's my understanding that the memory clock controls the data fabric clock and that the data fabric connects the CCXes together, as well as connecting the memory controllers to the CCXes, so that direct CCX to CCX communication can be indirectly affected by how fast the system memory is, not that the system memory itself is a buffer/cache for this data.

C6eUL0DWAAAeCQG.jpg
How it works is that there is the Infinity Fabric, which interfaces between the two L3 caches of the CCXs. Data is requested to both the RAM and CCX.

If the L3 cache has the information, then the communication has to be done through the Infinity Fabric from CCX to CCX. If not, then the DRAM is used as the "real" last level cache, which incurs a latency penalty.

See:
http://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700/9
Quote:
The L3 cache is actually a victim cache, taking data from L1 and L2 evictions rather than collecting data from prefetch/demand instructions. Victim caches tend to be less effective than inclusive caches, however Zen counters this by having a sufficiency large L2 to compensate. The use of a victim cache means that it does not have to hold L2 data inside, effectively increasing its potential capacity with less data redundancy.

It is worth noting that a single CCX has 8 MB of cache, and as a result the 8-core Zen being displayed by AMD at the current events involves two CPU Complexes. This affords a total of 16 MB of L3 cache, albeit in two distinct parts. This means that the true LLC for the entire chip is actually DRAM, although AMD states that the two CCXes can communicate with each other through the custom fabric which connects both the complexes, the memory controller, the IO, the PCIe lanes etc.
Then there's this:
http://www.pcgameshardware.de/Ryzen-7-1800X-CPU-265804/Tests/Test-Review-1222033/
Quote:
Die Webseite anandtech.com hat nun eine Liste an bekannten Problemen veröffentlicht (unten bei "Caveats"), die Ryzen-CPUs und AM4-Mainboards noch haben. Einige Punkte sind bereits klar: Die BIOS-Versionen sind noch alles andere als in einem finalen Zustand, was die Stabilität und Performance beeinflusst - davon können unsere Ryzen-Tester ein Lied singen. Der Scheduler von Windows 10 kennt Ryzen noch nicht - Berichte, wonach der kommende Woche stattfindende Patchday das Problem löst, konnten uns weder AMD, noch Microsoft bestätigen. Threads werden munter zwischen den CCX herumgereicht, was die Kommunikation unnötig erhöht. Zudem soll der Scheduler davon ausgehen, dass nicht jeder CCX 8 MiByte L3-Cache umfasst, sondern jeder Thread - das wären insgesamt 128 MiByte L3 statt 16. Das unterstützt die These, dass Windows 10 Ryzen wie Bulldozer behandelt und daher absichtlich Last auf die virtuellen SMT-Threads ablegt. Anandtech.com berichtet zudem von Problem mit 0,25x-Multiplikatoren, weshalb Nutzer nur halbe oder ganze verwenden sollten.
Translation from German:
Quote:
The aim of this series of measurements is firstly the elaboration of how much power (frame rate) increases from 4 to 6 and finally 8 CPU cores. Secondly, we examine the extent to which the communication between the compute complexes, AMD's basic units of Zen architecture, slow down the process. We remember that four nuclei with 2 MiByte L3 cache slices (8 MiByte total, 16-fold associative) yield a Compute Complex (CCX). Two of these CCXs communicate in Summit Ridge (Ryzen R7) internally via the so-called Inifinty Fabric, a coherent connection. According to AMD, this connection achieves approximately 22 GB of data throughput per second for single-threaded tasks and mixed read / write accesses. The Infinity Fabric plugs data requests to the memory controller and simultaneously checks the L3 cache of other CCX for availability of the requested data. Depending on whether a feedback is returned more quickly, either the other L3 (in most cases) or the memory is accessed. The memory request is canceled if the data already exists in the L3.
Then this:
http://www.hardware.fr/articles/956-23/retour-sous-systeme-memoire.html
Quote:
Cependant la situation au-delà de 8 Mo est sans appel : sur un accès de 12 Mo (dans un L3 combiné de 16 Mo), on a une latence proche de celle que l'on obtiendrait si l'on effectuait les accès au-dessus de 8 Mo en mémoire.

Nous avons posé la question à AMD qui a confirmé cela : en pratique, dans ce cas d'école le L3 du second CCX n'est pas utilisé, Ryzen se comportant comme s'il ne disposait que de 8 Mo de L3.

Il s'agit en effet d'une particularité du benchmark utilisé : pour pouvoir mesurer correctement la latence, il est indispensable que les threads soient attachées à un coeur et n'en bougent pas. Or, vous vous en souvenez probablement, le cache L3 est un cache de type "victime", il contient les données qui sont écartées du cache L2. Et pour remplir le cache L2, il faut qu'un accès se fasse... à l'intérieur du CCX. Résultat, en bloquant notre benchmark (la latence ne se mesure que sur un thread) sur le premier coeur, l'autre CCX n'est pas sollicité, son cache ne se remplissant pas.

C'est effectivement un désavantage de l'architecture même s'il est excessivement théorique : en pratique, Windows 10 adore, on ne cesse de le dire, balader les threads des logiciels d'un coeur à l'autre.

Si AMD n'a pas pu nous donner une idée de la latence d'un accès sur le second CCX, il nous a fourni une autre donnée largement plus importante : la bande passante entre ces deux CCX : seulement 22 Go/s !

Seulement, car vous l'aurez noté, cette bande passante est non seulement plus faible que celle du L3 à l'intérieur du CCX ("au moins" 175 Go/s, un chiffre à vérifier lorsque les logiciels de mesures auront été mis à jour), mais surtout inférieure... à la bande passante mémoire !

Lors de la GDC, AMD a partagé ce schéma qui donne quelques clarifications et explique le chiffre de 22 Go/s annoncé :

...
Chaque CCX est relié au Data Fabric par un bus capable de transférer 32 octets/cycles, et cadencé à la fréquence de la mémoire (1200 MHz pour de la DDR4-2400). C'est par ce bus que passent à la fois les requètes vers le contrôleur RAM, et vers l'autre CCX.

A 1200 MHz, on obtient un chiffre théorique de bande passante pour ce bus de 38 Go/s. Sachant qu'il est partagé pour les accès mémoire et RAM, on peut voir se dessiner assez rapidement des problèmes de contention entre les accès RAM, et les inter communications avec l'autre CCX. Une partie de la bande passante est elle réservée à la RAM, et l'autre aux échanges L3 ? Certaines requêtes ont elles la priorité sur d'autres ? Le constructeur n'a pas donné plus de détails pour l'instant.

On notera en prime que toutes les IO PCI Express se retrouvent aussi sur ce bus commun, ce qui peut ne pas être sans impact sur les jeux.
Translated from French:
Quote:
We asked AMD who confirmed this: in practice, in this case of school the L3 of the second CCX is not used, Ryzen behaving as if it had only 8 MB of L3.

This is a particularity of the benchmark used: to be able to measure the latency correctly, it is essential that the threads are attached to a heart and do not move. However, you probably remember, the L3 cache is a "victim" cache, it contains the data that is removed from the L2 cache. And to fill the L2 cache, access must be made ... inside the CCX. Result, blocking our benchmark (latency is only measured on a thread) on the first core, the other CCX is not solicited, its cache not filling.

This is actually a disadvantage of the architecture even if it is overly theoretical: in practice, Windows 10 loves, we keep saying it, to walk the software threads from one heart to another.

If AMD could not give us an idea of the latency of an access on the second CCX, it gave us another much more important data: the bandwidth between these two CCX: only 22 GB / s!

Only, as you will have noticed, this bandwidth is not only lower than that of the L3 inside the CCX ("at least" 175 GB / s, a figure to be checked when the measurement software has been set Day), but especially lower ... to memory bandwidth!

At the GDC, AMD shared this diagram which gives some clarifications and explains the figure of 22 GB / s announced:

...

Each CCX is connected to the Data Fabric via a bus capable of transferring 32 bytes / cycles and clocked at the memory frequency (1200 MHz for DDR4-2400). It is through this bus that pass both the requests to the RAM controller, and to the other CCX.

At 1200 MHz, a theoretical number of bandwidth is obtained for this 38 GB / s bus. Since it is shared for memory and RAM access, contention problems between RAM access and inter-communication with the other CCX can be seen fairly quickly. Is part of the bandwidth reserved for RAM, and the other for L3? Do some queries have priority over others? The manufacturer has not given more details yet.

It should be noted that all PCI Express IOs are also on this common bus, which may not be without impact on the games.
Considering what is tied to RAM overclocks at this point, that's why I theorized that overclocking the RAM would lead to other performance improvements - you are also improving the performance between CCXs.

In my original post, I called for the Infinity Bandwidth to be higher, as 22 GB/s is not adequate. How this 22 GB/s and 38 GB /s is derived, I'm not as sure. But let's say at 1200 MHz means actual DDR speed of 2400MHz, which means a peak transfer of 19.2 GB/s, so that's probably how they got the 38 GB/s (really 38.4 GB/s). If this bus is tied, then at 1600 MHz, which is 3200 MHz on the RAM we might be getting 51.2 GB/s of bandwidth? Assuming this is how the Infinity Fabric is allocated, that means that its bandwidth may be 22 x (51.2 / 38.4) = 29.33 GB/s. That's really slow too - Haswell EP uses QPI at 9.6 GT/s, which works out to 38.4 GB/s! Intel's Purley (Skylake Xeons) will be introducing a newer, faster UPI bus. if the 22GB/s figure is accurate, they've got less bandwidth between 2 CCXs on the same die, than Intel has for 2 CPUs on a motherboard!

It's also why I called for an L4 cache. Right now for programs, as it stands, programmers need to keep whatever they can in 1 CCX, as do OS schedulers. If it is going to DRAM sometimes, then that would explain the penalties.

I mean this is still a hypothesis (what's not is that faster RAM seems to be doing something for Ryzen's gaming performance), but unless I'm missing something there (and I could be), it does look like they are using DRAM as a last level cache.

Edit:
Some other observations (will edit OP later on to reflect these).

Also of note is the 32 Bytes/clock everywhere. Could that be a coincidence or the fact that they are all tied to the RAM clock? The fact that the data fabric is also 32B/clock suggests to me that it is tied to RAM.

The other interesting result. Hardware.fr did a test - they compared results of 4 + 0 versus 2 +2:
http://www.hardware.fr/articles/956-24/retour-sous-systeme-memoire-suite.html

Note the huge jump in performance for Battlefield. Battlefield is important because it is a CPU bottlenecked game. Most things do better with 4 + 0 - only exception is 7 Zip, which the authors speculate benefits from the 2+2 cache.

It is looking like there are some RAM latency penalties. The authors think it is because there are 2 controllers connected to the CCX via data fabric, the RAM and the PCH.

Contradicting information:
http://www.tomshardware.com/reviews/amd-ryzen-7-1800x-cpu,4951-5.html


Quote:
We measured performance with the utilities and achieved similar results for Intel's Core i7-6900K, but we also noticed a large gap between the AMD-provided Ryzen measurements and our test results. Ryzen's L3 cache latency measured 20 ~ 23ns, which is double the provided value. Due to some of the performance characteristics we noted during our game testing, we also tested with SMT enabled and disabled, but the results fell within expected variation. We also measured a ~10ns memory latency gap in favor of the Intel processor.
But ... Hardware.fr has different results. Note discrepancy on L3 cache.
https://www.techpowerup.com/231268/amds-ryzen-cache-analyzed-improvements-improveable-ccx-compromises



That's slower than 10ns for sure.



Except for L2 bandwidth, it looks like AMD has a slower cache.

I believe that this is because the Tom's Hardware review is using old cache.

That's the scheduling issue.

I feel like they should have used the NUMA design. It's not quite NUMA, but it's close enough. Change the resistance to reflect this.

Edit 2:
https://www.pcper.com/reviews/Processors/AMD-Ryzen-and-Windows-10-Scheduler-No-Silver-Bullet



Ping times between logical cores. THat's the inter-CCX penalty there. Within a CCX, it's about 2x as fast as a 5960X latency wise, but it's about 2x slower outside the CCX.

I'd love to see what this is like on Venice.

Quote:
Originally Posted by chew* View Post

Based on what the video is saying the way you want to game is not with smt off.

You would actually want SMT on and use the core disable feature.

On some boards you have choices 2+2 or 0+4.

Since we want to eliminate the cross core latency between ccx the 0+4 option with smt on should be the optimal setting in theory and on paper according to the video.

As I stated in my review and I stick by it 1700 is the chip for enthusiast overclockers and it seems silicon lottery had to adjust there findings.

Also keep in mind realbench is not real stability but it should still be a reliable way to "bin"

You will probably end up .025-.050 over the volts they required for realbench for "prime blend stable with kickers ( cinebench and wprime1024 during prime 95 )"

As far as ref clock boards go.....I would suggest keeping it low( 105 MAX) till someone gets some actual data in games and data on what you can get away with hardware combination wise without killing the hardware.

Hardcore overclockers benching tend to use specific hardware that tends to be far more tolerant to this and is in no way by any means is it indicative of real world without killing something.

Anyway cheers, I finally remembered my password.

chew*
It depends on the game. For games that use less than 4 threads, you'd actually want to do both the Core Disable of 4-7, leaving just 0-3 (one CCX) AND turn off the SMT.

The first because it prevents inter CCX communications. The second because turning off the SMT has led to performance gains independent of the core disable. That's because there are not enough resources in the queues to allow for gains.

Most games show minimal gains. Some show noticeable gains.
Example: http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/74880-amd-ryzen-7-1700x-review-testing-smt-11.html



AMD is looking into this:
https://community.amd.com/community/gaming/blog/2017/03/13/amd-ryzen-community-update?sf62107357=1
Quote:
Simultaneous Multi-threading (SMT)

Finally, we have investigated reports of instances where SMT is producing reduced performance in a handful of games.
They claim that SMT can gain in some games.

For that reason, note in the OP that I recommend that where games can use more than 8 threads, that you leave SMT and all cores on.

Perhaps the best recommendation is:
  • <4 threads - disable 4 of 8 cores and disable SMT
  • <8 threads - disable SMT, but keep all 8 cores on
  • >8 threads - disable nothing
Yeah it's complex, but I don't see a better option.
 

·
Iconoclast
Joined
·
30,613 Posts
Quote:
Originally Posted by CrazyElf View Post

How it works is that there is the Infinity Fabric, which interfaces between the two L3 caches of the CCXs. Data is requested to both the RAM and CCX.

If the L3 cache has the information, then the communication has to be done through the Infinity Fabric from CCX to CCX. If not, then the DRAM is used as the "real" last level cache, which incurs a latency penalty.
Barring things like swap files, DRAM is always the last level cache. That's not the point of contention.

The infinity fabric is the only way for data to leave or enter a CCX, so all access to other CCXes, all I/O and all memory accesses go over it. The infinity fabric is faster at higher DRAM clocks because the clocks are linked, not because the CCXes are using main system memory to talk to each other.
Quote:
Originally Posted by CrazyElf View Post

it does look like they are using DRAM as a last level cache.
No more so than any other architecture.

If one CCX needs what another CCX has, they can talk to eachother over the infinity fabric without accessing system memory, and this would be vastly faster than accessing system memory.

Higher memory clocks improving data fabric performance don't imply memory is being used for inter-CCX communication, it's because the entire infinity/data fabric, northbridge, uncore, whatever you want to call it, is locked to DRAM speed. The CCXes can talk to each other faster when you have faster memory, even without ever touching the memory in any way.
 

·
Registered
Joined
·
1,157 Posts
Quote:
Originally Posted by Blameless View Post

Barring things like swap files, DRAM is always the last level cache. That's not the point of contention.

The infinity fabric is the only way for data to leave or enter a CCX, so all access to other CCXes, all I/O and all memory accesses go over it. The infinity fabric is faster at higher DRAM clocks because the clocks are linked, not because the CCXes are using main system memory to talk to each other.
No more so than any other architecture.

If one CCX needs what another CCX has, they can talk to eachother over the infinity fabric without accessing system memory, and this would be vastly faster than accessing system memory.

Higher memory clocks improving data fabric performance don't imply memory is being used for inter-CCX communication, it's because the entire infinity/data fabric, northbridge, uncore, whatever you want to call it, is locked to DRAM speed. The CCXes can talk to each other faster when you have faster memory, even without ever touching the memory in any way.
Which gives rise to the question "Why would AMD not have a separate CLK?". Why would they connect the DF clk to the memory CLK ? What gain would they accomplish?
 

·
Registered
Joined
·
1,911 Posts
Quote:
Originally Posted by Firann View Post

Which gives rise to the question "Why would AMD not have a separate CLK?". Why would they connect the DF clk to the memory CLK ? What gain would they accomplish?
Easier timings. And no stalls because of different clocks. The real question is why didn't they made it SIGNIFICANTLY wider. 4x or more should be doable.
 

·
Iconoclast
Joined
·
30,613 Posts
Minimizing clock crossing boundaries and clock generation/distribution would certainly make for a simpler part and seems like a likely motivation.

However, I'm not sure widening the Fabric interconnect is an ideal solution either. Wider interconnects are also costly and wider won't help latency much.
 

·
Meeeeeeeow!
Joined
·
2,229 Posts
Discussion Starter #16
Quote:
Originally Posted by Blameless View Post

Barring things like swap files, DRAM is always the last level cache. That's not the point of contention.

The infinity fabric is the only way for data to leave or enter a CCX, so all access to other CCXes, all I/O and all memory accesses go over it. The infinity fabric is faster at higher DRAM clocks because the clocks are linked, not because the CCXes are using main system memory to talk to each other.
No more so than any other architecture.

If one CCX needs what another CCX has, they can talk to eachother over the infinity fabric without accessing system memory, and this would be vastly faster than accessing system memory.

Higher memory clocks improving data fabric performance don't imply memory is being used for inter-CCX communication, it's because the entire infinity/data fabric, northbridge, uncore, whatever you want to call it, is locked to DRAM speed. The CCXes can talk to each other faster when you have faster memory, even without ever touching the memory in any way.
Yes - this is dependent though on the other CCX having the information. I'm not sure though how the L3 plays into this, as it is a victim cache.

I think that the way AMD has designed it, it is quite more likely for information to need to access DRAM versus another CCX as compared to Intel's:
http://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/

In INtel's case, because it's a "conventional" L3 cache, the L3 would have a "copy" eventually of the lower level cache. In AMD's case, the L3 is a "victim", which spills over what is not needed on the L2 cache (which is a bit larger than on Intel's to compensate). I'm not sure how that plays into the CCX though for trying to get other CCX information - on paper, there's a higher probability that what the other core needs is in the L2 of the core, not the shared L3 victim. Compounding the problem for AMD, their memory controller has a lot more latency than Intel's.

In regards to more complex topology.

When Intel did was a Cluster on Die Technology:


It's a recognition that communicating between the 2 buffers introduces delays - it can take as many as 310 cycles to get something from the opposing memory controller. Maybe AMD should do the same for the 4 core CCXs, as there is a pretty big penalty. A lot of these problems could be partially mitigated by an L4 cache, even a small one.

We still need a calculation though of the actual frequency of the Infinity Fabric. I think that for sure it's tied to RAM. I think that it must be the FCLK. It runs at 50% of the MEMCLK and in turn, the GMI (Global Memory Interface, which is the part of the Infinity Fabric that links the two CCXs runs at 4x the speed of the FCLK or 2x the MEMCLK, at least that's how I understand it). There's also 2 other clocks running at 50% the MEMCLK, the DFICLK (which I think is the data fabric clock) and then there's the UCLK (which is the memory controller clocks). The UCLK could be modified to run at 100% of MEMCLK. NO idea what this does to stability though or if this is advisable. I'd advise some pretty strict tests (>1000% HCI Memtest) if you tried this.

Fun fact: Skylake EP (Purley), will likely have 3 rings in a 10+10+12 configuration with 6 distinct channels.

Edit:
Now that I think about this, it may make sense to bin on memory controller speeds. If RAM is tied to everything, then that means that it's extremely important.

Still waiting for final confirmation on GMI speeds - but will assume 22GB/s at 1200MHz RAM for now.

I"d like to see one other test - single channel RAM results. It might be that they are using 1 RAM channel per CCX right now (how that would work on the 4 core version I"m not sure, as it would surely have dual channel too), but it is a different die. In my OP I also discussed what INtel did on Skylake - split the timings apart. AMD needs to follow.

Quote:
Originally Posted by Blameless View Post

Minimizing clock crossing boundaries and clock generation/distribution would certainly make for a simpler part and seems like a likely motivation.

However, I'm not sure widening the Fabric interconnect is an ideal solution either. Wider interconnects are also costly and wider won't help latency much.
It all comes down to what the bottleneck is. At 22GB/s @ 1200 MHz on the RAM, that's not much at all. We probably need both better latency and bandwidth here.
 

·
Iconoclast
Joined
·
30,613 Posts
Quote:
Originally Posted by CrazyElf View Post

In regards to more complex topology.

When Intel did was a Cluster on Die Technology:


It's a recognition that communicating between the 2 buffers introduces delays - it can take as many as 310 cycles to get something from the opposing memory controller. Maybe AMD should do the same for the 4 core CCXs, as there is a pretty big penalty.
Intel has a set of fully custom ring busses that each have nearly ten times the bandwidth of AMD's data frabric and buffered switches to match. This all sits on it's own metal layer and is overlapped by the cache/cores.

Not sure it's practical for AMD to duplicate this, at least not in the short term.
Quote:
Originally Posted by CrazyElf View Post

A lot of these problems could be partially mitigated by an L4 cache, even a small one.
The L4 cache would still be accessed via the same relatively slow data fabric that connects everything else together.
Quote:
Originally Posted by CrazyElf View Post

I"d like to see one other test - single channel RAM results. It might be that they are using 1 RAM channel per CCX right now (how that would work on the 4 core version I"m not sure, as it would surely have dual channel too), but it is a different die.
According to every description/block diagram I've seen each CCX is four cores and an L3 cache, anything outside of that, including the memory controllers are largely independent of the CCXes.

I suspect the quad core parts to have essentially the same 'uncore' as the eight core parts, just with one less CCX attached.

I doubt that single channel would change much, but it may be a good way to isolate the basic data fabric performance issues from memory performance issues (e.g. if single channel does nothing, the bottleneck is the data fabric, if single channel reduces performance appreciably, it's probably an actual memory bottleneck).
Quote:
Originally Posted by CrazyElf View Post

We probably need both better latency and bandwidth here.
Best way to do this, without a radical architectural shift, would be to decouple the data fabric from the memory clock or change it's multiplier.

The former option would require some more PPLs and buffers, while the latter option might cap memory frequency unless the data fabric is capable of rather extreme clocks. Still, either would likely be more workable than widening the data fabric or worse, replacing it with the sort of custom interconnect Intel uses.
 
  • Rep+
Reactions: Steele84

·
Registered
Joined
·
1,157 Posts
Wouldn't it make more sense to link it to the bclk / cclk? Have it a multiple of that clock? In my mind it makes more sense to link the "speed" that the two CCX speak to eachother with the core speed rather than memory speed clk. This would also result in uniform gains when overclocking. If you overclock the core clocks you also overclock the ability of the CCX to communicate with each other. Now it relies on a third external factor rather than just the processor itself.

Then again i'm not an processor engineer or architect.
smile.gif
 

·
Registered
Joined
·
2,725 Posts
Quote:
Originally Posted by CrazyElf View Post

For that reason, note in the OP that I recommend that where games can use more than 8 threads, that you leave SMT and all cores on.
There are some games that "use" more than 8 threads but don't really benefit from having more than 8 threads in the processor, though, right?
Quote:
Originally Posted by CrazyElf View Post

Perhaps the best recommendation is:
  • <4 threads - disable 4 of 8 cores and disable SMT
  • <8 threads - disable SMT, but keep all 8 cores on
  • >8 threads - disable nothing
Yeah it's complex, but I don't see a better option.
I had mentioned the idea of operating systems adopting a per-application profiling system to do this. That way the user could override the vendors' setting which would be chosen for optimal efficiency and no rebooting would be required.
 
  • Rep+
Reactions: Steele84

·
Registered
Joined
·
626 Posts
Great thread +REP to all. So if what I'm reading correct and AMD eliminates dual CCX for the R5 series then they could potentially be better performing chips in applications that run less than 4-8 threads? The other issue I see with this is a resolution this the problem presented... the latency penalty appears to be an issue that can only be improved with better software scheduling of cores/ logical cores. However the penalty will always be present when spanning both CCXs?
 
1 - 20 of 39 Posts
Top