Overclock.net - An Overclocking Community

Overclock.net - An Overclocking Community (https://www.overclock.net/forum/)
-   AMD CPUs (https://www.overclock.net/forum/10-amd-cpus/)
-   -   Theories on why the SMT hurts the performance of gaming in Ryzen and some recommendations for the future (https://www.overclock.net/forum/10-amd-cpus/1624566-theories-why-smt-hurts-performance-gaming-ryzen-some-recommendations-future.html)

CrazyElf 03-02-2017 05:39 PM

In gaming, there is a performance penalty of around 10% (give or take) for Simultaneous Multi-Thread (SMT). Why is this?


The SMT Implementation

When Intel released its hyper-threading for the first time, there were actually performance penalties. Right now there seem to be with AMD right now.



The 3 queue resources (in green) are shared. That means that they are not duplicated in SMT.

I suspect that when SMT is on, they are essentially halved. While sufficient for a full core with no SMT, they are probably bottlenecking the SMT impelementation. IN essence, AMD repeated Intel's mistake with HT.

This cannot be fixed with any BIOS update, although perhaps they could find a few ways to mitigate in microcode.



The CCX

This is Ryzen CPU. Click on for full resolution.



There are 2 distinct 4 core clusters, making 8 cores. Each of these is called a CCX.




Communicating between each of the 4 cores within the CCX is very fast. Each CCX has 4 cores. These cores have their own L1 and L2 cache, then a shared L3 cache in 4 slices (8 MB shared amongst the 4 CPUs, kind of like a 4 core CPU). One notable difference versus Intel CPUs is that L3 is a victim cache (versus one that collects data from the prefetch/demand instructions - a write-back cache like on Intel CPUs). Note of course the larger than usual L2 cache to make up for this.



But communication between the 2 CCXs less so and there is a big performance penalty. The penalty is in both bandwidth and latency. How it works is that there is a link between the 2 CCX. The link that AMD currently uses something called "Infinity Fabric", which is basically an upgraded HyperTransport design. This Infinity Fabric appears to be at RAM speed. When one CCX has data that the other CCX needs, the Infinity Fabric checks the L3 cache of the other CCX and at the same time requests it of the memory controller. In most cases, the memory controller request will be cancelled because the data will be in the other CCX's L3 cache. However if not, then DRAM is the "true" last level cache. AMD claims that the Infinity Fabric has a bandwidth of 22 GB/s.

I'm actually concerned about that. For a comparison, the QPI on Haswell E is about 38.4 GB/s (QPI operates at 4.8 GHz on Haswell E x 2 for Double Data Rate x 16/20 (16 bits, but QPI bits wide) x 2 (bidirectional) / 8 (since there's 8 bits per byte) = 38.4 GB/s. That's almost double AMD's quoted 22GB/s and that's for a 2P socket! For Skylake (Purley) Intel plans an even faster interconnect called UltraPath Interconnect (UPI) technology (also known as KTI or Keizer Technology). It is reported to have 9.6GT/s or 10.4GT/s transfer speeds and it should support many requests per message, so it should see efficiency gains. The point is that in Intel, communication off die between 2 CPUs has more bandwidth than an on-die communication between 2 CCXs!

Edit: It is worth mentioning Intel lists the Haswell EP Xeons as 9.6 GT/s on QPI, which means 9.6 GT/s x 16 (data link is 16 bits; really 20 for data integrity) / 8 bits per byte x 2 for double data rate = 38.4 GB/s

Infinity Fabric will also be used in Vega.




That means this is not like having 1 big monolithic die. This is like having 2 fast 4 core CPUs.

For a comparison, here's Broadwell E, which uses a distinct "ring" design:




So there isn't a performance penalty on Intel CPUs for communication between cores because it is a "ring" rather than 2 CCX designs. While communicating within that ring will be slower (since data has to travel half way across the "ring" in the worst situation), it also means that the CPU acts as 1. By contrast with AMD's solution, it means that communication within the CCX is much faster, but between CCXs is really slow. Apparently AMD made this design to be scalable (they just don't have as many engineers as Intel).

Right now there is a performance penalty because Windows is treating this like a monolithic die, rather than 2 separate CPU complexes, which is what this really is.

Think of it as 2 4-core CPUs like on a 2P socket, not 1 8-core CPU. There is no real way to "fix" this issue - it's inherent in the design, although a Windows update/updated LInux kernel would be very good.




Other

RAM Speeds
We learned a while back that there was a RAM slowdown.

https://www.overclock.net/t/1624058/dvhardware-amd-ryzen-has-issues-with-high-frequency-ddr4-fix-expected-in-1-2-months/200

That might be an issue. When the 6700K first was released, it was often slower in gaming than the 4790K, despite Skylake having a better IPC than Haswell. The reason was due to the poor speed and loose timings of the DDR4. The interesting thing here is that Zen does very well at workstation benchmarks (better than Broadwell E in many cases and perhaps even better than a hypothetical Skylake E), which makes this a very likely culprit.


Clockspeeds

One of the reasons why the Ryzen is cheap is because they used High Density Libraries. That allows for more dies in a smaller area and reduced power consumption. The penalty is clockspeed. For similar reasons, a GPUs like the 290X do not have as much overclocking headroom as say, a 7970 might. You can put more transistors in a given area, but at the expense of clocks, due to the power density (clockspeed is exponential). Actually, that reminds me, one of the reasons why Kaby Lake clocks faster than Skylake by about 300 MHz is because Kaby Lake is less dense.


This design may very well be why Ryzen cannot overclock more. Actually 4 GHz is already very good considering this.

Voltage Integration
For those who remember, Haswell introduced FIVR, which integrated the voltage regulator on the CPU package, rather than the motherboard. I think that the LDO on Zen is bypassed on consumer boards, so this is a non-issue. That means voltage integration takes place on the motherboard in full.

Unlike Carrizo, I don't see this as a bottleneck, unless the integrated voltage is in use somehow.

Fun fact: The voltage integration design is called Zeppelin.

Uncore
The uncore (cache) does not seem to be separate from the core, unlike INtel CPUs. If this is bottlenecking clockspeeds and not the HDL, then we may be cache speed rather than core limited. This might explain Ryzen's weak OCs.

What I don't know is if it is the cache or the HDL that is limiting OCs. If it is the cache, then we may be able to get a few hundred MHz from splitting this out.

Base Clock
Much like Sandy Bridge, the Base Clock is closely tied to everything else, so overclocking it is likely to introduce instability. I'd guess that past 105 MHz on PCie 3.0, there may be instability.

I would like to see this separated (kind of like what Intel did with Skylake - they separated the CPU baseclock from the rest of the board). I would also like a "strap" function like on Intel boards to be added for unlocked CPUs.



4 cores don't suffer from the CCX communication problem

With only 1 CCX, the 4 core Zen CPUs will not suffer from this problem. Actually, for a mid-ranged system a 4 core Ryzen CPU with SMT disabled would be a very good value.

Once the RAM speeds are resolved and with SMT disabled, the main flaws are not a problem at all.

This works for a budget system, an APU, and may be an advantage on a laptop.




You may still be better of with Ryzen

Keep in mind that with most games, they are GPU not CPU bottlenecked.

You could buy a 6900K + an X99 motherboard + 1 GPU. Alternatively, you could buy a 1800X + X370 board + 2 GPUs for CF/SLI. In games that support CF or SLI, that would be an advantage and keep in mind you are not CPU bottlenecked.

The main drawback of course is multi-GPU issues. The other is of course where you do have CPU bottlenecks. Many strategy games (like the Total War series, simulator games) and the Battlefield series (especially in multiplayer on large maps) are CPU bottlenecked.


We really need a review of Zen vs X99 at 4k.



AVFS

Ryzen uses AVFS, much like Carrizo.



That may be a big part of the power savings of Ryzen.

I don't know if this has any impact on the overclock headroom, but the top Polaris chips also used AVFS and the best ones could (the XFX RX 480 GTR Black comes to mind) could go past 1500 MHz at times - provided you get lucky with the silicon lottery. We may see clocks mature as Zen matures.


AVX/FMA

Currently AMD's AVX/FMA does not scale as well as Intel's.
https://forums.anandtech.com/threads/ryzen-strictly-technical.2500572/

Not many games use these instructions, but it is a point to note. It may affect productivity though, depending on your workload.






Conclusions
Keep in mind the value proposition - you may still be better off with Zen. Also keep in mind the possibilities for your budget systems and APUs of the 4 core CPU.

The combination of these problems means that Zen cannot be as fast as Intel's HEDT in terms of gaming (let's assume that a 6800K/5820K + an X99 board is the approximate peer of an 1800X + an X370 board; the 1800X will be more expensive, but a good X99 board will be more expensive, due to the 40 PCie lanes, quad channel RAM, and more complex chipset). Combine this with a weak OC headroom (either due to the cache or the HDL) and you have an explanation.

That said, any game that uses more cores will mean that the 8 Core Ryzen should destroy the Skylake and Kaby Lake Intel CPUs. Keep in mind that with DX12 and Vulkan it may be more future proof to get Ryzen. Oh and Ryzen can destroy Intel at content creation - even the 6900K cannot keep up.


I think that we need a patch in Windows and for the next Linux Kernel for each CCX to be treated like 2 separate CPUs. Essentially this would eliminate much of the performance penalty of Ryzen's CCX, which causes data to miss the L3 cache and go into RAM, causing a performance penalty.

Maybe a microcode update could mitigate some of these problems, but they are inherent in the hardware, so I am unsure of how much it will gain.

They need to get the BIOS updates for the RAM out ASAP.

For gamers, disable the SMT when gaming.

From a programming POV, it may make sense to treat each 4 core CCX like a mini-NUMA cluster. Then communication between CCXs is minimized, keeping data in L3 rather than using the Infinity Fabric and DDR4 as Last Level Cache.





Zen+ Ideas
  • AVX and FMA performance need a boost in the future
  • Increase the resources for the queues as to prevent penalties with the SMT.
  • A higher speed interconnect between the CCX so that they don't have to go to DRAM as the Last Level Cache and perhaps further augments to Infinity Fabric. I think that Zen would benefit from an L4 eDRAM cache. Perhaps a future version could feature HBM.
  • Maybe a third idea might be to isolate the cache, Bclk and core clocks. On Skylake for example, you have your Core speed, then Uncore. There's no separate Uncore. If the Uncore is holding back clocks then perhaps the Core can go a bit faster. For the base clock, on unlocked Also, on Skylake, Intel introduced the ability for the Bclk of the CPU to be separate from the rest of the motherboard.

By high speed interconnect, look at this. It is a 24 core Broadwell E HCC design, with 2 "rings" of 12 cores.



Note the 2 buses between the 2 "rings" that allows for high speed connection. Without these, anything between the 2 rings would have to go to the DRAM, exacting a huge performance penalty. AMD needs to do something similar or beef up the Infinity Fabric. 22GB/s is not enough.

Yet another option may be an L4 eDRAM or even HBM configuration for the Last Level Cache (to prevent trips to DRAM that have a latency penalty).





I'm sure AMD engineers know about these.

This is an amazing CPU if you consider it, competitive and it could get a lot better with Zen+. I think that they could do 15% with some changes. Power consumption is good, and it has decent clocks even with low OC headroom. It's also a good value, awesome for content creation.



Thanks
To Looncraz as well for his advice on the LDOs

OCmember 03-03-2017 08:18 AM

Wow, great review! Thanks!

Quantum Reality 03-03-2017 08:24 AM

Excellent!

Also, some benches I've seen support the theory that disabling SMT will help gaming framerates in the interim while OS, BIOS and microcode patches get rolled out.

Yukon Trooper 03-03-2017 08:59 AM

Meh. Many gaming benchmarks show much less than 10% difference with SMT enabled/disabled. I don't think there's as much to gain there as people are hoping.

I also take issue with the GPU bottleneck argument. That may be true for average/max FPS, but for the all-important minimum FPS metric we'll likely see Zen destroyed as more in-depth gaming benchmarks come out.

System optimizations, BIOS updates, etc. are only going to get peoples' hopes up. AMD doesn't even really try arguing this point. AMD's main argument is game developers will start coding for Zen moving forward, but it will take at least 1-2 years before the market is semi-saturated with Zen optimized titles.

madbrayniak 03-03-2017 09:33 AM

Here is a good video by Gamers Nexus explaining some variance in Ryzen performance reviews:

https://www.youtube.com/watch?v=TBf0lwikXyU

Gamingboy 03-03-2017 09:55 AM

Ryzen's launch was a bit "premature" in the sense that the third-party companies creating motherboards for their processors are not yet THAT prepared. There are many issues highlighted by Steve from Gamer's Nexus about the BIOS problems with the MSI and ASUS boards. Just click on the Youtube Link madbrayniak shared.

AlphaC 03-03-2017 10:43 AM

Great overview.

It's interesting that you suggest the OS should recognize it as a 2P 4 core rather than 8 core.

I hope by the time Ryzen 5 releases many of these issues are ironed out by motherboard manufacturers and RAM manufacturers. So far the only RAM company seemingly on top of the RAM issue is GSkill , their "solution" in the long term is to release AMD Ryzen specialized RAM in the form of Flare X.

If your reasoning is correct then the biggest improvement could come with greater support RAM speeds (due to the CCX's Infinity Fabric implementation). Earlier news suggested the Infinity Fabric would be faster : "The company declined to give data rates or latency figures for Infinity, which comes only in a coherent version. However, it said that it is modular and will scale from 30- to 50-GBytes/second versions for notebooks to 512 Gbytes/s and beyond for Vega." http://www.eetimes.com/document.asp?doc_id=1330981&page_number=2

IMO when you see Firestrike/Timespy benchmarks with the Ryzen 7 competitive with respect to i7-6900K & i7-7700k @5.1GHz, then that means it is a game optimization issue.

Also I realized the Stilt's reasoning of < 3.3GHz being optimal for these chips might mean we ought to be under-volting instead of overclocking. It certainly explains the clockrates on ryzen 7 1700 and Ryzen 7 1700X.

CrazyElf 03-03-2017 11:58 AM

I feel that apart from the issues I've raised, AMD's CPU is very well made.

Quote:
Originally Posted by AlphaC View Post

Great overview.

It's interesting that you suggest the OS should recognize it as a 2P 4 core rather than 8 core.

I hope by the time Ryzen 5 releases many of these issues are ironed out by motherboard manufacturers and RAM manufacturers. So far the only RAM company seemingly on top of the RAM issue is GSkill , their "solution" in the long term is to release AMD Ryzen specialized RAM in the form of Flare X.

If your reasoning is correct then the biggest improvement could come with greater support RAM speeds (due to the CCX's Infinity Fabric implementation). Earlier news suggested the Infinity Fabric would be faster : "The company declined to give data rates or latency figures for Infinity, which comes only in a coherent version. However, it said that it is modular and will scale from 30- to 50-GBytes/second versions for notebooks to 512 Gbytes/s and beyond for Vega." http://www.eetimes.com/document.asp?doc_id=1330981&page_number=2

IMO when you see Firestrike/Timespy benchmarks with the Ryzen 7 competitive with respect to i7-6900K & i7-7700k @5.1GHz, then that means it is a game optimization issue.

Also I realized the Stilt's reasoning of < 3.3GHz being optimal for these chips might mean we ought to be under-volting instead of overclocking. It certainly explains the clockrates on ryzen 7 1700 and Ryzen 7 1700X.


+Rep - that gets my thought juices going.

Yes, which reminds me.

Once the RAM fix is in order, it may be advisable to buy the top binned RAM and OC it (best timings/clocks you can get). There is probably more to gain on tight timings and high RAM clocks on Ryzen than Intel platforms. The reason is because the Intel platforms don't use their DRAM as the last level cache, while Ryzen does. Furthermore, because they don't, it means that memory bandwidth is not the bottleneck in most cases, whereas when communicating between 2 CCXs, the memory could be a bottleneck. Actually, a 2 DIMM board might be a potentially good idea on Ryzen for that reason (trace lengths and possibility for better OC).

The source of the Infinite Fabric 22GB/s was the PCGH.de review. Apparently they talked with AMD about this.

Also, seeing that there's not much OC headroom and you want to undervolt, there is little point in buying flagship motherboards with insane VRMs, unless of course you need the other features that said flagship motherboards offer. Maybe put the savings towards buying better binned RAM. The only case you may want to consider a flagship then might be if that board has the ability to clock RAM faster.

Perhaps AMD should also focus on releasing a better memory controller for Zen+, although with my Zen+ proposals, it won't be needed because there will be a faster last level cache.

Will update the OP on this.





I'm actually worried about what the Infinity Fabric could mean for Vega. Keep in mind the 22GB/s is not a lot at all.

Quote:
Originally Posted by Quantum Reality View Post

Excellent!

Also, some benches I've seen support the theory that disabling SMT will help gaming framerates in the interim while OS, BIOS and microcode patches get rolled out.


WE should see modest gains. Getting rid of the SMT will add a few percentage points (perhaps as much as 10%) and the RAM fixes will add another few percent. That should mostly close the gap with Intel.

The big thing that we need to do for the microcode (and the OS kernels) to do is to treat the CCXs as different CPUs. If we can get say, a 4 thread game to only use 1 CCX, the gap will disappear. In that case, we could even see AMD get the kinds of wins in games that it gets in workstation benchmarks.



Quote:
Originally Posted by Yukon Trooper View Post

Meh. Many gaming benchmarks show much less than 10% difference with SMT enabled/disabled. I don't think there's as much to gain there as people are hoping.

I also take issue with the GPU bottleneck argument. That may be true for average/max FPS, but for the all-important minimum FPS metric we'll likely see Zen destroyed as more in-depth gaming benchmarks come out.

System optimizations, BIOS updates, etc. are only going to get peoples' hopes up. AMD doesn't even really try arguing this point. AMD's main argument is game developers will start coding for Zen moving forward, but it will take at least 1-2 years before the market is semi-saturated with Zen optimized titles.



True, it's not a huge difference, but it counts for a lot of people. The optimal use of CCXs would also likely boost not just games, but also the workstation loads even more. If games with less than 4 cores kept their loads within 1 CCX, they'd be able to pull the kind of results on games than they do on workstations.

Keep in mind that at 1440p where things are GPUs and not CPUs become the bottleneck (save in CPU bottlenecked games like Total War games). That means that even this bottleneck will disappear and in the few games that there are CPU bottlenecks, the faster RAM along with better CCX management should mitigate those.

I do not believe that with the fixes I have proposed Zen will get destroyed - at the very least, my Zen+ proposals would lead to viable fixes for Zen+.

AlphaC 03-03-2017 12:41 PM

The impact on workloads can be extremely odd though:


http://www.hardware.fr/articles/956-7/impact-smt-ht.html

and guru3d: http://www.guru3d.com/articles_pages/amd_ryzen_7_1700x_review,22.html

Quantum Reality 03-03-2017 12:49 PM

So applications (e.g. video encoding, calculations, etc) - SMT on.

Games (esp DX12) - SMT off.

Too bad you can't have dynamic feature-disabling profiles without needing a reboot to change the setting.

gtbtk 03-04-2017 12:12 AM

This is the first post that is actually looking for answers in the right place.

 

I am really surprised that no-one else has noticed the correlation yet between CPU and GPU performance. All the CPU benchmarks that only hit the CPU seem to provide really good results that beat can Broadwell-E while All the benchmarks that rely on the CPU plus a strong GPU are under performing. GPU load impacts CPU performance.  

 

Together with the memory clock speed limitations, Isnt it obvious to everyone that the gaming performance issue is being caused because of a weakness in performance somewhare in the interface between the CPU and GPU? The PCIe 3.0 bus is running at a fixed rate x16 so it is not the size of the pipe that is limiting things so there is only one thing left that can possibly be causing the issue and that is the part of the Chip that manages the chip IO (PCIe and Memory controllers - the fabric that you are discussing in the original post).

 

CPU's, and it doesn't matter if it is Intel or AMD, all have to juggle interdependent resources to get peak performance, too much strength on one side will overwhelm the other side and will reduce performance. Given that none of the Reviewer "experts" nor, apparently any of the Motherboard "engineering" Marketing people have mentioned it would seem to indicate that they don't really understand what is going on are are just trying to follow an overclocking process that they have memorized in the past. 

 

This is just from a thought experiment but I believe that I can tell you The Solution.

 

I am pretty certain that the workaround to this apparent conundrum will be to use a higher BCLK frequency with a lower multiplier and fine tune the IO/SOC voltages. It will improve memory clocking limitations and will increase the the number of cycles per second that the PCIe controller can deal with data flow between the CPU and GPU allowing a better balanced system. It will also allow you to  improve fast frequency memory kits ability to clock past 2933Mhz.

 

I would start by Setting BCLK to 125 and the CPU multiplier to around 32 or 33. Set memory to a higher bin frequency and adjust Ram timings. I don't know what the exact best multiplier/BCLK frequency combination will be, It could be 150/20 for a 4Ghz CPU frequency. That will require experimentation by the people who have the chip in hand. I suspect that the 1800X and 1700X chips, because of higher complexity at the SOC level that currently seems to be having problems, will benefit the most from this approach.


navjack27 03-04-2017 01:35 AM

I love all these theories. Now that I have my 1800x running my ASRock x370 killer Mobo I finally have a feel for the strangeness that is this chip. It performs so wildly.

I've updated my BIOS to what is the latest, but what gets me is how many options seem like blatant debug settings. So many settings that have no description but really once you look at them, hint at some deeper issues.

I wonder how much AMD is helping the motherboard guys, most of it seems like a shot in the dark to me.

gtbtk 03-04-2017 01:50 AM

Quote:
Originally Posted by navjack27 View Post

I love all these theories. Now that I have my 1800x running my ASRock x370 killer Mobo I finally have a feel for the strangeness that is this chip. It performs so wildly.

I've updated my BIOS to what is the latest, but what gets me is how many options seem like blatant debug settings. So many settings that have no description but really once you look at them, hint at some deeper issues.

I wonder how much AMD is helping the motherboard guys, most of it seems like a shot in the dark to me.

 

The whole platforms seems to have come to market before it was ready. Deadlines from the CEO I guess.

 

so how have you overclocked the chip/memory?


navjack27 03-04-2017 02:15 AM

well i have a coolermaster t4, so cooling ain't that great. i can't get my memory to do much past stock for whatever reason. i've had 4ghz just fine until heavy multithreaded loads but the issue is i have no idea what the voltage was or is or if it took. i have the best luck using ryzen master to set things AFTER i set them in bios.

right now i'm just stock auto everything and i'm going to wait it out until microcode updates and more bios updates happen.

maybe i'm a huge nerd but its neat having all these options but also having no idea what they do biggrin.gif





EDIT: no option in my bios for memory timings... knowing if XFR is actually ON when ratio is on auto. no per-core settings... theres actually a ton "missing"

might i mention that it says in the ryzen master user manual that you need to enable HPET in windows to use it. and, yes you do need to enable it. but that COULD explain the lower gaming benchmark numbers in some cases.

ChronoBodi 03-04-2017 02:29 AM

Quote:
Originally Posted by navjack27 View Post

well i have a coolermaster t4, so cooling ain't that great. i can't get my memory to do much past stock for whatever reason. i've had 4ghz just fine until heavy multithreaded loads but the issue is i have no idea what the voltage was or is or if it took. i have the best luck using ryzen master to set things AFTER i set them in bios.

right now i'm just stock auto everything and i'm going to wait it out until microcode updates and more bios updates happen.

maybe i'm a huge nerd but its neat having all these options but also having no idea what they do biggrin.gif





EDIT: no option in my bios for memory timings... knowing if XFR is actually ON when ratio is on auto. no per-core settings... theres actually a ton "missing"

might i mention that it says in the ryzen master user manual that you need to enable HPET in windows to use it. and, yes you do need to enable it. but that COULD explain the lower gaming benchmark numbers in some cases.

those bios options are utterly alien coming from Intel for so long.

like lol, where to even begin?

navjack27 03-04-2017 02:34 AM

well if you have irritable bowel syndrome just disable or enable.
if ur zen is common then go in that menu
if you wanna gimp ur cpu then disable branch prediction
if someone dis'd you and you want to redirect a witty rejoiner back at em then enable that first setting.

at least, these are my guesses

Undervolter 03-04-2017 02:50 AM

Quote:
Originally Posted by ChronoBodi View Post

those bios options are utterly alien coming from Intel for so long.

like lol, where to even begin?

Custom P-States looks like it's an option to edit every P-State in frequency and voltage. Much like K10Stat for AM3 CPUs or AMDMsrTweaker for AM3+. For example, if you think the lowest Pstate that Ryzen uses is too low for your taste, you can edit and change it. Or if you think it's overvolted, you can undervolt it. Just guess here, i don't have the CPU. If you don't care about such stuff, you can simply ignore it and use whatever PStates AMD has chosen. It's a very nice feature for undervolters, if it's what i think it is.

navjack27 03-04-2017 02:52 AM

it goes down to 400mhz on the lowest one. you have to enable 'custom' on each one starting with p0 to see the other ones accurately.

gtbtk 03-04-2017 02:56 AM

Quote:
Originally Posted by navjack27 View Post

well i have a coolermaster t4, so cooling ain't that great. i can't get my memory to do much past stock for whatever reason. i've had 4ghz just fine until heavy multithreaded loads but the issue is i have no idea what the voltage was or is or if it took. i have the best luck using ryzen master to set things AFTER i set them in bios.

right now i'm just stock auto everything and i'm going to wait it out until microcode updates and more bios updates happen.

maybe i'm a huge nerd but its neat having all these options but also having no idea what they do biggrin.gif





EDIT: no option in my bios for memory timings... knowing if XFR is actually ON when ratio is on auto. no per-core settings... theres actually a ton "missing"

might i mention that it says in the ryzen master user manual that you need to enable HPET in windows to use it. and, yes you do need to enable it. but that COULD explain the lower gaming benchmark numbers in some cases.

Realbench will give you a message if you try to run it without the HPET setting in the EFI. As a number of the reviews have used realbench, I doubt that that is the issue.

 

Look in "DRAM timing configuration" section for the primary memory timings.

 

I cant tell you where to look but you might like to try 125 BCLK with a 32 multiplier instead of 100BCLK with 40 multiplier. You will still get 4Ghz but you will have the option to clock memory to speeds higher than 3200 and it may help with combined CPU/GPU performance.

 

pstates are the performance states that the cpu operates in. pstate0 is running at full speed and different loads/temps will push the CPU to different pstate levels. Not seen that so I cant advise how the settings could be adjusted. It is possible that cinebench and Gaming are operating the CPU at different pstates and gaming performance could be recovered in those settings.


ChronoBodi 03-04-2017 02:58 AM

what would be the option to set 3.9 ghz on all cores, not just one or two at full load?

gtbtk 03-04-2017 03:08 AM

Quote:
Originally Posted by ChronoBodi View Post

what would be the option to set 3.9 ghz on all cores, not just one or two at full load?

with a higher BCLK? that would be 3900Mhz/125 = 31.2

 

using only the multiplier with default BCLK is 3900/100=39.

 

Any overclock on Ryzen is currently equal on all operational cores. I dont believe they have enabled independent core overclocking as yet. You can disable cores 2 at a time to get increased overclocks on the remaining cores. 


Undervolter 03-04-2017 03:25 AM

Quote:
Originally Posted by navjack27 View Post

it goes down to 400mhz on the lowest one. you have to enable 'custom' on each one starting with p0 to see the other ones accurately.

Well, that means, that if you choose it, then most likely the CPU will idle at 400Mhz instead of whatever the default set by AMD is. That's a very good feature for undervolters. I 've read people that would resort to 3rd party programs to do that. Like, try to run Phenom or FX at lower idle P-state than the default 1400Mhz.

gtbtk 03-04-2017 06:03 AM

Quote:
Originally Posted by Undervolter View Post
 
Quote:
Originally Posted by navjack27 View Post

it goes down to 400mhz on the lowest one. you have to enable 'custom' on each one starting with p0 to see the other ones accurately.

Well, that means, that if you choose it, then most likely the CPU will idle at 400Mhz instead of whatever the default set by AMD is. That's a very good feature for undervolters. I 've read people that would resort to 3rd party programs to do that. Like, try to run Phenom or FX at lower idle P-state than the default 1400Mhz.

I think that it will use that one when it is in power saving idle mode


Undervolter 03-04-2017 06:21 AM

Quote:
Originally Posted by gtbtk View Post

I think that it will use that one when it is in power saving idle mode

Yes, for the P-States to work, you need Cool N Quiet enabled, or whatever name AMD has given it nowdays (PowerNow? Powersomething?). When "Cool N Quiet" is enabled, the CPU has various P-States, from the lowest (idle) to the highest (the upper turbo). If the motherboard allows to edit them, then you can change them all or if possible, only some. The only trick, is that, if you are allowed to change vcore for each one, you should be certain that the vcore you put, is enough to keep the CPU stable.

gtbtk 03-04-2017 06:30 AM

Quote:
Originally Posted by Undervolter View Post
 
Quote:
Originally Posted by gtbtk View Post

I think that it will use that one when it is in power saving idle mode

Yes, for the P-States to work, you need Cool N Quiet enabled, or whatever name AMD has given it nowdays (PowerNow? Powersomething?). When "Cool N Quiet" is enabled, the CPU has various P-States, from the lowest (idle) to the highest (the upper turbo). If the motherboard allows to edit them, then you can change them all or if possible, only some. The only trick, is that, if you are allowed to change vcore for each one, you should be certain that the vcore you put, is enough to keep the CPU stable.

 

It is doing exactly the same thing that GPUs do.

 

I cant speak for AMD GPus but Nvidia Pascal units use Pstate0 in gaming 3d loads but they use Pstate2 if you run something like lexmark to render a 3d image.

 

Is there a utility that monitors what Pstate the CPU us actually operating in under different loads? What settings can actually be changed with regards pstate performance? it would be interesting to compare the Pstate in cinebench and the pstate in GTA V for example. Tuning those may well be the place to find improvements


Undervolter 03-04-2017 06:37 AM

Quote:
Originally Posted by gtbtk View Post

It is doing exactly the same thing that GPUs do.

I cant speak for AMD GPus but Nvidia Pascal units use Pstate0 in gaming 3d loads but they use Pstate2 if you run something like lexmark to render a 3d image.

Is there a utility that monitors what Pstate the CPU us actually operating in under different loads? What settings can actually be changed with regards pstate performance? it would be interesting to compare the Pstate in cinebench and the pstate in GTA V for example. Tuning those may well be the place to find improvements

The thing is, i don't have Ryzen, so it's all guesswork. If in your BIOS, you can see somewhere the P-States, then you can understand what's running inside Windows, if you have a software that shows clock in real time. For example, the lowest P-state, corresponds to the lowest clock. The next one, to the next attainable clock and so on. AMD in FX, had released a small software called PSCheck, that was showing exactly P-States all the time. But it's of little practical use. When your CPU is at max clock, you are also at the top P-State. As about to performance, i don't know what possibilities the BIOS gives. In AM3, software like K10Stat, was allowing you to set multi (and thus clock), vcore and NB voltage. In AMDMsrTweaker (see my signature), you can adjust multi (thus clock) and vcore.

CrazyElf 03-04-2017 09:07 AM

There is some evidence that the bandwidth may be double, but we will need testing.



Quote:
Originally Posted by gtbtk View Post

Warning: Spoiler! (Click to show)
This is the first post that is actually looking for answers in the right place.

I am really surprised that no-one else has noticed the correlation yet between CPU and GPU performance. All the CPU benchmarks that only hit the CPU seem to provide really good results that beat can Broadwell-E while All the benchmarks that rely on the CPU plus a strong GPU are under performing. GPU load impacts CPU performance.  

Together with the memory clock speed limitations, Isnt it obvious to everyone that the gaming performance issue is being caused because of a weakness in performance somewhare in the interface between the CPU and GPU? The PCIe 3.0 bus is running at a fixed rate x16 so it is not the size of the pipe that is limiting things so there is only one thing left that can possibly be causing the issue and that is the part of the Chip that manages the chip IO (PCIe and Memory controllers - the fabric that you are discussing in the original post).

CPU's, and it doesn't matter if it is Intel or AMD, all have to juggle interdependent resources to get peak performance, too much strength on one side will overwhelm the other side and will reduce performance. Given that none of the Reviewer "experts" nor, apparently any of the Motherboard "engineering" Marketing people have mentioned it would seem to indicate that they don't really understand what is going on are are just trying to follow an overclocking process that they have memorized in the past. 

This is just from a thought experiment but I believe that I can tell you The Solution.

I am pretty certain that the workaround to this apparent conundrum will be to use a higher BCLK frequency with a lower multiplier and fine tune the IO/SOC voltages. It will improve memory clocking limitations and will increase the the number of cycles per second that the PCIe controller can deal with data flow between the CPU and GPU allowing a better balanced system. It will also allow you to  improve fast frequency memory kits ability to clock past 2933Mhz.

I would start by Setting BCLK to 125 and the CPU multiplier to around 32 or 33. Set memory to a higher bin frequency and adjust Ram timings. I don't know what the exact best multiplier/BCLK frequency combination will be, It could be 150/20 for a 4Ghz CPU frequency. That will require experimentation by the people who have the chip in hand. I suspect that the 1800X and 1700X chips, because of higher complexity at the SOC level that currently seems to be having problems, will benefit the most from this approach.


I'm not sure if that solution will work.

Keep in mind that unlike Skylake, the Base clock is not isolated from the rest of the other components. PCie signals begin to degrade after 105 MHz, like on Sandy Bridge. The overclocks to Base Clock may be limited like P67/Z68 was. To the best of my knowledge, there's nothing like the "strap" function on Intel CPUs.


We need mature BIOSes as well to see what the RAM OC is. If the RAM is the last level cache then we need all the OC to RAM that we can get.

sterob 03-04-2017 10:15 AM

Can the SMT problem be fixed with microcode, windows update and software optimization or users should wait out till the next generation?

cloppy007 03-04-2017 10:29 AM

If I had a Ryzen CPU, I would benchmark with different affinity settings, I'm fairly confident that will have a big impact in those games that perform so-so. If that's the case, an updated OS scheduler (or game) will be able to fix, or minimise, that.

navjack27 03-04-2017 10:39 AM

The p States have something to do with the XFR. Yes like a GPU. Each one has its own power offset settings and final clock.

I was planning on messing with process lasso and affinity settings with stuff. It's not like I haven't done that before to tighten frame times in CSGO with Intel CPUs and hyperthreading.

starliner 03-04-2017 11:17 AM

So I guess this means ryzen laptops are going to be bad gaming machines depending on if there is a bios option to disable it. But lets face it, laptop bios's aren't that great as it is tongue.gif

gtbtk 03-04-2017 11:29 AM

Quote:
Originally Posted by sterob View Post

Can the SMT problem be fixed with microcode, windows update and software optimization or users should wait out till the next generation?

the old FX CPUs had a similar issues with Windows that was fixed with patches


gtbtk 03-04-2017 11:51 AM

Quote:
Originally Posted by CrazyElf View Post

There is some evidence that the bandwidth may be double, but we will need testing.


 
Quote:
Originally Posted by gtbtk View Post
  Warning: Spoiler! (Click to show)
This is the first post that is actually looking for answers in the right place.

I am really surprised that no-one else has noticed the correlation yet between CPU and GPU performance. All the CPU benchmarks that only hit the CPU seem to provide really good results that beat can Broadwell-E while All the benchmarks that rely on the CPU plus a strong GPU are under performing. GPU load impacts CPU performance.  

Together with the memory clock speed limitations, Isnt it obvious to everyone that the gaming performance issue is being caused because of a weakness in performance somewhare in the interface between the CPU and GPU? The PCIe 3.0 bus is running at a fixed rate x16 so it is not the size of the pipe that is limiting things so there is only one thing left that can possibly be causing the issue and that is the part of the Chip that manages the chip IO (PCIe and Memory controllers - the fabric that you are discussing in the original post).

CPU's, and it doesn't matter if it is Intel or AMD, all have to juggle interdependent resources to get peak performance, too much strength on one side will overwhelm the other side and will reduce performance. Given that none of the Reviewer "experts" nor, apparently any of the Motherboard "engineering" Marketing people have mentioned it would seem to indicate that they don't really understand what is going on are are just trying to follow an overclocking process that they have memorized in the past. 

This is just from a thought experiment but I believe that I can tell you The Solution.

I am pretty certain that the workaround to this apparent conundrum will be to use a higher BCLK frequency with a lower multiplier and fine tune the IO/SOC voltages. It will improve memory clocking limitations and will increase the the number of cycles per second that the PCIe controller can deal with data flow between the CPU and GPU allowing a better balanced system. It will also allow you to  improve fast frequency memory kits ability to clock past 2933Mhz.

I would start by Setting BCLK to 125 and the CPU multiplier to around 32 or 33. Set memory to a higher bin frequency and adjust Ram timings. I don't know what the exact best multiplier/BCLK frequency combination will be, It could be 150/20 for a 4Ghz CPU frequency. That will require experimentation by the people who have the chip in hand. I suspect that the 1800X and 1700X chips, because of higher complexity at the SOC level that currently seems to be having problems, will benefit the most from this approach.


I'm not sure if that solution will work.

Keep in mind that unlike Skylake, the Base clock is not isolated from the rest of the other components. PCie signals begin to degrade after 105 MHz, like on Sandy Bridge. The overclocks to Base Clock may be limited like P67/Z68 was. To the best of my knowledge, there's nothing like the "strap" function on Intel CPUs.


We need mature BIOSes as well to see what the RAM OC is. If the RAM is the last level cache then we need all the OC to RAM that we can get.

sorry, you are wrong, this is not sandy bridge with a 105Mhz limit. Like X99, you can run 125 BCLKs here - You also get access to memory speeds higher than 3200Mhz in the remory option list

 

Besides, What is there to lose? If it doesn't work, set it back to 100. 

 

If you want better performance when you are also loading both the CPU and the GPU, Both CPU and GPU need to be able to balance out their own requirements to get best performance. The only thing that is in between those is to the integrated PCIe and memory controllers in the fabric on the chip.  The controller plays the role of traffic cop in between and right now he is letting the GPU have most of the fun and backing up CPU traffic as it were.   

 

You know the GPU is ok cause it works on other computers and the CPU calculates numbers just fine if you run any of the calculation only benchmarks. The performance is only impacted when you load both the CPU and GPU together. You give the controller a bit more voltage and frequency to do things and the performance balance moves back into balance. 


AlphaC 03-04-2017 07:23 PM

https://www.phoronix.com/scan.php?page=article&item=amd-ryzen-cores&num=3

AMD Ryzen CPU Core Scaling Performance
Quote:
It seems that whenever SMT is exposed, it hurts Dota 2's Vulkan performance. Note that in the OpenGL result above, having SMT threads present wasn't hurting the performance so severely.

OS independent.

crucifix85 03-05-2017 06:45 AM

Quote:
Originally Posted by gtbtk View Post

The whole platforms seems to have come to market before it was ready. Deadlines from the CEO I guess.

so how have you overclocked the chip/memory?

I'm not buying that. It's not like board makers just got their hands on Zen this year. As far as MS is concerned, AMD can bribe them now to fast track a good patch with Zen money

madbrayniak 03-05-2017 07:56 AM

I think we will see some real info when the 4 core Eysenck CPUs come out.i

gtbtk 03-05-2017 10:24 AM

Quote:
Originally Posted by crucifix85 View Post
 
Quote:
Originally Posted by gtbtk View Post

The whole platforms seems to have come to market before it was ready. Deadlines from the CEO I guess.

so how have you overclocked the chip/memory?

I'm not buying that. It's not like board makers just got their hands on Zen this year. As far as MS is concerned, AMD can bribe them now to fast track a good patch with Zen money

 

good conspiracy theory.

 

AMD are still talking about microcode updates to resolve some of the issues. If it was all squared away then that would not be happening at this stage to address multiple issues at once.

 

While engineering samples are good enough to design and test motherboards from a basic functionality level, the are not exactly the same as a retail level chip that bios firmware needs to be finalized and tested against. The mobo manufacturers had retail samples for less than a month to finalize any firmware bugs or to get features that have only recently been finalized in silicon. Asrock did not even have any product ready on release day. That is not a business choice made from choice


cekim 03-05-2017 10:46 AM

Quote:
Originally Posted by Quantum Reality View Post

So applications (e.g. video encoding, calculations, etc) - SMT on.

Games (esp DX12) - SMT off.

Too bad you can't have dynamic feature-disabling profiles without needing a reboot to change the setting.
Quote:
Originally Posted by sterob View Post

Can the SMT problem be fixed with microcode, windows update and software optimization or users should wait out till the next generation?
This is where I think there is a lot of truth to the assertion that some big optimizations are yet to come.

The OS dispatcher and even applications themselves (through affinity and thread count choices) can make choices specific to this system as we see with Intel. Many applications will do best and do look first to the physical core count to choose the level of parallelism. How they get that information and how well the OS does at biasing its thread dispatch logic to unloaded resources (physical or otherwise) can make a big difference to the end performance.

The hype-train was at full speed, so I think people expected this sorts of entirely predictable growing pains to be resolved, but see predictable. It's impressive to start from scratch as they did and reach this level of performance out of the gate, but that's not what people had in their heads (right or wrong) judging by the death threats to reviewers pointing this out.

cekim 03-05-2017 10:53 AM

Quote:
Originally Posted by gtbtk View Post

good conspiracy theory.

AMD are still talking about microcode updates to resolve some of the issues. If it was all squared away then that would not be happening at this stage to address multiple issues at once.

While engineering samples are good enough to design and test motherboards from a basic functionality level, the are not exactly the same as a retail level chip that bios firmware needs to be finalized and tested against. The mobo manufacturers had retail samples for less than a month to finalize any firmware bugs or to get features that have only recently been finalized in silicon. Asrock did not even have any product ready on release day. That is not a business choice made from choice
Not only that, MS, like Intel has long been more reactive than proactive on such things. Windows is a huge boat, they don't turn the wheel quickly (for better and for worse).

They generally know they aren't the ones that will be blamed first, so they have some time to react (and/or simply require more time owing to the logistics they face in making changes.

gtbtk 03-05-2017 01:09 PM

Quote:
Originally Posted by CrazyElf View Post

There is some evidence that the bandwidth may be double, but we will need testing.


 
Quote:
Originally Posted by gtbtk View Post
  Warning: Spoiler! (Click to show)
This is the first post that is actually looking for answers in the right place.

I am really surprised that no-one else has noticed the correlation yet between CPU and GPU performance. All the CPU benchmarks that only hit the CPU seem to provide really good results that beat can Broadwell-E while All the benchmarks that rely on the CPU plus a strong GPU are under performing. GPU load impacts CPU performance.  

Together with the memory clock speed limitations, Isnt it obvious to everyone that the gaming performance issue is being caused because of a weakness in performance somewhare in the interface between the CPU and GPU? The PCIe 3.0 bus is running at a fixed rate x16 so it is not the size of the pipe that is limiting things so there is only one thing left that can possibly be causing the issue and that is the part of the Chip that manages the chip IO (PCIe and Memory controllers - the fabric that you are discussing in the original post).

CPU's, and it doesn't matter if it is Intel or AMD, all have to juggle interdependent resources to get peak performance, too much strength on one side will overwhelm the other side and will reduce performance. Given that none of the Reviewer "experts" nor, apparently any of the Motherboard "engineering" Marketing people have mentioned it would seem to indicate that they don't really understand what is going on are are just trying to follow an overclocking process that they have memorized in the past. 

This is just from a thought experiment but I believe that I can tell you The Solution.

I am pretty certain that the workaround to this apparent conundrum will be to use a higher BCLK frequency with a lower multiplier and fine tune the IO/SOC voltages. It will improve memory clocking limitations and will increase the the number of cycles per second that the PCIe controller can deal with data flow between the CPU and GPU allowing a better balanced system. It will also allow you to  improve fast frequency memory kits ability to clock past 2933Mhz.

I would start by Setting BCLK to 125 and the CPU multiplier to around 32 or 33. Set memory to a higher bin frequency and adjust Ram timings. I don't know what the exact best multiplier/BCLK frequency combination will be, It could be 150/20 for a 4Ghz CPU frequency. That will require experimentation by the people who have the chip in hand. I suspect that the 1800X and 1700X chips, because of higher complexity at the SOC level that currently seems to be having problems, will benefit the most from this approach.


I'm not sure if that solution will work.

Keep in mind that unlike Skylake, the Base clock is not isolated from the rest of the other components. PCie signals begin to degrade after 105 MHz, like on Sandy Bridge. The overclocks to Base Clock may be limited like P67/Z68 was. To the best of my knowledge, there's nothing like the "strap" function on Intel CPUs.


We need mature BIOSes as well to see what the RAM OC is. If the RAM is the last level cache then we need all the OC to RAM that we can get.

 

 

you need to take a look at this (not mine) is from Ryzenchrist here on ocn

 

http://valid.x86.fr/qmfrkd


superstition222 03-06-2017 02:03 AM

Quote:
Originally Posted by gtbtk View Post

Asrock did not even have any product ready on release day. That is not a business choice made from choice
It's also very possible for board makers to do a poor job even if they could have done a better one. That can be the result of business decisions.

It's not like software hasn't frequently had the "ship it in beta form and issue endless patches" model of development for a long time, in many cases. Board makers have a long history of issuing multiple revisions of the same board.

My Gigabyte UD3P 2.0 board, like all the others, will not boot with a multiplier higher than 4.4 GHz (don't remember the number... 22 maybe). AsRock FX boards have been shipped without LLC and with poor VRM cooling. One board didn't even have the thermal pad covering all the VRMs, at least according to one person's picture here. AsRock made a board that said it supports the 9000 series FX CPUs but isn't robust enough in terms of VRM cooling and quality to reliably do so.

Poor support for the features of Intel's Broadwell C CPUs is something that is pretty common, like being able to disable TDP throttling and manually adjust the EDRAM speed successfully. AsRock, oddly enough, has a good reputation in that area, as far as I know. Board makers decided that investing more time into making Broadwell C work better wasn't in their interest or Intel's.

gtbtk 03-06-2017 02:50 AM

Quote:
Originally Posted by superstition222 View Post
 
Quote:
Originally Posted by gtbtk View Post

Asrock did not even have any product ready on release day. That is not a business choice made from choice
It's also very possible for board makers to do a poor job even if they could have done a better one. That can be the result of business decisions.

It's not like software hasn't frequently had the "ship it in beta form and issue endless patches" model of development for a long time, in many cases. Board makers have a long history of issuing multiple revisions of the same board.

My Gigabyte UD3P 2.0 board, like all the others, will not boot with a multiplier higher than 4.4 GHz (don't remember the number... 22 maybe). AsRock FX boards have been shipped without LLC and with poor VRM cooling. One board didn't even have the thermal pad covering all the VRMs, at least according to one person's picture here. AsRock made a board that said it supports the 9000 series FX CPUs but isn't robust enough in terms of VRM cooling and quality to reliably do so.

Poor support for the features of Intel's Broadwell C CPUs is something that is pretty common, like being able to disable TDP throttling and manually adjust the EDRAM speed successfully. AsRock, oddly enough, has a good reputation in that area, as far as I know. Board makers decided that investing more time into making Broadwell C work better wasn't in their interest or Intel's.

 

There is a difference between shipping and selling beta hardware and software and not shipping anything so you have nothing to sell.

 

Lets be honest here, companies, and I am not just talking about Motherboard vendors, all aim to do the minimum amount of work/effort possible and sell the product they have at the highest price that they can convince customers to pay for it.

 

Broadwell desktop CPUs, with their production run of what, a month(?) did not exactly sell a lot of units. As there are multiple vendors with multiple models in their range to the CPUs, they maybe sold a thousand of each motherboard model making a total profit of say $20,000 for the entire family of products after the R&D costs are accounted for. Why would you put a bios developer being paid $120,000 a year onto developing bios upgrades for a product that you will never sell again and only has maybe 1000 to 5000 users of the CPU in total? In 2 man months, you have just eaten all of the profit you made from that entire family of products. After that you are ensuring a loss. You would put that developer working on a new product that stands the chance of selling 100,000 plus units and being much more profitable.

 

Ryzen is a product with much greater potential than Broadwell, not having a product available to be included in the day one reviews is a huge loss of inexpensive "free" marketing potential. The only reason you would not try and ship volume selling hardware is if by doing so with a totally unfinished product, you know it will damage your reputation even more than doing on the fly patches like Asus have been doing. 


navjack27 03-06-2017 04:10 AM

don't even talk to me about broadwell-c... i friggin LOVE that series of CPUs. i love the concept of the speed that L4 cache brings to the table.

EDIT: in fact since getting my 5820k main desktop. i retired my 5775c to a 24/7 folding machine/boinc machine and i just feel bad for doing that to it, but i don't have the desk space to have two full machines at my access at the moment. i'd set it up next to my main rig on the same monitor in a heartbeat LOL

superstition222 03-06-2017 11:06 AM

Quote:
Originally Posted by gtbtk View Post

Broadwell desktop CPUs, with their production run of what, a month(?) did not exactly sell a lot of units. As there are multiple vendors with multiple models in their range to the CPUs, they maybe sold a thousand of each motherboard model making a total profit of say $20,000 for the entire family of products after the R&D costs are accounted for. Why would you put a bios developer being paid $120,000 a year onto developing bios upgrades for a product that you will never sell again and only has maybe 1000 to 5000 users of the CPU in total? In 2 man months, you have just eaten all of the profit you made from that entire family of products. After that you are ensuring a loss. You would put that developer working on a new product that stands the chance of selling 100,000 plus units and being much more profitable.

Ryzen is a product with much greater potential than Broadwell, not having a product available to be included in the day one reviews is a huge loss of inexpensive "free" marketing potential. The only reason you would not try and ship volume selling hardware is if by doing so with a totally unfinished product, you know it will damage your reputation even more than doing on the fly patches like Asus have been doing. 
Broadwell C didn't require a new special board. As for seeking profit, neither ASUS nor Gigabyte bothered to put premium features on their Zen boards, like a hybrid water-air cooler, features that have been on Intel boards since 2013. As for why board makers should have fully supported Broadwell C chips — it's because if they claim to support a CPU they need to fully support it.

gtbtk 03-06-2017 12:42 PM

Quote:

Broadwell C didn't require a new special board. As for seeking profit, neither ASUS nor Gigabyte bothered to put premium features on their Zen boards, like a hybrid water-air cooler, features that have been on Intel boards since 2013. As for why board makers should have fully supported Broadwell C chips — it's because if they claim to support a CPU they need to fully support it.

 

Broadwell C did require Bios upgrades to support the chips in just the same way z170 needed bios updates to support Kaby Lake. Support just means that the manufacturer guarantees you can run a cpu on a given piece of hardware and will assist you for the life of that product. It doesn't guarantee you will get any upgrades after the event.

 

with regards the Zen motherboards, the planned full range of manufacturers boards have almost certainly not yet been announced.  Given the market share of bulldozer and Piledriver AMD CPUs, I would think that the vendors are probably taking a wait and see approach. If there is good take up of Ryzen, they will expand the range. That is called risk mitigation.

 

If Ryzen fails, and I am certainly not suggesting it will, They have enough product to cover what demand there is but why would they throw good money after bad to support a CPU that no-one is buying?


superstition222 03-06-2017 12:55 PM

Quote:
Originally Posted by gtbtk View Post

Broadwell C did require Bios upgrades to support the chips in just the same way z170 needed bios updates to support Kaby Lake.
A BIOS update is hardly the same thing as a different board.

As for support, all the features of the product (the board and the CPU) should be fully supported. Otherwise it should be labeled partial or minimal support rather than just "supported". For Broadwell C that means enabling the user to adjust the TDP, to avoid TDP throttling, and it means being able to adjust the EDRAM clock. Some brands made the effort to provide this support for at least one board, some did it in a half-baked manner, and some didn't bother.

mozmo 03-06-2017 03:36 PM

Ryzen will never be as good as intel in heavily coherent memory sharing applications.

The L3 in Ryzen is a victim cache not inclusive, it's broken into 2(CCX) and acts like a cluster on die chip. The L3 is not the LLC in the system like the L3 is in intel designs. This means any coherent locks/dependent memory sharing is going to be much slower than intel because a lot of the time it will need to go through slower DDR4 to ensure memory coherency.

This is why it falls behind in gaming, gaming depends on coherency and memory sharing a lot more. Improving the windows scheduler to recognize the clusters will help somewhat but you'll still hit scaling issues if a thread from CCX1 need to share data to a thread on CCX2. The bandwidth between these 2 is only 22gb/s and not fast, you're looking at around 50-100ns of pipeline stall vs 10ns on intel.

superstition222 03-06-2017 03:42 PM

Quote:
Originally Posted by mozmo View Post

Ryzen will never be as good as intel in heavily coherent memory sharing applications.

The L3 in Ryzen is a victim cache not inclusive, it's broken into 2(CCX) and acts like a cluster on die chip. The L3 is not the LLC in the system like the L3 is in intel designs. This means any coherent locks/dependent memory sharing is going to be much slower than intel because a lot of the time it will need to go through slower DDR4 to ensure memory coherency.

This is why it falls behind in gaming, gaming depends on coherency and memory sharing a lot more. Improving the windows scheduler to recognize the clusters will help somewhat but you'll still hit scaling issues if a thread from CCX1 need to share data to a thread on CCX2. The bandwidth between these 2 is only 22gb/s and not fast, you're looking at around 50-100ns of pipeline stall vs 10ns on intel.
The worst Ryzen gaming results are typically with those that favor single-threaded performance and fewer threads being heavily used, right? So, a quad that doesn't deal with the CCX would be most optimal? Things like Dolphin would also benefit from higher clocks and fewer cores/threads.

Is there a way to have a quad (half of Ryzen) and have 8 threads via SMT or does that involve the CCX1 to CCX2 latency issue? A 4/8 part with a high enough clock that doesn't have the CCX to CCX latency issue should be pretty competitive.

I wonder if Zen+ will have an eDRAM L4.

Kuivamaa 03-06-2017 03:56 PM

Quote:
Originally Posted by superstition222 View Post

The worst Ryzen gaming results are typically with those that favor single-threaded performance and fewer threads being heavily used, right? So, a quad that doesn't deal with the CCX would be most optimal? Things like Dolphin would also benefit from higher clocks and fewer cores/threads.

Is there a way to have a quad (half of Ryzen) and have 8 threads via SMT or does that involve the CCX1 to CCX2 latency issue? A 4/8 part with a high enough clock that doesn't have the CCX to CCX latency issue should be pretty competitive.

I wonder if Zen+ will have an eDRAM L4.

No, Ryzen game performance seems to depend on engine sensitivities and nuances. Single thread performance is strong , MT even more so yet there are both poorly and well threaded games that run both good and less good on ryzen. It is down to what is the engine doing and whether it touches upon non optimized areas.

mozmo 03-06-2017 07:37 PM

Quote:
Originally Posted by superstition222 View Post

The worst Ryzen gaming results are typically with those that favor single-threaded performance and fewer threads being heavily used, right? So, a quad that doesn't deal with the CCX would be most optimal? Things like Dolphin would also benefit from higher clocks and fewer cores/threads.

Is there a way to have a quad (half of Ryzen) and have 8 threads via SMT or does that involve the CCX1 to CCX2 latency issue? A 4/8 part with a high enough clock that doesn't have the CCX to CCX latency issue should be pretty competitive.

I wonder if Zen+ will have an eDRAM L4.
Watchdogs 2 and BF1 are heavily threaded and they perform worse on Ryzen, Rise of the tomb raider, dx12 spreads load on all threads, runs worse. GTA5 another one that spreads load and runs worse. Lots of games use many threads now.

The IPC of Ryzen is roughly the same as broadwell-e in single thread, Ryzen runs at higher clocks and still loses badly to broadwell-e, it's only gaming largely which is heavily cache dependent. Most other workloads that Ryzen does well are when it scales 100% to all cores because each thread has no dependency on another thread or any shared data.

If you look at the architecture of these chips, they are very similar now, same width decoder, same number of integer /fp units, similar register count, out of order window, trace cache etc.

The only difference now is the L3 (split victim vs full inclusive) and 128bit FMACs vs 256bit in intel chips.

AVX performance on Ryzen is half the rate of intel and cache performance and latency is worse than intel.

Luckily no games use AVX otherwise we'd have an even larger meltdown by AMD fans.

Kuivamaa 03-07-2017 01:01 AM

Ryzen is excellent at BF1 MP.

CrazyElf 03-07-2017 07:33 AM

Well we have one answer - look at Ryzen with just one CCX enabled. This is a review of someone that disabled one CCX and has just 4 cores, 8 threads. He then made it clock for clock with a 7700k.

http://www.zolkorn.com/reviews/amd-ryzen-7-1800x-vs-intel-core-i7-7700k-mhz-by-mhz-core-by-core/view-all/


Real bench:



Grand Theft Auto:



Battlefield performance:


Note that the penalty is smaller than with both CCXs enabled. I suspect that with the other CCX disabled, this is a big step forward because it means the L3 becomes Last Level Cache. It'd be interesting to see what would happen if SMT is offline as well. It could go either way. Battlefield is more threaded than most games.

That's good news for the 4 core and Zen APUs.


Edit:

One other thing I need to draw to people's attention. Disabling HPET can sometimes improve gaming performance at the expense of AMD Ryzen Master.

http://www.tomshardware.com/reviews/amd-ryzen-7-1800x-cpu,4951-12.html
Quote:
The evening before launch, AMD sent us a list of games that it says should perform well with Ryzen, including Sniper Elite 4, Battlefield 1, Star Wars: Battlefront, and Overwatch, among others. Many of the titles tend to be heavily threaded, which would lend itself well to Ryzen's high core count. We plan on revisiting some of those. Further, AMD suggests adjusting several different parameters for games that suffer from low performance. It recommends using Windows' High Performance power profile (which also helps Intel CPUs). It also says to disable the HPET (High Precision Event Timer), either in your BIOS or operating system, to gain a 5-8% advantage. Our results already reflect HPET disabled, though. Interestingly, AMD's Ryzen Master software requires HPET to “provide accurate measurements,” so you may find yourself toggling back and forth for the best experience.

Going to update the OP on that later.


Quote:
Originally Posted by superstition222 View Post

The worst Ryzen gaming results are typically with those that favor single-threaded performance and fewer threads being heavily used, right? So, a quad that doesn't deal with the CCX would be most optimal? Things like Dolphin would also benefit from higher clocks and fewer cores/threads.

Is there a way to have a quad (half of Ryzen) and have 8 threads via SMT or does that involve the CCX1 to CCX2 latency issue? A 4/8 part with a high enough clock that doesn't have the CCX to CCX latency issue should be pretty competitive.

I wonder if Zen+ will have an eDRAM L4.


Either eDRAM L4 or even some on die L4. We don't need that much. Perhaps just 20 MB will do. That way there's no latency penalty for off die communications.

The only other way to do it would be to line up the caches together. That would mean abandoning the CCX configuration however in favor of something that looks more like the 5960X though and would be costly to redesign.

In my OP, I proposed using HBM as a solution as well. Its got high bandwidth and the latency is better than going off package, but it would be costly.


Quote:
Originally Posted by Kuivamaa View Post

No, Ryzen game performance seems to depend on engine sensitivities and nuances. Single thread performance is strong , MT even more so yet there are both poorly and well threaded games that run both good and less good on ryzen. It is down to what is the engine doing and whether it touches upon non optimized areas.

The ideal would be that:

  1. Any single threaded games that don't scale with cores keep it within 1 CCX.
  2. Any games that do need to optimize between the 2 CCX and minimize inter-CCX communication.
  3. For Zen+ of course, we need a better solution; ideally an on die last level cache.


Quote:
Originally Posted by gtbtk View Post


you need to take a look at this (not mine) is from Ryzenchrist here on ocn

http://valid.x86.fr/qmfrkd


Yes, but that is probably not in PCIe 3.0. At higher Base Clock frequencies (or REFClock) Ryzen goes into PCIe 2.0 or even 1.0.

It's all in the guide:
https://www.overclock.net/t/1624603/rog-crosshair-vi-overclocking-thread/0_100

Page 2 in the PDF.




While modern GPUs don't have much of a penalty in PCIe 2.0, that would be a serious penalty for any NVMe SSDs (like the SSD 750 if you have the PCIe version) and if you were running an SSD 750 with Ryzen, even then there's going to be some penalty because you are running a PCIe 2.0 x8 GPU.

gtbtk 03-07-2017 08:42 AM

Quote:
Originally Posted by mozmo View Post

Ryzen will never be as good as intel in heavily coherent memory sharing applications.

The L3 in Ryzen is a victim cache not inclusive, it's broken into 2(CCX) and acts like a cluster on die chip. The L3 is not the LLC in the system like the L3 is in intel designs. This means any coherent locks/dependent memory sharing is going to be much slower than intel because a lot of the time it will need to go through slower DDR4 to ensure memory coherency.

This is why it falls behind in gaming, gaming depends on coherency and memory sharing a lot more. Improving the windows scheduler to recognize the clusters will help somewhat but you'll still hit scaling issues if a thread from CCX1 need to share data to a thread on CCX2. The bandwidth between these 2 is only 22gb/s and not fast, you're looking at around 50-100ns of pipeline stall vs 10ns on intel.

That analysis ignores the face that non CPU/GPU workloads (cinebench, realbench, superpi etc) that have good performance also have to deal with the same cache design and limitations. Impaired performance without interaction with a GPU, is simply is not the case. If what you said was correct, Performance would be bad in every scenario, not only if you are gaming or not.

 

What everyone is ignoring is that the core processing unit is supported by a system on a chip that provides the memory controller and PCIe Controller for the x16 lanes that connects to the GPU. The only thing between the CPU, other than wire that only conducts the data and timing signalling is that SOC PCIe controller.

 

Getting faster memory speeds has also been problematic. Memory Clocks in the 3000Mhz and above range increase stress on the memory controller in the SOC, so it would suggest that the performance issues are being caused by the SOC not being tuned for optimal performance and are creating the bottleneck between the CPU and GPU as well as the CPU and the installed memory. Under performance of the SOC can also explain why 4 sticks of RAM, which puts even more load on the memory controller, perform worse than 2 sticks of RAM. The SMT is also managed by the SOC as far as I am aware and even switching the L3 Cache and accessing the DDR4 relies on that SOC memory controller to access the ram

 

The performance challenges should be resolvable with firmware setting tuning, be it manually by changing settings with current bioses or with changed defaults baked into new versions from the motherboard vendors

 

The thing that I do not understand is why not one of the reviewer types seem to be able to work this out and keep trying to blame the architecture and not just work through the processing pipeline that applies when you are gaming. 


btupsx 03-17-2017 02:43 PM

Absolutely outstanding thread, everyone. Best thread I've seen on OCN in some time. applaud.gif

Echoing a couple of thoughts already espoused, and agree with most of the analysis so far.
Quote:
Originally Posted by CrazyElf View Post

Well we have one answer - look at Ryzen with just one CCX enabled. This is a review of someone that disabled one CCX and has just 4 cores, 8 threads. He then made it clock for clock with a 7700k.

http://www.zolkorn.com/reviews/amd-ryzen-7-1800x-vs-intel-core-i7-7700k-mhz-by-mhz-core-by-core/view-all/


Real bench:



Grand Theft Auto:



Battlefield performance:


Note that the penalty is smaller than with both CCXs enabled. I suspect that with the other CCX disabled, this is a big step forward because it means the L3 becomes Last Level Cache. It'd be interesting to see what would happen if SMT is offline as well. It could go either way. Battlefield is more threaded than most games.

That's good news for the 4 core and Zen APUs.


Edit:

One other thing I need to draw to people's attention. Disabling HPET can sometimes improve gaming performance at the expense of AMD Ryzen Master.

http://www.tomshardware.com/reviews/amd-ryzen-7-1800x-cpu,4951-12.html
Going to update the OP on that later.
Either eDRAM L4 or even some on die L4. We don't need that much. Perhaps just 20 MB will do. That way there's no latency penalty for off die communications.

The only other way to do it would be to line up the caches together. That would mean abandoning the CCX configuration however in favor of something that looks more like the 5960X though and would be costly to redesign.

In my OP, I proposed using HBM as a solution as well. Its got high bandwidth and the latency is better than going off package, but it would be costly.
The ideal would be that:

  1. Any single threaded games that don't scale with cores keep it within 1 CCX.
  2. Any games that do need to optimize between the 2 CCX and minimize inter-CCX communication.
  3. For Zen+ of course, we need a better solution; ideally an on die last level cache.
Yes, but that is probably not in PCIe 3.0. At higher Base Clock frequencies (or REFClock) Ryzen goes into PCIe 2.0 or even 1.0.

It's all in the guide:
https://www.overclock.net/t/1624603/rog-crosshair-vi-overclocking-thread/0_100

Page 2 in the PDF.




While modern GPUs don't have much of a penalty in PCIe 2.0, that would be a serious penalty for any NVMe SSDs (like the SSD 750 if you have the PCIe version) and if you were running an SSD 750 with Ryzen, even then there's going to be some penalty because you are running a PCIe 2.0 x8 GPU.

Yes, FINALLY concrete data with just one CCX properly enabled. I agree this bodes extremely well for the quad Ryzens, as well as the forthcoming APUs. This also makes the R5 quads the probable (undisputed?) high performance gaming value chip king until the CCX communication and DRAM/bclk nuances are completely probed.

Most importantly, it would seem to offer the best confirmation yet that DRAM is indeed being used as a last level L4 cache, and that the true bottleneck is the Infinity Fabric structure, at least in current form. *A LOT* of low hanging fruit here (and interesting engineering choices) for AMD to make in optimizing Zen 2 & Zen 3 iterations. Do they integrate a kind of eDRAM solution on-die? Do they locate a separate HBM cache physically close to the socket? Do they forgo the current unified bclk design in favor of a more distributed arrangement?

This much is clear: IMC binning, DRAM speed/timings, high level BIOS refinement, and motherboard tolerances to high speed DRAM will all be the name of the game for this iteration of Zen, critical to obtaining the best performance out of the silicon. It would also be best habit to stick to two DIMMs, so I can see high capacity dual DIMM kits becoming de rigueur for any enthusiast.
Quote:
Originally Posted by gtbtk View Post

That analysis ignores the face that non CPU/GPU workloads (cinebench, realbench, superpi etc) that have good performance also have to deal with the same cache design and limitations. Impaired performance without interaction with a GPU, is simply is not the case. If what you said was correct, Performance would be bad in every scenario, not only if you are gaming or not.

What everyone is ignoring is that the core processing unit is supported by a system on a chip that provides the memory controller and PCIe Controller for the x16 lanes that connects to the GPU. The only thing between the CPU, other than wire that only conducts the data and timing signalling is that SOC PCIe controller.

Getting faster memory speeds has also been problematic. Memory Clocks in the 3000Mhz and above range increase stress on the memory controller in the SOC, so it would suggest that the performance issues are being caused by the SOC not being tuned for optimal performance and are creating the bottleneck between the CPU and GPU as well as the CPU and the installed memory. Under performance of the SOC can also explain why 4 sticks of RAM, which puts even more load on the memory controller, perform worse than 2 sticks of RAM. The SMT is also managed by the SOC as far as I am aware and even switching the L3 Cache and accessing the DDR4 relies on that SOC memory controller to access the ram

The performance challenges should be resolvable with firmware setting tuning, be it manually by changing settings with current bioses or with changed defaults baked into new versions from the motherboard vendors

The thing that I do not understand is why not one of the reviewer types seem to be able to work this out and keep trying to blame the architecture and not just work through the processing pipeline that applies when you are gaming. 

As with most things, the simplest answer is most likely the optimal one: They didn't have any idea where to look. Aside from AnandTech, most popular sites/reviewers aren't truly boned up on the latest EI trends and uarch designs.

CrazyElf 03-18-2017 08:35 AM

Thanks for the complements everyone.

Quote:
Originally Posted by btupsx View Post

Absolutely outstanding thread, everyone. Best thread I've seen on OCN in some time. applaud.gif

Echoing a couple of thoughts already espoused, and agree with most of the analysis so far.
Yes, FINALLY concrete data with just one CCX properly enabled. I agree this bodes extremely well for the quad Ryzens, as well as the forthcoming APUs. This also makes the R5 quads the probable (undisputed?) high performance gaming value chip king until the CCX communication and DRAM/bclk nuances are completely probed.

Most importantly, it would seem to offer the best confirmation yet that DRAM is indeed being used as a last level L4 cache, and that the true bottleneck is the Infinity Fabric structure, at least in current form. *A LOT* of low hanging fruit here (and interesting engineering choices) for AMD to make in optimizing Zen 2 & Zen 3 iterations. Do they integrate a kind of eDRAM solution on-die? Do they locate a separate HBM cache physically close to the socket? Do they forgo the current unified bclk design in favor of a more distributed arrangement?

This much is clear: IMC binning, DRAM speed/timings, high level BIOS refinement, and motherboard tolerances to high speed DRAM will all be the name of the game for this iteration of Zen, critical to obtaining the best performance out of the silicon. It would also be best habit to stick to two DIMMs, so I can see high capacity dual DIMM kits becoming de rigueur for any enthusiast.
As with most things, the simplest answer is most likely the optimal one: They didn't have any idea where to look. Aside from AnandTech, most popular sites/reviewers aren't truly boned up on the latest EI trends and uarch designs.



We are expecting to see 32 GB DIMMs in the near future, so 2x 32 makes 64 GB of RAM. I mean in theory, they can already make 128 GB DIMMs, but probably not overclocked.

I proposed an L4 cache, eDRAM, or perhaps even HBM might work for higher end designs.



If you haven't read my other thread:
https://www.overclock.net/t/1625187/the-ryzen-gaming-performance-gap-is-mostly-gone/0_100


Some games love RAM, by the way:




We need that RAM controller unlocked ASAP. Ideally we could get DDR4 >4000.

superstition222 03-18-2017 01:47 PM

Quote:
Originally Posted by CrazyElf View Post

I proposed an L4 cache, eDRAM, or perhaps even HBM might work for higher end designs.
People have been wanting L4 for a long time.

Intel’s Skylake lineup is robbing us of the performance king we deserve


All times are GMT -7. The time now is 01:48 PM.

Powered by vBulletin® Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.

User Alert System provided by Advanced User Tagging (Pro) - vBulletin Mods & Addons Copyright © 2019 DragonByte Technologies Ltd.
vBulletin Security provided by vBSecurity (Pro) - vBulletin Mods & Addons Copyright © 2019 DragonByte Technologies Ltd.

vBulletin Optimisation provided by vB Optimise (Pro) - vBulletin Mods & Addons Copyright © 2019 DragonByte Technologies Ltd.