Overclock.net banner

21 - 39 of 39 Posts

·
⤷ αC
Joined
·
11,239 Posts
Quote:
Originally Posted by Steele84 View Post

Great thread +REP to all. So if what I'm reading correct and AMD eliminates dual CCX for the R5 series then they could potentially be better performing chips in applications that run less than 4-8 threads? The other issue I see with this is a resolution this the problem presented... the latency penalty appears to be an issue that can only be improved with better software scheduling of cores/ logical cores. However the penalty will always be present when spanning both CCXs?
Not the hexcore R5 1600X, that has to be 4+2 or 3+3 cores.

The Ryzen 5 1500X quad core likely would be 4+0 or 2+2.

I believe the penalty is mainly in Windows 10. Right now it is mitigated with faster single rank RAM with tighter timings (i.e. 3000+ MHz CL14 / CL15).

----

New developments as far as compatibility
http://media.kingston.com/pdfs/hx-product-memory-ddr4-amd-ryzen-and-hx-compatibility-us.pdf
Quote:
FURY DDR4 2666MHz Timing Change Black Heat Spreader
(coming March 20)
HX426C16FB2/8
HX426C16FB2K2/16
HX426C16FB2K4/32
HX426C16FBK2/32
FURY DDR4 2666MHz Timing Change & Red Heat Spreader
(coming March 20)
HX426C16FR2/8
HX426C16FR2K2/16
HX426C16FR2K4/32
HX426C16FRK2/32
FURY DDR4 2666MHz Timing Change & White Heat Spreader
(coming March 20)
HX426C16FW2/8
HX426C16FW2K2/16
HX426C16FW2K4/32
HX426C16FWK2/32
 

·
Iconoclast
Joined
·
30,657 Posts
Quote:
Originally Posted by Steele84 View Post

Great thread +REP to all. So if what I'm reading correct and AMD eliminates dual CCX for the R5 series then they could potentially be better performing chips in applications that run less than 4-8 threads? The other issue I see with this is a resolution this the problem presented... the latency penalty appears to be an issue that can only be improved with better software scheduling of cores/ logical cores. However the penalty will always be present when spanning both CCXs?
A single CCX part won't perform better in more lightly threaded tasks, unless those threads were being incorrectly scheduled on a multi-CCX part (which admittedly does seem to be happening with some setups).

There is always going to be a penalty to latency sensitive tasks when spanning both CCXes. This penalty will decrease as data fabric speed (which is tied to memory clock, but not necessarily memory performance) increases.
 

·
Meeeeeeeow!
Joined
·
2,229 Posts
Discussion Starter #23
Strangely, I've learned that the L3 caches on the cores are actually hybrid - they are mostly victim caches, but can be used as conventional caches. One problem is that apps that are cache aware might not be able to figure this one out.

I think also that Zen seems to be probing nearby L3s as well regularly when they seek information. The problem is that the other L3s don't necessarily have this information. Not sure what the impact is on performance, but probably negative.

Quote:
Originally Posted by Blameless View Post

Intel has a set of fully custom ring busses that each have nearly ten times the bandwidth of AMD's data frabric and buffered switches to match. This all sits on it's own metal layer and is overlapped by the cache/cores.

Not sure it's practical for AMD to duplicate this, at least not in the short term.
The L4 cache would still be accessed via the same relatively slow data fabric that connects everything else together.
According to every description/block diagram I've seen each CCX is four cores and an L3 cache, anything outside of that, including the memory controllers are largely independent of the CCXes.

I suspect the quad core parts to have essentially the same 'uncore' as the eight core parts, just with one less CCX attached.

I doubt that single channel would change much, but it may be a good way to isolate the basic data fabric performance issues from memory performance issues (e.g. if single channel does nothing, the bottleneck is the data fabric, if single channel reduces performance appreciably, it's probably an actual memory bottleneck).
Best way to do this, without a radical architectural shift, would be to decouple the data fabric from the memory clock or change it's multiplier.

The former option would require some more PPLs and buffers, while the latter option might cap memory frequency unless the data fabric is capable of rather extreme clocks. Still, either would likely be more workable than widening the data fabric or worse, replacing it with the sort of custom interconnect Intel uses.
It would have to wait for Zen+ or Zen++ to make the kind of changes I am proposing.

The whole data fabric needs a bandwidth and latency upgrade, as it seems to be the bottleneck in many cases right now and is tied to DRAM. It would probably be best to de-couple and see if we could run it faster. Adding more lanes would help, but it would not address the latency gap. The whole point of the L4 is to minimize the frequency of DRAM access. It would not resolve the latency issues, save for fewer access to DRAM, which has a latency penalty.

Agree though that the bandwidth is a lot less than Intel's ring - it's closer in terms of bandwidth to having a 2P system with 4 cores rather than 2 "rings" (technically meshes since the CCX is a mesh topology) of 4 cores on the same die.

We will need to test the single channel RAM to prove/disprove this hypothesis though.

Quote:
Originally Posted by Blameless View Post

A single CCX part won't perform better in more lightly threaded tasks, unless those threads were being incorrectly scheduled on a multi-CCX part (which admittedly does seem to be happening with some setups).

There is always going to be a penalty to latency sensitive tasks when spanning both CCXes. This penalty will decrease as data fabric speed (which is tied to memory clock, but not necessarily memory performance) increases.
The 4 + 0 seems to be doing better than the 2 + 2 in benchmarks. Sometimes the gap is nearly 20%. Only app that likes 2 + 2 seem to do so because of the extra cache.

We need a way to test the bandwidth between cores and how it scales with RAM along with core clocks.

The 4 core versions when they arrive in a couple of months could give us more clues if it is 4+0 or just 1 CCX on a die.

Quote:
Originally Posted by superstition222 View Post

There are some games that "use" more than 8 threads but don't really benefit from having more than 8 threads in the processor, though, right?
I had mentioned the idea of operating systems adopting a per-application profiling system to do this. That way the user could override the vendors' setting which would be chosen for optimal efficiency and no rebooting would be required.
Yeah I think that we need a per application profile to do this too. Then for things with fewer than 4 cores, we need to keep it in the same CCX.

The OS seems to be having scheduling issues judging by the architecture.

Quote:
Originally Posted by AlphaC View Post

Not the hexcore R5 1600X, that has to be 4+2 or 3+3 cores.

The Ryzen 5 1500X quad core likely would be 4+0 or 2+2.

I believe the penalty is mainly in Windows 10. Right now it is mitigated with faster single rank RAM with tighter timings (i.e. 3000+ MHz CL14 / CL15).

----

New developments as far as compatibility
http://media.kingston.com/pdfs/hx-product-memory-ddr4-amd-ryzen-and-hx-compatibility-us.pdf
Probably 4 cores on a new die? It would be a huge waste to sell an 8 core die with half the cores disabled. Unless the die yields are horrible, but even then you'd want a 4 die version. Granted, Intel isn't that far from doing that with Broadwell E (6 cores online with 4 disabled on the 6800K and 6850K).

Yeah we are going to be seeing more RAM developments come out in the next few months. Unlike before, it's going to be very important to get good RAM considering how tied this Infinity Fabric is to memory.

I think that the 6800K and 6850K basically need to come down in price. Only people who need them are those who need multi-thread and slightly faster single thread, quad channel RAM, 40 PCIe lanes, or the extras of X99 like extra SATA ports.
 

·
Premium Member
Joined
·
2,636 Posts

·
stock...hahaha
Joined
·
1,622 Posts
Hey Guys,

Been pondering this some more and something has just occurred to me that I don't remember seeing anyone mentioning before. This is absolutely relevant to Nvidia drivers. AMD may enable this by default but worth Checking regardless. I am not sure that this will resolve or alleviate the issue but it will absolutely not hurt performance

Message Signaled Interupts (MSI)

Windows Nvidia drivers do not enable this feature by default as I believe that there can be some hardware incompatibilities with older hardware. Should not be a problem with Ryzen. I think AMD Graphics card drivers do enable it but I am not sure and you would need to check.

MSI is certainly touching the core aspects of the PCIe and Memory communications that seem to be part of the slow gaming performance. Using traditional interrupts can at times create latency which in turn can cause bottlenecks.

Advantages of MSI From Wikipedia

"MSI increases the number of interrupts that are possible. While conventional PCI was limited to four interrupts per card (and, because they were shared among all cards, most are using only one), message signaled interrupts allow dozens of interrupts per card, when that is useful.

There is also a slight performance advantage. In software, a pin-based interrupt could race with a posted write to memory. That is, the PCI device would write data to memory and then send an interrupt to indicate the DMA write was complete. However, a PCI bridge or memory controller might buffer the write in order to not interfere with some other memory use. The interrupt could arrive before the DMA write was complete, and the processor could read stale data from memory. To prevent this race, interrupt handlers were required to read from the device to ensure that the DMA write had finished. This read had a moderate performance penalty. An MSI write cannot pass a DMA write, so the race is eliminated"

When Pascal was new, there were some DPC Latency issues that popped up and one of the things that we tried was enabling MSI, it seemed to help a bit back then and if it doesnt help, is easy to remove. Enabling it is not overly difficult but it does require a registry edit. If someone would like to try it out, it may be something that at least helps in the Gaming performance area. It certainly will not hurt. The only thing to be aware of is that when the drives are updated, the registry changes that enable MSI get deleted so you need to export the section of registry to a *.reg file and click it when you update drivers to re-enable it.

Unfortunately the actual location is different with every different model graphics card so I cant just write a batch file to update everyone.

Trying to switch device to MSI-mode.
You must your graphics card's registry key. You can follow these steps

Backup or at least create a system restore point before making changes to be safe.

Open Device manager

find the listing for your Graphics adapter

Invoke device properties dialog.

Switch to "Details" tab.

Select "Device Instance Path" in "Properties" dropdown box.

Write down "Value" (for example "PCI\VEN_1002&DEV_4397&SUBSYS_1609103C&REV_00\3&11 583659&0&B0").

This is relative registry path under the key "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Enum\PCI\VEN.......".

Go to that device`s registry key ("HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Enum \PCI\VEN_1002&DEV_4397&SUBSYS_1609103C&REV_00\3&11 583659&0&B0") and locate down the subkey "Device Parameters\Interrupt Management".

For devices working in MSI-mode there will be subkey "Device Parameters\Interrupt Management\MessageSignaledInterruptProperties" and in that subkey there will be DWORD value "MSISupported" equals to "0x00000001".

If the MSISupported subkey is missing, add them manually.

You can export a copy of the key you just created to a *.reg file so it is easy to apply after your next driver update by just double clicking on the file and rebooting.

Reboot the PC.

If you could run a benchmark after and report back any changes from your last gaming benchmark I would appreciate it.
 

·
Premium Member
Joined
·
2,636 Posts
Most of the difference people saw with win 7 vs 10 = nvidia drivers better in 7. Swap to amd card vice versa better in 10

Sadly there are no magic beans..only the kind that give you gas.

MS might get us a few % agesa updates a few % the biggest gain will come from higher mem/fabric and subtimings in a much later agesa.

Regardless its a well rounded cpu and a far more affordable platform. Enjoy it.
 

·
Registered
Joined
·
626 Posts
Again, a great thread. My next question relates to the memory and performance / functionality of 2 dimms vs 4 dimms being used on boards right now. Do we know yet if the issues that we are seeing is related to the bandwidth of each CCX ?
 

·
Meeeeeeeow!
Joined
·
2,229 Posts
Discussion Starter #28
Quote:
Originally Posted by Steele84 View Post

Again, a great thread. My next question relates to the memory and performance / functionality of 2 dimms vs 4 dimms being used on boards right now. Do we know yet if the issues that we are seeing is related to the bandwidth of each CCX ?
We will need to wait for more independent testing.

However seeing that the clocks of the CCX are likely tied with memory, 2 DIMMs might do better than 4 DIMMs. The reason being that 2 DIMMs can allow for much more aggressive RAM overclocking. That is provided we can overclock the other 32B/clock buses aggressively and the memory controller is up to the job.

I am hoping that we will see DDR4-4000. No idea if that is at all possible though. Once they unlock Ryzen, we should know what the full potential of Ryzen turns out to be in terms of memory overclocking.

In terms of gaming, we will reach a point where the GPU becomes the bottleneck in most situations. For CPU limited games, RAM overclocking might help quite a bit. Ryzen already dominates workstation benchmarks in many cases. Some RAM sensitive benchmarks though like 7 zip might see big gains with Ryzen.

My advice is to hold off for at least a couple of months and to see how good it is.
 

·
Registered
Joined
·
999 Posts

·
Meeeeeeeow!
Joined
·
2,229 Posts
Discussion Starter #31
Yeah you do make interesting points.

A lot has changed in the past few months. SMT no longer has to be disabled. One of the newer AGESA updates has ensured that there isn't the performance penalty.

You are very right though about the diminishing returns past 3200 MHz. I think to be honest, for Zen+, AMD may want to aim for say, 3200 MHz, a 50% improvement in their base Infinity Fabric speeds and make that independent of RAM. That would improve gaming performance dramatically. So too would adding a separate multiplier (kind of like what INtel has done).

What's really fascinating about this is that the speed between the communication between the cores is causing the slowdown. WE see this happening in Intel CPUs too.



The 7900X is both faster in terms of IPC and its clocked at 4.6 GHz, but it losses in games to a 4.4 GHz 6950X in many cases. This is caused by the mesh fabric. In INtel's case though, Uncore and memory are separate, so overclocking the memory does not lead to the Uncore being faster. To add insult to injury, unlike with say, a 5960X, no OC socket is present, limiting Uncore overclocks. While the Mesh topology is good for multithreaded performance, gaming does suffer.

I think that if AMD were to bump up the inter-core speeds, it would be very helpful. The Hyper-Transport architecture from which INfinity Fabric was developed from was rated to 3.2 GHz. Perhaps with OCing, that could be brought up to 4 GHz. The stock Infinity Fabric is operating at just 1066 MHz - 1/3 of that. The interesting question, as you note, is the optimal balance between single threaded performance vs multi-threaded performance. A higher clocked fabric would use more power for sure. Perhaps 1600 MHz (the speed of DDR4 3200) may be where the laws of diminishing returns begins.

It would be interesting to see the returns on 3466 MHz tight timings kits - or how much can be gained from say, a 4000 MHz kit stepped down to very tight timings. Ryzen seems to be very timing sensitive, perhaps because of the design of the CCX communication. I still think that a 1-2 MB L4 cache might have been very helpful. It would have to be small to keep the cost down, but even a small cache might be helpful. It might also help for inter-die communications on Threadripper and Epyc successors.

Quote:
Originally Posted by gtbtk View Post

HI, just checking in from YT. I actually had a half typed reply in draft to this thread from a while ago. I must have gotten side tracked and forgot to come back to it
Thanks for checking in. You might still have it in your draft folder.

I think we will know soon enough how big the penalty for one CCX is. If AMD is releasing their new APUs next year, the report is that they will have 4 cores (1 CCX) and 11 NCUs (so 704 Vega like SPs). This could test our hypothesis.
 

·
stock...hahaha
Joined
·
1,622 Posts
Quote:
Originally Posted by CrazyElf View Post

Yeah you do make interesting points.

A lot has changed in the past few months. SMT no longer has to be disabled. One of the newer AGESA updates has ensured that there isn't the performance penalty.

You are very right though about the diminishing returns past 3200 MHz. I think to be honest, for Zen+, AMD may want to aim for say, 3200 MHz, a 50% improvement in their base Infinity Fabric speeds and make that independent of RAM. That would improve gaming performance dramatically. So too would adding a separate multiplier (kind of like what INtel has done).

What's really fascinating about this is that the speed between the communication between the cores is causing the slowdown. WE see this happening in Intel CPUs too.



The 7900X is both faster in terms of IPC and its clocked at 4.6 GHz, but it losses in games to a 4.4 GHz 6950X in many cases. This is caused by the mesh fabric. In INtel's case though, Uncore and memory are separate, so overclocking the memory does not lead to the Uncore being faster. To add insult to injury, unlike with say, a 5960X, no OC socket is present, limiting Uncore overclocks. While the Mesh topology is good for multithreaded performance, gaming does suffer.

I think that if AMD were to bump up the inter-core speeds, it would be very helpful. The Hyper-Transport architecture from which INfinity Fabric was developed from was rated to 3.2 GHz. Perhaps with OCing, that could be brought up to 4 GHz. The stock Infinity Fabric is operating at just 1066 MHz - 1/3 of that. The interesting question, as you note, is the optimal balance between single threaded performance vs multi-threaded performance. A higher clocked fabric would use more power for sure. Perhaps 1600 MHz (the speed of DDR4 3200) may be where the laws of diminishing returns begins.

It would be interesting to see the returns on 3466 MHz tight timings kits - or how much can be gained from say, a 4000 MHz kit stepped down to very tight timings. Ryzen seems to be very timing sensitive, perhaps because of the design of the CCX communication. I still think that a 1-2 MB L4 cache might have been very helpful. It would have to be small to keep the cost down, but even a small cache might be helpful. It might also help for inter-die communications on Threadripper and Epyc successors.

Quote:
Originally Posted by gtbtk View Post

HI, just checking in from YT. I actually had a half typed reply in draft to this thread from a while ago. I must have gotten side tracked and forgot to come back to it
Thanks for checking in. You might still have it in your draft folder.

I think we will know soon enough how big the penalty for one CCX is. If AMD is releasing their new APUs next year, the report is that they will have 4 cores (1 CCX) and 11 NCUs (so 704 Vega like SPs). This could test our hypothesis.
The mesh and the Fabric are conceptually similar and are exhibiting similar issues regarding throughput/latency. The difference is that you can separately overclock the mesh and improve performance with the skylake X chips.

A single 4 core CCX in APUs should actually work better than the 2 core on 2 CCX design that the R3 chips are using. In spite of the spec sheet saying 16GB L3 cache, Ryzen only has 8GB L3 cache repeated twice. for 4 cores, a single CCX will alleviate an cross fabric switching.

Ryzen is a Ver.1 product. Of course there are design elements that AMD would change is they had their time over again just like there is with any newly developed product. I hope that the next generation of Zen improves a few things:

Firstly clock the fabric at x2 the memory frequency or separate the clocks and make it overclockable like the Intel chips. It will give the chips more headroom for optimization in terms of memory interleaving etc.

Secondly either add an L4 on die cache running at the same CPU frequency the other caches use that is shared and accessible directly from both CCX or even better, make the L3 cache a monolithic single 16GB cache accessible from both CCX modules. A Cache shared between both CCX modules will alleviate the cross fabric thread latency penalty.

Thirdly increase the max achievable all core frequency to 4.5ghz from 4Ghz or so. In spite of the myth being perpetuated on the internet that the Intel chips have significantly better IPC (Instructions per clock) than Ryzen, they dont. Ryzen IPC is better than broadwell-E, about the same as skylake-E only slightly behind Kaby Lake/coffee lake. The only reason that the Intel chips are winning in benchmark scores like cinebench single core is because the have more clock cycles to process instructions.

All of the current gen CPUs from both Intel and AMD score at about 25 cycles per point +/- .5 cycle
 

·
Registered
Joined
·
104 Posts
This is a fantastic review! I find that when I look at gaming rigs and test them, once I get out of the 1080p range I see a lot less of a difference.

I personally just finished a TR setup based on the 1900x with 2 980ti in SLI and I find the 4k performance to be top notch!
 

·
stock...hahaha
Joined
·
1,622 Posts
Quote:
Originally Posted by robtorbay View Post

This is a fantastic review! I find that when I look at gaming rigs and test them, once I get out of the 1080p range I see a lot less of a difference.

I personally just finished a TR setup based on the 1900x with 2 980ti in SLI and I find the 4k performance to be top notch!
As Resolution increases the number of rendered frames reduces meaning less work for the CPU working out the calculations for all the extra frames but more work for the GPU rendering all the pixels that are being set to the screen.

Would it be possible for you to do a test for me? Do you have either a flexible or some alternative SLI bridges so that you can try SLI in PCI slot 1 and 2 and compare that to PCIe slot 1 and 3 and run some 1080p benchmarks with the two different configurations?

What you will end up testing is the difference between one x16 + one x8 both connected to the same PCIe controller and two x16 slots connected to two different controllers that are separated by infinity fabric. I am curious to see what , if any performance difference there is. I dont know for sure but I suspect that the slot 1 + slot 2 may perform better than slot 1 + slot 3 even though they are both x16 slots.
 

·
Registered
Joined
·
329 Posts
too bad amd rushed the launch and reviewers got beta firmware. now that poor gaming reputation is gonna dog them for a while
 

·
Premium Member
Joined
·
2,636 Posts
Quote:
Originally Posted by gtbtk View Post

As Resolution increases the number of rendered frames reduces meaning less work for the CPU working out the calculations for all the extra frames but more work for the GPU rendering all the pixels that are being set to the screen.

Would it be possible for you to do a test for me? Do you have either a flexible or some alternative SLI bridges so that you can try SLI in PCI slot 1 and 2 and compare that to PCIe slot 1 and 3 and run some 1080p benchmarks with the two different configurations?

What you will end up testing is the difference between one x16 + one x8 both connected to the same PCIe controller and two x16 slots connected to two different controllers that are separated by infinity fabric. I am curious to see what , if any performance difference there is. I dont know for sure but I suspect that the slot 1 + slot 2 may perform better than slot 1 + slot 3 even though they are both x16 slots.
In my circumstance with DR and tuned timings @ 3200. No difference.

Maybe if your IF/mem speeds are low it matters. Maybe if you have better than fury x it matters...etc saturate bandwidth. Maybe my cards don't have enough power to do so.

All I can say is my scores are spot on if not better than expected using both 16x lanes.

Maybe its the fact that I am using uma but whatever the case it seems not to matter.

Honestly...anyone claiming issues gaming on TR bought specifically for gaming... Well they first and foremost are idiots. You do not buy TR for gaming.

Game tests on x370 have more fps even though TR wins most newer benchmarks overall. ( fwiw most newer 3d benchmarks are cpu tests )
 

·
Registered
Joined
·
74 Posts
Meh, I'm an idiot then
smile.gif
bought my TR for gaming, it's going to work out just fine, CPU just isn't the important thing when it comes to games, never has been, never will be, sure if you want to be an elite benchmarker or twitch gamer @ 720p it's important but I just want to put some images up on a nice big 4k screen, max details and a min 60fps, I might even use a joypad
tongue.gif
wink.gif
what this platform allows me to do is plug in what I want/need and not have to deal with dark lane rubbish.

Built my last system in 2013 and if it wasn't for spilling a cuppa on it I wouldn't be here, I'd still be quite happy pootling along with a 3770k for the next few years
biggrin.gif
that didn't start out as much of a system, with only 1 GPU and some spinners, by the end there were 3 GPUs and NVME SSDs in there, no CPU/Motherboard upgrade needed as I bought a decent platform with some scope (z77 + PLX, I did actually buy another CPU after the incident, as CPU died, all I could find was a 3570k, even that ran most games just fine! )

This one will start out similarly humble but who knows what it will end up as, a lot can change in a few years, one thing that probably won't is board/CPU, plenty of capability in there. When you look at the other lower end platforms it felt like that had gone backwards in many respects, throw in a million RGB LED, USB ports and SATA that'll sell it too 'em.

Has it cost me more to do this build, sure, but really, what's the extra boil down to, no more than a weekend out on the beers with mates that I would probably be too drunk to remember anyway bar the empty wallet
biggrin.gif
I can skip one of those and the embarrassing pictures that come after
biggrin.gif
 

·
Premium Member
Joined
·
2,636 Posts
Im referring to the 1080P guys
wink.gif
 

·
stock...hahaha
Joined
·
1,622 Posts
Quote:
Originally Posted by chew* View Post

Quote:
Originally Posted by gtbtk View Post

As Resolution increases the number of rendered frames reduces meaning less work for the CPU working out the calculations for all the extra frames but more work for the GPU rendering all the pixels that are being set to the screen.

Would it be possible for you to do a test for me? Do you have either a flexible or some alternative SLI bridges so that you can try SLI in PCI slot 1 and 2 and compare that to PCIe slot 1 and 3 and run some 1080p benchmarks with the two different configurations?

What you will end up testing is the difference between one x16 + one x8 both connected to the same PCIe controller and two x16 slots connected to two different controllers that are separated by infinity fabric. I am curious to see what , if any performance difference there is. I dont know for sure but I suspect that the slot 1 + slot 2 may perform better than slot 1 + slot 3 even though they are both x16 slots.
In my circumstance with DR and tuned timings @ 3200. No difference.

Maybe if your IF/mem speeds are low it matters. Maybe if you have better than fury x it matters...etc saturate bandwidth. Maybe my cards don't have enough power to do so.

All I can say is my scores are spot on if not better than expected using both 16x lanes.

Maybe its the fact that I am using uma but whatever the case it seems not to matter.

Honestly...anyone claiming issues gaming on TR bought specifically for gaming... Well they first and foremost are idiots. You do not buy TR for gaming.

Game tests on x370 have more fps even though TR wins most newer benchmarks overall. ( fwiw most newer 3d benchmarks are cpu tests )
I don't know one way or the other which approach will work better. I was theorizing about SLI/crossfire and the differences between running the cards on the same dies or across dies. I would have guessed that the AMD cards, without any additional bridge cables may gave struggled a bit putting them on two different dies. I am glad to hear that using the two x16 cards are working well for you. You may be right about the fury not having the power to really push things along.

As a matter of interest, On your TR system, have you ever tried the two cards in the x16 and x8 slot connected to the first die?

Personally I don't really give a damn about the games themselves. Sure, they are fun to play every now and then but I am not a rabid gamer. To me, they are tools, like wprime, that exercise various components and identify areas that are not behaving as expected. In this case, the Fabric is not functioning in the way that we all assumed it did in the beginning. In this case, the gaming benchmarks gave me insight to how the fabric actually connects the components, seems to work, its strengths and weaknesses, not just in games but in everything we use these systems for.

I've discovered that using CPU affinity to pin some applications to appropriate cores gives you the benefits of "gaming mode" and Numa mode while being able to leave the rest of the system in UMA mode. Its let you have your cake and eat it too.

I have discovered that the location of a pcie device in relation to a particular die is relevant to performance, particularly for applications that cannot scale to 32 threads. TR and Epyc are likely to see extensive use as VM hosts. Understanding that if passthrough GPUs and shared NVME storage devices are installed locally to the die that the VM is running on means a more efficient use of that hardware device, better performance and reduced operating costs.
 
21 - 39 of 39 Posts
Top