Originally Posted by Keith Myers
I want to see the memory latency die to die or die to I/O chip before I decide on the 3000 series. With the current 2000 series 32 core chips,half the cores can't be used because of the memory access penalties with only two dies having direct access. Any core usage over about 24-36 threads just hammers the compute times on the threads that have the high memory latency through the extra off die excursion. One of my friends sold off his 2990WX because it couldn't handle running the thread count to 60. Most of my other friends with the 2990WX are only running about half loading to keep the cpu compute times consistent.
You are wrong and are repeating wrong information that since has been disproved.
It had NOTHING to do with memory bandwidth after the exact same behavior was exhibited when tested with Windows 10 and an Epyc 7551P CPU that had the full eight channels of memory.
There are multiple issues. First is that the NUMA setup in Windows is so old, it was designed to allow pour over to a second node back when Intel had two CPU nodes on a single die for their Xeons but a single memory controller. It did not allow for use beyond the first numa flow to the second one, when the 2990WX has 4 nodes. That means the NUMA situation and scheduler was in part to blame for the performance regression. This behavior was NOT observed on Linux which accounted for the architecture better.
Next, there is the issue of stale data. Due to not having direct memory access, and the way that latency stacks on Threadripper (see my discussion on that here: https://www.overclock.net/forum/379-...l#post27893200
(worth checking out for the rumors on the Zen 3 anyways, which AMD is giving more information on in 5 days at the Game Developer's conference)), you can get a stale data problem (described in the post or two by me above the linked post, toward the end of that post).
Now, AMD has dealt with this to a degree with the I/O die being centralized, allowing for an UMA treatment for memory calls. Meanwhile, for Epyc, the dies are not directly connected with IF to each other, meaning all is routed through the I/O (Mark Papermaster disclosed this months ago). On the other hand, the Ryzen mainstream CPUs will have the CPU dies linked with IF directly, meaning it doesn't need the two hops to the I/O and to the other die for inter-die comms. The reason for this is the fewer the core dies, the less that the chance of data becoming stale is. Meanwhile, for wiring all core dies to each other on a 64 core chip with 8 core chiplets would be a technical nightmare, along with the varying latency potential depending on multiple factors for then going to each other core die may have made it less efficient than standardizing the latency, allowing for a more efficient algo for processing. Either way, overall latency is reduced, although certain latency was increased due to not having direct to memory access.
CorePrio by Bitsum is not a perfect solution, but it is similar to AMD's own solution.
Edit: Here is PCWorld comparing the 2990WX using CorePrio to the 28-core Intel Xeon.
(only shown on 7-Zip compression test)
Edit 2: Ian Cuttress article from Jan: https://www.anandtech.com/show/13853...dows-scheduler