Originally Posted by adam3234
I was hoping new 4 die Threadripper CPUs will encourage AMD to upgrade the tr4 socket to support 8 channels because without 8 channels 2 of the 4 dies that don't have direct RAM access will take a massive latency hit from having to request for RAM data from the other 2 dies with direct access to RAM...although that is with the Zen 1 architecture, I'm not sure how the Zen 2 architecture with it's chiplets and IO die architecture will handle memory requests from 4 dies.
I was hoping future boards like x699 or x799 etc will support 8 channels not AMD upgrading x399 boards to support 8 channels...although if AMD can do that then that will be pretty cool...although what I want is 16 slots and 8 channel support.
The I/O die centralizes ALL memory channels. All traffic going from one CCX to another, or to memory, goes to the I/O die. All latency was equalized in this way to address the issue of stale data. Windows also changed the scheduler so that the thread assigned is used and other threads are spawned on the same CCX until the CCX is full, then it goes to another CCX. Because the memory channels are centralized, it is the same latency to go the memory REGARDLESS of which core is requesting the information.
Further, your statement is predicated under the argument that the problem was memory bandwidth to the cores and NOT the latency causing stale data. Sure, memory bandwidth per core is going to be stretched a bit, except that all cores have equal access through a two hop solution, 1 hop to the I/O and 1 hop to the memory.
Now, AMD did this because the MS NUMA awareness was shiite. Meaning whenever it was in a NUMA situation, it allowed spawning to ONLY 1 extra node, meaning two of the four dies on the 2990WX. With a centralized I/O, the system recognizes the entire CPU as UMA, meaning one mem controller and die, roughly, rather than each die being a separate NUMA node, for simplicity sake. That means that the problem with NUMA has been resolved. And by standardizing all latency, even though it increased latency in some cases, it lowers the amount of stale data and other issues, thereby overall making the CPU faster. Efficiency through inefficiency.
So the problems of TR and TR second gen are gone. That also means that the new 32 cores will likely smash the crap out of the 18-core Intel CPUs and go toe to toe with the 28-core in many cases. Meanwhile, the 64-core variant, expected to be priced at $3000-3400, will be in the price category of the 28-core OC Xeon, while also beating it badly.
AMD is already said to be keeping backwards compatibility on the new third gen chips. Making those chips shut down two mem controllers with 2 channels per controller would be a PITA! I suppose it is possible, just very unlikely.
Instead, the new MBs will support PCIe 4.0. No new socket is likely until 2021 with DDR5 introduction. So chances of adding those memory lanes until then is small. Before they mentioned possible X399 compatibility, I was hoping, since we got two generations on the platform, for a potential new socket and additional lanes. But not happening is the most likely outcome.
As I believe I said, Intel isn't increasing their mem channels either. So that leaves little reason for complaint.
Finally, if you are going off of people saying their isn't enough bandwidth to run 32 cores on 4 channels of memory, that is FALSE. It was disproved when the scheduler part was shown. I, on my own, suspected a stale data problem for the cores without mem channels, as they were ALWAYS with the largest latency possible (there are 4 different latencies for core to core comms and two on memory, 1)same CCX, 2) different CCX on the same die, 3) mirrored CCX on a different die, and 4) different CCX on a different die; memory - 1) local memory controller, 2) memory controller on other die). So, you would wind up with the two dies without memory controllers always having to go off die for memory calls, while also having very long latencies on die to die comms, resulting in a lot of latency. Combine that with a scheduler acting opportunistically causing thread thrashing by moving apps to and from core 0 regularly, and you result in a system that is FUBARed in many use cases.
So, with Zen 2, you have two CPU comm latencies: 1) on same CCX, and 2) CCX to CCX. Further, you only have 1 memory latency, going to I/O, then going to memory. In addition, by centralizing the mem controller, the OS reads it as a single memory node, or Unified Memory Architecture, so does NOT treat it as NUMA when scheduling. On top of that, MS changed the scheduler performance to keep it on the same CCX as much as possible before spawning to another CCX, thereby trying to keep latency in the lowest CPU core to core comm latency at all times.
If you understand what was done, you should not fear the new gen at all, instead you should really welcome it. AMD will be discussing this more at the AMD server event on Aug. 7th, followed by Hot Chips the week after where they are headlining.
Now, if your workload needs the mem bandwidth, it may be time to step up to the Epyc Rome platform anyways.