Overclock.net banner
301 - 320 of 2,522 Posts
Now this inter-CCX latency is almost 3 times more compared to Zen4. Probably this is the root of all these multi-core performance problems. Because it increases core-core communication time and decreases chip-to-chip bandwidth, but that's not all, it increases the RAM latency too.
There is no inter-CCX latency on the single CCX parts, but odd performance issues remain. Memory latency is also not significantly different between Raphael and Granite Ridge. The inter-CCX latency issue is just one problem of many, and is probably more of a symptom than a cause.

all they needed to do to bandwidth unlock 8c/16t was dump single ccd.......
Splitting an 8c/16t part into two CCXes would hit performance far harder far more often than the single CCD bandwidth limitations.

Yeah but thats a big change
Keeping Ryzen 7000 architecture and power limits, moving to 4nm and upgrading the memory controller to allow for 3000 fclk would have been an amazing product.
Would need to be monolithic and would need to be something significantly different from the current Fabric implementation. A hypothetical monolithic desktop part built from the ground up probably wouldn't even connect the memory controller with Fabric, they'd give it it's own stop on the ring bus.

Infinity Fabric exists for the same reason chiplets exist...to make designs more modular and ultimately cheaper. There is a whole lot that is within the realm of technical possibility that will never happen because it would a massive waste of money for AMD (or whoever). Replacing the IOD, or moving to a new substrate/interconnect, or spending their expensive TSMC wafer allotments building an entire die flavor just to make a handful of enthusiasts happy, or win a few client benchmarks, doesn't jive with running a profitable business.
 
so I had this funny feeling testing 9950x ES......and i was right.

AMD made so many improvements to server that they basically crippled single ccd even more.....so much that it probably should not exist.....

4c/4c theoreticaly what 9700x should have been @kailz what was your best balls out 1b?

View attachment 2669622
16,4s with AVX512 cannonlake. Zen 5 workload is slower on the single ccd parts ;(

keeping them in :D its just PBO and 1,43v you know :) best it can do is around 5550 effective on this cooling i didnt delid the cpu either. I think MAX 1b with LN2 will be 14seconds. So not hwbot worthy comapred to 12900K AVX for it
 
maybe one day i can beat my 12900KF YC pi 1b but i need 1,2 seconds or so and better cooling

+unlocked PMIC :D


probably can scave it under 15s this was just testing my new kit 7400C36 patriots
 
maybe one day i can beat my 12900KF YC pi 1b but i need 1,2 seconds or so and better cooling

+unlocked PMIC :D


probably can scave it under 15s this was just testing my new kit 7400C36 patriots
I'm a karhu addict I just want to match my Intel 450mb/s output. Is that to much to ask? 🤣
 
y-cruncher is bandwidth limited, but y-cruncher is an outlier. What real-world tasks are going to perform better on two four-core CCXes than one eight-core?

'In the olden days' (2018), I built a TR 2950 X system; it had not-so-hot inter-CCX latency - HOWEVER, one could switch the TR 2950X Numa modes which gave some big improvements in latency. Something similar for the Granite Ridge parts could be useful, no ?
You can set most any Ryzen setup to split the CPU into NUMA nodes based on L3 caches. It doesn't do anything to inter-CCX latency, it just convinces most OS schedulers to keep within a node as much as possible.

Older Threadrippers had different memory controllers on different dies and using NUMA could reduce memory latency because anything running on one CCX wouldn't touch a non-local memory controller unless absolutely necessary. This doesn't apply to any AM4 or AM5 CPU because none of them have this topology.
 
  • Rep+
Reactions: Samsarulz and J7SC
y-cruncher is bandwidth limited, but y-cruncher is an outlier. What real-world tasks are going to perform better on two four-core CCXes than one eight-core?



You can set most any Ryzen setup to split the CPU into NUMA nodes based on L3 caches. It doesn't do anything to inter-CCX latency, it just convinces most OS schedulers to keep within a node as much as possible.

Older Threadrippers had different memory controllers on different dies and using NUMA could reduce memory latency because anything running on one CCX wouldn't touch a non-local memory controller unless absolutely necessary. This doesn't apply to any AM4 or AM5 CPU because none of them have this topology.
Y cruncher just prefers latency.

I can still get it fast with 2x ccd @ 1/1
 
View attachment 2669638

this is on a slow version of windows need W11 for it, but i was trying some stuff yesterday with just PBO+200 and -45 all core.
Once i set static OC for example 5600 and 1,35v its way slower. And need more cooling for 5675+ static.
Yes those were my findings and why I felt need to mention it when you beat my single ccd 1b.

Unfortunately until AMD fix agesa and stops thinking I'm overheating and throttling back my CPU I can't optimize anything sub zero for actual "performance". So I have just used it to gauge how I want to run benchmarks based on what gains I can measure while static. Static is at least good for that. When static nothing jumps around. A tune is faster or it isn't.

That wasn't an upgrade kaliz. That thing has been just dusted off and ghetto mount adapted. I had it way back on deneb 😉
 
View attachment 2669638

this is on a slow version of windows need W11 for it, but i was trying some stuff yesterday with just PBO+200 and -45 all core.
Once i set static OC for example 5600 and 1,35v its way slower. And need more cooling for 5675+ static.
...you can borrow mine -this is the 8000 one, have an extra for 8200 :p

Font Screenshot Number




@Blameless - yeah, on my TR 2950X the latency gain via NUMA/UMA was something like (-)12 ns, mid 70s to low 60s
 
...you can borrow mine -this is the 8000 one, have an extra for 8200 :p

View attachment 2669640



@Blameless - yeah, on my TR 2950X the latency gain via NUMA/UMA was something like (-)12 ns, mid 70s to low 60s
V cache CPU...try non vcache vs non vcache.

Also we are talking 8c rankings.

I'm already 10.777 16c untuned just flipping benches back to back
 
...you can borrow mine -this is the 8000 one, have an extra for 8200 :p

View attachment 2669640



@Blameless - yeah, on my TR 2950X the latency gain via NUMA/UMA was something like (-)12 ns, mid 70s to low 60s
mate if you lower this score by a 600ns its a topscore already for that part , but the 9950X ES i saw Dom getting 10,8s!

 
mate if you lower this score by a 600ns its a topscore already for that part , but the 9950X ES i saw Dom getting 10,8s!

About to put retail under ice. Maybe I'll try to not run default 1b config and tune it 😁

Azure Rectangle Font Handwriting Parallel
 
Regarding that ARdPtrInitValMP0/1 setting, it's been around since AM4.

I have it set to "0" on all my saved AMD nvram dumps going back to at least my 3900X. IIRC, I first noticed it in an MSI board that exposed a bunch of settings that MSI gave silly brand specific names and I noted that it improved memory benches slightly, so started looking for it in my other boards (it's hidden on most of them but editable via AMISCE or other tools). It's also enabled on my 7800X3D system, which explains why I haven't been seeing any advantage to 2:3 ratios on any of my AM5 setups.
 
  • Rep+
Reactions: dk_mic
301 - 320 of 2,522 Posts