For a while now I've been working on my new custom-loop MSI MEG Creation X399-based 2950X sort-of-silent 4U rackmount rig. I just wanted to report that I finally got PBO to do something useful on this board. I am using the latest 1usmus bios.
Because I run Linux I endeavored to do as much as possible in BIOS, so (almost) everything I've accomplished was achieved via BIOS tweaking -- that is, no Windows software allowed, as I won't be booting Windows except to verify/test.
I have 64G of ram in 4 dual-rank modules from a Trident Z F4-3200C14-16GTZ Glo-Stick kit. Getting this to run at an actual 3200 was hard AF but ultimately boiled down to this:
- XMP profile 2 (3200)
- SOC voltage ~1.09 (not quite... it's whatever is one tick down from 1.1 IIRC)
- Adjust DRAM voltage to 1.39V as reported in BIOS§.
DRAM voltage seems to be quite a sensitive frob. If I go down to just 1.38 prime95 instanta-fails, yet I have successfully run prime95 for hours at 1.39 with no failures (once I set ProcODT 53; otherwise it eventually fails at any voltage).
For RAM testing I usually run prime95 until I want to scream, and then compile chromium with full debug symbols and high parallelism on a compressed ramdisk. Probably I am doing this in the wrong order: I find the latter test is often faster at uncovering RAM instability which is manifested as unexplained "ninja" failures, with an error message something like "ninja stopped working for no reason." Due to the huge time-cost of avoiding Type II errors, (hmm, so H0
="ram stable"? Not really... so maybe they're type I errors then
), these RAM "over"clocking tasks take forever and I kind-of hate doing them.
I also have several options changed in the "DigitALL Power" menu (or whatever it's really called -- I'm referring to the menu where Spread Spectrum is found) which might affect RAM and/or system stability. I'm typing on the system in question and don't want to reboot but I'll endeavor to document all my BIOS changes (and correct any mistakes I made, working from memory) in a subsequent post, by reverting to stock, loading a stored OC profile, and photographing the "changes" dialog. Might take me a few days to get around to this.
As for the CPU, I'm surely forgetting a lot stuff but the main changes I made were:
- Enable PBO using the 400W limits
- Select manual PBO level and set it to "2"
- Enable a negative voltage offset of (IIRC) 0.075Veven
- LLC level "7"
I took inspiration from
on the interaction between XFR/PBO and offset voltage in MSI Ryzen BIOSen.
One thing that tripped me up for a long time: options in the MSI BIOS without the "[..]" markers are not necessarily read-only! Those markes only mean, a list of discrete values is available, but if you just highlight the option in question and start typing numbers, most of these can, in fact, be changed. The voltage offset value is one such frob.
got me to stable, but not quite. Under linux it was rock-solid under load, but, occasionally, I would return to the machine after leaving it idle for several hours and it would be frozen (with no signs of life, whatsoever, i.e., three-finger-salute does nothing, network interfaces non-responsive, etc). Turns out I could solve this problem (so far, at least...) by disabling c6 using zenstates.py, on boot. Since I made that change, I haven't seen any problems (it's been several days without a lockup, now... Before c6-disablement, they were more-than-daily events; maybe I should just disable c6 in bios, assuming that's possible, but since it apparently ain't broke, I might just not try to fix it).
As a yard-stick, with these changes my multi-thread Cinebench 15 scores under Windows go from something under 3000 to something over 3500, and I don't see any terrifying voltages (except on idle cores, which I've decided to treat as mostly harmless artifacts of XFR), things stay reasonably cool... basically, I don't see any mortifying crap going on. Power efficiency is clearly a casualty but I don't get the impression that my CPU will wind up extra-crispy anytime soon. I should probably spend some time physically probing motherboard thermals though; there could easily be scary hotspots on the mobo that I have no clue about yet.
One power-related thing that's kind-of curious: under linux, my wall-power draw never drops below 140W or so, whereas in Windows, I see it idling at something more like 90W. I have no idea what Windows is doing differently to cause such a dramatic difference. I'm using the ondemand frequency governor and haven't tried obvious experiments like using the powersave governor, however. [Edit: come to think of it, this could well be entirely due to inferior power management under the amdgpu drivers vs. Windows AMD gaming/consumer GPU drivers, which sounds like a plausible and testable hypothesis.]
Would be interested in hearing any thoughts about this stuff. Overclocking threadripper is confusing AF compared to normal CPU's. It's hard to figure out what's going on, especially under linux*
. It doesn't help that, for some reason, AMD seems determined not to properly document Ryzen's MSR's... maybe because they haven't finalized the interface yet, or because they don't want to expose frobs able to blow-up an otherwise highly nerfed platform that provides "safe-ish" OC capabilities?... Really not sure.
Obviously these "secrets" are going to partners and leaking to the usual SMEs, so why not tell the rest of us? My theory: a conspiracy to provide more datapoints to the Cortana spyware
§ For some reason if I set voltage v₀ then when I loop back to view the results, the bios reports voltage ~(v₀ + 0.01V) for channels A/B and ~(v₀) for channels C/D. Taking this as a hint, I set DRAM voltages of 1.40V for A/B and 1.39V for C/D, to achieve a BIOS-reported value of ~1.39V on all channels.
* Check out CoreFreq from github; insmod with the Experimental=1 option -- those guys/gals are slowly reverse-engineering the undocumented MSRs which close the information gap that currently exists between Linux and Windows on Threadripper (and Ryzen and many more, btw). Sadly it doesn't provide normal lm_sensors read-outs, so it is useful for troubleshooting, but probably not for general-purpose monitoring/tuning. Hopefully these capabilities will eventually find their way into lm_sensors and/or other standard linux platform introspection/power-tuning places. Also, for the Meg Creation (and likely other boards), manually loading the nct6775 kernel module will reveal some of the missing sensors -- although, unsurprisingly, the provided information seems incomplete/wrong to some extent.