Great new method for determining stability of x86 chips

mike7877 · Jul 7, 2022

Many people avoid using Prime95 (P95) to determine their system's stability because it generates so much heat while consuming ridiculous amounts of power - much, much more than any real program would cause (60-80% more). The biggest problem from this, is the heat skews the voltage required by a CPU for stability. Eg at 100 deg C, a chip might need 1.385V for proper operation at 5.1GHz, but just 1.32V at 68 deg C. You can see the issue if only Prime95 raises the chip to 100 degrees, and the next worst case (the worst 'real' program) raises it to just 68 degrees.

What can be done?

Well, what I was partial to in the past, was using Prime95 to find the two weakest cores (2/6,3/8 etc.) and only running on them. I'd find the approximate temperature the most torturous (real) program would cause the CPU to reach, and that would be the target temperature for the two cores running P95 -small FFT. Usually done by running the cooler at 100% - easy enough.

This method is very good. Very good. After it, I've never been able to generate errors with any other stability testing programs, and the voltages aren't excessive - they're only about 0.015-0.02V over what's required to generate errors in the next most demanding stability tests.

Is there a better method?

I've been doing some thinking, and I think there might be. It requires ThrotteStop!

The easiest way to describe is with an example: Instead of targeting the two weakest cores and matching temperature, the CPU, running at say 5GHz, is throttled down to an amount that, when P95 is running, matches the power consumption of the CPU during its worst real world workload. If that's 150W, instructions are: you'd start ThrottleStop and set its CPU clock modulation option to much lower than required (25%), then start P95 with all core small FFT on all cores. Then, you'd check CPU power consumption (eg. 60W). Then, you'd increase clock modulation (by ThrottleStop's 6.25% steps) until you reached the first one that causes at least 150W to be drawn, and you'd wait for errors for however long you choose (which is at least 4 hours and no more than 30). If no errors, reduce voltage and repeat test. If errors, increase voltage and repeat test. Doing the one you need until you find the optimal voltage. Then, add your offset - at least 0.01V, no more than 0.05V (there many factors to consider when choosing this-I won't get into them right now because it's not the focus of this thread).

So, what do you think of this method? I've done some preliminary testing and its effectiveness seems pretty similar to the method I described in the second paragraph - the voltage given may be ever so slightly more conservative (meaning higher than absolutely necessary).

I'm making this thread because I think what I described could be the method - recommended over all others for its simplicity and efficacy. The golden standard lol

Thoughts?

Ichirou · Jul 8, 2022

Just run y-cruncher's Component Stress Test and see if you can even pass one loop.
If you can, you're pretty much 99.9% stable. Since that test rapes the PC harder than any other test I've tried to date.

Blameless · Jul 8, 2022

The problem with the assumption that Prime95 is a extreme load is that this is increasingly less likely to be the case. Prime95 has always been a distributed computing project adapted to stress testing and not a dedicated test capable of loading parts beyond what could reasonably be encountered in the field. As applications get more optimized over time we see more and more real-world, even common place, apps rival such loads. Indeed, I'm in search of ever more stressful tests because I cannot rule out future apps that I will expect to be unconditionally stable from being even more demanding. Furthermore, real-world loads don't make for good stress tests because the time needed to validate them is extreme. If I want to know with ~99% certainty that any given 15 hour encode won't fail, I'd need to test an equivalent load for hundreds of hours...a good stress test should give me similar levels of certainty in a small fraction of the time.

On my AMD systems both my Monero miner and newer versions of x265 will pull more current than Prime95 Small FFTs (every time I transcode a video for Vimeo, I load my CPU more than most of the tests most commonly . It's not that far off on my Intel setups either. And pretty much every single modern CPU I've got will sit at it's temperature limit in all kinds of real-world loads, unless equipped with sub-ambient cooling or undervolted. Anything that artificially limits temperature, clocks, or anything else, is automatically less than representative and cannot be a good indicator of how the part will behave when under more extreme stress.

Personally, I'm looking to find problems so I can fix them, not conceal problems to achieve a false sense of security that could easily be upended because I decided to do something new with my hardware. Any 'golden standard' that obfuscates problems outside narrowly subject bounds of 'realistic' loads is not very useful to me.

If this methodology is satisfactory for you, by all means, use it. Just be aware that if it results in a higher OC and/or less voltage than other tests, there is significant potential for issues down line, should your use patterns change.

mike7877 · Jul 8, 2022

Blameless said:
The problem with the assumption that Prime95 is a extreme load is that this is increasingly less likely to be the case. Prime95 has always been a distributed computing project adapted to stress testing and not a dedicated test capable of loading parts beyond what could reasonably be encountered in the field. As applications get more optimized over time we see more and more real-world, even common place, apps rival such loads. Indeed, I'm in search of ever more stressful tests because I cannot rule out future apps that I will expect to be unconditionally stable from being even more demanding. Furthermore, real-world loads don't make for good stress tests because the time needed to validate them is extreme. If I want to know with ~99% certainty that any given 15 hour encode won't fail, I'd need to test an equivalent load for hundreds of hours...a good stress test should give me similar levels of certainty in a small fraction of the time.

On my AMD systems both my Monero miner and newer versions of x265 will pull more current than Prime95 Small FFTs (every time I transcode a video for Vimeo, I load my CPU more than most of the tests most commonly . It's not that far off on my Intel setups either. And pretty much every single modern CPU I've got will sit at it's temperature limit in all kinds of real-world loads, unless equipped with sub-ambient cooling or undervolted. Anything that artificially limits temperature, clocks, or anything else, is automatically less than representative and cannot be a good indicator of how the part will behave when under more extreme stress.

Personally, I'm looking to find problems so I can fix them, not conceal problems to achieve a false sense of security that could easily be upended because I decided to do something new with my hardware. Any 'golden standard' that obfuscates problems outside narrowly subject bounds of 'realistic' loads is not very useful to me.

If this methodology is satisfactory for you, by all means, use it. Just be aware that if it results in a higher OC and/or less voltage than other tests, there is significant potential for issues down line, should your use patterns change.

Many good points - the method I'm describing is more for people who don't run loads that take more power than Prime95, and don't have cooling for it either (95% of people over 5GHz). You'd have no problem running Prime95 on your system, so you wouldn't need to modulate clock speed. Actually, you wouldn't want to, because your maximum real world load causes even more power consumption than Prime95!
It seems your machine is tuned more toward mission critical operation - the method I described is for normal client usage. Not that it wouldn't be stable - if someone wanted mission critical stability, after completing what I described above - leaving the voltage there and reducing clocks by 100MHz would do it - excessive mission critical. Even before doing that it should pass everything I can think of.

Prime95 does not exercise all parts of the CPU, but to my knowledge, no stress test does. Especially if you want error checking. I don't think Prime95 does non-AVX error checking either.

Yes, I agree that using normal programs to check system stability is an absolutely massive waste of time, but I never suggested it. I suggested error checking within the power envelope the processor will be run in, + 0 - 6.25%. If a system CPU's maximum power consumption using 'real' programs will be 150W, someone could test to, say, 175W, then, in the BIOS, limit CPU power consumption 160W (where voltage is not reduced, but clock rate is).

The usual issues are still present, just to a lesser extent. I haven't solved them

TL;DR
Prime95 isn't the focus (though I did use it and think it's good in general), limiting power consumption and temperature during system error testing to what can be cooled and the temperature the CPU will be during normal operation by using clock modulation to artificially reduce CPU clocks is [the focus].

mike7877 · Jul 8, 2022

Ichirou said:
Just run y-cruncher's Component Stress Test and see if you can even pass one loop.
If you can, you're pretty much 99.9% stable. Since that test rapes the PC harder than any other test I've tried to date.

There's a bit of a problem with using that though. Say someone's maximum CPU power consumption and temperature during use is 120W and 72 degrees, and y-cruncher causes 170W to be used, also raising the temperature to 95C.

Just that jump in temperature would need ~0.03-0.04V added to Vcore, maybe more. Then core temperatures would be hitting 100, needing another 0.01V. Then the CPU would be throttling. And 0.05V too high. Basically everything I described wrong with using Prime95 is wrong with using y-cruncher

Blameless · Jul 8, 2022

mike7877 said:
Prime95 does not exercise all parts of the CPU, but to my knowledge, no stress test does.

Which is why any good test regimen will have many tests, and many combinations of tests.

mike7877 · Jul 8, 2022

Blameless said:
Which is why any good test regimen will have many tests, and many combinations of tests.

"This method is very good. Very good. After it, I've never been able to generate errors with other stability testing programs, and the voltages aren't excessive - they're only about 0.015-0.02V over [what's required by other error checking programs]"

"Very good." (not perfect).

And it's "for people who don't run things that take more power than Prime95 [small FFT], and don't have the cooling for it [to run on all cores full tilt] either"

"Prime95 isn't the focus (though I did use it and think it's good in general), limiting power consumption and temperature during system error testing, to [what it] will be during normal operation, by using clock modulation" is

Prime95 can be substituted with any program you'd like (something that when it's run on all cores, you can't cool your CPU to its normal maximum temperature (the maximum when using real programs).

The same error checking will be done at the same frequency and temperature that your processor will be running at during normal operation.

So excessive voltage isn't added to compensate for unrealistic temperatures during testing (which have to be compensated for again when those voltages raise temperatures even further).

I'm sure that 95+% of people who run AVX offsets on their CPUs, don't need to. If very, very, occasionally they're somehow running almost 100% AVX on all cores, a CPU power limit that cuts frequency and not voltage, will protect the CPU. If comfortable using the thermal limit, that'd work too.

Before this (IMO great) idea, I'd find the two weakest cores and run P95 on only them, aiming for those cores to reach the approximate temperature the CPU would reach while running the most torturous real program in its repertoire. Temperatures usually worked out by running the CPU fan at 100% - at all but the highest frequencies and voltages (5.2GHz/1.4V). Way up there, the temperature difference between cores becomes too great, and the heatsink isn't cold enough to keep the tested cores below 100. I have a Noctua D15 + 9600K, and at 5200MHz/1.44V, P95 small FFT on 2 cores raises their temperatures to 92-93 degrees while the others are in the 50-55 range. I believe this means the heatsink is 50-55 degrees C colder than the CPU heatspreader - not great.

Anyway, now all cores can be used at once for testing by 'wasting' intermediate cycles. No more massive temperature differences! And you end up at a voltage that's very stable/not excessive. For the majority of workloads. Special cases are, as always,
special cases

charonme · Nov 29, 2023

mike7877 said:
everything I described wrong with using Prime95 is wrong with using y-cruncher

It's only "wrong" if you wish to use the computer just at the edge of stability and have it crash at the slightest addition of "unrealistic" load. Is your complaint really that if people use p95 or y-cruncher they'll make their PC "unnecessarily too stable"?

Ichirou said:
Just run y-cruncher's Component Stress Test and see if you can even pass one loop.
If you can, you're pretty much 99.9% stable.

well with ycruncher I got one WHEA error after 8.5h ("cpu cache L0 error") (Y cruncher itself didn't report any error) and then continued running for another 7 hours without problems. I then ran P95 small FFTs and got a crash in 1.5h
But maybe I was just testing different V/f points with P95 and ycruncher and the V/f point ycruncher was running at was more stable than the V/f point P95 was running at?

Ichirou · Nov 29, 2023

charonme said:
It's only "wrong" if you wish to use the computer just at the edge of stability and have it crash at the slightest addition of "unrealistic" load. Is your complaint really that if people use p95 or y-cruncher they'll make their PC "unnecessarily too stable"?

well with ycruncher I got one WHEA error after 8.5h ("cpu cache L0 error") (Y cruncher itself didn't report any error) and then continued running for another 7 hours without problems. I then ran P95 small FFTs and got a crash in 1.5h
But maybe I was just testing different V/f points with P95 and ycruncher and the V/f point ycruncher was running at was more stable than the V/f point P95 was running at?

That error means too low Vcore. Small FFTs hammers the chip while y-cruncher does a mixed test. I'll baton pass this to @Falkentyne who can explain better.

Falkentyne · Nov 29, 2023

Prime95 and Y-cruncher are hot garbage unless you use them for mixed memory testing (VST, HNT on older (Ring I think), VT3).
While prime95 and y-cruncher used to be good at weeding out weak cores on older CPU's (like 10900k), I find that on modern chips, unless an e-core fails, you just straight up BSOD. If you get a Parity or TLB error it means you got lucky. L0 cache errors don't tell you what core failed at all.

Use Stockfish.

You can pass prime95 and SFT all day long and still fail Stockfish, because Stockfish actually uses the ENTIRE system (Hell, if you're running Stockfish in chessbase, even the storage gets hit when it tries to cache stuff).

L0 means you're unstable. Parity error means you're unstable. Translation Lookaside buffer error means you're unstable. Many chips go south FAST once you pass 89C core temps.

BTW for Parity and Translation Lookaside errors, you can find what core failed by dumping your HTML report with CPU-Z and looking for the APIC ID of the thread, then see what the APIC ID was in Windows Event Viewer.

charonme · Nov 30, 2023

Falkentyne said:
Use Stockfish.

Thanks for the recommendation, how do I use it? "go infinite"? Can I stop it from printing out too much of the chess information and only display the information relevant for the stress test? Will it print out any possible errors if my settings are unstable and stop the stress test? Also I probably wouldn't want it to use the SSD too much

Falkentyne · Nov 30, 2023

charonme said:
Thanks for the recommendation, how do I use it? "go infinite"? Can I stop it from printing out too much of the chess information and only display the information relevant for the stress test? Will it print out any possible errors if my settings are unstable and stop the stress test? Also I probably wouldn't want it to use the SSD too much

Yes, you download Arena UCI Client, load the engine exe file into Arena, set the # of threads (common CPU cores) to the max # of threads (do not disable hyperthreading unless you are specifically testing HT disabled on purpose), then have it analyze any chess position infinitely. I have found that more complicated positions are more stressful, but for normal users the starting position (or maybe after a few pawn and knight and bishop moves) is good enough.

There is no extra 'information' relevant to running Stockfish. You either BSOD, crash, generate a WHEA error logged in hwinfo64 (sensors) or are stable. The engine doesn't cause "calculation" errors to go missed as it's a chess engine--there are precise algorithms and checksums in place. If a calculation error occurs, the engine will crash.

As far as the "move spam" in Arena (AGAIN this is not important), I don't know how to trim the outputs, as I don't use Arena. I use Chessbase 16.

charonme · Nov 30, 2023

Falkentyne said:
set the # of threads

where do I set that? I can't find a setting like that anywhere

Falkentyne · Nov 30, 2023

charonme said:
where do I set that? I can't find a setting like that anywhere

In Arena, it's under "Engine--UCI Options". Assuming you're using that front end.

Rectangle Font Screenshot Software Parallel

charonme · Dec 1, 2023

great, thanks! it works

charonme · Dec 28, 2023

A strange thing happened to me with stockfish: I started the analysis as usual and after some time noticed the power draw displayed in hwinfo was very low (something like 20-70W I think), but there was no error, no crash, no WHEA errors and stockfish seemed it was still running, analysis turned on, GUI responsive normally. So I think something must have crashed in the background probably. I later confirmed with y-cruncher there indeed was some instability with those particular settings (even though another run of stockfish survived like 9.5 hours without issues)

Homex-HitTheLotto · Dec 28, 2023

I didnt read every post, and I see blameless does have some fair points, but my experience is this. I run Stable diffusion, blender and I do video AI upscaling (where ill be at high loads for 1-2-3 hours at a rip, video and cpu/mem) and I game and benchmark heavilly..And this includes in Fedora linux and windows 10 and 11. And I do agree that these lighter load methods that are being discussed here do not find you 100% of all errors....BUTTTTTTTTT

My opinion, as someone whos been doing this since I was 12 years old and my first overclock was a 486 dx2 50mhz --->66mhz... The only programs that I find outside of anything 'normal' I described above that create these 1% high end hard to debug errors ..................... ARE the damn stress/highload programs that we use to find them. And I've used everything! AND my system is never sitting stock, always heavy Ram/cpu/gpu overclocks, on water with PTM/high end pastes and modded deshroud and hardware mods for more power. Never had I found my way to the 99% land that exists before the nutty 1% area Blameless speaks of... And got an error from any of the high intensity things I enjoy doing listed above. And If I do? Its because I pushed the hardware too far. Period.

just sayin..?

Great new method for determining stability of x86 chips

mike7877

Ichirou

Blameless

mike7877

mike7877

Blameless

mike7877

charonme

Ichirou

Falkentyne

charonme

Falkentyne

charonme

Falkentyne

charonme

charonme

Homex-HitTheLotto

Top Contributors this Month

Recommended Communities