Yesterday I built my new PC which consists of:
Gigabyte X570 Aorus Pro (latest drivers and BIOS)
Ryzen 9 3950X
Noctua NH-D15 Chromax.black
Kingston HyperX Predator 2x16GB 3600MHz CL18
Gigabyte GTX 1080Ti Aorus Xtreme
CPU is most definitely a BEAST. Considering that I switched from a 9 year old i7 2600k, basically everything that I usually do showed immense performance improvements and I had no issues whatsoever.
However, today I wanted to see if the temperatures are okay and started testing with Prime95. Initially I started with blend which worked for ~20 minutes before the PC simply reset itself. So I disabled the XMP and thought that would be the end of it, reran blend and waited for ~40 minutes and just before I stopped the workers it reset again.
Temps were fine both times, usually at ~65, but peaked at ~83 a couple of times. So I guess that's not an issue. I still wasn't sure what's the issue, so I wanted to make sure that the CPU is fine and ran small FFTs considering that it will mostly stress the CPU and not the memory.
Aaaand boy was I wrong to assume that RAM was faulty. Immediately after the worker threads are started, PC just resets. And it happens every single time I run small FFTs. It doesn't even do anything, just starts the workers and bails. No BSODs, no freezing, nothing. A quick reset and I'm back at my Windows desktop.
After each of these resets, Event Viewer logs contain these errors:
A fatal hardware error has occurred.
Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Cache Hierarchy Error
Processor APIC ID: 21
I found a couple of similar issues, and an identical one on AMD community forums which was resolved by getting a new CPU. There was also a workaround mentioned to switch the LLC from Auto to Medium. So I tried that as well. LLC at Low and everything above works just fine with small FFTs.
I ran it with LLC Low for more than an hour and didn't have any issues. All cores were at ~3580MHz, Vcore was at ~1.02v, temperature at ~60c. Here's a pic of Ryzen Master at that time: https://i.imgur.com/cpEGc4j.png
Other issues mentioned VRM cooling and CPU overcurrent protection. VRM MOS temps were never above 47 while I did these tests, and Aorus Pro should have sufficient cooling for it.
I also tried setting the clock to 4200MHz and increasing the Vcore to 1.375, which resulted in small FFTs starting and a complete shutdown after 10 seconds or so. That would indicate a power issue, so I could probably surpass it by using the 4pin CPU connector as well (8pin is obviously already connected).
What do you guys think? Is this a case of a best-possible-time-to-get-a-faulty-CPU (considering the coronavirus and everything), or do I just need to tweak the BIOS settings a bit and forget about it?