[Phoronix] Segmentation Faults On Zen CPUs Under Heavy Workloads

mouacyk · Aug 4, 2017

Michael finally confirms this hardware bug in Linux. Took long enough for a Linux news outlet.
Quote:

With running a number of new Ryzen Linux tests lately, a number of readers requested I take a fresh look at the reported Ryzen segmentation fault issues / bugs affecting a number of many Linux users. I did and still am able to reproduce the problem.

source: http://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Test-Stress-Run

Just to clear the confusion around this possibly stemming from overclocking, Michael also stated:
Quote:

I also tried setting the memory to its defaults at DDR4-2133, but the issue occurred still in 83 seconds.

There are other users who even tested with stock CPU and ECC RAM on the AMD community thread. A few users reported the problem going away after they received an RMA replacement CPU. This might mean that RyZen was a bit aggresively binned when it comes to heavy workloads. If so, does the same lottery apply to ThreadRipper, where the bad luck is doubled and the CPUs are intended for 24/7 heavy workloads?

Update 8/23/2017: Customers report that receiving RMA replacements with batches manufacutured on or after week 25 allowed them to pass the segfault tests. If you have have this issue with stock settings, RMA your CPU. Sounds like later batches either were better binned or something changed in the manufacturing process.

Google Docs spreadsheet listing RMA successes: https://docs.google.com/spreadsheets/d/1pp6SKqvERxBKJupIVTp2_FMNRYmtxgP14ZekMReQVM4/edit#gid=0

Follow this guide to if you want to run a USB live image to just run the test:

BobiBolivia · Aug 4, 2017

For those who are interested in some nice reading, Gentoo thread contains some nice troubleshooting / reading.

mouacyk · Aug 4, 2017

Some extra notes:

How to reproduce the segmentation faults of Ryzen bug:
http://fujii.github.io/2017/06/23/how-to-reproduce-the-segmentation-faluts-on-ryzen/

Some users in Japan have done analysis and think it's address shift errata by 64 bytes. If so, it is a hardware defect.
Turning on/off certain hardware features as some people may suggest only reduces the chances of encountering the issue and may introduce other unwanted effects like lower performance or compromised security.
Issue was initially encountered with GCC on Linux, but was later found to also occur on the Windows Linux subsystem. Other *Nixes and BSD has also observed the issue.
Stock CPU and RAM and even ECC RAM does not eliminate the issue.

Quote:

Originally Posted by BobiBolivia

For those who are interested in some nice reading, Gentoo thread contains some nice troubleshooting / reading.

More recent Gentoo thread: https://forums-lb.gentoo.org/viewtopic-t-1057910-postdays-0-postorder-asc-start-450.html?sid=9a3a45505a1756c9b96a0662ecfb7ecc

Artikbot · Aug 4, 2017

Interesting.

What's the real world impact for desktop users?

I can see they're finding it in extremely high benchmark loads and compiling with gcc, which means it could crop up elsewhere... But where?

naz2 · Aug 4, 2017

dang, and i was just finalizing my ryzen linux workstation build

GreenArchon · Aug 4, 2017

Quote:

Originally Posted by Artikbot

Interesting.

What's the real world impact for desktop users?

I can see they're finding it in extremely high benchmark loads and compiling with gcc, which means it could crop up elsewhere... But where?

Limited, if you don't use your machine to compile code. Other heavy workloads (eg. prime95, etc.) haven't turned up problems. From what I understand it's most likely related to allocating/playing a lot with virtual memory, which is an uncommon pattern outside compilers. However if this bug is also in EPYC (which from what I understand is a B2 stepping?), then AMD is going to have serious problems.

figuretti · Aug 4, 2017

From what i was reading recently, it's something related to older GCC versions used to compile bash / Kernel prior 4.12 (I'm not a Linux guy, but i was subscribed to one topic on reddit, and will try to do some testing on my system this weekend)

mouacyk · Aug 4, 2017

Quote:

Originally Posted by GreenArchon

Limited, if you don't use your machine to compile code. Other heavy workloads (eg. prime95, etc.) haven't turned up problems. From what I understand it's most likely related to allocating/playing a lot with virtual memory, which is an uncommon pattern outside compilers. However if this bug is also in EPYC (which from what I understand is a B2 stepping?), then AMD is going to have serious problems.

Some of the specific programs written to isolate this issue actually works completely in a RAM drive, by unpacking GCC there and then do compilation.
Quote:

Originally Posted by figuretti

From what i was reading recently, it's something related to older GCC versions used to compile bash / Kernel prior 4.12 (I'm not a Linux guy, but i was subscribed to one topic on reddit, and will try to do some testing on my system this weekend)

If you follow the AMD Community thread, which contains all the latest user reports, the problem persists in GCC-7 and Kernel 4.11/4.12 rendering them ineffective as solutions.

In the Gentoo thread, it was posted that freebsd patched ther kernel with a work around for the errata:
https://svnweb.freebsd.org/base?view=revision&revision=321899

mouacyk · Aug 4, 2017

Quote:

Originally Posted by naz2

dang, and i was just finalizing my ryzen linux workstation build

For the price and multi-threaded performance, it might still be worth it. I had some funds set aside for a build to be used as a compiler server, so I'm not thrilled either (looking across the fence, where the grass ain't so green either).

EniGma1987 · Aug 4, 2017

Can someone explain to me exactly what the issue is? All that is said in the article is "heavy workloads can cause segmentation faults". And if it is hardware related, why does it only happen in Linux and not Windows?

geoxile · Aug 4, 2017

Quote:

Originally Posted by EniGma1987

Can someone explain to me exactly what the issue is? All that is said in the article is "heavy workloads can cause segmentation faults". And if it is hardware related, why does it only happen in Linux and not Windows?

Iirc a guy on 4chan's /g/ told me he ran into a similar or the same problem when doing hpc calculations on his labs new system. He said if two related threads were working on different CCXs sometimes one thread would hang waiting for another thread to set data, which would never happen. The implication is its related to the CCXs and SMT. AFAIK there are workarounds but it's an annoying, but rare bug. But it shouldn't affect consumers considering Ryzen chips can clear y-cruncher for days without issue. I only skimmed the explanation but apparently somewhere along the lines a register will return a random memory location and cause segfault

prjindigo · Aug 4, 2017

this is more an OS level bug not keeping the work separate than a hardware problem

LancerVI · Aug 4, 2017

Quote:

Originally Posted by prjindigo

this is more an OS level bug not keeping the work separate than a hardware problem

I'm talking out of my backside now, but I would think what prjindigo has said would be correct. If a similar workload doesn't cause the same fault on other OS'es, then how could this be a hardware fault? If it's truly hardware, it could be replicated on pretty much any OS right?

Again...I know nothing of this, I'm sincerely asking.

sumitlian · Aug 4, 2017

interesting! It is more like Compiler Issue it looks like. But it could be one of the erratas too, lets see what AMD says on this.

zGunBLADEz · Aug 4, 2017

Quote:

Originally Posted by geoxile

Iirc a guy on 4chan's /g/ told me he ran into a similar or the same problem when doing hpc calculations on his labs new system. He said if two related threads were working on different CCXs sometimes one thread would hang waiting for another thread to set data, which would never happen. The implication is its related to the CCXs and SMT. AFAIK there are workarounds but it's an annoying, but rare bug. But it shouldn't affect consumers considering Ryzen chips can clear y-cruncher for days without issue. I only skimmed the explanation but apparently somewhere along the lines a register will return a random memory location and cause segfault

thats interesting thanks for the share

@gupsterg
you may look at this

AlphaC · Aug 4, 2017

I guess you might have missed my reply earlier http://www.overclock.net/t/1635467/wccf-amd-ryzen-threadripper-1900x-8-core-hedt-cpu-officially-confirmed-will-cost-549-us-and-feature-64-pcie-lanes/50#post_26258423

Anyway I had several asteroids@home AVX WUs error out (several of hundreds) in a Linux VM on Ryzen 7 with Windows 7 Pro host. I found other users had the same issue on Intel i7s.

The error was SEGV

Code:

Code:

<core_client_version>7.6.31</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> SIGSEGV: segmentation violation

Caveat: errors were on Linux 4.8 guest, not 4.11

All my skynet POGs and Asteroids@home sse2/sse3 WUs validated.

Also given the amount of memory issues people have running XMP (i.e. Intel spec) timings, it might be a timing issue on certain motherboards? DDR4 2133 doesn't mean the timings are correct.

TL;DR : It's not truly a Ryzen bug if stock clocked intel CPUs error out on the code. This needs to be confirmed.

edit: also I noticed most of the erroring out motherboards are those with cheapo VRMs like ASUS B350 Plus

AmericanLoco · Aug 4, 2017

Quote:

Originally Posted by LancerVI

I'm talking out of my backside now, but I would think what prjindigo has said would be correct. If a similar workload doesn't cause the same fault on other OS'es, then how could this be a hardware fault? If it's truly hardware, it could be replicated on pretty much any OS right?

Again...I know nothing of this, I'm sincerely asking.

It probably is a hardware fault, but Windows probably manages things differently than Linux so the bug is never encountered. In all likeliness, it isn't a serious issue and some Linux/GCC patches will fix the problem with little or no performance penalty.

Particle · Aug 4, 2017

I've heard that this happens when doing kernel compiles for some people, but I've not run into it yet even when overclocked. I'd like to know more.

zGunBLADEz · Aug 4, 2017

Quote:

Code:
Originally Posted by AlphaC

I guess you might have missed my reply earlier http://www.overclock.net/t/1635467/wccf-amd-ryzen-threadripper-1900x-8-core-hedt-cpu-officially-confirmed-will-cost-549-us-and-feature-64-pcie-lanes/50#post_26258423

Anyway I had several asteroids@home AVX WUs error out (several of hundreds) in a Linux VM on Ryzen 7 with Windows 7 Pro host. I found other users had the same issue on Intel i7s.

The error was SEGV

Code:

Code:

<core_client_version>7.6.31</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> SIGSEGV: segmentation violation

Caveat: errors were on Linux 4.8 guest, not 4.11

All my skynet POGs and Asteroids@home sse2/sse3 WUs validated.

Also given the amount of memory issues people have running XMP (i.e. Intel spec) timings, it might be a timing issue on certain motherboards? DDR4 2133 doesn't mean the timings are correct.

TL;DR : It's not truly a Ryzen bug if stock clocked intel CPUs error out on the code. This needs to be confirmed.

thats weird you mention that..

as i was passing gsat 1hr but failing hci when testing my ram overclocks+timings hci passed when i moved my timings up but gsat didnt show me no errors.. HCI will caught the error way too fast not even it make it to a 100% and it will crap out..

I also manage to run 16 instances of hci at the same time and i manage to run non stop 14 of them with no memory errors but i have 2 of them that were throwing memory errors that i was using for testing to raise and lower voltages up/down on the fly with the msi command appy and keep retesting the same memory blocks while the other 14 threads were eating the rest.. Keep getting memory errors on those 2 same threads.

This with no page file to make sure appy wasnt testing the page file instead. I literally maxed out the memory i just have less than 50mb free..

Then i finally give up the 14/14/14 gsat stable and went to 15/15/15 no hci errors.

ibb27 · Aug 4, 2017

It's the Ryzen bug from what I understand, here workarounds from FreeBSD and DragonflyBSD:

https://svnweb.freebsd.org/base?view=revision&revision=321899

https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20

[Phoronix] Segmentation Faults On Zen CPUs Under Heavy Workloads

mouacyk

BobiBolivia

mouacyk

Artikbot

naz2

GreenArchon

figuretti

mouacyk

mouacyk

EniGma1987

geoxile

prjindigo

LancerVI

sumitlian

zGunBLADEz

AlphaC

AmericanLoco

Particle

zGunBLADEz

ibb27

Top Contributors this Month

Recommended Communities