Overclock.net banner

[Phoronix] Segmentation Faults On Zen CPUs Under Heavy Workloads

39K views 406 replies 82 participants last post by  kert06 
#1 ·
Michael finally confirms this hardware bug in Linux. Took long enough for a Linux news outlet.
Quote:
With running a number of new Ryzen Linux tests lately, a number of readers requested I take a fresh look at the reported Ryzen segmentation fault issues / bugs affecting a number of many Linux users. I did and still am able to reproduce the problem.
source: http://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Test-Stress-Run

Just to clear the confusion around this possibly stemming from overclocking, Michael also stated:
Quote:
I also tried setting the memory to its defaults at DDR4-2133, but the issue occurred still in 83 seconds.
There are other users who even tested with stock CPU and ECC RAM on the AMD community thread. A few users reported the problem going away after they received an RMA replacement CPU. This might mean that RyZen was a bit aggresively binned when it comes to heavy workloads. If so, does the same lottery apply to ThreadRipper, where the bad luck is doubled and the CPUs are intended for 24/7 heavy workloads?

Update 8/23/2017: Customers report that receiving RMA replacements with batches manufacutured on or after week 25 allowed them to pass the segfault tests. If you have have this issue with stock settings, RMA your CPU. Sounds like later batches either were better binned or something changed in the manufacturing process.

Google Docs spreadsheet listing RMA successes: https://docs.google.com/spreadsheets/d/1pp6SKqvERxBKJupIVTp2_FMNRYmtxgP14ZekMReQVM4/edit#gid=0

Follow this guide to if you want to run a USB live image to just run the test:
 
See less See more
#3 ·
Some extra notes:

How to reproduce the segmentation faults of Ryzen bug:
http://fujii.github.io/2017/06/23/how-to-reproduce-the-segmentation-faluts-on-ryzen/

  • Some users in Japan have done analysis and think it's address shift errata by 64 bytes. If so, it is a hardware defect.
  • Turning on/off certain hardware features as some people may suggest only reduces the chances of encountering the issue and may introduce other unwanted effects like lower performance or compromised security.
  • Issue was initially encountered with GCC on Linux, but was later found to also occur on the Windows Linux subsystem. Other *Nixes and BSD has also observed the issue.
  • Stock CPU and RAM and even ECC RAM does not eliminate the issue.
Quote:
Originally Posted by BobiBolivia View Post

For those who are interested in some nice reading, Gentoo thread contains some nice troubleshooting / reading.
More recent Gentoo thread: https://forums-lb.gentoo.org/viewtopic-t-1057910-postdays-0-postorder-asc-start-450.html?sid=9a3a45505a1756c9b96a0662ecfb7ecc
 
  • Rep+
Reactions: CrazyElf
#6 ·
Quote:
Originally Posted by Artikbot View Post

Interesting.

What's the real world impact for desktop users?

I can see they're finding it in extremely high benchmark loads and compiling with gcc, which means it could crop up elsewhere... But where?
Limited, if you don't use your machine to compile code. Other heavy workloads (eg. prime95, etc.) haven't turned up problems. From what I understand it's most likely related to allocating/playing a lot with virtual memory, which is an uncommon pattern outside compilers. However if this bug is also in EPYC (which from what I understand is a B2 stepping?), then AMD is going to have serious problems.
 
#8 ·
Quote:
Originally Posted by GreenArchon View Post

Limited, if you don't use your machine to compile code. Other heavy workloads (eg. prime95, etc.) haven't turned up problems. From what I understand it's most likely related to allocating/playing a lot with virtual memory, which is an uncommon pattern outside compilers. However if this bug is also in EPYC (which from what I understand is a B2 stepping?), then AMD is going to have serious problems.
Some of the specific programs written to isolate this issue actually works completely in a RAM drive, by unpacking GCC there and then do compilation.
Quote:
Originally Posted by figuretti View Post

From what i was reading recently, it's something related to older GCC versions used to compile bash / Kernel prior 4.12 (I'm not a Linux guy, but i was subscribed to one topic on reddit, and will try to do some testing on my system this weekend)
If you follow the AMD Community thread, which contains all the latest user reports, the problem persists in GCC-7 and Kernel 4.11/4.12 rendering them ineffective as solutions.

In the Gentoo thread, it was posted that freebsd patched ther kernel with a work around for the errata:
https://svnweb.freebsd.org/base?view=revision&revision=321899
 
#9 ·
Quote:
Originally Posted by naz2 View Post

dang, and i was just finalizing my ryzen linux workstation build
For the price and multi-threaded performance, it might still be worth it. I had some funds set aside for a build to be used as a compiler server, so I'm not thrilled either (looking across the fence, where the grass ain't so green either).
 
#11 ·
Quote:
Originally Posted by EniGma1987 View Post

Can someone explain to me exactly what the issue is? All that is said in the article is "heavy workloads can cause segmentation faults". And if it is hardware related, why does it only happen in Linux and not Windows?
Iirc a guy on 4chan's /g/ told me he ran into a similar or the same problem when doing hpc calculations on his labs new system. He said if two related threads were working on different CCXs sometimes one thread would hang waiting for another thread to set data, which would never happen. The implication is its related to the CCXs and SMT. AFAIK there are workarounds but it's an annoying, but rare bug. But it shouldn't affect consumers considering Ryzen chips can clear y-cruncher for days without issue. I only skimmed the explanation but apparently somewhere along the lines a register will return a random memory location and cause segfault
 
#13 ·
Quote:
Originally Posted by prjindigo View Post

this is more an OS level bug not keeping the work separate than a hardware problem
I'm talking out of my backside now, but I would think what prjindigo has said would be correct. If a similar workload doesn't cause the same fault on other OS'es, then how could this be a hardware fault? If it's truly hardware, it could be replicated on pretty much any OS right?

Again...I know nothing of this, I'm sincerely asking.
 
#15 ·
Quote:
Originally Posted by geoxile View Post

Iirc a guy on 4chan's /g/ told me he ran into a similar or the same problem when doing hpc calculations on his labs new system. He said if two related threads were working on different CCXs sometimes one thread would hang waiting for another thread to set data, which would never happen. The implication is its related to the CCXs and SMT. AFAIK there are workarounds but it's an annoying, but rare bug. But it shouldn't affect consumers considering Ryzen chips can clear y-cruncher for days without issue. I only skimmed the explanation but apparently somewhere along the lines a register will return a random memory location and cause segfault
thats interesting thanks for the share

@gupsterg
you may look at this
 
#16 ·
I guess you might have missed my reply earlier http://www.overclock.net/t/1635467/wccf-amd-ryzen-threadripper-1900x-8-core-hedt-cpu-officially-confirmed-will-cost-549-us-and-feature-64-pcie-lanes/50#post_26258423

Anyway I had several asteroids@home AVX WUs error out (several of hundreds) in a Linux VM on Ryzen 7 with Windows 7 Pro host. I found other users had the same issue on Intel i7s.

The error was SEGV

Code:

Code:
<core_client_version>7.6.31</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> SIGSEGV: segmentation violation
Caveat: errors were on Linux 4.8 guest, not 4.11

All my skynet POGs and Asteroids@home sse2/sse3 WUs validated.

Also given the amount of memory issues people have running XMP (i.e. Intel spec) timings, it might be a timing issue on certain motherboards? DDR4 2133 doesn't mean the timings are correct.

TL;DR : It's not truly a Ryzen bug if stock clocked intel CPUs error out on the code. This needs to be confirmed.

edit: also I noticed most of the erroring out motherboards are those with cheapo VRMs like ASUS B350 Plus
 
#17 ·
Quote:
Originally Posted by LancerVI View Post

I'm talking out of my backside now, but I would think what prjindigo has said would be correct. If a similar workload doesn't cause the same fault on other OS'es, then how could this be a hardware fault? If it's truly hardware, it could be replicated on pretty much any OS right?

Again...I know nothing of this, I'm sincerely asking.
It probably is a hardware fault, but Windows probably manages things differently than Linux so the bug is never encountered. In all likeliness, it isn't a serious issue and some Linux/GCC patches will fix the problem with little or no performance penalty.
 
#19 ·
Quote:
Originally Posted by AlphaC View Post

I guess you might have missed my reply earlier http://www.overclock.net/t/1635467/wccf-amd-ryzen-threadripper-1900x-8-core-hedt-cpu-officially-confirmed-will-cost-549-us-and-feature-64-pcie-lanes/50#post_26258423

Anyway I had several asteroids@home AVX WUs error out (several of hundreds) in a Linux VM on Ryzen 7 with Windows 7 Pro host. I found other users had the same issue on Intel i7s.

The error was SEGV

Code:

Code:
<core_client_version>7.6.31</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> SIGSEGV: segmentation violation
Caveat: errors were on Linux 4.8 guest, not 4.11

All my skynet POGs and Asteroids@home sse2/sse3 WUs validated.

Also given the amount of memory issues people have running XMP (i.e. Intel spec) timings, it might be a timing issue on certain motherboards? DDR4 2133 doesn't mean the timings are correct.

TL;DR : It's not truly a Ryzen bug if stock clocked intel CPUs error out on the code. This needs to be confirmed.
thats weird you mention that..

as i was passing gsat 1hr but failing hci when testing my ram overclocks+timings hci passed when i moved my timings up but gsat didnt show me no errors.. HCI will caught the error way too fast not even it make it to a 100% and it will crap out..

I also manage to run 16 instances of hci at the same time and i manage to run non stop 14 of them with no memory errors but i have 2 of them that were throwing memory errors that i was using for testing to raise and lower voltages up/down on the fly with the msi command appy and keep retesting the same memory blocks while the other 14 threads were eating the rest.. Keep getting memory errors on those 2 same threads.

This with no page file to make sure appy wasnt testing the page file instead. I literally maxed out the memory i just have less than 50mb free..

Then i finally give up the 14/14/14 gsat stable and went to 15/15/15 no hci errors.
 
#21 ·
There is a hyperthreading bug in Skylake and Kabylake processors that causes crashes and potential corruption/data loss. Intel fixed the microcode in April and distributed it to board makers. HOWEVER...

AsRock, for instance, hasn't bothered to update the BIOS of the "Fatality" ITX Skylake board I got for a colleague's VR machine. Some in this forum have tossed around a variety of fallacies to attempt to defend this unacceptably low level of support. Apparently, BIOS patches to fix severe bugs are no longer a requirement for basic support in the eyes of many here. Just throw your Skylake rig into the trash and buy whatever a company released in the last week or two.

Like this Ryzen bug, the Skylake/Kabylake bug appears under heavy loads.

There are kludges to work around the bug in Linux and Windows reportedly but proper BIOS patches are necessary for a basic level of support. People should not have to be locked out of other operating systems just because vendor like AsRock don't understand what proper support is.

Here's a list of the fallacies I remember:

1) I'm immature in the way I'm handling the problem. (Blame me, not AsRock and/or other board makers' lack of adequate support.)

2) I'm ineffective in the way I'm handling the problem. (Blame me, not AsRock and/or other board makers' lack of adequate support.)

3) The bug can't be discussed in AMD-related subforums because it's not relevant. (It was brought up in the context of someone considering buying an AsRock board, a topic that, obviously, involves the issue of the brand's support quality.)

4) It's only one board so it doesn't matter. (Wrong and wrong - 1: It's not just one board. 2: It still matters if it's one board.)

5) It's adequate policy to only update the very latest boards, maybe not ever patching the rest. (Wrong. This person tried to compare Skylake with Sandy Bridge, in 2017, too.)

6) It doesn't interest me because I am only interested in AMD stuff. (As if this guy is the entire forum, or is the only person who reads and participates in that topic.)

7) Other board makers do the same thing. (Good tu quoque-pot/kettle fallacy.)

8) The patch might trickle down someday, after the company has updated the higher-end boards. (The patch was released by Intel to board makers in April. Not good enough.)

9) Using word-of-mouth is wrong. (Yeah, right. Reputation via word-of-mouth is huge in business. Always has been. Always will be.)

10) Complaining about inadequate support and warning others about it is "slandering" a business. (Blame me. It's my fault the BIOS hasn't been patched.)

11) If you annoy me I'll leave. (Literally tried to threaten to quit the forum because I criticized AsRock.)

12) "We" aren't interested. (Gotta love the royal we in an open Internet forum.)

I probably forgot some of them.

So, it's interesting to hear about a potential hardware bug in Zen that may require a BIOS patch. I wonder which fallacies are going to be tossed around this time?
 
#24 ·
Quote:
Originally Posted by zGunBLADEz View Post

thats interesting thanks for the share

@gupsterg
you may look at this
Remembered something. It was either the same workload or another workload that causes errors. The guy said the Ryzen is recognized as a NUMA device and if one thread from a complex throws an inter-process interrupt to a thread working on another complex it'll stall forever. Or something like that. Not familiar with HPC.
 
#25 ·
Quote:
Originally Posted by stahlhart View Post

"Moral equivalence" is one in particular that stands out.
Corporations have no morals. They exist to get as much money as possible for as little product/service as possible. However, consumers do have morals. So, there is an inherent conflict between the amorality of the corporation and the morality of the consumer.
 
This is an older thread, you may not receive a response, and could be reviving an old thread. Please consider creating a new thread.
Top