Overclock.net › Forums › Industry News › Hardware News › [Phoronix] Segmentation Faults On Zen CPUs Under Heavy Workloads
New Posts  All Forums:Forum Nav:

[Phoronix] Segmentation Faults On Zen CPUs Under Heavy Workloads

post #1 of 388
Thread Starter 
Michael finally confirms this hardware bug in Linux. Took long enough for a Linux news outlet.
Quote:
With running a number of new Ryzen Linux tests lately, a number of readers requested I take a fresh look at the reported Ryzen segmentation fault issues / bugs affecting a number of many Linux users. I did and still am able to reproduce the problem.

source: http://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Test-Stress-Run

Just to clear the confusion around this possibly stemming from overclocking, Michael also stated:
Quote:
I also tried setting the memory to its defaults at DDR4-2133, but the issue occurred still in 83 seconds.

There are other users who even tested with stock CPU and ECC RAM on the AMD community thread. A few users reported the problem going away after they received an RMA replacement CPU. This might mean that RyZen was a bit aggresively binned when it comes to heavy workloads. If so, does the same lottery apply to ThreadRipper, where the bad luck is doubled and the CPUs are intended for 24/7 heavy workloads?

Update 8/23/2017: Customers report that receiving RMA replacements with batches manufacutured on or after week 25 allowed them to pass the segfault tests. If you have have this issue with stock settings, RMA your CPU. Sounds like later batches either were better binned or something changed in the manufacturing process.

Google Docs spreadsheet listing RMA successes: https://docs.google.com/spreadsheets/d/1pp6SKqvERxBKJupIVTp2_FMNRYmtxgP14ZekMReQVM4/edit#gid=0

Follow this guide to if you want to run a USB live image to just run the test:
https://www.reddit.com/r/Amd/comments/6rwggi/ryzen_build_loop_compile_failures_under_linux/

Update 8/7/2017: Some good news from AMD, and their engineers are saying its not a hardware bug, so that's good:
Quote:
AMD engineers found the problem to be very complex and characterize it as a performance marginality problem exclusive to certain workloads on Linux. The problem may also affect other Unix-like operating systems such as FreeBSD, but testing is ongoing for this complex problem and is not related to the recently talked about FreeBSD guard page issue attributed to Ryzen. AMD's testing of this issue under Windows hasn't uncovered problematic behavior.
source: http://phoronix.com/scan.php?page=news_item&px=Ryzen-Segv-Response
Edited by mouacyk - 8/23/17 at 5:32pm
post #2 of 388
For those who are interested in some nice reading, Gentoo thread contains some nice troubleshooting / reading.
post #3 of 388
Thread Starter 
Some extra notes:

How to reproduce the segmentation faults of Ryzen bug:
http://fujii.github.io/2017/06/23/how-to-reproduce-the-segmentation-faluts-on-ryzen/

  • Some users in Japan have done analysis and think it's address shift errata by 64 bytes. If so, it is a hardware defect.
  • Turning on/off certain hardware features as some people may suggest only reduces the chances of encountering the issue and may introduce other unwanted effects like lower performance or compromised security.
  • Issue was initially encountered with GCC on Linux, but was later found to also occur on the Windows Linux subsystem. Other *Nixes and BSD has also observed the issue.
  • Stock CPU and RAM and even ECC RAM does not eliminate the issue.
Quote:
Originally Posted by BobiBolivia View Post

For those who are interested in some nice reading, Gentoo thread contains some nice troubleshooting / reading.

More recent Gentoo thread: https://forums-lb.gentoo.org/viewtopic-t-1057910-postdays-0-postorder-asc-start-450.html?sid=9a3a45505a1756c9b96a0662ecfb7ecc
Edited by mouacyk - 8/4/17 at 11:56am
post #4 of 388

Interesting.

 

What's the real world impact for desktop users?

 

I can see they're finding it in extremely high benchmark loads and compiling with gcc, which means it could crop up elsewhere... But where?

   
AGP bencher
(14 items)
 
CPUMotherboardGraphicsRAM
Ryzen R7 1700 Gigabyte GA-AX370-Gaming 5 Sapphire HD 6950 2GiB 2x8GB KFA2 HOF DDR4-3600 
Hard DriveHard DriveHard DriveHard Drive
Crucial MX100 256GB Seagate 600 Series 240GB Seagate 7200.14 2TB Samsung F3 1TB 
CoolingCoolingCoolingCooling
EKWB Supreme HF XSPC Rasa GPU EK XT360 EK 4.0 
OSMonitorMonitorKeyboard
W10 Pro LG IPS235 LG E2250V KUL ES-87 
PowerCaseMouseAudio
SF Leadex II 650W Lian Li PC-A05NB Logitech G9 Xonar DX 
AudioAudio
SMSL SA-S3+Technics CB-250 Sennheiser HD555 
CPUMotherboardRAMHard Drive
AMD A10-5700 Gigabyte F2A75M-HD2 G.SKILL Ares 2133 CL9 Hitachi 5K750 
Hard DriveCoolingOSMonitor
Momentus .7 200GB Noctua NH-L9a Server 2012 R2 Standard AUO B156HW01 
PowerCaseOther
PicoPSU-80-WI-25V AIO Aluminium Handmade TP-Link Archer Something Something Wi-Fi AC 
CPUCPUCPUMotherboard
Core2Duo E6400 Core2Quad Q6600 Pentium Dual Core E5200 AsRock 4COREDUAL-SATA2 R2.0 
GraphicsRAMHard DriveOptical Drive
A dumpload of ancient AGP cards Kingston Value DDR2-667 CL4 2T @CL3 1T Seagate 160GB 7200.10 LG IDE DVD-ROM 
CoolingCoolingOSMonitor
Ghettomade CPU waterblock 49cc 2stroke engine copper radiator WinXP SP2 32bit ProView 17" 
PowerCase
Tacens Radix V 550W Ghetto aluminium bench 
  hide details  
Reply
   
AGP bencher
(14 items)
 
CPUMotherboardGraphicsRAM
Ryzen R7 1700 Gigabyte GA-AX370-Gaming 5 Sapphire HD 6950 2GiB 2x8GB KFA2 HOF DDR4-3600 
Hard DriveHard DriveHard DriveHard Drive
Crucial MX100 256GB Seagate 600 Series 240GB Seagate 7200.14 2TB Samsung F3 1TB 
CoolingCoolingCoolingCooling
EKWB Supreme HF XSPC Rasa GPU EK XT360 EK 4.0 
OSMonitorMonitorKeyboard
W10 Pro LG IPS235 LG E2250V KUL ES-87 
PowerCaseMouseAudio
SF Leadex II 650W Lian Li PC-A05NB Logitech G9 Xonar DX 
AudioAudio
SMSL SA-S3+Technics CB-250 Sennheiser HD555 
CPUMotherboardRAMHard Drive
AMD A10-5700 Gigabyte F2A75M-HD2 G.SKILL Ares 2133 CL9 Hitachi 5K750 
Hard DriveCoolingOSMonitor
Momentus .7 200GB Noctua NH-L9a Server 2012 R2 Standard AUO B156HW01 
PowerCaseOther
PicoPSU-80-WI-25V AIO Aluminium Handmade TP-Link Archer Something Something Wi-Fi AC 
CPUCPUCPUMotherboard
Core2Duo E6400 Core2Quad Q6600 Pentium Dual Core E5200 AsRock 4COREDUAL-SATA2 R2.0 
GraphicsRAMHard DriveOptical Drive
A dumpload of ancient AGP cards Kingston Value DDR2-667 CL4 2T @CL3 1T Seagate 160GB 7200.10 LG IDE DVD-ROM 
CoolingCoolingOSMonitor
Ghettomade CPU waterblock 49cc 2stroke engine copper radiator WinXP SP2 32bit ProView 17" 
PowerCase
Tacens Radix V 550W Ghetto aluminium bench 
  hide details  
Reply
post #5 of 388
dang, and i was just finalizing my ryzen linux workstation build
post #6 of 388
Quote:
Originally Posted by Artikbot View Post

Interesting.

What's the real world impact for desktop users?

I can see they're finding it in extremely high benchmark loads and compiling with gcc, which means it could crop up elsewhere... But where?

Limited, if you don't use your machine to compile code. Other heavy workloads (eg. prime95, etc.) haven't turned up problems. From what I understand it's most likely related to allocating/playing a lot with virtual memory, which is an uncommon pattern outside compilers. However if this bug is also in EPYC (which from what I understand is a B2 stepping?), then AMD is going to have serious problems.
First build
(13 items)
 
  
CPUMotherboardGraphicsRAM
Core i7 930 Gigabyte X58A-UD3R (rev. 2.0) 470 GTX SLI 18GB DDR3 
Hard DriveOSCaseMouse
Crucial m4 + 3 7200 RPM HDD in RAID 0 Windows 7 Ultimate 64-bits Antec 1200 Microsoft Sidewinder 
  hide details  
Reply
First build
(13 items)
 
  
CPUMotherboardGraphicsRAM
Core i7 930 Gigabyte X58A-UD3R (rev. 2.0) 470 GTX SLI 18GB DDR3 
Hard DriveOSCaseMouse
Crucial m4 + 3 7200 RPM HDD in RAID 0 Windows 7 Ultimate 64-bits Antec 1200 Microsoft Sidewinder 
  hide details  
Reply
post #7 of 388
From what i was reading recently, it's something related to older GCC versions used to compile bash / Kernel prior 4.12 (I'm not a Linux guy, but i was subscribed to one topic on reddit, and will try to do some testing on my system this weekend)
post #8 of 388
Thread Starter 
Quote:
Originally Posted by GreenArchon View Post

Limited, if you don't use your machine to compile code. Other heavy workloads (eg. prime95, etc.) haven't turned up problems. From what I understand it's most likely related to allocating/playing a lot with virtual memory, which is an uncommon pattern outside compilers. However if this bug is also in EPYC (which from what I understand is a B2 stepping?), then AMD is going to have serious problems.

Some of the specific programs written to isolate this issue actually works completely in a RAM drive, by unpacking GCC there and then do compilation.
Quote:
Originally Posted by figuretti View Post

From what i was reading recently, it's something related to older GCC versions used to compile bash / Kernel prior 4.12 (I'm not a Linux guy, but i was subscribed to one topic on reddit, and will try to do some testing on my system this weekend)
If you follow the AMD Community thread, which contains all the latest user reports, the problem persists in GCC-7 and Kernel 4.11/4.12 rendering them ineffective as solutions.

In the Gentoo thread, it was posted that freebsd patched ther kernel with a work around for the errata:
https://svnweb.freebsd.org/base?view=revision&revision=321899
post #9 of 388
Thread Starter 
Quote:
Originally Posted by naz2 View Post

dang, and i was just finalizing my ryzen linux workstation build
For the price and multi-threaded performance, it might still be worth it. I had some funds set aside for a build to be used as a compiler server, so I'm not thrilled either (looking across the fence, where the grass ain't so green either).
post #10 of 388
Can someone explain to me exactly what the issue is? All that is said in the article is "heavy workloads can cause segmentation faults". And if it is hardware related, why does it only happen in Linux and not Windows?
Gaming
(17 items)
 
Gaming PC
(20 items)
 
 
CPUMotherboardGraphicsRAM
7700K AS Rock Z170 OC Formula Titan X Pascal 2050MHz 64GB DDR4-3200 14-14-14-34-1T 
Hard DriveHard DriveHard DriveCooling
950 EVO m.2 OS drive 850 EVO 1TB games drive Intel 730 series 500GB games drive Custom water cooling 
OSMonitorKeyboardPower
Win 10 Pro x64 AMH A399U E-Element mechanical, black switches, Vortex b... EVGA G3 1kw 
CaseMouseAudioAudio
Lian-Li PC-V1000L Redragon M901 LH Labs Pulse X Infinity DAC Custom built balanced tube amp with SS diamond ... 
Audio
MrSpeakers Alpha Prime 
  hide details  
Reply
Gaming
(17 items)
 
Gaming PC
(20 items)
 
 
CPUMotherboardGraphicsRAM
7700K AS Rock Z170 OC Formula Titan X Pascal 2050MHz 64GB DDR4-3200 14-14-14-34-1T 
Hard DriveHard DriveHard DriveCooling
950 EVO m.2 OS drive 850 EVO 1TB games drive Intel 730 series 500GB games drive Custom water cooling 
OSMonitorKeyboardPower
Win 10 Pro x64 AMH A399U E-Element mechanical, black switches, Vortex b... EVGA G3 1kw 
CaseMouseAudioAudio
Lian-Li PC-V1000L Redragon M901 LH Labs Pulse X Infinity DAC Custom built balanced tube amp with SS diamond ... 
Audio
MrSpeakers Alpha Prime 
  hide details  
Reply
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Hardware News
Overclock.net › Forums › Industry News › Hardware News › [Phoronix] Segmentation Faults On Zen CPUs Under Heavy Workloads