Overclock.net › Forums › Overclockers Care › Overclock.net Folding@Home Team › GPU possible dying?
New Posts  All Forums:Forum Nav:

GPU possible dying?

post #1 of 7
Thread Starter 
Ok so over the past couple of days I have had one of my 460s constantly fail work units. Sometimes I will see this within the first couple of steps and other times it makes it to 20-30% before failing. I am including only one log file in here because they always give the same error: Warning: Spoiler! (Click to show)
[22:26:31] + Processing work unit
[22:26:31] Core required: FahCore_15.exe
[22:26:31] Core found.
[22:26:31] Working on queue slot 00 [April 23 22:26:31 UTC]
[22:26:31] + Working ...
[22:26:31]
[22:26:31] *
*
[22:26:31] Folding@Home GPU Core
[22:26:31] Version 2.22 (Thu Dec 8 17:08:05 PST 2011)
[22:26:31] Build host SimbiosNvdWin7
[22:26:31] Board Type NVIDIA/CUDA
[22:26:31] Core 15
[22:26:31]
[22:26:31] Window's signal control handler registered.
[22:26:31] Preparing to commence simulation
[22:26:31] - Looking at optimizations...
[22:26:31] DeleteFrameFiles: successfully deleted file=work/wudata_00.ckp
[22:26:31] - Created dyn
[22:26:31] - Files status OK
[22:26:31] sizeof(CORE_PACKET_HDR) = 512 file=<>
[22:26:31] - Expanded 550173 -> 983760 (decompressed 178.8 percent)
[22:26:31] Called DecompressByteArray: compressed_data_size=550173 data_size=983760, decompressed_data_size=983760 diff=0
[22:26:31] - Digital signature verified
[22:26:31]
[22:26:31] Project: 7644 (Run 14, Clone 0, Gen 31)
[22:26:31]
[22:26:31] Assembly optimizations on if available.
[22:26:31] Entering M.D.
[22:26:33] Tpr hash work/wudata_00.tpr: 3665839461 730685157 1054519910 1163210139 3366301962
[22:26:33] GPU device info: vendor=0 device=0 name= match=0
[22:26:34] Working on Protein in water
[22:26:34] Client config found, loading data.
[22:26:34] Starting GUI Server
[22:28:02] Setting checkpoint frequency: 25000
[22:28:02] Completed 3 out of 2500000 steps (0%).
[22:40:16] Completed 25000 out of 2500000 steps (1%).
[22:52:32] Completed 50000 out of 2500000 steps (2%).
[23:05:20] Completed 75000 out of 2500000 steps (3%).
[23:17:29] Completed 100000 out of 2500000 steps (4%).
[23:29:30] Completed 125000 out of 2500000 steps (5%).
[23:41:36] Completed 150000 out of 2500000 steps (6%).
[23:54:20] Completed 175000 out of 2500000 steps (7%).
[00:06:51] Completed 200000 out of 2500000 steps (8%).
[00:19:32] Completed 225000 out of 2500000 steps (9%).
[00:32:25] Completed 250000 out of 2500000 steps (10%).
[00:44:40] Completed 275000 out of 2500000 steps (11%).
[00:57:09] Completed 300000 out of 2500000 steps (12%).
[01:10:07] Completed 325000 out of 2500000 steps (13%).
[01:23:06] Completed 350000 out of 2500000 steps (14%).
[02:06:25] Completed 375000 out of 2500000 steps (15%).
[02:06:25] mdrun_gpu returned 52
[02:06:25] NANs detected on GPU
[02:06:25]
[02:06:25] Folding@home Core Shutdown: UNSTABLE_MACHINE
[02:06:27] CoreStatus = 7A (122)
[02:06:27] Sending work to server
[02:06:27] Project: 7644 (Run 14, Clone 0, Gen 31)
[02:06:27] - Read packet limit of 540015616... Set to 524286976.
[02:06:27] - Error: Could not get length of results file work/wuresults_00.dat
[02:06:27] - Error: Could not read unit 00 file. Removing from queue.
[02:06:27] - Preparing to get new work unit...
[02:06:27] Cleaning up work directory
[02:06:27] + Attempting to get work packet
[02:06:27] Passkey found
[02:06:27] Gpu type=3 species=20.
[02:06:27] - Connecting to assignment server
[02:06:28] - Successful: assigned to (171.64.65.93).
[02:06:28] + News From Folding@Home: Welcome to Folding@Home
[02:06:28] Loaded queue successfully.
[02:06:28] Gpu type=3 species=20.
[02:06:30] + Closed connections

The error message is not all that helpful and points to the possible switching user sessions or the monitor falling asleep. I dont think either are really the issue.

I do have a jerry rigged molex splitter that I have used to power a second GPU i purchased recently. I have a 1000W PSU but dont have the modular connectors anymore. So i thought it might be a power issue but am not so sure as i have returned the second card to stock voltages.

Any thoughts on this? I have now underclocked the card in question and will see in an hour or two if that helps but somehow I doubt it as even at stock settings it failed 5 WUs yesterday afternoon and last night.
post #2 of 7
Driver version? As I'm pretty sure you're aware the 295 drivers have an issue with the card stopping folding when the monitor falls asleep but I haven't seen it result in NANs, just not starting the next WU. Still might as well go ahead and disallow windows from turning the monitor off just in case. If you want to rule out drivers try the 290.53 or if you want to really rule out drivers go back to 266.58.

Can you pass OCCT GPU test with error checking on for ~15 minutes at stock, or any, speed? If you can't do that then it does appear to be a unstable card for some reason, whether that reason is due to the card outright dying or due to possible power issues I don't know.

What exact 460 is it and what voltages did you run it at for how long? That will help guesstimate the likelihood of it dying of 'natural' causes. Either way if it had a 3 year warranty when it was bought then it should still be under warranty right, 460 release wasn't that long ago.
Main Rig
(16 items)
 
  
CPUMotherboardGraphicsRAM
Intel i7 2700k ASUS P8P67 WS Revolution EVGA 980 Ti SC+ Samsung 4x4GB DDR3 1866MHz 
Hard DriveHard DriveOptical DriveCooling
Samsung 850 Evo 1TB Samsung Spinpoint F4 2TB Samsung BD Combo Noctua NH-D14 
OSMonitorPowerCase
Windows 10 64 bit Asus PG279Q Kingwin Lazer Platinum 1000W Silverstone Raven RV03 
  hide details  
Reply
Main Rig
(16 items)
 
  
CPUMotherboardGraphicsRAM
Intel i7 2700k ASUS P8P67 WS Revolution EVGA 980 Ti SC+ Samsung 4x4GB DDR3 1866MHz 
Hard DriveHard DriveOptical DriveCooling
Samsung 850 Evo 1TB Samsung Spinpoint F4 2TB Samsung BD Combo Noctua NH-D14 
OSMonitorPowerCase
Windows 10 64 bit Asus PG279Q Kingwin Lazer Platinum 1000W Silverstone Raven RV03 
  hide details  
Reply
post #3 of 7
Thread Starter 
Ok well I am currently on the 290.53 drivers and am fairly sure that is not the issue as I have another GPU chugging away just fine. I have also been on this driver for several months. In addition I did change the monitors to stay on 24/7 earlier today but still failed a unit a couple of hours ago.


I had it stable at 850 for about 4 months but even at the stock 780 it was failing WUs. here is a link to the exact card: http://www.newegg.com/Product/Product.aspx?Item=N82E16814127518. I do have a several year warranty. However in my drunken stupidity/desire to get a higher OC I decided to remove a stick saying "warranty void if removed". The sticker hid a switch to unlock the voltages although I never actually flipped the switch. Looked around to see if I could buy one of these stickers but no luck so far. Although I may end up doing that as it would still be a lot cheaper than a new GPU.

right now i have the voltage at 1.03(all i know is afterburner is saying +30 mV) although it was stable at 1.0 for months @850 core clock. I have also overclocked the memory and Aux to+20mV. I have been able to pass the OCCT GPU tests for 15 minutes no problem as well as the Stanford memory tests. I will run OCCT for a couple of hours when I go to class later tonight to see if that is the issue.
post #4 of 7
I wouldn't run OCCT for "a couple hour" because while a GPU should be able to do that it just seems like asking for trouble in your circumstance.

If the card is a dedicated folding card then I would remove the extra memory and aux voltage and return the memory to stock, memory speed has next to no effect on folding. Even if it's for gaming I still would just leave the aux and memory at stock and only OC the memory as far as it will go at stock volts for gaming but then fold on it at stock frequency (to remove variables for errors). The reason for this is just because I've not seen enough people find the limits of acceptable memory and aux voltages to make it worth the risk for the marginal gain they might or might not bring.

As for your specific model, it's a good card but it has no VRM heatsinks which isn't great but if they had failed the card would fail under all sorts of load so I don't think that's the problem, just a warning to make sure to be conservative with extra voltage and be sure to keep the fans spinning fairly high because that's the only bit of cooling the VRMs have.

Not sure exactly what it could be because if you're passing more than 15 minutes of OCCT with error checking showing no errors that thing should be stable. Maybe try running at 700MHz core and every other setting on default, that should at least help you narrow down the problem a little.
Main Rig
(16 items)
 
  
CPUMotherboardGraphicsRAM
Intel i7 2700k ASUS P8P67 WS Revolution EVGA 980 Ti SC+ Samsung 4x4GB DDR3 1866MHz 
Hard DriveHard DriveOptical DriveCooling
Samsung 850 Evo 1TB Samsung Spinpoint F4 2TB Samsung BD Combo Noctua NH-D14 
OSMonitorPowerCase
Windows 10 64 bit Asus PG279Q Kingwin Lazer Platinum 1000W Silverstone Raven RV03 
  hide details  
Reply
Main Rig
(16 items)
 
  
CPUMotherboardGraphicsRAM
Intel i7 2700k ASUS P8P67 WS Revolution EVGA 980 Ti SC+ Samsung 4x4GB DDR3 1866MHz 
Hard DriveHard DriveOptical DriveCooling
Samsung 850 Evo 1TB Samsung Spinpoint F4 2TB Samsung BD Combo Noctua NH-D14 
OSMonitorPowerCase
Windows 10 64 bit Asus PG279Q Kingwin Lazer Platinum 1000W Silverstone Raven RV03 
  hide details  
Reply
post #5 of 7
Thread Starter 
Ok so i did not run OCCT while I am at class to prevent other nastiness like you said. Don t want to bake anything else. I just checked on my computer and it looks like it failed another WU at around 20%. I had the GPU underclocked with core at 700 and ram 100 or 200 under. The reason I had added voltage to memory was because thats all I found when googling. I really think it is a hardware failure but i am hoping it is not rolleyes.gif


I have never really overvolted it. I might have taken it to its max at +100mV for an hour or two, +30mV for a week or two but at stock voltage for folding over the last 6 months or so.

I know that there is some dust build up on the fans and am pretty sure that both are still spinning but I will double check when I get back home
post #6 of 7
If possible try it in another machine, or at least swap it's power connections with some native PCIe plugs instead of adapters and see if that does anything, but I'm starting to think the GPU is failing as well. frown.gif
Main Rig
(16 items)
 
  
CPUMotherboardGraphicsRAM
Intel i7 2700k ASUS P8P67 WS Revolution EVGA 980 Ti SC+ Samsung 4x4GB DDR3 1866MHz 
Hard DriveHard DriveOptical DriveCooling
Samsung 850 Evo 1TB Samsung Spinpoint F4 2TB Samsung BD Combo Noctua NH-D14 
OSMonitorPowerCase
Windows 10 64 bit Asus PG279Q Kingwin Lazer Platinum 1000W Silverstone Raven RV03 
  hide details  
Reply
Main Rig
(16 items)
 
  
CPUMotherboardGraphicsRAM
Intel i7 2700k ASUS P8P67 WS Revolution EVGA 980 Ti SC+ Samsung 4x4GB DDR3 1866MHz 
Hard DriveHard DriveOptical DriveCooling
Samsung 850 Evo 1TB Samsung Spinpoint F4 2TB Samsung BD Combo Noctua NH-D14 
OSMonitorPowerCase
Windows 10 64 bit Asus PG279Q Kingwin Lazer Platinum 1000W Silverstone Raven RV03 
  hide details  
Reply
post #7 of 7
Thread Starter 
No other rig I can test it in as my server is in a custom case that will not fit the card. I will try taking out the slot and trying different power cables. Best case I might get lucky and have the motherboard slot fail so I can get that with warranty. Really wish I had not removed that sticker. Maybe amazon will have one rolleyes.gif

Also dont know if this is relevant but i have been hearing a heard drive shut off and restart. I have about 6 drives in my computer and I can hear one of them shut off and restart periodically. I imagine that I am pushing the cables to their limit power wise although I dont think this is my root cause. While the PSU is 1000W I dont have any of the modular cables so all of cables that I do have attached are all in use and I have two splitters that add 8 total extra connections (for 5 drives and 1 GPU).


To lazy to pull out one GPU to test now but I will do that tomorrow afternoon

Edit: so a little update. Seems my GPU picked up a 8019 WU and it is handling that just fine. Downclocked to 700 currently but it is showing full loads with no problems. Hopefully it just had a problem with those WUs but not all that confident with that. If it can finish 2 WUs i might try to raise the clocks back up a bit
Edited by crystalhand - 4/25/12 at 8:09am
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Overclock.net Folding@Home Team
Overclock.net › Forums › Overclockers Care › Overclock.net Folding@Home Team › GPU possible dying?