Overclock.net banner

1 - 20 of 23 Posts

·
Registered
Joined
·
89 Posts
Discussion Starter #1
EVGA 980ti rebooting from overheated VRM or Memory. Fixable?

I use a Raijintek morpheus cooler on my EVGA 980ti (06G-P4-4995-KR) for the past 3 years. GPU core temp was always 50-55c while gaming, pretty cool, but recently it started hard rebooting suddenly. Discovered after troubleshooting for a long time the VRM or memory is overheating.

The 980ti doesn't have VRM temperature monitor so I placed a thermal probe on top of the GPU plate, right above one of the VRM or memory chips. Idles around 25-30c, but around 50c it would reboot, usually takes 2-3hrs of gaming, pretty sure the chip underneath is actually hotter than that. Read VRM often are rated at 100-120c. Not sure about memory.

If I limit GPU power to 60% using msi afterburner the probe temp will be around 42c, it won't reboot, could play all day. However, my 980ti performance has been reduced to that of a 970 or worse.

Has my GPU degraded beyond hope or repair? What are my options? Will replacing the thermal pad in between the plates help?
 

·
AMD OC'ing Enthusiast
Joined
·
2,671 Posts
dont use power limit to reduce vrm temps. That simply makes the card struggle to stay stable even more flipping back and forth with voltages and frequencies.


Use Afterburner to reduce clocks and voltage directly.



Id say a good negative 300mhz core offset with a big chunk of voltage reduction would be better than lowering the power limit. If it was a VRM temperature issue, it will "solve" the issue right then and there.


This does not rule out a component failure though, simply a temperature issue.
 

·
Joined
·
2,452 Posts
Have you tried running Furmark on it yet? Let that run for a few hours, it will max out the power target, might help you crash earlier than 2-3 hours.
 

·
Registered
Joined
·
89 Posts
Discussion Starter #4
dont use power limit to reduce vrm temps. That simply makes the card struggle to stay stable even more flipping back and forth with voltages and frequencies.


Use Afterburner to reduce clocks and voltage directly.



Id say a good negative 300mhz core offset with a big chunk of voltage reduction would be better than lowering the power limit. If it was a VRM temperature issue, it will "solve" the issue right then and there.


This does not rule out a component failure though, simply a temperature issue.
Max I can lower for core clock is -90 and memory clock -201. Core voltage the slider is at the min, seems I can only add voltage.

What component failure as in PSU or Mobo? I've replaced the PSU thinking it was at fault but turns out didn't fix it.
 

·
Registered
Joined
·
89 Posts
Discussion Starter #5
Have you tried running Furmark on it yet? Let that run for a few hours, it will max out the power target, might help you crash earlier than 2-3 hours.
Furmark last about 5 seconds if it has already rebooted prior and still hot.

From normal/cold state it would die around 15-20m
 

·
9 Cans of Ravioli
Joined
·
19,126 Posts
Don't use Furmark and stop suggesting people use it. It serves 0 purpose and most cards will cripple themselves to a lower P state if it detects it's running. Just run a benchmark if you want to test things out.

@Sugita2Junko: Where did you put the temp probe on? You said the "VRM" but the VRM consists of chokes/coils/inductors/capacitors/MOSFETs/voltage regulators and not all of them really put out that much heat or need a heatsink. The only one that should have one are the mosfets.
 

·
Registered
Joined
·
89 Posts
Discussion Starter #7
Don't use Furmark and stop suggesting people use it. It serves 0 purpose and most cards will cripple themselves to a lower P state if it detects it's running. Just run a benchmark if you want to test things out.

@Sugita2Junko: Where did you put the temp probe on? You said the "VRM" but the VRM consists of chokes/coils/inductors/capacitors/MOSFETs/voltage regulators and not all of them really put out that much heat or need a heatsink. The only one that should have one are the mosfets.
http://images.hardwarecanucks.com/image//skymtl/GPU/EVGA-GTX-980-TI/EVGA-GTX-980-TI-1.PNG

It has a back & front plate covering. I just put the probe on the front plate which I felt was hottest, so hot I can't leave my finger for more than a sec. In between the expose mosfet (square things?) & gpu core.
 

·
9 Cans of Ravioli
Joined
·
19,126 Posts
The bigger gray square "things" in that picture are chokes and they don't need a heatsink.

The partly melted thing in this picture is a mosfet and what you should be checking the temps of. If it doesn't have a heatsink, stop using the card until you get some on it.
 

Attachments

·
Registered
Joined
·
89 Posts
Discussion Starter #9
The bigger gray square "things" in that picture are chokes and they don't need a heatsink.

The partly melted thing in this picture is a mosfet and what you should be checking the temps of. If it doesn't have a heatsink, stop using the card until you get some on it.
https://i.imgur.com/GmS9JK7.jpg

These are the areas too hot to touch for me. I circled it green. I only touched the plate. Probe location is in between there in red circle.

https://www.fudzilla.com/images/stories/2015/Reviews/Graphics/Nvidia/Maxwell/GTX_980_Ti/evga_gtx_980_ti_sc_pcb1.jpg

Those square chips surrounding the GPU core are VRAM?

https://qph.fs.quoracdn.net/main-qimg-e3e43542faf87ee5a7e4838524ae8118-c

The VRM when I touched it was scorching hot too. Didn't put a probe on it to know how hot.
 

·
Registered
Joined
·
3,732 Posts
The green areas look like where the VRAM is located.

What kind of fan do you have blowing down on the VRM side of the card? You might want to consider putting a more powerful shrouded fan over that location.
 

·
Registered
Joined
·
89 Posts
Discussion Starter #11 (Edited)
The green areas look like where the VRAM is located.

What kind of fan do you have blowing down on the VRM side of the card? You might want to consider putting a more powerful shrouded fan over that location.
NF-A12x25 running at 2000rpm, previously I had NF-F12 1500rpm

Wonder if the thermal pads under the plate degraded?
 

·
Joined
·
2,452 Posts
Don't use Furmark and stop suggesting people use it. It serves 0 purpose and most cards will cripple themselves to a lower P state if it detects it's running. Just run a benchmark if you want to test things out.
Furmark will max out power consumption at P0 on most recent modern GPUs without dropping down a state. This is the best way to test if there is an issue with power delivery, e.g. underpowered PSU. It's also very useful for VRAM artifact testing. Considering the OPs card failed 5 seconds into the test on a warm boot implies something is wrong, and by reducing core offset and VRAM offset, they can eliminate those from the equation.

Caveat, it is less useful for general stability testing however - synthetic demos and games will be better for that.

@Sugita2Junko, those green sections are the definitely the VRAM modules. Disassemble the backplate and card and ensure that the thermal pads are mating with the VRAM modules, when removing them they should have rectangular shaped indentations if they have been mounted. Unstable VRAM can cause kernel panic and driver lockup on the 980 Ti, especially Hynix memory, the fact that it occurs after a few hours means something is heating up to steady state, unstable core clock would result in a watchdog timeout, not a hard reboot.
 

·
9 Cans of Ravioli
Joined
·
19,126 Posts
It's also very useful for VRAM artifact testing.

You can find your vRAM OC in less than a minute with Unigine Heaven, that's not a reason to use Furmark. Start the test in windowed mode at 720p/1080p, pause somewhere during the test on a scene of your choice, increase your vRAM OC until you either get artifacts or performance drops (the scene will still render @ xxx FPS while it's paused) and then back off until you don't have artifacts or performance goes back up. You don't need Furmark to OC quickly, it's a useless program that does nothing outside making a ton of heat for no reason - pretty much why it's nicknamed "power virus."
 

·
Registered
Joined
·
89 Posts
Discussion Starter #14
Furmark will max out power consumption at P0 on most recent modern GPUs without dropping down a state. This is the best way to test if there is an issue with power delivery, e.g. underpowered PSU. It's also very useful for VRAM artifact testing. Considering the OPs card failed 5 seconds into the test on a warm boot implies something is wrong, and by reducing core offset and VRAM offset, they can eliminate those from the equation.

Caveat, it is less useful for general stability testing however - synthetic demos and games will be better for that.

@Sugita2Junko, those green sections are the definitely the VRAM modules. Disassemble the backplate and card and ensure that the thermal pads are mating with the VRAM modules, when removing them they should have rectangular shaped indentations if they have been mounted. Unstable VRAM can cause kernel panic and driver lockup on the 980 Ti, especially Hynix memory, the fact that it occurs after a few hours means something is heating up to steady state, unstable core clock would result in a watchdog timeout, not a hard reboot.
I can run furmark a long time before it reboots. But using realbench it reboots with in 15minutes from cold boot. Once it reboots running realbench immediately following would instantly reboot in a few seconds, but furmark can last a few minutes.

Don't see any artifacts or issues while gaming. No bluescreen or kernel panic or driver lockup. It just hard reboots once hot enough or something. Not sure if it is VRAM or VRM or mosfet or what but something is overheating. All I know is GPU core is well below thermal limit usually 50-60c gaming.

Going to see my EVGA will do an RMA on my 980ti. It is 2 weeks past the 3yr warranty cut off mark. If not, ill order some thermal pad and maybe stick on extra aluminum heatsink on the plate to help cool.
 

·
Registered
Joined
·
3,732 Posts
Do you ever see any perfcaps before it reboots? I wonder if it could be your PSU and some sort of over current protection kicking in? PSU's heat up too and if they heat up enough that can affect their output.
 

·
Registered
Joined
·
89 Posts
Discussion Starter #16 (Edited)
Do you ever see any perfcaps before it reboots? I wonder if it could be your PSU and some sort of over current protection kicking in? PSU's heat up too and if they heat up enough that can affect their output.
I play Overwatch and cap the FPS to 162, so the GPU utilization is usually 60-80%, rarely maxing out. At first I also thought it was a PSU issue. Swapped out a brand new one and still reboots. While trying to take the GPU out to re-seat I discovered it was freaking hot. Too hot to hold.

Installed a thermal probe on the front/backplate which covers the VRAM etc and found out it usually reboots when temp reach 50C. I tried stress testing again but this time fanning it hard to cool a little and it didn't reboot. Lowered the power limit to 60% and my games stopped rebooting.
 

·
Registered
Joined
·
3,732 Posts
I play Overwatch and cap the FPS to 162, so the GPU utilization is usually 60-80%, rarely maxing out. At first I also thought it was a PSU issue. Swapped out a brand new one and still reboots. While trying to take the GPU out to re-seat I discovered it was freaking hot. Too hot to hold.

Installed a thermal probe on the front/backplate which covers the VRAM etc and found out it usually reboots when temp reach 50C. I tried stress testing again but this time fanning it hard to cool a little and it didn't reboot. Lowered the power limit to 60% and my games stopped rebooting.
It sounds like you definitely need more cooling. Maybe you could try attaching an even higher powered fan or fans to the heatsink. Personally, I have a noctua NFA14-ippc3000 and a SanAce 127x38mm attached to my GPU's heatsink and both are shrouded.
 

·
Registered
Joined
·
89 Posts
Discussion Starter #18
It sounds like you definitely need more cooling. Maybe you could try attaching an even higher powered fan or fans to the heatsink. Personally, I have a noctua NFA14-ippc3000 and a SanAce 127x38mm attached to my GPU's heatsink and both are shrouded.
It has been fine for 2 years with dual NF-F12 @ 1500rpm which is pretty strong, much stronger than the stock fans. Not sure if GPU component just degraded, generating more heat than before or something faulty. EVGA is actually sending me a RMA replacement despite being out of warranty shy of 2 weeks.
 

·
Registered
Joined
·
3,732 Posts
It has been fine for 2 years with dual NF-F12 @ 1500rpm which is pretty strong, much stronger than the stock fans. Not sure if GPU component just degraded, generating more heat than before or something faulty. EVGA is actually sending me a RMA replacement despite being out of warranty shy of 2 weeks.
That's great. How are they sending you a replacement 980Ti though? They can't have any of those in stock anymore can they?
 

·
Registered
Joined
·
89 Posts
Discussion Starter #20
That's great. How are they sending you a replacement 980Ti though? They can't have any of those in stock anymore can they?
No clue what I am getting, but people on reddit reported getting either a refurb 980ti, 1070ti or 1080. Guess it is whatever they have available and fixed.
 
1 - 20 of 23 Posts
Top