|Topic Review (Newest First)|
|12-17-2018 03:44 PM|
I know it's not good forum etiquette to answer your own post but here are my steps on how I managed to (finally) fix my card.
step 1: check for actual hardware defects
I managed to find a small SMD capacitor close to the PCI-E connector that was broken off. I soldered it back. Not sure if this had any effect
step 2: set the card to PCI-E [email protected]
This is the most important part. PCI-E 3.0 when running at max bandwith (ex: loading textures into memory, running without an fps limit) seems to crash(?) the PLX chip for some reason and hence disconnect it while running, giving you a BSOD. You shouldn't loose that much performance in games (1440p) and rendering with the card running at 2.0.
Do not use MSI afterburner with the latest version of windows 10 and the AMD 2019 driver. It can cause crashes.
Get the SDK package for the PLX PEX 8747 chip that connects the 2 GPUs with the PC. You'll want to start the software called PLX GenMon and enable logging. Some of the other programs offer other useful debug functions.
Add a capacitor (15V220uF works fine) to the fan on the water cooler. This makes the pumps run much smoother and can make the card stop thermal throttling.
You can also use clock blocker and hawaii bios editor if you want to stop throttling.
Well, that's what I did. For a test I've been running blender for 2 days now without a crash. Previously this would crash the card after 1-2 rendered frames (VRAM intensive renders)
|10-19-2018 01:48 AM|
Problem with R9 295x2 VRAM
Recently one of my friends gave me his old R9 295x2 because it was giving him BSODs. I cleaned the card properly and re did the thermal paste/pads.
After having the card for about ~3 days it crashed for the first time while loading a game
video of crash:
Last night I found a software called memtestCL and ran it to see if maybe it was a problem with the memory of the card and sure enough, it got a few errors in almost every test
(the "random blocks" test always shows loads of errors on any card, so I think it's a bug)
(left card is the primary, right side is the secondary card)
(also, the test on the primary card took way longer)
I think it only crashes when windows tries to access some of the non-working RAM. Usually all 3 of my screens freeze, even the ones plugged into my onboard graphics, but sound continues to play so and as far as I can tell everything else keeps running as well. The fan on the card goes into a "default" state (same as in BIOS or without the drivers installed) and then switches between that and normal a few times (probably tries to re-initialise the drivers)
What I tried so far:
loosened the screws on the backplate so it doesn't put pressure on the ram modules - no effect
tightened the screws - no effect
dumped rubbing alcohol on the whole card and let it sit/dry for a few hours - no effect
set the windows timeout detection and recovery (TDR) to max (8) - no effect (takes longer to reboot automatically)
underclocked the ram - no effect
overclocked the ram - no effect (no idea what I was expecting)
tried the hotkey for reloading graphics drivers (start + ctrl + shift + B)
reinstalled drivers/windows/tried different slot/etc etc (basically the usual troubleshooting procedures)
I had a few indeas for workarounds (use my second card for 3 way crossfire, switch the roles of the 2 GPUs on the 295x2, add a third (nvidia) card and push the rendered frames to that) but I don't think any of these would do anything.
The card itself runs perfectly, and there are no visual clues of the ram failing (no artifacts). Can't RMA because I got it from a friend and he got it like ~4 years ago. I still have my original 290 so I can switch back if I need to.
The only lead I have is a forum post from a few years ago where someone had similar problems due to the VRM not providing enough power to the RAM - this seems possible, but I can't check voltage levels on the card in software (default bios), and for hardware I'd have to remove the cooling plate, which also has the VRM cooling on it - This could be the problem because the original thermal pads where in a pretty bad condition when I got it, so one of the MOSFETS on the memory VRM could be damaged.
I also got another software that specifies where in the memory the error is, and it seems to be random (anywhere between 700 and 3500 MB), so it might also be a problem with the memory controller.
Any ideas on what I should do? The card works perfectly fine otherwise, so it would be a shame to just put in a box and leave it.
This is my current PC configuration, but I think this is an issue with the card itself so it's not that relevant
Asus H97 pro gamer mobo with an i5 4690 (non K)
16 gigabytes of 1600 mhz RAM, a 240 GB samsung SSD and 2 HDDs for data
an EVGA 1000GQ PSU (eco mode is off atm)
MSI r9 295x2 (and currently an rx 460 so I can write this without my PC rebooting)
some random case I forgot the name of and a bunch of decent cooling fans for airflow