Recently one of my friends gave me his old R9 295x2 because it was giving him BSODs. I cleaned the card properly and re did the thermal paste/pads.
After having the card for about ~3 days it crashed for the first time while loading a game
video of crash:
After that, it kept crashing on random occurrences - under load, light load (youtube), no load - the frequency of these crashes was random, ranging from not crashing in 3 days to twice in 20 minutes.
Last night I found a software called memtestCL and ran it to see if maybe it was a problem with the memory of the card and sure enough, it got a few errors in almost every test
(the "random blocks" test always shows loads of errors on any card, so I think it's a bug)
(left card is the primary, right side is the secondary card)
(also, the test on the primary card took way longer)
I think it only crashes when windows tries to access some of the non-working RAM. Usually all 3 of my screens freeze, even the ones plugged into my onboard graphics, but sound continues to play so and as far as I can tell everything else keeps running as well. The fan on the card goes into a "default" state (same as in BIOS or without the drivers installed) and then switches between that and normal a few times (probably tries to re-initialise the drivers)
What I tried so far:
loosened the screws on the backplate so it doesn't put pressure on the ram modules - no effect
tightened the screws - no effect
dumped rubbing alcohol on the whole card and let it sit/dry for a few hours - no effect
set the windows timeout detection and recovery (TDR) to max (8) - no effect (takes longer to reboot automatically)
underclocked the ram - no effect
overclocked the ram - no effect (no idea what I was expecting)
tried the hotkey for reloading graphics drivers (start + ctrl + shift + B)
reinstalled drivers/windows/tried different slot/etc etc (basically the usual troubleshooting procedures)
I had a few indeas for workarounds (use my second card for 3 way crossfire, switch the roles of the 2 GPUs on the 295x2, add a third (nvidia) card and push the rendered frames to that) but I don't think any of these would do anything.
The card itself runs perfectly, and there are no visual clues of the ram failing (no artifacts). Can't RMA because I got it from a friend and he got it like ~4 years ago. I still have my original 290 so I can switch back if I need to.
The only lead I have is a forum post from a few years ago where someone had similar problems due to the VRM not providing enough power to the RAM - this seems possible, but I can't check voltage levels on the card in software (default bios), and for hardware I'd have to remove the cooling plate, which also has the VRM cooling on it - This could be the problem because the original thermal pads where in a pretty bad condition when I got it, so one of the MOSFETS on the memory VRM could be damaged.
I also got another software that specifies where in the memory the error is, and it seems to be random (anywhere between 700 and 3500 MB), so it might also be a problem with the memory controller.
Any ideas on what I should do? The card works perfectly fine otherwise, so it would be a shame to just put in a box and leave it.
This is my current PC configuration, but I think this is an issue with the card itself so it's not that relevant
Asus H97 pro gamer mobo with an i5 4690 (non K)
16 gigabytes of 1600 mhz RAM, a 240 GB samsung SSD and 2 HDDs for data
an EVGA 1000GQ PSU (eco mode is off atm)
MSI r9 295x2 (and currently an rx 460 so I can write this without my PC rebooting)
some random case I forgot the name of and a bunch of decent cooling fans for airflow