Overclock.net banner
1 - 20 of 50 Posts

jpz

· Registered
Joined
·
1,337 Posts
Discussion starter · #1 ·
Warning: this is not a simple problem. There is a lot of information posted here and I ask that you please do not offer advice or speculate as to what the problem might be unless you have read all of my posts.

I have been troubleshooting for the past week and what I am seeing continues to make less and less sense. I normally do not post threads asking for help; I am doing this as a last resort. My hope is that posting this thread will help me organize all my observations in one place, and that perhaps one of you has seen something like this in the past.

First, some history:
The machine in question is BlackBox, my sig rig. It is running a Q6600 in a Gigabyte X38-DS4 motherboard with 2x2GB DDR2 1066mhz Corsair Dominator RAM. I built this computer back in February '08, at which point it had 2x1GB Corsair DDR2 800mhz Corsair XMS2 RAM. I overclocked my Q6600 to 3.6ghz(9x400), and my overclock was 48 hours Prime-95 stable. Everything was great until the spring of '09 when I started Project BlackBox.

Project BlackBox:
During this project I moved my sig rig from an NZXT Nemesis Elite case to my scratch built acrylic case. I also made some hardware upgrades: I swapped in 4GB of Dominator RAM and put the old XMS2 RAM in another computer, added a second ATI 3870, lapped my Q6600, swapped a Creative X-Fi Titanium in for an old Audigy SE, and swapped in an Asus SATA DVD drive for an old IDE Sony drive. I also replaced upgraded my water loop, added more fans, and replaced all the lighting.

After finishing project BlackBox I set everything back to BIOS defaults except I set the RAM voltage to 2.1V for the Dominators. Everything was fine for a few weeks. When I finally had time, I overclocked my machine back to 3.6hz(verified stable by another 48 hours of Prime95). I used BlackBox like this for about a week, during which I rebuilt my entire Gentoo installation(after 6 months just about every piece of software was out of date so I ended up recompiling everything) and ran a F@H SMP client 24/7 under Gentoo.

Issues Become Present:
After a week of folding under Gentoo, I booted into Vista to play some games. Everything was fine until I had ended my gaming session-after closing the game I had been playing, I tried to open a web browser and E-mail client. Vista Blue-screened. I decided I would investigate later and booted Gentoo. That night and the next day I started getting segfaults and my FAH client crashed- two things that have never happened to me after folding under Gentoo 24/7 for nearly a full year. At this point I knew something was very wrong.

OK, enough story time. Here are the things I have tried and my observations in chronological order as best I can remember:
Booted Vista, ran Prime95 -> in-place fft runs fine, but blend failed instantly
Dropped multiplier to 6(6*400=2.4ghz), tried Prime95 again -> blend failed after 10-15 minutes
from here on out I run custom test in Prime95 which is the same as blend but uses more RAM
ran memtest86 -> lots of errors detected
removed one ram stick, tried memtest86 again -> no errors after two passes(~1 hour)
swapped in second stick, tried memtest86 again -> no errors after two passes(~1 hour)
put both sticks in, tried memtest86 again -> no errors after about 4 hours
booted vista, tried Prime95 again -> failed after an hour
moved RAM sticks from the yellow slots to the red slots, ran Prime95 overnight -> no errors after 10 hours

I then booted into Gentoo and after a few hours I started seeing segfaults again.
Image

ran memtest86 -> errors detected within minutes
booted Vista -> Prime 95 detects errors instantly

After this I flashed my BIOS to the latest version, returned to BIOS defaults, raised my RAM voltage to 2.1V. I tried testing with memtest86 and it started picking up errors after about an hour. I then set all my RAM timings manually according to the EPP profile (still running at 1066mhz). No matter how I played with the settings I could not get memtest86 to run for more than an hour without generating errors.

Next I changed the RAM divider and ran the Dominators underclocked at 800mhz. I ran memtest and no errors were picked up after ~4 hours. Then I booted Vista, ran Prime95 blend for 20 hours and no errors were detected.

After I ended the blend test I booted Gentoo. A few hours later I started getting more segfaults. I ran memtest and it found errors within seconds.

I believe that covers just about everything I tried over the past week. Now for some general observations:

I have experienced crashing programs, blue screens, and lots of errors in the Vista logs- my point is that I have experienced weird behavior in Windows too, not just Linux. During the long Prime95 tests everything is fine though; I have only experienced strange behavior in Windows either just before or just after failing Prime95 and/or memtest86.

I have gotten these memory errors at 2.4ghz and 3.6ghz, at 400mhz fsb and 266mhz fsb, at 1066 and 800mhz RAM (at all 4 combinations with 400mhz and 266 mhz fsb), and with my lights on as well as off.

Something that I noticed over the past couple days is that everything seems to be fine when I first turn the computer on, and the errors only begin to occur after 24-48 hours. Once I start getting memory errors, I continue to get memory errors until I shut down the computer. Rebooting has no effect yet powering the entire machine down, even for just a second, seems to make all the errors go away for ~ 24 hours.

The next thing I am going to try is more extensive testing with individual RAM sticks.
 
Discussion starter · #3 ·
Quote:

Originally Posted by nolonger View Post
To me it seems like your RAM modules are overheating, can you get a temperature reading?
I can't give you a temp reading, but they have never been more than barely warm to the touch when I've gone to swap sticks.

If the problem was heat related, the machine would have to be powered down much longer than a fraction of a second before the errors would disappear.

EDIT: The memory errors the past few days have been consistently in the 1400MB-1600MB range. It seems that this section of the memory just stops working randomly and refuses to work until the DIMMs are powered down.
 
Quote:

Originally Posted by jpz View Post
I can't give you a temp reading, but they have never been more than barely warm to the touch when I've gone to swap sticks.

If the problem was heat related, the machine would have to be powered down much longer than a fraction of a second before the errors would disappear.
Well, like in CPU's whenever you remove the load from them they instantly drop around 10-20ºC, I don't see why this can't happen with RAM modules.

But lets look at possible troubleshooting solutions.

Try to increase your RAM voltage just one bump and see if that solves your problems. If you still get errors, record if they occur faster or take longer than previously. Try this with only one module at a time.
 
Discussion starter · #5 ·
Quote:

Originally Posted by nolonger View Post
Try to increase your RAM voltage just one bump and see if that solves your problems. If you still get errors, record if they occur faster or take longer than previously. Try this with only one module at a time.
Already did this except with two modules. My board allows increments of 0.05 volts. I tried 2.1, 2.15, and 2.2, and there was no significant difference in how long it took for errors to occur.
 
Quote:

Originally Posted by jpz View Post
Already did this except with two modules. My board allows increments of 0.05 volts. I tried 2.1, 2.15, and 2.2, and there was no significant difference in how long it took for errors to occur.
Try with only one. My guess is you'll get errors faster with more volts on the faulty RAM (assuming this is a RAM issue, not motherboard).
 
Discussion starter · #7 ·
I pulled one of the sticks out about half an hour ago.

Also, a few days ago I was getting errors in the 4700-4800MB range from memtest86. Not quite sure what to make of that.

I've been leaning toward calling this a motherboard/cpu issue for a while now. I'm even willing to believe that the issue is being caused by EMI from a microwave in the room next to me.

I also forgot to mention that I installed a new wireless card during the upgrade. Come to think of it, there is a (small) chance that might be related. I had my wireless card disabled the first few weeks I was here, up until about the time I started having these weird memory issues. I turned my desktop into a wireless router so my laptops can get wireless access. I only have my wireless card enabled on gentoo because Vista hard-locks whenever I try to download files. Prior to setting up the wireless adhoc network, I had not installed drivers for my wireless card under Gentoo.

Guess I'll be playing with the wireless card too now.
 
Quote:

Originally Posted by jpz View Post
I pulled one of the sticks out about half an hour ago.

Also, a few days ago I was getting errors in the 4700-4800MB range from memtest86. Not quite sure what to make of that.

I've been leaning toward calling this a motherboard/cpu issue for a while now. I'm even willing to believe that the issue is being caused by EMI from a microwave in the room next to me.

I also forgot to mention that I installed a new wireless card during the upgrade. Come to think of it, there is a (small) chance that might be related. I had my wireless card disabled the first few weeks I was here, up until about the time I started having these weird memory issues. I turned my desktop into a wireless router so my laptops can get wireless access. I only have my wireless card enabled on gentoo because Vista hard-locks whenever I try to download files. Prior to setting up the wireless adhoc network, I had not installed drivers for my wireless card under Gentoo.

Guess I'll be playing with the wireless card too now.
I'd just pull the wireless card out in that case and retry the Memtest86 cycle all over again. You want to remove every PCI/PCI-E component you can to eliminate the most possibilities.

For testing I'd remove the wireless cardb and leave only one stick running Memtest86. If that fails, try the other one only. Then I'd start looking into increasing the NB Voltage.
 
Discussion starter · #9 ·
Quote:

Originally Posted by nolonger View Post
For testing I'd remove the wireless cardb and leave only one stick running Memtest86. If that fails, try the other one only. Then I'd start looking into increasing the NB Voltage.
I also tried playing with the NB voltage, and it didn't change anything. My board is rated for 400fsb so it should be able to handle 4gb of ram @ 1066mhz and 400fsb... yet I still get errors with 4gb ram @ 800mhz with 266mhz fsb.

I will try playing with the "new" pci cards if/when I start getting errors with a single RAM stick.
 
Quote:

Originally Posted by jpz View Post
I also tried playing with the NB voltage, and it didn't change anything. My board is rated for 400fsb so it should be able to handle 4gb of ram @ 1066mhz and 400fsb... yet I still get errors with 4gb ram @ 800mhz with 266mhz fsb.

I will try playing with the "new" pci cards if/when I start getting errors with a single RAM stick.
Alright, keep me posted.
 
You have done so many modifications at one time , it could be so many things now.
Start by simplifying to the simplest system you can. One stick of RAM, no wireless card, run everything stock.

your looking at RAM problems, ram slot issues, PSU issues, and possible memory controller errors.

Is it possible to get a different set of RAM to test? that will cover 3/4 above. If it acts the same way with different RAM, Your now onto motherboard and memory control problems (memory controller is CPU on almost all new boards).

Ram deteriorates, It will show errors, then seem fine until it just doesnt work at all anymore no matter what voltage you use. I am sure your main problem will become more and more obvious as you continue testing.

overheating is an obvious sign, Once I had a ram chip not seat right and that thing got hot enough to burn the crap out of my finger. Thats the easiest problem to identify.

When I upgrade, I do one thing at a time and test. you really have dug a hole here.

good luck man.
 
Discussion starter · #12 ·
I don't have access to any hardware other than what is in my sig rig. If I was back at home one of the first things I would have done would have been to swap my old RAM back in.

Last night I noticed that my northbridge heatsink(entirely passive) was very hot. The temperature diode was reading ~40C but the heatsink itself felt more like 60-70C. My northbridge heatsink has a Gigabyte plate covering half of the top of the heatsink; I wouldn't be surprised if the northbridge cooler was less effect with the motherboard in a horizontal position. I noticed that the side of the heatsink closes to the RAM(which is the covered side) was much hotter than the other side of the heatsink- the far side of the heatsink felt relatively cool. I also remembered that after running the extended Prime95 blend tests, I often played games for a couple hours before booting into Gentoo.

Based on this new information, I came up with a theory that something in the northbridge(probably related the memory controller) was getting corrupted when the northbridge hit higher temperatures, and would remain corrupted until a poweroff reset. Crossfire puts quite a bit of stress on the northbridge, and the northbridge runs about 5c hotter when my graphics cards are under load than when the CPU is under load.

I decided to suspend one of my case fans over the northbridge and pop both RAM sticks in. My northbridge temp dropped 10C according to the temp diode and the heatsink felt much cooler. I turned the machine back on, ran a quick memtest to make sure no errors were detected(whenever memtest has picked up errors in the past week, it has always found errors within the first 30 seconds). Then I booted vista and ran Prime95 blend and Furmark in burning mode for a couple hours. Next I restarted my computer and ran memtest for a few minutes, which found no errors. Finally I booted Gentoo and let it sit overnight.

When I woke up this morning my computer had hardlocked. I pushed the reset button(did NOT power it off) and booted memtest. Within the first minute memtest detected over 1,500,000 errors.
Image
. This time they were in the range of ~400MB to ~4200MB. I reset my computer and ran memtest again, and I saw the exact same results as the previous memtest. Next I pushed the power button and my computer turned off. As soon as all the lights turned off I pushed the power button again and my motherboard started to POST. I ran memtest a third time and it did not find a single error.

I turned off the computer, removed one of the sticks, and booted into Gentoo. I am currently waiting for a segfault or lockup.
 
Here is what I do to test memory sticks:

1. Reset CPU and memory timings to stock settings.
2. Download the latest Memtest86+ and create a bootable CD.
3. Test each stick individually. If a stick runs for 8 hours with no errors I go on to the next one.

Good luck!
Image
 
Discussion starter · #15 ·
I actually started a batch of compiles right before I went to bed- they finished without error before my computer locked up.

IIRC It has happened a number of times when I only booted vista. I have yet to ever receive an error while using just one stick, but I haven't used either stick long enough to determine anything.

What makes this problem so frustrating is that it takes so long to occur and I haven't found any way to reproduce it other than leaving my computer turned on for 24-48 hours.
 
Quote:

Originally Posted by jpz View Post
I actually started a batch of compiles right before I went to bed- they finished without error before my computer locked up.

IIRC It has happened a number of times when I only booted vista. I have yet to ever receive an error while using just one stick, but I haven't used either stick long enough to determine anything.

What makes this problem so frustrating is that it takes so long to occur and I haven't found any way to reproduce it other than leaving my computer turned on for 24-48 hours.
So your computer wasn't idle, it was compiling. I'm not sure if this can cause a freeze, but you might have had a voltage overshoot when your system finished compiling. Do you have any kind of vDroop control on?
 
Discussion starter · #17 ·
I have LLC enabled in the BIOS.

The problem is that my RAM becomes completely unusable after some extended period of time. That is what causes the lockups... the real question is why can't my computer store and retrieve data from the RAM once the computer has been powered up for a while.
 
Quote:

Originally Posted by jpz View Post
I have LLC enabled in the BIOS.

The problem is that my RAM becomes completely unusable after some extended period of time. That is what causes the lockups... the real question is why can't my computer store and retrieve data from the RAM once the computer has been powered up for a while.
It might be best if you run everything at stock to test which hardware is faulty. I know errors also occur at stock, but we want to remove the biggest amount of variables. I'd also remove one of the HD3870's and run with only one card while testing.
 
Discussion starter · #19 ·
I'm still waiting to get errors running with just my first stick of RAM. I'm not going to end this test early again like I did last night.
 
I would just change the sticks for new ones !

There's no need to get a headache, I think, just RMA or get another pair
 
1 - 20 of 50 Posts