Overclock.net › Forums › Overclockers Care › Overclock.net BOINC Team › Project "Headless Linux CLI Multiple GPU Boinc Server" - Ubuntu Server 12.04.4/14.04.1 64bit - Using GPU's from GeForce GT610/GT640/GTX750ti/+ to crunch data.
New Posts  All Forums:Forum Nav:

Project "Headless Linux CLI Multiple GPU Boinc Server" - Ubuntu Server 12.04.4/14.04.1 64bit - Using GPU's from GeForce GT610/GT640/GTX750ti/+ to crunch data. - Page 10

post #91 of 343
Seems to be coming along nicely thumb.gif
Precious
(23 items)
 
Intel 4P Rig
(16 items)
 
AMD 4P Rig
(9 items)
 
CPUCPUCPUCPU
Xeon E5-4650 (ES) C0 Xeon E5-4650 (ES) C0 Xeon E5-4650 (ES) C0 Xeon E5-4650 (ES) C0 
MotherboardGraphicsRAMHard Drive
SuperMicro X9QRi-F+ G200 on board  4GB PC3-12800R x16 OCZ Deneva  
Optical DriveCoolingOSMonitor
Samsung CD/DVD burner 4 x Cooler Master Hyper 212 Server 2012 Standard None 
PowerCaseMouse
OCZ ZX 1250 Rosewill Blackhawk Ultra None 
CPUCPUCPUCPU
Opteron 6376 Opteron 6376 Opteron 6376 Opteron 6376 
MotherboardRAMHard DriveCooling
H8QGi+-F Hynix 16 x HMT151R7BFR4C-H9 4GB 2RX4 PC3-10600R Silicon Power Slim S55 2.5" 480GB SATA III SSD ... Cooler Master Hyper 212 
OS
Ubuntu Server 14.04 
  hide details  
Reply
Precious
(23 items)
 
Intel 4P Rig
(16 items)
 
AMD 4P Rig
(9 items)
 
CPUCPUCPUCPU
Xeon E5-4650 (ES) C0 Xeon E5-4650 (ES) C0 Xeon E5-4650 (ES) C0 Xeon E5-4650 (ES) C0 
MotherboardGraphicsRAMHard Drive
SuperMicro X9QRi-F+ G200 on board  4GB PC3-12800R x16 OCZ Deneva  
Optical DriveCoolingOSMonitor
Samsung CD/DVD burner 4 x Cooler Master Hyper 212 Server 2012 Standard None 
PowerCaseMouse
OCZ ZX 1250 Rosewill Blackhawk Ultra None 
CPUCPUCPUCPU
Opteron 6376 Opteron 6376 Opteron 6376 Opteron 6376 
MotherboardRAMHard DriveCooling
H8QGi+-F Hynix 16 x HMT151R7BFR4C-H9 4GB 2RX4 PC3-10600R Silicon Power Slim S55 2.5" 480GB SATA III SSD ... Cooler Master Hyper 212 
OS
Ubuntu Server 14.04 
  hide details  
Reply
post #92 of 343
Thread Starter 
Hello Tex,


Thanks my friend.. And yes, it was a little more than I first expected starting the test. Forgot all about the short time support of Ubuntu 12.10. That's why it's great we got it to work using 12.04, even though it's an older version of it wink.gif
There's a few "hickups", but nothing really hard to solve. Just takes some time, which I'm not prepared to use right now.

OK, it's time to change the fans and see if the new fans, which runs twice as fast, will make a difference wink.gif

Thanks for checking in on me Tex thumb.gif

STATUS:

20.08.14 12:58am

Got a few pictures I would like to show you. Mounting/placing the sensors on the GPU's. A little harder than you might think to begin with wink.gif

Image 1. Case/chassis heat issues

Red: Here you can see the air flow from the 3 out of 4 chassis/case fans. Being placed the way they are, I always guessed, that GPU4/graphic card 4 (next to 2U PSU - in the bottom of the image). As we know now, it's not this card which gets hottest. It's card 3. The little red box in the lower right corner of the image, is where the 4'th fan is placed. This points directly at the intake/fan1 in the 2U PSU. It's good for the PSU, but not for the GPU's wink.gif
Green: This is the area which I thought would make the biggest issues, because of the limited space between PSU and GPU 4's heatsink. I'm not quite sure why this isn't so. Is it because of the materials the heatsinks are made of (the 4 graphic cards heatsink is not quite the same. The metal looks different on some of them) Maybe it's paint, I dont know). Normally this would be a great reason for a new another test, but in this cae we just press on with the test we are doing wink.gif
Blue: Here you can see that I've didn't mount the "low profile vga brackets". I only did that on the first card, because I use this during the first part of the installation. When the set-up/installation is done, it'll be unmounted as well. I leave area open, because of the hot air. To make it easier for the fans to blow the hot air out of the case.



Image 2. Showing the "missing" low profile vga brackets & air-flow.
Green & Red: As written above, the space between the 2U PSU and GPU 4/graphic card 4 was always my guess on coursing the heat issues. It was heat issues to me, because the temp. got very high, close to 70 degrees celsius inside the case and this increased the temp. of the other stuff inside. And it burned my fingers as well. This was too hot in my opinion wink.gif I like this to be a stable running system which will be able to run for 10 years without any trouble and need for repairs.
Blue: I keep these open, so that the fans has a chance to blow out the very hot air. This case is very ventilated, but the air-flow is really wrong when using the rear side brackets etc. I like to be able to control the direction the air is moving. I don't like the hot air to come out in the front, when all fans, the CPU fan included, is pointing backwards wink.gif



Image 3. Where to place the sensors.
Red: Here you can see the sensors we need to mount. In this case we mount it directly on the heatsink. It's very important that the sensor is placed in such a way, the air flow from the fans doesn't hit it directly. It's a bit difficult, because there's really not a lot of room here. but it's possible.



Image 4. Mounting the sensor & securing it to the curcuit board.
Red: Here you can see where to place the sensor. Use the tape which comes alog with AeroCool X-Vision, but use an extra peace of tape to keep it in place. I use a special kind of tape, which in Denmark is called "Fe-hud" (fairy skin). It's the kind you use when placing stuff on e.g. a articel before you copy it on the xerox wink.gif
Green: The plastic which protected the sensor, I use to protect the little sticker (showing sensor number) and to protect the wire against heat. It eases the mounting of the wire as well.



Image 5 & 6. Using a strip* to mount and secure the sensor from falling of (* is it called a strip over htere??)
Red: Here you can see how I mounted the wire using a "*strip". This way it's kept in place and the sensor will not fall of when working on the case/chassis. Please notice the little hole in the top right corner. It's very nice of Asus to make such a little mounting-hole in the circuit board for us thumb.gif




Image 7 & 8. Placing and securing sensor on CPU.
Red: On image 7 you can see the whole CPU area, where the sensor is being placed. On image 8 you can see it a little better. I chose to place the sensor where I did, because it's the place closest to the CPU. It's a good place because it's not directly in line of the air-flow from the fans. The sensor is secured the same way as the sensors on the GPU's using a *strip and the plastic tube. (the plastic tube is used to protect the sensor during installation) It's possible that we can find a better place for the sensor. The numbers are not exactly the same as the internal sensor. Of course not, but it would be nice to be a little closer to "the truth". Or maybe not!? Maybe it's a good thing to know the difference between the inside of the curcuits and the temp. right next to them. I would love to hear some remarks regarding this matter from you overclockers wink.gif I know you know a great deal about heat and temperatures wink.gif So please get back to me on this thumb.gif





Question!! What's the right word? Sensor is ...... placed/installed/mounted ????


....more to come

.
Edited by DanHansenDK - 8/20/14 at 9:02am
post #93 of 343
You know, a lot of folks that assemble systems merely assume things will be fine... you have really gone the extra mile! This is all very interesting to me too. I have a laser temperature gun I use to check things out and it works pretty well considering position limitations. Your temperature sensor placements seem ideal to me.

In any case, please keep us updated! Yours is an awesome setup!

biggrin.gif
Blue Beast
(13 items)
 
  
CPUMotherboardGraphicsRAM
W3670 4.0 GHz (HT On) 1.345v 24/7 Asus P6X58D Premium DUAL EVGA GTX 560 Ti SC 1G in SLI 12G Corsair Dominator GT2000MHz 
Hard DriveOptical DriveOSMonitor
Kingston V300 120G SSD, 2x 1TB Barracuda HD's LG BlueRay/LightScribe Burner Windows 7 Pro 64Bit 3x24", 1x22" LCD 
KeyboardPowerCaseMouse
Wireless Logitech K320 ULTRA X4 1200W Custom Danger Den LDR-29 Wireless Logitech M310 
Mouse Pad
Custom 
  hide details  
Reply
Blue Beast
(13 items)
 
  
CPUMotherboardGraphicsRAM
W3670 4.0 GHz (HT On) 1.345v 24/7 Asus P6X58D Premium DUAL EVGA GTX 560 Ti SC 1G in SLI 12G Corsair Dominator GT2000MHz 
Hard DriveOptical DriveOSMonitor
Kingston V300 120G SSD, 2x 1TB Barracuda HD's LG BlueRay/LightScribe Burner Windows 7 Pro 64Bit 3x24", 1x22" LCD 
KeyboardPowerCaseMouse
Wireless Logitech K320 ULTRA X4 1200W Custom Danger Den LDR-29 Wireless Logitech M310 
Mouse Pad
Custom 
  hide details  
Reply
post #94 of 343
Thread Starter 
Hi Tex,


Thanks my friend wink.gif
Quote:
I have a laser temperature gun I use to check things out and it works pretty well considering position limitations.

WHAT A F...... GREAT IDEA !!! I got such a "gun", but I've never thought about it that way!! Thanks Tex !! This is why I prefer overclock.net . No BS, only constructive ideas and suggestions !! Thanks my friend thumb.gif


STATUS:
20.08.14 6:08pm --> Just started to desmount the standard fans! The fan I'll test first is:
The german 8 x 8 cm PAPST Industrial Fan
Type 8412 N/2GH.
12V. DC / 235mA / 2.8W

I haven't got all the specs right now, but the RPM is more than twice as fast. The depth of the fan is the same as the standard fan, so the result must be better, more or less wink.gif The "Tornado" fan is long overdue. Haven't got it yet even though I ordered it fro more than 3 weeks ago or so! Anyroad, I wanted to test this fan first. It's an industrial fan used in many machines in the industry and it's not so costy wink.gif Actually, I was able to make a deal with my supplier, which made them even cheaper. Well, they are not that cheap, but they are cheaper than the "Tornado" fan mentioned earlier on wink.gif

BTW Tex, geographically, where in the world are you? Just querious biggrin.gif State & City ?? Texas maybe tongue.gif

On with it thumb.gif


STATUS:
21.08.14 00:49am --> The fan upgrade is pretty easy to do, because of the chassis/case design. (Remember the fan sequence - notice the small labels 1 - 5. Use 2-4. Fan 1 is the CPU fan connector). Just unplug the fan connectors from AeroCool and then unmount the combined "fan holder". Remove the fan guard/grill by unscrewing the 4 screws. After that you unscrew the fan, by unscrewing the 4 screws on the other side of the "fan holder". Now mount the 4 new fans by doing it all backwards wink.gif





Here's a little teaser while waiting for the new fans to be installed and tested. Trying to make a script which controls the fans on the graphic cards. It would be nice to succeed with "GPUFanWatchDog.sh" wink.gif This is what's on my mind when "turning in". Love to think on these things when trying to fall asleep wink.gif Usually I don't get much further this way, but I solve a lot of issues this way wink.gif Here's a "layout" of the script going tho be "GPUFanWatchDog.sh". The other *****WatchDog.sh scripts are done, more or less wink.gif I'm not satisfied, not just yet. I want to do it another way. As it is now, it's more or less like those 2-3 standard mobo fan programs. And I need to add variables so that the script check all GPU's. If there's 2 GPU's it'll check those 2, if there's 4 GPU's it'll check those 4. So it's a script in the making. We are going to use the Nvidia-SMI command again:

Problems: Need to set Coolbits in xorg.conf to "4" to enable the fan control sliding bar under "Thermal Settings"
Code:

#!/bin/bash

# GPUFanWatchDog.sh v.0.1.0 
# Checks Nvidia GPU's temperature and adjusts fan speed accordingly
# CRON job
# 

# Set update interval in seconds
interval=5

# Set min. and max. temperatures in degrees celsius
min_threshold=45
max_threshold=60

# Set a number between 35 and 100 for fan speed. Speed is in percentage of maximum
min_speed=35
mid_speed=60
max_speed=100


# continually loop
while [ "1" -eq "1" ]; do
    #get current temperature and fan speed
    current_temp=`nvidia-smi -q -d TEMPERATURE | grep Gpu | sed 's/.*\([0-9]\{2\}\).*/\1/'`
    current_speed=`nvidia-smi -q | grep Fan | sed 's/.* \(1\?[0-9]\{2\}\) .*/\1/'`
    
    #check current temperature and adjust fan speed
    #only set the speed if it actually needs changing, as nvidia-settings eats CPU cycles
    if [[ $current_temp > $min_threshold ]]; then
        if [[ $current_temp > $max_threshold ]]; then
            #if temp greater than 60, set fan speed to maximum
            if [[ $current_speed != $max_speed ]]; then
                nvidia-settings -a [gpu:0]/GPUFanControlState=1 -a [fan:0]/GPUCurrentFanSpeed="$max_speed" > /dev/null
            fi
        fi
        #if temp greater than 45, set fan speed to medium
        if [[ $current_speed != $mid_speed ]]; then
            nvidia-settings -a [gpu:0]/GPUFanControlState=1 -a [fan:0]/GPUCurrentFanSpeed="$mid_speed" > /dev/null
        fi
    else
        #if temp below 45, set fan speed to minimum
        if [[ $current_speed != $min_speed ]]; then
            echo set
            nvidia-settings -a [gpu:0]/GPUFanControlState=1 -a [fan:0]/GPUCurrentFanSpeed="$min_speed" > /dev/null
        fi
    fi

    #wait until interval expires before rechecking
    sleep "$interval"
done


Problem regarding the missing value under "Fan speed" when using the command "nvidia-smi -a" in e.g. a script to control the fan on the GPU's
OK, solved the "fan speed" issue as well wink.gif Didn't know this, so I had to "study" a little while wink.gif We need the value under "Fan speed" in the script. If the value is "0" the script will not work of course. It's pretty simple and straight forward. Here's how we do it

How to enable "Fan speed" value in nvidia-smi:

Edit /etc/X11/xorg.conf in "Section Device" and add this line Option "Coolbits" "4"

Command:
Code:
vi /etc/X11/xorg.conf

Add the line. It should look like this:
Code:

Command: # vi /etc/X11/xorg.conf

# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 319.37  (buildmeister@swio-display-x64-rhel04-11)  Wed Jul  3 18:14:07 PDT 2013

Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0"
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

Section "Module"
    Load           "dbe"
    Load           "extmod"
    Load           "type1"
    Load           "freetype"
    Load           "glx"
EndSection


Section "InputDevice"
    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/psaux"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"
    # generated from default
    Identifier     "Keyboard0"
    Driver         "keyboard"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "GeForce GT 640"
    Option         "Coolbits" "4"       <--------------------- ADD THIS LINE FOR EVERY "DeviceX" 
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection


OK, a few "hickups" here as well wink.gif I'll get back to it, let's finish the test. It's long overdue wink.gif
Edited by DanHansenDK - 8/22/14 at 6:41am
post #95 of 343
Thread Starter 
STATUS:
22.08.14 11:04am <
Dismounting the standard fans are done. Let's mount the new fan's

Mounting the new fan's in the 2U chassis/case.

Image1. Let's do this wink.gif
Try to fit as many of the wires into the "front" of the case. I tried to keep as few wires and other stuff as possible in the "back" where the mobo is. Everything that is in the way of the air-flow, is a no-go. I'll get this into the complete ToDo as well wink.gif



Image 2. Mounting the new fan, but!
Red: It's now pretty straight forward. Just mount one fan after another using the screws from before. But. please notice one thing.
IMPORTANT! When fitting the screws, DO NOT TIGHTEN IT TO MUCH!!! I'ts so very important that you do not tighten the screws to much. It's made of plastic, so of course you can not tighten them very much, then the material will brake. But what's even more important, the fan will form after the holder/bracket which they are placed in. This means that tightening them too much, may result in a fan that'll crash or even run very hot. With this I mean so hot that the ball bearings and the plastic will melt. It's serious, but if this happens you will be warned by the shell script we are making and AeroCool temp./fan control as well. So it's all good, but it's something to keep in mind anyroad.
The right way to fit the fan and tighten the screws, is to tighten them using the smallest possible screwdriver as possible (less power used that way) and then tighten it to the point where you can feel some resistance. You can see when the screw is all the way in too. Tighten it to that point and then unscrew it a little. Then tighten it again, and you'll feel the point where it's not to be tighten more, much easier. When you tighten the screw the first time, the screw "cut" itself through the plastic, and made a "thread*". Tightening it the second time is therefore much easier and it's much easier to keep it tightened without forcing it too much!
* thread (it's what my dictionary told me, so I hope it's correct)



Image 3 & 4. Mounting the fan, please notice the wires!
When you are mounting the fan, please notice the wires on the other side! The fan has a little plastic "thingy" which job is to hold the wires in place. When all 4 screws has been tightened and the wires will be kept in place by this, but you may have to work a little to get it to be like that. On these two images you can see what I'm talking about. On the first image you can see what may happen when mounting the fan. Please keep this in mind when fitting the fan to the case. If you already tightened the fan and this is what happened, unscrew all 4 screws a third of the way and then correct the problem.




Image 4. Fitting the fan grill
Mount the fan grill with the same caution, as when you mounted the fan itself. It's straight forward here.



Image 5. Notice the wires
Green: Please notice the wires, when finishing the fan installation. Here the wires may "pop out" of the "thingy" which holds them in place wink.gif It's important that a wire isn't loose and maybe gets in the way of the fan spinning. It's always very important to keep an eye on these things. This is what courses most accident's and breakdowns. Wires and connectors not fitted in the right way!!



AeroCool and what it does
Now we have mounted the new fan's from PAPST too. Let's begin the test of temperatures. The AeroCool controller will detect the temp. on each of the sensors and if it reaches a limit, set by you of course, it'll increase it's RPM. Both the "lower" limit (the fan runs at this speed to begin with) and the "alarm limit" (the temp. where the fan starts to increase the RPM to cool the system off) is set by you. Setting these temp. is not a must. You can just choose to have non "lower limit" and you can choose the fan to run at full speed at all time too. It's also possible to set an "alarm", which goes off when a certain temp. is reached. This is one of the reasons I choose to use an temp./fan-controller. Then the fan's will run at the speed you set and nothing else, no matter what "mood" you mobo is in wink.gif

NOTE! Regarding the AeroCool temp./fan controller it's not all good. I've noticed a problem with the LCD display. The light seems to get dimmer as time passed by. I'm not sure about this, but this will indeed be a thing I'll check "down the road". It may just be a glitch on the "older" AeroCool", but we need to be sure about this. Don't want any circuit-boards with sh**ty solderings. Or as my old electronic teacher named them "poo-solderings"biggrin.gif


STATUS:
22.08.14 1:34pm <--- Testing the new fan's at 100%
Let's test the new fan's and compare the result to the result using the standard fans. We'll do this first test running the fan's at 100% to se how much difference they can do at most. Then we'll see if the fan's at e.g. 80% would be enough. Let's see what the test will reveal wink.gif

The standard fan's turned with 1900-2100 RPM's according to AeroCool.
Here's the temp. result from those fan's:
Code:
# sensors
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +78.0°C  (high = +80.0°C, crit = +100.0°C)
Core 0:         +78.0°C  (high = +80.0°C, crit = +100.0°C)  <------- here is one issue. To close to the temperature where the CPU needs cooling
Core 1:         +73.0°C  (high = +80.0°C, crit = +100.0°C)
Core 2:         +69.0°C  (high = +80.0°C, crit = +100.0°C)
Core 3:         +68.0°C  (high = +80.0°C, crit = +100.0°C)

# nvidia-smi -a |grep Gpu
        Gpu                         : N/A
        Gpu                         : 58 C
        Gpu                         : N/A
        Gpu                         : 58 C
        Gpu                         : N/A
        Gpu                         : 66 C  <------- here is one issue. To close to the temperature where the GPU needs cooling*
        Gpu                         : N/A
        Gpu                         : 57 C

# Know that it endures 95 degrees Celsius, but it heats up the case/environment. 


Edited by DanHansenDK - 8/23/14 at 12:34pm
post #96 of 343
Thread Starter 
STATUS:
22.08.14 1:58pm <---- Testing the new fan's. Here's the results:

After 10 min. GPU's at 100% / Fan's at 3300 RPM:
Code:
# nvidia-smi -a |grep Gpu
        Gpu                         : N/A
        Gpu                         : 46 C
        Gpu                         : N/A
        Gpu                         : 46 C
        Gpu                         : N/A
        Gpu                         : 51 C
        Gpu                         : N/A
        Gpu                         : 48 C

After 20 min. GPU's at 100% / Fan's at 3300 RPM:
Code:
# nvidia-smi -a |grep Gpu
        Gpu                         : N/A
        Gpu                         : 46 C
        Gpu                         : N/A
        Gpu                         : 47 C
        Gpu                         : N/A
        Gpu                         : 51 C
        Gpu                         : N/A
        Gpu                         : 49 C

After 1 hour. GPU's at 100% / Fan's at 3300 RPM:
Code:
# sensors
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +79.0°C  (high = +80.0°C, crit = +100.0°C)
Core 0:         +79.0°C  (high = +80.0°C, crit = +100.0°C)
Core 1:         +74.0°C  (high = +80.0°C, crit = +100.0°C)
Core 2:         +70.0°C  (high = +80.0°C, crit = +100.0°C)
Core 3:         +69.0°C  (high = +80.0°C, crit = +100.0°C)

# nvidia-smi -a |grep Gpu
        Gpu                         : N/A
        Gpu                         : 46 C
        Gpu                         : N/A
        Gpu                         : 47 C
        Gpu                         : N/A
        Gpu                         : 51 C
        Gpu                         : N/A
        Gpu                         : 49 C

SUCCESS !!!
Well, I don't think "Success" is to big a word to use right now thumb.gif
The temp. has gone down 15 degrees Celsius! So I was right I guess. There was not enough air-flow inside the case! This really saves my day wink.gif
The CPU temp. hasn't changed, but tahts OK, because it's not being controlled by AeroCool. When connecting it I noticed that it worked, but there was no output on AeroCool (RPM/display) so there's an issue there to be solved as well. This means that the 2U CPU cooler is being controlled by the mobo, and therefore only increase the RPM when needed. Mobo has several different set-up's controlling chassis/CPU fan's as you might know. More about this later on. Right now let's just be thankful for the result. We may be able to build this Economic Semi-Professional Boinc Cruncher without water-cooling and save some cash thumb.gif

STATUS:
22.08.14 3:25pm <---- Testing the new fan's at a lower speed:

After 10 min. GPU's at 100% / Fan's at 2500 RPM:
Code:
# nvidia-smi -a |grep Gpu
        Gpu                         : N/A
        Gpu                         : 48 C
        Gpu                         : N/A
        Gpu                         : 51 C
        Gpu                         : N/A
        Gpu                         : 55 C
        Gpu                         : N/A
        Gpu                         : 52 C

After 20 min. GPU's at 100% / Fan's at 2500 RPM:
Code:
# nvidia-smi -a |grep Gpu
        Gpu                         : N/A
        Gpu                         : 49 C
        Gpu                         : N/A
        Gpu                         : 52 C
        Gpu                         : N/A
        Gpu                         : 55 C
        Gpu                         : N/A
        Gpu                         : 52 C

STATUS:
22.08.14 3:45pm <---- Testing the new fan's at a little higher speed:

After 10 min. GPU's at 100% / Fan's at 2800 RPM:
Code:
# nvidia-smi -a |grep Gpu
        Gpu                         : N/A
        Gpu                         : 47 C
        Gpu                         : N/A
        Gpu                         : 49 C
        Gpu                         : N/A
        Gpu                         : 53 C
        Gpu                         : N/A
        Gpu                         : 51 C

After 30 min. GPU's at 100% / Fan's at 2800 RPM:
Code:
# nvidia-smi -a |grep Gpu
        Gpu                         : N/A
        Gpu                         : 47 C
        Gpu                         : N/A
        Gpu                         : 49 C
        Gpu                         : N/A
        Gpu                         : 53 C
        Gpu                         : N/A
        Gpu                         : 51 C


CONCLUSION:
22.08.14 4:28pm
I think we can conclude, that these fan's are all we need for this system to run. Now I'll start a "burn-in" test, where we'll hit it with all we got and let it run for 48 hours. I've just written a little shell script which takes the temperature every hour and logs it. Let's see how our "pour mans super cruncher" handles a little work wink.gif

I think this looks pretty good:
Code:
# sensors
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +66.0°C  (high = +80.0°C, crit = +100.0°C)
Core 0:         +66.0°C  (high = +80.0°C, crit = +100.0°C)
Core 1:         +66.0°C  (high = +80.0°C, crit = +100.0°C)
Core 2:         +64.0°C  (high = +80.0°C, crit = +100.0°C)
Core 3:         +64.0°C  (high = +80.0°C, crit = +100.0°C)
Code:
# nvidia-smi -a |grep Gpu
        Gpu                         : N/A
        Gpu                         : 47 C
        Gpu                         : N/A
        Gpu                         : 50 C
        Gpu                         : N/A
        Gpu                         : 53 C
        Gpu                         : N/A
        Gpu                         : 51 C


I will be back.....

.
Edited by DanHansenDK - 8/22/14 at 7:23am
post #97 of 343
Nice progress! I like your attention to detail as well.

And hopefully, what works for one setup is repeatable in all setups.

Nice work so far! Looks really good!

biggrin.gif

PS: I moved from Texas to Tennessee and then to Paducah, Kentucky as my final retirement venue near good fishing and cheap housing...
Blue Beast
(13 items)
 
  
CPUMotherboardGraphicsRAM
W3670 4.0 GHz (HT On) 1.345v 24/7 Asus P6X58D Premium DUAL EVGA GTX 560 Ti SC 1G in SLI 12G Corsair Dominator GT2000MHz 
Hard DriveOptical DriveOSMonitor
Kingston V300 120G SSD, 2x 1TB Barracuda HD's LG BlueRay/LightScribe Burner Windows 7 Pro 64Bit 3x24", 1x22" LCD 
KeyboardPowerCaseMouse
Wireless Logitech K320 ULTRA X4 1200W Custom Danger Den LDR-29 Wireless Logitech M310 
Mouse Pad
Custom 
  hide details  
Reply
Blue Beast
(13 items)
 
  
CPUMotherboardGraphicsRAM
W3670 4.0 GHz (HT On) 1.345v 24/7 Asus P6X58D Premium DUAL EVGA GTX 560 Ti SC 1G in SLI 12G Corsair Dominator GT2000MHz 
Hard DriveOptical DriveOSMonitor
Kingston V300 120G SSD, 2x 1TB Barracuda HD's LG BlueRay/LightScribe Burner Windows 7 Pro 64Bit 3x24", 1x22" LCD 
KeyboardPowerCaseMouse
Wireless Logitech K320 ULTRA X4 1200W Custom Danger Den LDR-29 Wireless Logitech M310 
Mouse Pad
Custom 
  hide details  
Reply
post #98 of 343
Thread Starter 
Hi Tex,


Thank you wink.gif Nice to know who you are talking to, in wich direction "they" are anyway. Think you know what I mean. Sounds really nice where you are wink.gif We have to stay put, in this our home for more than 10 years, well actually I was raised in this area. Koege, close to Copenhagen, but we are going to our dreamhouse when I've finished university. It's a little late in life to become a student again, but I needed the knowhow. What I'm learning these 3 years, I couldn't "pick up" myself. Just not possible. I needed to learn the about SMD components and how to replace, test and check them. So I took chose to take this electronics engineer degree. ANyway, thanks for letting me know wink.gif

OK. Regarding the test. There's a little problem. The system suddenly rebootet and I've just checked the system uptime:
Code:
# uptime
 22:11:04 up 16 min,  1 user,  load average: 4.78, 4.73, 4.53

This was the reason I wanted to get the temperature of the GPU's down to begin with. That is, if it' is the same thing thats coursing it to reboot. Well, I'll let the system run a little while longer to see if it happenes again. Back then, testing system 3 with the Asus Extreme mobo, the chrashes and reboots where many more. I'm not sure if this is the same thing but I think it might is. Let's see whathappenes. We still got 500 more RPM's to deal with wink.gif Just didn't think these temp. could course it to "crash". It's much hotter near the heatsinks of the GPU's. but we did manage to lower the temp. 11-15 degrees Celsius.
Code:
# nvidia-smi -a |grep Gpu
        Gpu                         : N/A
        Gpu                         : 49 C
        Gpu                         : N/A
        Gpu                         : 50 C
        Gpu                         : N/A
        Gpu                         : 55 C
        Gpu                         : N/A
        Gpu                         : 51 C


GOT AN IDEA???
Anybody who has an idea of what the reason for the crash & reboot was???
The system rebooted about 21:24 o'clock (9:24pm)
Code:
# vi syslog
Aug 22 13:41:29 beaufort dhclient: DHCPACK of 192.168.1.29 from 192.168.1.1
Aug 22 13:41:29 beaufort dhclient: bound to 192.168.1.29 -- renewal in 42149 seconds.
Aug 22 13:41:29 beaufort kernel: [    5.824371] type=1400 audit(1408707689.750:8): apparmor="STATUS" operation="profile_replace" name="/sbin/dhclient" pid=1065 comm="apparmor_parser"
Aug 22 13:41:29 beaufort kernel: [    5.824536] type=1400 audit(1408707689.750:9): apparmor="STATUS" operation="profile_replace" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=1065 comm="apparmor_parser"
Aug 22 13:41:29 beaufort kernel: [    5.824628] type=1400 audit(1408707689.750:10): apparmor="STATUS" operation="profile_replace" name="/usr/lib/connman/scripts/dhclient-script" pid=1065 comm="apparmor_parser"
Aug 22 13:41:29 beaufort kernel: [    5.825656] type=1400 audit(1408707689.750:11): apparmor="STATUS" operation="profile_load" name="/usr/sbin/tcpdump" pid=1067 comm="apparmor_parser"
Aug 22 13:41:29 beaufort cron[1120]: (CRON) INFO (pidfile fd = 3)
Aug 22 13:41:29 beaufort acpid: starting up with proc fs
Aug 22 13:41:29 beaufort cron[1144]: (CRON) STARTUP (fork ok)
Aug 22 13:41:29 beaufort cron[1144]: (CRON) INFO (Running @reboot jobs)
Aug 22 13:41:29 beaufort acpid: 1 rule loaded
Aug 22 13:41:29 beaufort acpid: waiting for events: event logging is off
Aug 22 13:41:32 beaufort kernel: [    8.893501] NVRM: os_pci_init_handle: invalid context!
Aug 22 13:41:32 beaufort kernel: [    8.893504] NVRM: os_pci_init_handle: invalid context!
Aug 22 13:41:38 beaufort ntpdate[1007]: adjust time server 91.189.94.4 offset 0.307711 sec
Aug 22 13:42:09 beaufort dbus[725]: [system] Activating service name='org.freedesktop.ConsoleKit' (using servicehelper)
Aug 22 13:42:09 beaufort dbus[725]: [system] Activating service name='org.freedesktop.PolicyKit1' (using servicehelper)
Aug 22 13:42:09 beaufort polkitd[1439]: started daemon version 0.104 using authority implementation `local' version `0.104'
Aug 22 13:42:09 beaufort dbus[725]: [system] Successfully activated service 'org.freedesktop.PolicyKit1'
Aug 22 13:42:09 beaufort dbus[725]: [system] Successfully activated service 'org.freedesktop.ConsoleKit'
Aug 22 14:17:01 beaufort CRON[2013]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 22 15:17:01 beaufort CRON[2069]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 22 15:52:29 beaufort kernel: [ 7858.605739] setiathome_7.01[2189]: segfault at ffffffffffffffc8 ip 0000000000763244 sp 00007fff563ef1d8 error 7
Aug 22 16:17:01 beaufort CRON[2256]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 22 17:17:01 beaufort CRON[2338]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 22 18:17:01 beaufort CRON[2386]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 22 19:17:01 beaufort CRON[2428]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 22 20:17:01 beaufort CRON[2484]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug 22 21:17:01 beaufort CRON[2525]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)

# THIS IS WHERE THE SYSTEM CHASHED AND REBOOTED !!!

Aug 22 21:24:48 beaufort kernel: imklog 5.8.6, log source = /proc/kmsg started.
Aug 22 21:24:48 beaufort rsyslogd: [origin software="rsyslogd" swVersion="5.8.6" x-pid="792" x-info="http://www.rsyslog.com"] start
Aug 22 21:24:48 beaufort rsyslogd: rsyslogd's groupid changed to 103
Aug 22 21:24:48 beaufort rsyslogd: rsyslogd's userid changed to 101
Aug 22 21:24:48 beaufort rsyslogd-2039: Could not open output pipe '/dev/xconsole' [try http://www.rsyslog.com/e/2039 ]
Aug 22 21:24:48 beaufort kernel: [    0.000000] Initializing cgroup subsys cpuset
Aug 22 21:24:48 beaufort kernel: [    0.000000] Initializing cgroup subsys cpu
Aug 22 21:24:48 beaufort kernel: [    0.000000] Linux version 3.8.0-29-generic (buildd@panlong) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #42~precise1-Ubuntu SMP Wed Aug 14 16:19:23 UTC 2013 (Ubuntu 3.8.0-29.42~precise1-generic 3.8.13.5)
Aug 22 21:24:48 beaufort kernel: [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.8.0-29-generic root=/dev/mapper/beaufort--vg-root ro
Aug 22 21:24:48 beaufort kernel: [    0.000000] KERNEL supported cpus:
Aug 22 21:24:48 beaufort kernel: [    0.000000]   Intel GenuineIntel
Aug 22 21:24:48 beaufort kernel: [    0.000000]   AMD AuthenticAMD
Aug 22 21:24:48 beaufort kernel: [    0.000000]   Centaur CentaurHauls
Aug 22 21:24:48 beaufort kernel: [    0.000000] e820: BIOS-provided physical RAM map:
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009d7ff] usable
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x000000000009d800-0x000000000009ffff] reserved
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000088a05fff] usable
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x0000000088a06000-0x0000000088a0cfff] ACPI NVS
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x0000000088a0d000-0x0000000089545fff] usable
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x0000000089546000-0x000000008998ffff] reserved
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x0000000089990000-0x000000009cb0dfff] usable
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x000000009cb0e000-0x000000009d088fff] reserved
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x000000009d089000-0x000000009d0c8fff] usable
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x000000009d0c9000-0x000000009d171fff] ACPI NVS
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x000000009d172000-0x000000009dffefff] reserved
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x000000009dfff000-0x000000009dffffff] usable
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x00000000e0000000-0x00000000efffffff] reserved
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x00000000fed00000-0x00000000fed03fff] reserved
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
Aug 22 21:24:48 beaufort kernel: [    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000025effffff] usable
Aug 22 21:24:48 beaufort kernel: [    0.000000] NX (Execute Disable) protection: active
Aug 22 21:24:48 beaufort kernel: [    0.000000] SMBIOS 2.7 present.
Aug 22 21:24:48 beaufort kernel: [    0.000000] DMI: To Be Filled By O.E.M. To Be Filled By O.E.M./Z87 OC Formula, BIOS P2.10 07/17/2014
Aug 22 21:24:48 beaufort kernel: [    0.000000] e820: update [mem 0x00000000-0x0000ffff] usable ==> reserved
Aug 22 21:24:48 beaufort kernel: [    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
Aug 22 21:24:48 beaufort kernel: [    0.000000] No AGP bridge found
Aug 22 21:24:48 beaufort kernel: [    0.000000] e820: last_pfn = 0x25f000 max_arch_pfn = 0x400000000
Aug 22 21:24:48 beaufort kernel: [    0.000000] MTRR default type: uncachable
Aug 22 21:24:48 beaufort kernel: [    0.000000] MTRR fixed ranges enabled:
Aug 22 21:24:48 beaufort kernel: [    0.000000]   00000-9FFFF write-back
Aug 22 21:24:48 beaufort kernel: [    0.000000]   A0000-BFFFF uncachable
Aug 22 21:24:48 beaufort kernel: [    0.000000]   C0000-CFFFF write-protect
Aug 22 21:24:48 beaufort kernel: [    0.000000]   D0000-E7FFF uncachable
Aug 22 21:24:48 beaufort kernel: [    0.000000]   E8000-FFFFF write-protect
Aug 22 21:24:48 beaufort kernel: [    0.000000] MTRR variable ranges enabled:
Aug 22 21:24:48 beaufort kernel: [    0.000000]   0 base 0000000000 mask 7E00000000 write-back
Aug 22 21:24:48 beaufort kernel: [    0.000000]   1 base 0200000000 mask 7FC0000000 write-back
Aug 22 21:24:48 beaufort kernel: [    0.000000]   2 base 0240000000 mask 7FF0000000 write-back
Aug 22 21:24:48 beaufort kernel: [    0.000000]   3 base 0250000000 mask 7FF8000000 write-back
Aug 22 21:24:48 beaufort kernel: [    0.000000]   4 base 0258000000 mask 7FFC000000 write-back
Aug 22 21:24:48 beaufort kernel: [    0.000000]   5 base 025C000000 mask 7FFE000000 write-back
Aug 22 21:24:48 beaufort kernel: [    0.000000]   6 base 025E000000 mask 7FFF000000 write-back
Aug 22 21:24:48 beaufort kernel: [    0.000000]   7 base 00C0000000 mask 7FC0000000 uncachable
Aug 22 21:24:48 beaufort kernel: [    0.000000]   8 base 00A0000000 mask 7FE0000000 uncachable
Aug 22 21:24:48 beaufort kernel: [    0.000000]   9 disabled
Aug 22 21:24:48 beaufort kernel: [    0.000000] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
Aug 22 21:24:48 beaufort kernel: [    0.000000] original variable MTRRs
Aug 22 21:24:48 beaufort kernel: [    0.000000] reg 0, base: 0GB, range: 8GB, type WB
Aug 22 21:24:48 beaufort kernel: [    0.000000] reg 1, base: 8GB, range: 1GB, type WB
Aug 22 21:24:48 beaufort kernel: [    0.000000] reg 2, base: 9GB, range: 256MB, type WB
Aug 22 21:24:48 beaufort kernel: [    0.000000] reg 3, base: 9472MB, range: 128MB, type WB
Aug 22 21:24:48 beaufort kernel: [    0.000000] reg 4, base: 9600MB, range: 64MB, type WB
Aug 22 21:24:48 beaufort kernel: [    0.000000] reg 5, base: 9664MB, range: 32MB, type WB
Aug 22 21:24:48 beaufort kernel: [    0.000000] reg 6, base: 9696MB, range: 16MB, type WB
Aug 22 21:24:48 beaufort kernel: [    0.000000] reg 7, base: 3GB, range: 1GB, type UC
Aug 22 21:24:48 beaufort kernel: [    0.000000] reg 8, base: 2560MB, range: 512MB, type UC
Aug 22 21:24:48 beaufort kernel: [    0.000000] total RAM covered: 8176M
Aug 22 21:24:48 beaufort kernel: [    0.000000] Found optimal setting for mtrr clean up
Aug 22 21:24:48 beaufort kernel: [    0.000000]  gran_size: 64K         chunk_size: 32M         num_reg: 6      lose cover RAM: 0G
Aug 22 21:24:48 beaufort kernel: [    0.000000] New variable MTRRs
Aug 22 21:24:48 beaufort kernel: [    0.000000] reg 0, base: 0GB, range: 2GB, type WB
Aug 22 21:24:48 beaufort kernel: [    0.000000] reg 1, base: 2GB, range: 512MB, type WB
Aug 22 21:24:48 beaufort kernel: [    0.000000] reg 2, base: 4GB, range: 4GB, type WB
Aug 22 21:24:48 beaufort kernel: [    0.000000] reg 3, base: 8GB, range: 1GB, type WB
Aug 22 21:24:48 beaufort kernel: [    0.000000] reg 4, base: 9GB, range: 512MB, type WB
Aug 22 21:24:48 beaufort kernel: [    0.000000] reg 5, base: 9712MB, range: 16MB, type UC
Aug 22 21:24:48 beaufort kernel: [    0.000000] e820: update [mem 0xa0000000-0xffffffff] usable ==> reserved
Aug 22 21:24:48 beaufort kernel: [    0.000000] e820: last_pfn = 0x9e000 max_arch_pfn = 0x400000000
Aug 22 21:24:48 beaufort kernel: [    0.000000] found SMP MP-table at [mem 0x000fd9b0-0x000fd9bf] mapped at [ffff8800000fd9b0]
Aug 22 21:24:48 beaufort kernel: [    0.000000] initial memory mapped: [mem 0x00000000-0x1fffffff]
Aug 22 21:24:48 beaufort kernel: [    0.000000] Base memory trampoline at [ffff880000097000] 97000 size 24576
Aug 22 21:24:48 beaufort kernel: [    0.000000] Using GB pages for direct mapping
Aug 22 21:24:48 beaufort kernel: [    0.000000] init_memory_mapping: [mem 0x00000000-0x9dffffff]
Aug 22 21:24:48 beaufort kernel: [    0.000000]  [mem 0x00000000-0x7fffffff] page 1G
Aug 22 21:24:48 beaufort kernel: [    0.000000]  [mem 0x80000000-0x9dffffff] page 2M
Aug 22 21:24:48 beaufort kernel: [    0.000000] kernel direct mapping tables up to 0x9dffffff @ [mem 0x1fffe000-0x1fffffff]
Aug 22 21:24:48 beaufort kernel: [    0.000000] init_memory_mapping: [mem 0x100000000-0x25effffff]
Aug 22 21:24:48 beaufort kernel: [    0.000000]  [mem 0x100000000-0x23fffffff] page 1G
Aug 22 21:24:48 beaufort kernel: [    0.000000]  [mem 0x240000000-0x25effffff] page 2M
Aug 22 21:24:48 beaufort kernel: [    0.000000] kernel direct mapping tables up to 0x25effffff @ [mem 0x9d0c7000-0x9d0c8fff]
Aug 22 21:24:48 beaufort kernel: [    0.000000] RAMDISK: [mem 0x3604a000-0x3701cfff]
Aug 22 21:24:48 beaufort kernel: [    0.000000] ACPI: RSDP 00000000000f0490 00024 (v02 ALASKA)
Aug 22 21:24:48 beaufort kernel: [    0.000000] ACPI: XSDT 000000009d14d080 00084 (v01 ALASKA    A M I 01072009 AMI  00010013)
Aug 22 21:24:48 beaufort kernel: [    0.000000] ACPI: FACP 000000009d157f88 0010C (v05 ALASKA    A M I 01072009 AMI  00010013)
Aug 22 21:24:48 beaufort kernel: [    0.000000] ACPI: DSDT 000000009d14d1a0 0ADE7 (v02 ALASKA    A M I 00000210 INTL 20091112)
Aug 22 21:24:48 beaufort kernel: [    0.000000] ACPI: FACS 000000009d170080 00040
Aug 22 21:24:48 beaufort kernel: [    0.000000] ACPI: APIC 000000009d158098 00072 (v03 ALASKA    A M I 01072009 AMI  00010013)
Aug 22 21:24:48 beaufort kernel: [    0.000000] ACPI: FPDT 000000009d158110 00044 (v01 ALASKA    A M I 01072009 AMI  00010013)
Aug 22 21:24:48 beaufort kernel: [    0.000000] ACPI: ASF! 000000009d158158 000A5 (v32 INTEL       HCG 00000001 TFSM 000F4240)
Aug 22 21:24:48 beaufort kernel: [    0.000000] ACPI: SSDT 000000009d158200 00539 (v01  PmRef  Cpu0Ist 00003000 INTL 20051117)
Aug 22 21:24:48 beaufort kernel: [    0.000000] ACPI: SSDT 000000009d158740 00AD8 (v01  PmRef    CpuPm 00003000 INTL 20051117)
Aug 22 21:24:48 beaufort kernel: [    0.000000] ACPI: SSDT 000000009d159218 001C7 (v01  PmRef LakeTiny 00003000 INTL 20051117)
Aug 22 21:24:48 beaufort kernel: [    0.000000] ACPI: MCFG 000000009d1593e0 0003C (v01 ALASKA    A M I 01072009 MSFT 00000097)
Aug 22 21:24:48 beaufort kernel: [    0.000000] ACPI: HPET 000000009d159420 00038 (v01 ALASKA    A M I 01072009 AMI. 00000005)
Aug 22 21:24:48 beaufort kernel: [    0.000000] ACPI: SSDT 000000009d159458 0036D (v01 SataRe SataTabl 00001000 INTL 20091112)
Aug 22 21:24:48 beaufort kernel: [    0.000000] ACPI: SSDT 000000009d1597c8 03493 (v01 SaSsdt  SaSsdt  00003000 INTL 20091112)
Aug 22 21:24:48 beaufort kernel: [    0.000000] ACPI: AAFT 000000009d15cc60 00475 (v01 ALASKA OEMAAFT  01072009 MSFT 00000097)
Aug 22 21:24:48 beaufort kernel: [    0.000000] ACPI: Local APIC address 0xfee00000
Aug 22 21:24:48 beaufort kernel: [    0.000000] No NUMA configuration found

Edited by DanHansenDK - 8/22/14 at 2:15pm
post #99 of 343
Thread Starter 
STATUS 48 HOUR BURN-IN TEST:

24 HOUR CHECK!
OK, it's now precisely 24 hours since the test system rebooted for some reason. I increased the fan speed yesterday, with 200 RPM's. So that they were running at 3000 RPM's (according to AeroCool temp./fan controller.). According to the specs for the PAPST fan, it should be able to run at more than 5000 RPM's. So I'm a little lost on this matter. Anyway, the test shows that for the last 24 hours, the system hasn't rebooted/crashed. Maybe it was a whole other thing that coursed this crash/reboot, I don't know. The system is not quite perfected yet, we know that. So we'll see after another 24 hours of testing. Here's the result's
Code:
# uptime
 21:20:20 up 23:55,  1 user,  load average: 4.61, 4.71, 4.79  <---- UPTIME 23 HOURS AND 55 MINUTES  ;)

# nvidia-smi -a |grep Gpu
        Gpu                         : N/A
        Gpu                         : 46 C
        Gpu                         : N/A
        Gpu                         : 47 C
        Gpu                         : N/A
        Gpu                         : 50 C
        Gpu                         : N/A
        Gpu                         : 48 C


48 HOUR CHECK!
OK, it looks like we did it. Here's the result of the 48 hour burn-in test:
Code:
# uptime
 21:49:59 up 2 days, 25 min,  1 user,  load average: 4.76, 4.76, 4.72

# sensors
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +64.0°C  (high = +80.0°C, crit = +100.0°C)
Core 0:         +64.0°C  (high = +80.0°C, crit = +100.0°C)
Core 1:         +62.0°C  (high = +80.0°C, crit = +100.0°C)
Core 2:         +61.0°C  (high = +80.0°C, crit = +100.0°C)
Core 3:         +58.0°C  (high = +80.0°C, crit = +100.0°C)

# nvidia-smi -a |grep Gpu
        Gpu                         : N/A
        Gpu                         : 46 C
        Gpu                         : N/A
        Gpu                         : 47 C
        Gpu                         : N/A
        Gpu                         : 50 C
        Gpu                         : N/A
        Gpu                         : 48 C


OK, there's a couple of issues still to be solved, but the system works now, and very good actually! I just received the "Top 5% average" from SETI, so we are doing something right wink.gif According to my calculation (not that accurate) running 3 system like this, we will get about 15-20.000 points per system. That's between 45-60.000 points a day. Let's say 50.000 in average, that makes about 1.500.000 points every month. That way we'll get in to "better society" pretty d... fast thumb.gif

Back to reality! Issues needs to be solved.
1. I'll test the new update 12.05.5 with CUDA 5.5. If it doesn't work, I've got a possible solution to the issues regarding 12.04.3/CUDA5.5 from a friend at HowToForge, "Srijan". Let's see how that goes wink.gif
2. Solve the issues regarding the "headless" part.
3. Finish the sehll scripts, which is going to watch over the system and alert you if anything goes wrong.
4. Solve the hardware issues regarding the CPU fan-connector, connecting ASRock Z87 OC Formual/AeroCool Temp. & Fan Control. The RPM doesn't show in the display. Display has been tested of course!
5. Decide if FanControl by LM-sensors shall control system fans using the GPUFanWatchDog.sh and SystemFanWatchDog.sh scripts. Using software fan control can increase the fan temperature! So, if we choose to do so, we need a warnings system in case of a fan failure. Well, it's a warnings system we need it for, so this shouldn't be a problem, right biggrin.gif
6. Check power consumption. Is this 550Watts 2U PSU really needed? I think not. And it's too d... expensive as well wink.gif I think we'll use about 250-300 watts in total. If this is right, then it's pretty good I guess. 1 powerfull CPU and 4 GPU's running at full speed. This is why I wnat the system to be without anything not needed.

...more to come


Tex, Magic !?!? What do you say?? Do you think it's because of the heat inside/on the graphic cards or is it because of a completely other thing???

A PROBLEM:
Headless-Linux-CLI-Multiple-GPU-Boinc-Server_Problem-GPUs suddenly missing
Installed a Ubuntu Server 12.04 and CUDA5.5 for number crunching/Boinc. Used the 12.04.3 update, since the 12.04.4 update doesn't work with CUDA!
System runs perfect, using all 4 GPU's to crunch data. Suddenly, without installing anything or updating anything, the GPU's is lost to Boinc!
It' might be better when testing this using the new 12.04.5 update and CUDA5.5, if it works at all that is!! We are done testing the fans/GPU's temerature, so it's time to solve the issues wink.gif

Edited by DanHansenDK - 8/26/14 at 6:28am
post #100 of 343
Thread Starter 
Hello friends wink.gif


I've been fighting this since early this morning, but I think we might just craked the case wink.gif
We are now crunching data on a Ubuntu Server 14.04.1 LTS (newest version) using CUDA 6.5 (newest version) on all 4 GPU's and the CPU of course thumb.gif

I've started a Burn-In test again, using these new versions.

Status after 10 min. of testing at 100%:
Code:
# nvidia-smi -a | grep GPU
Attached GPUs                       : 4
GPU 0000:01:00.0
    GPU UUID                        : GPU-ea22ef3d-4254-dff0-2db8-86656441c198
    MultiGPU Board                  : N/A
    GPU Operation Mode
        GPU Link Info
        GPU Current Temp            : 48 C
        GPU Shutdown Temp           : N/A
        GPU Slowdown Temp           : N/A
GPU 0000:02:00.0
    GPU UUID                        : GPU-bf213a08-c3c6-346b-53ff-5ff7d82c5c74
    MultiGPU Board                  : N/A
    GPU Operation Mode
        GPU Link Info
        GPU Current Temp            : 49 C
        GPU Shutdown Temp           : N/A
        GPU Slowdown Temp           : N/A
GPU 0000:03:00.0
    GPU UUID                        : GPU-d5813be2-bf30-6c90-a591-90fef765984f
    MultiGPU Board                  : N/A
    GPU Operation Mode
        GPU Link Info
        GPU Current Temp            : 53 C
        GPU Shutdown Temp           : N/A
        GPU Slowdown Temp           : N/A
GPU 0000:04:00.0
    GPU UUID                        : GPU-a9bb2423-c2ba-16cd-529f-cdcc43fafd61
    MultiGPU Board                  : N/A
    GPU Operation Mode
        GPU Link Info
        GPU Current Temp            : 49 C
        GPU Shutdown Temp           : N/A
        GPU Slowdown Temp           : N/A
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Overclock.net BOINC Team
Overclock.net › Forums › Overclockers Care › Overclock.net BOINC Team › Project "Headless Linux CLI Multiple GPU Boinc Server" - Ubuntu Server 12.04.4/14.04.1 64bit - Using GPU's from GeForce GT610/GT640/GTX750ti/+ to crunch data.