|
![]() |
Overclock.net - Overclocking.net > Software, Programming and Coding > Operating Systems > Linux, Unix | |
Be warned if your hard drive is about to die!
|
||
![]() |
|
|
LinkBack | Thread Tools |
|
|
#1 (permalink) | |||||||||||||
|
has left the building
![]() |
About time I contributed something I guess.
![]() You probably already know about the SMART technology that's built into your hard drive. It's supposed to warn you if your hard drive is about to croak so that hopefully you'll have enough time to rescue your files before the drive is completely unusable. Today I'm going to tell you how to set up SMART monitoring on your Linux box, how to have it e-mail you whenever there's a problem with your drive, and how to run drive self-tests. It should take you about ten minutes to set up. (This guide is for Linux. If you have Windows, try Disk Checkup, HDD Health or SpeedFan to monitor your hard disk's SMART attributes and run self-tests.) The reason I'm writing this in the first place is I forgot to set it up on my laptop and the drive died despite three weeks of warnings in the system log which I never noticed. I was fortunate enough to get most of my files off but I did lose some MP3s and a movie. The only good news there is I got it replaced for free.Simple SMART Monitoring The GNOME Disk Utility included in Fedora 12 and Ubuntu 9.10 will display detailed SMART status for all the drives in your system. On Fedora, open Applications > System Tools > Disk Utility. On Ubuntu, open System > Administration > Disk Utility. When you highlight one of your drives, you will see a message stating whether your disk is healthy or not. ![]() When you click on More Information you will see the detailed SMART data for your drive and be able to determine if any specific component is failing, as well as conduct a self-test. ![]() For a deeper understanding of how your hard drive monitors itself, and to have your system automatically email you if your drive develops problems, keep reading. More Sophisticated SMART Monitoring All of the commands here require root access, so the first thing you want to do is open a Terminal with root access. Get a root Terminal by opening a regular Terminal and then running the following command: Code:
# Ubuntu: sudo -s # Fedora: su - Setting Up SMART Monitoring The first thing to do is to install the smartmontools package which contains everything you need. How you do this varies but it's usually pretty straightforward. If it's already installed, just run the second command to make sure it starts when you boot up the computer. Code:
# Ubuntu: apt-get install mailx smartmontools # install package update-rc.d smartmontools defaults # start at boot time # Fedora: yum install smartmontools # install package chkconfig smartmontools on # start at boot time Code:
# Ubuntu: gedit /etc/default/smartmontools # check the file for start_smartd=yes Code:
gedit /etc/smartd.conf Code:
DEVICESCAN -m your@email.address ![]() Now we have the SMART daemon smartd set to start whenever you boot up the system and to e-mail you detailed reports of your drive's health. All that's left to do is to start it up! Code:
# Ubuntu: /etc/init.d/smartmontools start # Fedora: service smartd start SMART will also give you a lot of detailed information about the drive, which you can view by doing: Code:
smartctl -a /dev/sda Here's an example from my own hard drive: Code:
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Second Generation Serial ATA family
Device Model: WDC WD2500KS-00MJB0
Serial Number: WD-WCANK4057361
Firmware Version: 02.01C03
User Capacity: 250,059,350,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sat Sep 6 03:44:42 2008 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (7080) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 83) minutes.
Conveyance self-test routine
recommended polling time: ( 6) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 201 186 021 Pre-fail Always - 4916
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 90
5 Reallocated_Sector_Ct 0x0033 197 197 140 Pre-fail Always - 22
7 Seek_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 8553
10 Spin_Retry_Count 0x0013 100 253 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 100 253 051 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 84
190 Airflow_Temperature_Cel 0x0022 044 022 045 Old_age Always FAILING_NOW 56
194 Temperature_Celsius 0x0022 094 072 000 Old_age Always - 56
196 Reallocated_Event_Count 0x0032 199 199 000 Old_age Always - 1
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 1
200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Code:
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG HS122JC
Serial Number: S18GJ16Q102092
Firmware Version: GQ100-01
User Capacity: 120,034,123,776 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 6
ATA Standard is: ATA/ATAPI-6 T13 1410D revision 1
Local Time is: Sat Aug 30 01:28:29 2008 EDT
==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 121) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x59) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 63) minutes.
SMART Attributes Data Structure revision number: 4
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 021 051 Pre-fail Always In_the_past 52231
3 Spin_Up_Time 0x0007 092 091 025 Pre-fail Always - 2245
4 Start_Stop_Count 0x0032 029 029 000 Old_age Always - 718447
5 Reallocated_Sector_Ct 0x0033 001 001 010 Pre-fail Always FAILING_NOW 691
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 2885
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 221
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 4372
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 703
194 Temperature_Celsius 0x0022 103 091 000 Old_age Always - 45 (Lifetime Min/Max 4/49)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 2867
197 Current_Pending_Sector 0x0032 001 001 000 Old_age Always - 340822684
198 Offline_Uncorrectable 0x0032 081 081 000 Old_age Always - 51728038
199 UDMA_CRC_Error_Count 0x0032 089 089 000 Old_age Always - 31225975
200 Multi_Zone_Error_Rate 0x0032 088 088 000 Old_age Always - 32804553
SMART Error Log Version: 1
Warning: ATA error count 65535 inconsistent with error log pointer 1
ATA Error Count: 65535 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 65535 occurred at disk power-on lifetime: 6 hours (0 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
00 50 01 00 00 00 e0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
00 ff 01 01 00 00 a0 0a 06:03:14.156 NOP [Reserved subcommand]
c8 ff 01 00 00 00 e0 08 06:03:10.895 READ DMA
c6 ff 08 00 00 00 e0 08 06:02:54.504 SET MULTIPLE MODE
10 ff 01 78 a9 1d e0 08 06:02:54.504 RECALIBRATE [OBS-4]
91 ff 3f 18 79 00 af 08 06:02:54.504 INITIALIZE DEVICE PARAMETERS [OBS-6]
Error 65534 occurred at disk power-on lifetime: 6 hours (0 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
00 50 f1 00 4f c2 e0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
00 d2 01 01 00 00 a0 00 06:02:54.478 NOP [Reserved subcommand]
b0 d2 f1 00 4f c2 e0 08 06:02:54.433 SMART ENABLE/DISABLE ATTRIBUTE AUTOSAVE
b0 d8 00 00 4f c2 e0 08 06:02:53.862 SMART ENABLE OPERATIONS
c6 ff 08 00 00 00 e0 08 06:02:53.862 SET MULTIPLE MODE
ef 03 45 01 07 00 a0 00 06:02:53.861 SET FEATURES [Set transfer mode]
Error 65533 occurred at disk power-on lifetime: 6 hours (0 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
00 50 a9 55 00 00 a0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
00 ff 01 01 00 00 a0 00 06:02:45.439 NOP [Reserved subcommand]
f5 ff a9 55 00 00 a0 00 06:02:45.438 SECURITY FREEZE LOCK
ec ff aa 55 00 00 a0 00 06:02:45.437 IDENTIFY DEVICE
00 00 01 01 00 00 a0 00 06:02:40.110 NOP [Abort queued commands]
e7 00 00 30 9e 13 a0 08 06:02:39.439 FLUSH CACHE
Error 65532 occurred at disk power-on lifetime: 6 hours (0 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
00 50 00 30 9e 13 a0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
00 00 01 01 00 00 a0 00 06:02:40.110 NOP [Abort queued commands]
e7 00 00 30 9e 13 a0 08 06:02:39.439 FLUSH CACHE
c8 00 08 29 9e 13 e8 08 06:02:38.438 READ DMA
c8 00 08 09 9e 13 e8 08 06:02:38.423 READ DMA
c8 00 10 99 48 c1 ea 08 06:02:38.422 READ DMA
Error 65531 occurred at disk power-on lifetime: 5 hours (0 days + 5 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 1a 1f 26 a8 ea Error: UNC 26 sectors at LBA = 0x0aa8261f = 178791967
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 00 39 25 a8 ea 08 05:43:41.362 READ DMA
f8 00 00 00 00 00 e0 08 05:43:41.362 READ NATIVE MAX ADDRESS
ec 00 00 00 00 00 a0 0a 05:43:41.358 IDENTIFY DEVICE
ef 03 45 00 00 00 a0 0a 05:43:41.356 SET FEATURES [Set transfer mode]
f8 00 00 00 00 00 e0 08 05:43:41.356 READ NATIVE MAX ADDRESS
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 2845 234441640
# 2 Conveyance offline Aborted by host 90% 2845 -
# 3 Extended offline Completed without error 00% 580 -
# 4 Extended offline Completed without error 00% 196 -
# 5 Short offline Completed without error 00% 5 -
# 6 Short offline Completed without error 00% 2 -
SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure revision number = 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Running Drive Self-Tests The smartctl command also allows you to run self-tests on the drive which do various things. The one I'll explain here is the so-called extended offline read test. This test reads every sector of the drive (while it's not busy doing other things for you) and reports if there were any errors. If you have the SMART daemon running (as above) you'll also get an e-mail with the test results if there were any problems. Code:
smartctl -t long /dev/sda ![]() If you don't need all that level of detail and just want to check to see if the drive is dead, or overheating, or whatever, you can try this: Code:
smartctl -H /dev/sda Code:
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED Please note the following marginal Attributes: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 190 Airflow_Temperature_Cel 0x0022 045 022 045 Old_age Always FAILING_NOW 55 Hopefully there aren't any errors or omissions in here, but I'll update the post if any are brought to my attention. Even so, don't try to hold me responsible if your drive dies without warning. It's been known to happen to the best of us. I think I've written enough for now to get you started. I need to go buy a fan now.
Last edited by error10 : 11-27-09 at 08:54 PM Reason: Added Disk Utility section |
|||||||||||||
|
|
|
|
#2 (permalink) | ||||||||||||||
|
om nom nom
![]()
Join Date: Apr 2008
Location: On earth, thankfully.
Posts: 8,248
Rep: 616
![]() ![]() ![]() ![]() ![]() ![]() Unique Rep: 463
Trader Rating: 5
|
thank you. considering 2 drives of mine have died within 6 months, i'ma set this up asap.
+rep
__________________
Quote:
|
||||||||||||||
|
|
|
|
#3 (permalink) | ||||||||||
|
Overclocker
|
great thread, thanks
__________________
|
||||||||||
|
|
|
|
|
#4 (permalink) | |||||||||||||
|
PC Gamer
![]() |
Totally off-topic, but given your avatar and location - are you Michael Hampton?
On-topic, might have a look at this, thankfully there's nothing of value on my laptop, it's all in Google's 'cloud' so to speak. I for one welcome our new alien Google overlords.
|
|||||||||||||
|
|
|
|
|
#5 (permalink) | |||||||||||||
|
has left the building
![]() |
Yeah, that's me. What's up?
__________________
|
|||||||||||||
|
|
|
|
#6 (permalink) | ||||||||||||||
|
Schubie approved shoe
![]() |
Quote:
__________________
Dig cars more than computers? Stop by CarForum.net!![]() ![]() ![]() ![]() E8400 C0 @ 4.3GHz on AIR ![]() ![]() ![]() Quote by GodOfGrunts: "Stand still, I'll piss on you." Windows 7||Fan Club The ß₤ứə Çřёώ The American Overclockers The Red Tide![]() Never Use Your Flash Drive the Same Again! Supercharge Vista! [F@H]View Your Fahmon From Anywhere! Member of Royal Navy GPU F@H Team
|
||||||||||||||
|
|
|
|
#7 (permalink) | |||||||||||||
|
has left the building
![]() |
I'm a minor celebrity in certain circles, (and apparently a potential terrorist in others) so I'm used to it by now.
__________________
|
|||||||||||||
|
|
|
|
#8 (permalink) | |||||||||||
|
Database Developer
|
nice read. Thanks for the contribution.
__________________
|
|||||||||||
|
|
|
|
|
#9 (permalink) | |||||||||||||
|
4.0 GHz
![]()
Join Date: Jun 2007
Location: In the middle of nowhere
Posts: 385
Rep: 66
![]() Unique Rep: 57
Trader Rating: 0
|
Hehe... I've been to your website before! WELCOME TO OCN!
__________________
|
|||||||||||||
|
|
|
|
|
#10 (permalink) | |||||||||||||
|
PC Gamer
![]() |
Not much, just a reader/Teaparty/FSP forums member. I'm not in NH so I'm not good but still.
![]() I'll just get on the bandwagon and welcome you, I really recently joined myself.
|
|||||||||||||
|
|
|
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|