So i have noticed some peeps having problems with their raid arrays. I have also notice no one gets the logs. This should be your first port of call any time you have an issue with an array. Even is a disk drops offline and comes back online, you should pull the logs and check for errors. The logs can also save an array if say in a raid 5, 2 disks drop offline, you now have a dead array, so you need to force online the last disk to drop off(because it has the last parity data). The logs will tell you info like that. They will also detect failing disks, perd failures and time outs on the disks. Giving you time to replace the faulty disk before it fails completely.
The prec is an enterprise solution controller; as such it will protect the information on the disks at all costs, so it will kick a disk from the array on a whim, if it thinks the disk is unreliable. So tolerance is much tighter than a cheaper/home solution.
Always keep the controller drivers and firmware up to date, always do the driver first. Keep the disks firmware up to date as well. Most firmware does not improve performance but will improve reliability.
Also note below i will be talking about raid 5 only, thou the errors from the logs are applicable to any raid level.
Some general Raid 5 troubleshooting:
Raid 5 offers redundancy/single disk failure. It's not bulletproof, so if your data is important think about tape backup/some form of backup. I understand most folks here are using this as a home solution but still something to note.
Have a hot spare.
Foreign config:raid 5 and 1/10. Occurs if a disk is kicked/fall out of the array. The config is kept on the disks now and so if the config on the disk does not match the config on the other disks it shows up as foreign. ALWAYA CLEAR THE FOREIGN CONFIG ON THE DISK.Done in the controller bios or thru dell open manage. Never import a foreign config as you may be importing bad parity with it.just clear it and allow it to rebuild back into the array (This is only relevant in a single disk failure, not if the card fails, that another issue)
Set patrol read on: just make sure it's on, it is on by default on the perc 5/6/h700. But double check. As for running it, once a week should be fine. Patrol read will start if the controller calls for it, if it thinks there is a problem. If problems are found it will call the consistency check as well.
Gathering logs: Windows only (can be gathered for Linux as well but i will have to get the commands to pull them, will update ASAP)
Go to http://www.lsi.com/downloads/Public/Obsolete/Obsolete%20Common%20Files/1.01.39_Windows_Cli.zip
Download and extract the .zip to c:\megaCLI
Run CMD and browse to the folder. Once their run
MegaCLI -FwTermLog -Dsply --aALL > tty.log
Its case sensitive so be careful. That will dump the controller logs into the same folder.
It's just a text file and opens with notepad.
HOW TO READ THE LOGS/WHAT TO LOOK FOR.
Top of the logs show the controller and firmware version:
PERC 5/i Integrated 0:
T0: LSI Logic MegaRAID firmware loaded
T0: Firmware version 1.00.01-0088 built on Apr 17 2006 at 18:02:55
T0: Board is type 1028/0015/1028/1f03
As you scroll down the logs you will see "T" at the beginning of each line, this is the card posting. After that you get a date and time stamp(very very useful).
If you search the logs you will find a section like below:
4/11/11 8:30:00: PD Flags State Type Size S NCQ Vendor Product Rev P C ID SAS Addr Port Phy DH
04/11/11 8:30:00: ---
---- - - --
---- --- --
04/11/11 8:30:00: 0 00c00005 00020 00 00000000 0 0 SEAGATE ST3300655SS S527 0 0 00 5000c50008fe857d 00 00 0a
04/11/11 8:30:00: 1 f0400005 00020 00 22ecb25b 0 0 SEAGATE ST3300655SS S527 0 0 01 5000c50008fea0fd 01 01 0b
04/11/11 8:30:00: 2 f0400005 00020 00 22ecb25b 0 0 SEAGATE ST3300655SS S527 0 0 02 5000c50008fe8399 02 02 0c
04/11/11 8:30:00: 3 f0400005 00020 00 22ecb25b 0 0 SEAGATE ST3300655SS S527 0 0 03 5000c50008fe8685 03 03 0d
04/11/11 8:30:00: 20 00400005 00020 0d 00000000 0 0 DP BACKPLANE 1.05 0 0 20 5001e0f0356bcf00 09 08 09
04/11/11 8:30:00: 100 00400005 00020 03 00000000 0 0 LSI SMP/SGPIO/SEP 0396 0 0 ff 0 00 ff 00
(this is a bit messy, and once you open a log in notepad its much easier to read)
That's general info on the disks, but it gives the disks firmware rev, which can be important.
The logs will show the Patrol reads and also any rebuilds that occurred.
Also events like T21: EVT#09685-T21: 91=Inserted: PD 00(e0x20/s0)
As you can see, someone inserted PD (physical disk) 00, so the disk was pulled and then reseated (Hot swap disk)
What to look for.
Predictive failure: Just search the logs for "pred", if you find a disk with a pred failure just replace it. Disks can last a few days to weeks but they will fail. This is the disk failing itself and cannot be resolved by firmware/drivers.
Bad_lba: Bad sector/block on the disk. The disk can take a few of these but really if you're seeing more that 5/6 think about replacing the disk.
(A very low level format might work, but i would replace the disk)
Corrected medium error (VD 00/0 at 357bf80d): again a media error on the disk. In this case corrected, in some cases uncorrectable, more than 5/6 of there, replace disk. If seen copy the block number "357bf80d" and search the logs, making sure its not on any other disks. If it is your in trouble
Background Initialization detected uncorrectable multiple medium errors (PD 01(e1/s1) at 34b9cdda on VD 00/0). In this case the error was not correctable, so again nothing to worry about on its own. Multiple accounts of this and replace the disk.
T81: DM: Timing wheel expired - Chip 0 Slot 4e
Timeout on the disks. Usually this is firmware but not always. The perc will try and talk to the disk. If it get no reply, it will deem the disk unsafe to write data to and kick it from the array. Update the firmware on the disk, reseat and allow to rebuild back into the array. If you keep getting timeouts, again replacing the disk.
Their the main ones to look out for, i spent years reading perc logs so its someting you get used to and not really something you can teach. As you might expect there are also many other errors that can occur and show up in the logs, but for the most part what i have listed above it whats mose likley to be seen.
Any questions comments, let me know.