0 Replies Latest reply on May 24, 2011 7:27 PM by krisha

    random drive failure / improvement for raid 5 rebuild

    krisha

      hi,

       

      since some weeks I have problems with my ICH8R raid 5 consisting of 3 Samsung HD501LJ 500GB drives. I was using Win XP SP3 x64 with matrix storage 8.(something). It was running without trouble for 2-3 years. I did not upgrade any driver or software, but now from time to time one of the drives fails... Ok, I clicked rebuild (took around 15hours each time) and everything was fine. Then it happened more often and I decided to not boot anymore from this raid array but from a SSD running win 7 x64 and rebuild the array here. Suddenly during rebuild a second drive reported an error and the raid was completly dead. I bought a 2 TB hadrdisk and created a raw copy of all drives using linux. With linux and dmraid I was able to rescue all important data. I also wrote a small tool that checked the harddisks to know in which area the error occured and how many errors, since this was not shown on matrix 8.x. I booted back into win 7, but now 2 drives were ok again (how come?). I rebuild the array and ran checkdisk. Some files were destroyed in the end. I was wondering what happened...

       

      I was developing an USB device and had some blue screens, so i thought this could be the problem... maybe the disks are not synchronized on an BSOD. Can anybody confirm this?

       

      I also checked SMART status of the drives, all seems to be ok, but one had an extremly high amount of "Hardware ECC Recovered" - since this is a vendor specific value, I don't know if this is related. But anywhy this was only on one harddisk - so if there's a problem with that harddisk only this should fail and not randomly 1 of the 3 drives.

       

      Ok, now I was running the array around 4 days without problem and without any BSOD or system failure, but today it happened again. I found out that matrix storage was replaced with RST, so I installed it. I'm surprised that it now shows the count of errornous stripes (nice improvement) and rebuild is also much faster!

       

      I also remember that some seconds/minutes before the raid was degraded I opened the window 'cause it was very hot. I think temeprature fall around 10°C in just 1-2 minutes. Could it be that this is my problem? Somehow I can not believe it ;-)

       

      Ok and now the suggestions:

      *) Add an option to RST to fix only the defective stripe! This can be fixed in some milliseconds and no rebuild is needed. If you take the data from 2 drives to rebuild the 3rd drive it could be that 1 of the "working" drives also sends a defective sector to repair. This means that after rebuild you have more errors than before and you're (or RST is) not able to know the newly defective drive. sometimes less is more ;-)

      *) Add more information about the error (not just degraded) - how many bits/bytes are defective?

       

      How the ICH8R detects the defective harddisk? If there is a sector error on one harddisk or one harddisk sends back invalid data, there is no possibility to know which harddisk contains the error, right? Maybe it could be nice to integrate it smoothly to the filesystem to check integrity before repairing it - This also determines which 2 of the 3 harddisk should be used for rebuilding the missing data.

       

      hope to see a nice discussion and maybe some improvements in later versions (and maybe also some answers what happened to my raid 5)