3 Replies Latest reply on Feb 4, 2011 11:19 PM by

    RAID5 failure during migration

    cfcohen

      So I don't know if anyone will be able to help, because this is a pretty unusual situation, but I though it was worth describing.

       

      I have (had) a three drive RAID5 array on the ICH10R chipset.  I wanted to expand the array by adding a fourth drive shortly after upgrading to a new motherboard.  I used the Rapid Storage Technology Interface to add the drive, and the data migration started and was doing well (26% complete) when I went to bed for the evening.

       

      Well, the next morning I awoke to find the (brand new) machine in some failed power management state that it would not wake up from.  I perform a hard reset and the BIOS reports that the RAID array is in the middle of a migration.   Hooray, I think.... There's hope yet that the right thing will happen.  Only it turns out that the previous hang was some kind of failed hibernation, and Windows won't boot correctly this time either.  Another hard reset and...  You guessed it -- All four drives in the array are marked failed.

       

      After confirming that Windows will boot correctly, I shutdown gracefully and went back into the BIOS.  Apparently the ICH10R decides that since all four drives have failed simultaneously, that maybe there's not really anything wrong, and so it asks if I would like to try and "recover" my array (Y/N).  Having little lose at this point, I pressed "Y" three times in a row, and watched as it added drives 2, 3, and 4 back into the array.  Now my array is spontaneously marked as "degraded", which is definitely a step up from "failed".

       

      After booting into Windows, it detects the array and CHKDSK offers to "fix" my corrupted volume...  Ha!  Not falling for that one.  I politely declined. :-)

       

      At this point, I read some reviews, did some more research and concluded that my best bet was to try a read-only recovery program to see what could be recovered from the current configuration.  I chose R-Studio based on several positive recommendations and a reasonable price.  When pointed the software at the failed array, it immediately detected practically all of my files, and happily recovered them all to an external drive (over about 24 hours).  Again, I breathe a deep sigh of relief.

       

      But sadly, it was premature.  It turns out that some of the recovered files are randomly scrambled, while other files are fine.  It's a little hard to tell for certain what the pattern is, but it appears that older files are ok, and newer files are corrupt.  There are a few exceptions, suggesting that the actual explanation is related to the block order on the drive, and which files were completely migrated to the fourth drive, and which were not.  I figure the migration was probably about 33% complete when the machine decided to demonstrate it's inability to hibernate correctly.

       

      Obviously, I'd like to get the other third (or two-thirds) of the files back if possible as well.  I figure that my best bet is to get the array rebuilt in a degraded state using drives 1, 2, and 3 instead of 2, 3, and 4.  Then I can repeat the recovery that worked before, and hopefully the corrupt files will be fine this time, and the files that were previously recovered correctly will now be corrupt...  The question before you fine reader, is whether I:

       

      1. Unplug the fourth drive, failing the entire array again in hopes that the BIOS will offer to "recover" again using drives 1, 2, and 3?

       

      2. Mark all four drives as "normal" and attempt to create a virtual array from drives 1, 2, and 3.

       

      3. Delete the array, and recreate it as new array using the first three drives.

       

      I see little harm in trying option one first.  I figure I can always mark the drives as "normal" and fall back to option number 2.  The third sounds dangerous to me.  Does anyone have enough experience with failed arrays to suggest whether any of these approaches are more or less wise than the others?

       

      I fullly understand the principles of RAID arrays, but honestly have no experience with what actually happens when you start intentionally failing more than the allowed number of drives in the array.  The official Intel documentation says that all data on a "failed" array is irrecoverably lost, but having four out of four good drives obviously leaves me in a situation that is rarely discussed.

       

      Thanks for any comments, feedback or opinions!

        • 1. Re: RAID5 failure during migration

          I find it interesting how many of the problems of this type on the forum are not answered by anyone at Intel.  I think it says a lot about corporate commitment.

           

          I have done much the same as you have.  I had a 3x2TB RAID 5 array that I wanted to expand by adding 1 additional 2TB identcal drive.  The so called user's manual for the ICH10R says to plug it in, go to the console and add it to the array, and reboot.  I did not reboot as I saw the migration proceeding.  But, after 2 DAYS and having gotten to all of 33% I assumed something was wrong and a reboot really was needed.  What Intel doesn't say is that upon reboot the bios says the array is Migrating and windows wants to do a chkdsk.  But, the chkdsk seems to be interrupted by whatever the migration process is.  I can hear the drives churning the data just as before, but I have been stuck on the chkdsk-do-you-want-to-cancel-this query for 2 days.  So, Intel - if anyone is awake there - how long will this Migration take and why don't you explain how to estimate the length of this process in your manuals.  We are not mind readers.

          • 2. Re: RAID5 failure during migration

            Since nobody responded to my original post, I didn't think anyone else cared.  Since you seem to have experienced a simialr problem, I'll relay the outcome of my crisis.  I was able to recover all most all of my data after several days of agonizing file recovery.  As I mentioned in my previous post, I used the RStudio product from Data Recovery Software and was able to mount the partially migrated drive and recover the majority of the data. I ultimately ended up marking all four drives as "normal", and attempting to build a virtual RAID array.

             

            Sadly, none of the files (except the ones I'd already recovered) were restored properly.  The files that were corrupt in the first restore were also corrupt in the second restore.  I then embarked on a week long saga of trying to replace the majority of the files from old backups, Internet downloads, my wife's machine, etc. etc.  I was able to make a list of the "corrupt" files by inspecting the thumbnails and so forth of the files restored earlier.  I also happended to have a complete list of the MD5s of all files on my system from a couple months before the failure, which turned out to be extremely useful for validating which files were restored correctly.  Since this is unlikely to be of any use to you, I'll move on to the relevant part of the story.

             

            Once I was down to a relatively small list of important files that I really wanted back (but still couldn't find anywhere) I got serious about using R-Studio.  In particular, I knew some strings that were certain to occur in a TeX (ASCII) file that I wanted to restore.  Using the R-Studio full-disk search option, I got a list of the blocks that contained this string.  By manually inspecting these blocks, I was able to figure out that this particular file was striped across three drives (not four) which was consitent with being in the middle of a migration step to add an additional drive.   Further, the correct RAID block ordering for this portion of the drive was NOT the same as the the volume that I had recovered earlier.  In particular, the "ordering" option turned out to be "left asynchrnous" or some other non-standard ordering that was unexpected.  I forget the exact details, and can no longer check since I've reformatted the volumes now.

             

            But the point is that the half oy my array was in a different ordering than the other half, and I had to manually detect the correct ordering using data that I knew came from the "corrupt" portion of the earlier restore.  I was disappointed to find that the critical NTFS root filesystem records had apparently been converted to the new block ordering, and so R-Studio was unable to find a "filesystem" on the correctly striped array.  I was however able to point it at the correct block range determined manually, and it correctly revcovered the entire file I was working on.  Having confirmed that the correct block ordering was chosen, I told R-Studio to "scan" the drive looking for lost and deleted files on known types.

             

            As with any recovery mechanism that works without the benefit of correct filesystem pointers, not all of the recovered files were correct.  One of the most common problems was corrupt files that were really in the four-drive stripe configuration (these typically had legitimate names and bogus data).  Another common problem was files with ambiguous endings (PDF & WMV) that were padded out with zeros to the former "size on disk".  In a few cases, I recovered completely intact video files that just had an extra gigabyte or two appended to the end of them. :-)  But there were many files that had bogus names and correct data.  For example: 120934.pdf for the nearly 121st thousand PDF that was restored.  In many cases I could tell by simple visual inspection which files were on my short list for priority recovery.  Searches of the newly recovered files for ones with identical or slightly larger files sizes was useful for limiting the search to a few candidates.

             

            Using this technique, I was able to recover several thousand of the most important outstanding files.  I still don't understand why I wasn't able to recover all of the remaining files on the second pass.  It must be related to files in mid-migration or failures of the R-Studio lost file recovery algorithm.    In the end, I lost about 10,000 files totalling 52GB.  The majority of those were files I didn't care much about.  The real losses amounted to a few hundred photos and few miscellaneous binary files that were created by me. :-(  Still, that's not bad considering the volume was nearly full at 2TB and 400,000 files when my migration failure occurred.   I've switched to RAID1 mirroring from now on.

             

            Hope this is helpful in recovering some of your data.  Obviously, when these kinds of "followed the instructions" failures occured, there's little advice available except that all of your data is unexpected gone forever. :-(

            • 3. Re: RAID5 failure during migration

              It seems that Intel only hosts forums and doesn't participate in them.  Sorry you lost most of your data.

               

              My saga has moved on.  After 2 days of waiting on the chkdsk to clear I took matters into my own hands - the same ones that got me into this.  I powered off, unplugged the 4 RAID drives and powered back on.  I booted up and then started the Matrix console.  I then plugged in the drives and after about a minute they all showed up green but the array was listed as failed.  I clicked on the rebuild and after waiting overnight it was successfully recovered.

               

              I rebooted and then noticed the RAID array is not listed in My COmputer.  Under the Disk Manager it asked me to select either MBT or GPT for it.  After selecting GPT I then have a drive that is listed as "unallocated."  If I right click on it the only option is to Make a SImple volume.

               

              I simply don't know what it will do next - hook up a recovered array, or wipe it clean.  Any ideas?