5 Replies Latest reply on May 22, 2014 3:54 PM by KernelSoftware

    Modular server array rebuild taking more than a week

    KernelSoftware

      We have a customer with a RAID 5 array of 5 x 136 GB 10K SAS drives on a Modular Server.  Over a week ago, one of the drives started getting timeout errors.  The server automatically marked the drive as stale and unused and began rebuilding the array using a dedicated spare.  However, it is now about 10 days later and the background task indicates the rebuild is only at 32% complete.  Is it possible that such a small array could really take this long?  Or do we need to take other action?  The stale drive is still physically in the server and powered up, but it continues to get a timeout and reset event every minute or so.

       

      What is the proper action here?  We are afraid of ejecting the stale drive until the rebuild is complete, but maybe the failing drive is causing the array to have poor performance.  Any guidance, comments, or suggestions are greatly appreciated.  Thank you.

       

      Message was edited by: Tim Sagstetter The rebuild is proceeding at about 1% per day.  At this rate, it will take months to complete.

        • 1. Re: Modular server array rebuild taking more than a week
          Dan_O

          I have seen very slow rebuilds with some Toshiba drives, or when there was too much rotational vibration interfering with the normal rebuild.  Normally, a rebuild takes about 5 min/gb.  If it took two and a half days, I'd say that's not unreasonable, but 10 days is definitely too much.

           

          Can you check what make/model your hard drives are, and see if your I/O fan module is D91260-004 or higher?

          • 2. Re: Modular server array rebuild taking more than a week
            KernelSoftware

            Thanks, Dan.  The five drives are a mix of Seagate ST9146802SS and ST9146803SS.  The estimated completion time is now estimated at 2,225 hours.  The Front I/O Fan part number is D70745-403.

             

            Are you suggesting this part has an issue which might cause this behavior?

             

            Do you think it is safe to eject the drive that is being replaced by the spare?  Could its continual timeouts be affecting the performance of the remaining drives?

             

            Thanks, again.

            • 3. Re: Modular server array rebuild taking more than a week
              Dan_O

              If the 10K.2 drives and the 10K.3 drives are in the same storage pool, that could also be contributing.

               

              The stale drive can be ejected.  The RAID should be rebuilding from everything except that drive, so it shouldn't be in use at all.

               

              I think you might have the older I/O fan module.  Can you also check if your power supplies are D73299-008 or higher?  both of those hardware parts had a higher RVI than the newer ones.

              • 4. Re: Modular server array rebuild taking more than a week
                KernelSoftware

                Thanks, Dan.  We'll toss the stale drive.  While not part of the rebuild, it is still getting timeouts and being reset every minute.  We have moved all content of the problem array onto other storage at this point, so we're just going to cancel the rebuild delete the storage pool and start over.

                 

                This entire modular server is five years old now, so it has a lot of stuff out of date.  So far, no failures of any component, though.  Thank you for your insights.

                • 5. Re: Modular server array rebuild taking more than a week
                  KernelSoftware

                  Dan, one final note on this issue.  After we ejected the stale drive, the rebuild completed within 20 minutes.  Thus, it appears leaving the drive inserted was definitely the cause of the issue.  In this case, the drive was seen, but could not come ready.  During each attempted start, it appears the rebuild would pause.  So, it seems any drive that fails to go on line should be removed as soon as possible.  Thanks again for your comments.