2 Replies Latest reply on May 16, 2014 1:30 AM by pentoli

    MFSYS25 - One Drive in Storage Array Failing Causes Entire Chassis to Fail?

    ckoeber

      Hello,

       

      We have an MFSYS25 chassis (details below) which has a strange but rather critical problem.

       

      Whenever a drive fails on the RAID setup we have the entire system locks up. The fans speed up; all of the compute modules restart but do not go anywhere, and we can barely manage the system as the management console hangs up. What we find that we have to do is completely unplug and plug back in the power to the system and then start the rebuild on the failed drive. Then, we can start the compute modules and get the various servers going again, etc.

       

      Has anyone else experienced this? What can we do to add some resiliency to the system?

       

      Thank you for your time.

       

      Details of our MFSYS25 System are as follows:

       

      1. Chassis Management Module: Part Number: D70735-403
      2. Server Storage Module: Part Number: D70737-404
      3. Gigabit Ethernet Switch: Part Number: D70739-404
      4. Six (6) Server Compute Modules: Part Number: D70726-404
      5. Firmware Versions:
        1. Server 1BMC Firmwareok1.36.6
          BMC Bootok0.10
          BIOSokSB5000.86B.10.00.0050.083120090939
          Server 2BMC Firmwareok1.36.6
          BMC Bootok0.10
          BIOSokSB5000.86B.10.00.0050.083120090939
          Server 3BMC Firmwareok1.36.6
          BMC Bootok0.10
          BIOSokSB5000.86B.10.00.0050.083120090939
          Server 4BMC Firmwareok1.36.6
          BMC Bootok0.10
          BIOSokSB5000.86B.10.00.0050.083120090939
          Server 5BMC Firmwareok1.36.6
          BMC Bootok0.10
          BIOSokSB5000.86B.10.00.0050.083120090939
          Server 6BMC Firmwareok1.36.6
          BMC Bootok0.10
          BIOSokSB5000.86B.10.00.0050.083120090939
          Switch 1Firmwareok1.0.0.27
          Bootok1.0.0.6
          Switch 2Firmwarenot present--
          Bootnot present--
          Storage Control Module 1Firmwareok3.10.140.2
          Storage Control Module 2Firmwarenot present--
          System Fan 1Firmwareok1.2
          Bootok1.2
          System Fan 2Firmwareok1.2
          Bootok1.2
          I/O FanFirmwareok1.2
          Bootok1.2
          Power Supply 1Firmwarenot applicable --
          Bootnot applicable --
          Power Supply 2Firmwarenot applicable --
          Bootnot applicable --
          Power Supply 3Firmwarenot applicable --
          Bootnot applicable --
          Power Supply Blank 4Firmwareok1.2
          Bootok1.2
        • 1. Re: MFSYS25 - One Drive in Storage Array Failing Causes Entire Chassis to Fail?
          emilec

          What firmware version is your chassis on? Might be worth loading the latest version if you haven't already.

           

          Which drives are you using? Are they on the Intel compatibility list? Are they all the same firmware version?

           

          With RAID in general I have seen some odd things where a single faulty disk can cause entire volumes to go offline or make other disks "disappear". Even faulty backplanes cause similar weirdness. If the same drive is repeatedly going offline, maybe start by replacing that drive or moving it to a different slot to see if the problem follows the drive or slot.

          • 2. Re: MFSYS25 - One Drive in Storage Array Failing Causes Entire Chassis to Fail?
            pentoli

            I had the same thing happening 2 weeks ago after one drive broke down.

             

            I was still able to connect to the web page for configuration and rebuild of the drive was on 0% for 5 hours. All VMs on internal storage where shut down, however the VMs on the vtrak were still running. I was unable to bring storage up again and had to pull all 4 power cables, like you wrote as well.

             

            I had drives fail in the past that rebuilt without problems.

             

            To reduce the risk of a similar failure I split my internal storage into 2 storage pools, hoping that only the broken one will be affected by the crash.