5 Replies Latest reply on Sep 5, 2014 9:51 AM by kevin_intel

    Repeatable RAID1 Mirror Corruption on 2008 R2 Server with Intel RSTe Controller with Intel SSD Drives

    Stephen_Done

       

      We are experiencing random, occasional but catastrophic array corruption on two servers that are on test before being moved to a hosting centre.

       

       

       

      SERVER CONFIGURATION

       

      SuperMicro SuperServer 6017R-TDLRF 1U Server

       

      Incorporating SuperMicro X9DRD-LF Motherboard with Intel C602 Chipset with latest BIOS.

       

      64GB ECC RAM

       

      1x Xeon E5-2630v2 CPU

       

      2x Intel DC S3700 800GB SSD Drives in RAID 1 (Mirror) on RSTe Hardware RAID.

       

      Windows Server Enterprise 2008 R2, fully updated.

       

       

       

      CORRUPTION PROBLEM

       

      Under heavy load, after a random period of time, often when doing a Windows backup, the array corrupts and the following event log messages are generated. There are varying quantities of each message...

       

       

       

      Event ID:      55

       

      Description:

       

      The file system structure on the disk is corrupt and unusable. Please run the chkdsk utility on the volume VMs.

       

       

       

      Event ID: 12289

       

      Description:

       

      Volume Shadow Copy Service error: Unexpected error CreateFileW(\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy25\,0x80000000,0x00000003,...).  hr = 0x800703ed, The volume does not contain a recognized file system.

       

      Please make sure that all required file system drivers are loaded and that the volume is not corrupted.

       

       

       

      Event ID:      136

       

      The default transaction resource manager on volume E: encountered an error while starting and its metadata was reset.  The data contains the error code.

       

       

       

      A chkdsk on a corrupted volume shows hundreds of lines of errors. I can post these two, but I do not think the exact errors are relevant, as they vary each time. They include:

       

      ...

       

      The object id index entry in file 0x19 points to file 0x174c

       

      but the file has no object id in it.

       

      ...

       

      The multi-sector header signature for VCN 0x0 of index $I30

       

      in file 0x3e is incorrect.

       

      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

       

      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

       

      Error detected in index $I30 for file 62.

       

      The index bitmap $I30 in file 0x3e is incorrect.

       

      ...

       

       

       

      TESTS PERFORMED

       

      # We have two complete servers with identical hardware. We can repeat the fault on either server. So we know there is not a fault with a specific hardware item.

       

      # We have tested with 800GB Intel SSDs with HP firmware. We have also tested with 200Gb Intel SSDs with Intel firmware. Configurations with both drive versions exhibit the fault.

       

      # We have tested with Windows based software RAID and the fault does not occur. Unfortunately this halves the array read performance, as we have confirmed with drive benchmarking software. Having spent £2000 per server on drives, halving the disk performance is not something we want to do. Since the software RAID works, this suggest that the drives and connectivity are not at fault, as they are used in both hardware and software RAID. Switching to software RAID uses a standard Microsoft AHCI driver instead of the Intel RSTe driver.

       

      # Configurations with the following Intel RSTe driver versions exhibit corruption: Version 3.8.0.1113, version 4.0.0.1045 and version 4.1.0.1047.

       

      # Configurations with Intel 'C600+/C200+ series chipset SATA RAID' RSTe driver version 3.6.0.1093 does not exhibit corruption.

       

      # Power consumption is around 130 Watts and is well within the limits of each server's dual 500 Watt power supplies.

       

       

       

      OBSERVATIONS

       

      # Once corrupted, running an array verify from the Windows Intel RAID utility often results in a blue screen.

       

      # Once an array has corrupted, if we break the array and inspect each disk of the mirror, we find that one drive is intact and the other drive is corrupt. But this is not a fault with a drive or a cable because we have run tests on two different servers, six drives and four SATA cables.

       

       

       

      CONCLUSION

       

      Based on 6 weeks of exhaustive testing, we have concluded that there is a fault in the Intel RSTe driver.

       

      We are trying to find a way to get this bug fixed.

       

      If others have had the same issue, this puts more weight to the case.

       

       

       

      Has anybody else experienced this behaviour? If so, have you been able to fix the problem by downgrading to RSTe driver version 3.6.0.1093?

       

       

       

      Does anyone have any good suggestions or good contacts at Intel, so we can get this information to the right people so it gets fixed?

       

       

       

      We don't really want the server to go live with a very old driver version, as the servers once live, will effectively remain stuck at that driver version, as it will be too risky to update them.

       

       

       

      Any help or suggestions are appreciated.

       

       

       

      Best regards

       

       

       

      Stephen Done

       

       

       

       

       

       

       

        • 1. Re: Repeatable RAID1 Mirror Corruption on 2008 R2 Server with Intel RSTe Controller with Intel SSD Drives

          Hi Stephen_Done,

           

          I am really sorry for your trouble but let me help you with this.

           

          All the information you have provided is very complete and helpful for our understanding and investigation process. However, there are some other questions I want to ask you for me to have a better picture of the issue:

           

          1. What is the SSDs Firmware version?
          2. Are these SSDs connected directly to the motherboard or through a PCIe Raid Card?
          3. Are you using any Virtual machine Software?
          4. Have you tried the IRSTe driver provided by your motherboard manufacture?

           

          Kevin M

          • 2. Re: Repeatable RAID1 Mirror Corruption on 2008 R2 Server with Intel RSTe Controller with Intel SSD Drives
            Stephen_Done

            Hello Kevin,

             

            1. The Intel DC S3700 SSD firmware version is 5DV10270.

            2. The two Intel SSDs are connected directly to the two Intel Chipset RAID ports. There is no external RAID card. It is Intel C602 chipset RAID. See my first post.

            3. This is a Windows Enterprise Server and is running Microsoft Hyper-V. There are four VMs, but none have direct disk access. The corruption is beneath the logical disk level, so I do not believe that anything within the OS can be doing this other than the Intel RAID driver. Also, the only thing we can change that makes the problem go away is the Intel RAID driver version. When the array corrupts, one drive is intact and the other is scrambled - this points at driver or firmware to me. My software developer hat makes me think driver race condition.

            4. Yes, we have tried the iRSTe driver provided by SuperMicro - this corrupts. We have also tried an update supplied by SuperMicro - this corrupts. We have also tried all versions of driver downloadable from the Intel site. These all corrupt, except 3.6.0.1093. However, SuperMicro are simply supplying the Intel driver, as I would expect them to. Boston, the UK distributor for SuperMicro have recently supplied us with several more driver versions between 3.6 and 3.8, as we have offered to pinpoint the driver version where the corruption began. We will do this as an assistance to Intel in solving the problem, if we know that the information will be used - would you find this useful? Please confirm, as these tests are all further time and money to my company. However, the fact still remains that the current driver version corrupts.

             

            Best regards

             

            Stephen Done

            • 3. Re: Repeatable RAID1 Mirror Corruption on 2008 R2 Server with Intel RSTe Controller with Intel SSD Drives

              Thanks for the information.

               

              Please note that our drivers are generic drivers for OEM Systems like SuperMicro. This is because they create their own software and special drivers for their units.

               

              Have you tried the IRSTe driver version provided by SuperMicro?

               

              Kevin M

              • 4. Re: Repeatable RAID1 Mirror Corruption on 2008 R2 Server with Intel RSTe Controller with Intel SSD Drives
                Stephen_Done

                >Please note that our drivers are generic drivers for OEM Systems like SuperMicro.

                >This is because they create their own software and special drivers for their units.

                >

                So the OEM drivers should work then?

                But perhaps not have extra features that SuperMicro implement?

                I've never been told that an OEM driver will not work before, and I've been in the business for some time.

                But I can see how that might cut down on your support overhead :-)

                 

                >Have you tried the IRSTe driver version provided by SuperMicro?

                >

                Yes.

                You asked this in question number 4 above and I answered above.

                Summary of reply to question 4: The problem is the same, whether using SuperMicro supplied drivers or Intel supplied drivers.

                This is why it is logical to conclude that since the problem is present in both drivers versions, the problem is in the core code.

                 

                But I can see that someone higher up in Intel is listening, as the later driver versions have just been pulled from the Intel download centre!

                I'm glad to know I'm getting somewhere, even though it might at first glance appear otherwise :-)

                 

                Best regards

                 

                Stephen Done

                Beng Hons Information Systems Engineering,

                MCSE+Internet

                • 5. Re: Repeatable RAID1 Mirror Corruption on 2008 R2 Server with Intel RSTe Controller with Intel SSD Drives

                  Thanks for the information. I am going to research about this and I will be back with more updates.

                   

                  Kevin M