5 Replies Latest reply on Aug 29, 2014 2:02 PM by sylvia_intel

    Repeatable RAID1 Mirror Corruption on 2008 R2 Server with Intel RSTe Controller with Intel SSD Drives



      We are experiencing random, occasional but catastrophic array corruption on two servers that are on test before being moved to a hosting centre.






      SuperMicro SuperServer 6017R-TDLRF 1U Server


      Incorporating SuperMicro X9DRD-LF Motherboard with Intel C602 Chipset with latest BIOS.


      64GB ECC RAM


      1x Xeon E5-2630v2 CPU


      2x Intel DC S3700 800GB SSD Drives in RAID 1 (Mirror) on RSTe Hardware RAID.


      Windows Server Enterprise 2008 R2, fully updated.






      Under heavy load, after a random period of time, often when doing a Windows backup, the array corrupts and the following event log messages are generated. There are varying quantities of each message...




      Event ID:      55




      The file system structure on the disk is corrupt and unusable. Please run the chkdsk utility on the volume VMs.




      Event ID: 12289




      Volume Shadow Copy Service error: Unexpected error CreateFileW(\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy25\,0x80000000,0x00000003,...).  hr = 0x800703ed, The volume does not contain a recognized file system.


      Please make sure that all required file system drivers are loaded and that the volume is not corrupted.




      Event ID:      136


      The default transaction resource manager on volume E: encountered an error while starting and its metadata was reset.  The data contains the error code.




      A chkdsk on a corrupted volume shows hundreds of lines of errors. I can post these two, but I do not think the exact errors are relevant, as they vary each time. They include:




      The object id index entry in file 0x19 points to file 0x174c


      but the file has no object id in it.




      The multi-sector header signature for VCN 0x0 of index $I30


      in file 0x3e is incorrect.


      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................


      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................


      Error detected in index $I30 for file 62.


      The index bitmap $I30 in file 0x3e is incorrect.








      # We have two complete servers with identical hardware. We can repeat the fault on either server. So we know there is not a fault with a specific hardware item.


      # We have tested with 800GB Intel SSDs with HP firmware. We have also tested with 200Gb Intel SSDs with Intel firmware. Configurations with both drive versions exhibit the fault.


      # We have tested with Windows based software RAID and the fault does not occur. Unfortunately this halves the array read performance, as we have confirmed with drive benchmarking software. Having spent £2000 per server on drives, halving the disk performance is not something we want to do. Since the software RAID works, this suggest that the drives and connectivity are not at fault, as they are used in both hardware and software RAID. Switching to software RAID uses a standard Microsoft AHCI driver instead of the Intel RSTe driver.


      # Configurations with the following Intel RSTe driver versions exhibit corruption: Version, version and version


      # Configurations with Intel 'C600+/C200+ series chipset SATA RAID' RSTe driver version does not exhibit corruption.


      # Power consumption is around 130 Watts and is well within the limits of each server's dual 500 Watt power supplies.






      # Once corrupted, running an array verify from the Windows Intel RAID utility often results in a blue screen.


      # Once an array has corrupted, if we break the array and inspect each disk of the mirror, we find that one drive is intact and the other drive is corrupt. But this is not a fault with a drive or a cable because we have run tests on two different servers, six drives and four SATA cables.






      Based on 6 weeks of exhaustive testing, we have concluded that there is a fault in the Intel RSTe driver.


      We are trying to find a way to get this bug fixed.


      If others have had the same issue, this puts more weight to the case.




      Has anybody else experienced this behaviour? If so, have you been able to fix the problem by downgrading to RSTe driver version




      Does anyone have any good suggestions or good contacts at Intel, so we can get this information to the right people so it gets fixed?




      We don't really want the server to go live with a very old driver version, as the servers once live, will effectively remain stuck at that driver version, as it will be too risky to update them.




      Any help or suggestions are appreciated.




      Best regards




      Stephen Done