For about 4 years we've had a SR2520SAXSR machine (s5000vsasas board, 2x xeon processors, active backplane - sgh-something iirc) Up to now it had a 3ware 9690 sa HW RAID controller which was something like top notch in the class at the time. The machine is a windows server 2003 standard 64-bit. Main roles are DC, DNS and File server. Stogare configuration was 2x SATA disks in RAID-1 for system drive, 4xSAS disks in RAID-5 for file serving. Load is average, mainly in work hours but data reliability and availability is of highest priority. Two separate backup strategies are used on systematic schedules.
The system is towards the end of its life but it is planned to be replaced not earlier than end of next year.
The 3ware controller was somewhat hard to get going in the beginning but once it got going it didn't stop for years (except for planned system maintenance once a year). So... this month we decided to give it a well deserved break and as the machine is soon going to go to pension anyway we decided to give a go to the ESRT 2 built-in in the motherboard, which turned out to be a LSI M1064e one. We were going to replace the RAID-5 array with RAID-10, which in view of the mediocre load was supposed to do the job. So far, so good. OS was updated from windows update, system drive was cloned, bios updated, ESRT2 firmware updated too - all to the latest versions. OS was restored from clone, AD checked - no errors, data was restored from back up, all in all within 3 days, which included full initialization (earlier lessons learned the hard way) of all logical volumes from the ESRT bios console. RAID web console 2 (latest version) was installed. System was up and running.
As the built-in ESRT2 does not have BBU, write back was disabled both for the controller and the LVs. For several days the system was running fine. Serving speed was satisfying even without write back caching. We decided to call it a production one.
What followed was quite dissapointing:
First - Patrol read unavailable in ESRT 2 - that is so "cheap". Anyway we can live with that although it is not what I would call an "enterprise feature", actually it should be one of the most basic (read also essential features) of any RAID controller - entry or enterprise level. What was a major annoyance was that everytime you try to do a schedule or right away patrol read, the Web console would return a "red cross" error saying that the command was failed by the OS, meanwhile - not a line in ANY log (nether OS logs, nor console log) - go figure! From the tons of Intel documentation read not a single line was there pointing that an add-in hardware key purchase was needed in order to get this "enterprice feature". It was only a very short line in a discussion in this community that shed some light. I don't remember the name of teh author but thanks anyway. NVM, the machine is managed remotely, so running a consistancy check from the web console every Sunday was not a big deal.
An now comes the funniest part.
Every time a consistency check is attempted from the Raid web console, the respective LV would degrade immediately. It was tested on both volumes, all disk were replaced, but no improvement. Within seconds of initiation of consistency check from the web console both LVs would degrade. The console didn't show any other activity. Strangely, in the Raid -10 volume, both mirrors would lose one disk each simultaneously. Even more strangely, consistency check initiated from the Bios console, would go fine.
I don't know if it has any relation but the web console didn't give a sign that it noticed the active backplane. It shows only that the physical drives are connected directly to the controller.
Anyway, we would like to give it a last try, so any help would be very appreciated.
What's the ESRT2 driver version and RWC2 version? What's the HDD model? I'd suggest you check the Tested hardware list. We've seen a lot of strange behaviors with non-validated HDDs.
ESRT2 driver version: 14.5.727.2011
HDDs are 2x ST500NM011 (Seagate) and 4xHUA72201 (Hitachi)
Of course they are not in the THL as its last version is from 2009, while the disks are newer. However, they look like from a tested family though.
Anyway, the problem stopped after several days, without any intervention whatsoever!