I am a professional accountant (at least I try to be in this strenuous world focused mainly on immediate good financial results).
I have a recent PC board OEM MSI MS-7616 with i5-750 P55 chipset (1 year old).
It had 2 arrays Raid 1 (4 drives) the 1st nearly exclusively for the OS MS Win7-64 (300GB), the 2nd for all data (1000GB). Drives are WD drives (ordinary drives, not built for servers or Raid).
All is saved to a Linux home NAS (Synology 2009+II in Raid1).
I thougth it was reliable. It was, but not as I expected.
I go to a client at lunch, switch off Windows and get the normal black screen (but no A/C current power off). Back at my home/office at four, "blue screen" because Windows wants to protect my system at boot! No one was at the office in between (normally!). Bios was authorising wake up on USB (KB ou mouse).
I try to recover the system with WinRe DVD. In a hurry, I don't pay attention to the bios message which shows 4 non-member disks to my subconscient mind. Before using it and accepting WinRE repair, I could still see the OS Windows partition on drive 1 & 2 in WinRE, described as faulty.
After, at second run of WinRE, it is all gone (no more windows partition at all!).
In WinRE console mode, I find 4 main disks instead of 2.
I go in BIOS and have to correct disk access to SATA! (It was something else but not IDE, nor SATA).
Using "testdisk" (a useful hardrive utility from a French guy), both file tables of main OS partition on volume zero are too damaged (still, hard drives 1 and 2 are still visible as Volume 0). Irrecoverable for TestDisk or anything I know.
Drive 3 and 4 have become non-Raid members. But all user data is intact.
I decide to rebuild a new Win7-64 on existing volume 0. It works.
I have also a full data back-up on my home/Office NAS (Synology 209+II Raid1) so no issue there. Data disks are still intact although no longer in Raid1 volume.
When OS all up and running again, I decide to make disk drive 4 (port 3) again a member of a volume 1 in Raid1, so to erase the disk 4 using the disk 3 (port 2) as its source.
The Intel RST software shows a systematic error: "An unknown error occurred when creating the volume."
I discover now that I will have to format drive 4 it before! I hope I don't have to format drive 3 either (for time sake). I will make a dvd copy of them using Acronis before.
Unrelated, I speak to my brother this week. He has the same config (Win7-64 and 2 volumes raid1) but with a GA motherboard (built totally separately) towards same resilience objective that we discussed before. He had also to restore the system image and rebuild its RAID1 volumes after exactly the same incident! But he was smarter, he had an Acronis system image, so he was faster to recover.
This can still be an accident but frankly, what is the probability it can happen like this without a serious inadequateness of the solution on offer? Is it the hard drives that are not adequate for chipsets (should we have bought Raid server dedicated hard drives instead of ordinary ones?)? Is it the Intel chipset's themselves which are not robust enough in most cases? Should we have used software Raid1 instead of hardware raid for this cost? Should we have bought a dedicated Raid1 controller and not use the "low-cost" Intel chipsets as suggested on the net? I can't find a scientific view on it.
It is impossible for me to be sure about my chipset config (OEM board hence no doc, 64 bits hence no Intel tool to get it - only Intel 32bit tools work).
I found out that the internet is full of such incidents with Intel Raid storage controller. My controller is referred to in Win 7 properties as Intel Server Express Chipset SATA Raid PCI\VEN_8086&DEV2822&SUBSYS-76161462&REV05. I always use recent drivers.
I had a review at the backed-up alerts from system logs of my failed Win7 system (backed up on NAS). Nothing indicating recent disk controllers errors, but I had errors in October (unreadable to me).
1) When I have two non-member Raid 1 disks supposed to be identical (I tested the data was identical) after incident, why can't I simply ask to put them back as part of a Raid 1 volume as before following such an incident? There should be a utility to put them back together.
2) Why can't I put easily these hard drives in "read-only" mode to perform my verifications (a typing error is quickly done)?
I am just a user which tries to be serious about data security. Can you guys in the top IT world listen to voice of customer that requires better reliability and recovery tools. MS is appearing to get there but obviously their WinRE tool destroyed further the Raid 1 OS volume, but I don't know why - not enough testing? - not handling Raid well?
A lot of time lost for the customer.
One of my daughters is an oncologist. She told me she switched to Apple and dumped Microsoft for ever as Apple make OS that works more reliably with their hardware and she can't afford the higher risks to loose time with such incidents. I cannot count the precious hours I lost with incidents due to poor integration of components from the triad Microsoft / Intel / Taiwanese manufacturers but it is huge.
My recommendation that Intel can't follow unfortunately: until Intel can make Raid chipsets working more reliably with easy recovery tested between you "IT manufacturers", don't sell it until it is close to rock solid. We don't seem to be there yet by far. Otherwise, it will continue to ruin quality efforts elsewhere and ultimately reputations.
As a more realistic recommendation perhaps, could you escalate to your management to put extra resources to reduce significantly the risk of this seamingly high customer issue.
Thanks for your consideration,