I've been spending the last two weeks attempting to track down some mysterious issues on two separate S1200BTL based servers. Both were purchased barebones (ie, motherboard, memory, CPU, case, but no drives) from two totally different suppliers months apart. We installed the storage (120GB software mirrored 520 series SSDs for OS) and fresh installed SBS 2011 on one, and Server 2008 R2 on the other.
After approximately 1 month after running without issue on-site, and 1 week after setting users up and joining to the domain, the SBS server began locking up. When it occurs, you cannot move the mouse, use the keyboard, or RDP in, but pings do respond. Only recourse is a hard reboot, then the system comes back up without issue. It proceeds to run again for something like a week, until the next lockup.
The log files usually show next to nothing shortly before the hang, except for some windows server errors about not being able to contact the domain . (IE, "The DHCP service failed to see a directory server for authorization." and "The processing of Group Policy failed. Windows could not obtain the name of a domain controller. This could be caused by a name resolution failure. Verify your Domain Name System (DNS) is configured and working correctly.")
There are, however, recurring iaStor timeout errors that I have not been able to resolve: "The device, \Device\Ide\iaStor0, did not respond within the timeout period." They do not seem to correspond with the hangs, but obviously could be the issue. Using either AHCI driver provided by Intel (the one included in the chipset drivers, or the separate RST AHCI driver) has not fixed the errors nor has updating both BIOSes to the most recent version available. I also tried the registry fix for disabling the link power management, which also did not resolve the errors. (For what it's worth, the mode is set to AHCI in the BIOS.)
It gets more bizarre. After pulling the plug and booting the SBS server one time after a hang, the system came back up only showing 8GB of memory out of the 16GB installed. Note this was in Windows in the task manager and computer properties; I was not present to see what the BIOS had detected at boot. In any event, it ran all day without issue like that, then that evening I rebooted, didn't touch a thing, ran a full 1.5 hour long memtest86+, and it passed with flying colors on all 16GB. Booted Windows, and all 16GB was back. The supplier gave us a new motherboard after this, which we swapped about 1 week ago, but then the server hung again yesterday showing that was not the issue. I was getting ready to ask to replace the memory (despite memtest passing) when the totally separate, second R2 server hung this morning for the first time in the exact same way as the SBS server. It's been getting more use lately, which I think is what triggered it.
Because both servers seem to be having the same problem (including the repeated iastor errors) I'm leaning towards some kind of incompatibility with the 520 Series SSDs and this board. It just seems too unlikely to be separate hardware failures, unless somehow we got a bad batch of SSDs, something I find hard to believe given how many we've installed and how little issues we've had with them.
Any thoughts in a direction to go? I'm ready to pull my hair out!