I've been spending the last two weeks attempting to track down some mysterious issues on two separate S1200BTL based servers. Both were purchased barebones (ie, motherboard, memory, CPU, case, but no drives) from two totally different suppliers months apart. We installed the storage (120GB software mirrored 520 series SSDs for OS) and fresh installed SBS 2011 on one, and Server 2008 R2 on the other.
After approximately 1 month after running without issue on-site, and 1 week after setting users up and joining to the domain, the SBS server began locking up. When it occurs, you cannot move the mouse, use the keyboard, or RDP in, but pings do respond. Only recourse is a hard reboot, then the system comes back up without issue. It proceeds to run again for something like a week, until the next lockup.
The log files usually show next to nothing shortly before the hang, except for some windows server errors about not being able to contact the domain . (IE, "The DHCP service failed to see a directory server for authorization." and "The processing of Group Policy failed. Windows could not obtain the name of a domain controller. This could be caused by a name resolution failure. Verify your Domain Name System (DNS) is configured and working correctly.")
There are, however, recurring iaStor timeout errors that I have not been able to resolve: "The device, \Device\Ide\iaStor0, did not respond within the timeout period." They do not seem to correspond with the hangs, but obviously could be the issue. Using either AHCI driver provided by Intel (the one included in the chipset drivers, or the separate RST AHCI driver) has not fixed the errors nor has updating both BIOSes to the most recent version available. I also tried the registry fix for disabling the link power management, which also did not resolve the errors. (For what it's worth, the mode is set to AHCI in the BIOS.)
It gets more bizarre. After pulling the plug and booting the SBS server one time after a hang, the system came back up only showing 8GB of memory out of the 16GB installed. Note this was in Windows in the task manager and computer properties; I was not present to see what the BIOS had detected at boot. In any event, it ran all day without issue like that, then that evening I rebooted, didn't touch a thing, ran a full 1.5 hour long memtest86+, and it passed with flying colors on all 16GB. Booted Windows, and all 16GB was back. The supplier gave us a new motherboard after this, which we swapped about 1 week ago, but then the server hung again yesterday showing that was not the issue. I was getting ready to ask to replace the memory (despite memtest passing) when the totally separate, second R2 server hung this morning for the first time in the exact same way as the SBS server. It's been getting more use lately, which I think is what triggered it.
Because both servers seem to be having the same problem (including the repeated iastor errors) I'm leaning towards some kind of incompatibility with the 520 Series SSDs and this board. It just seems too unlikely to be separate hardware failures, unless somehow we got a bad batch of SSDs, something I find hard to believe given how many we've installed and how little issues we've had with them.
Any thoughts in a direction to go? I'm ready to pull my hair out!
Just a quick mention that only the 710 series SSDs are listed compatible on the S1200BTL landing page. Rule of thumb in these cases is "you have to prove they are compatible if not listed". Sometimes the list is not complete.
Since you have issues on three seperate serverboards, the issue seems pretty fundamental.
looking at the 710 vs the 520 I see:
SATA rev 2.6 vs SATA rev 3
ATA8-ACS2 includes SCT (Smart Command Transport) and device statistics log support vs ACS2
I don't see anything on the S1200BTL tps about these protocols so I'm not drawing a conclusion, but it might ring a bell for you?
Thanks for taking the time to reply, especially given the frustrating weirdness of this issue. I agree that the problem must be something fundamental (though, the 8GB memory issue doesn't fit into that theory real well since I have a hard time figuring out how IO issues could cause such a thing. I'm ignoring it for the time being.) For what it's worth, I've been using consumer Intel SSDs on Intel server boards for a few years now without issue, but obviously that doesn't mean anything given the newer SSDs with different controllers.
Some notes of interest:
1) Since I posted, I reverted the AHCI drivers on both systems to the Microsoft default AHCI driver. Obviously, the iastor errors have stopped, but it is yet to be determined whether that has solved the problem, or simply masked it. Man, I love problems that take a week to appear...
2) I'm beginning to think these newer Intel SSDs with Marvell controllers are intrinsically temperamental. (That's a nice way of putting it.) Yesterday evening I installed a new 120GB 520 series SSD in an Asus U56E laptop, approximately 6 months old. Fresh installed Win 7, and installed all drivers including Intel chipset driver. The result? NTFS.sys critical object termination BSODs on every wake-up from sleep mode. Switching to the Microsoft driver made no difference, however switching to the separate RST Intel driver fixed the problem immediately. (Just to be sure, I switched it back to the driver provided by the chipset, and it was back to BSODing on wake-up.)
The interesting thing? It was a Series 6 / C200 chipset, which is the same as the server boards. Next time I am on-site, I am going to put one of the servers into sleep mode; it will be interesting to see if it BSODs similarly. That would be great, as it might provide me a way to actually figure out whether or not I have fixed the problem, rather than having to wait a week or more. (That is assuming it follows that no BSOD on wakeup also means no lockups after a week, which is pure speculation. Worth a shot, since I'm in hail mary land now!)
Regarding the 8GB issue, I think you need to divide and conquer. Two issues, two solution sets. There's not enough to go on for the memory recognition. However...
Correct me if I am wrong, but if you are using a Marvell controller then you would not want to use the iaStor driver. You would want to be using MegaSR. Unless you meant the actuall controller of the SSD, which iirc is made by Sandforce
Anyway, not trying to get lost splitting hairs, but it may be relevent.
How have you configured your (I assume) RAID array, on which controller and mode?
Just wanted to provide an update for everyone. Switching the controller drivers to the Microsoft AHCI driver seems to have solved the problem entirely. Both servers have been running nearly 4 weeks now without any further hangs. I'm tentatively calling this fixed (as much as you can say a server running under Microsoft default drivers is fixed...) though I have to say I am less than thrilled with Intel right now. I've used nothing but Intel SSDs for years now, but I"m beginning to question their latest offerings reliability and compatibility after all of the above issues.
Jason, just to clear up I am using the on-board Intel SATA controllers on both servers, set in AHCI mode. I was talking about the SSD controllers, and mistakenly said Marvell when I meant Sandforce. Thanks for your comments.
Steve1677, for what it's worth, what kind of drives do you have on your system? Are they SSDs or traditional hard drives? Is there anything else similar about your setup to mine?