First suggestion is to stick with hardware in the THOL. We've see a lot of stange errors with untested hard drives. Also, please note the tested HDD FW level and additional requirements in Notes/Comments.
A complete diag file could be very helpul to investigate the problem.
MFSYS25 firmware 6.8 states in the release notes something I was looking for ... but it's not entirely fixed and I'm seriously debating dropping our intel products in the future unless I can get some sort of idea on where the problem is. As hos reported, the SCM simply power cycles. For me it is every 30-60 days on one of our modular servers. On another it is once every 3 months. We have 6 modular servers (fully loaded) - the others have not had the problem, but the hardware is the same (more or less).
The release notes state in 'defects fixed in this release' - SCM crash due to firmware error. This actually isn't true (if it's the same problem - I have no idea as the release notes aren't very forth coming).
The drives we are using are listed as supported for the hardware ... in fact a few of the other modular servers we run have mixed drives, not supported, and they have yet to go down. The drives in the affected machines are ST936751SS, firmware 0001 - apparently supported according to Intel.
I haven't bothered upgrading the firmware to 10 yet as there is nothing in the release notes that appear to be able to fix this problem. The configuration on both of the affected systems are the same - single SCM. The firmware reports what it should according to the docs for upgrading (from intel).
In each case, the event log shows the following taking place:
1. Media patrol always starts.
2. Media patrol ends.
3. Server Power Permission Revoked (Chassis)
4. scm reports 'controller has started'
5. scm reports 'the system has started' (redundant message?)
6. scm reports 'controller reset by firmware'
7 & 8. Battery detected/charged
9. chassis reports power granted
10. scm reports online
The power has never been revoked (we run dual ups power plants, 4 protected apc power strips, each power supply is plugged into it's own power strip, running low amps). The server blades never restart, most of the time they lock up (as the scm is gone and the os panics). Sometimes I'll get lucky and one or two will recover, but it's rare.
From the response here, it sounds like you've run into hardware that you don't support causing the issue - but that isn't my case. I've gotten no where with resellers (apparently they haven't gotten anywhere with Intel). I don't buy that I'm the only one with this issue. The systems were purchased 2 years apart from each other as well, so it can't just be a bad batch (I wouldn't think, since two resellers were involved, from two different countries).
I've gone so far as to check if anything may have been running on the blades, that could be causing a lot of disk IO - it's just so random of a reboot, it doesn't make sense to me. Outside of a possible firmware issue.