Two years of trouble with S5000PSL
Occasional Crashes: BSOD, hangs, or restarts: most of the time after crash/restart one or more Drive Cages is missing. Need to pull the power-cables. Wait a few minutes. Connect cables, boot and drives are back.
Most of the time No crash information in Windows.
Restart in OS: often one or more Drive Cages are missing (after restart NOT A CRASH). Need to pull the power-cables. Wait a few minutes. Connect cables, boot and drives are back.
RMM2: no remote console posible (Vista/MacOS/XP, can login to RMM2)
RMM2: first page of SEL log not visible
Intel Management: software interferes with windows OS: unable to install and use Intel Management
Windows Small Business Server 2008 SP1 (64 bit)
Windows Small Business Server 2003 R2 SP2 (32 bit) (replaced by SBS 2008)
Drive cages: AXX4DRV3GEXP and AXX6DRV3GEXP
all components have latest versions
Dozens of emails: problem still exists
So far I tried:
- Replaced all Harddisks (found several problems with Seagate Barracuda 7200.11 SATA II Disk Drives)
- Replaced Drive Cages's EXP4 and EXP6
- Replaced RAID controller (Intel SRCSAS18E -> Adaptec 5805) (both crash in the same way)
- Replaced OS (new clean installations): before SBS 2003 -> now SBS 2008 (both crash in the same way)
- Replaced RAM
Have you captured the BSOD you receive? Is it always the same? ....
Also you mentioned that sometimes the Hot-swap cage is missing.... which cables you have plugged from the backplane to the controller card and/or motherboard?
Are those 3.0 Serial ATA hard drives?.... have you tried forcing the hard drives to work at 1.5? ....
let me know..
Most of the time the system just hangs or crashes, without BSOD.
If there was a BSOD, it was a different one.
So I can't give any more details why the system crashed.
Lots of the time, even after a normal restart, one or more drive-cages (with drives) is missing.
I switched/changed cages (with backplane), and cables.
No pattern was emerging: it happened in all combinations (between 4 drive-cages).
No, I didn't force the HD to use SATA 1.5.
I only found the Barracuda's had the most problems.
So I replaced them with WD drives.
This (?) resulted in less crashes.
I have seen many issues using sata 2.0 hard drives, the vibration of these hard drives cause issues like the one you are experiencing; so as a suggestion try to force them to work at 1.5 and check the behavior.
Additionally, i noticed there is a new backplane update available and there are LOTS of fixes on that one, download the firmware and perform the update.
6 Bay expander
4 Bay expander
The update of the HBA firmware to version 2.12 seems to solve” the missing drives after restart” (genuine Restarts and Crashes) issue.
However the BSOD are still there:
STOP 0x...D1, 0x,...34C
This happens if Adaptec Agent (Service) is enabled.
On our system this error is reproducible.
I did at least 3 installations!
Adaptec, is using the S500PSL board, unable to reproduce this error.
This time the crash damaged the boot sector (?): chckdsk C: "second NTFS boot sector unwritable" and Windows Backup fails every time.
Did the HBA issues (missing drives) come back or does that seem to be fixed? I've been fighting the same problem (I think) for a while now and sometimes it would appear to be fixed for a few weeks and then come back.
When you were having issues with the "missing drives" were you getting MR_MONITOR warnings (a lot of them) in the Event logs?
I seem to be having the same issue with the drives missing and I was getting a lot of MR_MONITOR errors in the Event logs? I just tried the HBA firmware update but It has only been a few days without errors so I'm not sure if it has fixed my problem or not.
I replaced the Intel RAID Controller (see original post), because I thought the Controller was causing the crashes. So no MR_Monitor events.
However the Adaptec RAID Controller suffered from the same problem.
Now we know this was due to the Firmware issues: Intel TA: TA-933-1
“The frequency of the failure under normal operating conditions is once in 1 to 12 months depending on the system configuration. The failure may occur with high probability during system FRUSDR update.”
A ‘little’ lie: our system 1-14 days between crash!!!!!!! So more than 75 in a year!
Sorry, but I am really annoyed how about this.(I reported ‘disk’ problems at the beginning of 2008).
I can't rule out that all issue are solved.
The missing drive cage seems to be solved.
However the system has still serious problems:
Enabling the Adaptec Agent (comparable with MR_Monitor), crashes the system.
Adaptec said, this could be related to a firmware issue.
I am in contact with an Adaptec engineer trying to solve this issue.
The last crash led to the un-ability to Back.
Trying another Backup software (Acronis), led to even more problem, now the NICs have disappeared (and Adpatec can’t use VPN to access the system)
That is nuts. Thanks for the info. If I wouldn't have found this post I'd think I was losing my mind.
About 5 months ago I built 3 servers with Identical parts. One server was a mess with these issues and the other two were fine. After replacing all of the parts and spending hours on the phone with Intel (they kept telling me they had never seen this before) I finally gave up on the one and just installed the two good ones. I installed both and one has never had any issues but the other one just started to fail a week ago after working fine for 4+ months. When calling Intel last week they never mentioned TA-933-1. I just updated it on Friday but I guess I'll have to baby this server because I won't ever be sure it is right.
Thanks again. I really hope you solve your issues. I seem to have more luck finding answers on sites like this then from actual tech support now.
I have two of these servers, both running Linux and both giving same errors - crashing, drives missing after restart, rmm failing.
I have left one turned off!! for the last few months. I will power it up again and try to get all updates done (Including TA-933-1) to see if I get any further with it.
Even after applying all driver and firmware updates, I am still getting errors:
Feb 26 17:19:17 storage-test-test MR_MONITOR: <MRMON181> Controller ID: 0 Enclosure shutdown: Ports 4-7:1
Feb 26 17:22:22 storage-test-test MR_MONITOR: <MRMON113> Controller ID: 0 Unexpected sense: PD = :16 - Enclosure services unavailable, CDB = 0x1c 0x
01 0x0e 0x14 0x00 0x00 , Sense = 0x70 0x00 0x02 0x00 0x00 0x00 0x00 0x0a 0x00 0x00 0x00 0x00 0x35 0x02 0x00 0x00 0x00 0x00
Does anyone have any insight into this?
I Notice you mention firmware 2.12 and 2.14.
I see no mention of these version numbers at http://downloadcenter.intel.com/SearchResult.aspx?lang=eng&ProductFamily=Server+Products&ProductLine=Intel%C2%AE+Storage+Systems&ProductProduct=Intel%C2%AE+Storage+Server+SSR212MC2
Can you provide a link?
I don't know if you are experiencing the same problem: on our system there where never any crash-dumps, or other log errors.
The system did just froze and was unable to restart (see original post).