Dear disk gurus…
I have a small research cluster based on S3600PT boards. Recently, I have hit some puzzling problems. A couple of weeks ago, I replaced dead fans in 5 of the 20 machines. About four days later, two of the machines started having very similar disk system problems - puzzling enough that I'm at a loss what to do (two machines down out of 20 seriously dents our research effort; and we don't have any funding to upgrade, so anything we can do to fix these would be really worthwhile).
To summarise: both machines get disk errors when booted from their HDD. In both cases, if I boot a fedora DVD, and run disk utility, their SMART status shows millions of read/write errors; and if I run a read-write test, it aborts with read-write errors. I have performed the following additional tests:
. tried both SATA connectors - no change
. replace SATA cables - no change
. replace disks - no change
. power disk from an external power supply - no change
All of this seems to suggest that the problem is the SATA controller.
The puzzling aspects are:
. if I read the SMART status of any of the four disks I have tried on another machine, they show no history of read/write errors; my understanding was that the SMART status was stored on the on-disk controller, so that the same status should show on any machine it was connected to. Is this a misunderstanding?
. I've run the full test suite from ?? on one of the failing machines, with one of the failing disks connected. No errors at all were detected.
Other things I've tried:
. Running the machines with the replaced fans disconnected (in an environment where I could guarantee plenty of external airflow) - originally, I suspected that the replacement fans might be overloading the power supply (though in theory their power demand is slightly lower than the originals; I also wondered whether the new fans might be injecting noise into the power supply, but again, running with them completely disconnected should have fixed this).
. Reflashing the BIOS
. Fully powering down (i.e. removing the onboard battery for half an hour)
I've had no joy with any of these.
If you have any suggestions on further tests I could run, that would be great! If not, please can I get your thoughts on another alternative: if the SATA controller really is smoked, could we resuscitate the machines by installing consumer grade PCIE SATA controllers? There's a PCIEx8 slot spare on the board, though it will take a bit of jiggery-pokery and a flexible PCIE cable to get it to fit in the blade chassis.
Thanks in advance for any suggestions