We have MFS2600KI Compute Modules all running ESXi 5.5. There have been intermittent issues where a catastrophic error occurs and the blades reboot. It's completely random and can happen on any of the 6 blades. Here is an example of the error:
ID:2101Type:IPMIDetailed Description:A catastrophic error has occurred. The system has halted.Cause:An uncorrectable memory error is often the cause.Action:Check for other events that occurred near the same time which may help identify the cause or potential hardware failure.Extra Data:s:68:"Raw IPMI (hex): Gen:3000 Num:80 Type:07 EDir:83 ED1:a1 ED2:01 ED3:01";
The error indicates a possible memory issue but Intel support has been unable to identify the exact issue. We've replaced a module completely but others are still throwing these errors. Has anyone seen this before and know of a possible resolution?
A CATERR could actually refer to anything, hardware, software or firmware-wise. Because of this, I would highly recommend providing the complete system diagnostics to our Support Team, with your given Case#; thus, they can check into the logs and see what may be triggering this random symptom.
As an alternative hint, please, ensure the memory installed on your compute modules is among the officially tested ones.