In order to better understand your issue, please let us know what console you are using to extract this information for the thermal margin.
The pasted output is from "ipmitool sensor". The pertinent section of the RMM sensor page is below.
P1 Status reports the processor's presence has been detected OK 0x0080 P2 Status reports the processor's presence has been detected OK 0x0080 P1 Therm Margin All deasserted Unknown Not Available P2 Therm Margin All deasserted Unknown Not Available P1 Therm Ctrl % All deasserted Unknown Not Available P2 Therm Ctrl % All deasserted Unknown Not Available P1 ERR2 All deasserted OK 0x0000 P2 ERR2 All deasserted OK 0x0000 CATERR All deasserted OK 0x0000 P1 MSID Mismatch All deasserted OK 0x0000 CPU Missing All deasserted OK 0x0000 P1 DTS Therm Mgn All deasserted Unknown Not Available P2 DTS Therm Mgn All deasserted Unknown Not Available P2 MSID Mismatch All deasserted OK 0x0000 P1 VRD Hot All deasserted OK 0x0000 P2 VRD Hot All deasserted OK 0x0000 P1 MEM01 VRD Hot All deasserted OK 0x0000 P1 MEM23 VRD Hot All deasserted OK 0x0000 P2 MEM01 VRD Hot All deasserted OK 0x0000 P2 MEM23 VRD Hot All deasserted OK 0x0000 DIMM Thrm Mrgn 1 All deasserted Unknown Not Available DIMM Thrm Mrgn 2 All deasserted Unknown Not Available DIMM Thrm Mrgn 3 All deasserted Unknown Not Available DIMM Thrm Mrgn 4 All deasserted Unknown Not Available
what is the part number of the CPUs?
Should be something like SR0L7 or QBxx
Apologies for the delay. I had to wait for a time where I could tear the machine down.
The CPU part numbers are SR0L7. While I had the system apart, I put these two CPUs in a S2600GZ board and verified that they do in fact report processor thermal margin values. They do, as expected. I also took the opportunity to try a pair of E5-2609s (SR0LA) in the S2600CO4 board. These processors have also been verified to report thermal margin values. As with the others, in the S2600CO4 board, the sensors report as unavailable.
This would certainly point the finger at the S2600CO board as having something misconfigured, or wrong with it.
You could try re flashing the complete fw stack, especially the ME, BMC and SDRs.
I would also clear the BMC defaults which can be done with the syscfg -rbfd (i think) command. (You may need to do syscfg /? to get the help and find the restored BMC default command. )
I would not give this very high odds of working as it is more likely a damaged CPU pin or damage on the Mother board.
I am fairly certain it is not damage per se, as I have two boards that behave exactly the same way. However, it might be the boards. Prompted by your damage comment, I was looking at the second board in detail, just giving it a good looking over. Turns out it is an engineering sample board. Turns out both boards are. Now, I wouldn't normally expect that to be the cause. In the past, the engineering sample equipment we have gotten from Intel has been fully functional, if not at its final hardware rev. Usually it just means that we got it before it had completed certifications. I suppose that these boards could have not been fully functional yet, or that the ME connection to the processors could have been changed slightly such that release firmware expects things to be different. If that is the case, it will be disappointing. I hate to trash a couple of otherwise functional boards.
I will give the BMC reset a try, just to cover all the bases.
Engineering samples are meant for OEM (Original Equipment Manufacturers) and Intel provides these for testing purposes only. They may lack features that the production units will have. We strongly recommend returning these to the place of purchase or your Intel representative and request production units instead.
Well, that would require returning them directly to Intel as, at the time we purchased these boards, we were an Intel OEM. In cleaning up recently, these boards were discovered. We have a number of other engineering sample systems we use for various purposes in our hardware lab and it was decided to see if these boards could be put to use. It seems the answer is "sort of" as they appear to be fully functional, other than the broken CPU thermal sensors. While I expected some features to be missing or to not work, something as basic as CPU thermal sensors wasn't expected. They will just have to be used where the extra fan noise is not an issue. Not ideal, but not worth putting much effort into either.
We, at Intel, appreciate your feedback on this matter.Thank you for taking the time to communicate this issue to us.