Server Products
Data Center Products including boards, integrated systems, Intel® Xeon® Processors, RAID Storage, and Intel® Xeon® Processors
4761 Discussions

S2600GZ, What should I change the first?

AZhuk2
Beginner
2,315 Views

Today night I had a followed trouble with S2600GZ server Serial Number QSGR21400619

Server uptime was about a half year. Heavy load about a 3 month.

What should I change tha first of all

Proccessor if yes wich one?

Memory ?

or motherboard ?

Also I can send DebugLog file on demand.

31505.12.2013 8:49BIOS Evt SensorSystem Eventreports OEM System Boot Event - Asserted31405.12.2013 8:48BIOS Evt SensorSystem Eventreports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted31305.12.2013 8:48BIOS Evt SensorSystem Eventreports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted31205.12.2013 8:47Pwr Unit StatusPower Unitreports the power unit is powered off or being powered down - Deasserted31105.12.2013 8:47Pwr Unit StatusPower Unitreports the power unit is powered off or being powered down - Asserted31005.12.2013 5:43BMC FW HealthManagement Subsystem Health'P2 Therm Ctrl %' sensor has failed and may not be providing a valid reading - Asserted30905.12.2013 5:43BMC FW HealthManagement Subsystem Health'P1 Therm Ctrl %' sensor has failed and may not be providing a valid reading - Asserted30805.12.2013 2:48BIOS Evt SensorSystem Eventreports OEM System Boot Event - Asserted30705.12.2013 2:47Mmry ECC SensorMemoryUncorrectable ECC. CPU: 1, DIMM: B1. - Asserted30605.12.2013 2:47BIOS Evt SensorSystem Eventreports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted30505.12.2013 2:43BIOS Evt SensorSystem Eventreports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted30405.12.2013 2:43CATERRProcessorreports it has been asserted - Deasserted30305.12.2013 2:43CATERRProcessorreports it has been asserted - Asserted30204/29/2013 08:26:20PS2 StatusPower Supplyreports a predictive failure has been detected for the power supply - Deasserted30104/29/2013 08:26:19PS2 StatusPower Supplyreports a predictive failure has been detected for the power supply - Asserted30003/19/2013 15:15:50PS2 StatusPower Supplyreports a predictive failure has been detected for the power supply - Deasserted29903/19/2013 15:15:49PS2 StatusPower Supplyreports a predictive failure has been detected for the power supply - Asserted29803/18/2013 22:55:12Pwr Unit RedundPower Unitreports redundancy has been lost, but the unit is still functioning with the minimum amount of resources needed for normal operation - Deasserted29703/18/2013 22:55:12Pwr Unit RedundPower Unitreports redundancy has been lost - Deasserted29603/18/2013 22:55:11PS2 StatusPower Supplyreports a predictive failure has been detected for the power supply - Deasserted29503/18/2013 22:55:11Pwr Unit RedundPower Unitreports redundancy has been lost, but the unit is still functioning with the minimum amount of resources needed for normal operation - Asserted29403/18/2013 22:55:11Pwr Unit RedundPower Unitreports redundancy has been lost - Asserted
0 Kudos
1 Solution
DSilv11
Valued Contributor III
769 Views

307

05.12.2013 2:47Mmry ECC SensorMemoryUncorrectable ECC. CPU: 1, DIMM: B1. - Asserted

This is a hard failure on the DIMM.

Which most likley resulted in this error as a secondary message

30305.12.2013 2:43CATERRProcessorreports it has been asserted

This two are very strange. I have seen simular on early Engennering Sample Processors , but not on Production processors. Indicates the BMC can't read the CPU tempeature so fans will all go to 100%

'P2 Therm Ctrl %' sensor has failed and may not be providing a valid reading - Asserted

Might be related to the Dimm is the dimm is hanging the i2C bus but very strange.

I would recommend:

replacing DIMM B1as this is the error that tool the system down.

Update to the newest code stack release for BIOS, BMC, ME and FRUSDR (may fix the PSU messages)

View solution in original post

0 Kudos
6 Replies
DSilv11
Valued Contributor III
770 Views

307

05.12.2013 2:47Mmry ECC SensorMemoryUncorrectable ECC. CPU: 1, DIMM: B1. - Asserted

This is a hard failure on the DIMM.

Which most likley resulted in this error as a secondary message

30305.12.2013 2:43CATERRProcessorreports it has been asserted

This two are very strange. I have seen simular on early Engennering Sample Processors , but not on Production processors. Indicates the BMC can't read the CPU tempeature so fans will all go to 100%

'P2 Therm Ctrl %' sensor has failed and may not be providing a valid reading - Asserted

Might be related to the Dimm is the dimm is hanging the i2C bus but very strange.

I would recommend:

replacing DIMM B1as this is the error that tool the system down.

Update to the newest code stack release for BIOS, BMC, ME and FRUSDR (may fix the PSU messages)

0 Kudos
AZhuk2
Beginner
769 Views

Thanks.

Anyway memory is chipest.

Empirischen question. Computer in inexpensive car can exactly tell what is going on with the car. Why computer inside computer can't?

 

0 Kudos
DSilv11
Valued Contributor III
769 Views

You must have better luck with Auto OBD codes than i have. I usually get 3 or 4 codes in my car and then have to figure out which of 3 or 4 component is bad.

Hmmm, you had 3 or 4 codes on your computer...... Wonder if the same guy wrote the code?

0 Kudos
AZhuk2
Beginner
769 Views

So, really I have little bit more codes in my automatic transmission diagnostic.

Conclusion. I've got the best answer from Intel support. They just suggest to swap the B1 and another DIMM.

I do at Friday. Today server down twice with the same error for other memory slot. We plug there the same memory, but with other party number. Hope it will be done.

Thanks anyone for your time.

0 Kudos
SAlex9
Beginner
769 Views

Hello everybody.

I have a same problem with my server. I bought it one year ago.

Please, see configuration:

So, today I have two unexpected restarts, at 05:45 and 07:00. From that moment it was six hours, server is running fine.

I installed SEL Viewer and I see two errors like Topic starter: one with DIMM and one with CATERR.

Please, see SEL file: https://www.dropbox.com/s/xk3mdm8zdmqlr29/Sel12122013.sel Dropbox - Sel12122013.sel

Unfortunately, I can be in server room only after 8 hours from now (its closed at night).

Tell me please, what I need to do at the morning?

I have no DIMM modules like this, but I can buy it (but it will different party number)? If I can't buy it tomorrow, is it possible to remove (not replace) the first module? Does server will work fine?

Thank you.

0 Kudos
Edward_Z_Intel
Employee
769 Views

I'd suggest you replace DIMM D1 first.

I think it should be OK to temporarily remove the DIMM from D1 slot. Just remember that for each CPU, all blue DIMM slots need to be populated before the black slots.

0 Kudos
Reply