6 Replies Latest reply on Dec 11, 2013 7:35 PM by edwardzh

    S2600GZ, What should I change the first?

    AOZ

      Today night I had a followed trouble with S2600GZ server Serial Number QSGR21400619

      Server uptime was about a half year. Heavy load about a 3 month.

      What should I change tha first of all

      Proccessor if yes wich one?

      Memory ?

      or motherboard ?

       

      Also I can send DebugLog file on demand.

       

      31505.12.2013 8:49BIOS Evt SensorSystem Eventreports OEM System Boot Event - Asserted
      31405.12.2013 8:48BIOS Evt SensorSystem Eventreports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted
      31305.12.2013 8:48BIOS Evt SensorSystem Eventreports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted
      31205.12.2013 8:47Pwr Unit StatusPower Unitreports the power unit is powered off or being powered down - Deasserted
      31105.12.2013 8:47Pwr Unit StatusPower Unitreports the power unit is powered off or being powered down - Asserted
      31005.12.2013 5:43BMC FW HealthManagement Subsystem Health'P2 Therm Ctrl %' sensor has failed and may not be providing a valid reading - Asserted
      30905.12.2013 5:43BMC FW HealthManagement Subsystem Health'P1 Therm Ctrl %' sensor has failed and may not be providing a valid reading - Asserted
      30805.12.2013 2:48BIOS Evt SensorSystem Eventreports OEM System Boot Event - Asserted
      30705.12.2013 2:47Mmry ECC SensorMemoryUncorrectable ECC. CPU: 1, DIMM: B1. - Asserted
      30605.12.2013 2:47BIOS Evt SensorSystem Eventreports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted
      30505.12.2013 2:43BIOS Evt SensorSystem Eventreports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted
      30405.12.2013 2:43CATERRProcessorreports it has been asserted - Deasserted
      30305.12.2013 2:43CATERRProcessorreports it has been asserted - Asserted
      30204/29/2013 08:26:20PS2 StatusPower Supplyreports a predictive failure has been detected for the power supply - Deasserted
      30104/29/2013 08:26:19PS2 StatusPower Supplyreports a predictive failure has been detected for the power supply - Asserted
      30003/19/2013 15:15:50PS2 StatusPower Supplyreports a predictive failure has been detected for the power supply - Deasserted
      29903/19/2013 15:15:49PS2 StatusPower Supplyreports a predictive failure has been detected for the power supply - Asserted
      29803/18/2013 22:55:12Pwr Unit RedundPower Unitreports redundancy has been lost, but the unit is still functioning with the minimum amount of resources needed for normal operation - Deasserted
      29703/18/2013 22:55:12Pwr Unit RedundPower Unitreports redundancy has been lost - Deasserted
      29603/18/2013 22:55:11PS2 StatusPower Supplyreports a predictive failure has been detected for the power supply - Deasserted
      29503/18/2013 22:55:11Pwr Unit RedundPower Unitreports redundancy has been lost, but the unit is still functioning with the minimum amount of resources needed for normal operation - Asserted
      29403/18/2013 22:55:11Pwr Unit RedundPower Unitreports redundancy has been lost - Asserted
        • 1. Re: S2600GZ, What should I change the first?
          Doc_SilverCreek

          307

          05.12.2013 2:47Mmry ECC SensorMemoryUncorrectable ECC. CPU: 1, DIMM: B1. - Asserted

          This is a hard failure on the DIMM.

          Which most likley resulted in this error as a secondary message

          30305.12.2013 2:43CATERRProcessorreports it has been asserted

           

          This two are very strange. I have seen simular on early Engennering Sample Processors , but not on Production processors. Indicates the BMC can't read the CPU tempeature so fans will all go to 100%

          'P2 Therm Ctrl %' sensor has failed and may not be providing a valid reading - Asserted

          Might be related to the Dimm is the dimm is hanging the i2C bus but very strange.

           

          I would recommend:

          replacing DIMM B1as this is the error that tool the system down.

          Update to the newest code stack release for BIOS, BMC, ME and FRUSDR (may fix the PSU messages)

          • 2. Re: S2600GZ, What should I change the first?
            AOZ

            Thanks.

            Anyway memory is chipest.

             

            Empirischen question. Computer in inexpensive car can exactly tell what is going on with the car. Why computer inside computer can't?

            • 3. Re: S2600GZ, What should I change the first?
              Doc_SilverCreek

              You must have better luck with Auto OBD codes than i have. I usually get 3 or 4 codes in my car and then have to figure out which of 3 or 4 component is bad.

               

              Hmmm, you had 3 or 4 codes on your computer...... Wonder if the same guy wrote the code?

              • 4. Re: S2600GZ, What should I change the first?
                AOZ

                So, really I have little bit more codes in my automatic transmission diagnostic.

                 

                Conclusion. I've got the best answer from Intel support. They just suggest to swap the B1 and another DIMM.

                I do at Friday. Today server down twice with the same error for other memory slot. We plug there  the same memory, but with other party number. Hope it will be done.

                 

                Thanks anyone for your time.

                • 5. Re: S2600GZ, What should I change the first?
                  AlexeySp

                  Hello everybody.

                   

                  I have a same problem with my server. I bought it one year ago.

                  Please, see configuration: http://ssmaker.ru/25a4fdce.jpg

                   

                  So, today I have two unexpected restarts, at 05:45 and 07:00. From that moment it was six hours, server is running fine.

                  I installed SEL Viewer and I see two errors like Topic starter: one with DIMM and one with CATERR.

                  Please, see SEL file: Dropbox - Sel12122013.sel

                  Unfortunately, I can be in server room only after 8 hours from now (its closed at night).

                  Tell me please, what I need to do at the morning?

                  I have no DIMM modules like this, but I can buy it  (but it will different party number)? If I can't buy it tomorrow, is it possible to remove (not replace) the first module? Does server will work fine?

                  Thank you.

                  • 6. Re: S2600GZ, What should I change the first?
                    edwardzh

                    I'd suggest you replace DIMM D1 first.

                     

                    I think it should be OK to temporarily remove the DIMM from D1 slot. Just remember that for each CPU, all blue DIMM slots need to be populated before the black slots.