1 2 3 Previous Next 38 Replies Latest reply on Nov 6, 2017 12:33 AM by David_Intel

    SC2600CO random reboot

    guilhermefsmiguel

      Dears, I have a SC2600CP board in a server with 2 Xeon CPUs and 196GB of RAM.

      This machine is used as a calculation node in a cluster environment, with other machines that has almost the same configuration.

       

      A few days ago it started to reboot with no reason.

       

      To try to identify the problem, I checked all DIMM slots and all the memory's looking for someone with error.  I tested all of them but could not found an error.

      Then I checked the SEL logs:

       

        1 | 09/17/2017 | 16:38:10 | Event Logging Disabled #0x07 | Log area reset/cleared | Asserted

         2 | 09/17/2017 | 17:16:55 | Power Unit #0x01 | Failure detected | Asserted

         3 | 09/17/2017 | 17:16:56 | Power Unit #0x01 | Power off/down | Asserted

         4 | 09/17/2017 | 17:17:01 | Power Unit #0x01 | Power off/down | Deasserted

         5 | 09/17/2017 | 17:17:01 | Power Unit #0x01 | Failure detected | Deasserted

         6 | 09/17/2017 | 17:17:02 | Power Unit #0x01 | Power off/down | Asserted

         7 | 09/17/2017 | 17:17:07 | Power Unit #0x01 | Power off/down | Deasserted

         8 | 09/17/2017 | 17:17:13 | Fan #0x32 | Lower Non-critical going low  | Deasserted

         9 | 09/17/2017 | 17:17:13 | Fan #0x32 | Lower Critical going low  | Deasserted

         a | 09/17/2017 | 17:17:13 | Fan #0x32 | Lower Non-critical going low  | Deasserted

         b | 09/17/2017 | 17:17:13 | Fan #0x32 | Lower Critical going low  | Deasserted

         c | 09/17/2017 | 17:17:24 | Fan #0x32 | Lower Non-critical going low  | Asserted

         d | 09/17/2017 | 17:17:24 | Fan #0x32 | Lower Critical going low  | Asserted

         e | 09/17/2017 | 17:17:31 | System Event #0x83 | Timestamp Clock Sync | Asserted

         f | 09/17/2017 | 17:17:32 | System Event #0x83 | Timestamp Clock Sync | Asserted

        10 | 09/17/2017 | 17:17:55 | System Event #0x83 | OEM System boot event | Asserted

       

      and on the BMC web console:

       

      3009/17/2017 17:39:32Pwr Unit StatusPower Unitreports the power unit is powered off or being powered down - Asserted
      2909/17/2017 17:37:19BIOS Evt SensorSystem Eventreports OEM System Boot Event - Asserted
      2809/17/2017 17:36:56BIOS Evt SensorSystem Eventreports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted
      2709/17/2017 17:36:56BIOS Evt SensorSystem Eventreports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted
      2609/17/2017 17:36:49System Fan 3Fanreports the sensor is in a low, critical, and going lower state - Asserted
      2509/17/2017 17:36:49System Fan 3Fanreports the sensor is in a low, but non-critical, and going lower state - Asserted
      2409/17/2017 17:36:36System Fan 3Fanreports the sensor is in a low, critical, and going lower state - Deasserted
      2309/17/2017 17:36:36System Fan 3Fanreports the sensor is in a low, but non-critical, and going lower state - Deasserted
      2209/17/2017 17:36:34System Fan 3Fanreports the sensor is in a low, critical, and going lower state - Deasserted
      2109/17/2017 17:36:34System Fan 3Fanreports the sensor is in a low, but non-critical, and going lower state - Deasserted
      2009/17/2017 17:36:31Pwr Unit StatusPower Unitreports the power unit is powered off or being powered down - Deasserted
      1909/17/2017 17:36:26Pwr Unit StatusPower Unitreports the power unit has suffered a failure - Deasserted
      1809/17/2017 17:36:20Pwr Unit StatusPower Unitreports the power unit is powered off or being powered down - Asserted
      1709/17/2017 17:36:20Pwr Unit StatusPower Unitreports the power unit has suffered a failure - Asserted

       

      The power unit Failure detected it is not the main cause, since I have replaced the power source and the problem remains.

      All the sensors, Fans etc are OK. There is no problem with them, but the LED fault is blinking amber, with no change.

       

      No errors reported on the BIOS, only in SEL.

       

      I have downloaded the debug logs, but I could not check it because it is password protected.

       

       

      The board information is as follows:

       

      Manufacturing Date :2012-10-09   03:53
      Manufacturer :Intel Corporation
      Product Name :S2600CO
      Serial Number:QSCO22700376
      Part/Model Number :G29920-205
      FRU File ID :FRU Ver 1.00

      If anyone could help me, it would be great.

       

      Best regards,

        • 1. Re: SC2600CO random reboot
          Intel Corporation
          This message was posted on behalf of Intel Corporation

          Hi guilhermefsmiguel,
           
          I am Mike and it is a pleasure to assist you.
           
          The Intel® Server Board S2600COE (G29920-205) is rebooting randomly and the SEL utility is showing the power supply #1 as faulty and the new power supply did not solve the issue.
           
          I have noted, your system is running the FRU 1.0, this version was released on 2012; so my first recommendation is updating the BIOS-firmware of the server. Before, doing it run our Intel® System Support Utility and send us the results; according to the current BIOS-firmware version, I will let you know which version of BIOS you need to use for the update. If we jump to the latest version 02.06.0006 (7/20/2017), the server might stop working properly.
           
          Downloads for Intel® System Support Utility ( Windows* and Linux*

          I would be waiting for your results for further assistance.
           
          Regards,
          Mike C
           

          • 2. Re: SC2600CO random reboot
            guilhermefsmiguel

            Hi Mike, thank you for the support.

             

            As you requested, I have executed the ssu.sh on the machine and the result is attached.

             

            I forgot to tell on the first message that when I figure out that the problem was not a memory or power problem, I tried to update the BIOS, but the FRU could not be upgraded.

             

            Once again, thank you for the support

            • 3. Re: SC2600CO random reboot
              guilhermefsmiguel

              Mike, I also need to tell you that I am not using any PCI express card or another hard disk than the on that is on the report.

              We use a SSD as SWAP, but this disk is deactivated, reason why you may see that there is no virtual memory available.

              This machine it is not inserted as a node now, so there is no processing load on it. It is configured now as under maintenance.

               

              Thank you for the support,

               

              M.Eng. Guilherme Fernandes de Souza Miguel

              • 4. Re: SC2600CO random reboot
                guilhermefsmiguel

                Mike, I am also attaching the ssu report with 3rd party log messages.

                 

                Thank you,

                • 5. Re: SC2600CO random reboot
                  Intel Corporation
                  This message was posted on behalf of Intel Corporation

                  Hi guilhermefsmiguel,
                   
                  I noted you have updated the BIOS to the latest version 02.06.0006; however, the logs are not showing the current version of the ME and BMC firmware.
                   
                  Please run our application Intel® System Information Retrieval Utility and send me the results.
                   
                  Additionally, send us the model of the chassis; if you are using and Intel® System, please add the product code of the chassis.
                   
                  Regards,
                  Mike C
                   

                  • 6. Re: SC2600CO random reboot
                    guilhermefsmiguel

                    Hi Mike,

                     

                    Here are the files you have requested.

                     

                    Please note that OpenSuSE LEAP 42.1 doesn't have a /var/log/messages

                    Journalctrl is the option now. If it is necessary I can try to send it to you.

                     

                    Best regards,

                    • 7. Re: SC2600CO random reboot
                      Intel Corporation
                      This message was posted on behalf of Intel Corporation
                      Hi guilhermefsmiguel,

                      Thank you for your update. The BMC and ME firmware versions are updated; however, the FRUSDR is not updated yet. The system is using the version 1.08.

                      Let’s try to update it using an older version 1.09. Use the BIOS-Firmware version 01.06.0002R4151 following the steps below:
                      https://downloadcenter.intel.com/download/22399/Intel-Server-Board-S2600CO-Firmware-Update-Package-for-Extensible-Firmware-Interface-EFI-?product=63157
                      FRUSDR update steps:
                      1) Boot the system to the EFI shell and go to root folder
                      2) At the EFI command prompt, run "FRUSDR.nsh" to start FRUSDR update
                      3) Answer questions and enter desired information when prompted.
                      4) When complete, reboot the system by front control panel

                      Verify if the FRUSDR update works:
                      1)  During POST, hit the F2 Key when prompted to access the BIOS Setup Utility
                      2)  Hit the F9 key to load BIOS Defaults, then hit the F10 (save changes)
                      3)  At the MAIN menu verify the BIOS revision is 02.06.0006 
                      4)  Move cursor to the SERVER MANAGEMENT Menu
                      5)  Move cursor down to the SYSTEM INFORMATION Option and hit Enter
                      6)  Verify the BMC Firmware revision is 01.28.10603
                      7)  Verify the SDR revision is 1.09
                      8)  Verify the ME Firmware revision is 02.01.07.328
                      9)  Hit the F10 Key to save changes and Exit

                      If it works, do the same with FRUSDR version 1.11

                      I would be waiting for the outcome of this workaround. Let me know the brand name and model of the chassis.

                      Regards,
                      Mike C
                      • 8. Re: SC2600CO random reboot
                        guilhermefsmiguel

                        Hi Mike,

                         

                        Fist os all , thank you for your time and help.

                         

                        Checking your script I did not saw any reference of jumper change on the motherboard so I can assume that this is not necessary, right?

                         

                        I am travelling and will return to the university on Friday, reason why I ask you: Do you think that it's better to wait until Friday to execute this procedure on site or it is safe to execute this procedure using the SOL?

                        I know that there are risks involved on any firmware upgrade, but I am not sure whatever a BMC restarts during his update, been the update process controlled by a SOL session,  can make it faulty.

                         

                        If it is not necessary to change jumpers position and there is no additional risk on doing this upgrade via SOL, I will ask another person to download the software to a USB drive and insert it on the machine to proceed with the update.

                         

                        Thank you once again,

                        • 9. Re: SC2600CO random reboot
                          Intel Corporation
                          This message was posted on behalf of Intel Corporation

                          Hi guilhermefsmiguel,

                          It is my pleasure to assist you. 

                          The FRUSDR update does not require to remove a jumper from the board itself, we can do it using the EFI shell. 

                          I suggest you updating the FRUSDR firmware physically instead of the remote mode. The BIOS might get corrupted if we try this option.

                          Let me know how the workaround works at your convenience, I will be waiting for your results.

                          Regards,
                          Mike C

                          • 10. Re: SC2600CO random reboot
                            guilhermefsmiguel

                            Hi Mike,

                             

                            I have upgraded the FRU firmware to the versions you have recommended and the screens that you asked me to confirm the version are attached.

                            But the problem persist.

                             

                             

                              1a | 09/22/2017 | 14:02:59 | System Event #0x83 | OEM System boot event | Asserted

                              1b | 09/22/2017 | 14:04:16 | Power Unit #0x01 | Failure detected | Asserted

                              1c | 09/22/2017 | 14:04:16 | Power Unit #0x01 | Power off/down | Asserted

                              1d | 09/22/2017 | 14:04:21 | Power Unit #0x01 | Power off/down | Deasserted

                              1e | 09/22/2017 | 14:04:21 | Power Unit #0x01 | Failure detected | Deasserted

                              1f | 09/22/2017 | 14:04:54 | System Event #0x83 | Timestamp Clock Sync | Asserted

                              20 | 09/22/2017 | 14:04:54 | System Event #0x83 | Timestamp Clock Sync | Asserted

                              21 | 09/22/2017 | 14:05:19 | System Event #0x83 | OEM System boot event | Asserted

                              22 | 09/22/2017 | 14:09:59 | Power Unit #0x01 | Failure detected | Asserted

                              23 | 09/22/2017 | 14:09:59 | Power Unit #0x01 | Power off/down | Asserted

                              24 | 09/22/2017 | 14:10:04 | Power Unit #0x01 | Power off/down | Deasserted

                              25 | 09/22/2017 | 14:10:04 | Power Unit #0x01 | Failure detected | Deasserted

                              26 | 09/22/2017 | 14:10:35 | System Event #0x83 | Timestamp Clock Sync | Asserted

                              27 | 09/22/2017 | 14:10:35 | System Event #0x83 | Timestamp Clock Sync | Asserted

                              28 | 09/22/2017 | 14:11:01 | System Event #0x83 | OEM System boot event | Asserted

                              29 | 09/22/2017 | 14:14:50 | Power Unit #0x01 | Failure detected | Asserted

                              2a | 09/22/2017 | 14:14:51 | Power Unit #0x01 | Power off/down | Asserted

                              2b | 09/22/2017 | 14:14:56 | Power Unit #0x01 | Power off/down | Deasserted

                              2c | 09/22/2017 | 14:15:08 | Power Unit #0x01 | Failure detected | Deasserted

                              2d | 09/22/2017 | 14:15:26 | System Event #0x83 | Timestamp Clock Sync | Asserted

                              2e | 09/22/2017 | 14:15:27 | System Event #0x83 | Timestamp Clock Sync | Asserted

                              2f | 09/22/2017 | 14:15:52 | System Event #0x83 | OEM System boot event | Asserted

                              30 | 09/22/2017 | 14:17:59 | Power Unit #0x01 | Failure detected | Asserted

                              31 | 09/22/2017 | 14:18:00 | Power Unit #0x01 | Power off/down | Asserted

                              32 | 09/22/2017 | 14:18:05 | Power Unit #0x01 | Failure detected | Deasserted

                              33 | 09/22/2017 | 14:18:10 | Power Unit #0x01 | Power off/down | Deasserted

                              34 | 09/22/2017 | 14:18:35 | System Event #0x83 | Timestamp Clock Sync | Asserted

                              35 | 09/22/2017 | 14:18:35 | System Event #0x83 | Timestamp Clock Sync | Asserted

                              36 | 09/22/2017 | 14:23:09 | System Event #0x83 | OEM System boot event | Asserted

                             

                            Do you think that it is convenient to retry testing the power source?

                            • 11. Re: SC2600CO random reboot
                              guilhermefsmiguel

                              I would like to mention that the machine restart when it boots into Linux, even without any CPU or RAM consuption. If I load the EFI sheel, or put it into BIOS it doesn't restart.

                               

                              I have googled about it, but I could not find the exact same problem.

                               

                              I have attached the info that it is shown in the BMC WEB interface.

                               

                              Thank you once again for your time and support.

                              • 12. Re: SC2600CO random reboot
                                guilhermefsmiguel

                                Dear Mike, another Professor told me a few seconds ago that he saw the machine restarting even when it was on the EFI Shell.

                                So please, do not consider my last affirmation that it only restarts when it is booted on Linux.

                                • 13. Re: SC2600CO random reboot
                                  Intel Corporation
                                  This message was posted on behalf of Intel Corporation

                                  Hi guilhermefsmiguel,

                                  Thank you for your update. The system is still showing the power supply as faulty even with the FRUSDR: 1.11. 

                                  I suggest you to update the FRUSDR to the latest version 1.12 (Version: 02.03.0003). Hopefully, it will solve the issue. Keep using the same method.  

                                  FRUSDR update steps:
                                  1) Boot the system to the EFI shell and go to root folder
                                  2) At the EFI command prompt, run "FRUSDR.nsh" to start FRUSDR update
                                  3) Answer questions and enter desired information when prompted.
                                  4) When complete, reboot the system by front control panel

                                  If the problem continues, double check if OpenSuSE LEAP 42.1 is up to date.

                                  Please, keep us posted with the results.

                                  Regards,
                                  Mike C

                                  • 14. Re: SC2600CO random reboot
                                    Intel Corporation
                                    This message was posted on behalf of Intel Corporation

                                    Hi Guilhermefsmiguel,
                                     
                                    Thank you for your update. I am interested to know if you are still having issues with the Intel® Server Board S2600COE.
                                     
                                    Regards,
                                    Mike C
                                     

                                    1 2 3 Previous Next