4 Replies Latest reply on Sep 11, 2018 10:00 AM by gchq

    WHEA Logger - Event ID 17 - 5520/x58

    gchq

      An eight year old Supermicro pedestal server has started to freeze and/or BOD. The error just before this is Event 17, WHEA logger.

       

      It points to

       

      Bus 0

      Device 5

      Function 0

       

      and that leads to

       

      Intel 5520/ x58 I/O Hub PCI Express Root Port 5-340C

       

      OS is Server 2018 R2

       

      Any help on pinning this down would be appreciated.

       

      Thanks

        • 1. Re: WHEA Logger - Event ID 17 - 5520/x58
          Intel Corporation
          This message was posted on behalf of Intel Corporation

          Hello gchq,
           
           Thank you for joining this community; it will be more than a pleasure to assist you.
           
          Please provide us with your full system configuration and specifications (Motherboard model, CPU and any additional details that you consider important)
           
           
          I hope to hear from you soon.
           
           
          Regards,
          Diego S.
           

          • 2. Re: WHEA Logger - Event ID 17 - 5520/x58
            gchq

            Thank you for your reply Diego

             

            Server is SuperMicro SYS-7046T-6F Dual LGA Xeon

             

            http://www.supermicro.com/products/system/4U/7046/SYS-7046T-6F.cfm

             

            MB is X8DT6-F

             

            X8DT6-F | Motherboards | Products - Super Micro Computer, Inc.

             

            Processors are 2 x Xeon X5680

             

            All temperatures seem normal, but when it throws the toys out and freezes the temperature warning light comes on shortly afterwards

             

            If it does BSD the screen is still frozen at the point before/at WHEA Logger is thrown (doesn't show the BSD, just a lot of drive activity) - makes me suspect the graphics card (NVIDIA GeForce GTS 450)

             

            It's very random - there was a BSD about eight months ago, but I didn't retain the information as it ran without problems until recently.

             

            Of late it can run two or three days without incident, then fall over two or three times in succession, sometimes during login

             

            A corrected hardware error has occurred.

            Component: PCI Express Root Port
            Error Source: Advanced Error Reporting (PCI Express)

            Bus:Device:Function: 0x0:0x5:0x0
            Vendor ID:Device ID: 0x8086:0x340c
            Class Code: 0x30400

             

            ErrorSource 4
              FRUId {00000000-0000-0000-0000-000000000000}
              FRUText 
              ValidBits 0xdf
              PortType 4
              Version 0x101
              Command 0x10
              Status 0x507
              Bus 0x0
              Device 0x5
              Function 0x0
              Segment 0x0
              SecondaryBus 0x0
              Slot 0x0
              VendorID 0x8086
              DeviceID 0x340c
              ClassCode 0x30400
              DeviceSerialNumber 0x0
              BridgeControl 0x0
              BridgeStatus 0x0
              UncorrectableErrorStatus 0x0
              CorrectableErrorStatus 0x1
              HeaderLog 00000000000000000000000000000000
              Length 672
              RawData 435045521002FFFFFFFF02000200000002000000A0020000143B0200050912140000000000000000000000000000000000000000000000000000000000000000BDC407CF89B7184EB3C41F732CB571311FC093CF161AFC4DB8BC9C4DAF67C10405BA5D41C344D40100000000455200000000000000000000000000000000000010010000D0000000010200000100000054E995D9C1BB0F43AD91B44DCB3C6F3500000000000000000000000000000000020000000000000000000000000000000000000000000000E0010000C00000000102000000000000ADCC7698B447DB4BB65E16F193C4F3DB00000000000000000000000000000000030000000000000000000000000000000000000000000000DF000000000000000400000001010000100007050000000086800C3400040300050000000000000000000000000000000000000010E042012180000007010100823C3B0041008170800C3000C00348010F0001000000000000000000000000000000000000000000000000000000000001000115000000000000000010200600010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000043010000000000000002000000000000C206020000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000031000000000000000000000000000000000000000000000000000000000000000000000000000000

            • 3. Re: WHEA Logger - Event ID 17 - 5520/x58
              Intel Corporation
              This message was posted on behalf of Intel Corporation

              Hello, 

              I would like to recommend you to update the firmware/BIOS and the chipset driver. Due to the fact that you have an OEM Server System. 
              I suggest you to contact the manufacture of your server system in order to find those updates. 

              The interactive technical support for the product you have requested has been discontinued. 
              We can provide you with resources and self-service support information through our website. 
              We highly suggest that you get information from the product support site.

              Please check the following links in order to find the support site: [https://communities.intel.com/community/tech/discontinued-product -
              https://www.intel.com/content/www/us/en/support/discontinued-products.html]  and all available technical information is included on the support site.

              Best Regards, 

              Emeth O
               

              • 4. Re: WHEA Logger - Event ID 17 - 5520/x58
                gchq

                Given that this box has run for eight years without any issues I don't feel that firmware and/or drivers could be the problem and that it is hardware or cooling related.

                 

                This is the only 2008 R2 box remaining, all the rest are rack mounted Server 2016 units - but I did notice the CPU temps for those are around 33-34 C

                 

                Here we have temps on one CPU hitting 83. Are we approaching a range that would cause this behavior?

                 

                 

                 

                CPU_Temps.png