5 Replies Latest reply on Jul 16, 2014 10:50 AM by Mayagrafix

    Linux Machine Check Exception: Is it the CPU?

    josmith

      Hello,

       

      On my Laptop Windows often showed the BSOD after minutes of use, so we contacted Dell and provided them the dump files, they exchanged the motherboard.

       

      Now I am running Linux, but random kernel panics occur, sometimes after minutes, sometimes after days.

       

      I configured kdump-tools on my linux distribution to start a crash kernel when the panic occurs to dump the memory along with dmesg output to allow post mortem analysis.

       

      This is what dmesg says when the panic occurs:

       

      [ 3933.364173] mce: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 3: be00000000200135

      [ 3933.364177] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8171d9c2> {_raw_spin_lock+0x12/0x50}

      [ 3933.364182] mce: [Hardware Error]: TSC a0255fbd7f7 ADDR 42dd14480 MISC d62285

      [ 3933.364185] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1398357146 SOCKET 0 APIC 1 microcode 15

      [ 3933.364186] mce: [Hardware Error]: Run the above through 'mcelog --ascii'

      [ 3933.364188] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 3: be00000000200135

      [ 3933.364190] mce: [Hardware Error]: RIP !INEXACT! 33:<0000045a7992c1b5>

      [ 3933.364191] mce: [Hardware Error]: TSC a0255fbd7f0 ADDR 42dd14480 MISC d62285

      [ 3933.364194] mce: [Hardware Error]: PROCESSOR 0:306a9 TIME 1398357146 SOCKET 0 APIC 0 microcode 15

      [ 3933.364195] mce: [Hardware Error]: Run the above through 'mcelog --ascii'

      [ 3933.364196] mce: [Hardware Error]: Machine check: Processor context corrupt

      [ 3933.364197] Kernel panic - not syncing: Fatal Machine check

       

      Analyzing the memory dump file with crash (crash /usr/lib/debug/boot/vmlinux<kernelversion> /path/to/crashdump/file and typing "bt") gives me the following backtrace:

       

      PID: 0 TASK: ffff8804177617f0 CPU: 6 COMMAND: "swapper/6"

      #0 [ffff88042dd89ca0] machine_kexec at ffffffff8104a732

      #1 [ffff88042dd89cf0] crash_kexec at ffffffff810e6ab3

      #2 [ffff88042dd89db8] panic at ffffffff8170ec6c

      #3 [ffff88042dd89e30] mce_panic at ffffffff8103687a

      #4 [ffff88042dd89e70] do_machine_check at ffffffff81038684

      #5 [ffff88042dd89f50] machine_check at ffffffff8171e25f

        [exception RIP: intel_idle+216]

        RIP: ffffffff813dfd78 RSP: ffff88041775de28 RFLAGS: 00000046

        RAX: 0000000000000001 RBX: 0000000000000002 RCX: 0000000000000001

        RDX: 0000000000000000 RSI: ffffffff81c93220 RDI: 0000000000000006

        RBP: ffff88041775de50 R8: ffff88042dd912d0 R9: 000000000000001c

        R10: 0000000000000320 R11: 0000000000000249 R12: 0000000000000002

        R13: 0000000000000001 R14: 0000000000000001 R15: ffffffff81c932e8

        ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018

      --- <MCE exception stack> ---

      #6 [ffff88041775de28] intel_idle at ffffffff813dfd78

      #7 [ffff88041775de58] cpuidle_enter_state at ffffffff815c9570

      #8 [ffff88041775de90] cpuidle_idle_call at ffffffff815c96a9

      #9 [ffff88041775ded0] arch_cpu_idle at ffffffff8101ceae

      #10 [ffff88041775dee0] cpu_startup_entry at ffffffff810beb85

      #11 [ffff88041775df30] start_secondary at ffffffff81040fc8

       

      Diagnosing the dmesg output with mcelog gives me the following:

       

       

      Hardware event. This is not a software error.

      CPU 4 BANK 3 TSC a0255fbd7f7

      RIP !INEXACT! 10:ffffffff8171d9c2

      MISC d62285 ADDR 42dd14480

      TIME 1398357146 Thu Apr 24 18:32:26 2014

      MCG status:RIPV MCIP

      MCi status:

      Uncorrected error

      Error enabled

      MCi_MISC register valid

      MCi_ADDR register valid

      Processor context corrupt

      MCA: Data CACHE Level-1 Data-Read Error

      STATUS be00000000200135 MCGSTATUS 5

      CPUID Vendor Intel Family 6 Model 58

      RIP: _raw_spin_lock+0x12/0x50}                                                      

      SOCKET 0 APIC 1 microcode 15

       

      and

       

      Hardware event. This is not a software error.                                                                       

      CPU 0 BANK 3 TSC a0255fbd7f0

      RIP !INEXACT! 33:45a7992c1b5

      MISC d62285 ADDR 42dd14480

      TIME 1398357146 Thu Apr 24 18:32:26 2014

      MCG status:RIPV MCIP

      MCi status:

      Uncorrected error

      Error enabled

      MCi_MISC register valid

      MCi_ADDR register valid

      Processor context corrupt

      MCA: Data CACHE Level-1 Data-Read Error

      STATUS be00000000200135 MCGSTATUS 5

      CPUID Vendor Intel Family 6 Model 58

      SOCKET 0 APIC 0 microcode 15

       

       

      I have also run many passes of memcheck86+, it found no errors, so memory seems to be fine. Given that the motherboard has been changed it is very likely that the CPU is bad, right? Does anything in the output support that view?