1 Reply Latest reply on Feb 21, 2012 11:26 AM by Adolfo_Intel

    Diagnosing source of mcelog errors under Linux for Pentium D 950

    deploylinux

      We had an event at the start of this month in which the office environmental temperature went out of spec (normally under 77 degrees,  but for a few days did spike into at least the mid 80's).  Unfortunately, after that was addressed we begain to see mcelog errors on a linux workstation...and the errors have become ever an ever more frequent.

       

      We suspect processor damage -- the 950 generally already is quite sensitive to temp as it is running at 3.4Ghz and is near an older high end nvidia graphics card, but we would like to ensure this is the case before buying a replacement and installing rather than swappingout memory or motherboard.

       

      Note that to test we tried to change the processor frequency but the Pentium D doesn't support this, so we had to mess with T states.   Interesting with a T state of 4, which ensures that only one core has load at any one time, the mcelog errors go away and the system seems completely stable althrough very very slow.  I would assume this reinforces the assumption that it is a processor related issue and not motherboard/ram.

       

      mcelog errors generally following the following pattern:

       

      Feb 19 00:10:03 hyperion.professionalsysadmin.com MCA: Instruction CACHE Level-3 Instruction-Fetch Error
      Feb 19 00:10:03 hyperion.professionalsysadmin.com STATUS 9000000000000153 MCGSTATUS 0
      Feb 19 00:10:03 hyperion.professionalsysadmin.com MCGCAP 180204 APICID 0 SOCKETID 0
      Feb 19 00:10:03 hyperion.professionalsysadmin.com CPUID Vendor Intel Family 15 Model 6
      Feb 19 00:10:03 hyperion.professionalsysadmin.com HARDWARE ERROR. This is *NOT* a software problem!
      Feb 19 00:10:03 hyperion.professionalsysadmin.com Please contact your hardware vendor
      Feb 19 00:10:03 hyperion.professionalsysadmin.com MCE 30
      Feb 19 00:10:03 hyperion.professionalsysadmin.com CPU 0 BANK 0
      Feb 19 00:10:03 hyperion.professionalsysadmin.com MISC 140002d0002a0 ADDR 1b83041c0
      Feb 19 00:10:03 hyperion.professionalsysadmin.com TIME 1329639003 Sun Feb 19 00:10:03 2012
      Feb 19 00:10:03 hyperion.professionalsysadmin.com MCG status:
      Feb 19 00:10:03 hyperion.professionalsysadmin.com MCi status:
      Feb 19 00:10:03 hyperion.professionalsysadmin.com Error overflow
      Feb 19 00:10:03 hyperion.professionalsysadmin.com MCi_MISC register valid
      Feb 19 00:10:03 hyperion.professionalsysadmin.com MCi_ADDR register valid
      Feb 19 00:10:03 hyperion.professionalsysadmin.com MCA: Generic CACHE Level-1 Snoop Error
      Feb 19 00:10:03 hyperion.professionalsysadmin.com Corrected events: 255
      Feb 19 00:10:03 hyperion.professionalsysadmin.com MISC format 0 value 140002d0002a0
      Feb 19 00:10:03 hyperion.professionalsysadmin.com STATUS cc0000ff20040189 MCGSTATUS 0
      Feb 19 00:10:03 hyperion.professionalsysadmin.com MCGCAP 180204 APICID 0 SOCKETID 0
      Feb 19 00:10:03 hyperion.professionalsysadmin.com CPUID Vendor Intel Family 15 Model 6
      Feb 19 00:10:03 hyperion.professionalsysadmin.com HARDWARE ERROR. This is *NOT* a software problem!
      Feb 19 00:10:03 hyperion.professionalsysadmin.com Please contact your hardware vendor
      Feb 19 00:10:03 hyperion.professionalsysadmin.com MCE 31
      Feb 19 00:10:03 hyperion.professionalsysadmin.com CPU 0 BANK 1
      Feb 19 00:10:03 hyperion.professionalsysadmin.com TIME 1329639003 Sun Feb 19 00:10:03 2012
      Feb 19 00:10:03 hyperion.professionalsysadmin.com MCG status:
      Feb 19 00:10:03 hyperion.professionalsysadmin.com MCi status:
      Feb 19 00:10:03 hyperion.professionalsysadmin.com MCA: Data CACHE Level-1 Data-Read Error
      Feb 19 00:10:03 hyperion.professionalsysadmin.com Corrected events: 200
      Feb 19 00:10:03 hyperion.professionalsysadmin.com STATUS 800008c800000135 MCGSTATUS 0
      Feb 19 00:10:03 hyperion.professionalsysadmin.com MCGCAP 180204 APICID 0 SOCKETID 0
      Feb 19 00:10:03 hyperion.professionalsysadmin.com CPUID Vendor Intel Family 15 Model 6

       

      Note that these all seem to be cache errors....I'm assuming that these must be shared somehow between the cores and the damage might be interfering with their ability to lock changes/etc....

       

      Can we be confident that this is a processor damage issue?

       

      I've checked the computer interior and cleaned it out and verified the system fan is running properly and that general temps are correct along with voltages:

       

      when acpi t state is 4:

      processor temp is ~48-50 degrees c

      motherboard temp seems stable at ~35 degrees c

      video card ambient temp is 39 degrees c

      video card core temp is about 51 degrees c

       

      Initial error frequency was just a few times/day but was roughly every 15 minutes yesterday until the t state was switched to 4.