- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We had an event at the start of this month in which the office environmental temperature went out of spec (normally under 77 degrees, but for a few days did spike into at least the mid 80's). Unfortunately, after that was addressed we begain to see mcelog errors on a linux workstation...and the errors have become ever an ever more frequent.
We suspect processor damage -- the 950 generally already is quite sensitive to temp as it is running at 3.4Ghz and is near an older high end nvidia graphics card, but we would like to ensure this is the case before buying a replacement and installing rather than swappingout memory or motherboard.
Note that to test we tried to change the processor frequency but the Pentium D doesn't support this, so we had to mess with T states. Interesting with a T state of 4, which ensures that only one core has load at any one time, the mcelog errors go away and the system seems completely stable althrough very very slow. I would assume this reinforces the assumption that it is a processor related issue and not motherboard/ram.
mcelog errors generally following the following pattern:
Feb 19 00:10:03 hyperion.professionalsysadmin.com MCA: Instruction CACHE Level-3 Instruction-Fetch Error
Feb 19 00:10:03 hyperion.professionalsysadmin.com STATUS 9000000000000153 MCGSTATUS 0
Feb 19 00:10:03 hyperion.professionalsysadmin.com MCGCAP 180204 APICID 0 SOCKETID 0
Feb 19 00:10:03 hyperion.professionalsysadmin.com CPUID Vendor Intel Family 15 Model 6
Feb 19 00:10:03 hyperion.professionalsysadmin.com HARDWARE ERROR. This is *NOT* a software problem!
Feb 19 00:10:03 hyperion.professionalsysadmin.com Please contact your hardware vendor
Feb 19 00:10:03 hyperion.professionalsysadmin.com MCE 30
Feb 19 00:10:03 hyperion.professionalsysadmin.com CPU 0 BANK 0
Feb 19 00:10:03 hyperion.professionalsysadmin.com MISC 140002d0002a0 ADDR 1b83041c0
Feb 19 00:10:03 hyperion.professionalsysadmin.com TIME 1329639003 Sun Feb 19 00:10:03 2012
Feb 19 00:10:03 hyperion.professionalsysadmin.com MCG status:
Feb 19 00:10:03 hyperion.professionalsysadmin.com MCi status:
Feb 19 00:10:03 hyperion.professionalsysadmin.com Error overflow
Feb 19 00:10:03 hyperion.professionalsysadmin.com MCi_MISC register valid
Feb 19 00:10:03 hyperion.professionalsysadmin.com MCi_ADDR register valid
Feb 19 00:10:03 hyperion.professionalsysadmin.com MCA: Generic CACHE Level-1 Snoop Error
Feb 19 00:10:03 hyperion.professionalsysadmin.com Corrected events: 255
Feb 19 00:10:03 hyperion.professionalsysadmin.com MISC format 0 value 140002d0002a0
Feb 19 00:10:03 hyperion.professionalsysadmin.com STATUS cc0000ff20040189 MCGSTATUS 0
Feb 19 00:10:03 hyperion.professionalsysadmin.com MCGCAP 180204 APICID 0 SOCKETID 0
Feb 19 00:10:03 hyperion.professionalsysadmin.com CPUID Vendor Intel Family 15 Model 6
Feb 19 00:10:03 hyperion.professionalsysadmin.com HARDWARE ERROR. This is *NOT* a software problem!
Feb 19 00:10:03 hyperion.professionalsysadmin.com Please contact your hardware vendor
Feb 19 00:10:03 hyperion.professionalsysadmin.com MCE 31
Feb 19 00:10:03 hyperion.professionalsysadmin.com CPU 0 BANK 1
Feb 19 00:10:03 hyperion.professionalsysadmin.com TIME 1329639003 Sun Feb 19 00:10:03 2012
Feb 19 00:10:03 hyperion.professionalsysadmin.com MCG status:
Feb 19 00:10:03 hyperion.professionalsysadmin.com MCi status:
Feb 19 00:10:03 hyperion.professionalsysadmin.com MCA: Data CACHE Level-1 Data-Read Error
Feb 19 00:10:03 hyperion.professionalsysadmin.com Corrected events: 200
Feb 19 00:10:03 hyperion.professionalsysadmin.com STATUS 800008c800000135 MCGSTATUS 0
Feb 19 00:10:03 hyperion.professionalsysadmin.com MCGCAP 180204 APICID 0 SOCKETID 0
Feb 19 00:10:03 hyperion.professionalsysadmin.com CPUID Vendor Intel Family 15 Model 6
Note that these all seem to be cache errors....I'm assuming that these must be shared somehow between the cores and the damage might be interfering with their ability to lock changes/etc....
Can we be confident that this is a processor damage issue?
I've checked the computer interior and cleaned it out and verified the system fan is running properly and that general temps are correct along with voltages:
when acpi t state is 4:
processor temp is ~48-50 degrees c
motherboard temp seems stable at ~35 degrees c
video card ambient temp is 39 degrees c
video card core temp is about 51 degrees c
Initial error frequency was just a few times/day but was roughly every 15 minutes yesterday until the t state was switched to 4.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The only way to make sure that the processor is the defective component is by testing the processor on a 2nd motherboard to see if it causes the same behavior, or by testing another processor on your system.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page