4 Replies Latest reply on Jul 12, 2017 6:38 PM by Intel Corporation

    Watchdog timer / Time Stamp Counter (TSC) in Xeon E5 2670. Stability concerns (possible overheating?)


      I recently purchased a refurbished HP Z820 workstation with dual Xeon E5-2670 processors (2 dies, 8 cores / die --> 16 cores, 32 threads).


      Shortly after receiving the machine I experienced a Windows 10 crash (windows is on a hard drive) and a Red Hat Enterprise Linux 7.3 crash (linux is on an SSD). The windows crash had the following message: "929-Fatal MCA error. MLC error detected CPU 0. Internal timeout error - watchdog timer (3 strike)", while the linux crash contained a phrase like 'CPU0 ... TSC dead'. (I don't have the exact text of the error available right now, but "TSC dead" was definitely there.)


      The wikipedia pages for TSC and watchdog timer (Watchdog timer - Wikipedia , Time Stamp Counter - Wikipedia ) sound related. Could the crashes have been caused by the same hardware error?


      I have stress tested the processor with Intel's "Processor Diagnostic Tool" (IPDT) and Prime95. IPDT ran with zero errors, and the stress test resulted in reasonable temps of 60 degrees C per core. Prime95 pushes the machine much harder- all 16 cores are above 80 degrees C, and one core reached as high as 89 degrees C (the average across all cores is about 84 C). However- Prime95 is not reporting any errors. Although I am concerned about the high temps when running Prime95, strictly speaking they are below my processor's Tj-Max (100 C).


      Is there any tool I could use to test for hardware failures related to the TSC or a "watchdog timer"? If there is an error there is doesn't seem like IPDT or Prime95 is going to find it.


      And in general, could the high core temperatures be contributing to instability? Note that the situations when it crashed were not under extremely high load (maybe high levels of I/O, but not numerical computation like Prime95 does).