5 Replies Latest reply on Mar 6, 2011 10:35 AM by Doc_SilverCreek

    Custom-Built PC Suddenly Unstable After Year of Solid Operation

    steppinwolf

      OK, I built this rig a year ago (specs below). As you can see, it has high-quality components centered around Intel mobo, CPU and SSD system drive with RAID-1 for increased reliability. The hardware has been rock-solid for a year. The only changes since then (other than Windows patches and applications)have been occasional driver and BIOS updates. It's been running fine with no changes for several months.

       

      Yesterday, I worked on my computer all morning, took a phone call and come back to find the computer frozen--desktop and apps still visible as I left them, but keyboard and mouse not responding. I've seen this occasionally. A simple restart should fix it. Well, the computer got partway through bootup, powered off and restarted power-up sequence several times! I turned off power and waited about 10 mins (to let things cool down) and tried again. This time, fans would come on, but nothing on screen--not even BIOS sequence--and no spinup of the mechanical drives. Intel DP55KG posted code 69 at this point which = Boot Device Selection (BDS) driver entry point. This made me suspect bad drive or controller, but I needed some sleep. Next day, I tried booting a couple more times and got a different result. This time I got single beep, normal BIOS boot sequence and all the way through Windows startup! However, before I could log in, poof! the power went totally out from under me--no lights on the mobo, no fans, nothing.

       

      OK, this is looking like power supply issue don't you think? But, I have smooth power coming in through the CyberPower UPS and I'm using top-rated SeaSonic PSU. I don't have another PSU to test. Could it be something else? I usually troubleshoot/fix problems on my own, but I'd appreciate any suggestions at this point. (I'm also sending this info to Intel and SeaSonic support.) 

       

      ---------------------------------------------------------

      OS: Windows 7 Pro x64

      Motherboard: Intel DP55KG AA E47218-403

      CPU: Intel i7 860

      CPU Cooler: Prolimatech Megahalems Rev B

      RAM: G.SKILL (4 x 2GB) F3-12800CL7D-4GBRH

      HD: Intel X25-M G2 80GB SSD x 2 (RAID-1) and Samsung F3 HD103SJ 1TB x 2 (RAID-1)

      PSU: SeaSonic X650 80 PLUS Gold Modular

      Case: Lian Li PC-B10

      Video: PowerColor AX5750 1GBD5-S3DH HD 5750

      Monitor: Samsung T260HD

      Optical: LG 8X BD-ROM 16X DVD-ROM

       

       

      BIOS version: KGIBX10J.86A.5893.2010.1116.0001

      Chipset: P55, INF version 9.1.1.1022

      Disk Controller Intel 3400 Series SATA RAID controller; driver: RST 9.6.0.1014 (3/3/2010)

      Marvell Controller 88SE6145 RAID Adapter; BIOS 1.2.0.31; Driver 1.2.0.7103 (9/15/2009)

      UPS: CyberPower 1285 AVR

        • 1. Re: Custom-Built PC Suddenly Unstable After Year of Solid Operation
          Doc_SilverCreek

          yuck,  About every symptom in the world!

           

          If you had a PS on had, I would try it since it would be in the top few suspects.

          But you could have a device shorting causing the PS to shut down also.

           

          The stalls could be memory, but the finial lights out sounds more like a power issue. (supply or a device shorting)

           

          Try a minimum configuration.

          No add in cards, minimum memory, 1 HDD, no CD/DVD drives, etc. Remove every thing not need to boot then see if it will come back.

           

          If yes, try some diagnostic test (memtest86, prime95, etc.) to see if you can get it to crash again or detect a hardware problem.

          If all the test look good, add your devices one at a time and see if you can reproduct the failure.

           

          if No, you have it down to the few components still attached. (PS, Mother bd, memory,CPU) With your symptoms, I would suspect them in this order also.

           

          The finial lights out sounds like a power supply or something over drawing excess current causing the supply to shut down.

          A brown out of the 3.3 or 5 v on the PS would cause any / all of your symptoms.  

          • 2. Re: Custom-Built PC Suddenly Unstable After Year of Solid Operation
            steppinwolf

            Doc, I want to thank you for taking time to respond and provide such a helpful troubleshooting list. I actually stumbled onto what seems to be the cause earlier today, but your response probably would have saved me some time.

             

            It turned out to be something I didn't even include in the components list. About 3-4 weeks ago, my trusty Logitech Cordless Desktop S 510 was swapped for a Microsoft Natural Ergonomic Desktop 7000 to help alleviate a repetitive stress injury. Anyway, I discovered the system could boot into BIOS setup and remain there indefinitely without crashing. Temps and voltage looked normal and remained stable. Only one problem. After 5-10 minutes, the keyboard stopped responding. I had to reset to get out of the BIOS.

             

            That focused attention on the keyboard/mouse. Sure enough, the system hasn't crashed or locked up since swapping back to Logitech. The Ergonomic Desktop 7000 worked fine for a month, but as you said, it could be causing a short. Anyway, I'm keeping your suggestions for next time and giving you credit for the answer.

            • 3. Re: Custom-Built PC Suddenly Unstable After Year of Solid Operation
              steppinwolf

              Sadly, my belief that the new Microsoft keyboard caused a short or something was short-lived. The system ran OK for hours that day, then become unstable again (sudden power down) then swung to the other extreme where it would not stay powered on more than a few seconds. Removing and reseating video card and memory and more thoroughly blowing dust out of all crevices allowed the system to remain up longer.

               

              Strangely, the type of symptom has become consistent now even as the amount of time it remains powered up varies from seconds to more than an hour. I get the two-toned thermal alarm at power on. Then it boots to the message "CPU was shutdown due to a thermal event (Overheating)" Press Enter to continue. Then it boots into the BIOS (pressing F2) or all the way into Windows, but thermal protection can suddenly power it down at any time.

               

              Here's where it gets strange again. When I boot into BIOS and monitor temperatures, processor thermal margin remains excellent at somewhere between 56 and 61c. Internal and remote temps are in the 38 to 46c range. Monitoring for 15-20 minutes, temps remain stable. Then I boot into Windows and fire up the Desktop Utility which also reports excellent thermal margin. Runs OK for at least an hour, keeping Desktop Utility temps visible the entire time. Suddenly while just sitting idle, it shuts down again. Most recently I remained in BIOS to monitor for about 20 mins, and when I went to escape out of BIOS it suddenly powered down again. (This is definitely not a Windows issue.)  

               

              Found some interesting advice about Intel mobo's and thermal events here: http://www.techimo.com/forum/general-tech-discussion/140400-rundll-message-popup-thermal-event-overheating-problems-2.html. Also there are discussions out there about capacitors going bad on some mobo's.

               

              Since my system was rock-solid stable for a year and no sign of leaky capacitors, I'm starting to suspect warping and/or separation between heatsink and CPU over time. Even though overall CPU temps are excellent, perhaps one or more individual cores are overheating? I have a giant Prolimatech Megahalems cooler with 120mm fan blowing through it. But the motherboard is vertical so there's downward pressure from the large heatsink. If this is the cause, it's the one flaw in my home-built rig that was designed to be stable, cool and quiet with power to spare.

               

              At this point I probably have to remove the entire mobo, heatsink and CPU to check for warping/separation... 

              • 4. Re: Custom-Built PC Suddenly Unstable After Year of Solid Operation
                steppinwolf

                Update: The computer stayed up over an hour laying horizontal on it's side to mitigate possible warping/separation. Detailed sensor information below indicates no unusually hot cores or grossly uneven heating (at least in horizontal position). I'd be willing to keep the PC in this position if it would help. However, the LG BD-ROM/DVD burner will be on it's side (vertical) and I'm not sure it functions well in this position. May have to put it in an external enclosure...

                 

                Anyway, none of that matters right now because the PC just powered off a few minutes ago while I was typing. Back to the laptop.

                 

                Sensor readings using AIA64:


                    Sensor Properties:
                      Sensor Type                     Analog Devices ADT7490  (SMBus 2Ch)
                      GPU Sensor Type                 Diode  (ATI-Diode)
                      Motherboard Name                Intel DP55KG / DP55SB / DP55WG


                    Temperatures:
                      Motherboard                     38 °C  (100 °F)
                      CPU #1 / Core #1                31 °C  (88 °F)
                      CPU #1 / Core #2                28 °C  (82 °F)
                      CPU #1 / Core #3                30 °C  (86 °F)
                      CPU #1 / Core #4                32 °C  (90 °F)
                      South Bridge                    38 °C  (100 °F)
                      GPU Diode (DispIO)              35 °C  (95 °F)
                      GPU Diode (MemIO)               38 °C  (100 °F)
                      GPU Diode (Shader)              34 °C  (93 °F)
                      SAMSUNG HD103SJ                 21 °C  (70 °F)
                      SAMSUNG HD103SJ                 22 °C  (72 °F)


                    Cooling Fans:
                      CPU                             1896 RPM
                      Front                           727 RPM
                      Rear                            563 RPM
                      Aux                             503 RPM
                      GPU                             40%


                    Voltage Values:
                      CPU Core                        0.902 V
                      +3.3 V                          3.283 V
                      +5 V                            5.052 V
                      +12 V                           12.375 V
                      DIMM                            1.510 V

                 


                Temp readings using Real Temp:

                Real Temp.png

                • 5. Re: Custom-Built PC Suddenly Unstable After Year of Solid Operation
                  Doc_SilverCreek

                  Stranger and Stranger,

                   

                  The MB reporting a CPU thermal shutdown is pretty solid that that is why the shut down occurred. Something set a thermal trip bit.

                  Now to determine if it was accurate.

                  The temps all look very good, which makes a thermal shutdown suspect.

                   

                  If we believe the mother board message, I would say check the heat sink mounting and thermal grease.

                  I would just plan on cleaning and reinstalling.

                  Inspect the heat sink when you remove it to see that it was fully seated, the grease is even across the whole surface of the processor.

                  Don't get too much grease when reinstalling. Make sure it is a thin even coat.

                   

                  Option B

                  The reading you are reporting are all normal to cool, which makes me wonder about the mother boards errors.

                  Sometime monitoring software can generate this type failure because it is polling the sensors when the mother board try's to poll the sensors.

                  You could try un-installing the moitoring software and see if the issue goes away.

                   

                  Option C

                  Hardware failure \ Electronic noise causing a failure.

                  There is no easy way to isolate this type issue out side of a lab.

                  General field isolation is swap in known good parts until the failure goes away.