2 Replies Latest reply on Jun 24, 2012 6:43 AM by Gilboa Davara

    S5520SC: Rock solid w/ single NV GTX470, random (?) power off when NV 9400 is added.

    Gilboa Davara

      Hello all,

       

      I've got a two year old (rock solid) S5220SC machine with the following configuration:

      Case: Coolermaster HAF 932.

      PSU: Thermaltake Toughpower 1200w.

      CPUs: 2 x Xeon X5680.

      MEM: 6 x 2GB KVR1333D3E9S/2GI (6/6).

      GPU: nVidia GTX 470.

      Drives: 5 x 320GB in software RAID5.

      Sound: SB Audigy 2SZ.

      OS: Fedora Linux 17 / x86_64.

       

      For the past two years, the machine is rock solid (up times measured in months).

      The only minor issue is the 1.5.4.2 beep codes on reboot, that as far as I know is caused by the use of a non-Intel-certified case.

       

      Now, in-order to improve VM performance I'm entertaining the idea of adding a second, low power GPU as the host GPU, and giving guests direct access to the GTX470 GPU.

      In order to test the concept I added a second known-good passively cooled 9400 GPU (taken from another Xeon workstation) and connected it along-side the current GPU.

      The machine reboot without problems and after some fiddling with xorg.conf, I got simultaneous display of both cards.

       

      *However* after a couple of minutes of work the machine simply turned off, no boot, no OOPS (Linux' version of a blue screen), nothing in the logs, nada.

      I restarted the machine, again boot was successful, display out both cards, everything is work, 5 minutes idle, power off.

       

      Once I removed the 9400 GPU and the machine returned to its old self.

      Just to be certain, I reconnected the 9400 to another machine and it worked out of the box.

       

      The 9400 is low-end GPU (AFAIR it uses <30w peak), so I would imagine that MB power usage should not be an issue.

      The PSU is well-above-and-beyond my current usage level (w/ Seti@home running at full swing my APC UPS is reporting ~350-400w power usage).

      As for cooling, the HSF is very well cooled, fans are cleaned regularly, and the current 470 GPU temperature idles at ~65C and peaks at ~85C.

       

      Any ideas?

      - Gilboa

        • 1. Re: S5520SC: Rock solid w/ single NV GTX470, random (?) power off when NV 9400 is added.
          Gilboa Davara

          OK, took the PSU out and tested it on another Xeon workstation (Tyan board w/ SLI).

          No issues-what-so-ever.

          Looked at the impi SEL logs, and found multiple critical events of two types:

          1. PCI-E Critical error:

          SEL Record ID          : 05a0

          Record Type          : 02

          Timestamp            : 06/21/2012 18:54:21

          Generator ID          : 0033

          EvM Revision          : 04

          Sensor Type          : Critical Interrupt

          Sensor Number        : 05

          Event Type            : OEM

          Event Direction      : Assertion

          Event Event Data (RAW)      : a60038

          Description          :

          Sensor ID              : PCIe Cor Sensor (0x5)

          Entity ID              : 49.1 (PCI Express Bus)

          Sensor Type            : Critical Interrupt

           

          2. IOH over-heating.

          SEL Record ID          : 0586

          Record Type          : 02

          Timestamp            : 06/21/2012 07:00:01

          Generator ID          : 0020

          EvM Revision          : 04

          Sensor Type          : Temperature

          Sensor Number        : 22

          Event Type            : Threshold

          Event Direction      : Assertion

          Event Event Data (RAW)      : 570505

          Trigger Reading      : 5.000 degrees C

          Trigger Threshold    : 5.000 degrees C

          Description          : Upper Non-critical going high

          Sensor ID              : IOH Therm Margin (0x22)

          Entity ID            : 7.18

          Sensor Type (Analog)  : Temperature

          Sensor Reading        : -19 (+/- 0) degrees C

          Status                : ok

          Lower Non-Recoverable : na

          Lower Critical        : na

          Lower Non-Critical    : na

          Upper Non-Critical    : 5.000

          Upper Critical        : 10.000

          Upper Non-Recoverable : na

          Assertion Events      :

          Assertions Enabled    : unc+ ucr+

          Deassertions Enabled  : unc+ ucr+

           

          It's very possible that given the position of the IOH (right beneath the second PCI slot) and the relative small heatsink w/o active cooling, adding a second card simply sends it over the edge. Never the less, anyone has any idea if the first PCI-E event is critical? What does it means? - Gilboa

          • 2. Re: S5520SC: Rock solid w/ single NV GTX470, random (?) power off when NV 9400 is added.
            Gilboa Davara

            P.S. The HAF 932 case has a 300mm fan right above the IOH. Weird... :(