Here is photo of memory test failing - happens early in test when it hits 'moving inversions':
Shows these errors for 1600Mhz ram and 1333Mhz ram, and since server works I can't trust this memtest86 ??
The system event log for the time the server crashed had:
Event ID: 46
A fatal hardware error has occurred.
Error Source: Generic
So I was wanting to run tests for a day or two to prove/disprove a memory problem exists.
The front panel system status led blinks green (no amber).
Sorry, I should have specified, when I said SEL I meant the BMC log. If there's something logged at the hardware level then I would think that is definitive.
By the way, you never really described why you think the memory is faulting besides indicating a system crash. Is there anything more there that you can say about the event - was it a single event?
Is memetest86 performing a modulo-x test? If so, does the memory pass this test? If so, you can likely ignore moving-inversions results: "caching, buffering and out of order execution will interfere with the moving inversions algorithm and make less effective."
Kick that around and see how it goes,
memtest86 had a Modulo-20 test, but this failed as dramatically as all the rest.
Do you know how do I get to look at this BMC log - on older machines I use to be able to view via the BIOS, but that is not part of this machine. I booted the management CD that came with server but it has no utility to view log. I can't install the active system console software as it does not run (http://communities.intel.com/thread/30361) and its php conflicts with some database monitoring software installed - they won't play nice together.
I'm not sure memory is the problem, I've had two random crashes so far - windows event log seems to indicate memory but I'm not sure - I just want to test it to eliminate it or confirm it as the problem.
There is a SEL viewer in the latest firmware update package for S1200BTL. Not a bad idea to consider upgrading your firmware before taking any further steps - besides looking into the SEL that is.
thanks for that - got the logs now.
Items of interest before the unexpected reboots:
773 8/15/2012-4:37:31 AM Memory Mmry ECC Sensor (#0x02) CRITICAL event: Mmry ECC Sensor reports uncorrectable error. There has been an uncorrectable ECC or other uncorrectable memory error for the memory module CPU_1, Channel = A, DIMM = 2. BIOS - LUN#0 (Channel#0 1231 8/20/2012-2:57:17 PM Memory Mmry ECC Sensor (#0x02) CRITICAL event: Mmry ECC Sensor reports uncorrectable error. There has been an uncorrectable ECC or other uncorrectable memory error for the memory module CPU_1, Channel = A, DIMM = 2. BIOS - LUN#0 (Channel#0
Still not sure if memory or maybe the controller, as between these two above failures all the memory was changed to 1333 for testing and then 1600 put back, so only a 1 in 4 chance the same piece ended up in same slot.
And by the way, there is now an amber light on the front panel for system status.
Any thoughts on this?