Hi George, perhaps it will work better after repairing the Microsoft* .NET framework installation from Programs and Features, you just need to highlight the software component and choose Uninstall/Change.
I believe the results you are getting are not very conclusive but you can still try other troubleshooting steps:
- Use a single memory stick for testing, and then swap it with a different one in order to determine if one of them is defective.
- You may use the bootable version of the Intel® Processor Diagnostic Tool.
- Try removing all other hardware components (add-in cards) and testing the system in minimum configuration.
- Use a different hard drive with a freshly installed operating system just for testing; there may be bad sectors in the other hard drive(s) or the operating system may be tampered or corrupted.
Many thanks for your suggestions. I did eventually download and use the Fedora bootable ISO version, burning it to CD instead of USB as I'm not sure the PowerEdge 800 will boot from USB. Anyway that worked, if a bit confusing having to manually install IPDT from a shell window. I'm sure Intel could build a better Live Fedora ISO for the next version that doesn't need this step.
The results of running the diagnostics didn't reveal anything - even running the stress tests for 3 hours. It did help eliminate the processor though. I needed to do that as when the freezing issue started the system hardware log (not the Windows server system log) was logging "CPU Sensor Processor IERR" events, which indicate the CPU was asserting IERR to flag an internal error. So my first line of attack was to give the whole machine a good clean out (the heatsink was rather clogged with dust/cat hairs/dirt etc). That didn't do anything so I swapped the CPU for another P4. Same result with no change in frequency of the lock-ups. So I then ran several memory tests using both Dell's system diagnostics from within SBS 2003, from bootable CD and also Memtest86 from bootable CD.
The fact that the system even froze up during the memory tests booting from CD (no pattern as to where in the tests or after how long) and that no memory problems were logged indicated it probably wasn't the RAM. Just in case I followed your suggestion and removed one of the two 1GB DIMMS, ran it - it froze. Took that DIMM out and replaced it with the other one than ran the server - still froze. Still no RAM errors logged either by the memory test software or in the system hardware log (it has ECC RAM). This also eliminates the OS and hard drives (4 SCSI drives in RAID 5 via a PERC 4/SC RAID controller) as the problem as it wasn't using booting from or even using them and it still locked-up.
As I final test I removed the CPU heatsink again as well as the two heatsinks on the mainboard chipset chips and renewed the thermal compound. That did have an effect on the lock-ups in that it can now stay up and running for anything from a coupe to 8 hours before it locks up - before they could occurr after just 30 mins. I (usually) know precisely when the lock-ups occur as I've got the baseboard management controller (BMC) set to reboot the system (using a hardware watchdog timer set to timeout after 480 seconds) when Windows Server is running if the server freezes.
So I think it pretty much says the mainboard is the problem so time to find a new one (or a new server!) on eBay...