I have an interesting issue that has been appearing with one of our products when it is run on a server with Xeon X5xxx series processors (it is particularly bad on HP DL360 and DL380 G6's and G7's) and was wondering if anyone had seen it or had suggestions on how to troubleshoot.
Our product is a financial portfolio analysis software package that performs rather intense calculations and is heavy on resource utilization...especially the CPU. During a portfolio analysis, the tasks are broken up and run on individual cores by a process. This is with the idea that only that particular core will be used per task and, for the most part, it works very well for us.
The problem is that occasionally one of our processes goes into what I call a "race condition" where it not only uses it's own core, but also manages to fully utilize the rest of the cores on the machine to 100% which makes the server so busy that it essentially becomes unresponsive. Additional notes regarding the condition are shown below:
- This does not happen on most of our clients' servers, just those running Xeon X5xxx series processors (My test system is using Xeon X5650 CPU's). It is particularly bad on HP DL360 and DL380 G6's and G7's (G3, G4 and G5 appear to be ok).
- It is not O.S. specific. The problem happens on Windows 2003 and 2008 servers.
- It is not consistent. The same job can run successfully 10 times, but then all of sudden act up. It also does not occur during the same part of the analysis each time.
- We have done extensive hardware tests on machines that have experienced the issues and they always come back clean.
- All of the drivers and firmware on the servers have been updated to the latest available.
- We have tested several CPU related bios changes on the servers with no luck.
- I have attempted to use Windows Performance Monitor to see what is happening, but once the Cores all go to 100%, it does not have enough resources to continue working.
To me it seems like something in our process (which is pretty resource intensive) is doing something that is pushing the CPU's to the point where it's code does not know how to handle something.
Does anyone have any ideas or suggestions on how to troubleshoot this? HP has been less than helpful so far (their suggestion was to reload the O.S. which makes no sense) and our developers have not been able to pin it down because it never happens in the same job or task.