Intel Xeon X5xxx processors go into a race condition - Intel Community

Processors

Intel® Processors, Tools, and Utilities

Intel Xeon X5xxx processors go into a race condition

4,541 Views

I have an interesting issue that has been appearing with one of our products when it is run on a server with Xeon X5xxx series processors (it is particularly bad on HP DL360 and DL380 G6's and G7's) and was wondering if anyone had seen it or had suggestions on how to troubleshoot.

Our product is a financial portfolio analysis software package that performs rather intense calculations and is heavy on resource utilization...especially the CPU. During a portfolio analysis, the tasks are broken up and run on individual cores by a process. This is with the idea that only that particular core will be used per task and, for the most part, it works very well for us.

The problem is that occasionally one of our processes goes into what I call a "race condition" where it not only uses it's own core, but also manages to fully utilize the rest of the cores on the machine to 100% which makes the server so busy that it essentially becomes unresponsive. Additional notes regarding the condition are shown below:

This does not happen on most of our clients' servers, just those running Xeon X5xxx series processors (My test system is using Xeon X5650 CPU's). It is particularly bad on HP DL360 and DL380 G6's and G7's (G3, G4 and G5 appear to be ok).
It is not O.S. specific. The problem happens on Windows 2003 and 2008 servers.
It is not consistent. The same job can run successfully 10 times, but then all of sudden act up. It also does not occur during the same part of the analysis each time.
We have done extensive hardware tests on machines that have experienced the issues and they always come back clean.
All of the drivers and firmware on the servers have been updated to the latest available.
We have tested several CPU related bios changes on the servers with no luck.
I have attempted to use Windows Performance Monitor to see what is happening, but once the Cores all go to 100%, it does not have enough resources to continue working.

To me it seems like something in our process (which is pretty resource intensive) is doing something that is pushing the CPU's to the point where it's code does not know how to handle something.

Does anyone have any ideas or suggestions on how to troubleshoot this? HP has been less than helpful so far (their suggestion was to reload the O.S. which makes no sense) and our developers have not been able to pin it down because it never happens in the same job or task.

Thanks!

Link Copied

4 Replies

3,311 Views

I would suggest testing another processor on this server system to see if it causes the same behavior. Personally I have never seen a behavior like this one, since usually it happens on the same specific task or at the same exact moment.

Copy link

3,311 Views

Thanks for the suggestion Adolfo and I will probably use it with HP as we move along. Another odd part about this whole issue is that none of the tests from Intel or HP stress the CPU as hard or as long as our own program.

I think the idea that we may have ask our customer base to avoid the HP servers is finally having some traction on that front so we will see where it goes.

Copy link

3,311 Views

Have you tried turning off the power saving features in the BIOS.

We had a similar problem and I know that there are plenty of documented cases where the power saving features of those CPU families cause problems.

Look in the BIOS for a setting to configure "No C-States" or something similar.

157 KB

Copy link

3,311 Views

Thanks for the tip. We had already tried turning off the power saving features in the bios with no success.

However, we do have new development which I believe puts us pretty close what the problem is and a possible solution. I was able to do a Process Explorer analysis on one of the processes that was acting up and it looks like process is going "thread crazy". Normally our process would spawn 10 - 15 threads, but when the problem occurs we were seeing hundreds. I then looked at the DLL and stacks and did some research on what we were seeing (I Googled it). It looked like the garbage collector being used (workstation) may have an issue on multi-processor machines (particularly the Xeon X5xxx series). Below are a couple of links that provide more information (there are many more).

http://odetocode.com/Blogs/scott/archive/2004/07/16/server-or-workstation-garbage-collection.aspx http://odetocode.com/Blogs/scott/archive/2004/07/16/server-or-workstation-garbage-collection.aspx

http://stackoverflow.com/questions/2618161/interpreting-w3wp-exe-thread-infos-does-mscorwks-dllstrongnameerrorinfo0x7688 http://stackoverflow.com/questions/2618161/interpreting-w3wp-exe-thread-infos-does-mscorwks-dllstrongnameerrorinfo0x7688

Currently we are attempting the code work around and have been running our jobs/processes for 4.5 days without issue so we're cautiously optimistic. We also have some clients testing this out, but have not heard back from them yet.

The code change to the process config file was as follows:

If this works then we know where our problem is and can either use the setting above or try the server based garbage collector:

I think Microsoft also has a hotfix so we have some options.

Copy link

Community support is provided during standard business hours (Monday to Friday 7AM - 5PM PST). Other contact methods are available here.

Intel does not verify all solutions, including but not limited to any file transfers that may appear in this community. Accordingly, Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

For more complete information about compiler optimizations, see our Optimization Notice.