I'm looking for help in trying to work thru a performance bottleneck in my app. While normally that would be a question for a more general programming forum, in this case, I need to understand something about how the CPU interacts with the OS.
First, the environment:
HP Pavilion dv6t Quad Edition series (http://www.shopping.hp.com/webapp/shopping/computer_series.do?storeName=computer_store&category=notebooks&series_name=dv6tqe_series)
Windows 7 professional x64
Intel(R) Core(TM) i7-820QM Processor (1.73GHz, 8MB L2 Cache,1333MHz FSB)w/Turbo Boost up to 3.06 GHz
8GB DDR3 System Memory
1GB Nvidia GeForce GT 230M
Next, the problem:
Looking at my latest profiler runs, I'm seeing that nearly 30% of my app's time is being spent on one specific assembler instruction: prefetchnta.
At first this was a puzzle to me, as the docs led me to believe this instruction was intended to be an asynchronous statement. However, further thought & research led me to understand the problem.
In my app, I have 8 worker threads (each affinitied to a logical processor) and a supervisor thread. The worker threads do some calculations, then do a lookup in a (very) large cache table, then do more calculations. The worker threads do zero i/o or system calls as the cache is all in RAM. Looking at Windows Task manager, each of these threads is shown running at a constant 100% cpu utilization.
The supervisor thread spends most of its time Sleeping, but wakes up periodically to printf some stats.
When I said this was a very large cache table, I wasn't kidding. It is nearly 8gig. It is all in RAM (ie zero page faults), however 8g is *so* large and my accesses to the table so random, that not only will the data I am looking for never be in L1/L2/L3 cache, but the TLB that references it won't be there either. So, while my tests show the retrieval of the *data* is async, the retrieval of the TLB appears to be synchronous, and this is what (I believe) is blocking my prefetch calls.
I've struggled with various approaches to try to improve the performance here, but have had no luck. My thoughts break down into 3 categories:
1) Prefetch the TLB
If there were some way to asynchronously prefetch the TLB, then asynchronously prefetch the data, I might be in business. However, since I am just an application (ie not a device driver), I doubt I have access to the location where my process' TLBs are stored. If someone could prove me wrong here, that would be great.
2) Switch to another thread while waiting for the TLB to be retrieved.
This approach sounded promising, but in real life, I can't make it work:
a) Attempting to switch threads AFTER the prefetchnta call doesn't help. The time is already spent.
b) Attempting to switch threads BEFORE the prefetchnta call doesn't help, because the delay still occurs when the thread resumes running. To verify this, instead of switching threads I tried putting a time wasting loop right before the prefetch call. The profiler then shows huge amounts of time in the time waste loop, followed by huge amounts of time in the prefetch. I had hoped that since the processor performs some memory prefetches, it might prefetch the TLB seeing as that is the next memory access, but observation suggests otherwise.
These two points, plus the huge overhead of switching threads this often seems to have doomed this approach. I'm willing to investigate using fibers to avoid the kernel hit of switching threads, but it seems it would have the same 2 problems.
3) Live with it.
Sometimes when perf tuning, you get to a point where you have simply maxed out some hw capability. 8gig takes just over 2 million 4k pages. Since the specs say the i7 processor only stores 64 (not 64 million, not 64 thousand, just 64.00) TLB entries, this may be one of those limitations. It may even be the case that while I think of the thread as being blocked waiting for a "RAM I/O", in reality the CPU is doing a great deal of work behind the scenes to retrieve this data, and there arent actually any slack cycles here to recover. Insight please?
When you hit a hw limitation, you either need better hardware, or a different way of approaching the problem. Since the i7 is pretty much state of the art right now, I'm hoping there is some way to squeeze just a *little* more performance out of it.
Any thoughts, ideas, suggestions, or even just more discussion about how TLBs or PrefetchNTA work would be appreciated. Thanks.