3 Replies Latest reply: Apr 11, 2014 12:21 PM by tbenson RSS

    Poor memcpy performance for dual-socket E5-* Xeons




      This question is very similar to the following stackoverflow question, which has many relevant details:


      c++ - Poor memcpy Performance on Linux - Stack Overflow


      In short, dual-socket systems with E5-26* CPUs (Sandy Bridge and Ivy Bridge) exhibit poor memcpy performance. For this question, I will focus on the CacheBench results (LLCbench Home Page) whereas the stackoverflow question has some other results as well. I have tested about half a dozen systems with varying CPU models (E5-2630, E5-2670, E5-2650 v2, X5650/X5660) and different versions of Linux. The older systems with X5660 Xeons yield around 10 GB/s for the larger test sizes in CacheBench whereas the newer E5 models only yield around 6 GB/s.  I have also seen this issue exhibited in other ways on some optimized code.  For example, I have several optimized versions of out-of-place matrix transpose (so similar to memcpy) and the cache-blocked versions are actually slower than the naive version.  For essentially all other CPU models that I test, the cache-blocked version is substantially faster than the naive version.  The tests take NUMA into account and are pinned to a single core so that the accesses are to local memory only.  Any thoughts on what can be causing the performance degradation on the newer dual-socket systems?  (I have also tested many single socket systems, but those have all behaved as expected.)


      Thanks and regards,