I'm working doing some performance tests on an Intel Xeon 7130M. To perform the tests I'm using the matrix multiply function. The test include blocking, SIMD instructions, parallelization and loop interchange. I'm comparing the different solutions applying the different optimizations and creating some benchmarks. When I'm not using SIMD instructions (the code works only with scalars), there is a big lose of performance when the L2 cache is overflowed (more or less when the matrices are about 500*500 elements). As far as I know when the L1 data cache is overflowed there is not a lose of performance because the miss latency can be hidden because is not too long.
But when I'm working with a solution which uses vector instructions (SIMD) I notice a lose of performance more or less when the L1 data cache is overflowed (the matrices are about 200*200 elements). The code uses single precission floating points thus it uses 1 SSE register (128 bits) for every 4 elements of a matrix (4*32 bits). Doing the tests I was wondering why there is a lose of performance when L1 is overflowed only when I use vector/SIMD instructions. My first assumption is that the data path size between L1 and the SSE register bank is 128 bits and the data path between L2 and L1/register bank is only of 64 bits. Therefore when the code uses only scalars this data path size limitation is not noticed in the performance but when the code uses vector instructions there is a lose of performance because for every L1 miss the processor needs to wait until the whole needed 128 bits are brought to the SSE register bank/L1 cache.
I tried to find in the intel documentation the information regarding the data path sizes but I couldn't find it. Anyway I found the specifications for the core 2 duo (which I guess is similar to the Xeon 7130M) but they don't fit in my previous assumptions; the L2 has a 256 bits data path to L1 data cache. So my questions are:
Am I right with my assumptions?
Which is the size of the L1 and L2 data path?
If I'm wrong, which can be the cause of this lose of performance in only the code that uses vector instructions?
Thanks in advance,