1) I ran the exact code and found that first two loops have more overhead. After that time taken is almost similar. Hence we should give a warm up kernel.
MKL - Completed 1 in: 0.2302730 seconds
MKL - Completed 2 in: 0.0001534 seconds
MKL - Completed 3 in: 0.0001267 seconds
MKL - Completed 4 in: 0.0001275 seconds
MKL - Completed 15 in: 0.0001280 seconds
MKL - Completed 16 in: 0.0001347 seconds
CMMA - Completed 1 in: 0.0504993 seconds
CMMA - Completed 2 in: 0.0003169 seconds
CMMA - Completed 3 in: 0.0001666 seconds
CMMA - Completed 4 in: 0.0001687 seconds
CMMA - Completed 15 in: 0.0001638 seconds
CMMA - Completed 16 in: 0.0001636 seconds
2) Cache misses should contribute if we are using large matrix size.
For further reference, please go through: https://software.intel.com/en-us/ipcc
Rishabh Kumar Jain
However, in most cases the kernels will always have a new matrix to operate on, and i understand that most of the over head in the first operation is simply due to cache misses.
These cache misses will happen everytime a new matrix is provided.
So do you think the first result should be included?
Is the overhead primarily due to the cache misses or the warm up time?
If it is indeed cache misses, how can i work on that? I thought its always accessed in a row-major format and thus cache misses would be avoided if i accessed it in the same format.
Following are the key points which will answer your questions:
1) First result should not be included.
2) The Intel® Math Kernel Library (Intel® MKL) is multi-threaded and employs internal buffers for fast memory allocation. Initial call initializes the threads and internal buffers.So the time cost is due to initialization of Intel MKL. And cache warm-up time is also an overhead.
3) To further improve you can use loop interchange and loop blocking optimization technique (LBOT) . MKL also use LBOT internally
Further you could refer to the following link also:
Rishabh Kumar Jain