Yesterday I started reading some of the chapters of the Intel® 64 and IA-32 Architectures Optimization Reference Manual available here .
Now I'm somewhat confused. Is it possible that there are quite a lot of mistakes in this manual?
Firstly, in Example 4-18 a code snippet for aligning memory chunks to 64- bit boundaries is presented. Does this code really do what it is supposed to when compiled using ICC? I haven't tried this yet, but usually incrementing a pointer results an address offset of sizeof(Datatype)*Offset (which in the given example would be 8Bytes*7 = 56Bytes). Since the array pointed to by p was initially allocated with a size of NUM_ELEMENTS+1, using the adapted pointer to iterate over the "aligned" array might lead to an out of bounds exception, doesn't it?
Moreover, I even doubt that this code will compile in the first place, since the &- operator is usually not implemented for pointer types (at least, my VS2010 complained about that). Or is it in the ICC?
Both issues could be resolved using type casts:
double *p, *newp;
p = (double*)malloc (sizeof(double)*(NUM_ELEMENTS+1));
newp = (double*)(((unsigned)p+7) & (~0x7));
Secondly, I think that parts of Example 4-21 are incorrect.
1: movaps xmm0, Array ; xmm0 = DC, x0, y0, z0
2: movaps xmm1, Fixed ; xmm1 = DC, xF, yF, zF
3: mulps xmm0, xmm1 ; xmm0 = DC, x0*xF, y0*yF, z0*zF
4: movhlps xmm, xmm0 ; xmm = DC, DC, DC, x0*xF
5: addps xmm1, xmm0 ; xmm0 = DC, DC, DC,
6: ; x0*xF+z0*zF
7: movaps xmm2, xmm1
8: shufps xmm2, xmm2,55h ; xmm2 = DC, DC, DC, y0*yF
9: addps xmm2, xmm1 ; xmm1 = DC, DC, DC,
10: ; x0*xF+y0*yF+z0*zF
First of all, I'm not quite sure, which register is ment to be the destination operand of the movhlps- instruction in line 4. According to the register contents suggested in the comments in lines 5 and 6, I suppose, it's xmm1 although stated, that the contents apply to register xmm0. However, according to my opinion, this is not possible, since (having a look at "Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2A: Instruction Set Reference, A-M" ) for the addps- instruction is the first operand is used as the destination operand. Moreover, using xmm0 as the destination operand of the addps- instruction would result in a loss of the product y0*yF, which is stored in the high order double word of the low order quad word of xmm0 (in fact, would be overwritten with DC).
Now, assuming xmm1 contains DC, DC, DC, x0*yF+z0*zF, line 7 is doesn't make much sense anymore and line 8 can be rewritten as shufps xmm0, xmm0,55h, since the only part of xmm0 that is still of interest is the high order double word of its low order quad word, containing y0*yF. Hence, the operands in line 9 need to be replaced, finally yielding addps xmm1, xmm0. This also correlates with what the comment in lines 9 and 10 suggests.
Last, I think that the paragraph
"[...]There is an alternative: a hybrid SoA approach blends the two alternatives (see Example 4-16). In this case, only 2 separate address streams are generated and referenced: one contains XXXX, YYYY,ZZZZ, ZZZZ,... and the other AAAA, BBBB, CCCC, AAAA, DDDD,... [...]"
found on page 4-26 contains confusing and incorrect information. According to my opinion, it is ment to be
"[...]There is an alternative: a hybrid SoA approach blends the two alternatives (see Example 4-16). In this case, only 2 separate address streams are generated and referenced: one contains XXXX, YYYY,ZZZZ,XXXX ,... and the other AAAA, BBBB, CCCC, AAAA, BBBB,... [...]".
Did I get something fundamentally wrong? Please help me out of my confusion
- Gustav Heinrich