I'm doing some audio processing on an i7 875K processor, and wrote some algorithms in both SSE2 and modified with SSSE3 and SSE4.1 instructions hoping to get even better performance on the latest processors - in some cases my modified stuff seemed to be up to 50% slower. Digging deeper, I was really confused by some of the numbers:
Unpacking 16 bit values into 32 bits.
pmovsxwd xmm1, [esi]
pmovsxwd xmm2, [esi+8]
pmulld xmm0, xmm6
pmulld xmm1, xmm6