SSE4 seems slow...

mmulh · ‎12-23-2010

I'm doing some audio processing on an i7 875K processor, and wrote some algorithms in both SSE2 and modified with SSSE3 and SSE4.1 instructions hoping to get even better performance on the latest processors - in some cases my modified stuff seemed to be up to 50% slower. Digging deeper, I was really confused by some of the numbers:

Unpacking 16 bit values into 32 bits.

SSE2:

movapd xmm0, [esi] // xmm0 = samples[]punpckhwd xmm1, xmm0psrld xmm1, 16punpcklwd xmm0, xmm0psrld xmm0, 16

SSE4.1

pmovsxwd xmm1, [esi]

pmovsxwd xmm2, [esi+8]

These seem to perform at about the same speed (despite the fact that I'm using 60% less instructions). Multiplying 2 16bit values into 32bit results: SSE2: movapd xmm1, xmm0 // xmm1 = hi16(xmm0 * scale_factor) pmulhw xmm1, xmm6 movapd xmm2, xmm0 // xmm2 = lo16(xmm0 * scale_factor) pmullw xmm2, xmm6 movapd xmm3, xmm2 punpcklwd xmm3, xmm1 punpckhwd xmm2, xmm1 SSE2 - case 2: movapd xmm0, xmm1 pmaddwd xmm0, xmm6 andps xmm1, xmm4 pmaddwd xmm1, xmm6 SSE4.1:

pmulld xmm0, xmm6

pmulld xmm1, xmm6

The SSE4.1 multiply seems to be about 50% slower than SSE2, and SSE2-case 2 seems to take over double the time of SSE2. Is there something that I'm doing wrong or something that needs to be enabled on the processor for SSE4.1 to get what I'd expect to be much better performance since I'm using so much fewer operations. I'm using inline assembly on win7, and compiling a win32 target - even though I'm running my machine in 64-bit mode. Michael.