0 Replies Latest reply on Dec 23, 2010 8:17 PM by mulhaupt

    SSE4 seems slow...


      I'm doing some audio processing on an i7 875K processor, and wrote some algorithms in both SSE2 and modified with SSSE3 and SSE4.1 instructions hoping to get even better performance on the latest processors - in some cases my modified stuff seemed to be up to 50% slower.  Digging deeper, I was really confused by some of the numbers:


      Unpacking 16 bit values into 32 bits.



      movapd xmm0, [esi]  // xmm0 = samples[]
      punpckhwd xmm1, xmm0
      psrld xmm1, 16
      punpcklwd xmm0, xmm0
      psrld xmm0, 16




      pmovsxwd xmm1, [esi]

      pmovsxwd xmm2, [esi+8]

      These seem to perform at about the same speed (despite the fact that I'm using 60% less instructions).
      Multiplying 2 16bit values into 32bit results:
      movapd xmm1, xmm0   // xmm1 = hi16(xmm0 * scale_factor)
      pmulhw xmm1, xmm6
      movapd xmm2, xmm0   // xmm2 = lo16(xmm0 * scale_factor)
      pmullw xmm2, xmm6
      movapd xmm3, xmm2
      punpcklwd xmm3, xmm1
      punpckhwd xmm2, xmm1
      SSE2 - case 2:
      movapd xmm0, xmm1
      pmaddwd xmm0, xmm6
      andps xmm1, xmm4
      pmaddwd xmm1, xmm6

      pmulld xmm0, xmm6

      pmulld xmm1, xmm6

      The SSE4.1 multiply seems to be about 50% slower than SSE2, and SSE2-case 2 seems to take over double the time of SSE2.
      Is there something that I'm doing wrong or something that needs to be enabled on the processor for SSE4.1 to get what I'd expect to be much better performance since I'm using so much fewer operations.  I'm using inline assembly on win7, and compiling a win32 target - even though I'm running my machine in 64-bit mode.