2 Replies Latest reply on May 15, 2018 1:48 PM by Intel Corporation

    AVX512 slower than AVX2

    reybernardo843@gmail.com

      I am trying to use AVX512. I don't see any improvement over AVX2, and in fact it is running a little slow. I can see with VTune in the assembly code the %zmm so I believe the compiler is doing the right thing, but AVX2 is still faster than AVX512 by about 5%. I did three quick tests for mults, adds and FMA and all three have the same behavior. I also tried GCC 7.3.0 and GCC 8.1.0. I am using an i7 - 7820x.

       

      Below is example code for my mult for both AVX2 and AVX512:

       

      void calcMult_AVX512( __m512* vecInput, int vecSize, __m512 * diffX512Ptr) {     int loopCmulSizeInFloats = vecSize;      int ptrIndex = 0;     for (int i = 0; i < loopCmulSizeInFloats; i = i + 16)  // 16 floats per iteration     {          __m512 A = vecInput[ptrIndex];         diffX512Ptr[ptrIndex] = _mm512_mul_ps(A,A);         ptrIndex++;  // do this after debug logging     } }    void calcMult_AVX2( __m256* vecInput, int vecSize, __m256 * diffX256Ptr ) {     int loopCmulSizeInFloats = vecSize;      int index = 0;     for (int i = 0; i < loopCmulSizeInFloats; i = i + 8)  // 8 floats per iteration     {         __m256 A = vecInput[index];         diffX256Ptr[index] = _mm256_mul_ps(A,A);         index++;     }  }

       

       

       

      And here is how is use them:

          int vectorSizeInFloats = 262144*4;     float  * vectorIn     = static_cast<float*> ( _mm_malloc(sizeof(float)  * vectorSizeInFloats,    4));     __m256 * vectorIn256  = static_cast<__m256*>( _mm_malloc(sizeof(__m256) * vectorSizeInFloats/8,  32));     __m512 * vectorIn512  = static_cast<__m512*>( _mm_malloc(sizeof(__m512) * vectorSizeInFloats/16, 64));     float  * vectorInB    = static_cast<float*> ( _mm_malloc(sizeof(float)  * vectorSizeInFloats,    4));     __m256 * vectorInB256 = static_cast<__m256*>( _mm_malloc(sizeof(__m256) * vectorSizeInFloats/8,  32));     __m512 * vectorInB512 = static_cast<__m512*>( _mm_malloc(sizeof(__m512) * vectorSizeInFloats/16, 64));     float  * diffXPtr     = static_cast<float*> ( _mm_malloc(sizeof(float)  * vectorSizeInFloats,    4));     __m256 * diffX256Ptr  = static_cast<__m256*>( _mm_malloc(sizeof(__m256) * vectorSizeInFloats/8,  32));     __m512 * diffX512Ptr  = static_cast<__m512*>( _mm_malloc(sizeof(__m512) * vectorSizeInFloats/16, 64));      for (int i = 0; i < vectorSizeInFloats; i++)     {         vectorIn[i]    = (float)i;         vectorInB[i]   = (float)-i;     }      int index = 0;     for (int i = 0; i < vectorSizeInFloats; i = i + 8)     {         vectorIn256[index++]  = _mm256_setr_ps (0+i,1+i,2+i,3+i,4+i,5+i,6+i,7+i);         vectorInB256[index++] = _mm256_setr_ps (0-i,1-i,2-i,3-i,4-i,5-i,6-i,7-i);     }      index = 0;     for (int i = 0; i < vectorSizeInFloats; i = i + 16)     {         vectorIn512[index++]  = _mm512_setr_ps (0+i,1+i,2+i,3+i,4+i,5+i,6+i,7+i,8+i,9+i,10+i,11+i,12+i,13+i,14+i,15+i);         vectorInB512[index++] = _mm512_setr_ps (0-i,1-i,2-i,3-i,4-i,5-i,6-i,7-i,8-i,9-i,10-i,11-i,12-i,13-i,14-i,15-i);     }       int loopAmount = 5000;     ////////////////////////////////////     std::clock_t start;     double durationAVX512;      start = std::clock();      for (int k = 0; k < loopAmount; k++)     {         calcMult_AVX512( vectorIn512, vectorSizeInFloats, diffX512Ptr );     }      durationAVX512 = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;     std::cout<<"mult AVX512 Duration of decimate: "<< durationAVX512 <<'\n';       ////////////////////////////////////     double durationAVX2;      start = std::clock();      for (int k = 0; k < loopAmount; k++)     {          calcMult_AVX2( vectorIn256, vectorSizeInFloats, diffX256Ptr );     }      durationAVX2 = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;     std::cout<<"mult AVX2 Duration of decimate: "<< durationAVX2 <<'\n';     std::cout<<"delta: "<< durationAVX2/durationAVX512 <<'\n'

       

      And the output result:

      mult AVX512 Duration of decimate: 1.56

      mult AVX2 Duration of decimate: 1.52

      delta: 0.974359

       

      As you can see, AVX2 is slightly faster. 

       

      Below is the snapshot for AVX512 code from VTune:

       

      Below is the snapshot for AVX2 code from VTune:

       

      I am using GCC 7.3.0 and 8.1.0 with the same results.  I used these compiler options: -O3 -march=skylake-avx512 -o (and several others, like march=native) with the same results.  I am running an i7-7820x.  As a compassion, I ran the AVX2 code on a  i7-8700K and taking account for different CPU clock speeds, the AVX2 code ran the same.  The  i7-8700K doesn't support AVX512, so I couldn't test it there.

      I am running on CentOS Linux release 7.4.1708 (Core).

       

      I am using an Fatal1ty X299 Professional Gaming i9 motherboard.  I didn't find anything in the BIOS to change AVX512 CPU clock speed.