4 Replies Latest reply on Jul 30, 2018 7:56 AM by Intel Corporation

    Xeon-Phi vs. Xeon Unexplained Overhead

    galoren.com

      Hello all!

       

      I am trying to run the following code with different n sizes on an Xeon Phi KNC (with 61 cores and 4T/C) and Xeon (with 2 sockets of Xeon E5-2660 v2).

       

      I am getting the timings as shown in the tables below. However, I am trying to understand why MIC's preformance are poorer than running a Xeon processor. What am I doing wrong here, and how can I fix it (if possible)?

       

      Thanks!

       

      CODE:

       

      program prog

        integer, allocatable :: arr1(:), arr2(:)

        integer :: i, n, time_start, time_end

        n=481

        do while (n .le. 481000000)

          allocate(arr1(n),arr2(n))

          call system_clock(time_start)

          !dir$ offload begin target(mic)

          !$omp SIMD

          do i=1,n

             arr1(i) = arr1(i) + arr2(i)

          end do

          !dir$ end offload

          call system_clock(time_end)

          write (,) "n=",n," time=",time_end-time_start

          deallocate(arr1,arr2)    

          n = n*10

        end do 

      end program

       

      Xeon-Phi RESULTS:

       

      n=         481  time=        8881

      n=        4810  time=          75

      n=       48100  time=          53

      n=      481000  time=         261

      n=     4810000  time=        1991

      n=    48100000  time=       18912

      n=   481000000  time=      188203

       

      Settings:

       

      #!/bin/bash #SBATCH -N 1 #SBATCH -o out_122 #SBATCH --exclusive export MIC_KMP_AFFINITY=verbose,granularity=fine,scatter export MIC_OMP_NUM_THREADS=122 ./prog.exe

       

      sbatch -p xphi -N 1 --exclusive run_par.sh

      while all of the settings are in run_par.sh and xphi is the name of the device.

       

      Its also worth mentioning that a native run (addition of !dir$ offload begin target(mic) before the !$omp SIMD) yields a much better results.

       

      n= 481       time= 0

      n= 4810      time= 0

      n= 48100     time= 6

      n= 481000    time= 55

      n= 4810000   time= 455

      n= 48100000  time= 4342

      n= 481000000 time= 43322

       

      In the native run rhe settings are:

       

      #!/bin/bash #SBATCH -N 1 #SBATCH -o out_244_native #SBATCH --exclusive export SINK_LD_LIBRARY_PATH=...intel/compilers_and_libraries/linux/lib/mic:$SINK_LD_LIBRARY_PATH micnativeloadex ./prog.exe.MIC -e "KMP_AFFINITY=verbose,granularity=fine,scatter"

       

      Xeon RESULTS:

       

      n=         481         time=           0

      n=        4810         time=           0

      n=       48100         time=           2

      n=      481000         time=          19

      n=     4810000         time=          93

      n=    48100000         time=         706

      n=   481000000         time=        7006

       

      Here is the output of lscpu command on my Xeon machine:

       

      Architecture:          x86_64

      CPU op-mode(s):        32-bit, 64-bit

      Byte Order:            Little Endian

      CPU(s):                40

      On-line CPU(s) list:   0-39

      Thread(s) per core:    2

      Core(s) per socket:    10

      Socket(s):             2

      NUMA node(s):          2

      Vendor ID:             GenuineIntel

      CPU family:            6

      Model:                 62

      Model name:            Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz

      Stepping:              4

      CPU MHz:               1203.382

      BogoMIPS:              4405.99

      Virtualization:        VT-x

      L1d cache:             32K

      L1i cache:             32K

      L2 cache:              256K

      L3 cache:              25600K

      NUMA node0 CPU(s):     0-9,20-29

      NUMA node1 CPU(s):     10-19,30-39

       

      My MIC specs are (tail of /proc/cpuinfo):

       

      processor       : 239

      vendor_id       : GenuineIntel

      cpu family      : 11

      model           : 1

      model name      : 0b/01

      stepping        : 3

      cpu MHz         : 1052.630

      cache size      : 512 KB

      physical id     : 0

      siblings        : 240

      core id         : 59

      cpu cores       : 60

      apicid          : 239

      initial apicid  : 239

      fpu             : yes

      fpu_exception   : yes

      cpuid level     : 4

      wp              : yes

      flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr mca pat fxsr htsyscall nx lm nopl lahf_lm

      bogomips        : 2112.44

      clflush size    : 64

      cache_alignment : 64

      address sizes   : 40 bits physical, 48 bits virtual

      power management: