4 Replies Latest reply on Oct 19, 2014 1:58 PM by PpHd

    Strange IPC behavior

    PpHd

      I have found a strange IPC behavior on a test program which benchmarks matrix multiplication using the MPFR library in 53 and 113 bits. The 113 bits was always way faster (typically 20-30%) whereas it perform more computation. After analysis, I have reduced the problem to the mpfr_mul function.

       

      Here is the assembly extract of where I think the problem is : in the mpfr_mul function on more precisely in the section which perform 1x1, 2x1 or 2x2 multiplication :

          cmpq   $2, %r9

          jg    .L21

          movq    24(%r14), %rsi

          leaq    8(%rbx), %rdi

          movq    24(%r13), %rcx

          movq    (%rsi), %rax

      #APP

      # 324 "mul.c" 1

          mulq (%rcx)

      # 0 "" 2

      #NO_APP

          cmpq    $1, %r9

          movq    %rdx, %r11

          movq    %rax, (%rbx)

          movq    %rdx, 8(%rbx)

          je    .L23

          movq    8(%rsi), %rax

      #APP

      # 334 "mul.c" 1

          mulq (%rcx)

      # 0 "" 2

      # 335 "mul.c" 1

          addq %rax,%r11

          adcq $0,%rdx

      # 0 "" 2

      #NO_APP

          cmpq    $1, -136(%rbp)

          movq    %rdx, 16(%rbx)

          movq    %r11, (%rdi)

      #    je    .L189

          movq    8(%rcx), %r9

          movq    (%rsi), %rcx

          movq    %rcx, %rax

      #APP

      # 346 "mul.c" 1

          mulq %r9

      # 0 "" 2

      #NO_APP

          movq    %rdx, %r11

          movq    %rax, %rcx

          movq    8(%rsi), %rax

      #APP

      # 347 "mul.c" 1

          mulq %r9

      # 0 "" 2

      # 348 "mul.c" 1

          addq %rax,%r11

          adcq $0,%rdx

      # 0 "" 2

      #NO_APP

          movq    8(%rbx), %rax

          movq    %rdx, 24(%rbx)

          movq    16(%rbx), %rdx

      #APP

      # 350 "mul.c" 1

          addq %rcx,%rax

          adcq %r11,%rdx

      # 0 "" 2

      #NO_APP

          movq    %rdx, 16(%rbx)

          movq    %rax, (%rdi)

          cmpq    %r11, 16(%rbx)

          setb    %r11b

          movzbl    %r11b, %r11d

          addq    24(%rbx), %r11

          movq    %r11, 24(%rbx)

      .L23:

          subq    -144(%rbp), %r8

          shrq    $63, %r11

       

      When I let the asm as it is (which is produced by gcc with a litlle change in - je    .L189 - in order to better show the problem), I get this performance (using linux perf stat -B tool):

       

             23431,087207 task-clock                #    0,976 CPUs utilized         

                   2 109 context-switches          #    0,000 M/sec                 

                       4 CPU-migrations            #    0,000 M/sec                 

                  11 888 page-faults               #    0,001 M/sec                 

          49 043 462 004 cycles                    #    2,093 GHz                     [50,06%]

         <not supported> stalled-cycles-frontend

         <not supported> stalled-cycles-backend 

          30 713 070 462 instructions              #    0,63  insns per cycle         [75,02%]

           4 492 657 867 branches                  #  191,739 M/sec                   [74,99%]

              71 968 726 branch-misses             #    1,60% of all branches         [74,95%]

       

            24,008123640 seconds time elapsed

       

      If I comment the line in bold (  je    .L23) in the assembly source (which performs a jump which only skips 29 instructions), I get:

       

            12919,383975 task-clock                #    0,943 CPUs utilized         

                   1 520 context-switches          #    0,000 M/sec                 

                      15 CPU-migrations            #    0,000 M/sec                 

                  11 887 page-faults               #    0,001 M/sec                 

          27 032 904 739 cycles                    #    2,092 GHz                     [50,04%]

         <not supported> stalled-cycles-frontend

         <not supported> stalled-cycles-backend 

          31 976 622 505 instructions              #    1,18  insns per cycle         [75,04%]

           4 734 392 898 branches                  #  366,457 M/sec                   [75,03%]

              64 698 800 branch-misses             #    1,37% of all branches         [74,93%]

       

            13,704240040 seconds time elapsed

       

      It performs way faster whereas it computes effectively more instruction (The IPC is nearly twice higher whereas this is the IPC of the whole program).

      I can not explain such behavior. It has been seem on multiple Intel core CPU (not only mine, which is Intel Core2 Duo T6500) . Full benchmark code for Linux is available on demand.

       

      If I replace the je .L23 by an unconditional jump, I get the slow behavior.

      If I replace the je .L23 by a nop instruction (or 2, 3, 4 nop), I get the fast behavior.

       

      Does anyone has an explanation of such a thing?