0 Replies Latest reply on Apr 24, 2011 8:32 AM by jack.ooo.lantern

    How to see the results from IACA(Intel Architecture Code Analyzer)

    jack.ooo.lantern

      Hello, I'm a developer.

       

      I saw the presentation (below) at GDC 2011.
      I can't understand the results output by IACA for AVX shown on the 17th page of the presentation.
      1. What does the 'Loop Latency: 14 Cycles' derive from?
      The sum of the latencies below is 12 cycles.
        [v]movss => 1 x 2
        [v]mulss => 5
        [v]addss => 3
        add      => 1
        cmp      => 1
        jl       => 0 (because of "Branch Prediction")
        -------------
                   12
      These latencies are written in 'Intel(R) 64 and IA-32 Architectures Optimization Reference Manual'.
      If cache accesses occur to the variables 's', 'a', 'x', and 'y', the total latency is larger, 12+5x4=32.
      '5' is the load latency on the page 2-19 in 'Intel(R) 64 and IA-32 Architectures Optimization Reference Manual'.
      What's wrong ?
      2. What causes the the 'vmovss' latencies?
      I got an analysis report for 'movss xmm0, dword ptr [eax]' from IACA, shown below.
      ----------------------------------------------------------------
      Intel(R) Architecture Code Analyzer Version - 1.1.3
      Analyzed File - GDC.exe
      Binary Format - 32Bit
      Architecture  - Intel(R) AVX
      Analysis Report
      ---------------
      Total Throughput: 1 Cycles;             Throughput Bottleneck: Port2_ALU, Port2_DATA
      Total number of Uops bound to ports:  1
      Data Dependency Latency:    6 Cycles;   Performance Latency:    7 Cycles
      Port Binding in cycles:
      -------------------------------------------------------
      |  Port  |  0 - DV |  1 |  2 -  D |  3 -  D |  4 |  5 |
      -------------------------------------------------------
      | Cycles |  0 |  0 |  0 |  1 |  1 |  0 |  0 |  0 |  0 |
      -------------------------------------------------------
      | Num of |          Ports pressure in cycles          |    |
      |  Uops  |  0 - DV |  1 |  2 -  D |  3 -  D |  4 |  5 |    |
      ------------------------------------------------------------
      |   1    |    |    |    |  1 |  1 |  X |  X |    |    | CP | movss xmm0, dword ptr [eax]
      ----------------------------------------------------------------
      I guess that 'Data Dependency Latency:    6 Cycles' derives from 1 cycle for 'movss' itself and 5 cycles for the L1 cache load latency of xmm.
      But I don't understand what the additional latency, 6 cycles, of the 'Performance Latency: 7 cycles' derives from.
      I'd appreciate if somebody would answer them.