10 Replies Latest reply on Mar 15, 2018 4:58 AM by Intel Corporation

    Intel Caffe AlexNet - Run Time Variation On Intel Xeon Phi

    Chetan Arvind Patil

      HI All,

       

      I am profiling Intel Caffe which runs AlexNet. The number of iteration is 1000 and batch size is 256. I make use of OpenMP to distribute 128 threads. I explicitly provide thread mapping list which replicates scatter i.e. allocate threads to 0-63 and then 64-127.

       

      However, I observe different run time (huge variation) between two runs of the above simulation. In first run, I get run time as 55.8 mins and second run I get run time as 42.5 mins. I was expecting difference between two runs of same Intel Caffe workload to be within 1%, but here I observe run time different of more than 23%. After first run I flush the DRAM and ensure that thenew run has all 116 GB available to it (100GB DDR4 + 16GB MCDRAM in All-2-All Flat Mode).

       

      Is this due to any specific learning parameters, random seeding in the network that changes for every new run? How I can ensure the variation among two runs for same network don't change?

       

      Can anyone please suggest possible reason for this?

       

      Thanks,

      Chetan Arvind Patil

        • 1. Re: Intel Caffe AlexNet - Run Time Variation On Intel Xeon Phi
          Intel Corporation
          This message was posted on behalf of Intel Corporation

          Hi Chetan,

          We are looking into it and will come back shortly.

          Thanks,
          Rishabh

          • 2. Re: Intel Caffe AlexNet - Run Time Variation On Intel Xeon Phi
            Intel Corporation
            This message was posted on behalf of Intel Corporation

            Hi Chetan,

            Could you please provide the following details:
            1) Which server you are using? As you have mentioned Xeon Phi, are you running the model on Intel's old colfax cluster (knl)?
            2) Which dataset you are using?
            3) Can we have the detailed steps which you are using for profiling Intel Caffe?

            Thanks,
            Rishabh

            • 3. Re: Intel Caffe AlexNet - Run Time Variation On Intel Xeon Phi
              Chetan Arvind Patil

              Hi Rishabh,

               

              1) I am using Intel Xeon Phi 7210: Developer Access Program (DAP) for Intel® Xeon Phi™ Processor (formerly Knights Landing) . This isn't the ColFax cluster, instead a standalone Linux box.

              2) I am using ImageNet which are in LMDB format.

              3) Steps:

                          1) Compiled Intel Caffe

                          2) Setup AlexNet as per the instructions given

                          3) I use Intel Optimized Model files for AlexNet. I keep iteration to 1000 (attached are solver and training files I use): caffe/models/intel_optimized_models/alexnet at master · intel/caffe · GitHub

                          4) I have a bash script attached that I use to profile benchmark. I make use of perf for profiling and numactl to map application to MCDRAM. Also, scatter affinity as described in bash script.

                          5) I get the data for execution time and I log it.

                          6) I re-run same for another time and get different (huge difference) run time.

              4) I haven't changed any thing on the Intel Caffe side.

               

              If it helps, you may reach out to me directly if there is a way to do so on Intel forum. As, I have a strict deadline and want Intel Caffe and Xeon Phi training bench marking to be accurate.

               

              P.S.: I don't see change in time different if I run Matrix Multiplication with Intel OpenMP (mmatest1.c) as described here: Performance of Classic Matrix Multiplication Algorithm on Intel® Xeon Phi™ Processor System | Intel® Software

               

              Thanks,

              Chetan Arvind Patil

              • 4. Re: Intel Caffe AlexNet - Run Time Variation On Intel Xeon Phi
                Intel Corporation
                This message was posted on behalf of Intel Corporation

                Had discussion over mail internally with Chetan. Working on this.

                • 5. Re: Intel Caffe AlexNet - Run Time Variation On Intel Xeon Phi
                  Intel Corporation
                  This message was posted on behalf of Intel Corporation

                  Hi Chetan,

                  I ran your code with 2,000 images from imagenet and 1,000 iterations using Intel Caffe with Alexnet and got following results:
                  1) With thread settings:

                  KMP_AFFINITY="granularity=fine,proclist=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63],explicit,verbose"

                  a) It took 16 min 37 sec
                  Output from perf:
                  $ tail -5 test0-63.txt
                     997.377487200              9,357      page-faults               #    0.035 K/sec
                     997.377487200     68,626,124,016      cycles                    #    0.256 GHz                      (50.00%)
                     997.377487200     22,399,959,992      instructions              #    0.12  insn per cycle           (74.99%)
                     997.377487200      1,299,834,688      branches                  #    4.841 M/sec                    (75.01%)
                     997.377487200         71,964,655      branch-misses             #    1.13% of all branches          (75.00%)
                  2) With thread settings:
                  export KMP_AFFINITY="granularity=fine,proclist=[64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127],explicit,verbose"

                  a) It took 17 min 16 sec
                  Output from perf:
                  $ tail -5 test64-127.txt
                    1036.172187473              1,984      page-faults               #    0.007 K/sec
                    1036.172187473     24,444,323,170      cycles                    #    0.088 GHz                      (50.00%)
                    1036.172187473      3,970,192,335      instructions              #    0.02  insn per cycle           (75.00%)
                    1036.172187473        368,822,852      branches                  #    1.333 M/sec                    (75.00%)
                    1036.172187473         33,082,440      branch-misses             #    0.49% of all branches          (75.00%)

                  The difference is only of approx.39 sec. 
                  Please let us know for any further concern.
                   
                  Thanks,
                  Rishabh
                  • 6. Re: Intel Caffe AlexNet - Run Time Variation On Intel Xeon Phi
                    Chetan Arvind Patil

                    Hi Rishabh,

                     

                    You experiment is different from what's mine. I ran 128 threads together (2 per core) but I explicitly define threads exactly as "Scatter" would. In your case you run two runs, but both have 64 threads only, not 128 threads.

                     

                    It should be like this:

                     

                    export KMP_AFFINITY="granularity=fine,proclist=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127],explicit,verbose"

                     

                    Above mapping for AlexNet and Intel Caffe should be run twice and then the comparison should be done. Also, in your case too the number of instructions vary a lot, I am expecting total instructions to remain same as long as I am running same thing twice.

                     

                    Thanks,

                    Chetan Arvind Patil

                    • 7. Re: Intel Caffe AlexNet - Run Time Variation On Intel Xeon Phi
                      Intel Corporation
                      This message was posted on behalf of Intel Corporation

                      Hi Chetan,

                      I ran with 0-63 and 64-127 threads in two runs as you mentioned initially that you explicitly provided thread mapping list which replicates scatter i.e. allocate threads to 0-63 and then 64-127.
                      Now, I ran same script you provided i.e. with 128 threads:
                      export KMP_AFFINITY="granularity=fine,proclist=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127],explicit,verbose"

                      I got following results:                                                                            

                       Time taken (sec)Total Instructions
                      First Run820.870817274,858,540,735,339
                      Second Run819.988373174,678,843,574,714
                      %diff between run0.10750.24

                      Instructions include all kernel, user and system level instructions. There could be many factors which can lead to variation in instructions.
                      Please could you let us know exactly what you are trying to achieve with this analysis, so that we can help you better?

                      Thanks,
                      Rishabh

                      • 8. Re: Intel Caffe AlexNet - Run Time Variation On Intel Xeon Phi
                        Chetan Arvind Patil

                        Hi Rishabh,

                         

                        One issue was getting different run time for different runs for same mapping and network, which I solved earlier by reducing, compression and resizing the training data. I guess earlier it was not fitting memory and leading to IO request.

                         

                        Second issue is getting different number of instructions or l2 miss when thread mapping changes. Ideally, if the input and output is same (in this case Intel Caffe + AlexNet), then there shouldn't be large variation in the performance counters. But as per my analysis, I see many things that don't make sense starting with instruction.

                         

                        If you can, then please try to compare balanced vs scatter with same thread count. I understand the way balanced will map (given number of threads > physical cores) is different, but at the end the application is same, the computation is same. Then having justification of "something happening leading to drastic change in instructions retired" is not acceptable.

                         

                        Please use "userspace" for your analysis, as that way we can at least remove kernel noise.

                         

                        Thanks,

                        Chetan Arvind Patil

                        • 9. Re: Intel Caffe AlexNet - Run Time Variation On Intel Xeon Phi
                          Intel Corporation
                          This message was posted on behalf of Intel Corporation

                          Hi Chetan,

                          1) Under performance of balanced as compared to scatter mode is clear from the following link:

                          Besides spreading of threads on cores and sharing memory data, for which balanced seems to be logically better, performance may be affected by code also.
                          As from the link:
                          #pragma omp parallel for
                          for (int i=0; i<10; i++)
                              for (int j=0; j<i; j++)
                                 ... do something that takes constant time ...
                          Each outer loop has different amount of workload i.e. For i=0, it's 1 unit of work, i=2, it's 2 unit and finally i=9, 9 units.
                          a)  If you use a compact schedule, then the distribution of threads to cores will be (0,1), (2,3), ... so the number of iterations executed by each core will be (0+1), (2+3)... (8+9), however with a scatter affinity the distribution of threads to cores is (0,5),(1,6) , and the similarly the work will be (0+5), (1+6), ... (4+9) in either case  there remains imbalance, but in the compact case that is between 1 and 17 iterations per core whereas in the second (scatter) case it's between 5 and 13.
                          b) If we refer to the last core, with the compact schedule it will take (8*2)+1 = 17 units of time, whereas in scatter (4*2)+5 = 13 units of time.
                          c) Threads which are waiting may be executing polling loops also

                          Hence balanced could be under performing due to such code snippets.

                          2) In addition, to get better out of your analysis:
                          a) Try running the code in quadrant mode
                          b) Run using 64 threads (Refer: https://github.com/intel/caffe/wiki/Recommendations-to-achieve-best-performance )

                          Thanks,
                          Rishabh Kumar Jain
                          • 10. Re: Intel Caffe AlexNet - Run Time Variation On Intel Xeon Phi
                            Intel Corporation
                            This message was posted on behalf of Intel Corporation

                            Hi Chetan,

                            I believe that it would have answered your question and am closing the ticket.
                            Please let us know for any further concern.

                            Thanks and regards,
                            Rishabh Kumar Jain