2 Replies Latest reply on Dec 7, 2017 3:05 AM by Anju_Paul

    Hints for performance tuning on AI DevCloud

    gnperdue

      Hello,

       

      I am running a particle physics application that does event vertex reconstruction for a neutrino scattering experiment (it happens to be the MINERvA experiment at Fermilab, MINERvA: Bringing neutrinos into sharp focus). The input data are images from three "views" of a particle physics detector. They are stored as gzip-compressed TensorFlow TFRecord files and accessed via the TF file and batch-queue API. My baseline model feeds the image views into three convolutional towers before joining them with fully connected layers. Because the image space is non-linear, regression for vertex position is not a good technique but classification (with cross-entropy loss) is actually quite successful.

       

      I'm naively benchmarking the AI DevCloud, running the exact same network and hyper-parameters I use on GPU severs. I am seeing relatively slow performance on the DevCloud, although there are quirks to the performance on GPU servers as well. For example, I run training with 500 events per batch and see the following results:

       

      * 1 batch / 1.2s on a gpu server with k40m

      * 1 batch / 1.5s on a gpu server with P100

      * 1 batch / 3.4s on the AI DevCloud

       

      I don't fully understand why the P100 is slower than the K40 GPU, but that is not a subject for this forum, haha. What I would like is some insight on ways to tune my application to provide a more honest benchmark on the AI DevCloud. I would prefer not to change the network structure, but perhaps there are other parameters I could tune slightly (e.g. batch size, file compression, perhaps certain optimizers should be avoided, etc.). Also, I am a bit surprised to see this warning in the log files:

       

      ```

      2017-11-22 13:29:27.509739: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.

      2017-11-22 13:29:27.509771: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.

      2017-11-22 13:29:27.509777: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.

      2017-11-22 13:29:27.509781: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.

      2017-11-22 13:29:27.509786: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX512F instructions, but these are available on your machine and could speed up CPU computations.

      2017-11-22 13:29:27.509791: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

      ```

       

      Is this expected? I believe I am on the right hardware, end of job `lscpu` yields:

       

      ```

      Architecture:          x86_64

      CPU op-mode(s):        32-bit, 64-bit

      Byte Order:            Little Endian

      CPU(s):                24

      On-line CPU(s) list:   0-23

      Thread(s) per core:    2

      Core(s) per socket:    6

      Socket(s):             2

      NUMA node(s):          2

      Vendor ID:             GenuineIntel

      CPU family:            6

      Model:                 85

      Model name:            Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz

      Stepping:              4

      CPU MHz:               1599.992

      CPU max MHz:           3700.0000

      CPU min MHz:           1200.0000

      BogoMIPS:              6800.00

      Virtualization:        VT-x

      L1d cache:             32K

      L1i cache:             32K

      L2 cache:              1024K

      L3 cache:              19712K

      NUMA node0 CPU(s):     0-5,12-17

      NUMA node1 CPU(s):     6-11,18-23

      ```

       

      Thanks for any thoughts!

        • 1. Re: Hints for performance tuning on AI DevCloud
          Ravikeron

          Hi,

             Regrading the Warnings you can ignore them for now. For the performance improvement the please try the following options.

           

          While executing your script, you can use the following option

                 numactl –interleave = all      -- This would help using all the nodes so that it enhances the performance

           

          Inside your script you can use the following options. These are

               tf.app.flags.DEFINE_integer('inter_op', 2, """Inter Op Parallelism Threads.""")

               tf.app.flags.DEFINE_integer('intra_op', 136, """Intra Op Parallelism Threads.""")

               os.environ["OMP_NUM_THREADS"] = "136"

               os.environ["KMP_BLOCKTIME"] = "30"

               os.environ["KMP_SETTINGS"] = "1"

               os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"

           

          Thanks

          Ravi Keron N

          • 2. Re: Hints for performance tuning on AI DevCloud
            Anju_Paul

            Hi,

             

            Is this issue resolved? 

            We will need to close this if we do not receive any more updates. 

            You will be able to create a new issue if you are still facing difficulties.

             

            Regards,

            Anju