1 2 Previous Next 18 Replies Latest reply on Apr 8, 2018 10:30 PM by Intel Corporation

    Performance++?

    Zubair2018

      I am following your response to this thread and I have done following changes to my script

       

      I am still facing really slow training of my model, this could be due to my large HDF5 and training indeed requires lots of time (I haven't tried locally to compare)

       

       

      1. Set interop and intra op threads in Keras with parameters updated. The following code can be used for that:
      from keras import backend as K
      import tensorflow as tf
      config = tf.ConfigProto(intra_op_parallelism_threads=64, inter_op_parallelism_threads=2, allow_soft_placement=True,  device_count = {'CPU': 64})
      session = tf.Session(config=config)
      K.set_session(session)

       

      2. Set OpenMP* environment variables (OMP_) and extensions (KMP_).
      os.environ["OMP_NUM_THREADS"] = "64"
      os.environ["KMP_BLOCKTIME"] = "30"
      os.environ["KMP_SETTINGS"] = "1"
      os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"

       

      This will give you a better performance than what you've observed.

       

      For a much better performance, try DevCloud with skylake xeon processors.

       

       

      In the post you do say that please try our skylake xeon processors, it seems all the processors are indeed sky lake

       

      Because when I check pbsnodes I get all nodes similar to following details

       

      state = job-exclusive

           power_state = Running

           np = 2

           properties = xeon,skl,gold6128,ram96gb,1gbe

           ntype = cluster

           jobs = 0/52733.c009,1/52738.c009

           status = rectime=1522183083,macaddr=a4:bf:01:38:e0:68,cpuclock=Fixed,varattr=,jobs=52733.c009(cput=16,energy_used=0,mem=156208kb,vmem=2147040kb,walltime=12555,session_id=123975) 52738.c009(cput=21,energy_used=0,mem=327132kb,vmem=3214064kb,walltime=10602,session_id=128221),state=free,netload=1139955518563,gres=,loadave=0.00,ncpus=24,physmem=196704400kb,availmem=212528904kb,totmem=213478540kb,idletime=94659,nusers=2,nsessions=5,sessions=123975 124459 124467 128221 128472,uname=Linux c009-n001 4.15.2-1.el7.elrepo.x86_64 #1 SMP Wed Feb 7 17:26:44 EST 2018 x86_64,opsys=linux

           mom_service_port = 15002

           mom_manager_port = 15003

       

      so my question is what more can I do to ensure that I'm getting maximum performance out of these?

        • 1. Re: Performance++?
          Intel Corporation
          This message was posted on behalf of Intel Corporation

          Hello,

          Thanks for reaching out to us.

          Could you please share the below details for further investigation

          1. Please provide the topology and framework details 
          2. Is the code available in github? if yes, please share the details, we will check the performance from our end.
          3. Please share the current performance result

          Thanks,
          Dilraj

          • 2. Re: Performance++?
            Zubair2018

            Hi,

             

            I am using Keras with Tensorflow backend

            Code is not available on Github it can be found in my directory u12628 / deepergooglenet / train.py

             

            It takes around 8 hours to do 9 epochs. Not sure what topology means here?

            • 3. Re: Performance++?
              Intel Corporation
              This message was posted on behalf of Intel Corporation

              Hello,

              Could you please change the below parameter to 24 or 48 and let us know if any improvement in the performance
              config = tf.ConfigProto(intra_op_parallelism_threads=64, inter_op_parallelism_threads=2, allow_soft_placement=True,  device_count = {'CPU': 64})
              session = tf.Session(config=config)
              K.set_session(session)

              os.environ["OMP_NUM_THREADS"] = "64"

              We don't have access to the path provided by you, so we are unable to access the file from our end and do the testing 

              Thanks



               

              • 4. Re: Performance++?
                Zubair2018

                Ok I have changed it and started re-training, I'll report back on performance update

                • 5. Re: Performance++?
                  Zubair2018

                  So there's definitely something wrong, the author of the model that I'm using to train without modification, just said that it takes him 250 seconds on a Titax X GPU (just under 5 seconds) to do 1 epoch

                  To me, it takes 8 hours to do 9 epochs, which is insane.

                   

                  I have reduced the number of threads and CPU count and that's not helping either.

                   

                  Please help!

                  • 6. Re: Performance++?
                    Zubair2018

                    I killed that training which was using cpu count 48, changed it to 12 and it did 5 epochs in just over 2 hours, that's a significant improvement over past result

                    Now I have completely removed those changes that you suggested for improvement and I'm retraining, let's see what I get

                    • 7. Re: Performance++?
                      Zubair2018

                      Alright completely removing those optimizations was not a good idea, model didn't even learn a thing

                      I have set them back to the following to see what I get this time

                       

                      config2 = tf.ConfigProto(intra_op_parallelism_threads=48, inter_op_parallelism_threads=2, allow_soft_placement=True,  device_count = {'CPU': 48})

                      session = tf.Session(config=config2)

                      KK.set_session(session)

                       

                       

                      os.environ["OMP_NUM_THREADS"] = "48"

                      os.environ["KMP_BLOCKTIME"] = "30"

                      os.environ["KMP_SETTINGS"] = "1"

                      os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"

                      • 8. Re: Performance++?
                        Intel Corporation
                        This message was posted on behalf of Intel Corporation

                        Thanks for sharing the observation. We will wait for your response for the final result.

                        Thanks,
                        Dilraj

                        • 9. Re: Performance++?
                          Zubair2018

                          It is taking 3 hours to do 5 epochs but on Titan X GPU it takes 25 mins

                          How can I fix that?

                          • 10. Re: Performance++?
                            Zubair2018

                            Dilraj,

                             

                            This is the code that I'm using and model is taking 6 hours to do 5 epochs

                            A DeeperGoogleNet model is built in Keras and used for training, which takes 25 mins to do 5 epochs on Titan X GPU but here it takes 6 hours

                             

                            # USAGE
                            # python train.py --checkpoints output/checkpoints
                            # python train.py --checkpoints output/checkpoints --model output/checkpoints/epoch_25.hdf5 --start-epoch 25
                            
                            
                            # set the matplotlib backend so figures can be saved in the background
                            import matplotlib
                            matplotlib.use("Agg")
                            
                            
                            # import the necessary packages
                            from config import tiny_imagenet_config as config
                            from pyimagesearch.preprocessing import ImageToArrayPreprocessor
                            from pyimagesearch.preprocessing import SimplePreprocessor
                            from pyimagesearch.preprocessing import MeanPreprocessor
                            from pyimagesearch.callbacks import EpochCheckpoint
                            from pyimagesearch.callbacks import TrainingMonitor
                            from pyimagesearch.io import HDF5DatasetGenerator
                            from pyimagesearch.nn.conv import DeeperGoogLeNet
                            from keras.preprocessing.image import ImageDataGenerator
                            from keras.optimizers import Adam
                            from keras.models import load_model
                            import keras.backend as K
                            import argparse
                            import json
                            import os
                            import keras.backend.tensorflow_backend as KK
                            import tensorflow as tf
                            
                            
                            config2 = tf.ConfigProto(intra_op_parallelism_threads=256, inter_op_parallelism_threads=2, allow_soft_placement=True,  device_count = {'CPU': 64})
                            session = tf.Session(config=config2)
                            KK.set_session(session)
                            
                            
                            os.environ["OMP_NUM_THREADS"] = "256"
                            os.environ["KMP_BLOCKTIME"] = "30"
                            os.environ["KMP_SETTINGS"] = "1"
                            os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"
                            
                            
                            # construct the argument parse and parse the arguments
                            ap = argparse.ArgumentParser()
                            ap.add_argument("-c", "--checkpoints", required=True,
                            help="path to output checkpoint directory")
                            ap.add_argument("-m", "--model", type=str,
                            help="path to *specific* model checkpoint to load")
                            ap.add_argument("-s", "--start-epoch", type=int, default=0,
                            help="epoch to restart training at")
                            args = vars(ap.parse_args())
                            
                            
                            # construct the training image generator for data augmentation
                            aug = ImageDataGenerator(rotation_range=18, zoom_range=0.15,
                            width_shift_range=0.2, height_shift_range=0.2, shear_range=0.15,
                            horizontal_flip=True, fill_mode="nearest")
                            
                            
                            # load the RGB means for the training set
                            means = json.loads(open(config.DATASET_MEAN).read())
                            
                            
                            # initialize the image preprocessors
                            sp = SimplePreprocessor(64, 64)
                            mp = MeanPreprocessor(means["R"], means["G"], means["B"])
                            iap = ImageToArrayPreprocessor()
                            
                            
                            # initialize the training and validation dataset generators
                            trainGen = HDF5DatasetGenerator(config.TRAIN_HDF5, 64, aug=aug,
                            preprocessors=[sp, mp, iap], classes=config.NUM_CLASSES)
                            valGen = HDF5DatasetGenerator(config.VAL_HDF5, 64,
                            preprocessors=[sp, mp, iap], classes=config.NUM_CLASSES)
                            
                            
                            # if there is no specific model checkpoint supplied, then initialize
                            # the network and compile the model
                            if args["model"] is None:
                            print("[INFO] compiling model...")
                            model = DeeperGoogLeNet.build(width=64, height=64, depth=3,
                            classes=config.NUM_CLASSES, reg=0.0002)
                            opt = Adam(1e-3)
                            model.compile(loss="categorical_crossentropy", optimizer=opt,
                            metrics=["accuracy"])
                            
                            
                            # otherwise, load the checkpoint from disk
                            else:
                            print("[INFO] loading {}...".format(args["model"]))
                            model = load_model(args["model"])
                            
                            
                            # update the learning rate
                            print("[INFO] old learning rate: {}".format(
                            K.get_value(model.optimizer.lr)))
                            K.set_value(model.optimizer.lr, 1e-5)
                            print("[INFO] new learning rate: {}".format(
                            K.get_value(model.optimizer.lr)))
                            
                            
                            # construct the set of callbacks
                            callbacks = [
                            EpochCheckpoint(args["checkpoints"], every=5,
                            startAt=args["start_epoch"]),
                            TrainingMonitor(config.FIG_PATH, jsonPath=config.JSON_PATH,
                            startAt=args["start_epoch"])]
                            
                            
                            # train the network
                            model.fit_generator(
                            trainGen.generator(),
                            steps_per_epoch=trainGen.numImages // 64,
                            validation_data=valGen.generator(),
                            validation_steps=valGen.numImages // 64,
                            epochs=10,
                            max_queue_size=64 * 2,
                            callbacks=callbacks, verbose=1)
                            
                            
                            # close the databases
                            trainGen.close()
                            valGen.close()
                            

                             

                            Awaiting your quickest reply

                            • 11. Re: Performance++?
                              Intel Corporation
                              This message was posted on behalf of Intel Corporation

                              Hello,

                              We will check and run the code, we will get back to you asap.

                              Thanks,
                              Dilraj

                              • 12. Re: Performance++?
                                Intel Corporation
                                This message was posted on behalf of Intel Corporation

                                Hello,

                                Could you please share all the codes along with dependency .py codes

                                from config import tiny_imagenet_config as config  
                                from pyimagesearch.preprocessing import ImageToArrayPreprocessor  
                                from pyimagesearch.preprocessing import SimplePreprocessor  
                                from pyimagesearch.preprocessing import MeanPreprocessor  
                                from pyimagesearch.callbacks import EpochCheckpoint  
                                from pyimagesearch.callbacks import TrainingMonitor  
                                from pyimagesearch.io import HDF5DatasetGenerator  
                                from pyimagesearch.nn.conv import DeeperGoogLeNet  
                                from keras.preprocessing.image import ImageDataGenerator  
                                from keras.optimizers import Adam  
                                from keras.models import load_model  
                                import keras.backend as K   
                                import keras.backend.tensorflow_backend as KK 

                                Thanks,
                                Dilraj

                                • 13. Re: Performance++?
                                  karlfezer

                                  Hey Zubair,

                                   

                                   

                                  Do you have a link to the post you're trying to replicate? That might help in terms of having a written guide to follow.

                                   

                                  Thanks,

                                  -Karl

                                  • 14. Re: Performance++?
                                    Zubair2018

                                    Hi Karl,

                                     

                                    The code is not allowed to be shared publicly I can however, send that to you in an email, would that work for you?

                                    1 2 Previous Next