1 2 Previous Next 16 Replies Latest reply on Apr 26, 2018 4:42 AM by Anju_Paul

    Part2:  Missing log and information on job to DevCloud

    AdamMiltonBarker

      Hi guys, after a number of issues attempting to train on a large dataset (4000 images per class roughly) the trainer just fails without any error or output logs so I am unable to see what went wrong. There was a pbtxt  file generated in the logs I have tried to create the graph for Movidius but it does not appear to have worked correctly. Using the pb generated from my local training converting to Movidius friendly works fine, converting the pbtxt file to a pb and then converting the pb to graph using Movidius sdk gives me error about input node not existing however it should do as per the training line:

       

      images = tf.placeholder("float", [1, Trainer._confs["ClassifierSettings"]["image_size"], Trainer._confs["ClassifierSettings"]["image_size"], 3], name="input")

       

      After converting the pb file that was generated from the part of the script missed out training on the server all works fine and classification works as expected.

       

      I will convert the pbtxt from the original training tomorrow to clarify if it is anything related to that process.

        • 1. Re: Part2:  Missing log and information on job to DevCloud
          Intel Corporation
          This message was posted on behalf of Intel Corporation

          Hello,

          We would like to troubleshoot the issue by recreating the environment.

          Since we don't have access to the files mentioned, kindly share the following:

          1. Full code(ipynb/py file) along with all the dependencies.
          2. Any associated data required to run the code.

          Regards,
          Nikhila

          • 2. Re: Part2:  Missing log and information on job to DevCloud
            Intel Corporation
            This message was posted on behalf of Intel Corporation

            Given the first level response as below through mail:

             

            After having a quick look, following are our observations:

             

            1. The checkpoints have not been saving 

               

               

              It looks like the checkpoints are already getting saved. Given below is the screenshot after listing directory, /home/uxxxx/IDC/IDC-Colfax-Trainer/model/_logs

             

             

             

            1. It takes too long to train 

              We added the following lines in Trainer2.py after the tensorflow import:

                           config = tf.ConfigProto(intra_op_parallelism_threads=12, inter_op_parallelism_threads=2, allow_soft_placement=True,  device_count = {'CPU': 12})

            os.environ["OMP_NUM_THREADS"] = "12"

            os.environ["KMP_BLOCKTIME"] = "30"

            os.environ["KMP_SETTINGS"] = "1"

            os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"

             

            and changed the line

             

            with tf.Session() as sess:               

            to          

            with tf.Session(config=config) as sess:

             

            This gives us an improvement of 2.5-3.2 sec/step compared to 4.5-5.5 sec/step.

             

            We are also checking if there are further optimizations possible. Will let you know soon.

            • 3. Re: Part2:  Missing log and information on job to DevCloud
              AdamMiltonBarker

              Hi thanks sorry I had mentioned the checkpoints were not being saved in the email I thought this had already been covered, if you open those checkpoints there is no checkpoint just an error which does not exist when training on my local GPU, the checkpoints are saved correctly on GPU but unfortunately not on DevCloud.

               

              Regarding the other issues thanks for the assistance, for the article I am writing I am going to attempt reducing the dataset and I will incorporate your changes, please let me know if you are successful with training the existing dataset in less that 36 hours, the other issue was the training just ending after close to 24 hours and no error logs or output with exception to the errors in the checkpoints, I am unable to spend that much time working on the large dataset and attempting reduced dataset on the weekend so hopefully reducing the dataset may work without reducing the accuracy.

               

              Thanks for the help guys.

              • 4. Re: Part2:  Missing log and information on job to DevCloud
                Intel Corporation
                This message was posted on behalf of Intel Corporation

                Hi Adam,

                Do you get an error as below when you run?

                /var/spool/torque/mom_priv/epilogue.parallel: /usr/local/bin/kill-illegit-procs: No such file or directory

                We got this error for both sorter & train.
                Kindly confirm.


                Regards,
                Anju

                • 5. Re: Part2:  Missing log and information on job to DevCloud
                  Intel Corporation
                  This message was posted on behalf of Intel Corporation

                  Hi Adam,

                  Kindly reduce the dataset as suggested or reduce the number of epochs (num_epochs in confs.json).
                  One suggestion would be to set num_epochs to 1 or 2 and check if there are issues with the saved checkpoint.
                  Kindly clear the _logs directory and run both datasort and train.
                  We did this and we could not find any problem with the saved checkpoint. Hence requesting you to check.

                  Regards,
                  Anju

                  • 6. Re: Part2:  Missing log and information on job to DevCloud
                    AdamMiltonBarker

                    Hi no I never got any error logs as mentioned this was the issue after nearly 24 hours training something happened, there was no error or output logs and the checkpoints when opened had an error  inside warning about utf8 and saying check points were disabled and to view console for logs but there were no logs. This did not happen in training on my local server but happened twice on AI DevCloud.

                     

                    Right now I am going to test with a reduced data set which I feel will cause the program to be less accurate, since these issues I have been successful in completing training as normal with no modifications to my existing training script on my local server and maintained the same accuracy, I will follow the suggestions and compare the results and let you know. Where is the reduced data set you used ? How many examples did you train on ? Thanks for the help with dev cloud so far.

                    • 7. Re: Part2:  Missing log and information on job to DevCloud
                      AdamMiltonBarker

                      I do not see the changes you mentioned in Trainer2.py. 

                       

                      Please could you let me know the following:

                       

                      - How many images did you train on

                      - How many epochs did you train with

                      - What was the final streaming accuracy etc after your modifications

                      • 8. Re: Part2:  Missing log and information on job to DevCloud
                        AdamMiltonBarker

                        After running the data sorter I get the following errors, again the second error only exists when using DevCloud not in my local training, the first warning is normal with the version of TF I am using as h5py needs updating and they have not yet.

                         

                        Python 3.6.3 :: Intel Corporation /glob/intel-python/python3/lib/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.   from ._conv import register_converters as _register_converters /var/spool/torque/mom_priv/epilogue.parallel: line 6: /usr/local/bin/kill-illegit-procs: No such file or directory

                         

                        I have reduced the dataset to 1000 examples of each and am going to try with 2 epochs and then 20 for final training, will add results here.

                        • 9. Re: Part2:  Missing log and information on job to DevCloud
                          AdamMiltonBarker

                          I am still getting the same errors in the checkpoints, every file in the directory contains this error with exception to checkpoint

                           

                          errors.PNG

                          In comparison to the files saved on my local machine which is training now, they are not saving correctly on DevCloud, on my local machine they are saving correctly.

                           

                          This error seems to be only when accessing them in Notebook, if I  nano the files they are ok, so it seems this is not stopping successful training but a bug with notebook.

                          • 10. Re: Part2:  Missing log and information on job to DevCloud
                            AdamMiltonBarker

                            OK 1 epoch with 1000 images per class appears to have worked I am going to try training with 20 epochs.

                            • 11. Re: Part2:  Missing log and information on job to DevCloud
                              AdamMiltonBarker

                              There is no output when using qpeek so I am unable to monitor the status:

                               

                              [u13339@c009 ~]$ qpeek 71701

                               

                              ########################################################################

                              # Colfax Cluster - https://colfaxresearch.com/

                              #      Date:           Sat Apr 21 14:55:55 PDT 2018

                              #    Job ID:           71701.c009

                              #      User:           u13339

                              # Resources:           neednodes=1:ppn=2,nodes=1:ppn=2,walltime=24:00:00

                              ########################################################################

                               

                              * Hello world from compute server c009-n039 on the A.I. DevCloud!

                              * The current directory is /home/u13339/IDC-Colfax-Trainer.

                              * Compute server's CPU model and number of logical CPUs:

                              CPU(s):                24

                              Model name:            Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz

                              * Python available to us:

                              /glob/intel-python/python3/bin/python

                              * This job trains the IDC Classifier on the Colfax Cluster

                               

                              [u13339@c009 ~]$

                              • 12. Re: Part2:  Missing log and information on job to DevCloud
                                Intel Corporation
                                This message was posted on behalf of Intel Corporation

                                Hi Adam,

                                 

                                Sorry for the confusion. We belong to technical support team and we don't have read/write access to any of the user folders. Since you had given permission to check your data/code, we raised a request to the DevCloud Admin team to get a local copy. Hence those changes won't be available on your folder. However, you could make changes as suggested and check.

                                 

                                Given below are the answers to your questions:

                                 

                                1. Used the whole dataset
                                2. Tried with 1 epoch.
                                3. Final Accuracy (in the logs) : 81.54%(0.8154), Final Loss (in the logs) : 0.9723

                                 

                                Not able to get the Final Streaming accuracy, since this needs the evaluation to be run and we don't have the complete setup for that (JumpWayMQTT server needs to be started?)

                                 

                                Regards,
                                Anju

                                • 13. Re: Part2:  Missing log and information on job to DevCloud
                                  Intel Corporation
                                  This message was posted on behalf of Intel Corporation

                                  Hi Adam,

                                   

                                  Kindly use qpeek -e <JOB_ID> to view real time logs of running job.

                                  The job progress logs, in this case, is being saved to error file and this could be viewed with the above mentioned command.

                                   

                                  Regards,
                                  Anju

                                  • 14. Re: Part2:  Missing log and information on job to DevCloud
                                    Intel Corporation
                                    This message was posted on behalf of Intel Corporation

                                    Hi Adam,

                                    We would definitely check what creates these warnings/errors.
                                    However, they does not seem to influence the final results.
                                    Could you please confirm.

                                    Regards,
                                    Anju

                                    1 2 Previous Next