1 2 Previous Next 17 Replies Latest reply on Feb 22, 2018 9:35 PM by Intel Corporation

    Get this error in the error file generated by the cluster

    sumedh.pendurkar

      I use keras with tensorflow as backend.

       

      =>> PBS: job killed: walltime 21637 exceeded limit 21600

      cat: /var/spool/torque/mom_priv/jobs/0-1/40746.c009.JB: No such file or directory

       

      I get the first line,6 hours was not enough, so it was killed.

      But, in the output file there was no model.summary() printed. It was blank.

       

      I did this twice or thrice with the cluster, some times model.summary() (which is called before calling the fit_generator) is written to output file, sometimes it isn't.

      I am not calling "cat" explicitly by any system call(in my python/shell script).

      What is the second error about?

       

      this is the shell script

      #PBS -N test

      python /home/u7529/person-reidentfication/Implementation-CVPR2015-CNN-for-ReID/CUHK03/main.py  --dataset /home/u7529/person-reidentfication/Implementation-CVPR2015-CNN-for-ReID/CUHK03/cuhk-03.h5

        • 1. Re: Get this error in the error file generated by the cluster
          Intel Corporation
          This message was posted on behalf of Intel Corporation

          Thanks for reaching out to us!
          In order to resolve the issue, we have to reproduce and analyze the issue at our end.
          Could you please provide the code scripts that you are running.

          Regards,
          Krishnaprasad T


           

          • 2. Re: Get this error in the error file generated by the cluster
            sumedh.pendurkar

            Please find attached files.

             

            model.py contains the model and the loss functions.

            data_preparation.py contains the image generator.

             

            create_dataset.py is used to extract images from  Download CUHK03 dataset and convert into h5 file.

             

            this is the git link to the code(in the cuhk folder) https://gitlab.com/sumedh.pendurkar/Person-Reidentification.git

             

             

             

             

            Thank you.

            • 3. Re: Get this error in the error file generated by the cluster
              Intel Corporation
              This message was posted on behalf of Intel Corporation

              Hi Sumedh,

              I have put your code for run in AI Dev Cloud. Code still on run.
              Will get back to you soon with any updates on this.
              In the meanwhile could you please try:
              Below are the steps I followed to set up the environment in AI Dev Cloud:

              Step 1: Create a new intel environment
              conda create -n intelpython2_full python=2

              Step 2: Install all required python packages
              eg; TENSORFLOW 1.3
              conda install -c intel tensorflow==1.3

              Similarly install intel distribution for keras and theano.

              Step 3: Set below parameters in your code run shell script
              a. Set PBS walltime to 24 in your PBS shell script. The following command will overwrite your default walltime of 6 hours to 24.

              #PBS -l walltime=24:00:00

              b. activate virtual environment
              source activate intelpython2_full

              c. python ....main.py code path.



              Regards,
              Krishnaprasad T







               

              • 4. Re: Get this error in the error file generated by the cluster
                sumedh.pendurkar

                Hello,

                 

                while creating the conda environment, I got the error:

                Solving environment: failed

                libgcc_s.so.1 must be installed for pthread_cancel to work

                Aborted

                 

                I referred this issue libgcc_s.so.1 must be installed for pthread_cancel to work

                 

                Still I am unable to figure it out.

                 

                As per the issue, here are my environment variables.

                 

                find  /usr/lib64/ -name  "libgcc_s.so.1"

                output: /usr/lib64/libgcc_s.so.1

                 

                 

                conda config --show-sources

                ==> /glob/intel-python/versions/2018u1/intelpython3/.condarc <==

                channels:

                  - intel

                  - defaults

                 

                ==> /home/u7529/.condarc <==

                channels:

                  - intel

                  - defaults

                 

                echo $LIBRARY_PATH | grep "usr/lib64/"

                /usr/lib64/:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/ipp/lib/intel64:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/mkl/lib/intel64_lin:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/tbb/lib/intel64/gcc4.7:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/tbb/lib/intel64/gcc4.7:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/daal/lib/intel64_lin:/usr/lib64/:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/ipp/lib/intel64:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/mkl/lib/intel64_lin:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/tbb/lib/intel64/gcc4.7:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/tbb/lib/intel64/gcc4.7:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/daal/lib/intel64_lin:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/daal/../tbb/lib/intel64_lin/gcc4.4

                 

                echo $LD_LIBRARY_PATH | grep "/usr/lib64"

                /glob/development-tools/mklml/lib/:/usr/lib64/:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/mpi/mic/lib:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/ipp/lib/intel64:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/mkl/lib/intel64_lin:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/tbb/lib/intel64/gcc4.7:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/tbb/lib/intel64/gcc4.7:/glob/development-tools/versions/intel-parallel-studio-2018/debugger_2018/iga/lib:/glob/development-tools/versions/intel-parallel-studio-2018/debugger_2018/libipt/intel64/lib:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/daal/lib/intel64_lin:/glob/development-tools/mklml/lib/:/usr/lib64/:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/mpi/intel64/lib:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/mpi/mic/lib:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/ipp/lib/intel64:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/compiler/lib/intel64_lin:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/mkl/lib/intel64_lin:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/tbb/lib/intel64/gcc4.7:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/tbb/lib/intel64/gcc4.7:/glob/development-tools/versions/intel-parallel-studio-2018/debugger_2018/iga/lib:/glob/development-tools/versions/intel-parallel-studio-2018/debugger_2018/libipt/intel64/lib:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/daal/lib/intel64_lin:/glob/development-tools/versions/intel-parallel-studio-2018/compilers_and_libraries_2018.0.128/linux/daal/../tbb/lib/intel64_lin/gcc4.4

                • 5. Re: Get this error in the error file generated by the cluster
                  Intel Corporation
                  This message was posted on behalf of Intel Corporation

                  Hi Sumedh,

                  Please confirm did you try creating conda environment in compute node.

                  If not please try:

                  1. The below command will switch you from your login node to compute node.
                  qsub -I

                  2. Add intel channel
                  conda config --add channels intel

                  3. Create conda environment and run your scripts.

                  NB: Activate your conda environment before executing python scripts


                  Regards,
                  Krishnaprasad T









                   

                  • 6. Re: Get this error in the error file generated by the cluster
                    sumedh.pendurkar

                    Hello,

                     

                    Sorry, I forgot to switch to computer node.

                     

                    I create a new env and installed packages.

                     

                    I ran

                    #PBS -N test -l walltime=24:00:00                                                                                                                                      
                    source activate intelpython2_full                                                                                                                                      
                    python /home/u7529/person-reidentfication/Implementation-CVPR2015-CNN-for-ReID/CUHK03/main.py  --dataset /home/u7529/person-reidentfication/Implementation-CVPR2015-CNN-for-ReID/CUHK03/cuhk-03.h5                                                                                                                                                  
                                                                                                                                                                                           
                          

                     

                    I got the model.summary() as output, But it stopped with following exception

                     

                    Using TensorFlow backend.

                    Traceback (most recent call last):

                      File "/home/u7529/person-reidentfication/Implementation-CVPR2015-CNN-for-ReID/CUHK03/main.py", line 99, in <module>

                        main(args.dataset_path)

                      File "/home/u7529/person-reidentfication/Implementation-CVPR2015-CNN-for-ReID/CUHK03/main.py", line 20, in main

                        train(model, dataset_path)

                      File "/home/u7529/person-reidentfication/Implementation-CVPR2015-CNN-for-ReID/CUHK03/main.py", line 42, in train

                        model.fit_generator(Data_Generator.flow(f,flag = flag_train),one_epoch/batch_size,epoch_num)

                      File "/home/u7529/.local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper

                        return func(*args, **kwargs)

                      File "/home/u7529/.local/lib/python2.7/site-packages/keras/engine/training.py", line 2145, in fit_generator

                        generator_output = next(output_generator)

                      File "/home/u7529/.local/lib/python2.7/site-packages/keras/utils/data_utils.py", line 561, in get

                        six.raise_from(StopIteration(e), e)

                      File "/home/u7529/.conda/envs/intelpython2_full/lib/python2.7/site-packages/six.py", line 718, in raise_from

                        raise value

                    StopIteration

                    cat: /var/spool/torque/mom_priv/jobs/0-1/43408.c009.JB: No such file or directory

                    • 7. Re: Get this error in the error file generated by the cluster
                      Intel Corporation
                      This message was posted on behalf of Intel Corporation

                      Hi Sumedh,

                       

                      Please double check the data path given in PBS file is correct.

                       

                      Regards,
                      Krishnaprasad T

                      • 8. Re: Get this error in the error file generated by the cluster
                        Intel Corporation
                        This message was posted on behalf of Intel Corporation

                        Hi Sumedh,

                        Hope the information shared was helpful. So if you don't have any more questions may I close this thread?

                        Regards,
                        Krishnaprasad T
                         

                        • 9. Re: Get this error in the error file generated by the cluster
                          sumedh.pendurkar

                          Hello,

                           

                          I have checked the path, and commented the print line.

                          I have put it to run now.

                           

                           

                           

                          Sorry for the delayed response.

                          I'll reply once it gets completed.

                           

                           

                          Thanks,

                          • 10. Re: Get this error in the error file generated by the cluster
                            Intel Corporation
                            This message was posted on behalf of Intel Corporation

                            Ok Sumedh.
                             

                            • 11. Re: Get this error in the error file generated by the cluster
                              sumedh.pendurkar

                              Hello,

                               

                              the process was terminated with same error.

                               

                              error file

                              Using TensorFlow backend.

                              Traceback (most recent call last):

                                File "/home/u7529/person-reidentfication/Implementation-CVPR2015-CNN-for-ReID/CUHK03/main.py", line 100, in <module>

                                  main(args.dataset_path)

                                File "/home/u7529/person-reidentfication/Implementation-CVPR2015-CNN-for-ReID/CUHK03/main.py", line 20, in main

                                  train(model, dataset_path)

                                File "/home/u7529/person-reidentfication/Implementation-CVPR2015-CNN-for-ReID/CUHK03/main.py", line 43, in train

                                  model.fit_generator(Data_Generator.flow(f,flag = flag_train),one_epoch/batch_size,epoch_num)

                                File "/home/u7529/.local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper

                                  return func(*args, **kwargs)

                                File "/home/u7529/.local/lib/python2.7/site-packages/keras/engine/training.py", line 2145, in fit_generator

                                  generator_output = next(output_generator)

                                File "/home/u7529/.local/lib/python2.7/site-packages/keras/utils/data_utils.py", line 561, in get

                                  six.raise_from(StopIteration(e), e)

                                File "/home/u7529/.conda/envs/intelpython2_full/lib/python2.7/site-packages/six.py", line 718, in raise_from

                                  raise value

                              StopIteration

                               

                               

                              output file

                              __________________________________________________________________________________________________

                              Layer (type)                    Output Shape         Param #     Connected to                    

                              ==================================================================================================

                              input_1 (InputLayer)            (None, 160, 60, 3)   0                                           

                              __________________________________________________________________________________________________

                              conv2d_1 (Conv2D)               (None, 156, 56, 20)  1520        input_1[0][0]                   

                                                                                               input_2[0][0]                   

                              __________________________________________________________________________________________________

                              max_pooling2d_1 (MaxPooling2D)  multiple             0           conv2d_1[0][0]                  

                                                                                               conv2d_1[1][0]                  

                                                                                               conv2d_2[0][0]                  

                                                                                               conv2d_2[1][0]                  

                                                                                               conv2d_5[0][0]                  

                                                                                               conv2d_6[0][0]                  

                              __________________________________________________________________________________________________

                              input_2 (InputLayer)            (None, 160, 60, 3)   0                                           

                              __________________________________________________________________________________________________

                              conv2d_2 (Conv2D)               (None, 74, 24, 25)   12525       max_pooling2d_1[0][0]           

                                                                                               max_pooling2d_1[1][0]           

                              __________________________________________________________________________________________________

                              lambda_1 (Lambda)               (None, 185, 60, 25)  0           max_pooling2d_1[2][0]           

                                                                                               max_pooling2d_1[3][0]           

                              __________________________________________________________________________________________________

                              up_sampling2d_1 (UpSampling2D)  (None, 185, 60, 25)  0           max_pooling2d_1[2][0]           

                                                                                               max_pooling2d_1[3][0]           

                              __________________________________________________________________________________________________

                              lambda_2 (Lambda)               (None, 185, 60, 25)  0           lambda_1[0][0]                  

                                                                                               lambda_1[1][0]                  

                              __________________________________________________________________________________________________

                              add_1 (Add)                     (None, 185, 60, 25)  0           up_sampling2d_1[0][0]           

                                                                                               lambda_2[1][0]                  

                              __________________________________________________________________________________________________

                              add_2 (Add)                     (None, 185, 60, 25)  0           up_sampling2d_1[1][0]           

                                                                                               lambda_2[0][0]                  

                              __________________________________________________________________________________________________

                              conv2d_3 (Conv2D)               (None, 37, 12, 25)   15650       add_1[0][0]                     

                              __________________________________________________________________________________________________

                              conv2d_4 (Conv2D)               (None, 37, 12, 25)   15650       add_2[0][0]                     

                              __________________________________________________________________________________________________

                              conv2d_5 (Conv2D)               (None, 35, 10, 25)   5650        conv2d_3[0][0]                  

                              __________________________________________________________________________________________________

                              conv2d_6 (Conv2D)               (None, 35, 10, 25)   5650        conv2d_4[0][0]                  

                              __________________________________________________________________________________________________

                              concatenate_1 (Concatenate)     (None, 17, 5, 50)    0           max_pooling2d_1[4][0]           

                                                                                               max_pooling2d_1[5][0]           

                              __________________________________________________________________________________________________

                              flatten_1 (Flatten)             (None, 4250)         0           concatenate_1[0][0]             

                              __________________________________________________________________________________________________

                              dense_1 (Dense)                 (None, 500)          2125500     flatten_1[0][0]                 

                              __________________________________________________________________________________________________

                              dense_2 (Dense)                 (None, 2)            1002        dense_1[0][0]                   

                              ==================================================================================================

                              Total params: 2,183,147

                              Trainable params: 2,183,147

                              Non-trainable params: 0

                              __________________________________________________________________________________________________

                              Model Compile Successful.

                              ('number', 0, 'in', 100)

                              Epoch 1/1

                              • 12. Re: Get this error in the error file generated by the cluster
                                Intel Corporation
                                This message was posted on behalf of Intel Corporation

                                Hi Sumedh,

                                I have put your to run now.
                                Will get back to you soon with any update on this.

                                Regards,
                                Krishnaprasad T

                                • 13. Re: Get this error in the error file generated by the cluster
                                  Intel Corporation
                                  This message was posted on behalf of Intel Corporation

                                  Hi Sumedh,

                                  The issue here is because of the Keras version installed in DevCloud.
                                  Try reverting to keras 2.0.0. That will fix the stopiteration error.
                                  Please try and confirm.

                                  Regards,
                                  Krishnaprasad T

                                  • 14. Re: Get this error in the error file generated by the cluster
                                    sumedh.pendurkar

                                    Hello,

                                     

                                    I have put it to train. I have installed keras version 2.0.2 (2.0.0 not available on cluster. I did a conda -c intel search keras)

                                    1 2 Previous Next