5 Replies Latest reply on May 6, 2018 10:46 PM by Intel Corporation

    Missing job process

    virtualdvid

      I was training my model and it seemed was going pretty well but suddenly it disappeared and It didn't create the files:

       

      my_project_1.oXXXXX

      my_project_1.eXXXXX

       

      When I executed the command qstat -f XXXXX it gave me:

       

      stat: Unknown Job Id Error XXXXX.c00X

       

      When I check my log file it was just running the epoch 14.

       

      What can I do to avoid this?

       

      Thank you!

        • 1. Re: Missing job process
          Intel Corporation
          This message was posted on behalf of Intel Corporation

          Hi David,

          To answer your question better, please revert with the answer to the following questions.

          1. How did you check the log file?
          2. How did you run the job?

          Regards,
          Anju

           

          • 2. Re: Missing job process
            virtualdvid

            1. When I made some basic tests. I got two files one with the output and another with errors.

            2. I run the job using `qsub myjob` utility

             

            Here my steps:

             

            1. I used this instruction to open the terminal: Using Jupyter Notebook* Terminal Console | Intel® Software

            2. There:

            • Created conda environment.
            • activated environment.
            • Installed some libraries
            • created file "myjob" with this lines:

                          #PBS -l nodes=1          

                          cd $PBS_O_WORKDIR          

                          echo Starting calculation          

                          source activate iMaterialist          

                          python NASNet.py          

                          echo End of calculation

            • executed `qsub myjob` from the terminal
            • started training job with number xxxxx.c00x
            • It ran for certain hours.

            3. The job stopped suddenly and I don't get any information about the process.

            4. My model saves a basic log and it shows me it was on the epoch 14/189 way far to the end.

             

            I am thinking now it has to be with the `walltime` but I am confuse in how do I increase this time or which is the best configuration for it? I am just getting a couple hours :/ I would like to have the max 24h

            • 3. Re: Missing job process
              Intel Corporation
              This message was posted on behalf of Intel Corporation

              Hi David,

              The answers to each of the problems mentioned is given below:

              1.
              When I executed the command qstat -f XXXXX it gave me: stat: Unknown Job Id Error XXXXX.c00X
               Reply: qstat -f XXXXX command works only when the job is running
               
              2. I would like to have the max 24 hours
              Reply: Please use Putty/Linux ssh terminal instead of Jupyter notebook terminal and then give the following #PBS setting in the job file #PBS –l walltime=24:00:00

              3. Not able to see output & error files
              Reply: Please let us know if you face these issues with Putty/Linux ssh terminal.
              Please note that you could see the output and error file only after the job is completed. While the job is running, use qpeek command to see the logs.

              Regards,
              Anju
               

              • 4. Re: Missing job process
                virtualdvid

                Thank you!! I followed the instruction and I can execute qpeek now! It is working perfectly!

                • 5. Re: Missing job process
                  Intel Corporation
                  This message was posted on behalf of Intel Corporation

                  Hi David,

                  We are closing this case on your confirmation.
                  Please open a fresh thread for any further assistance.

                  Regards,
                  Anju