To answer your question better, please revert with the answer to the following questions.
1. How did you check the log file?
2. How did you run the job?
1. When I made some basic tests. I got two files one with the output and another with errors.
2. I run the job using `qsub myjob` utility
Here my steps:
1. I used this instruction to open the terminal: Using Jupyter Notebook* Terminal Console | Intel® Software
- Created conda environment.
- activated environment.
- Installed some libraries
- created file "myjob" with this lines:
#PBS -l nodes=1
echo Starting calculation
source activate iMaterialist
echo End of calculation
- executed `qsub myjob` from the terminal
- started training job with number xxxxx.c00x
- It ran for certain hours.
3. The job stopped suddenly and I don't get any information about the process.
4. My model saves a basic log and it shows me it was on the epoch 14/189 way far to the end.
I am thinking now it has to be with the `
walltime` but I am confuse in how do I increase this time or which is the best configuration for it? I am just getting a couple hours :/ I would like to have the max 24h
The answers to each of the problems mentioned is given below:
1. When I executed the command qstat -f XXXXX it gave me: stat: Unknown Job Id Error XXXXX.c00X
Reply: qstat -f XXXXX command works only when the job is running
2. I would like to have the max 24 hours
Reply: Please use Putty/Linux ssh terminal instead of Jupyter notebook terminal and then give the following #PBS setting in the job file #PBS –l walltime=24:00:00
3. Not able to see output & error files
Reply: Please let us know if you face these issues with Putty/Linux ssh terminal.
Please note that you could see the output and error file only after the job is completed. While the job is running, use qpeek command to see the logs.