Thanks for reaching out to us.
Could you please share the screenshot of the error which you observed while submission. Meanwhile we will investigate from our end.
I do not get any error, the submission settings are wrong. I submit:
#PBS -N $fn
#PBS -e $addres/reports/errors-$fn.err
#PBS -o $addres/reports/output-$fn.out
#PBS -l nodes=1:ppn=2
#PBS -q batch
#PBS -l mem=10gb
#PBS -l vmem=$memo
#PBS -l walltime=24:00:00
but, the submitted job runs for 6:00:00 with 1 node and 2 cores, without any memory limit, which are the default settings for the batch queue.
Additionally, I am currently can run at most 15 jobs in a same time. I believe it was 30. Is the policy changed?
Would like to discuss the problems raised, one by one.
1. #PBS -l nodes=1:ppn=1, it does not work and also incurs some error from that line of my submission file.
Reply : It does not look like an error to us. It is more like an information stating that "Queue manager has overridden the nodes request from #PBS -l nodes=1:ppn=1 to #PBS -l nodes=1:ppn=2". It does not affect the running of the job script. However, we would check and let you know why a value of one for ppn is not accepted.
2. I could run at most 15 jobs at a time. I believe it was 30. Is the policy changed?
Reply : How did you run 15 jobs ? Did you submit the jobs one after the other? Did it throw an error when the number of submitted jobs is more than 15? We could not recreate such a situation, hence requesting you.
3. Even after the giving a different PBS setting for wall time and memory, the submitted job ran with default settings.
Reply: Do you have commands before the #PBS commands in the job script?
Did you export the variables like $fn, $addres/reports/errors-$fn.err, $addres/reports/output-$fn.out, $memo or did you define it inside the job script?
Note: PBS Commands should always be given at the very beginning of the job script file. Otherwise it will be ignored.
Kindly revert with the result of “qstat -xf <JOB_ID>”, after submitting your job.
Thanks for your replay.Here are the answer of your question:
1- The problem is that from the point that #PBS -l nodes=1:ppn=1, the rest of the settings are getting inactive and the default setting are replaced by, i.e. the submitted job runs for 6:00:00 (instead of 24:00:00) with 1 node and 2 cores (instead of 1 node and 1 core), without any memory limit (instead of 11gb), which are the default settings for the batch queue.
2- I submitted around 90 jobs and the scheduler started to run 15 of them. When I call qstat -q, there were situations that the number total of jobs in the queue was exactly the number of my jobs with status Q (so one else had jobs in queue), while the total running jobs was something around 30 jobs. I know that the other running jobs might be very resource demanding so that my jobs cannot run, but really it is very in-probable to something like that happen for a whole week. So, I guess that the limit of total running jobs per person is 15.
3- I do not have any command before #PBS scripts. My file is what I posted. I have another script that creates different setting of the game that I want to run, and that script completes the submission script that I posted here. Indeed, $fn and $memo are coming from that script. Once this file is completed, I call qsub to submit it into the queue.
Below is what you asked, but note that I changed the #PBS -l nodes=1:ppn=1 into #PBS -l nodes=1:ppn=2 and submitted them into queue to get the results as soon as possible.
From your reply, we assume that “#PBS -l nodes=1:ppn=2” does not encounter any issues with wall time or memory. We could also see it from job summary provided by you.
We will check with the concerned team and get back to you on the following issues:
- The issues with setting “#PBS -l nodes=1:ppn=1”
- Limit of total running jobs per person is 15.
After checking with the concerned team, given below is the response from them.
- The issues with setting “#PBS -l nodes=1:ppn=1” : This is expected. We implemented it recently. The reason for this is that 1 slot vs 2 slots does not affect resource allocation, it affects only how the resource manager counts the occupancy of this node. The reason why we don't like ppn=1 jobs is that, if you have one job on a node, it will occupy all cores, but if you have two jobs per node, they will compete for cores. This is not ideal because your performance depends on whether you have a neighbor on the node or not. Our updated scheme only allows ppn=2 so that queued jobs never have co-tenants on the node. The only case when ppn=1 is used is the Jupyter queue.
- Limit of total running jobs per person is 15: This is correct: we allow up to 15 running jobs per user, and each job can request up to 5 nodes.