I am trying to run a multiprocessor job onto SCC (kernel 3.1.4, gcc-4.6, rckmpi with sccmulti).
The job runs well on our own linux clusters, but on the SCC, I get this:
3: terminate called after throwing an instance of 'std::system_error'
2: CommandID: 0
3: what(): Resource temporarily unavailable
rank 3 in job 3 rck00_58139 caused collective abort of all ranks
exit status of rank 3: killed by signal 6
In my program, I do not throw std::system_error, and the trace clearly says " what(): Resource temporarily unavailable"
I am lost on what this could mean.
I am running the job on cores 00-07.
Any ideas will be helpful.
To answer my own question: This seems to be a problem with the RCKMPI. Although a new message has not been received, MPI_Recv(...) acts as if a new message were received, and thus the system goes into a continuous loop.
Attempting to rebuild MPI, and see whether the problem can be solved.