I went on your marc computer and successfully built RCCE and ran pingpong. You can look in my /home and my /shared.
I noticed that your command line shows pssh pointing to your /home.
Only the /shared directory is mounted on the cores. You must copy your pingpong executable to /shared/<yourusername> and run from there.
I think Error code 127 means that the core cannot find your file. We'll start a list of error codes and post here.
Ted is correct. In general, I recommend the following method to troubleshoot mysterious error codes experienced during RCCE executions. Build you code so that it can run on a single core. Execute the rccerun command as you would with just a single core ("nue 1). Capture the string that contains the way the code would run, ie.the line that starts with "pssh." Strip off everything on that line that comes before the name of the executable. What is left contains the name of the executable, parameters generated by rccerun, plus parameters supplied by the user.
Log into the core you want to use, using ssh. If you cannot do this, the core is dead. If you can do it but are asked to supply a password, there is a problem with ssh, see earlier post by Ted. Neither of thse problems have anything to do with RCCE. Once you are on the core, execeute the code you were trying to run from the MCPC by pasting the string you saved. If this fails it is usually immediately obvious why. For example, incorerct permissions, executable not present, input files not present, etc.
Note: if you cannot build your code to run on a single core, most of the above procedure can still be useful in troubleshooting error codes.
Thanks Rob, yes I found this method quite effective (ssh ing into the core and running code). What mechanisms can I use to debug an application on a node? I noticed that strace is not present, also is it possible to use gdb?
I had an MPI + Pthread application, which is tested on our school cluster. I replaced MPI calls with the RCCE calls, and made sure that the build process succeeds. Now when I try to run it, I am trapped in error code 127:rccerun -nue 3 -f rc.hosts sc_applicationpssh -v -h PSSH_HOST_FILE.8599 -t -1 -p 3 /shared/devendra.rai/mpb.8599 < /dev/null 09:54:56 [SUCCESS] rck23 09:54:56 [SUCCESS] rck00 09:54:56 [SUCCESS] rck12pssh -v -h PSSH_HOST_FILE.8599 -t -1 -P -p 3 /shared/devendra.rai/sc_application < /dev/null 09:55:21 [FAILURE] rck12 Exited with error code 127 09:55:21 [FAILURE] rck23 Exited with error code 127 09:55:21 [FAILURE] rck00 Exited with error code 127I believe that 127 means that some kind of run-time support is missing, and it was:(based on the recommendation here, I tried running the application on rck00 manually)./sc_application: error while loading shared libraries: libcprts.so.5: cannot open shared object file: No such file or directoryWe have remote access to SCC, so can anyone tell me how to get around this error? I am sure that there are many more such errors to come.Thanks a lotDevendra Rai
I copied your sc_application binary to another system (used by Intel) and see the same error. So the missing shared library is missing on all our systems. I filed a bug about the missing library. It's bug 146 on http://marcbug.scc-dc.com/bugzilla3/
We'll look into whether it's feasible to install this library.
Have you tried building your application statically? This may be one way of getting around the need for a shared library?