6 Replies Latest reply on Jan 20, 2011 12:29 PM by tedk

    Exited with error code 127

    keith_chapman

      While running the pingpong sample on marc010 I get the following error,

       

      ./rccerun -nue 2 -f hosts/rc.hosts apps/PINGPONG/pingpong

      [1] 20:11:21 [FAILURE] rck01 Exited with error code 127

      [2] 20:11:21 [FAILURE] rck00 Exited with error code 127

      pssh -h PSSH_HOST_FILE.10969 -t -1 -P -p 2 /home/keith_chapman/svn/apps/PINGPONG/pingpong 2 1.0 00 01 < /dev/null

      [1] 20:11:46 [FAILURE] rck00 Exited with error code 127

      [2] 20:11:46 [FAILURE] rck01 Exited with error code 127

      What does the error code 127 indicate?

        • 1. Re: Exited with error code 127
          tedk

          I went on your marc computer and successfully built RCCE and ran pingpong. You can look in my /home and my /shared.

          I noticed that your command line shows pssh pointing to your /home.

          Only the /shared directory is mounted on the cores. You must copy your pingpong executable to /shared/<yourusername> and run from there.

           

          I think Error code 127 means that the core cannot find your file. We'll start a list of error codes and post here.

          • 2. Re: Exited with error code 127
            rfvander

            Ted is correct. In general, I recommend the following method to troubleshoot mysterious error codes experienced during RCCE executions. Build you code so that it can run on a single core. Execute the rccerun command as you would with just a single core ("nue 1). Capture the string that contains the way the code would run, ie.the line that starts with "pssh." Strip off everything on that line that comes before the name of the executable. What is left contains the name of the executable, parameters generated by rccerun, plus parameters supplied by the user.

            Log into the core you want to use, using ssh. If you cannot do this, the core is dead. If you can do it but are asked to supply a password, there is a problem with ssh, see earlier post by Ted. Neither of thse problems have anything to do with RCCE. Once you are on the core, execeute the code you were trying to run from the MCPC by pasting the string you saved. If this fails it is usually immediately obvious why. For example, incorerct permissions, executable not present, input files not present, etc.

            Note: if you cannot build your code to run on a single core, most of the above procedure can still be useful in troubleshooting error codes.

            • 3. Re: Exited with error code 127
              keith_chapman

              Thanks Rob, yes I found this method quite effective (ssh ing into the core and running code). What mechanisms can I use to debug an application on a node? I noticed that strace is not present, also is it possible to use gdb? 

              • 4. Error code 127: shared library not found
                devendra.rai

                Hello,

                 

                I had an MPI + Pthread application, which is tested on our school cluster. I replaced MPI calls with the RCCE calls, and made sure that the build process succeeds. Now when I try to run it, I am trapped in error code 127:

                 

                 

                rccerun -nue 3 -f rc.hosts  sc_application
                pssh -v -h PSSH_HOST_FILE.8599 -t -1 -p 3 /shared/devendra.rai/mpb.8599 < /dev/null
                [1] 09:54:56 [SUCCESS] rck23
                [2] 09:54:56 [SUCCESS] rck00
                [3] 09:54:56 [SUCCESS] rck12
                pssh -v -h PSSH_HOST_FILE.8599 -t -1 -P -p 3 /shared/devendra.rai/sc_application < /dev/null
                [1] 09:55:21 [FAILURE] rck12 Exited with error code 127
                [2] 09:55:21 [FAILURE] rck23 Exited with error code 127
                [3] 09:55:21 [FAILURE] rck00 Exited with error code 127
                I believe that 127 means that some kind of run-time support is missing, and it was:
                (based on the recommendation here, I tried running the application on rck00 manually)
                ./sc_application: error while loading shared libraries: libcprts.so.5: cannot open shared object file: No such file or directory
                We have remote access to SCC, so can anyone tell me how to get around this error? I am sure that there are many more such errors to come.
                Thanks a lot
                Devendra Rai
                • 5. Re: Error code 127: shared library not found
                  tedk

                  I copied your sc_application binary to another system (used by Intel) and see the same error. So the missing shared library is missing on all our systems. I filed a bug about the missing library. It's bug 146 on http://marcbug.scc-dc.com/bugzilla3/

                   

                  We'll look into whether it's feasible to install this library.

                  • 6. Re: Error code 127: shared library not found
                    tedk

                    Have you tried building your application statically? This may be one way of getting around the need for a shared library?