2 Replies Latest reply on Sep 24, 2010 3:10 PM by tedk

    pssh: Error Code 126

    devendra.rai

      Hello All,

       

      Did anyone get an error code of 126 when using pssh? I tried to get man pages to explain the error codes are, but for now, it seems that there are no decent man pages. Curiously, the first time, my pssh succeeds, while the next time, I get error 126.

       

      Is there any place where all error codes are catalogued? Or, can we agree to have a place where we catalogue all error codes that we find?

       

      Would help all a lot.

       

      Cheers

      Devendra Rai

        • 1. Re: pssh: Error Code 126
          devendra.rai

          Just to reply to my own mail:

           

          Retrying the procedure after some time does the trick. Although, I do not know whether this could be a "solution".  I am still looking for a plausible explanation of the error code.

           

          Cheers

           

          Devendra

          • 2. Re: pssh: Error Code 126
            tedk

            Your screenshots seem to indicate that you are running the share example on 2 cores (00 and 47). Is that correct or is one of your own programs called share? The share example, however, requires two parameters, and I see only one in your screenshot. Could you post the original rccerun command you used?

             

            Please understand that shared memory with RCCE is still under development. There are areas in the memory assigned to shared that are used by the system and there is no protection against corrupting those locations (see Michael Riepen's post  in which he points this out and lists the locations used.) We are working on methods to avoid those system locations.

             

            I have seen error code 126 before but haven't been able to reproduce it this morning. These error codes are not set inside any Intel code. I think they come right from bash.I have seen error code 139 when I try to run a RCCE program with pssh, and I see error code 1 when I leave out a command-line parameter.

             

            Here's what I got when I ran share on two cores, choosing 16 doubles. The pssh command produced by rccerun looks somewhat different from yours ... I have an additional parameter and a redirect in from null. I'm curious why yours is different? Note also that my clock is 0.533 not 1.0. This is probably due to the setting of REFCLOCKGHZ in rccerun.in. In the latest RCCE release it is set to 1.0; it should be =0.533.. It's a benign typo and will be changed in the next release.

             

            tkubasx@marc101:/shared/tkubasx$ rccerun -nue 2 -f rc.hosts share 16 1
            pssh -h PSSH_HOST_FILE.14696 -t -1 -p 2 /shared/tkubasx/mpb.14696 < /dev/null
            [1] 13:42:13 [SUCCESS] rck00
            [2] 13:42:13 [SUCCESS] rck47
            pssh -h PSSH_HOST_FILE.14696 -t -1 -P -p 2 /shared/tkubasx/share 2 0.533 00 47 16 1 < /dev/null
            rck47: Final sum on UE 001 equals 152.000000, refval = 152.000000
            rck00: Buffer is allocated 16 doubles
            Initial sum on UE 000 equals 136.000000
            [1] 13:42:13 [SUCCESS] rck00
            [2] 13:42:13 [SUCCESS] rck47
            tkubasx@marc101:/shared/tkubasx$ head rc.hosts
            00
            47
            02
            03
            04
            05
            06
            07
            08
            09
            tkubasx@marc101:/shared/tkubasx$