3 Replies Latest reply on Nov 21, 2010 11:17 PM by tedk

    Run NPB with rccerun

    xl10

      I got a problem when running BT in NPB using rccerun. RCCE was build for SCC_LINUX and BT was build as make bt NPROCS=1 CLASS=C. The command used to launch the code was ../../../rccerun -nue 1 -f ../../../hosts/rc.hosts bt.C.1. The output is as follows:

       

      pssh -h PSSH_HOST_FILE.6525 -t -1 -p 1 /shared/xl10/rcce/apps/NPB/BT/mpb.6525 < /dev/null
      [1] 19:22:56 [SUCCESS] rck01
      pssh -h PSSH_HOST_FILE.6525 -t -1 -P -p 1 /shared/xl10/rcce/apps/NPB/BT/bt.C.1 1 0.533 01 < /dev/null
      [1] 19:22:57 [FAILURE] rck01 Exited with error code 137

       

      Any ideas?

       

      Thank you

       

      Xu

        • 1. Re: Run NPB with rccerun
          tedk

          I hadn't seen error code 137 before, but I think now it has something to do with insufficent memory. I was able to duplicate your error. I tried it on 4 cores and also got an error.When built as Class=S, the benchmark works fine on 4 cores.

           

          I checked with our RCCE developer and he says "This is probably caused by insufficient memory. Please do a “size” on the executable (no dynamic memory allocation, so this will give some meaningful information), and/or run with more cores. This thing takes squares, so could go up to 36 cores."

           

          So I built it as "make bt CLASS=C NPROCS=36" and then ran as "rccerun -nue 36 -f rc.hosts bt.C.36" and so far it's running OK so far ... hasn't finished yet (it'll take a while on 36 cores), but that error 137 did not come up immediately as it did before, and the benchmark is giving reasonable output.

           

          If you look at this some more and find out the minimum number of nodes it needs to run, please report it.

           

          What compiler were you using for this? icc,icpc,g++?

          • 2. Re: Run NPB with rccerun
            xl10

            I used gcc to compile BT. For now I changed to the class A to continue my experiments and it works fine.

            • 3. Re: Run NPB with rccerun
              tedk

              That error code 137 you saw was most likely a seg fault. That error is a bash error code that comes from rccerun which runs pssh.

              Here's a trick for getting a better error message that sometimes works ... unfortunately not in this case, however.

               

              You can run your applciation with rccerun and let it fail with the bash error code. Then ssh into a node and explictly run your application. The reason you want to run your application first with rccerun is to get the parameters that your app needs.

               

              pssh -h PSSH_HOST_FILE.10268 -t -1 -P -p 1 /shared/tkubasx/bt.C.1 1 0.533 00 < /dev/null
              [1] 21:55:21 [FAILURE] rck00 Exited with error code 137

               

              That </dev/null is there because some versions of pssh need it when used in scripts. Then,

               

              tkubasx@marc101:/shared/tkubasx$ ssh root@rck00
              root@rck00:~> cd /shared/tkubasx/
              root@rck00:/shared/tkubasx> bt.C.1 1 0.533 00
              Killed <== not the best error message but often much better with this method.
              root@rck00:/shared/tkubasx>