1 2 Previous Next 22 Replies Latest reply: Sep 21, 2011 9:24 AM by tedk RSS

Unable to use RCKMPI due to lack of mpdboot

papandya Community Member
Currently Being Moderated

I installed the RCKMPI and followed the instructions as given in Quickstart manual but was stuck in the step which asks for running mpdboot. It says the program is not installed and needs the administrator to install mpich2. Is there a work around for this problem? Is there some alternative that could be used for running MPI programs?

  • 1. Re: Unable to use RCKMPI due to lack of mpdboot
    compres Community Member
    Currently Being Moderated

    After you configure, compile and install (to a directory in /shared/ ) the library, you need to copy the executables to the cores (from the install directory in /shared/ to /usr/bin ).  You then run the mpboot command in one of the cores (not in the MCPC), and a mpd ring is started.

     

    You then launch jobs in the cores, as if they were a cluster (with mpiexec as normal).  In the cores, you are root, so you have permission to do anything.

     

    - Isaías

  • 2. Re: Unable to use RCKMPI due to lack of mpdboot
    papandya Community Member
    Currently Being Moderated

    Thanks for the quick reply... but its not the end of my troubles. I am unable to copy anything to the /usr/bin due to permission problems... any work around for that... is there some alternative folder that can be used for intiating it. Also I tried going into the folder and running mpdboo, but there also it does not work... Maybe i did not understand exactly on what you meant by running the mpdboot on cores instead of MCPC.

  • 3. Re: Unable to use RCKMPI due to lack of mpdboot
    tedk Community Member
    Currently Being Moderated

    I'm very confused. If you ssh to rck00, you will have access to /usr/bin because you are root. What am I missing here? You dont even need a password to ssh as root to the cores. The MCPC and the cores share a certificate.

     

    tekubasx@marc101:~$ ssh root@rck00
    root@rck00:~> cd /shared
    root@rck00:/shared> cd tekubasx/

    root@rck00:/shared/tekubasx> cp a.sh /usr/bin
    root@rck00:/shared/tekubasx> cd /usr/bin
    root@rck00:/usr/bin> ls -l a.sh
    -rwxr-xr-x    1 root     root          103 Jan 11 04:06 a.sh
    root@rck00:/usr/bin> rm a.sh
    root@rck00:/usr/bin> ls -l a.sh
    ls: a.sh: No such file or directory
    root@rck00:/usr/bin> exit

    Connection to rck00 closed.
    tekubasx@marc101:~$

  • 4. Re: Unable to use RCKMPI due to lack of mpdboot
    papandya Community Member
    Currently Being Moderated

    I hadnt done the ssh root@rck00 and had missed it in the documentation so was having the problems in trying to copy to /usr/bin. I am still having trouble with it though. As I cant connect via VNC to the intel machine, I am unable to do a few steps as specified in the Quickstart document. How do i broadcast commands from rck00 to all other cores. I couldnt find a command that will be doing that. Is there some document for setting this up using only the ssh console.

     

    Thank you for helping me out, appreciate it.

  • 5. Re: Unable to use RCKMPI due to lack of mpdboot
    tedk Community Member
    Currently Being Moderated

    What marc system are you trying to VNC into? You should not have a problem with VNC. It has generally good response, better than X11 forwarding and in my opinion usable. In the morning I can try to VNC into your system and see if there is a problem.

     

    We use sccKonsole to broadcast a command (issue on one core, sent to all cores). sccKonsole requires a GUI desktop. I don't know how to do this from a command line. Does anyone know if it is possible?

  • 6. Re: Unable to use RCKMPI due to lack of mpdboot
    compres Community Member
    Currently Being Moderated

    If there is no access to sccKonsole, broadcasting would be easier to achieve from the MCPC through PSSH.

     

    - Isaías

  • 7. Re: Unable to use RCKMPI due to lack of mpdboot
    papandya Community Member
    Currently Being Moderated

    I tried using PSSH and it seems to work alright from MCPC. I keep getting the error from mpdboot to be

     

    root@rck00:~> mpdboot --totalnum=48 --file=mpi.hosts --maxbranch=1
    mpdboot_rck00 (handle_mpd_output 420): from mpd on rck01, invalid port info:
    no_port

     

    my mpi.hosts is correct and put in the folder as specified by Quickstart document.

     

    What is missing on my end here? As it says invalid port I feel that it has to do something with the mpi.hosts, but I m not sure.

  • 8. Re: Unable to use RCKMPI due to lack of mpdboot
    michael.riepen Community Member
    Currently Being Moderated

    mpdboot should not be invoked via pssh as it should only be started on core rck00!

  • 9. Re: Unable to use RCKMPI due to lack of mpdboot
    papandya Community Member
    Currently Being Moderated

    Hello Michael, from using PSSH i meant that I am able to broadcast to all the cores. On the other hand if u see that error message I have put its already in root@rck00 , so there is some salient point that I am unable to get a hold of.

  • 10. Re: Unable to use RCKMPI due to lack of mpdboot
    compres Community Member
    Currently Being Moderated

    First, I would suggest you give a look to the manual to find more details, maybe it will help you:

    http://communities.intel.com/docs/DOC-6133

     

    Second, you could try the following:

     

    1. Copy the binaries and libraries for python and mpich to the SCC fylesystem, as specified in the documents.

     

    2. On all cores:

     

    a. Test that python works:

    python --version

     

    b. Test that mpd scripts work:

    mpd --help

     

    3. Try the alternative way of creating a ring:

     

    a. In rck00, issue:

    mpd --daemon

     

    b. Again in rck00, issue:

    mpdtrace -l

    And take note of the port (output format is rck00_<port> (192.168.0.1)).

     

    c. On all cores, issue the command:

    mpd --daemon --host=rck00 --port=<port from b>

     

    d. In rck00, issue:

    mpdtrace -l

    to verify that the ring was created.

     

    Hope this helps.

     

    - Isaías

  • 11. Re: Unable to use RCKMPI due to lack of mpdboot
    tedk Community Member
    Currently Being Moderated

    Has this been resolved? The instructions we used to use to run rckmpi don't work any longer ... at least for me.

     

    Here is the error I get ...

    rck00:/root # mpdboot --totalnum=48 --file=mpi.hosts --maxbranch=1 --verbose
    running mpdallexit on rck00
    LAUNCHED mpd on rck00  via
    RUNNING: mpd on rck00
    LAUNCHED mpd on rck01  via  rck00
    mpdboot_rck00 (handle_mpd_output 420): from mpd on rck01, invalid port info:
    no_port

     

    I sucessfully went through all the other rckmpi steps. Before trying to run mpdboot I broadcasted from rck00

    cp -a /shared/tekubasx/rckmpi/python/lib/* /usr/lib

    cp -a /shared/tekubasx/rckmpi/zlib/lib/* /usr/lib

    cp -a /shared/tekubasx/rckmpi/libssl/lib/* /usr/lib

    cp -a /shared/tekubasx/rckmpi/python/bin/* /usr/bin

    Then I can see

    rck00:rckmpi # python --version

    Python 2.6.5+

    rck00:rckmpi #

    on all cores. So pythong works, a good sign. Then I did

    cp -a /shared/tekubasx/install/rckmpi/bin/* /usr/bin

    cp -a /shared/tekubasx/install/rckmpi/lib/* /usr/lib

    cp -a /shared/tekubasx/rckmpi/conf/mpich/mpd.conf /etc

     

     

    Then I stopped broadcasting on rck00 and did the mpdboot and got the error.

    I do see

    rck00:rckmpi # mpdtrace -l
    rck00_49084 (192.168.26.1)
  • 12. Re: Unable to use RCKMPI due to lack of mpdboot
    tedk Community Member
    Currently Being Moderated

    Well, here's something interesting. I put RCK MPI on a 1.3.0 system (not many of those left and the one I found won't last for long) but on sccKit 1.3.0,  mpdboot works just fine.

     

    root@rck00:~> mpdboot --totalnum=48 --file=mpi.hosts --maxbranch=1
    root@rck00:~> mpdtrace -l
    rck00_59676 (192.168.0.1)
    rck01_57624 (192.168.0.2)
    rck02_50782 (192.168.0.3)
    rck03_53645 (192.168.0.4)
           :

           :

    rck43_51985 (192.168.0.44)
    rck44_35423 (192.168.0.45)
    rck45_44521 (192.168.0.46)
    rck46_35245 (192.168.0.47)
    rck47_50624 (192.168.0.48)
    root@rck00:~>

     

    So the problems with mpdboot discussed in this thread seem to have arisen with 1.4.0 and eMAC. The system I saw my problem on was 1.4.1.2 and eMAC, and that configuration is stable. So my suspicion is that our SVN either doesn't have the very latest RCK MPI or our RCK MPI needs an update.

     

    Isaias, what do you think?

  • 13. Re: Unable to use RCKMPI due to lack of mpdboot
    papandya Community Member
    Currently Being Moderated

    Hello All, sorry for having disappeared from this board. The problem was resolved for me when I followed all the steps given as replies in this post with a little trial and error. I manually formed the complete ring and had to copy the required files to every core that I ended up using. Finally I was able to run the program I intended to with RCK MPI.

     

    @Ted: I will check and see which version is running on the machine I had access to and let you guys know.

     

    Again thanks to all the helpful comments recieved here.

  • 14. Re: Unable to use RCKMPI due to lack of mpdboot
    compres Community Member
    Currently Being Moderated

    Hi Ted,

     

    Which kernel version are you using?

     

    You may want to start the MPD daemon ring by broadcasting (as described in my previous post) instead of with mpdboot, as a work around, on the newer kernel.

     

    There is higher latency for inter core communication through sockets (on the newer kernel) and the mpdboot may fail to initialize the MPD ring.

     

    - Isaías

1 2 Previous Next

More Like This

  • Retrieving data ...

Legend

  • Correct Answers - 4 points
  • Helpful Answers - 2 points