I installed the RCKMPI and followed the instructions as given in Quickstart manual but was stuck in the step which asks for running mpdboot. It says the program is not installed and needs the administrator to install mpich2. Is there a work around for this problem? Is there some alternative that could be used for running MPI programs?
After you configure, compile and install (to a directory in /shared/ ) the library, you need to copy the executables to the cores (from the install directory in /shared/ to /usr/bin ). You then run the mpboot command in one of the cores (not in the MCPC), and a mpd ring is started.
You then launch jobs in the cores, as if they were a cluster (with mpiexec as normal). In the cores, you are root, so you have permission to do anything.
- Isaías
Thanks for the quick reply... but its not the end of my troubles. I am unable to copy anything to the /usr/bin due to permission problems... any work around for that... is there some alternative folder that can be used for intiating it. Also I tried going into the folder and running mpdboo, but there also it does not work... Maybe i did not understand exactly on what you meant by running the mpdboot on cores instead of MCPC.
I'm very confused. If you ssh to rck00, you will have access to /usr/bin because you are root. What am I missing here? You dont even need a password to ssh as root to the cores. The MCPC and the cores share a certificate.
tekubasx@marc101:~$ ssh root@rck00
root@rck00:~> cd /shared
root@rck00:/shared> cd tekubasx/
root@rck00:/shared/tekubasx> cp a.sh /usr/bin
root@rck00:/shared/tekubasx> cd /usr/bin
root@rck00:/usr/bin> ls -l a.sh
-rwxr-xr-x 1 root root 103 Jan 11 04:06 a.sh
root@rck00:/usr/bin> rm a.sh
root@rck00:/usr/bin> ls -l a.sh
ls: a.sh: No such file or directory
root@rck00:/usr/bin> exit
Connection to rck00 closed.
tekubasx@marc101:~$
I hadnt done the ssh root@rck00 and had missed it in the documentation so was having the problems in trying to copy to /usr/bin. I am still having trouble with it though. As I cant connect via VNC to the intel machine, I am unable to do a few steps as specified in the Quickstart document. How do i broadcast commands from rck00 to all other cores. I couldnt find a command that will be doing that. Is there some document for setting this up using only the ssh console.
Thank you for helping me out, appreciate it.
What marc system are you trying to VNC into? You should not have a problem with VNC. It has generally good response, better than X11 forwarding and in my opinion usable. In the morning I can try to VNC into your system and see if there is a problem.
We use sccKonsole to broadcast a command (issue on one core, sent to all cores). sccKonsole requires a GUI desktop. I don't know how to do this from a command line. Does anyone know if it is possible?
If there is no access to sccKonsole, broadcasting would be easier to achieve from the MCPC through PSSH.
- Isaías
I tried using PSSH and it seems to work alright from MCPC. I keep getting the error from mpdboot to be
root@rck00:~> mpdboot --totalnum=48 --file=mpi.hosts --maxbranch=1
mpdboot_rck00 (handle_mpd_output 420): from mpd on rck01, invalid port info:
no_port
my mpi.hosts is correct and put in the folder as specified by Quickstart document.
What is missing on my end here? As it says invalid port I feel that it has to do something with the mpi.hosts, but I m not sure.
mpdboot should not be invoked via pssh as it should only be started on core rck00!
Hello Michael, from using PSSH i meant that I am able to broadcast to all the cores. On the other hand if u see that error message I have put its already in root@rck00 , so there is some salient point that I am unable to get a hold of.
First, I would suggest you give a look to the manual to find more details, maybe it will help you:
http://communities.intel.com/docs/DOC-6133
Second, you could try the following:
1. Copy the binaries and libraries for python and mpich to the SCC fylesystem, as specified in the documents.
2. On all cores:
a. Test that python works:
python --version
b. Test that mpd scripts work:
mpd --help
3. Try the alternative way of creating a ring:
a. In rck00, issue:
mpd --daemon
b. Again in rck00, issue:
mpdtrace -l
And take note of the port (output format is rck00_<port> (192.168.0.1)).
c. On all cores, issue the command:
mpd --daemon --host=rck00 --port=<port from b>
d. In rck00, issue:
mpdtrace -l
to verify that the ring was created.
Hope this helps.
- Isaías
Has this been resolved? The instructions we used to use to run rckmpi don't work any longer ... at least for me.
Here is the error I get ...
rck00:/root # mpdboot --totalnum=48 --file=mpi.hosts --maxbranch=1 --verbose
running mpdallexit on rck00
LAUNCHED mpd on rck00 via
RUNNING: mpd on rck00
LAUNCHED mpd on rck01 via rck00
mpdboot_rck00 (handle_mpd_output 420): from mpd on rck01, invalid port info:
no_port
I sucessfully went through all the other rckmpi steps. Before trying to run mpdboot I broadcasted from rck00
cp -a /shared/tekubasx/rckmpi/python/lib/* /usr/lib
cp -a /shared/tekubasx/rckmpi/zlib/lib/* /usr/lib
cp -a /shared/tekubasx/rckmpi/libssl/lib/* /usr/lib
cp -a /shared/tekubasx/rckmpi/python/bin/* /usr/bin
Then I can see
rck00:rckmpi # python --version
Python 2.6.5+
rck00:rckmpi #
on all cores. So pythong works, a good sign. Then I did
cp -a /shared/tekubasx/install/rckmpi/bin/* /usr/bin
cp -a /shared/tekubasx/install/rckmpi/lib/* /usr/lib
cp -a /shared/tekubasx/rckmpi/conf/mpich/mpd.conf /etc
Then I stopped broadcasting on rck00 and did the mpdboot and got the error.
I do see
rck00:rckmpi # mpdtrace -l
rck00_49084 (192.168.26.1)
Well, here's something interesting. I put RCK MPI on a 1.3.0 system (not many of those left and the one I found won't last for long) but on sccKit 1.3.0, mpdboot works just fine.
root@rck00:~> mpdboot --totalnum=48 --file=mpi.hosts --maxbranch=1
root@rck00:~> mpdtrace -l
rck00_59676 (192.168.0.1)
rck01_57624 (192.168.0.2)
rck02_50782 (192.168.0.3)
rck03_53645 (192.168.0.4)
::
rck43_51985 (192.168.0.44)
rck44_35423 (192.168.0.45)
rck45_44521 (192.168.0.46)
rck46_35245 (192.168.0.47)
rck47_50624 (192.168.0.48)
root@rck00:~>
So the problems with mpdboot discussed in this thread seem to have arisen with 1.4.0 and eMAC. The system I saw my problem on was 1.4.1.2 and eMAC, and that configuration is stable. So my suspicion is that our SVN either doesn't have the very latest RCK MPI or our RCK MPI needs an update.
Isaias, what do you think?
Hello All, sorry for having disappeared from this board. The problem was resolved for me when I followed all the steps given as replies in this post with a little trial and error. I manually formed the complete ring and had to copy the required files to every core that I ended up using. Finally I was able to run the program I intended to with RCK MPI.
@Ted: I will check and see which version is running on the machine I had access to and let you guys know.
Again thanks to all the helpful comments recieved here.
Hi Ted,
Which kernel version are you using?
You may want to start the MPD daemon ring by broadcasting (as described in my previous post) instead of with mpdboot, as a work around, on the newer kernel.
There is higher latency for inter core communication through sockets (on the newer kernel) and the mpdboot may fail to initialize the MPD ring.
- Isaías

