mmrun: Simple job distributing using MPI
by Kasper Peeters, kasper.peeters (at) aei.mpg.de
Introduction
The mmrun tool is a very simple command-line
tool to run a series of (non-MPI) programs on machines connected
through an MPI layer. You give it a list of programs, and mmrun will
spawn each of these in turn over as many nodes as you have requested
from the MPI layer. It is useful, for instance, if you want to run one
program many times, but with different values of the input
parameters.
You do not need to be root in order to use
mmrun. You only need SSH access to the machines in your
cluster.
The programs that you want to run with mmrun do
not have to be compiled with MPI. An explicit design goal was
to require no changes whatsoever to existing programs, but still be
able to distribute jobs easily over many processors. MPI is only used
by the master to spawn processes on the clients. A typical situation
in which mmrun is thus useful is when you have a long list of
independent programs to be executed, and have access to a set of
machines which can be reached through SSH.
No other networking tools are necessary except for
SSH access (key-based) to all machines and a working LAM installation (other MPI
implementations can probably also be made to work, but LAM is very
simple to install). All jobs which you want to run have to be
executable on all machines, so this tool is only useful if you have
access to a homogeneous cluster of machines.
Progress can be monitored by pointing a web browser
at the simple web server built into mmrun.
The mmrun program is licensed under the GNU General Public License.
Downloading and installing
In order to compile
mmrun you need the following libraries:
Then do
tar zxf mmrun-1.0.tar.gz
cd mmrun-1.0/src
make
in order to compile
mmrun (you may need to change a couple of
paths in the Makefile so the compiler finds your mpic++ and the ehs
and mpi libraries). Install the binary somewhere convenient (autoconf
stuff will be added sometime soon).
Make sure that you have a machine file for LAM,
something like
# This is an example of a LAM machine
# file, listing all machines in the cluster.
#
machine1.domain.org
machine2.domain.org
machine3.domain.org
machine4.domain.org
Start LAM by doing
lamboot [machinefile]
and make sure that this works fine (i.e. that LAM manages to start its
daemons on all machines in the machine file without asking for an SSH
password).
The input file
Each line of the input file is interpreted as one 'command', and
executed by the users' default shell on the computing node. It can
contain multiple commands separated by semi-colons. A typical
example: (to create a 7-frame movie with
PoVray)
# This is an example input file for mmrun.
# Each line contains one command (or sequence
# of commands separated by semicolons) which
# will be executed by mmrun.
#
povray +SF1 +EF1 +Imovie.pov +Omovie_frames
povray +SF2 +EF2 +Imovie.pov +Omovie_frames
povray +SF3 +EF3 +Imovie.pov +Omovie_frames
povray +SF4 +EF4 +Imovie.pov +Omovie_frames
povray +SF5 +EF5 +Imovie.pov +Omovie_frames
povray +SF6 +EF6 +Imovie.pov +Omovie_frames
povray +SF7 +EF7 +Imovie.pov +Omovie_frames
Comments can be included by prefixing them with a '#' sign. Make sure
that all commands listed in the input file can run on any of the nodes
in the machine file.
Running
Make sure that LAM is running (see "installing" above). Also make sure
that all commands listed in the input file can be run on any of the
nodes (mmrun will only run these command lines, it does not take care
of installing or copying software).
To set your cluster of machines to work on the tasks described in the
mmrun input file, do
mpirun -np [number of working nodes + 1] -f mmrun [inputfile]
where "number of working nodes" is the number of nodes on which LAM is
running (or fewer, of course, if you do not want to use all
nodes). The "+1" is for the master node, which consumes very few
resources and can easily be run together with a compute job on a
single CPU.
Currently, all output from all nodes goes (without separators) to
stdout on the master node. All jobs run as if you had manually
ssh'ed to a client and started the program there; if you want these
programs to create files which are to be visible on the master node,
you have to write to a shared filesystem.
Monitoring
Once
mmrun is running, you can monitor it by pointing a web
browser at port 12345 on the machine on which mpirun was called (you
can change this port number by giving mmrun the option "--port [number]"). This
interface is view-only; the
mmrun program cannot be influenced
through this web interface. For the machine and input files shown
above, the command
mpirun -np 4 -f mmrun jobs.mmrun
yields, after the first four jobs have completed, a monitor screen as shown below,
jobs
| command | rank | status | started on | ended on |
| povray +SF1 +EF1 +Imovie.pov +Omovie_frames | 1 | completed | 15:52:05 | 15:52:08 |
| povray +SF2 +EF2 +Imovie.pov +Omovie_frames | 2 | completed | 15:52:05 | 15:52:08 |
| povray +SF3 +EF3 +Imovie.pov +Omovie_frames | 1 | completed | 15:52:08 | 15:52:11 |
| povray +SF4 +EF4 +Imovie.pov +Omovie_frames | 2 | completed | 15:52:08 | 15:52:11 |
| povray +SF5 +EF5 +Imovie.pov +Omovie_frames | 1 | running | 15:52:11 | |
| povray +SF6 +EF6 +Imovie.pov +Omovie_frames | 2 | running | 15:52:11 | |
| povray +SF7 +EF7 +Imovie.pov +Omovie_frames | | not running | | |
machines
| machine | rank | status | load avg |
|---|
| machine1.domain.org | 0 | 0.93 | master |
| machine2.domain.org | 1 | 0.87 | running |
| machine3.domain.org | 2 | 0.99 | running |
| machine4.domain.org | 3 | 2.30 | overloaded |
Note that machine4 is recognised to be loaded and waiting for the load
level to drop.
The screen updates automatically every second to reflect the current
status (note, however, that the machine status summary is not updated
during the time that a command runs, so the load average reflects the
status just before the command was started).
Startup options
There are a few startup options:
- --port=
- Determines the port on which the internal web server is run.
- --maxload=
- Determines the maximum load average that a slave node is
allowed to have; if the load average goes higher than this the slave
will be instructed to wait and not will receive any additional jobs
until the load level drops again.
Coming up / todo
There are certainly some improvements left to
incorporate. One is the ability to start processes only when the load
level drops below a given number, or when there are no interactive
users logged in to a machine. A related useful option would be to
temporarily sleep a process on a client when the interactive load goes
up (so that you can start a process at night on a computer of your
office mate but have it automatically sleep in the morning, when the
machine is needed for interactive). Finally, a clean way to separate
the stdout streams of all processes would probably be useful.
If you have any other suggestions, feel free to
contact me by email (kasper.peeters (at) aei.mpg.de).

$Id: index.html,v 1.6 2005/11/04 16:04:11 kp229 Exp $