Parallel tasks without MPI
You may already have read the section on using a Message Passing
Interface (MPI) to communicate between parallel tasks, and how this can
be used in the cluster.
Sometimes, some tasks can be usefully performed in parallel without the
need to use an MPI, and for these the pbsdsh
command is useful. Here is an example of a 8 processor job using pbsdsh:
#!/bin/sh #PBS -l nodes=4:ppn=2 #PBS -l walltime=5:00:00,cput=20:00:00 #PBS -j oe .... initial processing .... pbsdsh -v $PBS_O_WORKDIR/myscript .... final processing ....
Since the same "myscript" is run on each of the processor cores of a
job, that script needs to be clever enough to decide what its role is.
Of course, if the task is identical on every processor core, then
that's simple. But in the case where each processor core should be
doing a different task, then you can make use of an environmental
variable called $PBS_VNODENUM. This variable takes a value from 0 to
c-1, where c is the number of processor cores allocated to the job, and
is set by the torque system when it invokes the pbsdsh'd script on each
core. So if you have pre-prepared several lower-level scripts named
mysub.0 to mysub.7, your file "myscript" might contain:
#!/bin/sh cd $PBS_O_WORKDIR PATH=$PBS_O_PATH sh mysub.$PBS_VNODENUM
or, if you have pre-prepared a program myprog and a set of different
data-files, mydata.0 to mydata.7, for the tasks, then
#!/bin/sh cd $PBS_O_WORKDIR PATH=$PBS_O_PATH myprog < mydata.$PBS_VNODENUM
Let me know of other, innovative methods of using pbsdsh.
Note that there is also the variable $PBS_NODENUM, which has a unique
number 0 upwards for each different node, so 0 to 3 in the above
example, but observe that this is not so useful in the above context as
$PBS_VNODENUM. Also there is the variable $PBS_TASKNUM, which
is incremented before each task on each core is started.
Initial environment of a script invoked by pbsdsh
A script invoked by pbsdsh starts in a very basic environment: the
user's $HOME directory is defined and is the current directory, the
LANG variable is set to C, and the PATH is set to the basic /usr/local/bin:/usr/bin:/bin as
defined in a system-wide file pbs_environment. Nothing that would
normally be set up by a system shell profile or user shell profile is
defined, unlike the environment for the main job
script. To be positive about this, you could say that it this
is very efficient, particularly if you use pbsdsh repeatedly in your
main job script, as it eliminates unnecessary overheads!
The first thing such a script is likely to need to do, therefore, is to
change directory to $PBS_O_WORKDIR, and to set the PATH to $PBS_O_PATH.
Be careful, because this approach assumes that when you submit the job,
the environment in which you submit it is the one you want when it is
running. Alternatively, it might be sensible for the script to source a file containing all the
definitions of environment that your job script requires.
Yet another choice is for the pbsdsh command in your main job script to
invoke your script via a shell,
like sh or bash, with or without the "-l" login-shell
option, so that it gives an initialised environment for each instance:
for example:
pbsdsh bash -l -c '$PBS_O_WORKDIR/myscript'
In detail, the initial environment of a command invoked by pbsdsh has
the following defined, listed alphabetically. Notice that this list of
variable names is
the same list as for a main job script (see the Torque details page),
except
that PBS_NODEFILE is not defined on secondary nodes.
ENVIRONMENT
HOME
LANG
PATH
PBS_ENVIRONMENT
PBS_JOBCOOKIE
PBS_JOBID
PBS_JOBNAME
PBS_MOMPORT
PBS_NODENUM
PBS_O_HOME
PBS_O_HOST
PBS_O_LANG
PBS_O_LOGNAME
PBS_O_MAIL
PBS_O_PATH
PBS_O_QUEUE
PBS_O_SHELL
PBS_O_WORKDIR
PBS_QUEUE
PBS_TASKNUM
PBS_VNODENUM
Questions of efficiency when running multi-core jobs
When considering running different processes on different nodes/cores
as part of a multi-core job, be aware that some processes may finish
well before others. Therefore the cores that those processes were using
will be idle until all the
pbsdsh-invoked processes have finished. Your job effectively reserves
all the cores you requested for the total duration of the job: busy or
not.
Some inefficiencies are
inevitable in this sort of parallel environment, if the parts running
in parallel are not identical. But this can make the cluster as a whole
inefficient. Your user and group fair-shares are based on core
wall-time occupancy, not on actual processing, so idle cores are still
charged for in fair-share terms, and will count against you and your
group for future jobs. So do not devise jobs to work in a parallel way,
if there is little benefit in doing so, if they can perfectly
adequately run as multiple single-core jobs.
L.S.Lowe
|