C H A P T E R  6

Running Programs With mpirun in Distributed Resource Management Systems

This chapter describes the options to the mpirun command that are used for distributed resource management, and provides instructions for each resource manager. It contains the following sections:


mpirun Options for Third-Party Resource Manager Integration

ORTE is compatible with a number of other launchers, including rsh/ssh, Sun Grid Engine, and PBS.



Note - Open MPI itself supports other third-party launchers supported by Open MPI, such as SLURM and Torque. However, these launchers are currently not supported in Sun HPC ClusterTools software. To use these other third-party launchers, you must download the Open MPI source, compile, and link with the libraries for the launchers.


Checking Your Open MPI Configuration

To see whether your Open MPI installation has been configured for use with the third-party resource manager you want to use, issue the ompi_info command and pipe the output to grep. The following examples show how to use ompi_info to check for the desired third-party resource manager.


procedure icon  To Check for rsh/ssh

To see whether your Open MPI installation has been configured to use the rsh/ssh launcher:


% ompi_info | grep rsh
MCA plm: rsh (MCA v2.0, API v2.0, Component v1.3)


procedure icon  To Check for PBS/Torque

To see whether your Open MPI installation has been configured to use the PBS/Torque launcher:


% ompi_info | grep tm
MCA ras: tm (MCA v2.0, API v2.0, Component v1.3)
MCA plm: tm (MCA v2.0, API v2.0, Component v1.3)


procedure icon  To Check for Sun Grid Engine

To see whether your Open MPI installation has been configured to use Sun Grid Engine:


% ompi_info | grep gridengine
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)


Running Parallel Jobs in the PBS Environment

If your Open MPI environment is set up to include PBS, Open MPI automatically detects when mpirun is running within PBS, and will execute properly.

First reserve the number of resources by invoking the qsub command with the
-l option. The -l option specifies the number of nodes and the number of processes per node. For example, this command sequence reserves four nodes with four processes per node for the job myjob.sh:


% qsub -l nodes=4:ppn=4 myjob.sh

When you enter the PBS environment, you can launch an individual job or a series of jobs with mpirun. The mpirun command launches the job using the nodes and processes information from PBS.The resource information is accessed using the tm calls provided by PBS; hence, tm is the name used to identify the module in ORTE. The job ranks are children of PBS, not ORTE.

You can run an ORTE job within the PBS environment in two different ways: interactive and scripted.


procedure icon  To Run an Interactive Job in PBS

1. Enter the PBS environment interactively with the -I option to qsub, and use the
-l option to reserve resources for the job.

Here is an example.


% qsub -l nodes=2:ppn=2 -I 

The command sequence shown above enters the PBS environment and reserves one node called mynode with two processes for the job. Here is the output:


qsub: waiting for job 20.mynode to start
qsub: job 20.mynode ready
Sun Microsystems Inc. SunOS 5.10 Generic June 2006
pbs%

2. Launch the mpirun command.

Here is an example that launches the hostname command with a verbose output:


pbs% /opt/SUNWhpc/HPC8.2.1c/sun/bin/mpirun -np 4 -mca plm_tm_verbose 1 hostname

The output shows the hostname program being run on ranks r0 and r1:


% /opt/SUNWhpc/HPC8.2.1c/sun/bin/mpirun -np 4 -mca plm_tm_verbose 1 hostname
[hostname1:09064] plm:tm: launching on node mynode1
[hostname2:09064] plm:tm: launching on node mynode2
hostname2
hostname1
hostname2
hostname1 


procedure icon  To Run a Batch Job in PBS

1. Write a script that calls mpirun.

In the following examples, the script is called myjob.csh. The system is called mynode. Here is an example of the script.


#!/bin/csh
 
/opt/SUNWhpc/HPC8.2.1c/sun/bin/mpirun -np 2 -mca plm_tm_verbose 1 hostname 

2. Enter the PBS environment and use the -l option to qsub to reserve resources for the job.

Here is an example of how to use the -l option with the qsub command.


% qsub -l nodes=2:ppn=2 myjob.csh 

This command enters the PBS environment and reserves one node with two processes for the job that will be launched by the script named myjob.csh.

Here is the output to the script myjob.csh.


% more myjob.csh.*
::::::::::::::
myjob.csh.e2365
::::::::::::::
::::::::::::::
myjob.csh.o2365
::::::::::::::
Warning: no access to tty (Bad file number).
Thus no job control in this shell.
Sun Microsystems Inc.   SunOS 5.10      Generic January 2005
hostname5
hostname4
hostname5
hostname4

After the job finishes, it generates two output files:

As you can see, the pbsrun command calls mpirun, which forks into two calls of the hostname program, one for each node.


Running Parallel Jobs in the Sun Grid Engine Environment

Sun Grid Engine 6.1 is the supported version of Sun Grid Engine for Sun HPC ClusterTools 8.2.1c.

Before you can run parallel jobs, make sure that you have defined the parallel environment and queue before running the job.

Defining Parallel Environment (PE) and Queue

A PE needs to be defined for all the queues in the Sun Grid Engine cluster to be used as ORTE nodes. Each ORTE node should be installed as a Sun Grid Engine execution host. To allow the ORTE to submit a job from any ORTE node, configure each ORTE node as a submit host in Sun Grid Engine.

Each execution host must be configured with a default queue. In addition, the default queue set must have the same number of slots as the number of processors on the hosts.


procedure icon  To Use PE Commands

single-step bullet  To display a list of available PEs (parallel environments), type the following:


% qconf -spl
make

single-step bullet  To define a new PE, you must have Sun Grid Engine manager or operator privileges. Use a text editor to modify a template for the PE. The following example creates a PE named orte.


% qconf -ap orte

single-step bullet  To modify an existing PE, use this command to invoke the default editor:


% qconf -mp orte

single-step bullet  To show a particular PE that has been defined, type this command:


% qconf -sp orte
pe_name           orte
slots             8
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

The value NONE in user_lists and xuser_lists mean enable everybody and exclude nobody.

The value of control_slaves must be TRUE; otherwise, qrsh exits with an error message.

The value of job_is_first_task must be FALSE or the job launcher consumes a slot. In other words, mpirun itself will count as one of the slots and the job will fail, because only n-1 processes will start.


procedure icon  To Use Queue Commands

single-step bullet  To show all the defined queues, type the following command:


% qconf -spl
all.q

The queue all.q is set up by default in Sun Grid Engine.

single-step bullet  To configure the orte PE from the example in the previous section to the existing queue, type the following:


% qconf -mattr queue pe_list "orte" all.q

You must have Sun Grid Engine manager or operator privileges to use this command.

Submitting Jobs Under Sun Grid Engine Integration

There are two ways to submit jobs under Sun Grid Engine integration: interactive mode and batch mode. The instructions in this section describe how to submit jobs in batch mode. For information about how to use interactive mode, see Chapter 5.


procedure icon  To Set the Interactive Display

Before you submit a job, you must have your DISPLAY environment variable set so that the interactive window will appear on your desktop, if you have not already done so.

For example, if you are working in the C shell, type the following command:


% setenv DISPLAY desktop:0.0


procedure icon  To Submit Jobs in Batch Mode



Note - Before you can use the parallel environment, make sure that you have set it up before running the job. See Defining Parallel Environment (PE) and Queue for more information.


1. Create the script. In this example, mpirun is embedded within a script to qsub.


mynode4% cat SGE.csh
#!/usr/bin/csh
 
# set PATH: including location of MPI program to be run
setenv PATH /opt/SUNWhpc/HPC8.2.1c/sun/examples/connectivity:${PATH}
 
mpirun -np 4 -mca ras_gridengine_verbose 100 connectivity.sparc -v 



Note - The --mca ras_gridengine_verbose 100 setting is used in this example only to show that Sun Grid Engine is being used. This would not be needed for normal operation.


2. Next, source the Sun Grid Engine environment variables from a settings.csh file where $SGE_ROOT is set to /opt/sge:.


% source $SGE_ROOT/default/common/settings.csh

3. To start the batch (or scripted) job, specify the parallel environment, slot number and the user executable.


% qsub -pe orte 2 sge.csh 
your job 305 (“sge.csh") has been submitted

Since this is submitted as a batch job, you would not expect to see output at the terminal. If no indication is given for where the output should go, Sun Grid Engine redirects to your home directory and creates <job_name>.o<job_number>.

The job creates the output files. The file name with the format name_of_job.ojob_id contains the standard output. The file name with the format name_of_job.ejob_id contains the error output. If the job executes normally, the error output files will be empty.

The following example lists the files produced by a job called sge.csh with the job ID number 866:


% ls -rlt ~ | tail
-rw-r--r--   1 joeuser   mygroup       0 Jan 16 16:42 sge.csh.po866
-rw-r--r--   1 joeuser   mygroup       0 Jan 16 16:42 sge.csh.pe866
-rw-r--r--   1 joeuser   mygroup       0 Jan 16 16:42 sge.csh.e866
-rw-r--r--   1 joeuser   mygroup     194 Jan 16 16:42 sge.csh.o866

By default, the output files are located in your home directory, but you can use Sun Grid Engine software to change the location of the files, if desired.



Note - In most cases, you do not need to change the values set in the gridengine MCA parameters. If you run into difficulty and want to change the values for debugging purposes, the option is available. For more information about MCA parameters, see Chapter 7.



procedure icon  To See a Running Job

single-step bullet  Type the following command:


% qstat -f


procedure icon  To Delete a Running Job

single-step bullet  Type the following command:


% qdel job-number

where job-number is the number of the job you want to delete.

For more information about Sun Grid Engine commands, refer to the Sun Grid Engine documentation.

rsh Limitations



Note - This issue affects both rsh and the Sun Grid Engine program qrsh. qrsh uses rsh to launch jobs.


If you are using rsh or qrsh as the job launcher on a large cluster with hundreds of nodes, rsh might show the following error messages when launching jobs on the remote nodes:


rcmd: socket: Cannot assign requested address
rcmd: socket: Cannot assign requested address
rcmd: socket: Cannot assign requested address
[node0:00749] ERROR: A daemon on node m2187 failed to start as expected.
[node0:00749] ERROR: There may be more information available from
[node0:00749] ERROR: the ’qstat -t’ command on the Grid Engine tasks.
[node0:00749] ERROR: If the problem persists, please restart the
[node0:00749] ERROR: Grid Engine PE job

This indicates that rsh is running out of sockets when launching the job from the head node.

Using rsh as the Job Launcher

If you are using rsh as your job launcher, use ssh instead. Add the following to your command line:


-mca plm_rsh_agent ssh

Using Sun Grid Engine as the Job Launcher

If you are using Sun Grid Engine version 6.1 or earlier as your job launcher, you can modify the Sun Grid Engine configuration to allow Sun Grid Engine to use ssh instead of rsh to launch tasks on the remote nodes. The following web site describes how to perform this workaround:

http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html

Note that this workaround does not properly track resource usage, nor does it allow proper job accounting. Sun Grid Engine tracks resource usage by attaching an extra groupid when launching tasks as a user of the remote connection.

Sun Grid Engine version 6.2 fixes this issue by not using rsh to start jobs on remote nodes. Instead Sun Grid Engine version 6.2 makes use of a native Interactive Job Support (IJS), which removes any dependencies on rsh, ssh, or telnet. It is recommended that you upgrade to the latest available version of Sun Grid Engine.


For More Information

For more information about using the mpirun command to perform batch processing, see the following: