C H A P T E R 5 |
Running Programs With the mpirun Command |
This chapter describes the general syntax of the mpirun command and lists the command’s options. This chapter also shows some of the tasks you can perform with the mpirun command. It contains the following sections:
Note - The mpirun, mpiexec, and orterun commands all perform the same function, and they can be used interchangeably. The examples in this manual all use the mpirun command. |
The mpirun command controls several aspects of program execution in Open MPI. mpirun uses the Open Run-Time Environment (ORTE) to launch jobs. If you are running under distributed resource manager software, such as Sun Grid Engine or PBS, ORTE launches the resource manager for you.
If you are using rsh/ssh instead of a resource manager, you must use a hostfile or host list to identify the hosts on which the program will be run. When you issue the mpirun command, you specify the name of the hostfile or host list on the command line; otherwise, mpirun executes all the copies of the program on the local host, in round-robin sequence by CPU slot. For more information about hostfiles and their syntax, see Specifying Hosts By Using a Hostfile.
Both MPI programs and non-MPI programs can use mpirun to launch the user processes.
Some example programs are provided in the /opt/SUNWhpc/HPC8.2.1c/sun/examples directory for you to compile and run as sanity tests.
The following example shows the general single-program, multiple data (SPMD) syntax for mpirun:
For an MPMD (Multiple Program, Multiple Data) application, the command syntax appears similar to the following:
This command starts x number of copies of the program program1, and then starts y copies of the program program2.
The options control the behavior of the mpirun command. They might or might not be followed by arguments.
Use the mpirun -h (or --help) option to see a complete list of supported options.
Use the -x args option (where args is the environment variable(s) you want to use) to specify any environment variable you want to pass during runtime. The -x option exports the variable specified in args. If no value is specified on the mpirun command line, the value is inherited from the current environment. For example:
% setenv DISPLAY myworkstation:0 % mpirun -x DISPLAY -x LD_LIBRARY_PATH=/opt/SUNWhpc/HPC8.2.1c/sun/lib -np 4 a.out |
The mpirun command uses MCA (Multiple Component Architecture) parameters to pass environment variables. To specify an MCA parameter, use the -mca option with the mpirun command, and then specify the parameter type, the parameter you want to pass as an environment variable, and the value you want to set. For example:
This sets the MCA parameter mpi_show_handle_leaks to the value of 1 before running the program named a.out with four processes. In general, the format used on the command line is --mca parameter_name value.
Note - There are multiple ways to specify the values of MCA parameters. This chapter discusses how to use them from the command line with the mpirun command. MCA parameters are discussed in more detail in Chapter 7. |
The examples in this section show how to use the mpirun command options to specify how and where the processes and programs run.
TABLE 5-1 shows the process control options for the mpirun command. The procedures that follow the table explain how these options are used and show the syntax for each.
To Run a Program With Default Settings |
To run the program with default settings, enter the command and program name, followed by any required arguments to the program:
To Run Multiple Processes |
By default, an MPI program started with mpirun runs as one process.
To run the program as multiple processes, use the -np option:
When you request multiple processes, ORTE attempts to start the number of processes you request, regardless of the number of CPUs available to run those processes. For more information, see Oversubscribing Nodes.
To Direct mpirun By Using an Appfile |
You can use a type of text file (called an appfile) to direct mpirun. The appfile specifies the nodes on which to run, the number of processes to launch on each node, and the programs to execute in a parallel application. When you use the
--app option, mpirun takes all its direction from the contents of the appfile and ignores any other nodes or processes specified on the command line.
For example the following shows an appfile called my_appfile:
To use the --app option with the mpirun command, specify the name and path of the appfile on the command line. For example:
This command produces the same results as running a.out and b.out from the command line.
When you issue the mpirun command from the command line, ORTE reads the number of processes to be launched from the -np option, and then determines where the processes will run.
To determine where the processes will run, ORTE uses the following criteria:
You specify the available hosts to Open MPI in three ways:
The hostfile lists each node, the available number of slots, and the maximum number of slots on that node. For example, the following listing shows a simple hostfile:
In this example file, node0 is a single-processor machine. node1 has two slots. node2 and node3 both have 4 slots, but the values of slots and max_slots are the same (4) on node2. This disallows the processors on node2 from being oversubscribed. The four slots on node3 can be oversubscribed, up to a maximum of 20 processes.
When you use this hostfile with the --nooversubscribe option (see Oversubscribing Nodes), mpirun assumes that the value of max_slots for each node in the hostfile is the same as the value of slots for each node. It overrides the values for max_slots set in the hostfile.
Open MPI assumes that the maximum number of slots you can specify is equal to infinity, unless explicitly specified. Resource managers also do not specify the maximum number of available slots.
You can use the --host option to mpirun to specify the hosts you want to use on the command line in a comma-delimited list. For example, the following command directs mpirun to run a program called a.out on hosts a, b, and c:
Open MPI assumes that the default number of slots on each host is one, unless you explicitly specify otherwise.
To specify multiple slots with the -host option for each host repeat the host name on the command line for each slot you want to use. For example:
If you are using a resource manager such as Sun Grid Engine or PBS, the resource manager maintains an accurate count of available slots.
You can also use the --host option in conjunction with a hostfile to exclude any nodes not explicitly specified on the command line. For example, assume that you have the following hostfile called my_hosts:
Suppose you issue the following command to run program a.out:
This command launches one instance of a.out on host c, but excludes the other hosts in the hostfile (a, b, and d).
Note - If you use these two options (--hostfile and --host) together, make sure that the host(s) you specify using the --host option also exist in the hostfile. Otherwise, mpirun exits with an error. |
If you schedule more processes to run than there are available slots, this is referred to as oversubscribing. Oversubscribing a host is not suggested, as it might result in performance degradation.
mpirun has a --nooversubscribe option. This option implicitly sets the max_slots value (maximum number of available slots) to the same value as the slots value for each node, as specified in your hostfile. If the number of processes requested is greater than the slots value, mpirun returns an error and does not execute the command. This option overrides the value set for max_slots in your hostfile.
For more information about oversubscribing, see the following URL:
http://www.open-mpi.org/faq/?category=running#oversubscribing
ORTE uses two types of scheduling policies when it determines where processes will run:
This is the default scheduling policy for Open MPI. If you do not specify a scheduling policy, this is the policy that is used.
In by-slot scheduling, Open MPI schedules processes on a node until all of its available slots are exhausted (that is, all slots are running processes) before proceeding to the next node. In MPI terms, this means that Open MPI tries to maximize the number of adjacent ranks in MPI_COMM_WORLD on the same host without oversubscribing that host.
If you want to explicitly specify by-slot scheduling for some reason, there are two ways to do it:
1. Specify the --byslot option to mpirun. For example, the following command specifies the --byslot and --hostfile options:
The following example uses the -host option:
2. Set the MCA parameter rmaps_base_schedule_policy to the value slot. For example:
Note - The examples in this chapter set MCA parameters on the command line. For more information about the ways in which you can set MCA parameters, see Chapter 7. In addition, the Open MPI FAQ contains information about MCA parameters at the following URL:
http://www.open-mpi.org/faq/?category=tuning#setting-mca-params |
The following output example shows the contents of a simple hostfile called my-hosts and the results of the mpirun command using by-slot scheduling.
In by-node scheduling, Open MPI schedules a single process on each node in a round-robin fashion (looping back to the beginning of the node list as necessary) until all processes have been scheduled. Nodes are skipped once their default slot counts are exhausted.
There are two ways to specify by-node scheduling:
The following output example shows the contents of the same hostfile used in the previous example and the results of the mpirun command using by-node scheduling.
In the examples in this section, node0 and node1 each have two slots. The diagrams show the differences in scheduling between the two methods.
By-slot scheduling for the two nodes can be represented as follows:
By-node scheduling for the same two nodes can be represented this way:
Binding MPI processes to specific hardware processors can benefit performance in several ways:
While default process binding often benefits performance, it has the potential to induce undesirable side effects:
There are three methods for specifying process binding:
MPI processes are bound at the time that they are launched.
The following bind-to-* options were introduced in the Sun HPC ClusterTools 8.2.1 software release.
Each can be used with the process placement options:
You can use the -report-bindings option to get a report on how the processes are bound.
Note - See the mpirun man page for detailed descriptions of the command-line binding options. |
You can produce the same process binding behavior as is available with mpirun options by setting MCA parameters in the ~/.openmpi/mca-params.conf configuration file.
The MCA parameter method offers the advantage of allowing you to associate environment variables with individual process binding settings. For example, if a configuration file included the following entries, processes would be bound to successive processor sockets if the node’s environment supports binding, but not otherwise:
orte_process_binding = [none|core|socket|board][:if-avail] rmaps_base_schedule_policy = [slot|socket|board|node] |
See Chapter 7 for more information on setting MCA parameters as environment variables or in text configuration files.
You can use the mpirun command’s rankfile option to specify detailed bindings of MPI processes. The syntax for this option is:
Note - See the mpirun man page for more information on the rankfile option. |
Open MPI directs UNIX standard input to /dev/null on all processes except the rank 0 process of MPI_COMM_WORLD. The MPI_COMM_WORLD rank 0 process inherits standard input from mpirun. The node from which you invoke mpirun need not be the same as the node where the MPI_COMM_WORLD rank 0 process resides. Open MPI handles the redirection of the mpirun standard input to the rank 0 process.
Open MPI directs UNIX standard output and standard error from remote nodes to the node that invoked mpirun, and then prints the information from the remote nodes on the standard output/error of mpirun. Local processes inherit the standard output/error of mpirun and transfer to it directly.
To Redirect Standard I/O |
To redirect standard I/O for Open MPI applications, use the typical shell redirection procedure on mpirun. For example:
In this example, only the MPI_COMM_WORLD rank 0 process will receive the stream from my_input on stdin. The stdin on all the other nodes will be tied to /dev/null. However, the stdout from all nodes will be collected into the my_output file.
-wdir or --wdir |
|
-d |
|
-h |
To Change the Working Directory |
Use the -wdir or --wdir option to specify the path of an alternative working directory to be used by the processes spawned when you run your program:
Setting a path with --wdir does not affect where the runtime environment looks for executables. If you do not specify --wdir, the default is the current working directory. For example:
The syntax above changes the working directory for a.out to /home/mystuff/bin.
To Specify Debugging Output |
Use this syntax to specify debugging output. For example:
The -d option shows the user-level debugging output for all of the ORTE modules used with mpirun. To see more information from a particular module, you can set additional MCA debugging parameters. The availability of the additional debugging information depends on how the module of interest is implemented.
For more information on MCA parameters, see Chapter 7. For more information about whether a module provides additional verbose or debug mode, run the ompi_info command on that module.
To Display Command Help (-h) |
To display a list of mpirun options, use the -h option (alone).
There are two ways to submit jobs under Sun Grid Engine integration: interactive mode and batch mode. The instructions in this chapter describe how to submit jobs interactively. For information about how to submit jobs in batch mode, see Chapter 6.
A PE needs to be defined for all the queues in the Sun Grid Engine cluster to be used as ORTE nodes. Each ORTE node should be installed as an Sun Grid Engine execution host. To allow the ORTE to submit a job from any ORTE node, configure each ORTE node as a submit host in Sun Grid Engine.
Each execution host must be configured with a default queue. In addition, the default queue set must have the same number of slots as the number of processors on the hosts.
To display a list of available PEs (parallel environments), type the following:
To define a new PE, you must have Sun Grid Engine manager or operator privileges. Use a text editor to modify a template for the PE. The following example creates a PE named orte.
To modify an existing PE, use this command to invoke the default editor:
To show a particular PE that has been defined, type this command:
The value NONE in user_lists and xuser_lists mean enable everybody and exclude nobody.
The value of control_slaves must be TRUE; otherwise, qrsh exits with an error message.
The value of job_is_first_task must be FALSE or the job launcher consumes a slot. In other words, mpirun itself will count as one of the slots and the job will fail, because only n-1 processes will start.
To show all the defined queues, type the following command:
The queue all.q is set up by default in Sun Grid Engine.
To configure the orte PE from the example in the previous section to the existing queue, type the following:
You must have Sun Grid Engine manager or operator privileges to use this command.
Before you submit a job, you must have your DISPLAY environment variable set so that the interactive window will appear on your desktop, if you have not already done so.
For example, if you are working in the C shell, type the following command:
1. Use the source command to set the Sun Grid Engine environment variables from a file:
2. Use the qsh command to start the interactive X Windows session, and specify the parallel environment (in this example, ORTE) and the number of slots to use:
mynode4% qsh -pe orte 2 waiting for interactive job to be scheduled... Your interactive job 324 has been successfully scheduled. |
3. On a different node in the cluster, use the cd command to switch to the directory where your executable is located.
In the above example, Sun Grid Engine starts the user executable hostname with 4 processes on the two Sun Grid Engine assigned slots. The following example shows the output from the mpirun command with the specified options.
The following is not required for normal operation, but if you want to verify that Sun Grid Engine is being used, add --mca ras_gridengine_verbose to the mpirun command line. For example:
An alternate way to start an interactive session is by using qrsh instead of qsh. For example:
The instructions in this section explain how to get best results when starting Open MPI client/server applications.
To Launch the Client/Server Job |
1. Type the following command to launch the server application. Substitute the name of your MPI job’s universe for univ1:
2. Type the following command to launch the client application, substituting the name of your MPI job’s universe for univ1:
If the client and server jobs span more than 1 node, the first job (that is, the server job) must specify on the mpirun command line all the nodes that will be used. Specifying the node names allocates the specified hosts from the entire universe of server and client jobs.
For example, if the server runs on node0 and the client job runs on node1 only, the command to launch the server must specify both nodes (using the -host node0,node1 flag) even it uses only one process on node0.
Assuming that the persistent daemon is started on node0, the command to launch the server would look like this:
The command to launch the client is:
If you are planning on using name publishing, you must perform some additional tasks. You need to start up an ompi-server process on your server so that both the clients and servers can exchange information using that server.
For information about how to start the ompi-server process, type the following command on your server:
If the MPI client/server job fails to start, you might see error messages similar to this:
These messages indicate that there is residual data left in the /tmp directory. This can happen if a previous client/server job has already run from the same node.
To empty the /tmp directory, use the orte-clean utility. For more information about orte-clean, see the orte-clean man page.
You might also need to run orte-clean if you see error messages similar to the following:
A communication failover feature for handling network failures in multi-rail Infiniband configurations was introduced in the Sun HPC ClusterTools 8.2.1c software release. When a completion error is detected on a given rail, the failover software maps out that rail and routes future traffic through other rails available to the process.
The failover feature supports Open MPI openib BTL (Open Fabrics User Verbs) for the communications layer.
Note - The failover feature does not function with uDAPL or IPoIB. |
You enable the failover feature by setting the MCA parameter pml_obl_enable_failover, either from the mpirun command line or in the openmpi-mca-parameter.conf file.
In the openmpi-mca-parameter.conf File:
Note - For the failover feature to function correctly, you must keep the default settings of the btl_openib_flags parameter. |
Failover occurs when we get a completion error on a connection. A completion error is typically generated when a message transfer fails to complete within a defined time period. By default, this time period is probably larger than is optimal when the the failover feature enabled. This section explains how the timeout value is determined and how to adjust it.
Ttt = 4.096 microseconds * 2btl_openib_ib_timeout
The btl_openib_ib_timeout parameter corresponds to the IBV_QP_TIMEOUT attribute.
The btl_openib_ib_retry_count parameter corresponds to the IBV_QP_RETRY attribute.
Each of these values can be set on a queue pair.
The default Transport Timer Timeout parameter value is 20, which results in a minimum Ttt period of 4.29 seconds.
Ttt = 4.096 microseconds * 220 = 4.096 microseconds * 1048576 = 4.29 seconds
The upper limit of the Transport Timer Timeout is four times this basic timeout period, or 17.18 seconds. That is, with a Ttt parameter value of 20, the Ttt period is guaranteed to fall within the following range:
4.096 microseconds * 220 <= timeout <= 4 * 4.096 microseconds * 220
4.29 seconds <= timeout <= 17.18 seconds
The default Retry Count parameter value is 7. When both default values are used, the combined Ttt and Retry Count timeout period will fall within the following range:
7 * 4.096 microseconds * 220 <= timeout <= 7 * 4 * 4.096 microseconds * 220
30.06 seconds <= timeout <= 120.26 seconds
If you want to cause the network failover action to take effect more quickly, set the IBV_QP_TIMEOUT attribute to a value smaller than the default. For example, if you set the timeout attribute to 15, the failover will occur within the following range:
7 * 4.096 microseconds * 215 <= timeout <= 7 * 4 * 4.096 microseconds * 215
7 * 4.096 microseconds * 32768 <= timeout <= 7 * 4 * 4.096 microseconds * 32768
0.94 seconds <= timeout <= 3.76 seconds
The MPI library may also detect asynchronous errors. One example of these is the PORT_EVENT error. This error is ignored because it will ultimately result in a timeout error, which will be handled by the failover software. Other asynchronous errors, which are less likely to occur, will cause the MPI library to abort.
When running with failover enabled, you are likely to have best results if you run with the Ttt value set as follows:
Experimentally, this yields a timeout value of approximately 2 seconds.
By default, no information about rail loss is generated during a job run. However, if you enable some level of verbosity for the MPI jobs you run, you will be able to determine when a network failover occurs.
For example, if you include the following entry on the mpirun command line, minimal output will be displayed that shows when network interfaces are mapped out.
To see more details about failover activities, use verbosity level 20 or 30.
You can also view failover activity by using the MCA parameter btl_openib_verbose_failover. As with the parameter pml_obl_verbose, you can choose among three verbosity levels: 10, 20, and 30. For example:
--mca btl_openib_verbose_failover 30
Some error events do not directly trigger a failover, but are likely to cause one through cascading timeout effects. Port Event errors are an example of this. The failover software catches Port Event errors, but does not take action. A failover may subsequently result from a Transport Timer Timeout failure, which could be a natural consequence of the Port Event.
When you suspect that Port Event error may occur, you might want to force the failover, without waiting for an eventual Ttt-driven failover to occur. You can do this by setting a flag with the following MCA parameter:
--mca btl_openib_port_event_error_failover 1
The following example illustrates the kind of information that is displayed on standard out when a failover occurs. In this example, the Ttt parameter is set to 15 and the verbosity level is set to 10.
This output shows the openib error that caused the rail to be mapped out. It also shows a sudden increase in latency for the data affected by the network failure. The 41616.97 microseconds latency includes the time required for the seven retries to complete.
You can use the syslog facility to record information about network failures and any consequent remapping of interfaces. Run with the following mca parameter to enable the use of syslog:
This will cause information to be stored in /var/log/messages. Messages about the job will also be written to standard out.
In this second example, the timeout parameter is again set to 15, but we have enabled the logging of failures to the syslog facility.
The /var/log/messages file on one of the nodes will include the following information:
Dec 2 17:02:05 ct-x2200-11 Open MPI Error Report:[10046]: BTL openib error: rank=0 mapping out lid=42:name=mlx4_1 to rank=1 on node=ct-x2200-12 |
For more information about the mpirun command and its options, see the following:
Copyright © 2010, Oracle and/or its affiliates. All rights reserved.