This manual explains how to execute Sun MPI applications on a Sun HPC cluster that is using the Sun Cluster Runtime Environment (CRE) 1.0 for job management.
Sun HPC ClusterTools 3.0 software is an integrated ensemble of parallel development tools that extend Sun's network computing solutions to high-end distributed-memory applications. Sun HPC ClusterTools products can be used either with the CRE or with LSF Suite 3.2.3, Platform Computing Corporation's resource-management software.
If you are using LSF Suite instead of the CRE for workload management, you should be reading the Sun MPI 4.0 User's Guide: With LSF instead of this document.
The principal components of Sun HPC ClusterTools Software are described in "Sun Cluster Runtime Environment" through "Sun S3L".
The CRE is a cluster administration and job launching facility. It provides users with an interactive command-line interface for executing jobs on the cluster and for obtaining information about job activity and cluster resources.
The CRE also performs load-balancing for programs running in shared partitions.
Load balancing, partitions, and other related Sun HPC cluster concepts are discussed in "Fundamental CRE Concepts".
Sun MPI is a highly optimized version of the Message-Passing Interface (MPI) communications library. Sun MPI implements all of the MPI 1.2 standard as well as a significant subset of the MPI 2.0 feature list. For example, Sun MPI provides the following features:
Support for multithreaded programming.
Seamless use of different network protocols; for example, code compiled on a Sun HPC cluster that has a Scalable Coherent Interface (SCI) network, can be run without change on a cluster that has an ATM network.
Multiprotocol support such that MPI picks the fastest available medium for each type of connection (such as shared memory, SCI, or ATM).
Communication via shared memory for fast performance on clusters of SMPs.
Finely tunable shared memory communication.
Optimized collectives for symmetric multiprocessors (SMPs).
Prism support - Users can develop, run, and debug programs in the Prism programming environment.
MPI I/O support for parallel file I/O.
Sun MPI is a dynamic library.
Sun MPI and MPI I/O provide full F77, C, and C++ support and Basic F90 support.
The Sun Parallel File System (PFS) component of the Sun HPC ClusterTools suite of software provides high-performance file I/O for multiprocess applications running in a cluster-based, distributed-memory environment.
PFS file systems closely resemble UFS file systems, but provide significantly higher file I/O performance by striping files across multiple PFS I/O server nodes. This means the time required to read or write a PFS file can be reduced by an amount roughly proportional to the number of file server nodes in the PFS file system.
PFS is optimized for the large files and complex data access patterns that are characteristic of parallel scientific applications.
Prism is the Sun HPC graphical programming environment. It allows you to develop, execute, debug, and visualize data in message-passing programs. With Prism you can
Control various aspects of program execution, such as:
Starting and stopping execution.
Setting breakpoints and traces.
Printing values of variables and expressions.
Displaying the call stack.
Visualize data in various formats.
Analyze performance of MPI programs.
Aggregate processes across multiprocess parallel jobs into meaningful groups, called process sets or psets.
Prism can be used with applications written in F77, F90, C, and C++.
The Sun Scalable Scientific Subroutine Library (Sun S3L) provides a set of parallel and scalable functions and tools that are used widely in scientific and engineering computing. It is built on top of MPI and provides the following functionality for Sun MPI programmers:
Vector and dense matrix operations (level 1, 2, 3 Parallel BLAS).
Iterative solvers for sparse systems.
Matrix-vector multiply for sparse systems.
FFT
LU factor and solve.
Autocorrelation.
Convolution/deconvolution.
Tridiagonal solvers.
Banded solvers.
Eigensolvers.
Singular value decomposition.
Least squares.
One-dimensional sort.
Multidimensional sort.
Selected ScaLAPACK and BLACS application program interface.
Conversion between ScaLAPACK and S3L.
Matrix transpose.
Random number generators (linear congruential and lagged Fibonacci).
Random number generator and I/O for sparse systems.
Matrix inverse.
Array copy.
Safety mechanism.
An array syntax interface callable from message-passing programs.
Toolkit functions for operations on distributed data.
Support for the multiple instance paradigm (allowing an operation to be applied concurrently to multiple, disjoint data sets in a single call).
Thread safety.
Detailed programming examples and support documentation provided online.
Sun S3L routines can be called from applications written in F77, F90, C, and C++.
Sun HPC ClusterTools 3.0 sofware supports the following Sun compilers:
Sun WorkShop Compilers C/C++ v4.2 and v5.0
Sun WorkShop Compilers Fortran v4.2 and v5.0
Sun HPC ClusterTools software uses the Solaris 2.6 or Solaris 7 (32-bit or 64-bit) operating environment. All programs that execute under Solaris 2.6 or Solaris 7 execute in the Sun HPC ClusterTools environment.
This section introduces some important concepts that you should understand in order to use the Sun HPC ClusterTools software in the CRE effectively.
As its name implies, the CRE is intended to operate in a Sun HPC cluster--that is, in a collection of Sun SMP (symmetric multiprocessor) servers that are interconnected by any Sun-supported, TCP/IP-capable interconnect. An SMP attached to the cluster network is referred to as a node.
The CRE manages the launching and execution of both serial and parallel jobs on the cluster nodes. For serial jobs, its chief contribution is to perform load balancing in shared partitions, where multiple processes can be competing for the same node resources. For parallel jobs, the CRE provides:
A single job monitoring and control point.
Load balancing for shared partitions.
Information about node connectivity.
Support for spawning of MPI processes.
Support for Prism interaction with parallel jobs.
A cluster can consist of a single Sun SMP server. However, executing MPI jobs on even a single-node cluster requires the CRE to be running on that cluster.
The CRE supports parallel jobs running on clusters of up to 64 nodes containing up to 256 CPUs.
The system administrator can configure the nodes in a Sun HPC cluster into one or more logical sets, called partitions.
The CPUs in a Sun HPC 10000 server can be configured into logical nodes. These domains can be logically grouped to form partitions, which the CRE uses in the same way it deals with partitions containing other types of Sun HPC nodes.
Any job launched on a partition will run on one or more nodes in that partition, but not on nodes in any other partition. Partitioning a cluster allows multiple jobs to be executed on the partitions concurrently, without any risk of jobs on different partitions interfering with each other. This ability to isolate jobs can be beneficial in various ways: For example:
If one job requires exclusive use of a set of nodes, but other jobs also need to execute at the same time, the availability of two partitions in a cluster would allow both needs to be satisfied.
If a cluster contains a mix of nodes whose characteristics differ--such as having different memory sizes, CPU counts, or levels of I/O support--the nodes can be grouped into partitions that have similar resources. This would allow jobs that require particular resources to be run on suitable partitions, while jobs that are less resource-dependent could be relegated to less specialized partitions.
If you want your job to execute on a specific partition, the CRE provides you with the following methods for selecting the partition:
Log in to a node that is a member of the partition.
Set the environment variable SUNHPC_PART to the name of the partition.
Use the -p option to the job-launching command, mprun, to specify the partition.
These methods are listed in order of increasing priority. That is, setting the SUNHPC_PART environment variable overrides whichever partition you may be logged into. Likewise, specifying the mprun -p option overrides either of the other methods for selecting a partition.
It is possible for cluster nodes to not belong to any cluster. If you log in to one of these independent nodes and do not request a particular partition, the CRE will launch your job on the cluster's default partition. This is a partition whose name is specified by the SUNHPC_PART environment variable or is defined by an internal attribute that the system administrator is able to set.
The system administrator can also selectively enable and disable partitions. Jobs can only be executed on enabled partitions. This restriction makes it possible to define many partitions in a cluster, but have only a few active at any one time.
It is also possible for a node to belong to more than one partition, so long as only one is enabled at a time.
In addition to enabling and disabling partitions, the system administrator can set and unset other partition attributes that influence various aspects of how the partition functions. For example, if you have an MPI job that requires dedicated use of a set of nodes, you could run it on a partition that the system administrator has configured to accept only one job at a time.
The administrator could configure a different partition to allow multiple jobs to execute concurrently. This shared paritition would be used for code development or other jobs that do not require exclusive use of their nodes.
Although a job cannot be run across partition boundaries, it can be run on a partition plus independent nodes.
The CRE load-balances programs that execute in shared partitions--that is, in partitions that allow multiple jobs to run concurrently.
When you issue the mprun command in a shared partition, the CRE first determines what criteria (if any) you have specified for the node or nodes on which the program is to run. It then determines which nodes within the partition meet these criteria. If more nodes meet the criteria than are required to run your program, the CRE starts the program on the node or nodes that are least loaded. It examines the one-minute load averages of the nodes and ranks them accordingly.
This load-balancing mechanism ensures that your program's execution will not be unnecessarily delayed because it happened to be placed on a heavily loaded node. It also ensures that some nodes won't sit idle while other nodes are heavily loaded, thereby keeping overall throughput of the partition as high as possible.
When a serial program executes on a Sun HPC cluster, it becomes a Solaris process with a Solaris process ID, or pid.
When the CRE executes a distributed message-passing program it spawns multiple Solaris processes, each with its own pid.
The CRE also assigns a job ID, or jid, to the program. If it is an MPI job, the jid applies to the overall job. Job IDs always begin with a j to distinguish them from pids. Many CRE commands take jids as arguments. For example, you can issue an mpkill command with a signal number or name and a jid argument to send the specified signal to all processes that make up the job specified by the jid.
From the user's perspective, PFS file systems closely resemble UNIX file systems. PFS uses a conventional inverted-tree hierarchy, with a root directory at the top and subdirectories and files branching down from there. The fact that individual PFS files are distributed across multiple disks managed by multiple I/O servers is transparent to the programmer. The way that PFS files are actually mapped to the physical storage facilities is based on file system configuration entries in the CRE database.
The balance of this manual discusses the following aspects to using the CRE:
Choosing a partition and logging in - see Chapter 2, Starting Sun MPI Programs.
Executing programs - see Chapter 3, Executing Programs.
Obtaining information - see Chapter 4, Getting Information.
Debugging programs - see Chapter 5, Debugging Programs.
Performance Tuning and Profiling - see Chapter 6, Performance Tuning .