Sun MPI 4.0 User's Guide: With CRE

Fundamental CRE Concepts

This section introduces some important concepts that you should understand in order to use the Sun HPC ClusterTools software in the CRE effectively.

Cluster of Nodes

As its name implies, the CRE is intended to operate in a Sun HPC cluster--that is, in a collection of Sun SMP (symmetric multiprocessor) servers that are interconnected by any Sun-supported, TCP/IP-capable interconnect. An SMP attached to the cluster network is referred to as a node.

The CRE manages the launching and execution of both serial and parallel jobs on the cluster nodes. For serial jobs, its chief contribution is to perform load balancing in shared partitions, where multiple processes can be competing for the same node resources. For parallel jobs, the CRE provides:

A single job monitoring and control point.

Load balancing for shared partitions.

Information about node connectivity.

Support for spawning of MPI processes.

Support for Prism interaction with parallel jobs.

Note -
A cluster can consist of a single Sun SMP server. However, executing MPI jobs on even a single-node cluster requires the CRE to be running on that cluster.

The CRE supports parallel jobs running on clusters of up to 64 nodes containing up to 256 CPUs.

Partitions

The system administrator can configure the nodes in a Sun HPC cluster into one or more logical sets, called partitions.

Note -

The CPUs in a Sun HPC 10000 server can be configured into logical nodes. These domains can be logically grouped to form partitions, which the CRE uses in the same way it deals with partitions containing other types of Sun HPC nodes.

Any job launched on a partition will run on one or more nodes in that partition, but not on nodes in any other partition. Partitioning a cluster allows multiple jobs to be executed on the partitions concurrently, without any risk of jobs on different partitions interfering with each other. This ability to isolate jobs can be beneficial in various ways: For example:

If one job requires exclusive use of a set of nodes, but other jobs also need to execute at the same time, the availability of two partitions in a cluster would allow both needs to be satisfied.

If a cluster contains a mix of nodes whose characteristics differ--such as having different memory sizes, CPU counts, or levels of I/O support--the nodes can be grouped into partitions that have similar resources. This would allow jobs that require particular resources to be run on suitable partitions, while jobs that are less resource-dependent could be relegated to less specialized partitions.

If you want your job to execute on a specific partition, the CRE provides you with the following methods for selecting the partition:

Set the environment variable SUNHPC_PART to the name of the partition.

Use the -p option to the job-launching command, mprun, to specify the partition.

Note -
These methods are listed in order of increasing priority. That is, setting the SUNHPC_PART environment variable overrides whichever partition you may be logged into. Likewise, specifying the mprun -p option overrides either of the other methods for selecting a partition.

It is possible for cluster nodes to not belong to any cluster. If you log in to one of these independent nodes and do not request a particular partition, the CRE will launch your job on the cluster's default partition. This is a partition whose name is specified by the SUNHPC_PART environment variable or is defined by an internal attribute that the system administrator is able to set.

The system administrator can also selectively enable and disable partitions. Jobs can only be executed on enabled partitions. This restriction makes it possible to define many partitions in a cluster, but have only a few active at any one time.

Note -

It is also possible for a node to belong to more than one partition, so long as only one is enabled at a time.

In addition to enabling and disabling partitions, the system administrator can set and unset other partition attributes that influence various aspects of how the partition functions. For example, if you have an MPI job that requires dedicated use of a set of nodes, you could run it on a partition that the system administrator has configured to accept only one job at a time.

The administrator could configure a different partition to allow multiple jobs to execute concurrently. This shared paritition would be used for code development or other jobs that do not require exclusive use of their nodes.

Note -

Although a job cannot be run across partition boundaries, it can be run on a partition plus independent nodes.

Load Balancing

The CRE load-balances programs that execute in shared partitions--that is, in partitions that allow multiple jobs to run concurrently.

When you issue the mprun command in a shared partition, the CRE first determines what criteria (if any) you have specified for the node or nodes on which the program is to run. It then determines which nodes within the partition meet these criteria. If more nodes meet the criteria than are required to run your program, the CRE starts the program on the node or nodes that are least loaded. It examines the one-minute load averages of the nodes and ranks them accordingly.

This load-balancing mechanism ensures that your program's execution will not be unnecessarily delayed because it happened to be placed on a heavily loaded node. It also ensures that some nodes won't sit idle while other nodes are heavily loaded, thereby keeping overall throughput of the partition as high as possible.

Jobs and Processes

When a serial program executes on a Sun HPC cluster, it becomes a Solaris process with a Solaris process ID, or pid.

When the CRE executes a distributed message-passing program it spawns multiple Solaris processes, each with its own pid.

The CRE also assigns a job ID, or jid, to the program. If it is an MPI job, the jid applies to the overall job. Job IDs always begin with a j to distinguish them from pids. Many CRE commands take jids as arguments. For example, you can issue an mpkill command with a signal number or name and a jid argument to send the specified signal to all processes that make up the job specified by the jid.

Parallel File System

From the user's perspective, PFS file systems closely resemble UNIX file systems. PFS uses a conventional inverted-tree hierarchy, with a root directory at the top and subdirectories and files branching down from there. The fact that individual PFS files are distributed across multiple disks managed by multiple I/O servers is transparent to the programmer. The way that PFS files are actually mapped to the physical storage facilities is based on file system configuration entries in the CRE database.