Sun MPI 4.0 User's Guide: With CRE

Chapter 1 Introduction

This manual explains how to execute Sun MPI applications on a Sun HPC cluster that is using the Sun Cluster Runtime Environment (CRE) 1.0 for job management.

Sun HPC ClusterTools 3.0 Software

Sun HPC ClusterTools 3.0 software is an integrated ensemble of parallel development tools that extend Sun's network computing solutions to high-end distributed-memory applications. Sun HPC ClusterTools products can be used either with the CRE or with LSF Suite 3.2.3, Platform Computing Corporation's resource-management software.


Note -

If you are using LSF Suite instead of the CRE for workload management, you should be reading the Sun MPI 4.0 User's Guide: With LSF instead of this document.


The principal components of Sun HPC ClusterTools Software are described in "Sun Cluster Runtime Environment" through "Sun S3L".

Sun Cluster Runtime Environment

The CRE is a cluster administration and job launching facility. It provides users with an interactive command-line interface for executing jobs on the cluster and for obtaining information about job activity and cluster resources.

The CRE also performs load-balancing for programs running in shared partitions.


Note -

Load balancing, partitions, and other related Sun HPC cluster concepts are discussed in "Fundamental CRE Concepts".


Sun MPI and MPI I/O

Sun MPI is a highly optimized version of the Message-Passing Interface (MPI) communications library. Sun MPI implements all of the MPI 1.2 standard as well as a significant subset of the MPI 2.0 feature list. For example, Sun MPI provides the following features:

Sun MPI and MPI I/O provide full F77, C, and C++ support and Basic F90 support.

Parallel File System

The Sun Parallel File System (PFS) component of the Sun HPC ClusterTools suite of software provides high-performance file I/O for multiprocess applications running in a cluster-based, distributed-memory environment.

PFS file systems closely resemble UFS file systems, but provide significantly higher file I/O performance by striping files across multiple PFS I/O server nodes. This means the time required to read or write a PFS file can be reduced by an amount roughly proportional to the number of file server nodes in the PFS file system.

PFS is optimized for the large files and complex data access patterns that are characteristic of parallel scientific applications.

Prism

Prism is the Sun HPC graphical programming environment. It allows you to develop, execute, debug, and visualize data in message-passing programs. With Prism you can

Prism can be used with applications written in F77, F90, C, and C++.

Sun S3L

The Sun Scalable Scientific Subroutine Library (Sun S3L) provides a set of parallel and scalable functions and tools that are used widely in scientific and engineering computing. It is built on top of MPI and provides the following functionality for Sun MPI programmers:

Sun S3L routines can be called from applications written in F77, F90, C, and C++.

Sun Compilers

Sun HPC ClusterTools 3.0 sofware supports the following Sun compilers:

Solaris Operating Environment

Sun HPC ClusterTools software uses the Solaris 2.6 or Solaris 7 (32-bit or 64-bit) operating environment. All programs that execute under Solaris 2.6 or Solaris 7 execute in the Sun HPC ClusterTools environment.

Fundamental CRE Concepts

This section introduces some important concepts that you should understand in order to use the Sun HPC ClusterTools software in the CRE effectively.

Cluster of Nodes

As its name implies, the CRE is intended to operate in a Sun HPC cluster--that is, in a collection of Sun SMP (symmetric multiprocessor) servers that are interconnected by any Sun-supported, TCP/IP-capable interconnect. An SMP attached to the cluster network is referred to as a node.

The CRE manages the launching and execution of both serial and parallel jobs on the cluster nodes. For serial jobs, its chief contribution is to perform load balancing in shared partitions, where multiple processes can be competing for the same node resources. For parallel jobs, the CRE provides:

The CRE supports parallel jobs running on clusters of up to 64 nodes containing up to 256 CPUs.

Partitions

The system administrator can configure the nodes in a Sun HPC cluster into one or more logical sets, called partitions.


Note -

The CPUs in a Sun HPC 10000 server can be configured into logical nodes. These domains can be logically grouped to form partitions, which the CRE uses in the same way it deals with partitions containing other types of Sun HPC nodes.


Any job launched on a partition will run on one or more nodes in that partition, but not on nodes in any other partition. Partitioning a cluster allows multiple jobs to be executed on the partitions concurrently, without any risk of jobs on different partitions interfering with each other. This ability to isolate jobs can be beneficial in various ways: For example:

If you want your job to execute on a specific partition, the CRE provides you with the following methods for selecting the partition:

It is possible for cluster nodes to not belong to any cluster. If you log in to one of these independent nodes and do not request a particular partition, the CRE will launch your job on the cluster's default partition. This is a partition whose name is specified by the SUNHPC_PART environment variable or is defined by an internal attribute that the system administrator is able to set.

The system administrator can also selectively enable and disable partitions. Jobs can only be executed on enabled partitions. This restriction makes it possible to define many partitions in a cluster, but have only a few active at any one time.


Note -

It is also possible for a node to belong to more than one partition, so long as only one is enabled at a time.


In addition to enabling and disabling partitions, the system administrator can set and unset other partition attributes that influence various aspects of how the partition functions. For example, if you have an MPI job that requires dedicated use of a set of nodes, you could run it on a partition that the system administrator has configured to accept only one job at a time.

The administrator could configure a different partition to allow multiple jobs to execute concurrently. This shared paritition would be used for code development or other jobs that do not require exclusive use of their nodes.


Note -

Although a job cannot be run across partition boundaries, it can be run on a partition plus independent nodes.


Load Balancing

The CRE load-balances programs that execute in shared partitions--that is, in partitions that allow multiple jobs to run concurrently.

When you issue the mprun command in a shared partition, the CRE first determines what criteria (if any) you have specified for the node or nodes on which the program is to run. It then determines which nodes within the partition meet these criteria. If more nodes meet the criteria than are required to run your program, the CRE starts the program on the node or nodes that are least loaded. It examines the one-minute load averages of the nodes and ranks them accordingly.

This load-balancing mechanism ensures that your program's execution will not be unnecessarily delayed because it happened to be placed on a heavily loaded node. It also ensures that some nodes won't sit idle while other nodes are heavily loaded, thereby keeping overall throughput of the partition as high as possible.

Jobs and Processes

When a serial program executes on a Sun HPC cluster, it becomes a Solaris process with a Solaris process ID, or pid.

When the CRE executes a distributed message-passing program it spawns multiple Solaris processes, each with its own pid.

The CRE also assigns a job ID, or jid, to the program. If it is an MPI job, the jid applies to the overall job. Job IDs always begin with a j to distinguish them from pids. Many CRE commands take jids as arguments. For example, you can issue an mpkill command with a signal number or name and a jid argument to send the specified signal to all processes that make up the job specified by the jid.

Parallel File System

From the user's perspective, PFS file systems closely resemble UNIX file systems. PFS uses a conventional inverted-tree hierarchy, with a root directory at the top and subdirectories and files branching down from there. The fact that individual PFS files are distributed across multiple disks managed by multiple I/O servers is transparent to the programmer. The way that PFS files are actually mapped to the physical storage facilities is based on file system configuration entries in the CRE database.

Using the Sun CRE

The balance of this manual discusses the following aspects to using the CRE: