C H A P T E R 4 - Cluster Configuration Notes

This chapter examines various issues that may have some bearing on choices you make when configuring your Sun HPC cluster. The information is organized in three general topic areas:

Nodes

Configuring a Sun server or cluster of servers to use Sun HPC ClusterTools software involves many of the same choices seen when configuring general-purpose compute servers. Common issues include the number of CPUs per machine, the amount of installed memory, and the amount of disk space reserved for swapping.

Because the characteristics of the particular applications to be run on any given Sun HPC cluster have such a large effect on the optimal settings of these parameters, the following discussion is necessarily general in nature.

Number of CPUs

Since Sun MPI programs can run efficiently on a single server, it can be advantageous to have at least as many CPUs as there are processes used by the applications running on the cluster. This is not a necessary condition since Sun MPI applications can run across multiple nodes in a cluster, but for applications with very large interprocess communication requirements, running on a single server might result in significant performance gains.

Memory

Generally, the amount of installed memory should be proportional to the number of CPUs in the cluster, although the exact amount depends significantly on the particulars of the target application mix.

Sun HPC ClusterTools applications that perform intensive computations, and which process data that is stored locally, often benefit from larger external caches on their processor modules. Large cache capacity allows data to be kept closer to the processor for longer periods of time.

Swap Space

Because Sun HPC ClusterTools software applications are, on average, larger than those typically run on compute servers, the swap space allocated to Sun HPC clusters should be correspondingly larger. The amount of swap should be proportional to the number of CPUs and to the amount of installed memory. Additional swap should be configured to act as backing store for the shared memory communication areas used by Sun HPC ClusterTools software in these situations.

Sun MPI jobs require large amounts of swap space for shared memory files. The sizes of shared memory files scale in stepwise fashion, rather than linearly. For example, a two-process job (with both processes running within the same SMP) requires shared memory files of approximately 35 Mbytes. A 16-process job (all processes running within the same SMP) requires shared memory files of approximately 85 Mbytes. A 256-process job (all processes running within the same SMP) requires shared memory files of approximately 210 Mbytes.

Interconnects

One of the most fundamental issues to be addressed when configuring a cluster is the question of how to connect the nodes of the cluster. In particular, both the type and the number of networks should be chosen to complement the way in which the cluster is most likely to be used.

In a broad sense, a Sun HPC cluster can be viewed as a standard LAN. Operations performed on nodes of the cluster generate the same type of network traffic that is seen on a LAN. For example, running an executable and accessing directories and files causes NFS traffic, while remote login sessions will cause network traffic. This kind of network traffic is referred to here as administrative traffic.

Administrative traffic has the potential to tax cluster resources. This can result in significant performance losses for some Sun MPI applications, unless these resources are somehow protected from this traffic. Fortunately, Sun CRE provides enough configuration flexibility to allow you to avoid many of these problems.

The following sections discuss some of the factors that you should consider when building a cluster for Sun HPC ClusterTools software applications.

Sun HPC ClusterTools Internode Communication

Several Sun HPC ClusterTools software components generate internode communication. It is important to understand the nature of this communication in order to make informed decisions about network configurations.

Administrative Traffic

As mentioned earlier, a Sun HPC cluster generates the same kind of network traffic as any UNIX-based LAN. Common operations like starting a program can have a significant network impact. The impact of such administrative traffic should be considered when making network configuration decisions.

When a simple serial program is run within a LAN, network traffic typically occurs as the executable is read from a NFS-mounted disk and paged into a single node's memory. In contrast, when a 16- or 32-process parallel program is invoked, the NFS server is likely to experience approximately simultaneous demands from multiple nodes--each pulling pages of the executable to its own memory. Such requests can often result in large amounts of network traffic. How much traffic occurs will depend on various factors, such as the number of processes in the parallel job, the size of the executable, and so forth.

Sun CRE-Generated Traffic

Sun CRE uses the cluster default network interconnect to perform communication between the daemons that perform resource management functions. Sun CRE makes heavy use of this network when Sun MPI jobs are started, with the load being roughly proportional to the number of processes in the parallel jobs. This load is in addition to the start-up load described in the previous section. Sun CRE generates a similar load during job termination as the Sun CRE database is updated to reflect the expired MPI job.

There is also a small amount of steady traffic generated on this network as Sun CRE continually updates its view of the resources on each cluster node and monitors the status of its components to guard against failures.

Sun MPI Interprocess Traffic

Parallel programs use Sun MPI software to move data between processes as the program runs. If the running program is spread across multiple cluster nodes, then the program generates network traffic.

Sun MPI uses the network that Sun CRE instructs it to use, which can be set by the system administrator. In general, Sun CRE instructs Sun MPI to use the fastest network available so that message-passing programs obtain the best possible performance.

If the cluster has only one network, then message-passing traffic shares bandwidth with administrative and Sun CRE functions. This results in performance degradation for all types of traffic, especially if one of the applications is performing significant amounts of data transfer, as message-passing applications often do. You should understand the communication requirements associated with the types of applications to be run on the Sun HPC cluster in order to decide whether the amount and frequency of application-generated traffic warrants the use of a second, dedicated network for parallel application network traffic. In general, a second network will significantly assist overall performance.

Parallel I/O Traffic

Sun MPI programs can make use of the parallel I/O capabilities of Sun HPC ClusterTools software, but not all such programs do so. You need to understand how distributed multiprocess applications that are run on the Sun HPC cluster make use of parallel I/O in order to understand the ramifications for network load.

Applications can use parallel I/O to read from and write to standard UNIX file systems, generating NFS traffic on the default network, on the network being used by the Sun MPI component, or some combination of the two. The type of traffic that is generated depends on the type of I/O operations being used by the applications. Collective I/O operations generate traffic on the Sun MPI network, while most other types of I/O operations involve only the default network.

Network Characteristics

Bandwidth, latency, and performance under load are all important network characteristics to consider when choosing interconnects for a Sun HPC cluster. These are discussed in the following sections.

Bandwidth

Bandwidth should be matched to expected load as closely as possible. If the intended message-passing applications have only modest communication requirements and no significant parallel I/O requirements, then a fast, expensive interconnect may be unnecessary. On the other hand, many parallel applications benefit from large pipes (high-bandwidth interconnects). Clusters that are likely to handle such applications should use interconnects with sufficient bandwidth to avoid communication bottlenecks. Significant use of parallel I/O would also increase the importance of having a high-bandwidth interconnect.

It is also a good practice to use a high-bandwidth network to connect large nodes (nodes with many CPUs) so that communication capabilities are in balance with computational capabilities.

An example of a low-bandwidth interconnect is the 10-Mbit/s Ethernet. Examples of higher-bandwidth interconnects include ATM and switched FastEthernet.

Latency

The latency of the network is the sum of all delays a message encounters from its point of departure to its point of arrival. The significance of network latency varies according to the communication patterns of the application.

Low latency can be particularly important when the message traffic consists mostly of small messages--in such cases, latency accounts for a large proportion of the total time spent transmitting messages. Transmitting larger messages can be more efficient on a network with higher latencies.

Parallel I/O operations are less vulnerable to latency delays than small-message traffic because the messages transferred by parallel I/O operations tend to be large (often 32 Kbytes or larger).

Performance Under Load

Generally speaking, better performance is provided by switched network interconnects, such as ATM and Fibre Channel. Network interconnects with collision-based semantics should be avoided in situations where performance under load is important. Unswitched 100-Mbit/s Ethernet and Gigabit Ethernet are two common examples of this type of network. While 10-Mbit/s Ethernet is almost certainly not adequate for any Sun HPC ClusterTools software application, a switched version of 100-Mbit/s Ethernet or Gigabit Ethernet might be sufficient for some applications.

Close Integration With Batch Processing Systems

Sun CRE provides close integration with several distributed resource managers. In that integration, Sun CRE retains most of its original functions, but delegates others to the resource manager. The sections that follow include:

How Close Integration Works

The batch system, whether SGE, LSF, or PBS, launches the job through a script. The script calls mprun, and passes it a host file of the resources that have been allocated for the job, plus the job ID assigned by the batch system.

The Sun CRE environment continues to perform most of its normal parallel-processing actions. However, in the case of SGE and PBS, the child processes do not fork any executable programs, whereas in LSF, they do. In all cases, each child process identifies a communications channel (specifically, a listen query socket) through which it can be monitored by the Sun CRE environment while running in the batch system.

Before releasing its child processes to the batch system, the Sun CRE environment inserts around it a wrapper that identifies its listen query socket. That wrapper program keeps the executable programs fully connected to the Sun CRE environment while they are running. And the Sun CRE environment can clean up when a socket closes because a program has stopped running.

A user can also invoke a similar process interactively, without a script. Instructions for script-based and interactive job launching are provided in the Sun HPC ClusterTools Software User's Guide.

How Close Integration Is Used

The exact instructions vary from one resource manager to another, and are affected by your Sun CRE configuration. They all follow these general guidelines:

1. The job can be launched either interactively or through a script. Instructions for both are provided in the Sun HPC ClusterTools Software User's Guide and the following man pages:

2. Users enter the batch processing environment before launching jobs with mprun.

3. Users reserve resources for the parallel job and set other job control parameters from within their resource manager.

4. Users invoke the mprun command with the applicable resource manager flags. Those flags are described in the Sun HPC ClusterTools Software User's Guide and the mprun(1) man page.

Resource managers affect only the CREOptions section of the hpc.conf file. Here is an example of that section:

Begin CREOptions

enable_core           on

corefile_name         directory and file name of core file

syslog_facility       daemon

auth_opt              sunhpc_rhosts

default_rm            cre

allow_mprun           *

End CREOptions

Only two fields affect integration with resource managers: the default_rm field and the allow_mprun field.

By selecting a default resource manager, you can save users the trouble of entering the -x flag each time they invoke the mprun command (as described in Sun HPC ClusterTools Software User's Guide). Simply change the value of the field to one of these:

2. To restrict mprun's ability to run programs in the batch processing environment, add or change the allow_mprun field.

To place no restrictions on mprun, either set the value to * or remove the field.

Add the names of the resource managers that do not place restrictions on mprun.

The sunhpc.allow file specifies the restrictions that a batch system will place on mprun's ability to call other programs while running in a batch processing environment.

By default, sunhpc.allow is ignored. It is only examined when the hpc.conf file asks for restrictions (see To Configure the hpc.conf File).

The sunhpc.allow file is a text file with one filename pattern per line. Each filename patter specifies the directories whose programs mprun is allowed to launch. For example:

/opt/SUNWhpc/*

/opt/SUNWhpc/bin/*

The example above uses an asterisk to denote all subdirectories or files. For a list of valid pattern-matching symbols, see fnmatch(1).

Before launching a program, mprun examines the sunhpc.allow file and tries to match the program's absolute pathname to the patterns in the file. If the program's absolute pathname matches one of the patterns, the program is launched. If it does not match, the program is not launched, and an error message is displayed to the user, listing the absolute path that had no match in the file.

The sunhpc.allow file must be located on the master node and owned by root. It is read only at startup. If you add new entries to it, you must restart the master node daemons.

The nodes file must contain a list of entries for each node and the number of processors in each node. For example,

hpc-u2-8 np=2  # first node, 2 processors

hpc-u2-9 np=2  # second node, 2 processors

hpc-u2-8 np=5000

hpc-u2-9 np=5000

There is nothing special that you need to do to enable close integration with LSF. Follow the instructions in the LSF User Guide or in Using Platform LSF HPC in order to run jobs.

For more information about LSF and LSF documentation, see the Platform Computing web site at:

The main thing you have to do is make sure each Sun CRE node is configured as an SGE execution host. Simply define a parallel environment for each queue in the SGE cluster that wil be used as a Sun CRE node. Follow these steps.

1. Create a PE (Parallel Environment) configuration file (cre.pe) with the following contents.

TABLE 4-1 SGE Configuration
`%` `qconf -sp cre` `pe_name cre` slots 8 `user_lists NONE` `xuser_lists NONE` `start_proc_args /bin/true` `stop_proc_args /bin/true` `allocation_rule $round_robin` `control_slaves TRUE` `job_is_first_task FALSE` `urgency_slots min`

TABLE 4-1 SGE Configuration

% qconf -sp cre

pe_name               cre

slots                 8

user_lists            NONE

xuser_lists           NONE

start_proc_args       /bin/true

stop_proc_args        /bin/true

allocation_rule       $round_robin

control_slaves        TRUE

job_is_first_task     FALSE

urgency_slots         min

The control_slaves value must be TRUE, or SGE will prohibit parallel jobs, causing qrsh to exit with an error message that does not indicate the cause.

The job_is_first_task value must be FALSE, or mprun will count as one of the slots. If that happens, only n-1 processes will start, and the job will fail.

Use the qconf command. For example, if you named the configuration file cre.pe, enter:

Nodes

Number of CPUs

Memory

Swap Space

Interconnects

Sun HPC ClusterTools Internode Communication

Administrative Traffic

Sun CRE-Generated Traffic

Sun MPI Interprocess Traffic

Parallel I/O Traffic

Network Characteristics

Bandwidth

Latency

Performance Under Load

Close Integration With Batch Processing Systems

How Close Integration Works

How Close Integration Is Used

To Enable Close Integration

To Configure the `hpc.conf` File

To Configure the `sunhpc.allow` File

To Configure PBS For Close Integration

To Configure LSF for Close Integration

To Configure SGE For Close Integration

Nodes

Number of CPUs

Memory

Swap Space

Interconnects

Sun HPC ClusterTools Internode Communication

Administrative Traffic

Sun CRE-Generated Traffic

Sun MPI Interprocess Traffic

Parallel I/O Traffic

Network Characteristics

Bandwidth

Latency

Performance Under Load

Close Integration With Batch Processing Systems

How Close Integration Works

How Close Integration Is Used

To Enable Close Integration

To Configure the hpc.conf File

To Configure the sunhpc.allow File

To Configure PBS For Close Integration

To Configure LSF for Close Integration

To Configure SGE For Close Integration

To Configure the `hpc.conf` File

To Configure the `sunhpc.allow` File