Sun HPC ClusterTools 3.0 Administrator's Guide: With CRE

`hpc.conf`: Cluster Configuration File

When the CRE starts up, it updates portions of the resource database according to the contents of a configuration file named hpc.conf. This file is organized into six sections, which are summarized below and illustrated in Example 3-1.

The ShmemResource section specifies the maximum amount of shared memory and swap space that jobs can allocate.

The Netif section lists and ranks all network interfaces to which Sun HPC nodes may be connected.

The MPIOptions section defines various MPI parameters that can affect the communication performance of MPI jobs.

The PFSFileSystem section names and defines PFS file systems in the cluster.

The PFSServers section names and defines I/O servers for the PFS file systems.

The HPCNodes section is not used by the CRE. It applies only in an LSF-based runtime environment.

You can change any of these aspects of your cluster's configuration by editing the corresponding parts of the hpc.conf file. This section explains how to:

Prepare for editing hpc.conf. See "Prepare to Edit hpc.conf""Prepare to Edit hpc.conf".

Create one or more I/O servers for the PFS file systems. See "Create PFS I/O Servers"

Create PFS file systems. See "Create PFS File Systems".

Specify various attributes of the network interfaces that your cluster nodes use. See "Set Up Network Interfaces".

Learn how to control MPI communication attributes. See "Specify MPI Options".

Update the CRE database. See "Update the CRE Database".

Note -

You may never need to make any other changes to hpc.conf than are described in this section. However, if you do want to edit hpc.conf further, see Chapter 7, hpc.conf: Detailed Description for a fuller description of this file.

Example 3-1 General Organization of hpc.conf File

Begin ShmemResource
  :  
End ShmemResource

Begin Netif
NAME    RANK    MTU    STRIPE    PROTOCOL    LATENCY    BANDWIDTH
  :      :       :        :         :           :           :
End Netif

Begin MPIOptions queue=hpc
  :
Begin HPCNodes
  :
End HPCNodes

Begin PFSFileSystem=pfs1 
NODE            DEVICE            THREADS 
  :               :                  :
End PFSFileSystem

Begin PFSServers
NODE            BUFFER_SIZE
  :               :                 End PFSServers

Begin HPCNodes

End HPCNodes

Prepare to Edit `hpc.conf`

Perform the steps described in "Stop the CRE Daemons" and "Copy the hpc.conf Template".

Stop the CRE Daemons

Stop the CRE nodal and master daemons (in that order). The nodal daemons must be stopped on each node, including the master node.

Note -

You can use one of the CCM tools (cconsole, ctelnet, or crlogin) to broadcast the nodal stop command to all the nodes from a single command entered on the master node.

Use the following scripts to stop the CRE nodal and master daemons:

# /etc/init.d/sunhpc.cre_node stop# /etc/init.d/sunhpc.cre_master stop

Note -

If you edit the hpc.conf file at a later time and make any changes to the PFSServers section or PFSFileSystem section, you will need to also unmount any PFS file systems and stop the PFS daemons on the PFS I/O servers before making the changes.

Copy the `hpc.conf` Template

The Sun HPC ClusterTools 3.0 distribution includes an hpc.conf template, which is stored, by default, in /opt/SUNWhpc/examples/cre/hpc.conf.template.

Copy the template from its installed location to /opt/SUNWhpc/conf/hpc.conf and edit it as described in "Create PFS I/O Servers".

When you have finished editing hpc.conf, perform the steps described in"Update the CRE Database" to update the CRE database with the new configuration information.

Create PFS I/O Servers

Decide which cluster nodes that you want to have function as PFS I/O servers. To be of value as PFS I/O servers, these nodes must be connected to one or more disk storage devices that have enough capacity to handle the PFS file systems you expect will be stored on them.

Note -

The disk storage units should include some level of RAID support to protect the file systems against failure of individual storage devices.

Once you know which nodes you want as I/O servers, list their host names on separate lines in the PFSServers section of hpc.conf. Example 3-2 shows a sample PFSServers section that includes three PFS I/O server nodes.

Example 3-2 PFSServers Section Example

Begin PFSServers
NODE            BUFFER_SIZE
hpc-node0        150
hpc-node1        150
hpc-node2        300
hpc-node3        300
End PFSServers

The left column lists the host names of the PFS I/O server nodes.

The second column specifies the amount of memory the PFS I/O daemon will have available for buffering transfer data. This value is specified in units of 32-Kbyte buffers. The number of buffers that you should specify will depend on the amount of I/O traffic you expect that server is likely to experience at any given time.

The optimal buffer size will vary with system type and load. Buffer sizes in the range of 128 to 512 provide reasonable performance on most Sun HPC Systems.

Note -

You can use pfsstat to get reports on buffer cache hit rates. Knowing buffer cache hit rates can be useful for evaluating how well suited the buffer size is to the cluster's current I/O activity.

Create PFS File Systems

Add a separate PFSFileSystem section for each PFS file system you want to create. Include the following information in each PFSFileSystem section:

The name of the parallel file system. See "Parallel File System Name".

The hostname of each server node in the parallel file system. See "Server Node Hostnames".

The name of the storage device to be included in the parallel file system being defined. See "Storage Device Names".

The number of PFS I/O threads spawned to support each PFS storage device. See "Thread Limits".

The following example shows sample PFSFileSystem sections for two parallel file systems, pfs0 and pfs1.

Example 3-3 PFSFileSystem Section Example

Begin PFSFileSystem=pfs-demo0
NODE            DEVICE              THREADS 
hpc-node0       /dev/rdsk/c0t1d0s2  1 hpc-node1       /dev/rdsk/c0t1d0s2  1 
hpc-node2       /dev/rdsk/c0t1d0s2  1
End PFSFileSystem

Begin PFSFileSystem=pfs-demo1
NODE            DEVICE              THREADS
hpc-node3       /dev/rdsk/c0t1d0s2  1
hpc-node4       /dev/rdsk/c0t1d0s2  1 End PFSFileSystem

Parallel File System Name

Specify the name of the PFS file system on the first line of the section, to the right of the = symbol.

Apply the same naming conventions to PFS file as are used for serial Solaris files.

Server Node Hostnames

The NODE column lists the hostnames of the nodes that function as I/O servers for the parallel file system being defined. The example configuration in Example 3-3 shows two parallel file systems:

pfs0 - two server nodes: node4 and node5.

pfs1 - two server nodes: node5 and node6.

Note that I/O server node5 is used by both pfs0 and pfs1. This is possible because node5 is attached to at least two storage devices, one of which is assigned to pfs0 and the other to pfs1.

Storage Device Names

In the DEVICE column, specify the name of the device that will be used by the file system. Solaris device naming conventions apply.

Thread Limits

In the THREADS column, specify the number of threads a PFS I/O daemon will spawn for the disk storage device or devices it controls. The number of threads needed by a given PFS I/O server node will depend primarily on the performance capabilities of its disk subsystem.

For a storage object with a single disk or a small storage array, one thread may be enough to exploit the storage unit's maximum I/O potential.

For a more powerful storage array, two or more threads may be needed to make full use of the available bandwidth.

Set Up Network Interfaces

Edit the Netif section to specify various characteristics of the network interfaces that are used by the nodes in the cluster. Example 3-4 illustrates the default Netif section that is in hpc.conf.template. This section discusses the various network interface attributes that are defined in the Netif section.

Example 3-4 `Netif Section` Example

Begin Netif
NAME           RANK     MTU       STRIPE     PROTOCOL     LATENCY     BANDWIDTH
midnn          0        16384     0          tcp          20          150
idn            10       16384     0          tcp          20          150 scin           20       32768     1          tcp          20          150
  :            :          :       :           :           :            :
scid           40       32768     1          tcp          20          150
  :            :          :       :           :           :            :
scirsm         45       32768     1          rsm          20          150
  :            :          :       :           :           :            :
  :            :          :       :           :           :            :
smc            220      4096      0          tcp          20          150
End Netif

Interface Names

Add to the first column the names of the network interfaces that are used in your cluster. The supplied Netif section contains an extensive list of commonly used interface types to simplify this task.

By convention, network interface names include a trailing number as a way to distinguish multiple interfaces of the same type. For example, if your cluster includes two 100 Mbit/second Ethernet networks, include the names hme0 and hme1 in the Netif section.

Rank Attribute

Decide the order in which you want the networks in your cluster to be preferred for use and then edit the RANK column entries to implement that order.

Network preference is based on the relative value of a network interface's ranking, with higher preference being given to interfaces with lower rank values. In other words, an interface with a rank of 10 will be selected for use over interfaces with ranks of 11 or higher, but interfaces with ranks of 9 or less will have a higher preference.

Note -

These ranking values are relative; their absolute values have no significance. This is why gaps are left in the default rankings, so that if a new interface is added, it can be given an unused rank value without having to change any existing values.

Decisions about how to rank two or more dissimilar network types are usually based on site-specific conditions and requirements. Ordinarily, a cluster`s fastest network is given preferential ranking over slower networks. However, raw network bandwidth is only one consideration. For example, an administrator might decide to dedicate a network that offers very low latency, but not the fastest bandwidth to all intra-cluster communication and use a higher-capacity network for connecting the cluster to systems outside the cluster.

MTU Attribute

This is a placeholder column. Its contents are not used at this time.

Stripe Attribute

If your cluster includes an SCI (Scalable Coherent Interface) network, you can implement scalable communication between cluster nodes by striping MPI messages over the SCI interfaces. In striped communication, a message is split into smaller packets and transmitted in two or more parallel streams over a set of network interfaces that have been logically combined into a stripe-group.

The STRIPE column allows the administrator to include individual SCI network interfaces in a stripe-group pool. Members of this pool are available to be included in logical stripe groups. These stripe groups are formed on an as-needed basis, selecting interfaces from this stripe-group pool.

To include the SCI interface in a stripe-group pool, set its STRIPE value to 1. To exclude an interface from the pool, specify 0. Up to four SCI network interface cards per node can be configured for stripe-group membership.

When a message is submitted for transmission over the SCI network, an MPI protocol module distributes the message over as many SCI network interfaces as are available.

Stripe-group membership is made optional so you can reserve some SCI network bandwidth for non-striped use. To do so, simply set STRIPE = 0 on the SCI network interface(s) you wish to reserve in this way.

Protocol Attribute

This column identifies the communication protocol used by the interface. The scirsm interface employs the RSM (Remote Shared Memory) protocol. The others in the default list all use TCP (Transmission Control Protocol).

If you add a network interface of a type not represented in the hpc.conf template, you will need to specify the type of protocol the new interface uses.

Latency Attribute

This is a placeholder column. Its contents are not used at this time.

Bandwidth Attribute

This is a placeholder column. Its contents are not used at this time.

Specify MPI Options

The MPIOptions section provides a set of options that control MPI communication behavior in ways that are likely to affect message-passing performance. It contains two templates with predefined option settings. These templates are shown in Example 3-5 and are discussed below.

General-Purpose, Multiuser Template - The first template in the MPIOptions section is designed for general-purpose use at times when multiple message-passing jobs will be running concurrently.

Performance Template - The second template is designed to maximize the performance of message-passing jobs when only one job is allowed to run at a time.

Note -

The first line of each template contains the phrase "Queue=xxxx." This is because the queue-based LSF workload manager runtime environment also uses this hpc.conf file.

The options in the general-purpose template are the same as the default settings for the Sun MPI library. In other words, you do not have to uncomment the general-purpose template to have its option values be in effect. This template is provided in the MPIOptions section so you can see what options are most beneficial when operating in a multiuser mode.

Example 3-5 `MPIOptions` Section Example

# Following is an example of the options that affect the runtime
# environment of the MPI library. The listings below are identical
# to the default settings of the library. The "queue=hpc" phrase
# makes it an LSF-specific entry, and only for the queue named hpc.
# These options are a good choice for a multiuser queue. To be
# recognized by CRE, the "Queue=hpc" needs to be removed.
#
# Begin MPIOptions queue=hpc
# coscheduling  avail
# pbind         avail
# spindtimeout   1000
# progressadjust   on
# spin            off
#
# shm_numpostbox       16
# shm_shortmsgsize    256
# rsm_numpostbox       15
# rsm_shortmsgsize    401
# rsm_maxstripe         2
# End MPIOptions

# The listing below is a good choice when trying to get maximum
# performance out of MPI jobs that are running in a queue that
# allows only one job to run at a time.
#
# Begin MPIOptions Queue=performance
# coscheduling             off
# spin                      on
# End MPIOptions

If you want to use the performance template, do the following:

Delete the "Queue=performance" phrase from the Begin MPIOptions line.

Delete the comment character (#) from the beginning of each line of the performance template, including the Begin MPIOptions and End MPIOptions lines.

The resulting template should appear as follows:

Begin MPIOptions
coscheduling			off
spin			on
End MPIOptions

Update the CRE Database

When you have finished editing hpc.conf, update the CRE database with the new information. To do this, restart the CRE master and nodal daemons as follows:

# /etc/init.d/sunhpc.cre_master start# /etc/init.d/sunhp.cre_node start

The nodal daemons must be restarted on all the nodes in the cluster, including the master node.

hpc.conf: Cluster Configuration File

Example 3-1 General Organization of hpc.conf File

Prepare to Edit hpc.conf

Stop the CRE Daemons

Copy the hpc.conf Template

Create PFS I/O Servers

Example 3-2 PFSServers Section Example

Create PFS File Systems

Example 3-3 PFSFileSystem Section Example

Parallel File System Name

Server Node Hostnames

Storage Device Names

Thread Limits

Set Up Network Interfaces

Example 3-4 Netif Section Example

Interface Names

Rank Attribute

MTU Attribute

Stripe Attribute

Protocol Attribute

Latency Attribute

Bandwidth Attribute

Specify MPI Options

Example 3-5 MPIOptions Section Example

Update the CRE Database

`hpc.conf`: Cluster Configuration File

Prepare to Edit `hpc.conf`

Copy the `hpc.conf` Template

Example 3-4 `Netif Section` Example

Example 3-5 `MPIOptions` Section Example