Sun HPC ClusterTools 3.0 Administrator's Guide: With CRE

Chapter 7 `hpc.conf`: Detailed Description

This chapter discusses the Sun HPC configuration file hpc.conf, which defines various attributes of a Sun HPC cluster. A single hpc.conf file is shared by all the nodes in a cluster. It resides in /opt/SUNWhpc/conf.

hpc.conf is organized into six sections, which are summarized below and illustrated in Example 7-1.

The ShmemResource section defines certain shared memory attributes. See "ShmemResource Section" for details.

The Netif section lists and ranks all network interfaces to which Sun HPC nodes are connected. See "Netif Section" for details.

The MPIOptions section allows the administrator to control certain MPI parameters by setting them in the hpc.conf file. See "MPIOptions Section" for details.

The PFSFileSystem section names and defines all parallel file systems in the Sun HPC cluster. See "PFSFileSystem Section" for details.

The PFSServers section names and defines all parallel file system servers in the Sun HPC cluster. See "PFSServers Section" for details.

The HPCNodes section is not used by the CRE. It applies only in an LSF-based runtime environment.

Each configuration section is bracketed by a Begin/End keyword pair and, when a parameter definition involves multiple fields, the fields are separated by spaces.

Sun HPC ClusterTools 3.0 software is distributed with an hpc.conf template, which is installed by default in /opt/SUNWhpc/examples/cre/hpc.conf.template. You should copy this file to /opt/SUNWhpc/conf/hpc.conf and edit it to suit your site's specific configuration requirements.

Note -

When any changes are made to hpc.conf, the system should be in a quiescent state. To ensure that it is safe to edit hpc.conf, shut down the nodal and master CRE daemons as described in "To Shut Down the CRE Without Shutting Down Solaris". If you change PFSFileSystem or PFSServers sections, you must also unmount any PFS file systems first.

Example 7-1 General Organization of hpc.conf File

Begin ShmemResource
  :     End ShmemResource
Begin Netif
NAME    RANK    MTU    STRIPE    PROTOCOL    LATENCY    BANDWIDTH
  :      :       :        :         :           :           :
End Netif
Begin MPIOptions queue=hpc
  :
End MPIQOptions

Begin HPCNodes
  :
End HPCNodes

Begin PFSFileSystem=pfs1 
NODE            DEVICE            THREADS 
  :               :                  :
End PFSFileSystem

Begin PFSServers
NODE            BUFFER_SIZE
  :               :                 End PFSServers

Begin HPCNodes

End HPCNodes

`ShmemResource` Section

The ShmemResource section provides the administrator with two parameters that control allocation of shared memory and swap space: MaxAllocMem and MaxAllocSwap. This special memory allocation control is needed because some Sun HPC ClusterTools components use shared memory.

Example 7-2 shows the ShmemResource template that is in the hpc.conf file that is shipped with Sun HPC ClusterTools 3.0 software.

Example 7-2 ShmemResource Section Example

#Begin ShmemResource
#MaxAllocMem  0x7fffffffffffffff
#MaxAllocSwap 0x7fffffffffffffff
#End ShmemResource

To set MaxAllocMem and/or MaxAllocSwap limits, remove the comment character (#) from the start of each line and replace the current value, 0x7fffffffffffffff, with the desired limit.

The following section explains how to set these limits.

Guidelines for Setting Limits

Sun HPC's internal shared memory allocator permits an application to use swap space, the amount of which is the smaller of:

The value (in bytes) given by the MaxAllocSwap parameter.

90% of available swap on a node

If MaxAllocSwap is not specified, or if zero or a negative value is specified, 90% of a node's available swap will be used as the swap limit.

The MaxAllocMem parameter can be used to limit the amount of shared memory that can be allocated. If a smaller shared memory limit is not specified, the shared memory limit will be 90% of available physical memory.

The following Sun HPC ClusterTools components use shared memory:

The CRE uses shared memory to hold cluster and job table information. Its memory use is based on cluster and job sizes and is not controllable by the user. Shared memory space is allocated for the CRE when it starts up and is not affected by MaxAllocMem and MaxAllocSwap settings. This ensures that the CRE can start up no matter how low these memory-limit variables have been set.

MPI uses shared memory for communication between processes that are on the same node. The amount of shared memory allocated by a job can be controlled by MPI environment variables.

Sun S3L uses shared memory for storing data. An MPI application can allocate parallel arrays whose subgrids are in shared memory. This is done with the utility S3L_declare_detailed().

Note -
Sun S3L supports a special form of shared memory known as Intimate Shared Memory (ISM), which reserves a region in physical memory for shared memory use. What makes ISM space special is that it is not swappable and, therefore, cannot be made available for other use. For this reason, the amount of memory allocated to ISM should be kept to a minimum.

Note -
Shared memory and swap space limits are applied per-job on each node.

If you have set up your system for dedicated use (only one job at a time is allowed), you should leave MaxAllocMem and MaxAllocSwap undefined. This will allow jobs to maximize use of swap space and physical memory.

If, however, multiple jobs will share a system, you may want to set MaxAllocMem to some level below 50% of total physical memory. This will reduce the risk of having a single application lock up physical memory. How much below 50% you choose to set it will depend on how many jobs you expect to be competing for physical memory at any given time.

Note -

When users make direct calls to mmap(2) or shmget(2), they are not limited by the MaxAllocMem and MaxAllocSwap variables. These utilities manipulate shared memory independently of the MaxAllocMem and MaxAllocSwap values.

`Netif` Section

The Netif section identifies the network interfaces supported by the Sun HPC cluster and specifies the rank, and striping attributes for each interface. The hpc.conf template that is supplied with Sun HPC ClusterTools 3.0 software contains a default list of supported network interfaces as well as their default ranking. Example 7-3 represents a portion of the default Netif section.

Example 7-3 Netif Section Example

Begin Netif
NAME           RANK     MTU       STRIPE     PROTOCOL     LATENCY WIDTH
midnn          0        16384     0          tcp          20          150
idn            10       16384     0          tcp          20          150
scin           20       32768     1          tcp          20          150
  :            :          :       :           :           :            :
scid           40       32768     1          tcp          20          150
  :            :          :       :           :           :            :
scirsm         45       32768     1          rsm          20          150
  :            :          :       :           :           :            :
  :            :          :       :           :           :            :
smc            220      4096      0          tcp          20          150End Netif

Interface Names

The first column lists the names of possible network interface types.

Rank Attribute

The rank of an interface is the order in which that interface is to be preferred over other interfaces. That is, if an interface with a rank of 0 is available when a communication operation begins, it will be selected for the operation before interfaces with ranks of 1 or greater. Likewise, an available rank 1 interface will be used before interfaces with a rank of 2 or greater.

Note -

Because hpc.conf is a shared, cluster-wide configuration file, the rank specified for a given interface will apply to all nodes in the cluster.

Network ranking decisions are usually influenced by site-specific conditions and requirements. Although interfaces connected to the fastest network in a cluster are often given preferential ranking, raw network bandwidth is only one consideration. For example, an administrator might decide to dedicate one network that offers very low latency, but not the fastest bandwidth to all cluster intra-cluster communication and use a higher-capacity network for connecting the cluster to other systems.

MTU Attribute

This is a placeholder column. Its contents are not used at this time.

Stripe Attribute

Sun HPC ClusterTools 3.0 software supports scalable communication between cluster nodes through striping of MPI messages over SCI interfaces. In striped communication, a message is split into smaller packets and transmitted in two or more parallel streams over a set of network interfaces that have been logically combined into a stripe-group.

The STRIPE column allows the administrator to include individual SCI network interfaces in a stripe-group pool. Members of this pool are available to be included in logical stripe groups. These stripe groups are formed on an as-needed basis, selecting interfaces from this stripe-group pool.

To include the SCI interface in a stripe-group pool, set its STRIPE value to 1. To exclude an interface from the pool, specify 0. Up to four SCI network interface cards per node can be configured for stripe-group membership.

When a message is submitted for transmission over the SCI network, an MPI protocol module distributes the message over as many SCI network interfaces as are available.

Stripe-group membership is made optional so you can reserve some SCI network bandwidth for non-striped use. To do so, simply set STRIPE = 0 on the SCI network interface(s) you wish to reserve in this way.

Protocol Attribute

This column identifies the communication protocol used by the interface. The scirsm interface employs the RSM (Remote Shared Memory) protocol. The others all use TCP (Transmission Control Protocol).

Latency Attribute

This is a placeholder column. Its contents are not used at this time.

Bandwidth Attribute

This is a placeholder column. Its contents are not used at this time.

`MPIOptions` Section

Overview

This section provides a set of options that control MPI communication behavior in ways that are likely to affect message-passing performance. It contains two templates with predefined option settings. These templates are shown in Example 7-4 and discussed below.

General-Purpose, Multiuser Template - The first template in the MPIOptions section is designed for general-purpose use at times when multiple message-passing jobs will be running concurrently.

Performance Template - The second template is designed to maximize the performance of message-passing jobs when only one job is allowed to run at a time.

Note -

The first line of each template contains the phrase "Queue=xxxx." This is because the queue-based LSF workload management runtime environment uses the same hpc.conf file as the CRE.

The options in the general-purpose template are the same as the default settings for the Sun MPI library. In other words, you do not have to uncomment the general-purpose template to have its option values be in effect. This template is provided in the MPIOptions section so you can see what options are most beneficial when operating in a multiuser mode.

If you want to use the performance template, do the following:

Delete the "Queue=performance" phrase from the Begin MPIOptions line.

Delete the comment character (#) from the beginning of each line of the performance template, including the Begin MPIOptions and End MPIOptions lines.

The resulting template should appear as follows:

Begin MPIOptions
coscheduling			off
spin			on
End MPIOptions

Table 7-1 provides brief descriptions of the MPI runtime options that can be set in hpc.conf. Each description identifies the default value and describes the effect of each legal value.

Some MPI options not only control a parameter directly, they can also be set to a value that passes control of the parameter to an environment variable. Where an MPI option has an associated environment variable, Table 7-1 names the environment variable

Example 7-4 `MPIOptions` Section Example

# Following is an example of the options that affect the runtime
# environment of the MPI library. The listings below are identical
# to the default settings of the library. The "queue=hpc" phrase
# makes it an LSF-specific entry, and only for the queue named hpc.
# These options are a good choice for a multiuser queue. To be
# recognized by CRE, the "Queue=hpc" needs to be removed.
#
# Begin MPIOptions queue=hpc
# coscheduling  avail
# pbind         avail
# spindtimeout   1000
# progressadjust   on
# spin            off
#
# shm_numpostbox       16
# shm_shortmsgsize    256
# rsm_numpostbox       15
# rsm_shortmsgsize    401
# rsm_maxstripe         2
# End MPIOptions

# The listing below is a good choice when trying to get maximum
# performance out of MPI jobs that are running in a queue that
# allows only one job to run at a time.
#
# Begin MPIOptions Queue=performance
# coscheduling             off
# spin                      on
# End MPIOptions

Table 7-1 MPI Runtime Options


	Values
Option	Default	Other	Description
`coscheduling`	`avail`		Allows `spind` use to be controlled by the environment variable `MPI_COSCHED`. If `MPI_COSCHED=0` or is not set, `spind` is not used. If `MPI_COSCHED=1`, `spind` must be used.
		`on`	Enables coscheduling; `spind` is used. This value overrides `MPI_COSCHED=0`.
		`off`	Disables coscheduling; `spind` is not to be used. This value overrides `MPI_COSCHED=1`.
`pbind`	`avail`		Allows processor binding state to be controlled by the environment variable `MPI_PROCBIND`. If `MPI_PROCBIND=0` or is not set, no processes will be bound to a processor. This is the default. If `MPI_PROCBIND=1`, all processes on a node will be bound to a processor.
		`on`	All processes will be bound to processors. This value overrides `MPI_PROCBIND=0`.
		`off`	No processes on a node are bound to a processor. This value overrides `MPI_PROCBIND=1`.
`spindtimeout`	`1000`		When polling for messages, a process waits 1000 milliseconds for `spind` to return. This equals the value to which the environment variable `MPI_SPINDTIMEOUT` is set.
		`integer`	To change the default timeout, enter an integer value specifying the number of milliseconds the timeout should be.
`progressadjust`	`on`		Allows user to set the environment variable `MPI_SPIN`.
		`off`	Disables user's ability to set the environment variable `MPI_SPIN`.
`shm_numpostbox`	`16`		Sets to 16 the number of postbox entries that are dedicated to a connection endpoint. This equals the value to which the environment variable `MPI_SHM_NUMPOSTBOX` is set.
		`integer`	To change the number of dedicated postbox entries, enter an integer value specifying the desired number.
`shm_shortmsgsize`	`256`		Sets to 256 the maximum number of bytes a short message can contain. This equals the default value to which the environment variable `MPI_SHM_SHORTMSGSIZE` is set.
		`integer`	To change the maximum-size definition of a short message, enter an integer specifying the maximum number of bytes it can contain.
`rsm_numpostbox`	`15`		Sets to 15 the number of postbox entries that are dedicated to a connection endpoint. This equals the value to which the environment variable `MPI_RSM_NUMPOSTBOX` is set.
		`integer`	To change the number of dedicated postbox entries, enter an integer value specifying the desired number.
`rsm_shortmsgsize`	`401`		Sets to 401 the maximum number of bytes a short message can contain. This equals the value to which the environment variable `MPI_RSM_SHORTMSGSIZE` is set.
		`integer`	To change the maximum-size definition of a short message, enter an integer specifying the maximum number of bytes it can contain.
`rsm_maxstripe`	`2`		Sets to 2 the maximum number of stripes that can be used. This equals the value to which the environment variable `MPI_RSM_MAXSTRIPE` is set.
		integer	To change the maximum number of stripes that can be used, enter an integer specifying the desired limit. This value cannot be greater than 2.
`spin`	`off`		Sets the MPI library to avoid spinning while waiting for status. This equals the value to which the environment variable `MPI_SPIN` is set.
		`on`	Sets the MPI library to spin.

`PFSFileSystem` Section

The PFSFileSystem section describes the parallel file systems that Sun MPI applications can use. This description includes

The name of the parallel file system.

The hostname of each server node in the parallel file system.

The name of the storage device to be included in the parallel file system being defined.

The number of PFS I/O threads spawned to support each PFS storage device.

A separate PFSFileSystem section is needed for each parallel file system that you want to create. Example 7-5 shows a sample PFSFileSystem section with two parallel file systems, pfs0 and pfs1.

Example 7-5 PFSFileSystem Section Example

Begin PFSFileSystem=pfs-demo0
NODE            DEVICE              THREADS 
hpc-node0       /dev/rdsk/c0t1d0s2  1
hpc-node1       /dev/rdsk/c0t1d0s2  1 
hpc-node2       /dev/rdsk/c0t1d0s2  1 
End PFSFileSystem

Begin PFSFileSystem=pfs-demo1
NODE            DEVICE              THREADS 
hpc-node3       /dev/rdsk/c0t1d0s2  1
hpc-node4       /dev/rdsk/c0t1d0s2  1 
End PFSFileSystem

Parallel File System Name

The first line shows the name of the parallel file system. PFS file system names must not include spaces.

Server Node Hostnames

The NODE column lists the hostnames of the nodes that function as I/O servers for the parallel file system being defined. The example configuration in Example 7-5 shows two parallel file systems:

pfs0 - three server nodes: node0, node1, and node2.

pfs1 - two server nodes: node2 and node3.

Note that I/O server node2 is used by both pfs0 and pfs1. Note also that hostname node3 represents a node that is used as both a PFS I/O server and as a computation server--that is, it is also used for executing application code.

Storage Device Names

The second column gives the device name associated with each member node. This name follows Solaris device naming conventions.

Thread Limits

The THREADS column allows the administrator to specify how many threads a PFS I/O daemon will spawn for the disk storage device or devices it controls. The number of threads needed by a given PFS I/O server node will depend primarily on the performance capabilities of its disk subsystem.

For a storage object with a single disk or a small storage array, one thread may be enough to exploit the storage unit's maximum I/O potential.

For a more powerful storage array, two or more threads may be needed to make full use of the available bandwidth.

`PFSServers` Section

A PFS I/O server is a Sun HPC node that is

Listed in the PFSServers section of hpc.conf, as shown in Example 7-6.

Connected to one or more disk storage units that are listed in a PFSFileSystem section of the hpc.conf file.

Example 7-6 PFSServers Section Example

Begin PFSServers
NODE            BUFFER_SIZE
hpc-node0        150
hpc-node1        150
hpc-node2        300
hpc-node3        300
End PFSServers

In addition to being defined in hpc.conf, a PFS I/O server also differs from other nodes in a Sun HPC cluster in that it has a PFS I/O daemon running on it.

PFS I/O Server Hostnames

The left column lists the hostnames of the nodes that are PFS I/O servers. In this example, they are ios0 through node2 and node3.

Buffer Size

The second column specifies the amount of memory the PFS I/O daemon will have for buffering transfer data. This value is specified in units of 32-Kbyte buffers. The number of buffers that you specify will depend on the amount of I/O traffic you expect that server is likely to experience at any given time.

The optimal buffer size will vary with system type and load. Buffer sizes in the range of 128 to 512 provide reasonable performance on most Sun HPC Systems. You can use pfsstat to get reports on buffer cache hit rates. This can be useful for evaluating how well suited the buffer size is to the cluster's current I/O activity.

`HPCNodes` Section

This section is used only in a cluster that is using LSF as its workload manager, not CRE. The CRE ignores the HPCNodes section of the hpc.conf file.

Propagate `hpc.conf` Information

Whenever hpc.conf is changed, the CRE database must be updated with the new information. After all required changes to hpc.conf have been made, restart the CRE master and nodal daemons as follows:

# /etc/init.d/sunhpc.cre_master start# /etc/init.d/sunhpc.cre_node start

If PFS file systems were unmounted and PFS I/O daemons were stopped, restart the I/O daemons and remount the PFS file systems .

Chapter 7 hpc.conf: Detailed Description

Example 7-1 General Organization of hpc.conf File

ShmemResource Section

Example 7-2 ShmemResource Section Example

Guidelines for Setting Limits

Netif Section

Example 7-3 Netif Section Example

Interface Names

Rank Attribute

MTU Attribute

Stripe Attribute

Protocol Attribute

Latency Attribute

Bandwidth Attribute

MPIOptions Section

Overview

Example 7-4 MPIOptions Section Example

PFSFileSystem Section

Example 7-5 PFSFileSystem Section Example

Parallel File System Name

Server Node Hostnames

Storage Device Names

Thread Limits

PFSServers Section

Example 7-6 PFSServers Section Example

PFS I/O Server Hostnames

Buffer Size

HPCNodes Section

Propagate hpc.conf Information

Chapter 7 `hpc.conf`: Detailed Description

`ShmemResource` Section

`Netif` Section

`MPIOptions` Section

Example 7-4 `MPIOptions` Section Example

`PFSFileSystem` Section

`PFSServers` Section

`HPCNodes` Section

Propagate `hpc.conf` Information