Sun HPC ClusterTools 3.0 Administrator's Guide: With LSF

Chapter 2 `hpc.conf` - Sun HPC System Configuration File

This chapter discusses the Sun HPC configuration file hpc.conf, which defines attributes of Sun HPC clusters that are not defined in any LSF configuration files. A single hpc.conf file is shared by all the nodes in a cluster. It resides in the LSF shared directory LSF_SHARED_LOC.

The configuration file hpc.conf is organized into five sections which are summarized below and illustrated in Example 2-1.

The ShmemResource section defines certain shared memory attributes. See "ShmemResource Section" "ShmemResource Section"for details.

The Netif section lists and ranks all network interfaces to which Sun HPC nodes are connected. See "Netif Section" "Netif Section"for details.

The MPIQOptions section allows the administrator to control certain MPI parameters by setting them in the hpc.conf file. See "MPIOptions Section" "MPIOptions Section" for details.

The PFSFileSystem section names and defines all parallel file systems in the Sun HPC cluster. See "PFSFileSystem Section" "PFSFileSystem Section"for details.

The PFSServers section names and defines all parallel file system servers in the Sun HPC cluster. See "PFSServers Section" "PFSServers Section"for details.

The HPCNodes section can be used to define an HPC cluster that consists of a subset of the nodes contained in the LSF cluster. See "HPCNodes Section" "HPCNodes Section"for details.

The hpc.conf file follows the formatting conventions of LSF configuration files. That is, each configuration section is bracketed by a Begin/End keyword pair and, when a parameter definition involves multiple fields, the fields are separated by spaces.

Sun HPC ClusterTools 3.0 software is distributed with an hpc.conf template, which you can edit to suit your particular configuration requirements. This chapter provides instructions for editing each part of hpc.conf.

Note -

When any changes are made to hpc.conf, the system should be in a quiescent state. To ensure that it is safe to edit hpc.conf, shut down the LSF Batch daemons. See Chapter 3 "Managing LSF Batch" in the LSF Batch Administrator's Guide for instructions on stopping and starting LSF Batch daemons.

Example 2-1 General Organization of hpc.conf File

Begin ShmemResource
  :     End ShmemResource

Begin Netif
NAME    RANK    MTU    STRIPE    PROTOCOL    LATENCY    BANDWIDTH
  :      :       :        :         :           :           :
End Netif

Begin MPIOptions queue=hpc
  :
End MPIQOptions

Begin HPCNodes
  :
End HPCNodes

Begin PFSFileSystem=pfs1 NODE            DEVICE            THREADS 
  :               :                  :
End PFSFileSystem

Begin PFSServers
NODE            BUFFER_SIZE
  :               :                 End PFSServers

Begin HPCNodes
 :
End HPCNodes

Note -

If changes need to be made to the PFSFileSystem or PFSServers sections, all PFS file systems must be unmounted first. This requirement is in addition to the need to stop and restart the LSF Batch daemons, as described in the previous note. See Chapter 5, Starting and Stopping PFS Daemons of this manual for instructions on how to perform these tasks.

`ShmemResource` Section

The ShmemResource section provides the administrator with two parameters that control allocation of shared memory and swap space: MaxAllocMem and MaxAllocSwap. This special memory allocation control is needed because some Sun HPC ClusterTools components use shared memory.

Example 2-2 shows the ShmemResource template that is in the hpc.conf file that is shipped with Sun HPC ClusterTools 3.0 software.

Example 2-2 ShmemResource Section Example

#Begin ShmemResource
#MaxAllocMem  0x7fffffffffffffff
#MaxAllocSwap 0x7fffffffffffffff
#End ShmemResource

To set MaxAllocMem and/or MaxAllocSwap limits, remove the comment character (#) from the start of each line and replace the current value, 0x7fffffffffffffff, with the desired limit.

The following section explains how to set these limits.

Guidelines for Setting Limits

Sun HPC's internal shared memory allocator permits an application to use swap space, the amount of which is the smaller of:

The value (in bytes) given by the MaxAllocSwap parameter.

90% of available swap on a node

If MaxAllocSwap is not specified, or if zero or a negative value is specified, 90% of a node's available swap will be used as the swap limit.

The MaxAllocMem parameter can be used to limit the amount of shared memory that can be allocated. If a smaller shared memory limit is not specified, the shared memory limit will be 90% of available physical memory.

The following Sun HPC ClusterTools components use shared memory:

The resource management software uses shared memory to hold cluster and job table information. Its memory use is based on cluster and job sizes and is not controllable by the user. Shared memory space is allocated for the runtime environment when the LSF daemon starts and is not affected by MaxAllocMem and MaxAllocSwap settings. This ensures that the runtime environment and LSF Base subsystem can start up no matter how low these memory-limit variables have been set.

MPI uses shared memory for communication between processes that are on the same node. The amount of shared memory allocated by a job can be controlled by MPI environment variables.

Sun S3L uses shared memory for storing data. An MPI application can allocate parallel arrays whose subgrids are in shared memory. This is done with the utility S3L_declare_detailed(), described in the Sun S3L Programming and Reference Guide.

Note -
Sun S3L supports a special form of shared memory known as Intimate Shared Memory (ISM), which reserves a region in physical memory for shared memory use. What makes ISM space special is that it is not swappable and, therefore, cannot be made available for other use. For this reason, the amount of memory allocated to ISM should be kept to a minimum.

Note -
Shared memory and swap space limits are applied per-job on each node.

If you have set up your system for dedicated use (only one job at a time is allowed), you should leave MaxAllocMem and MaxAllocSwap undefined. This will allow jobs to maximize use of swap space and physical memory.

If, however, multiple jobs will share a system, you may wish to set MaxAllocMem to some level below 50% of total physical memory. This will reduce the risk of having a single application lock up physical memory. How much below 50% you choose to set it will depend on how many jobs you expect to be competing for physical memory at any given time.

Note -

When users make direct calls to mmap(2) or shmget(2), they are not limited by the MaxAllocMem and MaxAllocSwap variables. These utilities manipulate shared memory independently of the MaxAllocMem and MaxAllocSwap values.

`Netif` Section

The Netif section identifies the network interfaces supported by the Sun HPC cluster and specifies the rank and striping attributes for each interface. The hpc.conf template that is supplied with Sun HPC ClusterTools 3.0 software contains a default list of supported network interfaces as well as their default ranking. Example 2-3 represents a portion of the default Netif section.

Example 2-3 Netif Section Example

Begin Netif
NAME           RANK     MTU       STRIPE     PROTOCOL     LATENCY WIDTHmidnn          0        16384     0          tcp          20          150
idn            10       16384     0          tcp          20          150
scin           20       32768     1          tcp          20          150
  :            :          :       :           :           :            :
scid           40       32768     1          tcp          20          150
  :            :          :       :           :           :            :
scirsm         45       32768     1          rsm          20          150
  :            :          :       :           :           :            :
  :            :          :       :           :           :            :
smc            220      4096      0          tcp          20          150
End Netif

Interface Names

The first column lists the names of possible network interface types.

Rank Attribute

The rank of an interface is the order in which that interface is to be preferred over other interfaces. That is, if an interface with a rank of 0 is available when a communication operation begins, it will be selected for the operation before interfaces with ranks of 1 or greater. Likewise, an available rank-1 interface will be used before interfaces with a rank of 2 or greater.

Note -

Because hpc.conf is a shared, cluster-wide configuration file, the rank specified for a given interface will apply to all nodes in the cluster.

Network ranking decisions are usually influenced by site-specific conditions and requirements. Although interfaces connected to the fastest network in a cluster are often given preferential ranking, raw network bandwidth is only one consideration. For example, an administrator might decide to dedicate one network that offers very low latency, but not the fastest bandwidth to all intra-cluster communication and use a higher-capacity network for connecting the cluster to other systems.

MTU Attribute

This is a placeholder column. Its contents are not used at this time.

Stripe Attribute

Sun HPC ClusterTools 3.0 software supports scalable communication between cluster nodes through striping of MPI messages over SCI interfaces, as described in the Sun HPC 3.0 SCI Guide. In striped communication, a message is split into smaller packets and transmitted in two or more parallel streams over a set of network interfaces that have been logically combined into a stripe-group.

The STRIPE column allows the administrator to include individual SCI network interfaces in a stripe-group pool. Members of this pool are available to be included in logical stripe groups. These stripe groups are formed on an as-needed basis, selecting interfaces from this stripe-group pool.

To include the SCI interface in a stripe-group pool, set its STRIPE value to 1. To exclude an interface from the pool, specify 0. Up to four SCI network interface cards per node can be configured for stripe-group membership.

When a message is submitted for transmission over the SCI network, an MPI protocol module distributes the message over as many SCI network interfaces as are available.

Stripe-group membership is made optional so you can reserve some SCI network bandwidth for non-striped use. To do so, simply set STRIPE = 0 on the SCI network interface(s) you wish to reserve in this way.

Protocol Attribute

This column identifies the communication protocol used by the interface. The scirsm interface employs the RSM (Remote Shared Memory) protocol. The others all use TCP (Transmission Control Protocol).

Latency Attribute

This is a placeholder column. Its contents are not used at this time.

Bandwidth Attribute

This is a placeholder column. Its contents are not used at this time.

`MPIOptions` Section

This section provides the means to control many MPI runtime parameters at the queue level. This is done by naming the LSF queue of interest and then listing the parameters to be defined, along with their desired values.

The hpc.conf file that is shipped with Sun HPC ClusterTools 3.0 software includes two MPIOptions templates--see Example 2-4. The first template, which contains the phrase queue=hpc, is designed for use by multiuser queue named hpc. The second template, which contains the phrase queue=performance, is intended by use by a dedicated (single-user) queue named performance.

To use either template, simply delete the comment character (#) from the beginning of each line in the template of interest.

Note -

The options in the general-purpose template (queue=hpc) are the same as the default settings for the Sun MPI library. In other words, you will have the effects of this template even if you do not use it or change any of the library defaults. This template is provided in the MPIOptions section so you can see what options are most beneficial when operating in a multiuser mode.

Example 2-4 `MPIOptions` Section Example

# Following is an example of the options that affect the runtime 
# environment of the MPI library. The listings below are identical 
# to the default settings of the library. The "queue=hpc" phrase
# makes it an LSF-specific entry, and only for the queue named hpc.
# These options are a good choice for a multiuser queue. To be
# recognized by CRE, the "Queue=hpc" needs to be removed.
#
# Begin MPIOptions queue=hpc
# coscheduling  avail
# pbind         avail
# spindtimeout   1000
# progressadjust   on
# spin            off
#
# shm_numpostbox       16
# shm_shortmsgsize    256
# rsm_numpostbox       15
# rsm_shortmsgsize    401
# rsm_maxstripe         2
# End MPIOptions

# The listing below is a good choice when trying to get maximum
# performance out of MPI jobs that are running in a queue that
# allows only one job to run at a time.
#
# Begin MPIOptions Queue=performance
# coscheduling             off
# spin                      on
# End MPIOptions

Table 2-1 provides brief descriptions of the MPI runtime options that can be set in hpc.conf. Each description identifies the default value and describes the effect of each legal value. Refer to the Sun MPI 4.0 User's Guide: With LSF for fuller descriptions.

Note -

Some MPI options not only control a parameter directly, they can also be set to a value that passes control of the parameter to an environment variable. Where an MPI option has an associated environment variable, Table 2-1 names the environment variable.

Table 2-1 MPI Runtime Options


	Values
Option	Default	Other	Description
`coscheduling`	`avail`		Allows `spind` use to be controlled by the environment variable `MPI_COSCHED`. If `MPI_COSCHED=0` or is not set, `spind` is not used. If `MPI_COSCHED=1`, `spind` must be used.
		`on`	Enables coscheduling; `spind` is used. This value overrides `MPI_COSCHED=0`.
		`off`	Disables coscheduling; `spind` is not to be used. This value overrides `MPI_COSCHED=1`.
`pbind`	`avail`		Allows processor binding state to be controlled by the environment variable `MPI_PROCBIND`. If `MPI_PROCBIND=0` or is not set, no processes will be bound to a processor. This is the default. If `MPI_PROCBIND=1`, all processes on a node will be bound to a processor.
		`on`	All processes will be bound to processors. This value overrides `MPI_PROCBIND=0`.
		`off`	No processes on a node are bound to a processor. This value overrides `MPI_PROCBIND=1`.
`spindtimeout`	`1000`		When polling for messages, a process waits 1000 milliseconds for `spind` to return. This equals the value to which the environment variable `MPI_SPINDTIMEOUT` is set.
		`integer`	To change the default timeout, enter an integer value specifying the number of milliseconds the timeout should be.
`progressadjust`	`on`		Allows user to set the environment variable `MPI_SPIN`.
		`off`	Disables user's ability to set the environment variable `MPI_SPIN`.
`shm_numpostbox`	`16`		Sets to 16 the number of postbox entries that are dedicated to a connection endpoint. This equals the value to which the environment variable `MPI_SHM_NUMPOSTBOX` is set.
		`integer`	To change the number of dedicated postbox entries, enter an integer value specifying the desired number.
`shm_shortmsgsize`	`256`		Sets to 256 the maximum number of bytes a short message can contain. This equals the default value to which the environment variable `MPI_SHM_SHORTMSGSIZE` is set.
		`integer`	To change the maximum-size definition of a short message, enter an integer specifying the maximum number of bytes it can contain.
`rsm_numpostbox`	`15`		Sets to 15 the number of postbox entries that are dedicated to a connection endpoint. This equals the value to which the environment variable `MPI_RSM_NUMPOSTBOX` is set.
		`integer`	To change the number of dedicated postbox entries, enter an integer value specifying the desired number.
`rsm_shortmsgsize`	`401`		Sets to 401 the maximum number of bytes a short message can contain. This equals the value to which the environment variable `MPI_RSM_SHORTMSGSIZE` is set.
		`integer`	To change the maximum-size definition of a short message, enter an integer specifying the maximum number of bytes it can contain.
`rsm_maxstripe`	`2`		Sets to 2 the maximum number of stripes that can be used. This equals the value to which the environment variable `MPI_RSM_MAXSTRIPE` is set.
		integer	To change the maximum number of stripes that can be used, enter an integer specifying the desired limit. This value cannot be greater than 2.
`spin`	`off`		Sets the MPI library to avoid spinning while waiting for status. This equals the value to which the environment variable `MPI_SPIN` is set.
		`on`	Sets the MPI library to spin.

`PFSFileSystem` Section

The PFSFileSystem section describes the parallel file systems that Sun MPI applications can use. This description includes

The name of the parallel file system.

The hostname of each server node in the parallel file system.

The name of the storage device to be included in the parallel file system being defined.

The number of PFS I/O threads spawned to support each PFS storage device.

A separate PFSFileSystem section is needed for each parallel file system that you want to create. Example 2-5 shows a sample PFSFileSystem section with two parallel file systems, pfs-demo0 and pfs-demo1.

Example 2-5 PFSFileSystem Section Example

Begin PFSFileSystem=pfs-demo0
NODE            DEVICE              THREADS 
hpc-node0       /dev/rdsk/c0t1d0s2  1 
hpc-node1       /dev/rdsk/c0t1d0s2  1 
hpc-node2       /dev/rdsk/c0t1d0s2  1 
End PFSFileSystem

Begin PFSFileSystem=pfs-demo1
NODE            DEVICE              THREADS 
hpc-node3       /dev/rdsk/c0t1d0s2  1
hpc-node4       /dev/rdsk/c0t1d0s2  1 
End PFSFileSystem

Parallel File System Name

The first line shows the name of the parallel file system. PFS file system names must not include spaces.

Server Node Hostnames

The NODE column lists the hostnames of the nodes that function as servers for the parallel file system being defined. The example configuration in Example 2-5 shows two parallel file systems:

pfs-demo0 - three server nodes: hpc-node0, hpc-node1, and hpc-node2.

pfs-demo1 - two server nodes: hpc-node3 and hpc-node4.

Storage Device Names

The second column gives the device name associated with each member node. This name follows Solaris device naming conventions. (But note the use of rdisk in the device names.)

Thread Limits

The THREADS column allows the administrator to specify how many threads a PFS I/O daemon will spawn for the disk storage device or devices it controls. The number of threads needed by a given PFS I/O server node will depend primarily on the performance capabilities of its disk subsystem.

For a storage object with a single disk or a small storage array, one thread may be enough to exploit the storage unit's maximum I/O potential.

For a more powerful storage array, two or more threads may be needed to make full use of the available bandwidth.

`PFSServers` Section

A PFS I/O server is a Sun HPC node that is connected to one or more disk storage units that are defined as part of a parallel file system in the hpc.conf file--that is, they are listed in a PFSFileSystem section of the hpc.conf file. Plus, the node itself must be listed in the PFSServers section of hpc.conf, as shown in Example 2-6.

Example 2-6 PFSServers Section Example

Begin PFSServers
NODE            BUFFER_SIZE
hpc-node0        150
hpc-node1        150
hpc-node2        300
hpc-node3        300
End PFSServers

In addition to being defined in hpc.conf, a PFS server also differs from other nodes in a Sun HPC cluster in that it has a PFS I/O daemon running on it.

PFS I/O Server Hostnames

The left column lists the hostnames of the nodes that are PFS I/O servers. In this example, they are hpc-node0 through hpc-node3.

Buffer Size

The second column specifies the amount of memory the PFS I/O daemon will have for buffering transfer data. This value is specified in units of 32-Kbyte buffers. The number of buffers that you specify will depend on the amount of I/O traffic you expect that server is likely to experience at any given time.

The optimal buffer size will vary with system type and load. Buffer sizes in the range of 128 to 512 provide reasonable performance on most Sun HPC Systems. You can use pfsstat to get reports on buffer cache hit rates. This can be useful for evaluating how well suited the buffer size is to the cluster's current I/O activity.

`HPCNodes` Section

This section allows you to define a Sun HPC cluster that consists of a subset of the nodes contained in the LSF cluster. To use this configuration option, enter the hostnames of the nodes that you want in the HPC cluster between in this section, one hostname per line. Example 2-7 shows a sample HPCNodes section with two nodes listed.

Example 2-7 HPCNodes Section Example

Begin HPCNodes
node1
node2
End HPCNodes

Propagate `hpc.conf` Information

Whenever hpc.conf is changed, the LSF daemons must be updated with the new information. After all required changes to hpc.conf have been made, restart the LSF base daemons LIM and RES. Use the lsadmin subcommands reconfig and resrestert as follows:

hpc-demo# lsadmin reconfig
hpc-demo# lsadmin resrestart all

Then, use the badmin subcommands reconfig to restart mbatchd and hrestart to restart the slave batch daemons:

hpc-demo# badmin reconfig
hpc-demo# badmin hrestart all

This only needs to be done from one node. See the LSF Batch Administrator`s Guide for additional information about restarting LSF daemons.

Chapter 2 hpc.conf - Sun HPC System Configuration File

Example 2-1 General Organization of hpc.conf File

ShmemResource Section

Example 2-2 ShmemResource Section Example

Guidelines for Setting Limits

Netif Section

Example 2-3 Netif Section Example

Interface Names

Rank Attribute

MTU Attribute

Stripe Attribute

Protocol Attribute

Latency Attribute

Bandwidth Attribute

MPIOptions Section

Example 2-4 MPIOptions Section Example

PFSFileSystem Section

Example 2-5 PFSFileSystem Section Example

Parallel File System Name

Server Node Hostnames

Storage Device Names

Thread Limits

PFSServers Section

Example 2-6 PFSServers Section Example

PFS I/O Server Hostnames

Buffer Size

HPCNodes Section

Example 2-7 HPCNodes Section Example

Propagate hpc.conf Information

Chapter 2 `hpc.conf` - Sun HPC System Configuration File

`ShmemResource` Section

`Netif` Section

`MPIOptions` Section

Example 2-4 `MPIOptions` Section Example

`PFSFileSystem` Section

`PFSServers` Section

`HPCNodes` Section

Propagate `hpc.conf` Information