Sun HPC ClusterTools 3.0 Administrator's Guide: With CRE

Chapter 7 hpc.conf: Detailed Description

This chapter discusses the Sun HPC configuration file hpc.conf, which defines various attributes of a Sun HPC cluster. A single hpc.conf file is shared by all the nodes in a cluster. It resides in /opt/SUNWhpc/conf.

hpc.conf is organized into six sections, which are summarized below and illustrated in Example 7-1.

Each configuration section is bracketed by a Begin/End keyword pair and, when a parameter definition involves multiple fields, the fields are separated by spaces.

Sun HPC ClusterTools 3.0 software is distributed with an hpc.conf template, which is installed by default in /opt/SUNWhpc/examples/cre/hpc.conf.template. You should copy this file to /opt/SUNWhpc/conf/hpc.conf and edit it to suit your site's specific configuration requirements.


Note -

When any changes are made to hpc.conf, the system should be in a quiescent state. To ensure that it is safe to edit hpc.conf, shut down the nodal and master CRE daemons as described in "To Shut Down the CRE Without Shutting Down Solaris". If you change PFSFileSystem or PFSServers sections, you must also unmount any PFS file systems first.



Example 7-1 General Organization of hpc.conf File


Begin ShmemResource
  :     End ShmemResource
Begin Netif
NAME    RANK    MTU    STRIPE    PROTOCOL    LATENCY    BANDWIDTH
  :      :       :        :         :           :           :
End Netif
Begin MPIOptions queue=hpc
  :
End MPIQOptions

Begin HPCNodes
  :
End HPCNodes

Begin PFSFileSystem=pfs1 
NODE            DEVICE            THREADS 
  :               :                  :
End PFSFileSystem

Begin PFSServers
NODE            BUFFER_SIZE
  :               :                 End PFSServers

Begin HPCNodes

End HPCNodes

ShmemResource Section

The ShmemResource section provides the administrator with two parameters that control allocation of shared memory and swap space: MaxAllocMem and MaxAllocSwap. This special memory allocation control is needed because some Sun HPC ClusterTools components use shared memory.

Example 7-2 shows the ShmemResource template that is in the hpc.conf file that is shipped with Sun HPC ClusterTools 3.0 software.


Example 7-2 ShmemResource Section Example


#Begin ShmemResource
#MaxAllocMem  0x7fffffffffffffff
#MaxAllocSwap 0x7fffffffffffffff
#End ShmemResource

To set MaxAllocMem and/or MaxAllocSwap limits, remove the comment character (#) from the start of each line and replace the current value, 0x7fffffffffffffff, with the desired limit.

The following section explains how to set these limits.

Guidelines for Setting Limits

Sun HPC's internal shared memory allocator permits an application to use swap space, the amount of which is the smaller of:

If MaxAllocSwap is not specified, or if zero or a negative value is specified, 90% of a node's available swap will be used as the swap limit.

The MaxAllocMem parameter can be used to limit the amount of shared memory that can be allocated. If a smaller shared memory limit is not specified, the shared memory limit will be 90% of available physical memory.

The following Sun HPC ClusterTools components use shared memory:

If you have set up your system for dedicated use (only one job at a time is allowed), you should leave MaxAllocMem and MaxAllocSwap undefined. This will allow jobs to maximize use of swap space and physical memory.

If, however, multiple jobs will share a system, you may want to set MaxAllocMem to some level below 50% of total physical memory. This will reduce the risk of having a single application lock up physical memory. How much below 50% you choose to set it will depend on how many jobs you expect to be competing for physical memory at any given time.


Note -

When users make direct calls to mmap(2) or shmget(2), they are not limited by the MaxAllocMem and MaxAllocSwap variables. These utilities manipulate shared memory independently of the MaxAllocMem and MaxAllocSwap values.


Netif Section

The Netif section identifies the network interfaces supported by the Sun HPC cluster and specifies the rank, and striping attributes for each interface. The hpc.conf template that is supplied with Sun HPC ClusterTools 3.0 software contains a default list of supported network interfaces as well as their default ranking. Example 7-3 represents a portion of the default Netif section.


Example 7-3 Netif Section Example


Begin Netif
NAME           RANK     MTU       STRIPE     PROTOCOL     LATENCY WIDTH
midnn          0        16384     0          tcp          20          150
idn            10       16384     0          tcp          20          150
scin           20       32768     1          tcp          20          150
  :            :          :       :           :           :            :
scid           40       32768     1          tcp          20          150
  :            :          :       :           :           :            :
scirsm         45       32768     1          rsm          20          150
  :            :          :       :           :           :            :
  :            :          :       :           :           :            :
smc            220      4096      0          tcp          20          150End Netif

Interface Names

The first column lists the names of possible network interface types.

Rank Attribute

The rank of an interface is the order in which that interface is to be preferred over other interfaces. That is, if an interface with a rank of 0 is available when a communication operation begins, it will be selected for the operation before interfaces with ranks of 1 or greater. Likewise, an available rank 1 interface will be used before interfaces with a rank of 2 or greater.


Note -

Because hpc.conf is a shared, cluster-wide configuration file, the rank specified for a given interface will apply to all nodes in the cluster.


Network ranking decisions are usually influenced by site-specific conditions and requirements. Although interfaces connected to the fastest network in a cluster are often given preferential ranking, raw network bandwidth is only one consideration. For example, an administrator might decide to dedicate one network that offers very low latency, but not the fastest bandwidth to all cluster intra-cluster communication and use a higher-capacity network for connecting the cluster to other systems.

MTU Attribute

This is a placeholder column. Its contents are not used at this time.

Stripe Attribute

Sun HPC ClusterTools 3.0 software supports scalable communication between cluster nodes through striping of MPI messages over SCI interfaces. In striped communication, a message is split into smaller packets and transmitted in two or more parallel streams over a set of network interfaces that have been logically combined into a stripe-group.

The STRIPE column allows the administrator to include individual SCI network interfaces in a stripe-group pool. Members of this pool are available to be included in logical stripe groups. These stripe groups are formed on an as-needed basis, selecting interfaces from this stripe-group pool.

To include the SCI interface in a stripe-group pool, set its STRIPE value to 1. To exclude an interface from the pool, specify 0. Up to four SCI network interface cards per node can be configured for stripe-group membership.

When a message is submitted for transmission over the SCI network, an MPI protocol module distributes the message over as many SCI network interfaces as are available.

Stripe-group membership is made optional so you can reserve some SCI network bandwidth for non-striped use. To do so, simply set STRIPE = 0 on the SCI network interface(s) you wish to reserve in this way.

Protocol Attribute

This column identifies the communication protocol used by the interface. The scirsm interface employs the RSM (Remote Shared Memory) protocol. The others all use TCP (Transmission Control Protocol).

Latency Attribute

This is a placeholder column. Its contents are not used at this time.

Bandwidth Attribute

This is a placeholder column. Its contents are not used at this time.

MPIOptions Section

Overview

This section provides a set of options that control MPI communication behavior in ways that are likely to affect message-passing performance. It contains two templates with predefined option settings. These templates are shown in Example 7-4 and discussed below.

General-Purpose, Multiuser Template - The first template in the MPIOptions section is designed for general-purpose use at times when multiple message-passing jobs will be running concurrently.

Performance Template - The second template is designed to maximize the performance of message-passing jobs when only one job is allowed to run at a time.


Note -

The first line of each template contains the phrase "Queue=xxxx." This is because the queue-based LSF workload management runtime environment uses the same hpc.conf file as the CRE.


The options in the general-purpose template are the same as the default settings for the Sun MPI library. In other words, you do not have to uncomment the general-purpose template to have its option values be in effect. This template is provided in the MPIOptions section so you can see what options are most beneficial when operating in a multiuser mode.

If you want to use the performance template, do the following:

The resulting template should appear as follows:

Begin MPIOptions
coscheduling			off
spin			on
End MPIOptions

Table 7-1 provides brief descriptions of the MPI runtime options that can be set in hpc.conf. Each description identifies the default value and describes the effect of each legal value.

Some MPI options not only control a parameter directly, they can also be set to a value that passes control of the parameter to an environment variable. Where an MPI option has an associated environment variable, Table 7-1 names the environment variable


Example 7-4 MPIOptions Section Example


# Following is an example of the options that affect the runtime
# environment of the MPI library. The listings below are identical
# to the default settings of the library. The "queue=hpc" phrase
# makes it an LSF-specific entry, and only for the queue named hpc.
# These options are a good choice for a multiuser queue. To be
# recognized by CRE, the "Queue=hpc" needs to be removed.
#
# Begin MPIOptions queue=hpc
# coscheduling  avail
# pbind         avail
# spindtimeout   1000
# progressadjust   on
# spin            off
#
# shm_numpostbox       16
# shm_shortmsgsize    256
# rsm_numpostbox       15
# rsm_shortmsgsize    401
# rsm_maxstripe         2
# End MPIOptions

# The listing below is a good choice when trying to get maximum
# performance out of MPI jobs that are running in a queue that
# allows only one job to run at a time.
#
# Begin MPIOptions Queue=performance
# coscheduling             off
# spin                      on
# End MPIOptions

Table 7-1 MPI Runtime Options

 

Values 

 

Option 

Default 

Other 

Description 

coscheduling

avail

 

Allows spind use to be controlled by the environment variable MPI_COSCHED. If MPI_COSCHED=0 or is not set, spind is not used. If MPI_COSCHED=1, spind must be used.

 

 

on

Enables coscheduling; spind is used. This value overrides MPI_COSCHED=0.

 

 

off

Disables coscheduling; spind is not to be used. This value overrides MPI_COSCHED=1.

pbind

avail

 

Allows processor binding state to be controlled by the environment variable MPI_PROCBIND. If MPI_PROCBIND=0 or is not set, no processes will be bound to a processor. This is the default.

If MPI_PROCBIND=1, all processes on a node will be bound to a processor.

 

 

on

All processes will be bound to processors. This value overrides MPI_PROCBIND=0.

 

 

off

No processes on a node are bound to a processor. This value overrides MPI_PROCBIND=1.

spindtimeout

1000

 

When polling for messages, a process waits 1000 milliseconds for spind to return. This equals the value to which the environment variable MPI_SPINDTIMEOUT is set.

 

 

integer

To change the default timeout, enter an integer value specifying the number of milliseconds the timeout should be. 

progressadjust

on

 

Allows user to set the environment variable MPI_SPIN.

 

 

off

Disables user's ability to set the environment variable MPI_SPIN.

shm_numpostbox

16

 

Sets to 16 the number of postbox entries that are dedicated to a connection endpoint. This equals the value to which the environment variable MPI_SHM_NUMPOSTBOX is set.

 

 

integer

To change the number of dedicated postbox entries, enter an integer value specifying the desired number. 

shm_shortmsgsize

256

 

Sets to 256 the maximum number of bytes a short message can contain. This equals the default value to which the environment variable MPI_SHM_SHORTMSGSIZE is set.

 

 

integer

To change the maximum-size definition of a short message, enter an integer specifying the maximum number of bytes it can contain. 

rsm_numpostbox

15

 

Sets to 15 the number of postbox entries that are dedicated to a connection endpoint. This equals the value to which the environment variable MPI_RSM_NUMPOSTBOX is set.

 

 

integer

To change the number of dedicated postbox entries, enter an integer value specifying the desired number. 

rsm_shortmsgsize

401

 

Sets to 401 the maximum number of bytes a short message can contain. This equals the value to which the environment variable MPI_RSM_SHORTMSGSIZE is set.

 

 

integer

To change the maximum-size definition of a short message, enter an integer specifying the maximum number of bytes it can contain. 

rsm_maxstripe

2

 

Sets to 2 the maximum number of stripes that can be used. This equals the value to which the environment variable MPI_RSM_MAXSTRIPE is set.

 

 

integer 

To change the maximum number of stripes that can be used, enter an integer specifying the desired limit. This value cannot be greater than 2. 

spin

off

 

Sets the MPI library to avoid spinning while waiting for status. This equals the value to which the environment variable MPI_SPIN is set.

 

 

on

Sets the MPI library to spin.  

PFSFileSystem Section

The PFSFileSystem section describes the parallel file systems that Sun MPI applications can use. This description includes

A separate PFSFileSystem section is needed for each parallel file system that you want to create. Example 7-5 shows a sample PFSFileSystem section with two parallel file systems, pfs0 and pfs1.


Example 7-5 PFSFileSystem Section Example


Begin PFSFileSystem=pfs-demo0
NODE            DEVICE              THREADS 
hpc-node0       /dev/rdsk/c0t1d0s2  1
hpc-node1       /dev/rdsk/c0t1d0s2  1 
hpc-node2       /dev/rdsk/c0t1d0s2  1 
End PFSFileSystem

Begin PFSFileSystem=pfs-demo1
NODE            DEVICE              THREADS 
hpc-node3       /dev/rdsk/c0t1d0s2  1
hpc-node4       /dev/rdsk/c0t1d0s2  1 
End PFSFileSystem

Parallel File System Name

The first line shows the name of the parallel file system. PFS file system names must not include spaces.

Server Node Hostnames

The NODE column lists the hostnames of the nodes that function as I/O servers for the parallel file system being defined. The example configuration in Example 7-5 shows two parallel file systems:

Note that I/O server node2 is used by both pfs0 and pfs1. Note also that hostname node3 represents a node that is used as both a PFS I/O server and as a computation server--that is, it is also used for executing application code.

Storage Device Names

The second column gives the device name associated with each member node. This name follows Solaris device naming conventions.

Thread Limits

The THREADS column allows the administrator to specify how many threads a PFS I/O daemon will spawn for the disk storage device or devices it controls. The number of threads needed by a given PFS I/O server node will depend primarily on the performance capabilities of its disk subsystem.

PFSServers Section

A PFS I/O server is a Sun HPC node that is

In addition to being defined in hpc.conf, a PFS I/O server also differs from other nodes in a Sun HPC cluster in that it has a PFS I/O daemon running on it.

PFS I/O Server Hostnames

The left column lists the hostnames of the nodes that are PFS I/O servers. In this example, they are ios0 through node2 and node3.

Buffer Size

The second column specifies the amount of memory the PFS I/O daemon will have for buffering transfer data. This value is specified in units of 32-Kbyte buffers. The number of buffers that you specify will depend on the amount of I/O traffic you expect that server is likely to experience at any given time.

The optimal buffer size will vary with system type and load. Buffer sizes in the range of 128 to 512 provide reasonable performance on most Sun HPC Systems. You can use pfsstat to get reports on buffer cache hit rates. This can be useful for evaluating how well suited the buffer size is to the cluster's current I/O activity.

HPCNodes Section

This section is used only in a cluster that is using LSF as its workload manager, not CRE. The CRE ignores the HPCNodes section of the hpc.conf file.

Propagate hpc.conf Information

Whenever hpc.conf is changed, the CRE database must be updated with the new information. After all required changes to hpc.conf have been made, restart the CRE master and nodal daemons as follows:

# /etc/init.d/sunhpc.cre_master start# /etc/init.d/sunhpc.cre_node start 

If PFS file systems were unmounted and PFS I/O daemons were stopped, restart the I/O daemons and remount the PFS file systems .