This chapter discusses the Sun HPC configuration file hpc.conf, which defines attributes of Sun HPC clusters that are not defined in any LSF configuration files. A single hpc.conf file is shared by all the nodes in a cluster. It resides in the LSF shared directory LSF_SHARED_LOC.
The configuration file hpc.conf is organized into five sections which are summarized below and illustrated in Example 2-1.
The ShmemResource section defines certain shared memory attributes. See "ShmemResource Section" "ShmemResource Section"for details.
The Netif section lists and ranks all network interfaces to which Sun HPC nodes are connected. See "Netif Section" "Netif Section"for details.
The MPIQOptions section allows the administrator to control certain MPI parameters by setting them in the hpc.conf file. See "MPIOptions Section" "MPIOptions Section" for details.
The PFSFileSystem section names and defines all parallel file systems in the Sun HPC cluster. See "PFSFileSystem Section" "PFSFileSystem Section"for details.
The PFSServers section names and defines all parallel file system servers in the Sun HPC cluster. See "PFSServers Section" "PFSServers Section"for details.
The HPCNodes section can be used to define an HPC cluster that consists of a subset of the nodes contained in the LSF cluster. See "HPCNodes Section" "HPCNodes Section"for details.
The hpc.conf file follows the formatting conventions of LSF configuration files. That is, each configuration section is bracketed by a Begin/End keyword pair and, when a parameter definition involves multiple fields, the fields are separated by spaces.
Sun HPC ClusterTools 3.0 software is distributed with an hpc.conf template, which you can edit to suit your particular configuration requirements. This chapter provides instructions for editing each part of hpc.conf.
When any changes are made to hpc.conf, the system should be in a quiescent state. To ensure that it is safe to edit hpc.conf, shut down the LSF Batch daemons. See Chapter 3 "Managing LSF Batch" in the LSF Batch Administrator's Guide for instructions on stopping and starting LSF Batch daemons.
Begin ShmemResource : End ShmemResource Begin Netif NAME RANK MTU STRIPE PROTOCOL LATENCY BANDWIDTH : : : : : : : End Netif Begin MPIOptions queue=hpc : End MPIQOptions Begin HPCNodes : End HPCNodes Begin PFSFileSystem=pfs1 NODE DEVICE THREADS : : : End PFSFileSystem Begin PFSServers NODE BUFFER_SIZE : : End PFSServers Begin HPCNodes : End HPCNodes |
If changes need to be made to the PFSFileSystem or PFSServers sections, all PFS file systems must be unmounted first. This requirement is in addition to the need to stop and restart the LSF Batch daemons, as described in the previous note. See Chapter 5, Starting and Stopping PFS Daemons of this manual for instructions on how to perform these tasks.
The ShmemResource section provides the administrator with two parameters that control allocation of shared memory and swap space: MaxAllocMem and MaxAllocSwap. This special memory allocation control is needed because some Sun HPC ClusterTools components use shared memory.
Example 2-2 shows the ShmemResource template that is in the hpc.conf file that is shipped with Sun HPC ClusterTools 3.0 software.
#Begin ShmemResource #MaxAllocMem 0x7fffffffffffffff #MaxAllocSwap 0x7fffffffffffffff #End ShmemResource |
To set MaxAllocMem and/or MaxAllocSwap limits, remove the comment character (#) from the start of each line and replace the current value, 0x7fffffffffffffff, with the desired limit.
The following section explains how to set these limits.
Sun HPC's internal shared memory allocator permits an application to use swap space, the amount of which is the smaller of:
The value (in bytes) given by the MaxAllocSwap parameter.
90% of available swap on a node
If MaxAllocSwap is not specified, or if zero or a negative value is specified, 90% of a node's available swap will be used as the swap limit.
The MaxAllocMem parameter can be used to limit the amount of shared memory that can be allocated. If a smaller shared memory limit is not specified, the shared memory limit will be 90% of available physical memory.
The following Sun HPC ClusterTools components use shared memory:
The resource management software uses shared memory to hold cluster and job table information. Its memory use is based on cluster and job sizes and is not controllable by the user. Shared memory space is allocated for the runtime environment when the LSF daemon starts and is not affected by MaxAllocMem and MaxAllocSwap settings. This ensures that the runtime environment and LSF Base subsystem can start up no matter how low these memory-limit variables have been set.
MPI uses shared memory for communication between processes that are on the same node. The amount of shared memory allocated by a job can be controlled by MPI environment variables.
Sun S3L uses shared memory for storing data. An MPI application can allocate parallel arrays whose subgrids are in shared memory. This is done with the utility S3L_declare_detailed(), described in the Sun S3L Programming and Reference Guide.
Sun S3L supports a special form of shared memory known as Intimate Shared Memory (ISM), which reserves a region in physical memory for shared memory use. What makes ISM space special is that it is not swappable and, therefore, cannot be made available for other use. For this reason, the amount of memory allocated to ISM should be kept to a minimum.
Shared memory and swap space limits are applied per-job on each node.
If you have set up your system for dedicated use (only one job at a time is allowed), you should leave MaxAllocMem and MaxAllocSwap undefined. This will allow jobs to maximize use of swap space and physical memory.
If, however, multiple jobs will share a system, you may wish to set MaxAllocMem to some level below 50% of total physical memory. This will reduce the risk of having a single application lock up physical memory. How much below 50% you choose to set it will depend on how many jobs you expect to be competing for physical memory at any given time.
When users make direct calls to mmap(2) or shmget(2), they are not limited by the MaxAllocMem and MaxAllocSwap variables. These utilities manipulate shared memory independently of the MaxAllocMem and MaxAllocSwap values.
The Netif section identifies the network interfaces supported by the Sun HPC cluster and specifies the rank and striping attributes for each interface. The hpc.conf template that is supplied with Sun HPC ClusterTools 3.0 software contains a default list of supported network interfaces as well as their default ranking. Example 2-3 represents a portion of the default Netif section.
Begin Netif NAME RANK MTU STRIPE PROTOCOL LATENCY WIDTHmidnn 0 16384 0 tcp 20 150 idn 10 16384 0 tcp 20 150 scin 20 32768 1 tcp 20 150 : : : : : : : scid 40 32768 1 tcp 20 150 : : : : : : : scirsm 45 32768 1 rsm 20 150 : : : : : : : : : : : : : : smc 220 4096 0 tcp 20 150 End Netif |
The first column lists the names of possible network interface types.
The rank of an interface is the order in which that interface is to be preferred over other interfaces. That is, if an interface with a rank of 0 is available when a communication operation begins, it will be selected for the operation before interfaces with ranks of 1 or greater. Likewise, an available rank-1 interface will be used before interfaces with a rank of 2 or greater.
Because hpc.conf is a shared, cluster-wide configuration file, the rank specified for a given interface will apply to all nodes in the cluster.
Network ranking decisions are usually influenced by site-specific conditions and requirements. Although interfaces connected to the fastest network in a cluster are often given preferential ranking, raw network bandwidth is only one consideration. For example, an administrator might decide to dedicate one network that offers very low latency, but not the fastest bandwidth to all intra-cluster communication and use a higher-capacity network for connecting the cluster to other systems.
This is a placeholder column. Its contents are not used at this time.
Sun HPC ClusterTools 3.0 software supports scalable communication between cluster nodes through striping of MPI messages over SCI interfaces, as described in the Sun HPC 3.0 SCI Guide. In striped communication, a message is split into smaller packets and transmitted in two or more parallel streams over a set of network interfaces that have been logically combined into a stripe-group.
The STRIPE column allows the administrator to include individual SCI network interfaces in a stripe-group pool. Members of this pool are available to be included in logical stripe groups. These stripe groups are formed on an as-needed basis, selecting interfaces from this stripe-group pool.
To include the SCI interface in a stripe-group pool, set its STRIPE value to 1. To exclude an interface from the pool, specify 0. Up to four SCI network interface cards per node can be configured for stripe-group membership.
When a message is submitted for transmission over the SCI network, an MPI protocol module distributes the message over as many SCI network interfaces as are available.
Stripe-group membership is made optional so you can reserve some SCI network bandwidth for non-striped use. To do so, simply set STRIPE = 0 on the SCI network interface(s) you wish to reserve in this way.
This column identifies the communication protocol used by the interface. The scirsm interface employs the RSM (Remote Shared Memory) protocol. The others all use TCP (Transmission Control Protocol).
This is a placeholder column. Its contents are not used at this time.
This is a placeholder column. Its contents are not used at this time.
This section provides the means to control many MPI runtime parameters at the queue level. This is done by naming the LSF queue of interest and then listing the parameters to be defined, along with their desired values.
The hpc.conf file that is shipped with Sun HPC ClusterTools 3.0 software includes two MPIOptions templates--see Example 2-4. The first template, which contains the phrase queue=hpc, is designed for use by multiuser queue named hpc. The second template, which contains the phrase queue=performance, is intended by use by a dedicated (single-user) queue named performance.
To use either template, simply delete the comment character (#) from the beginning of each line in the template of interest.
The options in the general-purpose template (queue=hpc) are the same as the default settings for the Sun MPI library. In other words, you will have the effects of this template even if you do not use it or change any of the library defaults. This template is provided in the MPIOptions section so you can see what options are most beneficial when operating in a multiuser mode.
# Following is an example of the options that affect the runtime # environment of the MPI library. The listings below are identical # to the default settings of the library. The "queue=hpc" phrase # makes it an LSF-specific entry, and only for the queue named hpc. # These options are a good choice for a multiuser queue. To be # recognized by CRE, the "Queue=hpc" needs to be removed. # # Begin MPIOptions queue=hpc # coscheduling avail # pbind avail # spindtimeout 1000 # progressadjust on # spin off # # shm_numpostbox 16 # shm_shortmsgsize 256 # rsm_numpostbox 15 # rsm_shortmsgsize 401 # rsm_maxstripe 2 # End MPIOptions # The listing below is a good choice when trying to get maximum # performance out of MPI jobs that are running in a queue that # allows only one job to run at a time. # # Begin MPIOptions Queue=performance # coscheduling off # spin on # End MPIOptions |
Table 2-1 provides brief descriptions of the MPI runtime options that can be set in hpc.conf. Each description identifies the default value and describes the effect of each legal value. Refer to the Sun MPI 4.0 User's Guide: With LSF for fuller descriptions.
Some MPI options not only control a parameter directly, they can also be set to a value that passes control of the parameter to an environment variable. Where an MPI option has an associated environment variable, Table 2-1 names the environment variable.
The PFSFileSystem section describes the parallel file systems that Sun MPI applications can use. This description includes
The name of the parallel file system.
The hostname of each server node in the parallel file system.
The name of the storage device to be included in the parallel file system being defined.
The number of PFS I/O threads spawned to support each PFS storage device.
A separate PFSFileSystem section is needed for each parallel file system that you want to create. Example 2-5 shows a sample PFSFileSystem section with two parallel file systems, pfs-demo0 and pfs-demo1.
Begin PFSFileSystem=pfs-demo0 NODE DEVICE THREADS hpc-node0 /dev/rdsk/c0t1d0s2 1 hpc-node1 /dev/rdsk/c0t1d0s2 1 hpc-node2 /dev/rdsk/c0t1d0s2 1 End PFSFileSystem Begin PFSFileSystem=pfs-demo1 NODE DEVICE THREADS hpc-node3 /dev/rdsk/c0t1d0s2 1 hpc-node4 /dev/rdsk/c0t1d0s2 1 End PFSFileSystem |
The first line shows the name of the parallel file system. PFS file system names must not include spaces.
The NODE column lists the hostnames of the nodes that function as servers for the parallel file system being defined. The example configuration in Example 2-5 shows two parallel file systems:
pfs-demo0 - three server nodes: hpc-node0, hpc-node1, and hpc-node2.
pfs-demo1 - two server nodes: hpc-node3 and hpc-node4.
The second column gives the device name associated with each member node. This name follows Solaris device naming conventions. (But note the use of rdisk in the device names.)
The THREADS column allows the administrator to specify how many threads a PFS I/O daemon will spawn for the disk storage device or devices it controls. The number of threads needed by a given PFS I/O server node will depend primarily on the performance capabilities of its disk subsystem.
For a storage object with a single disk or a small storage array, one thread may be enough to exploit the storage unit's maximum I/O potential.
For a more powerful storage array, two or more threads may be needed to make full use of the available bandwidth.
A PFS I/O server is a Sun HPC node that is connected to one or more disk storage units that are defined as part of a parallel file system in the hpc.conf file--that is, they are listed in a PFSFileSystem section of the hpc.conf file. Plus, the node itself must be listed in the PFSServers section of hpc.conf, as shown in Example 2-6.
Begin PFSServers NODE BUFFER_SIZE hpc-node0 150 hpc-node1 150 hpc-node2 300 hpc-node3 300 End PFSServers |
In addition to being defined in hpc.conf, a PFS server also differs from other nodes in a Sun HPC cluster in that it has a PFS I/O daemon running on it.
The left column lists the hostnames of the nodes that are PFS I/O servers. In this example, they are hpc-node0 through hpc-node3.
The second column specifies the amount of memory the PFS I/O daemon will have for buffering transfer data. This value is specified in units of 32-Kbyte buffers. The number of buffers that you specify will depend on the amount of I/O traffic you expect that server is likely to experience at any given time.
The optimal buffer size will vary with system type and load. Buffer sizes in the range of 128 to 512 provide reasonable performance on most Sun HPC Systems. You can use pfsstat to get reports on buffer cache hit rates. This can be useful for evaluating how well suited the buffer size is to the cluster's current I/O activity.
This section allows you to define a Sun HPC cluster that consists of a subset of the nodes contained in the LSF cluster. To use this configuration option, enter the hostnames of the nodes that you want in the HPC cluster between in this section, one hostname per line. Example 2-7 shows a sample HPCNodes section with two nodes listed.
Begin HPCNodes node1 node2 End HPCNodes |
Whenever hpc.conf is changed, the LSF daemons must be updated with the new information. After all required changes to hpc.conf have been made, restart the LSF base daemons LIM and RES. Use the lsadmin subcommands reconfig and resrestert as follows:
hpc-demo# lsadmin reconfig hpc-demo# lsadmin resrestart all
Then, use the badmin subcommands reconfig to restart mbatchd and hrestart to restart the slave batch daemons:
hpc-demo# badmin reconfig hpc-demo# badmin hrestart all
This only needs to be done from one node. See the LSF Batch Administrator`s Guide for additional information about restarting LSF daemons.