This chapter discusses the Sun HPC configuration file hpc.conf, which defines various attributes of a Sun HPC cluster. A single hpc.conf file is shared by all the nodes in a cluster. It resides in /opt/SUNWhpc/conf.
hpc.conf is organized into six sections, which are summarized below and illustrated in Example 7-1.
The ShmemResource section defines certain shared memory attributes. See "ShmemResource Section" for details.
The Netif section lists and ranks all network interfaces to which Sun HPC nodes are connected. See "Netif Section" for details.
The MPIOptions section allows the administrator to control certain MPI parameters by setting them in the hpc.conf file. See "MPIOptions Section" for details.
The PFSFileSystem section names and defines all parallel file systems in the Sun HPC cluster. See "PFSFileSystem Section" for details.
The PFSServers section names and defines all parallel file system servers in the Sun HPC cluster. See "PFSServers Section" for details.
The HPCNodes section is not used by the CRE. It applies only in an LSF-based runtime environment.
Each configuration section is bracketed by a Begin/End keyword pair and, when a parameter definition involves multiple fields, the fields are separated by spaces.
Sun HPC ClusterTools 3.0 software is distributed with an hpc.conf template, which is installed by default in /opt/SUNWhpc/examples/cre/hpc.conf.template. You should copy this file to /opt/SUNWhpc/conf/hpc.conf and edit it to suit your site's specific configuration requirements.
When any changes are made to hpc.conf, the system should be in a quiescent state. To ensure that it is safe to edit hpc.conf, shut down the nodal and master CRE daemons as described in "To Shut Down the CRE Without Shutting Down Solaris". If you change PFSFileSystem or PFSServers sections, you must also unmount any PFS file systems first.
Begin ShmemResource : End ShmemResource Begin Netif NAME RANK MTU STRIPE PROTOCOL LATENCY BANDWIDTH : : : : : : : End Netif Begin MPIOptions queue=hpc : End MPIQOptions Begin HPCNodes : End HPCNodes Begin PFSFileSystem=pfs1 NODE DEVICE THREADS : : : End PFSFileSystem Begin PFSServers NODE BUFFER_SIZE : : End PFSServers Begin HPCNodes End HPCNodes |
The ShmemResource section provides the administrator with two parameters that control allocation of shared memory and swap space: MaxAllocMem and MaxAllocSwap. This special memory allocation control is needed because some Sun HPC ClusterTools components use shared memory.
Example 7-2 shows the ShmemResource template that is in the hpc.conf file that is shipped with Sun HPC ClusterTools 3.0 software.
#Begin ShmemResource #MaxAllocMem 0x7fffffffffffffff #MaxAllocSwap 0x7fffffffffffffff #End ShmemResource |
To set MaxAllocMem and/or MaxAllocSwap limits, remove the comment character (#) from the start of each line and replace the current value, 0x7fffffffffffffff, with the desired limit.
The following section explains how to set these limits.
Sun HPC's internal shared memory allocator permits an application to use swap space, the amount of which is the smaller of:
The value (in bytes) given by the MaxAllocSwap parameter.
90% of available swap on a node
If MaxAllocSwap is not specified, or if zero or a negative value is specified, 90% of a node's available swap will be used as the swap limit.
The MaxAllocMem parameter can be used to limit the amount of shared memory that can be allocated. If a smaller shared memory limit is not specified, the shared memory limit will be 90% of available physical memory.
The following Sun HPC ClusterTools components use shared memory:
The CRE uses shared memory to hold cluster and job table information. Its memory use is based on cluster and job sizes and is not controllable by the user. Shared memory space is allocated for the CRE when it starts up and is not affected by MaxAllocMem and MaxAllocSwap settings. This ensures that the CRE can start up no matter how low these memory-limit variables have been set.
MPI uses shared memory for communication between processes that are on the same node. The amount of shared memory allocated by a job can be controlled by MPI environment variables.
Sun S3L uses shared memory for storing data. An MPI application can allocate parallel arrays whose subgrids are in shared memory. This is done with the utility S3L_declare_detailed().
Sun S3L supports a special form of shared memory known as Intimate Shared Memory (ISM), which reserves a region in physical memory for shared memory use. What makes ISM space special is that it is not swappable and, therefore, cannot be made available for other use. For this reason, the amount of memory allocated to ISM should be kept to a minimum.
Shared memory and swap space limits are applied per-job on each node.
If you have set up your system for dedicated use (only one job at a time is allowed), you should leave MaxAllocMem and MaxAllocSwap undefined. This will allow jobs to maximize use of swap space and physical memory.
If, however, multiple jobs will share a system, you may want to set MaxAllocMem to some level below 50% of total physical memory. This will reduce the risk of having a single application lock up physical memory. How much below 50% you choose to set it will depend on how many jobs you expect to be competing for physical memory at any given time.
When users make direct calls to mmap(2) or shmget(2), they are not limited by the MaxAllocMem and MaxAllocSwap variables. These utilities manipulate shared memory independently of the MaxAllocMem and MaxAllocSwap values.
The Netif section identifies the network interfaces supported by the Sun HPC cluster and specifies the rank, and striping attributes for each interface. The hpc.conf template that is supplied with Sun HPC ClusterTools 3.0 software contains a default list of supported network interfaces as well as their default ranking. Example 7-3 represents a portion of the default Netif section.
Begin Netif NAME RANK MTU STRIPE PROTOCOL LATENCY WIDTH midnn 0 16384 0 tcp 20 150 idn 10 16384 0 tcp 20 150 scin 20 32768 1 tcp 20 150 : : : : : : : scid 40 32768 1 tcp 20 150 : : : : : : : scirsm 45 32768 1 rsm 20 150 : : : : : : : : : : : : : : smc 220 4096 0 tcp 20 150End Netif |
The first column lists the names of possible network interface types.
The rank of an interface is the order in which that interface is to be preferred over other interfaces. That is, if an interface with a rank of 0 is available when a communication operation begins, it will be selected for the operation before interfaces with ranks of 1 or greater. Likewise, an available rank 1 interface will be used before interfaces with a rank of 2 or greater.
Because hpc.conf is a shared, cluster-wide configuration file, the rank specified for a given interface will apply to all nodes in the cluster.
Network ranking decisions are usually influenced by site-specific conditions and requirements. Although interfaces connected to the fastest network in a cluster are often given preferential ranking, raw network bandwidth is only one consideration. For example, an administrator might decide to dedicate one network that offers very low latency, but not the fastest bandwidth to all cluster intra-cluster communication and use a higher-capacity network for connecting the cluster to other systems.
This is a placeholder column. Its contents are not used at this time.
Sun HPC ClusterTools 3.0 software supports scalable communication between cluster nodes through striping of MPI messages over SCI interfaces. In striped communication, a message is split into smaller packets and transmitted in two or more parallel streams over a set of network interfaces that have been logically combined into a stripe-group.
The STRIPE column allows the administrator to include individual SCI network interfaces in a stripe-group pool. Members of this pool are available to be included in logical stripe groups. These stripe groups are formed on an as-needed basis, selecting interfaces from this stripe-group pool.
To include the SCI interface in a stripe-group pool, set its STRIPE value to 1. To exclude an interface from the pool, specify 0. Up to four SCI network interface cards per node can be configured for stripe-group membership.
When a message is submitted for transmission over the SCI network, an MPI protocol module distributes the message over as many SCI network interfaces as are available.
Stripe-group membership is made optional so you can reserve some SCI network bandwidth for non-striped use. To do so, simply set STRIPE = 0 on the SCI network interface(s) you wish to reserve in this way.
This column identifies the communication protocol used by the interface. The scirsm interface employs the RSM (Remote Shared Memory) protocol. The others all use TCP (Transmission Control Protocol).
This is a placeholder column. Its contents are not used at this time.
This is a placeholder column. Its contents are not used at this time.
This section provides a set of options that control MPI communication behavior in ways that are likely to affect message-passing performance. It contains two templates with predefined option settings. These templates are shown in Example 7-4 and discussed below.
General-Purpose, Multiuser Template - The first template in the MPIOptions section is designed for general-purpose use at times when multiple message-passing jobs will be running concurrently.
Performance Template - The second template is designed to maximize the performance of message-passing jobs when only one job is allowed to run at a time.
The first line of each template contains the phrase "Queue=xxxx." This is because the queue-based LSF workload management runtime environment uses the same hpc.conf file as the CRE.
The options in the general-purpose template are the same as the default settings for the Sun MPI library. In other words, you do not have to uncomment the general-purpose template to have its option values be in effect. This template is provided in the MPIOptions section so you can see what options are most beneficial when operating in a multiuser mode.
If you want to use the performance template, do the following:
Delete the "Queue=performance" phrase from the Begin MPIOptions line.
Delete the comment character (#) from the beginning of each line of the performance template, including the Begin MPIOptions and End MPIOptions lines.
The resulting template should appear as follows:
Begin MPIOptions coscheduling off spin on End MPIOptions
Table 7-1 provides brief descriptions of the MPI runtime options that can be set in hpc.conf. Each description identifies the default value and describes the effect of each legal value.
Some MPI options not only control a parameter directly, they can also be set to a value that passes control of the parameter to an environment variable. Where an MPI option has an associated environment variable, Table 7-1 names the environment variable
# Following is an example of the options that affect the runtime # environment of the MPI library. The listings below are identical # to the default settings of the library. The "queue=hpc" phrase # makes it an LSF-specific entry, and only for the queue named hpc. # These options are a good choice for a multiuser queue. To be # recognized by CRE, the "Queue=hpc" needs to be removed. # # Begin MPIOptions queue=hpc # coscheduling avail # pbind avail # spindtimeout 1000 # progressadjust on # spin off # # shm_numpostbox 16 # shm_shortmsgsize 256 # rsm_numpostbox 15 # rsm_shortmsgsize 401 # rsm_maxstripe 2 # End MPIOptions # The listing below is a good choice when trying to get maximum # performance out of MPI jobs that are running in a queue that # allows only one job to run at a time. # # Begin MPIOptions Queue=performance # coscheduling off # spin on # End MPIOptions |
The PFSFileSystem section describes the parallel file systems that Sun MPI applications can use. This description includes
The name of the parallel file system.
The hostname of each server node in the parallel file system.
The name of the storage device to be included in the parallel file system being defined.
The number of PFS I/O threads spawned to support each PFS storage device.
A separate PFSFileSystem section is needed for each parallel file system that you want to create. Example 7-5 shows a sample PFSFileSystem section with two parallel file systems, pfs0 and pfs1.
Begin PFSFileSystem=pfs-demo0 NODE DEVICE THREADS hpc-node0 /dev/rdsk/c0t1d0s2 1 hpc-node1 /dev/rdsk/c0t1d0s2 1 hpc-node2 /dev/rdsk/c0t1d0s2 1 End PFSFileSystem Begin PFSFileSystem=pfs-demo1 NODE DEVICE THREADS hpc-node3 /dev/rdsk/c0t1d0s2 1 hpc-node4 /dev/rdsk/c0t1d0s2 1 End PFSFileSystem |
The first line shows the name of the parallel file system. PFS file system names must not include spaces.
The NODE column lists the hostnames of the nodes that function as I/O servers for the parallel file system being defined. The example configuration in Example 7-5 shows two parallel file systems:
pfs0 - three server nodes: node0, node1, and node2.
pfs1 - two server nodes: node2 and node3.
Note that I/O server node2 is used by both pfs0 and pfs1. Note also that hostname node3 represents a node that is used as both a PFS I/O server and as a computation server--that is, it is also used for executing application code.
The second column gives the device name associated with each member node. This name follows Solaris device naming conventions.
The THREADS column allows the administrator to specify how many threads a PFS I/O daemon will spawn for the disk storage device or devices it controls. The number of threads needed by a given PFS I/O server node will depend primarily on the performance capabilities of its disk subsystem.
For a storage object with a single disk or a small storage array, one thread may be enough to exploit the storage unit's maximum I/O potential.
For a more powerful storage array, two or more threads may be needed to make full use of the available bandwidth.
A PFS I/O server is a Sun HPC node that is
Listed in the PFSServers section of hpc.conf, as shown in Example 7-6.
Connected to one or more disk storage units that are listed in a PFSFileSystem section of the hpc.conf file.
Begin PFSServers NODE BUFFER_SIZE hpc-node0 150 hpc-node1 150 hpc-node2 300 hpc-node3 300 End PFSServers |
In addition to being defined in hpc.conf, a PFS I/O server also differs from other nodes in a Sun HPC cluster in that it has a PFS I/O daemon running on it.
The left column lists the hostnames of the nodes that are PFS I/O servers. In this example, they are ios0 through node2 and node3.
The second column specifies the amount of memory the PFS I/O daemon will have for buffering transfer data. This value is specified in units of 32-Kbyte buffers. The number of buffers that you specify will depend on the amount of I/O traffic you expect that server is likely to experience at any given time.
The optimal buffer size will vary with system type and load. Buffer sizes in the range of 128 to 512 provide reasonable performance on most Sun HPC Systems. You can use pfsstat to get reports on buffer cache hit rates. This can be useful for evaluating how well suited the buffer size is to the cluster's current I/O activity.
This section is used only in a cluster that is using LSF as its workload manager, not CRE. The CRE ignores the HPCNodes section of the hpc.conf file.
Whenever hpc.conf is changed, the CRE database must be updated with the new information. After all required changes to hpc.conf have been made, restart the CRE master and nodal daemons as follows:
# /etc/init.d/sunhpc.cre_master start# /etc/init.d/sunhpc.cre_node start
If PFS file systems were unmounted and PFS I/O daemons were stopped, restart the I/O daemons and remount the PFS file systems .