When the CRE starts up, it updates portions of the resource database according to the contents of a configuration file named hpc.conf. This file is organized into six sections, which are summarized below and illustrated in Example 3-1.
The ShmemResource section specifies the maximum amount of shared memory and swap space that jobs can allocate.
The Netif section lists and ranks all network interfaces to which Sun HPC nodes may be connected.
The MPIOptions section defines various MPI parameters that can affect the communication performance of MPI jobs.
The PFSFileSystem section names and defines PFS file systems in the cluster.
The PFSServers section names and defines I/O servers for the PFS file systems.
The HPCNodes section is not used by the CRE. It applies only in an LSF-based runtime environment.
You can change any of these aspects of your cluster's configuration by editing the corresponding parts of the hpc.conf file. This section explains how to:
Prepare for editing hpc.conf. See "Prepare to Edit hpc.conf""Prepare to Edit hpc.conf".
Create one or more I/O servers for the PFS file systems. See "Create PFS I/O Servers"
Create PFS file systems. See "Create PFS File Systems".
Specify various attributes of the network interfaces that your cluster nodes use. See "Set Up Network Interfaces".
Learn how to control MPI communication attributes. See "Specify MPI Options".
Update the CRE database. See "Update the CRE Database".
You may never need to make any other changes to hpc.conf than are described in this section. However, if you do want to edit hpc.conf further, see Chapter 7, hpc.conf: Detailed Description for a fuller description of this file.
Begin ShmemResource : End ShmemResource Begin Netif NAME RANK MTU STRIPE PROTOCOL LATENCY BANDWIDTH : : : : : : : End Netif Begin MPIOptions queue=hpc : Begin HPCNodes : End HPCNodes Begin PFSFileSystem=pfs1 NODE DEVICE THREADS : : : End PFSFileSystem Begin PFSServers NODE BUFFER_SIZE : : End PFSServers Begin HPCNodes End HPCNodes |
Perform the steps described in "Stop the CRE Daemons" and "Copy the hpc.conf Template".
Stop the CRE nodal and master daemons (in that order). The nodal daemons must be stopped on each node, including the master node.
You can use one of the CCM tools (cconsole, ctelnet, or crlogin) to broadcast the nodal stop command to all the nodes from a single command entered on the master node.
Use the following scripts to stop the CRE nodal and master daemons:
# /etc/init.d/sunhpc.cre_node stop# /etc/init.d/sunhpc.cre_master stop
If you edit the hpc.conf file at a later time and make any changes to the PFSServers section or PFSFileSystem section, you will need to also unmount any PFS file systems and stop the PFS daemons on the PFS I/O servers before making the changes.
The Sun HPC ClusterTools 3.0 distribution includes an hpc.conf template, which is stored, by default, in /opt/SUNWhpc/examples/cre/hpc.conf.template.
Copy the template from its installed location to /opt/SUNWhpc/conf/hpc.conf and edit it as described in "Create PFS I/O Servers".
When you have finished editing hpc.conf, perform the steps described in"Update the CRE Database" to update the CRE database with the new configuration information.
Decide which cluster nodes that you want to have function as PFS I/O servers. To be of value as PFS I/O servers, these nodes must be connected to one or more disk storage devices that have enough capacity to handle the PFS file systems you expect will be stored on them.
The disk storage units should include some level of RAID support to protect the file systems against failure of individual storage devices.
Once you know which nodes you want as I/O servers, list their host names on separate lines in the PFSServers section of hpc.conf. Example 3-2 shows a sample PFSServers section that includes three PFS I/O server nodes.
Begin PFSServers NODE BUFFER_SIZE hpc-node0 150 hpc-node1 150 hpc-node2 300 hpc-node3 300 End PFSServers |
The left column lists the host names of the PFS I/O server nodes.
The second column specifies the amount of memory the PFS I/O daemon will have available for buffering transfer data. This value is specified in units of 32-Kbyte buffers. The number of buffers that you should specify will depend on the amount of I/O traffic you expect that server is likely to experience at any given time.
The optimal buffer size will vary with system type and load. Buffer sizes in the range of 128 to 512 provide reasonable performance on most Sun HPC Systems.
You can use pfsstat to get reports on buffer cache hit rates. Knowing buffer cache hit rates can be useful for evaluating how well suited the buffer size is to the cluster's current I/O activity.
Add a separate PFSFileSystem section for each PFS file system you want to create. Include the following information in each PFSFileSystem section:
The name of the parallel file system. See "Parallel File System Name".
The hostname of each server node in the parallel file system. See "Server Node Hostnames".
The name of the storage device to be included in the parallel file system being defined. See "Storage Device Names".
The number of PFS I/O threads spawned to support each PFS storage device. See "Thread Limits".
The following example shows sample PFSFileSystem sections for two parallel file systems, pfs0 and pfs1.
Begin PFSFileSystem=pfs-demo0 NODE DEVICE THREADS hpc-node0 /dev/rdsk/c0t1d0s2 1 hpc-node1 /dev/rdsk/c0t1d0s2 1 hpc-node2 /dev/rdsk/c0t1d0s2 1 End PFSFileSystem Begin PFSFileSystem=pfs-demo1 NODE DEVICE THREADS hpc-node3 /dev/rdsk/c0t1d0s2 1 hpc-node4 /dev/rdsk/c0t1d0s2 1 End PFSFileSystem |
Specify the name of the PFS file system on the first line of the section, to the right of the = symbol.
Apply the same naming conventions to PFS file as are used for serial Solaris files.
The NODE column lists the hostnames of the nodes that function as I/O servers for the parallel file system being defined. The example configuration in Example 3-3 shows two parallel file systems:
pfs0 - two server nodes: node4 and node5.
pfs1 - two server nodes: node5 and node6.
Note that I/O server node5 is used by both pfs0 and pfs1. This is possible because node5 is attached to at least two storage devices, one of which is assigned to pfs0 and the other to pfs1.
In the DEVICE column, specify the name of the device that will be used by the file system. Solaris device naming conventions apply.
In the THREADS column, specify the number of threads a PFS I/O daemon will spawn for the disk storage device or devices it controls. The number of threads needed by a given PFS I/O server node will depend primarily on the performance capabilities of its disk subsystem.
For a storage object with a single disk or a small storage array, one thread may be enough to exploit the storage unit's maximum I/O potential.
For a more powerful storage array, two or more threads may be needed to make full use of the available bandwidth.
Edit the Netif section to specify various characteristics of the network interfaces that are used by the nodes in the cluster. Example 3-4 illustrates the default Netif section that is in hpc.conf.template. This section discusses the various network interface attributes that are defined in the Netif section.
Begin Netif NAME RANK MTU STRIPE PROTOCOL LATENCY BANDWIDTH midnn 0 16384 0 tcp 20 150 idn 10 16384 0 tcp 20 150 scin 20 32768 1 tcp 20 150 : : : : : : : scid 40 32768 1 tcp 20 150 : : : : : : : scirsm 45 32768 1 rsm 20 150 : : : : : : : : : : : : : : smc 220 4096 0 tcp 20 150 End Netif |
Add to the first column the names of the network interfaces that are used in your cluster. The supplied Netif section contains an extensive list of commonly used interface types to simplify this task.
By convention, network interface names include a trailing number as a way to distinguish multiple interfaces of the same type. For example, if your cluster includes two 100 Mbit/second Ethernet networks, include the names hme0 and hme1 in the Netif section.
Decide the order in which you want the networks in your cluster to be preferred for use and then edit the RANK column entries to implement that order.
Network preference is based on the relative value of a network interface's ranking, with higher preference being given to interfaces with lower rank values. In other words, an interface with a rank of 10 will be selected for use over interfaces with ranks of 11 or higher, but interfaces with ranks of 9 or less will have a higher preference.
These ranking values are relative; their absolute values have no significance. This is why gaps are left in the default rankings, so that if a new interface is added, it can be given an unused rank value without having to change any existing values.
Decisions about how to rank two or more dissimilar network types are usually based on site-specific conditions and requirements. Ordinarily, a cluster`s fastest network is given preferential ranking over slower networks. However, raw network bandwidth is only one consideration. For example, an administrator might decide to dedicate a network that offers very low latency, but not the fastest bandwidth to all intra-cluster communication and use a higher-capacity network for connecting the cluster to systems outside the cluster.
This is a placeholder column. Its contents are not used at this time.
If your cluster includes an SCI (Scalable Coherent Interface) network, you can implement scalable communication between cluster nodes by striping MPI messages over the SCI interfaces. In striped communication, a message is split into smaller packets and transmitted in two or more parallel streams over a set of network interfaces that have been logically combined into a stripe-group.
The STRIPE column allows the administrator to include individual SCI network interfaces in a stripe-group pool. Members of this pool are available to be included in logical stripe groups. These stripe groups are formed on an as-needed basis, selecting interfaces from this stripe-group pool.
To include the SCI interface in a stripe-group pool, set its STRIPE value to 1. To exclude an interface from the pool, specify 0. Up to four SCI network interface cards per node can be configured for stripe-group membership.
When a message is submitted for transmission over the SCI network, an MPI protocol module distributes the message over as many SCI network interfaces as are available.
Stripe-group membership is made optional so you can reserve some SCI network bandwidth for non-striped use. To do so, simply set STRIPE = 0 on the SCI network interface(s) you wish to reserve in this way.
This column identifies the communication protocol used by the interface. The scirsm interface employs the RSM (Remote Shared Memory) protocol. The others in the default list all use TCP (Transmission Control Protocol).
If you add a network interface of a type not represented in the hpc.conf template, you will need to specify the type of protocol the new interface uses.
This is a placeholder column. Its contents are not used at this time.
This is a placeholder column. Its contents are not used at this time.
The MPIOptions section provides a set of options that control MPI communication behavior in ways that are likely to affect message-passing performance. It contains two templates with predefined option settings. These templates are shown in Example 3-5 and are discussed below.
General-Purpose, Multiuser Template - The first template in the MPIOptions section is designed for general-purpose use at times when multiple message-passing jobs will be running concurrently.
Performance Template - The second template is designed to maximize the performance of message-passing jobs when only one job is allowed to run at a time.
The first line of each template contains the phrase "Queue=xxxx." This is because the queue-based LSF workload manager runtime environment also uses this hpc.conf file.
The options in the general-purpose template are the same as the default settings for the Sun MPI library. In other words, you do not have to uncomment the general-purpose template to have its option values be in effect. This template is provided in the MPIOptions section so you can see what options are most beneficial when operating in a multiuser mode.
# Following is an example of the options that affect the runtime # environment of the MPI library. The listings below are identical # to the default settings of the library. The "queue=hpc" phrase # makes it an LSF-specific entry, and only for the queue named hpc. # These options are a good choice for a multiuser queue. To be # recognized by CRE, the "Queue=hpc" needs to be removed. # # Begin MPIOptions queue=hpc # coscheduling avail # pbind avail # spindtimeout 1000 # progressadjust on # spin off # # shm_numpostbox 16 # shm_shortmsgsize 256 # rsm_numpostbox 15 # rsm_shortmsgsize 401 # rsm_maxstripe 2 # End MPIOptions # The listing below is a good choice when trying to get maximum # performance out of MPI jobs that are running in a queue that # allows only one job to run at a time. # # Begin MPIOptions Queue=performance # coscheduling off # spin on # End MPIOptions |
If you want to use the performance template, do the following:
Delete the "Queue=performance" phrase from the Begin MPIOptions line.
Delete the comment character (#) from the beginning of each line of the performance template, including the Begin MPIOptions and End MPIOptions lines.
The resulting template should appear as follows:
Begin MPIOptions coscheduling off spin on End MPIOptions
When you have finished editing hpc.conf, update the CRE database with the new information. To do this, restart the CRE master and nodal daemons as follows:
# /etc/init.d/sunhpc.cre_master start# /etc/init.d/sunhp.cre_node start
The nodal daemons must be restarted on all the nodes in the cluster, including the master node.