The Sun HPC cluster's default configuration will support execution of MPI applications. In other words, if you have started the CRE daemons on your cluster) and created the default partition all (as described in the previous chapter), you can begin executing MPI jobs on the cluster.
You may, however, want to customize your cluster's configuration to make it better suited to the specific administration and use requirements of your site. For example, you may want to create other partitions of different sizes containing different sets of nodes as members to match a variety of job execution needs. You may also want to create one or more PFS file systems to handle the more demanding disk storage requirements that MPI applications often have.
This chapter provides an overview of the principal ways in which you can control your cluster's configuration and behavior. There are three distinct types of control you can use to manage your cluster:
Environment variables - See "CRE Environment Variables".
Administration interface (mpadmin) - See "mpadmin: Administration Interface" for an introduction to the mpadmin command interface and instructions for using it to perform basic administration tasks. See Chapter 6, mpadmin: Detailed Description for a more comprehensive discussion of mpadmin.
Cluster configuration file (hpc.conf) - See "hpc.conf: Cluster Configuration File" for a brief description of the hpc.conf file and Chapter 7, hpc.conf: Detailed Description for instructions on how to modify it.
You can use the following environment variables to specify default values for various features of your Sun HPC cluster.
SUNHPC_CLUSTER
SUNHPC_CONFIG_DIR
SUNHPC_PART
These are described in "SUNHPC_CLUSTER ", "SUNHPC_CONFIG_DIR ", and "SUNHPC_PART ", respectively.
SUNHPC_CLUSTER specifies the name of the default Sun HPC cluster. This is the cluster to which users will automatically be connected--unless a different cluster is chosen using the mpadmin -s option, as described in "mpadmin Syntax".
You can find out the name of your cluster by running mpinfo -C, which displays information about the cluster, including its name.
The name of a cluster is always the host name of the cluster's master node--that is, the node on which the master daemons are running.
SUNHPC_CONFIG_DIR specifies the directory in which the CRE's resource database files are to be stored. The default is /var/hpc.
SUNHPC_PART specifies the name of the default partition. This is the partition on which user's jobs will be executed unless a different partition is selected via the mprun -p option. This option is discussed in the Sun HPC Cluster Runtime Environment 1.0 User's Guide.
The CRE provides an interactive command interface, mpadmin, which you can use to administer your Sun HPC cluster. It must be invoked by root.
This section explains how to use mpadmin to perform the following administrative tasks:
List the names of all nodes in the cluster - see "List Names of Nodes"
Enable nodes - see "Enabling Nodes"
Create and enable partitions - see "Creating and Enabling Partitions"
Customize some aspects of cluster administration - see "Customizing Cluster Administration"
These descriptions are preceded by an introduction of the mpadmin command interface, which is presented in "Introduction to mpadmin".
mpadmin offers many more capabilities than are described in this section. See Chapter 6 for a more comprehensive description of mpadmin.
The mpadmin command has the following syntax.
# mpadmin [-c command] [-f filename] [-h] [-q] [-s cluster_name] [-V]
When you invoke mpadmin with the -c, -h, or -V options, it performs the requested operation and returns to the shell level.
When you invoke mpadmin with any of the other options (-f, -q, or -s), it performs the specified operation and then displays an mpadmin prompt, indicating that it is in the interactive mode. In this mode, you can execute any number of mpadmin commands until you quit the interactive session.
When you invoke mpadmin without any options, it goes immediately into the interactive mode, displaying an mpadmin prompt.
The mpadmin command-line options are summarized in Table 3-1 and more fully following the table.
Table 3-1 mpadmin Options
Option |
Description |
---|---|
-c command |
Execute single specified command. |
-f file-name |
Take input from specified file. |
-h |
Display help/usage text. |
-q |
Suppress the display of a warning message when a non-root user attempts to use restricted command mode. |
-s cluster-name |
Connect to the specified Sun HPC cluster. |
-V |
Display mpadmin version information. |
Use the -c option when you want to execute a single mpadmin command and return upon completion to the shell prompt. For example, the following use of mpadmin -c changes the location of the CRE log file to /home/wmitty/cre_messages:
# mpadmin -c set logfile="/home/wmitty/cre_messages" #
Most commands that are available via the interactive interface can be invoked via the -c option. See Chapter 6, mpadmin: Detailed Description for a description of the mpadmin command set and a list of which commands can be used as arguments to the -c option.
Use the -f option to supply input to mpadmin from the file specified by the file-name argument. The source file is expected to consist of one or more mpadmin commands, one command per line.
This option can be particularly useful in the following ways:
It can be used following use of the mpadmin command dump, which outputs all or part of a cluster's configuration in the form of an mpadmin script. If the dump output is stored in a file, mpadmin can, at a later time, read the file via the -f option, thereby reconstructing the configuration that had been saved in the dump output file.
The -f option can also be used to read mpadmin scripts written by the system administrator--scripts designed to simplify other cluster management tasks that involve issuing a series of mpadmin commands.
The -h option displays help information about mpadmin.
Use the -q option to suppress a warning message when a non-root user attempts to invoke a restricted command.
Use the -s option to connect to the cluster specified by the cluster-name argument.
Use the -V option to display the version of mpadmin.
From the perspective of mpadmin, a Sun HPC cluster consists of a system of objects, which include
The cluster itself
Each node contained in the cluster
Each partition (logical group of nodes) defined in the cluster
The net work interfaces used by the nodes
Each type of object has a set of attributes whose values can be operated on via mpadmin commands. These attributes control various aspects of their respective objects. For example, a node's enabled attribute can be
set to make the node available for use
unset to prevent it from being used
The CRE sets many attributes in a cluster to default values each time it boots up. Except for attribute modifications described here and in Chapter 6, mpadmin: Detailed Description, do not change attribute values.
mpadmin commands are organized into four contexts, which correspond to the four types of mpadmin objects. These contexts are summarized below and illustrated in Figure 3-1.
Cluster - These commands affect cluster attributes.
Node - These commands affect node attributes.
Network - These commands affect network interface attributes.
Partition - These commands affect partition attributes.
In the interactive mode, the mpadmin prompt contains one or more fields that indicate the current context. Table 3-2 shows the prompt format for each of the possible mpadmin contexts.
Table 3-2 mpadmin Prompt Formats
Prompt Formats |
Context |
---|---|
[cluster-name]:: |
Current context = Cluster. |
[cluster-name]Node:: |
Current context = Node, but not a specific node. |
[cluster-name]N(node-name):: |
Current context = a specific node. |
[cluster-name]Partition:: |
Current context = Partition, but not a specific partition. |
[cluster-name]P(partition-name):: |
Current context = a specific partition. |
[cluster-name]N(node-name) Network:: |
Current context = Network, but not a specific network interface. |
[cluster-name]N(node-name) I(net-if-name):: |
Current context = a specific network interface. |
When the prompt indicates a specific network interface, it uses I as the abbreviation for Network to avoid being confused with the Node abbreviation N.
mpadmin provides various ways to display information about the cluster and many kinds of information that can be displayed. However, the first information you are likely to need is a list of the nodes in your cluster.
Use the list command in the Node context to display this list. In the following example, list is executed on node1 in a four-node cluster.
node1# mpadmin[node0]:: node[node0] Node:: list node0 node1 node2 node3 [node0] Node::
The mpadmin command starts up an mpadmin interactive session in the cluster context. This is indicated by the [node0]:: prompt, which contains the cluster name, node0, and no other context information.
A cluster's name is assigned by the CRE and is always the name of the cluster's master node.
The node command on the example's second line makes Node the current context. The list command displays a list of all the nodes in the cluster.
Once you have this list of nodes, you have the information you need to enable the nodes and to create a partition. However, before moving on to those steps, you might want to try listing information from within the cluster context or the partition context. In either case, you would follow the same general procedure as for listing nodes.
If this is a newly installed cluster and you have not already run the part_initialize script (as described in the previous chapter), the cluster will contain no partitions at this stage. If, however, you did run part_initialize and have thereby created the partition all, you might want to perform the following test.
node1# mpadmin[node0]:: partition[node0] Partition:: list all [node0] Partition::
To see what nodes are in partition all, make all the current context and execute the list command. The following example illustrates this; it begins in the Partition context (where the previous example ended).
[node0] Partition:: all[node0] P[all]:: list node0 node1 node2 node3 [node0] P[all]::
A node must be in the enabled state before MPI jobs can be run on it. To enable a node, make that node the current context and set its enabled attribute. Repeat for each node that you want to be available for running MPI jobs.
The following example illustrates this, using the same four-node cluster used in the previous examples.
node1# mpadmin[node0]:: node0[node0] N[node0]:: set enabled[node0] N[node0]:: node1[node0] N[node1]:: set enabled[node0] N[node1]:: node2[node0] N[node2]:: set enabled[node0] N[node2]:: node3[node0] N[node3]:: set enabled[node0] N[node3]::
Note the use of a shortcut to move directly from the Cluster context to the node0 context without first going to the general Node context. You can explicitly name a particular object as the target context in this way so long as the name of the object is unambiguous--that is, it is not the same as an mpadmin command.
mpadmin accepts multiple commands on the same line. The previous example could be expressed more succinctly as
node1# mpadmin[node0]:: node0 set enabled node1 set enabled node2 set enabled node3 set enabled[node0] N[node3]::
To disable a node, use the unset command in place of the set command.
You must create at least one partition and enable it before you can run MPI programs on your Sun HPC cluster. Even if your cluster already has the default partition all in its database, you will probably want to create other partitions with different node configurations to handle particular job requirements.
There are three essential steps involved in creating and enabling a partition:
Use the create command to assign a name to the partition. The next time the CRE starts its master daemons, it will add the names of any newly created partitions to its resource database.
Set the partition's nodes attribute to a list of the nodes you want to include in the partition.
Set the partition's enabled attribute.
Once a partition is created and enabled, you can run serial or parallel jobs on it. A serial program will run on a single node of the partition. Parallel programs will be distributed to as many nodes of the partition as the CRE determines to be appropriate for the job. Job placement on a partition's nodes is discussed in the Sun MPI 4.0 User's Guide: With CRE.
The following example creates and enables a two-node partition named part0. It then lists the member nodes to verify the success of the creation.
node1# mpadmin[node0]:: partition[node0] Partition:: create part0[node0] P[part0]:: set nodes=node0 node1[node0] P[part0]:: set enabled[node0] P[part0]:: list node0 node1 [node0] P[part0]::
There are no restrictions on the number or size of partitions, so long as no node is a member of more than one enabled partition.
The next example shows a second partition, part1, being created. One of its nodes, node1, is also a member of part1.
[node0] P[part0]:: up[node0] Partition:: create part1[node0] P[part1]:: set nodes=node1 node2 node3[node0] P[part1]:: list node1 node2 node3 [node0] P[part1]::
Because node1 is shared with part0, which is already enabled, part1 is not being enabled at this time. This illustrates the rule that a node can be a member of more than one partition, but only one of those partitions can be enabled at a time.
If both partitions were enabled at the same time and you tried to run a job on either, the CRE would fail and return an error message. When you want to use part1, you will need to disable part0 first.
Note the use of the up command. The up command moves the context up one level, in this case, from the context of a particular partition (that is, from part0) to the general Partition context.
The CRE can configure a partition to allow multiple MPI jobs to be running on it concurrently. Such partitions are referred to as shared partitions. The CRE can also configure a partition to permit only one MPI job to run at a time. These are called dedicated partitions.
In the following example, the partition part0 is configured to be a dedicated partition and part1 is configured to allow shared use by up to four processes.
node1# mpadmin[node0]:: part0[node0] P[part0]:: set max_total_procs=1[node0] P[part0]:: part1[node0] P[part1]:: set max_total_procs=4[node0] P[part1]::
The max_total_procs attribute defines how many processes can be active on each node in the partition for which it is being set. In this example, it is set to 1 on part0, which means only one job can be running at a time. It is set to 4 on part1 to allow up to four jobs to be started on that partition.
Note again, that the context-changing shortcut (introduced in "Enabling Nodes") is used in the second and fourth lines of this example.
There are two cluster attributes that you may be interested in modifying, logfile and administrator.
The logfile attribute allows you to log CRE messages in a separate file from all other system messages. For example, if you enter
[node0]:: set logfile=/home/wmitty/cre-messages
CRE will output its messages to the file /home/wmitty/cre-messages. If logfile is not set, CRE messages will be passed to syslog, which will store them with other system messages in /var/adm/messages.
A full path name must be specified when setting the logfile attribute.
Set the administrator attribute to specify the email address of the system administrator. For example:
[node0]:: set administrator="root@example.com"
Note the use of double quotes.
Use either the quit or exit command to quit an mpadmin interactive session. Either causes mpadmin to terminate and return you to the shell level. For example:
[node0]:: quitnode1#
When the CRE starts up, it updates portions of the resource database according to the contents of a configuration file named hpc.conf. This file is organized into six sections, which are summarized below and illustrated in Example 3-1.
The ShmemResource section specifies the maximum amount of shared memory and swap space that jobs can allocate.
The Netif section lists and ranks all network interfaces to which Sun HPC nodes may be connected.
The MPIOptions section defines various MPI parameters that can affect the communication performance of MPI jobs.
The PFSFileSystem section names and defines PFS file systems in the cluster.
The PFSServers section names and defines I/O servers for the PFS file systems.
The HPCNodes section is not used by the CRE. It applies only in an LSF-based runtime environment.
You can change any of these aspects of your cluster's configuration by editing the corresponding parts of the hpc.conf file. This section explains how to:
Prepare for editing hpc.conf. See "Prepare to Edit hpc.conf""Prepare to Edit hpc.conf".
Create one or more I/O servers for the PFS file systems. See "Create PFS I/O Servers"
Create PFS file systems. See "Create PFS File Systems".
Specify various attributes of the network interfaces that your cluster nodes use. See "Set Up Network Interfaces".
Learn how to control MPI communication attributes. See "Specify MPI Options".
Update the CRE database. See "Update the CRE Database".
You may never need to make any other changes to hpc.conf than are described in this section. However, if you do want to edit hpc.conf further, see Chapter 7, hpc.conf: Detailed Description for a fuller description of this file.
Begin ShmemResource : End ShmemResource Begin Netif NAME RANK MTU STRIPE PROTOCOL LATENCY BANDWIDTH : : : : : : : End Netif Begin MPIOptions queue=hpc : Begin HPCNodes : End HPCNodes Begin PFSFileSystem=pfs1 NODE DEVICE THREADS : : : End PFSFileSystem Begin PFSServers NODE BUFFER_SIZE : : End PFSServers Begin HPCNodes End HPCNodes |
Perform the steps described in "Stop the CRE Daemons" and "Copy the hpc.conf Template".
Stop the CRE nodal and master daemons (in that order). The nodal daemons must be stopped on each node, including the master node.
You can use one of the CCM tools (cconsole, ctelnet, or crlogin) to broadcast the nodal stop command to all the nodes from a single command entered on the master node.
Use the following scripts to stop the CRE nodal and master daemons:
# /etc/init.d/sunhpc.cre_node stop# /etc/init.d/sunhpc.cre_master stop
If you edit the hpc.conf file at a later time and make any changes to the PFSServers section or PFSFileSystem section, you will need to also unmount any PFS file systems and stop the PFS daemons on the PFS I/O servers before making the changes.
The Sun HPC ClusterTools 3.0 distribution includes an hpc.conf template, which is stored, by default, in /opt/SUNWhpc/examples/cre/hpc.conf.template.
Copy the template from its installed location to /opt/SUNWhpc/conf/hpc.conf and edit it as described in "Create PFS I/O Servers".
When you have finished editing hpc.conf, perform the steps described in"Update the CRE Database" to update the CRE database with the new configuration information.
Decide which cluster nodes that you want to have function as PFS I/O servers. To be of value as PFS I/O servers, these nodes must be connected to one or more disk storage devices that have enough capacity to handle the PFS file systems you expect will be stored on them.
The disk storage units should include some level of RAID support to protect the file systems against failure of individual storage devices.
Once you know which nodes you want as I/O servers, list their host names on separate lines in the PFSServers section of hpc.conf. Example 3-2 shows a sample PFSServers section that includes three PFS I/O server nodes.
Begin PFSServers NODE BUFFER_SIZE hpc-node0 150 hpc-node1 150 hpc-node2 300 hpc-node3 300 End PFSServers |
The left column lists the host names of the PFS I/O server nodes.
The second column specifies the amount of memory the PFS I/O daemon will have available for buffering transfer data. This value is specified in units of 32-Kbyte buffers. The number of buffers that you should specify will depend on the amount of I/O traffic you expect that server is likely to experience at any given time.
The optimal buffer size will vary with system type and load. Buffer sizes in the range of 128 to 512 provide reasonable performance on most Sun HPC Systems.
You can use pfsstat to get reports on buffer cache hit rates. Knowing buffer cache hit rates can be useful for evaluating how well suited the buffer size is to the cluster's current I/O activity.
Add a separate PFSFileSystem section for each PFS file system you want to create. Include the following information in each PFSFileSystem section:
The name of the parallel file system. See "Parallel File System Name".
The hostname of each server node in the parallel file system. See "Server Node Hostnames".
The name of the storage device to be included in the parallel file system being defined. See "Storage Device Names".
The number of PFS I/O threads spawned to support each PFS storage device. See "Thread Limits".
The following example shows sample PFSFileSystem sections for two parallel file systems, pfs0 and pfs1.
Begin PFSFileSystem=pfs-demo0 NODE DEVICE THREADS hpc-node0 /dev/rdsk/c0t1d0s2 1 hpc-node1 /dev/rdsk/c0t1d0s2 1 hpc-node2 /dev/rdsk/c0t1d0s2 1 End PFSFileSystem Begin PFSFileSystem=pfs-demo1 NODE DEVICE THREADS hpc-node3 /dev/rdsk/c0t1d0s2 1 hpc-node4 /dev/rdsk/c0t1d0s2 1 End PFSFileSystem |
Specify the name of the PFS file system on the first line of the section, to the right of the = symbol.
Apply the same naming conventions to PFS file as are used for serial Solaris files.
The NODE column lists the hostnames of the nodes that function as I/O servers for the parallel file system being defined. The example configuration in Example 3-3 shows two parallel file systems:
pfs0 - two server nodes: node4 and node5.
pfs1 - two server nodes: node5 and node6.
Note that I/O server node5 is used by both pfs0 and pfs1. This is possible because node5 is attached to at least two storage devices, one of which is assigned to pfs0 and the other to pfs1.
In the DEVICE column, specify the name of the device that will be used by the file system. Solaris device naming conventions apply.
In the THREADS column, specify the number of threads a PFS I/O daemon will spawn for the disk storage device or devices it controls. The number of threads needed by a given PFS I/O server node will depend primarily on the performance capabilities of its disk subsystem.
For a storage object with a single disk or a small storage array, one thread may be enough to exploit the storage unit's maximum I/O potential.
For a more powerful storage array, two or more threads may be needed to make full use of the available bandwidth.
Edit the Netif section to specify various characteristics of the network interfaces that are used by the nodes in the cluster. Example 3-4 illustrates the default Netif section that is in hpc.conf.template. This section discusses the various network interface attributes that are defined in the Netif section.
Begin Netif NAME RANK MTU STRIPE PROTOCOL LATENCY BANDWIDTH midnn 0 16384 0 tcp 20 150 idn 10 16384 0 tcp 20 150 scin 20 32768 1 tcp 20 150 : : : : : : : scid 40 32768 1 tcp 20 150 : : : : : : : scirsm 45 32768 1 rsm 20 150 : : : : : : : : : : : : : : smc 220 4096 0 tcp 20 150 End Netif |
Add to the first column the names of the network interfaces that are used in your cluster. The supplied Netif section contains an extensive list of commonly used interface types to simplify this task.
By convention, network interface names include a trailing number as a way to distinguish multiple interfaces of the same type. For example, if your cluster includes two 100 Mbit/second Ethernet networks, include the names hme0 and hme1 in the Netif section.
Decide the order in which you want the networks in your cluster to be preferred for use and then edit the RANK column entries to implement that order.
Network preference is based on the relative value of a network interface's ranking, with higher preference being given to interfaces with lower rank values. In other words, an interface with a rank of 10 will be selected for use over interfaces with ranks of 11 or higher, but interfaces with ranks of 9 or less will have a higher preference.
These ranking values are relative; their absolute values have no significance. This is why gaps are left in the default rankings, so that if a new interface is added, it can be given an unused rank value without having to change any existing values.
Decisions about how to rank two or more dissimilar network types are usually based on site-specific conditions and requirements. Ordinarily, a cluster`s fastest network is given preferential ranking over slower networks. However, raw network bandwidth is only one consideration. For example, an administrator might decide to dedicate a network that offers very low latency, but not the fastest bandwidth to all intra-cluster communication and use a higher-capacity network for connecting the cluster to systems outside the cluster.
This is a placeholder column. Its contents are not used at this time.
If your cluster includes an SCI (Scalable Coherent Interface) network, you can implement scalable communication between cluster nodes by striping MPI messages over the SCI interfaces. In striped communication, a message is split into smaller packets and transmitted in two or more parallel streams over a set of network interfaces that have been logically combined into a stripe-group.
The STRIPE column allows the administrator to include individual SCI network interfaces in a stripe-group pool. Members of this pool are available to be included in logical stripe groups. These stripe groups are formed on an as-needed basis, selecting interfaces from this stripe-group pool.
To include the SCI interface in a stripe-group pool, set its STRIPE value to 1. To exclude an interface from the pool, specify 0. Up to four SCI network interface cards per node can be configured for stripe-group membership.
When a message is submitted for transmission over the SCI network, an MPI protocol module distributes the message over as many SCI network interfaces as are available.
Stripe-group membership is made optional so you can reserve some SCI network bandwidth for non-striped use. To do so, simply set STRIPE = 0 on the SCI network interface(s) you wish to reserve in this way.
This column identifies the communication protocol used by the interface. The scirsm interface employs the RSM (Remote Shared Memory) protocol. The others in the default list all use TCP (Transmission Control Protocol).
If you add a network interface of a type not represented in the hpc.conf template, you will need to specify the type of protocol the new interface uses.
This is a placeholder column. Its contents are not used at this time.
This is a placeholder column. Its contents are not used at this time.
The MPIOptions section provides a set of options that control MPI communication behavior in ways that are likely to affect message-passing performance. It contains two templates with predefined option settings. These templates are shown in Example 3-5 and are discussed below.
General-Purpose, Multiuser Template - The first template in the MPIOptions section is designed for general-purpose use at times when multiple message-passing jobs will be running concurrently.
Performance Template - The second template is designed to maximize the performance of message-passing jobs when only one job is allowed to run at a time.
The first line of each template contains the phrase "Queue=xxxx." This is because the queue-based LSF workload manager runtime environment also uses this hpc.conf file.
The options in the general-purpose template are the same as the default settings for the Sun MPI library. In other words, you do not have to uncomment the general-purpose template to have its option values be in effect. This template is provided in the MPIOptions section so you can see what options are most beneficial when operating in a multiuser mode.
# Following is an example of the options that affect the runtime # environment of the MPI library. The listings below are identical # to the default settings of the library. The "queue=hpc" phrase # makes it an LSF-specific entry, and only for the queue named hpc. # These options are a good choice for a multiuser queue. To be # recognized by CRE, the "Queue=hpc" needs to be removed. # # Begin MPIOptions queue=hpc # coscheduling avail # pbind avail # spindtimeout 1000 # progressadjust on # spin off # # shm_numpostbox 16 # shm_shortmsgsize 256 # rsm_numpostbox 15 # rsm_shortmsgsize 401 # rsm_maxstripe 2 # End MPIOptions # The listing below is a good choice when trying to get maximum # performance out of MPI jobs that are running in a queue that # allows only one job to run at a time. # # Begin MPIOptions Queue=performance # coscheduling off # spin on # End MPIOptions |
If you want to use the performance template, do the following:
Delete the "Queue=performance" phrase from the Begin MPIOptions line.
Delete the comment character (#) from the beginning of each line of the performance template, including the Begin MPIOptions and End MPIOptions lines.
The resulting template should appear as follows:
Begin MPIOptions coscheduling off spin on End MPIOptions
When you have finished editing hpc.conf, update the CRE database with the new information. To do this, restart the CRE master and nodal daemons as follows:
# /etc/init.d/sunhpc.cre_master start# /etc/init.d/sunhp.cre_node start
The nodal daemons must be restarted on all the nodes in the cluster, including the master node.