Sun HPC ClusterTools 3.0 Administrator's Guide: With CRE

Chapter 3 Cluster Administration: A Primer

The Sun HPC cluster's default configuration will support execution of MPI applications. In other words, if you have started the CRE daemons on your cluster) and created the default partition all (as described in the previous chapter), you can begin executing MPI jobs on the cluster.

You may, however, want to customize your cluster's configuration to make it better suited to the specific administration and use requirements of your site. For example, you may want to create other partitions of different sizes containing different sets of nodes as members to match a variety of job execution needs. You may also want to create one or more PFS file systems to handle the more demanding disk storage requirements that MPI applications often have.

This chapter provides an overview of the principal ways in which you can control your cluster's configuration and behavior. There are three distinct types of control you can use to manage your cluster:

CRE Environment Variables

You can use the following environment variables to specify default values for various features of your Sun HPC cluster.

These are described in "SUNHPC_CLUSTER ", "SUNHPC_CONFIG_DIR ", and "SUNHPC_PART ", respectively.

SUNHPC_CLUSTER

SUNHPC_CLUSTER specifies the name of the default Sun HPC cluster. This is the cluster to which users will automatically be connected--unless a different cluster is chosen using the mpadmin -s option, as described in "mpadmin Syntax".

You can find out the name of your cluster by running mpinfo -C, which displays information about the cluster, including its name.


Note -

The name of a cluster is always the host name of the cluster's master node--that is, the node on which the master daemons are running.


SUNHPC_CONFIG_DIR

SUNHPC_CONFIG_DIR specifies the directory in which the CRE's resource database files are to be stored. The default is /var/hpc.

SUNHPC_PART

SUNHPC_PART specifies the name of the default partition. This is the partition on which user's jobs will be executed unless a different partition is selected via the mprun -p option. This option is discussed in the Sun HPC Cluster Runtime Environment 1.0 User's Guide.

mpadmin: Administration Interface

The CRE provides an interactive command interface, mpadmin, which you can use to administer your Sun HPC cluster. It must be invoked by root.

This section explains how to use mpadmin to perform the following administrative tasks:

These descriptions are preceded by an introduction of the mpadmin command interface, which is presented in "Introduction to mpadmin".


Note -

mpadmin offers many more capabilities than are described in this section. See Chapter 6 for a more comprehensive description of mpadmin.


Introduction to mpadmin

mpadmin Syntax

The mpadmin command has the following syntax.

mpadmin [-c command] [-f filename] [-h] [-q] [-s cluster_name] [-V]

When you invoke mpadmin with the -c, -h, or -V options, it performs the requested operation and returns to the shell level.

When you invoke mpadmin with any of the other options (-f, -q, or -s), it performs the specified operation and then displays an mpadmin prompt, indicating that it is in the interactive mode. In this mode, you can execute any number of mpadmin commands until you quit the interactive session.

When you invoke mpadmin without any options, it goes immediately into the interactive mode, displaying an mpadmin prompt.

The mpadmin command-line options are summarized in Table 3-1 and more fully following the table.

Table 3-1 mpadmin Options

Option 

Description 

-c command

Execute single specified command. 

-f file-name

Take input from specified file. 

-h

Display help/usage text. 

-q

Suppress the display of a warning message when a non-root user attempts to use restricted command mode. 

-s cluster-name

Connect to the specified Sun HPC cluster. 

-V

Display mpadmin version information.

-c command - Single Command Option

Use the -c option when you want to execute a single mpadmin command and return upon completion to the shell prompt. For example, the following use of mpadmin -c changes the location of the CRE log file to /home/wmitty/cre_messages:

# mpadmin -c set logfile="/home/wmitty/cre_messages"
#

Note -

Most commands that are available via the interactive interface can be invoked via the -c option. See Chapter 6, mpadmin: Detailed Description for a description of the mpadmin command set and a list of which commands can be used as arguments to the -c option.


-f file-name - Take Input From a File

Use the -f option to supply input to mpadmin from the file specified by the file-name argument. The source file is expected to consist of one or more mpadmin commands, one command per line.

This option can be particularly useful in the following ways:

-h - Display Help

The -h option displays help information about mpadmin.

-q - Suppress Warning Message

Use the -q option to suppress a warning message when a non-root user attempts to invoke a restricted command.

-s cluster-name - Connect to Specified Cluster

Use the -s option to connect to the cluster specified by the cluster-name argument.

-V - Version Display Option

Use the -V option to display the version of mpadmin.

mpadmin Objects, Attributes, and Contexts

mpadmin Objects and Attributes

From the perspective of mpadmin, a Sun HPC cluster consists of a system of objects, which include

Each type of object has a set of attributes whose values can be operated on via mpadmin commands. These attributes control various aspects of their respective objects. For example, a node's enabled attribute can be

mpadmin Contexts

mpadmin commands are organized into four contexts, which correspond to the four types of mpadmin objects. These contexts are summarized below and illustrated in Figure 3-1.

Figure 3-1 mpadmin Contexts

Graphic

mpadmin Prompts

In the interactive mode, the mpadmin prompt contains one or more fields that indicate the current context. Table 3-2 shows the prompt format for each of the possible mpadmin contexts.

Table 3-2 mpadmin Prompt Formats

Prompt Formats 

Context 

[cluster-name]::

Current context = Cluster. 

[cluster-name]Node::

Current context = Node, but not a specific node. 

[cluster-name]N(node-name)::

Current context = a specific node. 

[cluster-name]Partition::

Current context = Partition, but not a specific partition. 

[cluster-name]P(partition-name)::

Current context = a specific partition. 

[cluster-name]N(node-name) Network::

Current context = Network, but not a specific network interface. 

[cluster-name]N(node-name) I(net-if-name)::

Current context = a specific network interface. 


Note -

When the prompt indicates a specific network interface, it uses I as the abbreviation for Network to avoid being confused with the Node abbreviation N.


List Names of Nodes

mpadmin provides various ways to display information about the cluster and many kinds of information that can be displayed. However, the first information you are likely to need is a list of the nodes in your cluster.

Use the list command in the Node context to display this list. In the following example, list is executed on node1 in a four-node cluster.

node1# mpadmin[node0]:: node[node0] Node:: list
    node0
    node1
    node2
    node3
[node0] Node::

The mpadmin command starts up an mpadmin interactive session in the cluster context. This is indicated by the [node0]:: prompt, which contains the cluster name, node0, and no other context information.


Note -

A cluster's name is assigned by the CRE and is always the name of the cluster's master node.


The node command on the example's second line makes Node the current context. The list command displays a list of all the nodes in the cluster.

Once you have this list of nodes, you have the information you need to enable the nodes and to create a partition. However, before moving on to those steps, you might want to try listing information from within the cluster context or the partition context. In either case, you would follow the same general procedure as for listing nodes.

If this is a newly installed cluster and you have not already run the part_initialize script (as described in the previous chapter), the cluster will contain no partitions at this stage. If, however, you did run part_initialize and have thereby created the partition all, you might want to perform the following test.

node1# mpadmin[node0]:: partition[node0] Partition:: list
    all
[node0] Partition::

To see what nodes are in partition all, make all the current context and execute the list command. The following example illustrates this; it begins in the Partition context (where the previous example ended).

[node0]
Partition:: all[node0] P[all]:: list
    node0
    node1
    node2
    node3
[node0] P[all]::

Enabling Nodes

A node must be in the enabled state before MPI jobs can be run on it. To enable a node, make that node the current context and set its enabled attribute. Repeat for each node that you want to be available for running MPI jobs.

The following example illustrates this, using the same four-node cluster used in the previous examples.

node1# mpadmin[node0]:: node0[node0] N[node0]:: set enabled[node0] N[node0]:: node1[node0] N[node1]:: set enabled[node0] N[node1]:: node2[node0] N[node2]:: set enabled[node0] N[node2]:: node3[node0] N[node3]:: set enabled[node0] N[node3]::

Note the use of a shortcut to move directly from the Cluster context to the node0 context without first going to the general Node context. You can explicitly name a particular object as the target context in this way so long as the name of the object is unambiguous--that is, it is not the same as an mpadmin command.

mpadmin accepts multiple commands on the same line. The previous example could be expressed more succinctly as

node1# mpadmin[node0]:: node0 set enabled node1 set enabled node2 set enabled node3 set enabled[node0] N[node3]::

To disable a node, use the unset command in place of the set command.

Creating and Enabling Partitions

You must create at least one partition and enable it before you can run MPI programs on your Sun HPC cluster. Even if your cluster already has the default partition all in its database, you will probably want to create other partitions with different node configurations to handle particular job requirements.

There are three essential steps involved in creating and enabling a partition:

Once a partition is created and enabled, you can run serial or parallel jobs on it. A serial program will run on a single node of the partition. Parallel programs will be distributed to as many nodes of the partition as the CRE determines to be appropriate for the job. Job placement on a partition's nodes is discussed in the Sun MPI 4.0 User's Guide: With CRE.

Example: Creating a Two-Node Partition

The following example creates and enables a two-node partition named part0. It then lists the member nodes to verify the success of the creation.

node1# mpadmin[node0]:: partition[node0] Partition:: create part0[node0] P[part0]:: set nodes=node0 node1[node0] P[part0]:: set enabled[node0] P[part0]:: list
    node0
    node1
[node0] P[part0]::

Note -

There are no restrictions on the number or size of partitions, so long as no node is a member of more than one enabled partition.


Example: Two Partitions Sharing a Node

The next example shows a second partition, part1, being created. One of its nodes, node1, is also a member of part1.

[node0]
P[part0]:: up[node0] Partition:: create part1[node0] P[part1]:: set nodes=node1 node2 node3[node0] P[part1]:: list
    node1
    node2
    node3
[node0] P[part1]::

Because node1 is shared with part0, which is already enabled, part1 is not being enabled at this time. This illustrates the rule that a node can be a member of more than one partition, but only one of those partitions can be enabled at a time.

If both partitions were enabled at the same time and you tried to run a job on either, the CRE would fail and return an error message. When you want to use part1, you will need to disable part0 first.

Note the use of the up command. The up command moves the context up one level, in this case, from the context of a particular partition (that is, from part0) to the general Partition context.

Shared vs. Dedicated Partitions

The CRE can configure a partition to allow multiple MPI jobs to be running on it concurrently. Such partitions are referred to as shared partitions. The CRE can also configure a partition to permit only one MPI job to run at a time. These are called dedicated partitions.

In the following example, the partition part0 is configured to be a dedicated partition and part1 is configured to allow shared use by up to four processes.

node1# mpadmin[node0]:: part0[node0] P[part0]:: set max_total_procs=1[node0] P[part0]:: part1[node0] P[part1]:: set max_total_procs=4[node0] P[part1]::

The max_total_procs attribute defines how many processes can be active on each node in the partition for which it is being set. In this example, it is set to 1 on part0, which means only one job can be running at a time. It is set to 4 on part1 to allow up to four jobs to be started on that partition.

Note again, that the context-changing shortcut (introduced in "Enabling Nodes") is used in the second and fourth lines of this example.

Customizing Cluster Administration

There are two cluster attributes that you may be interested in modifying, logfile and administrator.

Changing the logfile Attribute

The logfile attribute allows you to log CRE messages in a separate file from all other system messages. For example, if you enter

[node0]:: set logfile=/home/wmitty/cre-messages

CRE will output its messages to the file /home/wmitty/cre-messages. If logfile is not set, CRE messages will be passed to syslog, which will store them with other system messages in /var/adm/messages.


Note -

A full path name must be specified when setting the logfile attribute.


Changing the administrator Attribute

Set the administrator attribute to specify the email address of the system administrator. For example:

[node0]:: set administrator="root@example.com"

Note the use of double quotes.

Quitting mpadmin

Use either the quit or exit command to quit an mpadmin interactive session. Either causes mpadmin to terminate and return you to the shell level. For example:

[node0]:: quitnode1#

hpc.conf: Cluster Configuration File

When the CRE starts up, it updates portions of the resource database according to the contents of a configuration file named hpc.conf. This file is organized into six sections, which are summarized below and illustrated in Example 3-1.

You can change any of these aspects of your cluster's configuration by editing the corresponding parts of the hpc.conf file. This section explains how to:

Prepare to Edit hpc.conf

Perform the steps described in "Stop the CRE Daemons" and "Copy the hpc.conf Template".

Stop the CRE Daemons

Stop the CRE nodal and master daemons (in that order). The nodal daemons must be stopped on each node, including the master node.


Note -

You can use one of the CCM tools (cconsole, ctelnet, or crlogin) to broadcast the nodal stop command to all the nodes from a single command entered on the master node.


Use the following scripts to stop the CRE nodal and master daemons:

# /etc/init.d/sunhpc.cre_node stop# /etc/init.d/sunhpc.cre_master stop

Note -

If you edit the hpc.conf file at a later time and make any changes to the PFSServers section or PFSFileSystem section, you will need to also unmount any PFS file systems and stop the PFS daemons on the PFS I/O servers before making the changes.


Copy the hpc.conf Template

The Sun HPC ClusterTools 3.0 distribution includes an hpc.conf template, which is stored, by default, in /opt/SUNWhpc/examples/cre/hpc.conf.template.

Copy the template from its installed location to /opt/SUNWhpc/conf/hpc.conf and edit it as described in "Create PFS I/O Servers".

When you have finished editing hpc.conf, perform the steps described in"Update the CRE Database" to update the CRE database with the new configuration information.

Create PFS I/O Servers

Decide which cluster nodes that you want to have function as PFS I/O servers. To be of value as PFS I/O servers, these nodes must be connected to one or more disk storage devices that have enough capacity to handle the PFS file systems you expect will be stored on them.


Note -

The disk storage units should include some level of RAID support to protect the file systems against failure of individual storage devices.


Once you know which nodes you want as I/O servers, list their host names on separate lines in the PFSServers section of hpc.conf. Example 3-2 shows a sample PFSServers section that includes three PFS I/O server nodes.


Example 3-2 PFSServers Section Example


Begin PFSServers
NODE            BUFFER_SIZE
hpc-node0        150
hpc-node1        150
hpc-node2        300
hpc-node3        300
End PFSServers

The left column lists the host names of the PFS I/O server nodes.

The second column specifies the amount of memory the PFS I/O daemon will have available for buffering transfer data. This value is specified in units of 32-Kbyte buffers. The number of buffers that you should specify will depend on the amount of I/O traffic you expect that server is likely to experience at any given time.

The optimal buffer size will vary with system type and load. Buffer sizes in the range of 128 to 512 provide reasonable performance on most Sun HPC Systems.


Note -

You can use pfsstat to get reports on buffer cache hit rates. Knowing buffer cache hit rates can be useful for evaluating how well suited the buffer size is to the cluster's current I/O activity.


Create PFS File Systems

Add a separate PFSFileSystem section for each PFS file system you want to create. Include the following information in each PFSFileSystem section:

The following example shows sample PFSFileSystem sections for two parallel file systems, pfs0 and pfs1.


Example 3-3 PFSFileSystem Section Example


Begin PFSFileSystem=pfs-demo0
NODE            DEVICE              THREADS 
hpc-node0       /dev/rdsk/c0t1d0s2  1 hpc-node1       /dev/rdsk/c0t1d0s2  1 
hpc-node2       /dev/rdsk/c0t1d0s2  1
End PFSFileSystem

Begin PFSFileSystem=pfs-demo1
NODE            DEVICE              THREADS
hpc-node3       /dev/rdsk/c0t1d0s2  1
hpc-node4       /dev/rdsk/c0t1d0s2  1 End PFSFileSystem

Parallel File System Name

Specify the name of the PFS file system on the first line of the section, to the right of the = symbol.

Apply the same naming conventions to PFS file as are used for serial Solaris files.

Server Node Hostnames

The NODE column lists the hostnames of the nodes that function as I/O servers for the parallel file system being defined. The example configuration in Example 3-3 shows two parallel file systems:

Note that I/O server node5 is used by both pfs0 and pfs1. This is possible because node5 is attached to at least two storage devices, one of which is assigned to pfs0 and the other to pfs1.

Storage Device Names

In the DEVICE column, specify the name of the device that will be used by the file system. Solaris device naming conventions apply.

Thread Limits

In the THREADS column, specify the number of threads a PFS I/O daemon will spawn for the disk storage device or devices it controls. The number of threads needed by a given PFS I/O server node will depend primarily on the performance capabilities of its disk subsystem.

Set Up Network Interfaces

Edit the Netif section to specify various characteristics of the network interfaces that are used by the nodes in the cluster. Example 3-4 illustrates the default Netif section that is in hpc.conf.template. This section  discusses the various network interface attributes that are defined in the Netif section.


Example 3-4 Netif Section Example


Begin Netif
NAME           RANK     MTU       STRIPE     PROTOCOL     LATENCY     BANDWIDTH
midnn          0        16384     0          tcp          20          150
idn            10       16384     0          tcp          20          150 scin           20       32768     1          tcp          20          150
  :            :          :       :           :           :            :
scid           40       32768     1          tcp          20          150
  :            :          :       :           :           :            :
scirsm         45       32768     1          rsm          20          150
  :            :          :       :           :           :            :
  :            :          :       :           :           :            :
smc            220      4096      0          tcp          20          150
End Netif

Interface Names

Add to the first column the names of the network interfaces that are used in your cluster. The supplied Netif section contains an extensive list of commonly used interface types to simplify this task.

By convention, network interface names include a trailing number as a way to distinguish multiple interfaces of the same type. For example, if your cluster includes two 100 Mbit/second Ethernet networks, include the names hme0 and hme1 in the Netif section.

Rank Attribute

Decide the order in which you want the networks in your cluster to be preferred for use and then edit the RANK column entries to implement that order.

Network preference is based on the relative value of a network interface's ranking, with higher preference being given to interfaces with lower rank values. In other words, an interface with a rank of 10 will be selected for use over interfaces with ranks of 11 or higher, but interfaces with ranks of 9 or less will have a higher preference.


Note -

These ranking values are relative; their absolute values have no significance. This is why gaps are left in the default rankings, so that if a new interface is added, it can be given an unused rank value without having to change any existing values.


Decisions about how to rank two or more dissimilar network types are usually based on site-specific conditions and requirements. Ordinarily, a cluster`s fastest network is given preferential ranking over slower networks. However, raw network bandwidth is only one consideration. For example, an administrator might decide to dedicate a network that offers very low latency, but not the fastest bandwidth to all intra-cluster communication and use a higher-capacity network for connecting the cluster to systems outside the cluster.

MTU Attribute

This is a placeholder column. Its contents are not used at this time.

Stripe Attribute

If your cluster includes an SCI (Scalable Coherent Interface) network, you can implement scalable communication between cluster nodes by striping MPI messages over the SCI interfaces. In striped communication, a message is split into smaller packets and transmitted in two or more parallel streams over a set of network interfaces that have been logically combined into a stripe-group.

The STRIPE column allows the administrator to include individual SCI network interfaces in a stripe-group pool. Members of this pool are available to be included in logical stripe groups. These stripe groups are formed on an as-needed basis, selecting interfaces from this stripe-group pool.

To include the SCI interface in a stripe-group pool, set its STRIPE value to 1. To exclude an interface from the pool, specify 0. Up to four SCI network interface cards per node can be configured for stripe-group membership.

When a message is submitted for transmission over the SCI network, an MPI protocol module distributes the message over as many SCI network interfaces as are available.

Stripe-group membership is made optional so you can reserve some SCI network bandwidth for non-striped use. To do so, simply set STRIPE = 0 on the SCI network interface(s) you wish to reserve in this way.

Protocol Attribute

This column identifies the communication protocol used by the interface. The scirsm interface employs the RSM (Remote Shared Memory) protocol. The others in the default list all use TCP (Transmission Control Protocol).

If you add a network interface of a type not represented in the hpc.conf template, you will need to specify the type of protocol the new interface uses.

Latency Attribute

This is a placeholder column. Its contents are not used at this time.

Bandwidth Attribute

This is a placeholder column. Its contents are not used at this time.

Specify MPI Options

The MPIOptions section provides a set of options that control MPI communication behavior in ways that are likely to affect message-passing performance. It contains two templates with predefined option settings. These templates are shown in Example 3-5 and are discussed below.

General-Purpose, Multiuser Template - The first template in the MPIOptions section is designed for general-purpose use at times when multiple message-passing jobs will be running concurrently.

Performance Template - The second template is designed to maximize the performance of message-passing jobs when only one job is allowed to run at a time.


Note -

The first line of each template contains the phrase "Queue=xxxx." This is because the queue-based LSF workload manager runtime environment also uses this hpc.conf file.


The options in the general-purpose template are the same as the default settings for the Sun MPI library. In other words, you do not have to uncomment the general-purpose template to have its option values be in effect. This template is provided in the MPIOptions section so you can see what options are most beneficial when operating in a multiuser mode.


Example 3-5 MPIOptions Section Example


# Following is an example of the options that affect the runtime
# environment of the MPI library. The listings below are identical
# to the default settings of the library. The "queue=hpc" phrase
# makes it an LSF-specific entry, and only for the queue named hpc.
# These options are a good choice for a multiuser queue. To be
# recognized by CRE, the "Queue=hpc" needs to be removed.
#
# Begin MPIOptions queue=hpc
# coscheduling  avail
# pbind         avail
# spindtimeout   1000
# progressadjust   on
# spin            off
#
# shm_numpostbox       16
# shm_shortmsgsize    256
# rsm_numpostbox       15
# rsm_shortmsgsize    401
# rsm_maxstripe         2
# End MPIOptions

# The listing below is a good choice when trying to get maximum
# performance out of MPI jobs that are running in a queue that
# allows only one job to run at a time.
#
# Begin MPIOptions Queue=performance
# coscheduling             off
# spin                      on
# End MPIOptions

If you want to use the performance template, do the following:

The resulting template should appear as follows:

Begin MPIOptions
coscheduling			off
spin			on
End MPIOptions

Update the CRE Database

When you have finished editing hpc.conf, update the CRE database with the new information. To do this, restart the CRE master and nodal daemons as follows:

# /etc/init.d/sunhpc.cre_master start# /etc/init.d/sunhp.cre_node start

The nodal daemons must be restarted on all the nodes in the cluster, including the master node.