Sun HPC ClusterTools 3.0 Administrator's Guide: With CRE

Chapter 3 Cluster Administration: A Primer

The Sun HPC cluster's default configuration will support execution of MPI applications. In other words, if you have started the CRE daemons on your cluster) and created the default partition all (as described in the previous chapter), you can begin executing MPI jobs on the cluster.

You may, however, want to customize your cluster's configuration to make it better suited to the specific administration and use requirements of your site. For example, you may want to create other partitions of different sizes containing different sets of nodes as members to match a variety of job execution needs. You may also want to create one or more PFS file systems to handle the more demanding disk storage requirements that MPI applications often have.

This chapter provides an overview of the principal ways in which you can control your cluster's configuration and behavior. There are three distinct types of control you can use to manage your cluster:

Environment variables - See "CRE Environment Variables".

Administration interface (mpadmin) - See "mpadmin: Administration Interface" for an introduction to the mpadmin command interface and instructions for using it to perform basic administration tasks. See Chapter 6, mpadmin: Detailed Description for a more comprehensive discussion of mpadmin.

Cluster configuration file (hpc.conf) - See "hpc.conf: Cluster Configuration File" for a brief description of the hpc.conf file and Chapter 7, hpc.conf: Detailed Description for instructions on how to modify it.

CRE Environment Variables

You can use the following environment variables to specify default values for various features of your Sun HPC cluster.

SUNHPC_CLUSTER

SUNHPC_CONFIG_DIR

SUNHPC_PART

These are described in "SUNHPC_CLUSTER ", "SUNHPC_CONFIG_DIR ", and "SUNHPC_PART ", respectively.

`SUNHPC_CLUSTER`

SUNHPC_CLUSTER specifies the name of the default Sun HPC cluster. This is the cluster to which users will automatically be connected--unless a different cluster is chosen using the mpadmin -s option, as described in "mpadmin Syntax".

You can find out the name of your cluster by running mpinfo -C, which displays information about the cluster, including its name.

Note -

The name of a cluster is always the host name of the cluster's master node--that is, the node on which the master daemons are running.

`SUNHPC_CONFIG_DIR`

SUNHPC_CONFIG_DIR specifies the directory in which the CRE's resource database files are to be stored. The default is /var/hpc.

`SUNHPC_PART`

SUNHPC_PART specifies the name of the default partition. This is the partition on which user's jobs will be executed unless a different partition is selected via the mprun -p option. This option is discussed in the Sun HPC Cluster Runtime Environment 1.0 User's Guide.

`mpadmin`: Administration Interface

The CRE provides an interactive command interface, mpadmin, which you can use to administer your Sun HPC cluster. It must be invoked by root.

This section explains how to use mpadmin to perform the following administrative tasks:

List the names of all nodes in the cluster - see "List Names of Nodes"

Enable nodes - see "Enabling Nodes"

Create and enable partitions - see "Creating and Enabling Partitions"

Customize some aspects of cluster administration - see "Customizing Cluster Administration"

These descriptions are preceded by an introduction of the mpadmin command interface, which is presented in "Introduction to mpadmin".

Note -

mpadmin offers many more capabilities than are described in this section. See Chapter 6 for a more comprehensive description of mpadmin.

Introduction to `mpadmin`

`mpadmin` Syntax

The mpadmin command has the following syntax.

# mpadmin [-c command] [-f filename] [-h] [-q] [-s cluster_name] [-V]

When you invoke mpadmin with the -c, -h, or -V options, it performs the requested operation and returns to the shell level.

When you invoke mpadmin with any of the other options (-f, -q, or -s), it performs the specified operation and then displays an mpadmin prompt, indicating that it is in the interactive mode. In this mode, you can execute any number of mpadmin commands until you quit the interactive session.

When you invoke mpadmin without any options, it goes immediately into the interactive mode, displaying an mpadmin prompt.

The mpadmin command-line options are summarized in Table 3-1 and more fully following the table.

Table 3-1 mpadmin Options


Option	Description
`-c` `command`	Execute single specified command.
`-f` `file-name`	Take input from specified file.
`-h`	Display help/usage text.
`-q`	Suppress the display of a warning message when a non-root user attempts to use restricted command mode.
`-s` `cluster-name`	Connect to the specified Sun HPC cluster.
`-V`	Display `mpadmin` version information.

`-c` `command` - Single Command Option

Use the -c option when you want to execute a single mpadmin command and return upon completion to the shell prompt. For example, the following use of mpadmin -c changes the location of the CRE log file to /home/wmitty/cre_messages:

# mpadmin -c set logfile="/home/wmitty/cre_messages"
#

Note -

Most commands that are available via the interactive interface can be invoked via the -c option. See Chapter 6, mpadmin: Detailed Description for a description of the mpadmin command set and a list of which commands can be used as arguments to the -c option.

`-f` `file-name` - Take Input From a File

Use the -f option to supply input to mpadmin from the file specified by the file-name argument. The source file is expected to consist of one or more mpadmin commands, one command per line.

This option can be particularly useful in the following ways:

It can be used following use of the mpadmin command dump, which outputs all or part of a cluster's configuration in the form of an mpadmin script. If the dump output is stored in a file, mpadmin can, at a later time, read the file via the -f option, thereby reconstructing the configuration that had been saved in the dump output file.

The -f option can also be used to read mpadmin scripts written by the system administrator--scripts designed to simplify other cluster management tasks that involve issuing a series of mpadmin commands.

`-h` - Display Help

The -h option displays help information about mpadmin.

`-q` - Suppress Warning Message

Use the -q option to suppress a warning message when a non-root user attempts to invoke a restricted command.

`-s` `cluster-name` - Connect to Specified Cluster

Use the -s option to connect to the cluster specified by the cluster-name argument.

`-V` - Version Display Option

Use the -V option to display the version of mpadmin.

`mpadmin` Objects, Attributes, and Contexts

`mpadmin` Objects and Attributes

From the perspective of mpadmin, a Sun HPC cluster consists of a system of objects, which include

The cluster itself

Each node contained in the cluster

Each partition (logical group of nodes) defined in the cluster

The net work interfaces used by the nodes

Each type of object has a set of attributes whose values can be operated on via mpadmin commands. These attributes control various aspects of their respective objects. For example, a node's enabled attribute can be

set to make the node available for use

unset to prevent it from being used

Note -
The CRE sets many attributes in a cluster to default values each time it boots up. Except for attribute modifications described here and in Chapter 6, mpadmin: Detailed Description, do not change attribute values.

`mpadmin` Contexts

mpadmin commands are organized into four contexts, which correspond to the four types of mpadmin objects. These contexts are summarized below and illustrated in Figure 3-1.

Cluster - These commands affect cluster attributes.

Node - These commands affect node attributes.

Network - These commands affect network interface attributes.

Partition - These commands affect partition attributes.

Figure 3-1 `mpadmin` Contexts

`mpadmin` Prompts

In the interactive mode, the mpadmin prompt contains one or more fields that indicate the current context. Table 3-2 shows the prompt format for each of the possible mpadmin contexts.

Table 3-2 mpadmin Prompt Formats


Prompt Formats	Context
`[cluster-name]::`	Current context = Cluster.
`[cluster-name]Node::`	Current context = Node, but not a specific node.
`[cluster-name]N(node-name)::`	Current context = a specific node.
`[cluster-name]Partition::`	Current context = Partition, but not a specific partition.
`[cluster-name]P(partition-name)::`	Current context = a specific partition.
`[cluster-name]N(node-name) Network::`	Current context = Network, but not a specific network interface.
`[cluster-name]N(node-name) I(net-if-name)::`	Current context = a specific network interface.

Note -

When the prompt indicates a specific network interface, it uses I as the abbreviation for Network to avoid being confused with the Node abbreviation N.

List Names of Nodes

mpadmin provides various ways to display information about the cluster and many kinds of information that can be displayed. However, the first information you are likely to need is a list of the nodes in your cluster.

Use the list command in the Node context to display this list. In the following example, list is executed on node1 in a four-node cluster.

node1# mpadmin[node0]:: node[node0] Node:: list
    node0
    node1
    node2
    node3
[node0] Node::

The mpadmin command starts up an mpadmin interactive session in the cluster context. This is indicated by the [node0]:: prompt, which contains the cluster name, node0, and no other context information.

Note -

A cluster's name is assigned by the CRE and is always the name of the cluster's master node.

The node command on the example's second line makes Node the current context. The list command displays a list of all the nodes in the cluster.

Once you have this list of nodes, you have the information you need to enable the nodes and to create a partition. However, before moving on to those steps, you might want to try listing information from within the cluster context or the partition context. In either case, you would follow the same general procedure as for listing nodes.

If this is a newly installed cluster and you have not already run the part_initialize script (as described in the previous chapter), the cluster will contain no partitions at this stage. If, however, you did run part_initialize and have thereby created the partition all, you might want to perform the following test.

node1# mpadmin[node0]:: partition[node0] Partition:: list
    all
[node0] Partition::

To see what nodes are in partition all, make all the current context and execute the list command. The following example illustrates this; it begins in the Partition context (where the previous example ended).

[node0]
Partition:: all[node0] P[all]:: list
    node0
    node1
    node2
    node3
[node0] P[all]::

Enabling Nodes

A node must be in the enabled state before MPI jobs can be run on it. To enable a node, make that node the current context and set its enabled attribute. Repeat for each node that you want to be available for running MPI jobs.

The following example illustrates this, using the same four-node cluster used in the previous examples.

node1# mpadmin[node0]:: node0[node0] N[node0]:: set enabled[node0] N[node0]:: node1[node0] N[node1]:: set enabled[node0] N[node1]:: node2[node0] N[node2]:: set enabled[node0] N[node2]:: node3[node0] N[node3]:: set enabled[node0] N[node3]::

Note the use of a shortcut to move directly from the Cluster context to the node0 context without first going to the general Node context. You can explicitly name a particular object as the target context in this way so long as the name of the object is unambiguous--that is, it is not the same as an mpadmin command.

mpadmin accepts multiple commands on the same line. The previous example could be expressed more succinctly as

node1# mpadmin[node0]:: node0 set enabled node1 set enabled node2 set enabled node3 set enabled[node0] N[node3]::

To disable a node, use the unset command in place of the set command.

Creating and Enabling Partitions

You must create at least one partition and enable it before you can run MPI programs on your Sun HPC cluster. Even if your cluster already has the default partition all in its database, you will probably want to create other partitions with different node configurations to handle particular job requirements.

There are three essential steps involved in creating and enabling a partition:

Use the create command to assign a name to the partition. The next time the CRE starts its master daemons, it will add the names of any newly created partitions to its resource database.

Set the partition's nodes attribute to a list of the nodes you want to include in the partition.

Set the partition's enabled attribute.

Once a partition is created and enabled, you can run serial or parallel jobs on it. A serial program will run on a single node of the partition. Parallel programs will be distributed to as many nodes of the partition as the CRE determines to be appropriate for the job. Job placement on a partition's nodes is discussed in the Sun MPI 4.0 User's Guide: With CRE.

Example: Creating a Two-Node Partition

The following example creates and enables a two-node partition named part0. It then lists the member nodes to verify the success of the creation.

node1# mpadmin[node0]:: partition[node0] Partition:: create part0[node0] P[part0]:: set nodes=node0 node1[node0] P[part0]:: set enabled[node0] P[part0]:: list
    node0
    node1
[node0] P[part0]::

Note -

There are no restrictions on the number or size of partitions, so long as no node is a member of more than one enabled partition.

Example: Two Partitions Sharing a Node

The next example shows a second partition, part1, being created. One of its nodes, node1, is also a member of part1.

[node0]
P[part0]:: up[node0] Partition:: create part1[node0] P[part1]:: set nodes=node1 node2 node3[node0] P[part1]:: list
    node1
    node2
    node3
[node0] P[part1]::

Because node1 is shared with part0, which is already enabled, part1 is not being enabled at this time. This illustrates the rule that a node can be a member of more than one partition, but only one of those partitions can be enabled at a time.

If both partitions were enabled at the same time and you tried to run a job on either, the CRE would fail and return an error message. When you want to use part1, you will need to disable part0 first.

Note the use of the up command. The up command moves the context up one level, in this case, from the context of a particular partition (that is, from part0) to the general Partition context.

Shared vs. Dedicated Partitions

The CRE can configure a partition to allow multiple MPI jobs to be running on it concurrently. Such partitions are referred to as shared partitions. The CRE can also configure a partition to permit only one MPI job to run at a time. These are called dedicated partitions.

In the following example, the partition part0 is configured to be a dedicated partition and part1 is configured to allow shared use by up to four processes.

node1# mpadmin[node0]:: part0[node0] P[part0]:: set max_total_procs=1[node0] P[part0]:: part1[node0] P[part1]:: set max_total_procs=4[node0] P[part1]::

The max_total_procs attribute defines how many processes can be active on each node in the partition for which it is being set. In this example, it is set to 1 on part0, which means only one job can be running at a time. It is set to 4 on part1 to allow up to four jobs to be started on that partition.

Note again, that the context-changing shortcut (introduced in "Enabling Nodes") is used in the second and fourth lines of this example.

Customizing Cluster Administration

There are two cluster attributes that you may be interested in modifying, logfile and administrator.

Changing the `logfile` Attribute

The logfile attribute allows you to log CRE messages in a separate file from all other system messages. For example, if you enter

[node0]:: set logfile=/home/wmitty/cre-messages

CRE will output its messages to the file /home/wmitty/cre-messages. If logfile is not set, CRE messages will be passed to syslog, which will store them with other system messages in /var/adm/messages.

Note -

A full path name must be specified when setting the logfile attribute.

Changing the `administrator` Attribute

Set the administrator attribute to specify the email address of the system administrator. For example:

[node0]:: set administrator="root@example.com"

Note the use of double quotes.

Quitting `mpadmin`

Use either the quit or exit command to quit an mpadmin interactive session. Either causes mpadmin to terminate and return you to the shell level. For example:

[node0]:: quitnode1#

`hpc.conf`: Cluster Configuration File

When the CRE starts up, it updates portions of the resource database according to the contents of a configuration file named hpc.conf. This file is organized into six sections, which are summarized below and illustrated in Example 3-1.

The ShmemResource section specifies the maximum amount of shared memory and swap space that jobs can allocate.

The Netif section lists and ranks all network interfaces to which Sun HPC nodes may be connected.

The MPIOptions section defines various MPI parameters that can affect the communication performance of MPI jobs.

The PFSFileSystem section names and defines PFS file systems in the cluster.

The PFSServers section names and defines I/O servers for the PFS file systems.

The HPCNodes section is not used by the CRE. It applies only in an LSF-based runtime environment.

You can change any of these aspects of your cluster's configuration by editing the corresponding parts of the hpc.conf file. This section explains how to:

Prepare for editing hpc.conf. See "Prepare to Edit hpc.conf""Prepare to Edit hpc.conf".

Create one or more I/O servers for the PFS file systems. See "Create PFS I/O Servers"

Create PFS file systems. See "Create PFS File Systems".

Specify various attributes of the network interfaces that your cluster nodes use. See "Set Up Network Interfaces".

Learn how to control MPI communication attributes. See "Specify MPI Options".

Update the CRE database. See "Update the CRE Database".

Note -

You may never need to make any other changes to hpc.conf than are described in this section. However, if you do want to edit hpc.conf further, see Chapter 7, hpc.conf: Detailed Description for a fuller description of this file.

Example 3-1 General Organization of hpc.conf File

Begin ShmemResource
  :  
End ShmemResource

Begin Netif
NAME    RANK    MTU    STRIPE    PROTOCOL    LATENCY    BANDWIDTH
  :      :       :        :         :           :           :
End Netif

Begin MPIOptions queue=hpc
  :
Begin HPCNodes
  :
End HPCNodes

Begin PFSFileSystem=pfs1 
NODE            DEVICE            THREADS 
  :               :                  :
End PFSFileSystem

Begin PFSServers
NODE            BUFFER_SIZE
  :               :                 End PFSServers

Begin HPCNodes

End HPCNodes

Prepare to Edit `hpc.conf`

Perform the steps described in "Stop the CRE Daemons" and "Copy the hpc.conf Template".

Stop the CRE Daemons

Stop the CRE nodal and master daemons (in that order). The nodal daemons must be stopped on each node, including the master node.

Note -

You can use one of the CCM tools (cconsole, ctelnet, or crlogin) to broadcast the nodal stop command to all the nodes from a single command entered on the master node.

Use the following scripts to stop the CRE nodal and master daemons:

# /etc/init.d/sunhpc.cre_node stop# /etc/init.d/sunhpc.cre_master stop

Note -

If you edit the hpc.conf file at a later time and make any changes to the PFSServers section or PFSFileSystem section, you will need to also unmount any PFS file systems and stop the PFS daemons on the PFS I/O servers before making the changes.

Copy the `hpc.conf` Template

The Sun HPC ClusterTools 3.0 distribution includes an hpc.conf template, which is stored, by default, in /opt/SUNWhpc/examples/cre/hpc.conf.template.

Copy the template from its installed location to /opt/SUNWhpc/conf/hpc.conf and edit it as described in "Create PFS I/O Servers".

When you have finished editing hpc.conf, perform the steps described in"Update the CRE Database" to update the CRE database with the new configuration information.

Create PFS I/O Servers

Decide which cluster nodes that you want to have function as PFS I/O servers. To be of value as PFS I/O servers, these nodes must be connected to one or more disk storage devices that have enough capacity to handle the PFS file systems you expect will be stored on them.

Note -

The disk storage units should include some level of RAID support to protect the file systems against failure of individual storage devices.

Once you know which nodes you want as I/O servers, list their host names on separate lines in the PFSServers section of hpc.conf. Example 3-2 shows a sample PFSServers section that includes three PFS I/O server nodes.

Example 3-2 PFSServers Section Example

Begin PFSServers
NODE            BUFFER_SIZE
hpc-node0        150
hpc-node1        150
hpc-node2        300
hpc-node3        300
End PFSServers

The left column lists the host names of the PFS I/O server nodes.

The second column specifies the amount of memory the PFS I/O daemon will have available for buffering transfer data. This value is specified in units of 32-Kbyte buffers. The number of buffers that you should specify will depend on the amount of I/O traffic you expect that server is likely to experience at any given time.

The optimal buffer size will vary with system type and load. Buffer sizes in the range of 128 to 512 provide reasonable performance on most Sun HPC Systems.

Note -

You can use pfsstat to get reports on buffer cache hit rates. Knowing buffer cache hit rates can be useful for evaluating how well suited the buffer size is to the cluster's current I/O activity.

Create PFS File Systems

Add a separate PFSFileSystem section for each PFS file system you want to create. Include the following information in each PFSFileSystem section:

The name of the parallel file system. See "Parallel File System Name".

The hostname of each server node in the parallel file system. See "Server Node Hostnames".

The name of the storage device to be included in the parallel file system being defined. See "Storage Device Names".

The number of PFS I/O threads spawned to support each PFS storage device. See "Thread Limits".

The following example shows sample PFSFileSystem sections for two parallel file systems, pfs0 and pfs1.

Example 3-3 PFSFileSystem Section Example

Begin PFSFileSystem=pfs-demo0
NODE            DEVICE              THREADS 
hpc-node0       /dev/rdsk/c0t1d0s2  1 hpc-node1       /dev/rdsk/c0t1d0s2  1 
hpc-node2       /dev/rdsk/c0t1d0s2  1
End PFSFileSystem

Begin PFSFileSystem=pfs-demo1
NODE            DEVICE              THREADS
hpc-node3       /dev/rdsk/c0t1d0s2  1
hpc-node4       /dev/rdsk/c0t1d0s2  1 End PFSFileSystem

Parallel File System Name

Specify the name of the PFS file system on the first line of the section, to the right of the = symbol.

Apply the same naming conventions to PFS file as are used for serial Solaris files.

Server Node Hostnames

The NODE column lists the hostnames of the nodes that function as I/O servers for the parallel file system being defined. The example configuration in Example 3-3 shows two parallel file systems:

pfs0 - two server nodes: node4 and node5.

pfs1 - two server nodes: node5 and node6.

Note that I/O server node5 is used by both pfs0 and pfs1. This is possible because node5 is attached to at least two storage devices, one of which is assigned to pfs0 and the other to pfs1.

Storage Device Names

In the DEVICE column, specify the name of the device that will be used by the file system. Solaris device naming conventions apply.

Thread Limits

In the THREADS column, specify the number of threads a PFS I/O daemon will spawn for the disk storage device or devices it controls. The number of threads needed by a given PFS I/O server node will depend primarily on the performance capabilities of its disk subsystem.

For a storage object with a single disk or a small storage array, one thread may be enough to exploit the storage unit's maximum I/O potential.

For a more powerful storage array, two or more threads may be needed to make full use of the available bandwidth.

Set Up Network Interfaces

Edit the Netif section to specify various characteristics of the network interfaces that are used by the nodes in the cluster. Example 3-4 illustrates the default Netif section that is in hpc.conf.template. This section discusses the various network interface attributes that are defined in the Netif section.

Example 3-4 `Netif Section` Example

Begin Netif
NAME           RANK     MTU       STRIPE     PROTOCOL     LATENCY     BANDWIDTH
midnn          0        16384     0          tcp          20          150
idn            10       16384     0          tcp          20          150 scin           20       32768     1          tcp          20          150
  :            :          :       :           :           :            :
scid           40       32768     1          tcp          20          150
  :            :          :       :           :           :            :
scirsm         45       32768     1          rsm          20          150
  :            :          :       :           :           :            :
  :            :          :       :           :           :            :
smc            220      4096      0          tcp          20          150
End Netif

Interface Names

Add to the first column the names of the network interfaces that are used in your cluster. The supplied Netif section contains an extensive list of commonly used interface types to simplify this task.

By convention, network interface names include a trailing number as a way to distinguish multiple interfaces of the same type. For example, if your cluster includes two 100 Mbit/second Ethernet networks, include the names hme0 and hme1 in the Netif section.

Rank Attribute

Decide the order in which you want the networks in your cluster to be preferred for use and then edit the RANK column entries to implement that order.

Network preference is based on the relative value of a network interface's ranking, with higher preference being given to interfaces with lower rank values. In other words, an interface with a rank of 10 will be selected for use over interfaces with ranks of 11 or higher, but interfaces with ranks of 9 or less will have a higher preference.

Note -

These ranking values are relative; their absolute values have no significance. This is why gaps are left in the default rankings, so that if a new interface is added, it can be given an unused rank value without having to change any existing values.

Decisions about how to rank two or more dissimilar network types are usually based on site-specific conditions and requirements. Ordinarily, a cluster`s fastest network is given preferential ranking over slower networks. However, raw network bandwidth is only one consideration. For example, an administrator might decide to dedicate a network that offers very low latency, but not the fastest bandwidth to all intra-cluster communication and use a higher-capacity network for connecting the cluster to systems outside the cluster.

MTU Attribute

This is a placeholder column. Its contents are not used at this time.

Stripe Attribute

If your cluster includes an SCI (Scalable Coherent Interface) network, you can implement scalable communication between cluster nodes by striping MPI messages over the SCI interfaces. In striped communication, a message is split into smaller packets and transmitted in two or more parallel streams over a set of network interfaces that have been logically combined into a stripe-group.

The STRIPE column allows the administrator to include individual SCI network interfaces in a stripe-group pool. Members of this pool are available to be included in logical stripe groups. These stripe groups are formed on an as-needed basis, selecting interfaces from this stripe-group pool.

To include the SCI interface in a stripe-group pool, set its STRIPE value to 1. To exclude an interface from the pool, specify 0. Up to four SCI network interface cards per node can be configured for stripe-group membership.

When a message is submitted for transmission over the SCI network, an MPI protocol module distributes the message over as many SCI network interfaces as are available.

Stripe-group membership is made optional so you can reserve some SCI network bandwidth for non-striped use. To do so, simply set STRIPE = 0 on the SCI network interface(s) you wish to reserve in this way.

Protocol Attribute

This column identifies the communication protocol used by the interface. The scirsm interface employs the RSM (Remote Shared Memory) protocol. The others in the default list all use TCP (Transmission Control Protocol).

If you add a network interface of a type not represented in the hpc.conf template, you will need to specify the type of protocol the new interface uses.

Latency Attribute

This is a placeholder column. Its contents are not used at this time.

Bandwidth Attribute

This is a placeholder column. Its contents are not used at this time.

Specify MPI Options

The MPIOptions section provides a set of options that control MPI communication behavior in ways that are likely to affect message-passing performance. It contains two templates with predefined option settings. These templates are shown in Example 3-5 and are discussed below.

General-Purpose, Multiuser Template - The first template in the MPIOptions section is designed for general-purpose use at times when multiple message-passing jobs will be running concurrently.

Performance Template - The second template is designed to maximize the performance of message-passing jobs when only one job is allowed to run at a time.

Note -

The first line of each template contains the phrase "Queue=xxxx." This is because the queue-based LSF workload manager runtime environment also uses this hpc.conf file.

The options in the general-purpose template are the same as the default settings for the Sun MPI library. In other words, you do not have to uncomment the general-purpose template to have its option values be in effect. This template is provided in the MPIOptions section so you can see what options are most beneficial when operating in a multiuser mode.

Example 3-5 `MPIOptions` Section Example

# Following is an example of the options that affect the runtime
# environment of the MPI library. The listings below are identical
# to the default settings of the library. The "queue=hpc" phrase
# makes it an LSF-specific entry, and only for the queue named hpc.
# These options are a good choice for a multiuser queue. To be
# recognized by CRE, the "Queue=hpc" needs to be removed.
#
# Begin MPIOptions queue=hpc
# coscheduling  avail
# pbind         avail
# spindtimeout   1000
# progressadjust   on
# spin            off
#
# shm_numpostbox       16
# shm_shortmsgsize    256
# rsm_numpostbox       15
# rsm_shortmsgsize    401
# rsm_maxstripe         2
# End MPIOptions

# The listing below is a good choice when trying to get maximum
# performance out of MPI jobs that are running in a queue that
# allows only one job to run at a time.
#
# Begin MPIOptions Queue=performance
# coscheduling             off
# spin                      on
# End MPIOptions

If you want to use the performance template, do the following:

Delete the "Queue=performance" phrase from the Begin MPIOptions line.

Delete the comment character (#) from the beginning of each line of the performance template, including the Begin MPIOptions and End MPIOptions lines.

The resulting template should appear as follows:

Begin MPIOptions
coscheduling			off
spin			on
End MPIOptions

Update the CRE Database

When you have finished editing hpc.conf, update the CRE database with the new information. To do this, restart the CRE master and nodal daemons as follows:

# /etc/init.d/sunhpc.cre_master start# /etc/init.d/sunhp.cre_node start

The nodal daemons must be restarted on all the nodes in the cluster, including the master node.

Chapter 3 Cluster Administration: A Primer

CRE Environment Variables

SUNHPC_CLUSTER

SUNHPC_CONFIG_DIR

SUNHPC_PART

mpadmin: Administration Interface

Introduction to mpadmin

mpadmin Syntax

-c command - Single Command Option

-f file-name - Take Input From a File

-h - Display Help

-q - Suppress Warning Message

-s cluster-name - Connect to Specified Cluster

-V - Version Display Option

mpadmin Objects, Attributes, and Contexts

mpadmin Objects and Attributes

mpadmin Contexts

Figure 3-1 mpadmin Contexts

mpadmin Prompts

List Names of Nodes

Enabling Nodes

Creating and Enabling Partitions

Example: Creating a Two-Node Partition

Example: Two Partitions Sharing a Node

Shared vs. Dedicated Partitions

Customizing Cluster Administration

Changing the logfile Attribute

Changing the administrator Attribute

Quitting mpadmin

hpc.conf: Cluster Configuration File

Example 3-1 General Organization of hpc.conf File

Prepare to Edit hpc.conf

Stop the CRE Daemons

Copy the hpc.conf Template

Create PFS I/O Servers

Example 3-2 PFSServers Section Example

Create PFS File Systems

Example 3-3 PFSFileSystem Section Example

Parallel File System Name

Server Node Hostnames

Storage Device Names

Thread Limits

Set Up Network Interfaces

Example 3-4 Netif Section Example

Interface Names

Rank Attribute

MTU Attribute

Stripe Attribute

Protocol Attribute

Latency Attribute

Bandwidth Attribute

Specify MPI Options

Example 3-5 MPIOptions Section Example

Update the CRE Database

`SUNHPC_CLUSTER`

`SUNHPC_CONFIG_DIR`

`SUNHPC_PART`

`mpadmin`: Administration Interface

Introduction to `mpadmin`

`mpadmin` Syntax

`-c` `command` - Single Command Option

`-f` `file-name` - Take Input From a File

`-h` - Display Help

`-q` - Suppress Warning Message

`-s` `cluster-name` - Connect to Specified Cluster

`-V` - Version Display Option

`mpadmin` Objects, Attributes, and Contexts

`mpadmin` Objects and Attributes

`mpadmin` Contexts

Figure 3-1 `mpadmin` Contexts

`mpadmin` Prompts

Changing the `logfile` Attribute

Changing the `administrator` Attribute

Quitting `mpadmin`

`hpc.conf`: Cluster Configuration File

Prepare to Edit `hpc.conf`

Copy the `hpc.conf` Template

Example 3-4 `Netif Section` Example

Example 3-5 `MPIOptions` Section Example