C H A P T E R  6

hpc.conf Configuration File

This chapter discusses the Sun HPC ClusterTools software configuration file hpc.conf, which defines various attributes of a Sun HPC cluster. A single hpc.conf file is shared by all the nodes in a cluster. It resides in /opt/SUNWhpc/etc.



Note - This configuration file is also used on LSF-based clusters, but it resides in a different location.



The hpc.conf file is organized into functional sections, which are summarized below and illustrated in CODE EXAMPLE 6-1.

Sun HPC ClusterTools software is distributed with an hpc.conf template, which resides by default in /opt/SUNWhpc/examples/rte/hpc.conf.template. If you wish to customize the configuration settings, you should copy this template file to
/opt/SUNWhpc/etc/hpc.conf and edit it.

Each configuration section is bracketed by a Begin/End keyword pair and, when a parameter definition involves multiple fields, the fields are separated by spaces.


CODE EXAMPLE 6-1 General Organization of the hpc.conf File
# Begin ShmemResource
# ...
# End ShmemResource
 
# Begin MPIOptions Queue=
# ...
# End MPIOptions
 
# Begin CREOptions Server=
# ...
# End CREOptions
 
# Begin HPCNodes
# ...
# End HPCNodes
 
Begin PMODULES
...
End PMODULES
 
Begin PM=shm
...
End PM
 
Begin PM=tcp
... 
End PM



Note - When any changes are made to hpc.conf, the daemons running on the system should be stopped, and no jobs should be running. To ensure that it is safe to edit hpc.conf, shut down the nodal and master Sun CRE daemons as described in Stopping and Restarting Sun CRE.




ShmemResource Section

The ShmemResource section provides the administrator with two parameters that control allocation of shared memory and swap space: MaxAllocMem and MaxAllocSwap. This special memory allocation control is needed because some Sun HPC ClusterTools software components use shared memory.

CODE EXAMPLE 6-2 shows the ShmemResource template that is in the hpc.conf file that is shipped with Sun HPC ClusterTools software.


CODE EXAMPLE 6-2 ShmemResource Section Example
#Begin ShmemResource
#MaxAllocMem  0x7fffffffffffffff
#MaxAllocSwap 0x7fffffffffffffff
#End ShmemResource

To set MaxAllocMem and/or MaxAllocSwap limits, remove the comment character (#) from the start of each line and replace the current value, 0x7fffffffffffffff, with the desired limit.

Guidelines for Setting Limits

The Sun HPC ClusterTools software internal shared memory allocator permits an application to use swap space, the amount of which is the smaller of:

If MaxAllocSwap is not specified, or if zero or a negative value is specified, 90% of a node's available swap is used as the swap limit.

The MaxAllocMem parameter can be used to limit the amount of shared memory that can be allocated. If a smaller shared memory limit is not specified, the shared memory limit is 90% of available physical memory.

The following Sun HPC ClusterTools software components use shared memory:



Note - Shared memory and swap space limits are applied per-job on each node.



If you have set up your system for dedicated use (only one job at a time is allowed), you should leave MaxAllocMem and MaxAllocSwap undefined. This allows jobs to maximize use of swap space and physical memory.

If, however, multiple jobs will share a system, you may want to set MaxAllocMem to some level below 50% of total physical memory. This reduces the risk of having a single application lock up physical memory. How much below 50% you choose to set it depends on how many jobs you expect to be competing for physical memory at any given time.



Note - When users make direct calls to mmap(2) or shmget(2), they are not limited by the MaxAllocMem and MaxAllocSwap variables. These utilities manipulate shared memory independently of the MaxAllocMem and MaxAllocSwap values.




MPIOptions Section

The MPIOptions section provides a set of options that control MPI communication behavior in ways that are likely to affect message-passing performance. It contains a template showing some general-purpose option settings, plus an example of alternative settings for maximizing performance. These examples are shown in CODE EXAMPLE 6-3.



Note - The first line of the template contains the phrase "Queue=hpc." This line indicates a queue in the LSF batch runtime environment, which uses the same hpc.conf file as Sun CRE. For LSF, the settings apply only to the specified queue. For Sun CRE, the settings apply across the cluster.



The options in the general-purpose template are the same as the default settings for the Sun MPI library. In other words, you do not have to uncomment the general-purpose template to have its option values be in effect. This template is provided in the MPIOptions section so that you can see what options are most beneficial when operating in a multiuser mode.

If you want to use the performance settings, do the following:

The resulting template should appear as follows:

Begin MPIOptionscoscheduling			offspin			onEnd MPIOptions

CODE EXAMPLE 6-3 MPIOptions Section Example
# The following is an example of the options that affect the run time
# environment of the MPI library.  The listings below are identical to
# the default settings of the library.  The "Queue=hpc" phrase makes
# this an LSF-specific entry, and only for the Queue named hpc.  These
# options are a good choice for a multiuser Queue.  To be recognized
# by CRE, the "Queue=hpc" needs to be removed.
#
# Begin MPIOptions Queue=hpc
# coscheduling  avail
# pbind	        avail
# spindtimeout   1000
# progressadjust   on
# spin		  off
#
# shm_numpostbox           16
# shm_shortmsgsize        256
# maxprocs_limit   2147483647
# maxprocs_default       4096
#
# End MPIOptions
 
# The listing below is a good choice when trying to get maximum
# performance out of MPI jobs that are running in a Queue that
# allows only one job to run at a time.
#
# Begin MPIOptions Queue=performance
# coscheduling             off
# spin                      on
# End MPIOptions

TABLE 6-1 provides brief descriptions of the MPI runtime options that can be set in hpc.conf. Each description identifies the default value and describes the effect of each legal value.

Some MPI options not only control a parameter directly, they can also be set to a value that passes control of the parameter to an environment variable. Where an MPI option has an associated environment variable, TABLE 6-1 names the environment variable.


TABLE 6-1 MPI Runtime Options

Values

 

Option

Default

Other

Description

coscheduling

avail

 

Allows spind use to be controlled by the environment variable MPI_COSCHED. If MPI_COSCHED=0 or is not set, spind is not used. If MPI_COSCHED=1, spind must be used.

 

 

on

Enables coscheduling; spind is used. This value overrides MPI_COSCHED=0.

 

 

off

Disables coscheduling; spind is not to be used. This value overrides MPI_COSCHED=1.

pbind

avail

 

Allows processor binding state to be controlled by the environment variable MPI_PROCBIND. If MPI_PROCBIND=0 or is not set, no processes will be bound to a processor. This is the default.

If MPI_PROCBIND=1, all processes on a node will be bound to a processor.

 

 

on

All processes will be bound to processors. This value overrides MPI_PROCBIND=0.

 

 

off

No processes on a node are bound to a processor. This value overrides MPI_PROCBIND=1.

spindtimeout

1000

 

When polling for messages, a process waits 1000 milliseconds for spind to return. This equals the value to which the environment variable MPI_SPINDTIMEOUT is set.

 

 

integer

To change the default timeout, enter an integer value specifying the number of milliseconds the timeout should be.

progressadjust

on

 

Allows user to set the environment variable MPI_SPIN.

 

 

off

Disables user's ability to set the environment variable MPI_SPIN.

shm_numpostbox

16

 

Sets to 16 the number of postbox entries that are dedicated to a connection endpoint. This equals the value to which the environment variable MPI_SHM_NUMPOSTBOX is set. (See the Sun HPC ClusterTools Software Performance Guide for details.)

 

 

integer

To change the number of dedicated postbox entries, enter an integer value specifying the desired number.

shm_shortmsgsize

256

 

Sets to 256 the maximum number of bytes a short message can contain. This equals the default value to which the environment variable MPI_SHM_SHORTMSGSIZE is set.

 

 

integer

To change the maximum-size definition of a short message, enter an integer specifying the maximum number of bytes it can contain.

maxprocs_default

4096

 

Sets to 4096 the number of processes an MPI process may be connected to at any one time. It includes processes in the same MPI job and processes in jobs that are currently connected to the MPI

process. This equals the value to which the environment variable MPI_MAXPROCS is set.

 

 

integer

To change the maximum number of processes an MPI process may be connected to at any one time, enter an integer specifying the desired limit. The value may not exceed the setting for the option maxprocs_limit.

maxprocs_limit

 

integer

The maximum process table size a user may set MPI_MAXPROCS to. If the option maxprocs_default is not

set, the user is able to specify a value up to MAX_INT.

spin

off

 

Sets the MPI library spin policy to spin nonaggressively. This equals the value to which the environment variable MPI_SPIN is set.

 

 

on

Sets the MPI library to spin aggressively.


Setting MPI Spin Policy

An MPI process often has to wait for a particular event, such as the arrival of data from another process. If the process checks (spins) for this event continuously, it consumes CPU resources that may be deployed more productively for other purposes.

The administrator can direct that the MPI process instead register events associated with shared memory message passing with the spin daemon spind, which can spin on behalf of multiple MPI processes (coscheduling). This frees up multiple CPUs for useful computation. The spind daemon itself runs at a lower priority and backs off its activities with time if no progress is detected.

The SUNWcre package provides the spind daemon, which is not directly user callable.

The cluster administrator can control spin policy in the hpc.conf file. The attribute coscheduling, in the MPIOptions section, can be set to avail, on, or off.

The cluster administrator can also change the setting of the attribute spindtimeout, indicating how long a process waits for spind to return. The default is 1000 milliseconds.

For tips on determining spin policy, see the man page for MPI_COSCHED. In general, the administrator may wish to force use of spind for heavily used development partitions where performance is not a priority. On other partitions, the policy could be set to avail, and users can set MPI_COSCHED=0 for runs where performance is needed.


CREOptions Section

The CREOptions section controls the behavior of Sun CRE in logging system events, handling daemon core files, and authenticating users and programs.

The template hpc.conf file contains the default settings for these behaviors for the current cluster. These settings are shown in CODE EXAMPLE 6-4.


CODE EXAMPLE 6-4 CREOptions Section Example
Begin CREOptions
enable_core       off
corefile_name     core
syslog_facility   daemon
auth_opt          sunhpc_rhosts
max_pub_names     256
default_rm        cre
allow_mprun       *
End CREOptions

Specifying the Cluster

The cluster is specified by appending the tag Server=master-node-name to the Begin CREOptions line:

Begin CREOptions Server=master-node-name

If the node name supplied does not match the name of the current master node, then this section is ignored.

It is possible to have two CREOptions sections. The section without a tag is always processed first. Then the section with a matching master-node-name adds to or overrides the previous settings.

Logging System Events

By default, Sun CRE uses the syslog facility to log system events, as indicated by the entry syslog_facility daemon. Other possible values are user, local0, local1, ..., local7. See the man pages for syslog(2), syslogd(8), and syslog.conf(5) for information on the syslog facility.



Note - In rare cases, the Sun CRE daemons may log errors to the default system log. This occurs when an error is generated before the system has read the value of syslog_facility in hpc.conf.



Enabling Core Files

By default, core files are disabled for Sun CRE daemons. The administrator may enable core files by changing enable_core off to enable_core on. The administrator may also specify where daemon core files are saved by supplying a value for corefile_name. See coreadm(1M)for the possible naming patterns. For example:

corefile_name /var/hpc/core.%n.%f.%p

This would cause any core files to be placed in /var/hpc with the name core modified by node name (%n), executable file name (%f), and process ID (%p). (Note that only daemon core files are affected by the CREOptions section; user programs are not affected.)

Enabling Authentication

Authentication may be enabled by changing the auth_opt value from sunhpc_rhosts (the default) to rhosts, des, or krb5. These values indicate DES or Kerberos Version 5, respectively. See Authentication and Security for additional steps needed to establish the chosen authentication method.

Changing the Maximum Number of Published Names

By default, a single job may publish a maximum of 256 names. To increase or reduce that number, change this line:

max_pub_names maximum-number-of-names

Identifying A Default Resource Manager

If you have only one resource manager installed, you can save users the trouble of entering the -x resource-manager option each time they use they mprun command by specifying a default. Enter this line:

default_rm resource-manager-name

Limiting mprun's Ability to Launch Programs in Batch Mode

If you need to restrict mprun's ability to launch programs while running in batch mode, use the allow_mprun field. By default, the field is set to:

allow_mprun *

The asterisk indicates that no restrictions have been placed on mprun. The asterisk is equivalent to having no entry in the CREOptions section of hpc.conf.

If you need to set restrictions, you must:

Instructions are provided in To Configure the hpc.conf File, and To Configure the sunhpc.allow File.


HPCNodes Section

This section is used only in a cluster that is using LSF as its workload manager, not Sun CRE. Sun CRE ignores the HPCNodes section of the hpc.conf file.


PMODULES Section

The PMODULES section provides the names and locations of the protocol modules (PMs) which the run-time system is to discover and make available for use for communication in the cluster.

When a Sun CRE-based cluster is being started, an instance of the daemon tm.omd is started on each node. This daemon is responsible for discovering various information about a node, including which PMs are available for that node. The tm.omd daemon looks in the hpc.conf file for a list of PMs that may be available. It then opens the PMs, and calls an interface discovery function to find out if the PM has interfaces that are up and running. This information is returned to the tm.omd and stored away in the cluster database.

The PMODULES section of the hpc.conf file lists each PM by name and gives the location (or default location) where it may be found. The template hpc.conf looks like this:


# PMODULE LIBRARY
Begin PMODULES
shm      ()
tcp      ()
End PMODULES

Two PMs are shipped with Sun HPC ClusterTools software and included in the template hpc.conf file. These are:

The template hpc.conf file specifies the location of all three PMs as (), which indicates the default location. The default location is /opt/SUNWhpc/lib for 32-bit libraries, and /opt/SUNWhpc/lib/sparcv9 or /opt/SUNWhpc/lib/amd64 for 64-bit libraries.

The administrator has the option of putting PM libraries in a location other than the default. This is useful, for instance, when a new user-defined PM is being developed. For PMs located in a directory other than the default, the administrator must put the absolute pathname in the hpc.conf file. For example:


# PMODULE LIBRARY
Begin PMODULES
tcp      ()
shm      /home/jbuffett/libs
End PMODULES

In this example, the tcp libraries are located in the default location and the shm libraries are located in /home/jbuffett/libs. In a 64-bit environment, sparcv9 is automatically added to the pathname. Thus, this hpc.conf entry indicates that the shm PM libraries would be found in /home/jbuffett/libs/sparcv9.


PM Section

The hpc.conf file contains a PM section for each available protocol module. The section gives standard information (name of interface and its preference ranking) for the PM, along with additional information for some types of PMs.

The name of the PM being described appears on the same line as the keyword PM with an equal sign and no spaces between them. This example shows the PM sections provided for the shm PM.


# SHM settings
# NAME  RANK
Begin PM=shm
shm     5
End PM

The NAME and RANK columns must be filled in for all PMs. The shm PM requires only these two standard items of information.

NAME Column

The name of the interface indicates the controller type and, optionally, a numbered interface instance. Interface names not ending with a number are wildcards; they specify default settings for all interfaces of that type. The name can be can be between 1 and 32 characters in length.

If interfaces are specified by name after a wildcard entry, the named entries take precedence.

RANK Column

The rank of an interface is the order in which that interface is preferred over other interfaces, with the lowest-ranked interface the most preferred. That is, if an interface with a rank of 0 is available when a communication operation begins, it will be selected for the operation before interfaces with ranks of 1 or greater. Likewise, an available rank 1 interface will be used before interfaces with a rank of 2 or greater.



Note - Because hpc.conf is a shared, cluster-wide configuration file, the rank specified for a given interface will apply to all nodes in the cluster.



Network ranking decisions are usually influenced by site-specific conditions and requirements. Although interfaces connected to the fastest network in a cluster are often given preferential ranking, raw network bandwidth is only one consideration. For example, an administrator might decide to dedicate one network that offers very low latency, but not the fastest bandwidth, to all communication within a cluster and use a higher-capacity network for connecting the cluster to other systems.

Rank can also be specified for interface instances within a PM section. For example, consider a customized hpc.conf entry like this:


# SHM Settings
# NAME     RANK  AVAIL
Begin PM=shm
wshm       15    1
wshm0      10    1
wshm1      20    1
wshm2      30    1
wshm3      40    1
End PM=shm

If controllers wshm0 and wshm1 could be used to establish connections to the same process, wshm0 would always be chosen, since it has the lower ranking number.

TCP-IP PM Section

The PM section provided for the tcp PM in the template hpc.conf file contains the standard NAME and RANK columns, along with several placeholder columns that are not used at this time. The default TCP settings (and placeholders) are shown in CODE EXAMPLE 6-5.


CODE EXAMPLE 6-5 PM=tcp Section Example
# TCP settings
# NAME  RANK    MTU     STRIPE  LATENCY BANDWIDTH
Begin PM=tcp
midn    0	      16384	  0       20	      150
idn     10	     16384	  0       20	      150
mscid   30	     32768	  0	       20	      150
scid    40	     32768	  0	       20	      150
ce      43	     8192	   0	       20	      150
ge      45	     4096	   0	       20	      150
bge     47	     4096	   0	       20	      150
mba     50	     4096	   0	       20	      150
ba      60	     8192	   0	       20	      150
mfa     70	     8192	   0	       20	      150
fa      80	     8192	   0	       20	      150
macip   90	     8192	   0	       20	      150
acip    100	    8192	   0	       20	      150
manfc   110		    16384	  0	       20	      150
anfc    120		    16384	  0	       20	      150
mbf     130		    4096	   0	 	      20	      150
bf      140		    4096	   0		       20	      150
mbe     150		    4096	   0		       20	      150
be      160		    4096	   0		       20	      150
mqfe    163		    4096	   0		       20	      150
qfe     167		    4096	   0		       20		      150
mhme    170		    4096	   0		       20	      150
hme     180		    4096	   0		       20	      150
meri    170		    4096	   0		       20	      150
eri     183		    4096	   0		       20	      150
mle     187		    4096	   0		       20	      150
le      200		    4096	   0		       20	      150
msmc    210		    4096	   0		       20	      150
smc     220		    4096	   0		       20	      150
lo      230		    4096	   0	       20	      150
End PM

The template hpc.conf file identifies the network interfaces that are included in the TCP PM section.

The bge interface is used for x64-based processors.



Note - Inclusion of any network interface in this file does not imply that Sun Microsystems supports, or intends to support, that network.




Propagating hpc.conf Information

Whenever hpc.conf is changed, the Sun CRE database must be updated with the new information. After all required changes to hpc.conf have been made, restart the Sun CRE daemons on all cluster nodes. For example, to start the daemons on cluster nodes node1 and node2 from a central host, enter


# ./ctstartd -n node1,node2 -r connection_method

where connection_method is rsh, ssh, or telnet. Or, you can specify a nodelist file instead of listing the nodes on a command line.


# ./ctstopd -N /tmp/nodelist -r connection_method

where /tmp/nodelist is absolute path to a file containing the names of the cluster nodes, with each name on a separate line.