C H A P T E R 7 - Maintenance and Troubleshooting

C H A P T E R 7

Maintenance and Troubleshooting

This chapter describes some procedures you can use for preventive maintenance and troubleshooting. The topics covered are:

Cleaning Up Defunct Sun CRE Jobs

Using Diagnostics

Interpreting Sun CRE Error Messages

Anticipating Common Problems

Understanding Protocol-Related Errors

Recovering From System Failure

Cleaning Up Defunct Sun CRE Jobs

One preventive maintenance practice that can be beneficial is the routine cleanup of defunct jobs. There are several types of such jobs:

Jobs that have exited, but still appear in mpps output

Jobs that have not terminated, but need to be removed

Jobs that have orphan processes

Removing Sun CRE Jobs That Have Exited

When a job does not exit cleanly, it is possible for all of a job's processes to have reached a final state, but the job object itself to not be removed from the Sun CRE database. One indicator of such incompletely exited jobs is a process (identified by mpps) in the EXIT, SEXIT, FAIL, or CORE state.

If you see a job in one of these defunct states, perform the following steps to clear the job from the Sun CRE database:

1. Execute mpps -e again in case Sun CRE has had time to update the database (and remove the job).

2. If the job is still running, kill it, specifying its job ID.

% mpkill jid

If mpps continues to report the killed job, use the -C option to mpkill to remove the job object from the Sun CRE database. This must be done as superuser from the master node.

# mpkill -C jid

Removing Sun CRE Jobs That Have Not Terminated

The second type of defunct job includes jobs that are waiting for signals from processes on nodes that have gone off line. The mpps utility displays such jobs in states such as RUNNING, EXITING, SEXTNG, or CORNG.

Note - If the job-killing option of tm.watchd (-Yk) is enabled, Sun CRE handles such situations automatically. This section assumes this option is not enabled.

Kill the job using:

% mpkill jid

There are several variants of the mpkill command, similar to the variants of the Solaris kill command. You may also use:

% mpkill -9 jid

% mpkill -I jid

If these do not succeed, execute mpps -pe to display the unresponsive processes. Then, execute the Solaris ps command on each of the nodes listed. If those processes still exist on any of the nodes, you can remove them using kill -9 pid.

Once you have eliminated defunct jobs, data about the jobs may remain in the Sun CRE database. As superuser from the master node, use mpkill -C to remove this residual data.

Killing Orphaned Processes

When the tm.watchd -Yk option has been enabled, the watch daemon marks processes ORPHAN if they run on nodes that have gone off line. If the node resumes communication with the Sun CRE daemons, the watch daemon will kill the ORPHAN processes. If not, you will have to kill the processes manually using the Solaris kill command. Otherwise, such processes will continue to consume resources.

Symptoms of orphaned processes can be detected by examining error log files or stdout, if you are running from a terminal. You can also search for such errors as
RPC: cannot connect, or RPC: timout. These errors will appear under user.err priority in syslog.

Note - If an mprun process becomes unresponsive on a system, even where tm.watchd -Yk has been enabled, it may be necessary to use Ctrl-c to kill mprun.

Using Diagnostics

The following sections describe Solaris diagnostics that may be useful in troubleshooting various types of error conditions.

Using Network Diagnostics

You can use /usr/sbin/ping to check whether you can connect to the network interface on another node. For example:

% ping hpc-node3

tests (over the default network) the connection to hpc-node3.

You can use /usr/sbin/spray to determine whether a node can handle significant network traffic. spray indicates the amount of dropped traffic. For example:

% spray -c 100 hpc-node3

sends 100 small packets to hpc-node3.

Checking Load Averages

You can use mpinfo -N or, if Sun CRE is not running, /usr/bin/uptime, to determine load averages. These averages can help to determine the current load on the machine and how quickly it reached that load level.

Using Interval Diagnostics

The diagnostic programs described below check the status of various parameters. Each accepts a numerical option that specifies the time interval between status checks. If the interval option is not used, the diagnostics output an average value for the respective parameter since boot time. Specify the numerical value at the end of the command to get current information.

Use /usr/bin/netstat to check local system network traffic. For example:

% netstat -ni 3

checks and reports traffic every three seconds.

Use /usr/bin/iostat to display disk and system usage. For example:

% iostat -c 2

displays percentage utilizations every two seconds.

Use /usr/bin/vmstat to generate additional information about the virtual memory system. For example:

% vmstat -S 5

reports on swapping activity every five seconds.

It can be useful to run these diagnostics periodically, monitoring their output for multiple intervals.

Interpreting Sun CRE Error Messages

This section presents sample error messages and their interpretations.

The following error message usually indicates that all the nodes in a Sun CRE partition are marked down--that is, their node daemons are not running:

No nodes in partition satisfy RRS:

Under certain circumstances, when a user attempts to kill a job, Sun CRE may log error messages of the following form on the master node:

Aug 27 11:02:30 ops2a tm.rdb[462]: Cond_set: unable to connect to ops2a/45126: connect: Connection refused

If these errors can be correlated to jobs being killed, then they can be safely ignored. One way to check this correlation would be to look at the accounting logs for jobs that were signaled during this time.

The following error message indicates that no partitions have been set up:

mprun: unique partition: No such object

When there is stale job information in the Sun CRE database, an error message of the following form may occur:

Query returned excess results:
a.out: (TMTL UL) TMRTE_Abort: Not yet initialized
The attempt to kill your program failed

This might happen, for example, when mpps shows running processes that are actually no longer running.

Use the mpkill -C nn command to clear out such stale jobs.

Note - Before removing the job's information from the database, the mpkill -C option verifies that the processes of the job are in fact no longer running.

The following error message indicates that the tm.rdb daemon has not started on the master node:

nodename tm.rdb[7896]: [ID 451353 daemon.error] unable to register (TMRTE_RDB_PROG, RDBVERS, tcp)

where nodename stands for the name of the system (master node). This condition can generate errors from the mpinfo command as well.The error message is generated when the rpcbind daemon has stopped running.

To diagnose the problem, perform these steps:

i. Check whether rpcbind is running. Issue one of the two following commands:

% svcs bind

% ps -elf | grep rpcbind

ii. If rpcbind is running, issue the rpcinfo command. If rpcinfo is unable to contact rpcbind, then restart rpcbind.

iii. If rpcbind is not running, restart rpcbind.

After rpcbind has been restarted, the tm.rdb daemon starts normally.

Anticipating Common Problems

This section presents some guidelines for preventing and troubleshooting common problems.

When running multiprocess jobs (-np equal to 0 or greater than 1) in a cluster with NFS-mounted file systems, you should take steps to limit core dumps to zero. This can be done with the Solaris limit command. See the limit(1) man page for additional information.

The Sun CRE resource database daemon (tm.rdb) does not remove missing interfaces after a client daemon is restarted. Instead, the mpinfo -Nv command will show them marked as down.

The contents of the /var/adm/messages file are local to each node. Any daemon messages are logged only on the node where that daemon runs. By default, Sun CRE daemon messages are stored in /var/adm/messages along with other messages handled by syslog. Alternatively, Sun CRE messages can be written to a file specified by the mpadmin logfile command.

Use shell I/O redirection instead of mprun -I options whenever possible. Using shell redirection reduces the likelihood of problems involving standard I/O.

If mprun is signaled too soon after it has been invoked, it exits without stopping the job`s processes. If this happens, use mpkill -9 jid to kill such a job.

Sun CRE does not pass supplemental group ID information to remote processes. You must use the -G gid option with mprun to run with the group permissions of that group. You must be a member of that group.

Sun CRE RPC timeouts in Sun MPI code are logged to syslog, but the default syslog.conf file causes these messages to be dropped. If you want to see these errors, modify your /etc/syslog.conf file so that messages of the priority user.err are not dropped. Note that this does not apply to RPC timeouts occurring in the Sun CRE daemons themselves. By default, these are logged to /var/adm/messages.

Note - If you have set the Cluster-level attribute logfile, all error messages generated by user code will be handled by Sun CRE (not syslog) and will be logged in a file specified by an argument to logfile.

Sun CRE RPC timeouts in user code are generally not recoverable. The job might continue to run, but processes probably will not be able to communicate with each other. There are two ways to deal with this:

Enable the tm.watchd job killing option (-Yk), which will automatically kill jobs when nodes go off line. This will catch most of these cases, since RPC timeouts usually coincide with tm.watchd marking the node as off line.

Monitor RPC errors from user codes by looking for syslog messages of priority user.err. Then use mpkill to kill the associated job manually.

If a file system is not visible on all nodes, users can encounter a permission denied message when attempting to execute programs from such a file system. Watch for errors caused by non-shared file systems like /tmp, which exist locally on all nodes. This can show up when users attempt to execute programs from
/tmp, and the program does not exist in the /tmp file systems of all nodes.

If you execute mpkill with the -C option (this option is available only to the system administrator), you should look for and remove leftover files on the master node. The file names for large files are of the form:

/tmp/.hpcshm_mmap.jid.*

Smaller files will have file names of the form:

/tmp/.hpcshm_acf.jid.*

The Sun MPI shared memory protocol module uses these files for interprocess communication on the same node. These files consume swap space.

Understanding Protocol-Related Errors

Errors may occur at cluster startup or at program initialization because of problems finding or loading protocol modules. Such errors are not fatal to the runtime environment (that is, to the Sun CRE daemons), but they do mean that the protocol in question is not available for communication on the cluster.

This section describes some error conditions that may occur in relation to protocol modules(PMs).

Errors When Sun CRE Daemons Load Protocol Modules

The errors below are generated when the Sun CRE daemons first start up. These errors can occur because of problems in the hpc.conf file, or because of problems loading the PMs.

All these errors are considered nonfatal to the daemon, but the PM that causes the error will not be usable. The errors below cause the Sun CRE daemons to generate calls to syslog that result in self-explanatory error messages.

Error: PM listed in hpc.conf cannot be found.

Error: Library that PM depends on could not be found.

Error: PM name has more than 4 characters.

Error: Attempting to load 32-bit PM into 64-bit daemon.

The daemons cause a warning to be generated when there are duplicate PM entries in the PMODULES section of hpc.conf. If there are multiple PM entries with the same name, then only the first one is loaded.

Errors When Protocol Modules Discover Interfaces

The errors below are generated at program startup when a PM attempts interface discovery.

These errors are nonfatal to the Sun CRE daemons, but they may mean that the PM causing the error will not be usable. Appropriate error strings are generated by syslog.

Error: Malformed interface entry in the hpc.conf file. This example indicates a missing RANK column entry for the hme interface for the tcp PM.

-WARNING- Problem detected initializing tcp PM: PM=tcp entry hme is missing tokens

Error: Missing interface entry in the hpc.conf file. If an entry is missing entirely, then default values are used if available. This example indicates that there is no entry for the hme interface for the tcp PM.

-WARNING- Problem detected initializing tcp PM: Interface hme0 has missing or broken entry in hpc.conf. Will use: Rank=1000,stripe=0, mtu=1500 latency=-1, bandwidth=-1

Error: Malformed interface in the hpc.conf file. This example indicates that there are too many columns for the hme interface for the tcp PM.

-WARNING- Problem detected initializing tcp PM: PM=tcp entry hme has extra tokens

Recovering From System Failure

Recovering from system failure involves rebooting Sun CRE and recreating the Sun CRE resource database.

The sunhpc.cre_master reboot and sunhpc.cre_node reboot commands should be used only as a last resort, if the system is not responding (for example, if programs such as mprun, mpinfo, or mpps hang).

To Reboot Sun CRE:

1. Run sunhpc.cre_master reboot on the master node:

# /etc/init.d/sunhpc.cre_master reboot

2. Run sunhpc.cre_node reboot on all the nodes (including the master node if it is running tm.spmd and tm.omd):

# /etc/init.d/sunhpc.cre_node reboot

The procedure attempts to save the system configuration (in the same way as using the mpadmin dump command), kill all the running jobs, and restore the system configuration. Note that the Cluster Console Manager applications may be useful in executing commands on all the nodes in the cluster simultaneously. For information about the Cluster Console Manager applications, see Appendix A.

Note - sunhpc.cre_master reboot saves the existing rdb-log and rdb-save files in /var/hpc/rdb-log.1 and /var/hpc/rdb-save.1. The rdb-log file is a running log of the resource database activity and rdb-save is a snapshot of the database taken at regular intervals.

To recover the Sun CRE after a partial failure--that is, when some but not all daemons have failed, it is possible to clean up bad database entries without losing the configuration information. For example, run the following commands on the master node to clear out the dynamic data while preserving the configuration.

# /opt/SUNWhpc/sbin/ctstartd -l

# /etc/init.d/sunhpc.cre_master reboot

The ctstartd command is necessary in case some of the daemons are not running. The -l option causes the command to run on the local system; in this case, the master node.