Sun HPC ClusterTools 3.0 Administrator's Guide: With CRE

Chapter 8 Troubleshooting

System administrators can control some conditions that can cause errors by performing periodic maintenance. Examples of such preventive maintenance are described in "Cleaning Up Defunct CRE Jobs". "Diagnostics" describes procedures for troubleshooting various kinds of problems. "Error Conditions and Troubleshooting Tips" discusses various error conditions and troubleshooting tips. Finally, "Procedures for Recovery" describes a procedure for recovering the CRE database when a system failure occurs.

Cleaning Up Defunct CRE Jobs

One preventive maintenance practice that can be beneficial is the routine cleanup of defunct jobs. There are several types of such jobs:

Jobs that have exited, but still appear in mpps output

Jobs that have not terminated, but need to be removed

Jobs that have orphan processes

Removing CRE Jobs that have Exited

When a job does not exit cleanly, it is possible for all of a job's processes to have reached a final state, but the job object itself to not be removed from the CRE database. The following are two indicators of such incompletely exited jobs:

A process (identified by mpps) in the EXIT, SEXIT, FAIL, or CORE states.
A Prism main window that won't close or exit.

If you see a job in one of these defunct states perform the following steps to clear the job from the CRE database:

Execute mpps -e again in case the CRE has had time to update the database (and remove the job).

If the job is still running, kill it, specifying its job ID.

% mpkill jid

If necessary, remove the job object from the CRE database.

If mpps continues to report the killed job, use the -C option to mpkill to remove the job object from the CRE database; This must be done as root, from the master node.

# mpkill -C jid

CRE Jobs that Have Not Terminated

The second type of defunct job includes jobs that are waiting for signals from processes on nodes that have gone off line. The mpps utility displays such jobs in states such as RUNNING, EXITING, SEXTNG, or CORNG.

Note -

If the job-killing option of tm.watchd (-Yk) is enabled, the CRE will handle such situations automatically. This section assumes this option is not enabled.

Kill the job using:

% mpkill jid

There are several variants of the mpkill command, similar to the variants of the Solaris kill command. You may also use:

% mpkill -9 jid

% mpkill -I jid

If these do not succeed, execute mpps -pe to display the unresponsive processes. Then, execute the Solaris ps command on the each of the nodes listed. If those processes still exist on any of the nodes, you can remove them using kill -9 pid.

Once you have eliminated all defunct jobs, data about the jobs may remain in the CRE database. As root from the master node, use mpkill -C to remove this residual data.

Orphaned Processes

When the tm.watchd -Yk option has been enabled, the watch daemon marks processes ORPHAN if they run on nodes that have gone off line. If the node resumes communication with the CRE daemons, the watch daemon will kill the ORPHAN processes. If not, you will have to kill the processes manually using the Solaris kill command. Otherwise, such processes will continue to consume resources and CPU cycles.

Symptoms of orphaned processes can be detected by examining error log files or stdout, if you're running from a terminal. You can also search for such errors as RPC: cannot connect, or RPC: timout. These errors will appear under user.err priority in syslog.

Note -

If an mprun process becomes unresponsive on a system, even where tm.watchd -Yk has been enabled, it may be necessary to use Ctrl-c to kill mprun.

Diagnostics

The following sections describe Solaris diagnostics that may be useful in troubleshooting various types of error conditions.

Network Diagnostics

You can use /usr/sbin/ping to check whether you can connect to the network interface on another node. For example:

% ping hpc-node3

will test (over the default network) the connection to hpc-node3.

You can use /usr/sbin/spray to determine whether a node can handle significant network traffic. spray indicates the amount of dropped traffic. For example:

% spray -c 100 hpc-node3

sends 100 small packets to hpc-node3.

Checking Load Averages

You can use mpinfo -N or, if the CRE is not running, /usr/bin/uptime, to determine load averages. These averages can help to determine the current load on the machine and how quickly it reached that load level.

Using Interval Diagnostics

The diagnostic programs described below check the status of various parameters. Each accepts a numerical option that specifies the time interval between status checks. If the interval option is not used, the diagnostics output an average value for the respective parameter since boot time. Specify the numerical value at the end of the command to get current information.

Use /usr/bin/netstat to check local system network traffic. For example:

% netstat -ni 3

checks and reports traffic every 3 seconds.

Use /usr/bin/iostat to display disk and system usage. For example:

% iostat -c 2

displays percentage utilizations every 2 seconds.

Use /usr/bin/vmstat to generate additional information about the virtual memory system. For example:

% vmstat -S 5

reports on swapping activity every 5 seconds.

It can be useful to run these diagnostics periodically, monitoring their output for multiple intervals.

Error Conditions and Troubleshooting Tips

The following sections include sample error messages and their interpretations, as well as guidelines for anticipating common problems.

Error Messages

The following error message usually indicates that all the nodes in an CRE partition are marked down--that is, their node daemons are not running.

No nodes in partition satisfy RRS:

This could happen, for example, if CRE was unable to check out its licenses. This message can also indicate an error in the construction of an RRS.

Under certain circumstances, when a user attempts to kill a job, the CRE may log error messages of the following form on the master node:

Aug 27 11:02:30 ops2a tm.rdb[462]: Cond_set: unable to connect to ops2a/45126: connect: Connection refused

If these can be correlated to jobs being killed, these errors can be safely ignored. One way to check this correlation would be to look at the accounting logs for jobs that were signaled during this time.

The following error message indicates that no partitions have been set up.

mprun: unique partition: No such object

When there is stale job information in the CRE database, an error message of the following form may occur:

Query returned excess
results: 
a.out: (TMTL UL) TMRTE_Abort: Not yet initialized
The attempt to kill your program failed.

This might happen, for example, when mpps shows running processes that are actually no longer running.

Use the mpkill -C nn command to clear out such stale jobs.

Note -

Before removing the job's information from the database, the mpkill -C option verifies that the processes of the job are in fact no longer running.

Troubleshooting Tips

When running multiprocess jobs (-np equal to 0 or greater than 1) in an cluster with NFS-mounted file systems, you should take steps to limit core dumps to zero. This can be done with the Solaris limit command. See the limit(1) man page for additional information.

The CRE resource database daemon (tm.rdb) does not remove missing interfaces after a client daemon is restarted. Instead, the mpinfo -Nv command will show them marked as down.

The contents of the /var/adm/messages file are local to each node. Any daemon messages will be logged only on the node where that daemon runs. By default, CRE daemon messages are stored in /var/adm/messages along with other messages handled by syslog. Alternatively, CRE messages can be written to a file specified by the mpadmin logfile command.

Use shell I/O redirection instead of mprun -I options whenever possible. Using shell redirection will reduce the likelihood of problems involving standard I/O.

If mprun is signaled too soon after it has been invoked, it exits without stopping the job`s processes. If this happens, find out which processes are use mpkill -9 jid to kill such a job.

The CRE does not pass supplemental group ID information to remote processes. You must use the -G gid option with mprun to run with the group permissions of that group. You must be a member of that group.

RPC timeouts - CRE RPC timeouts in Sun MPI code are logged to syslog, but the default syslog.conf file causes these messages to be dropped. If you want to see these errors, you should modify your /etc/syslog.conf file so that messages of the priority user.err are not dropped. Note that this does not apply to RPC timeouts occurring in the CRE daemons themselves. By default, these are logged to /var/adm/messages.

Note -
If you have set the Cluster-level attribute logfile, all error messages generated by user code will be handled by the CRE (not syslog) and will be logged in a file specified by an argument to logfile.

CRE RPC timeouts in user code are generally not recoverable. The job might continue to run, but processes probably won't be able to communicate with each other. There are two ways to deal with this:

Enable the tm.watchd job killing option (- Yk), which will automatically kill jobs when nodes go off line. This will catch most of these cases, since RPC timeouts usually coincide with tm.watchd marking the node as off line.

Monitor RPC errors from user codes by looking for syslog messages of priority user.err). Then use mpkill to kill the associated job manually.

If a file system is not visible on all nodes, users can encounter a permission denied message when attempting to execute programs from such a file system. Watch for errors caused by non-shared file system like /tmp, which exist locally on all nodes. This can show up when users attempt to execute programs from /tmp, and the program does not exist in the /tmp file systems of all nodes.

If the behavior of your system suggests that you've run out of swap space (after executing vmstat or df on /tmp), you may need to increase the limit on the CRE's shmem_minfree attribute.

If you execute mpkill with the -C option (this option is available only to the system administrator), you should look for and remove leftover files on the master node. The file names for large files are of the form:

/tmp/.hpcshm_mmap.jid.*

Smaller files will have file names of the form:

/tmp/.hpcshm_acf.jid.*

The Sun MPI shared memory protocol module uses these files for interprocess communication on the same node. These files consume swap space.

Procedures for Recovery

Re-creating the CRE Database

The rte.master reboot and rte.node reboot commands should be used only as a last resort, if the system is not responding (for example, if programs such as mprun, mpinfo, or mpps hang). Follow these steps to rebootthe CRE:

Run rte.master reboot on the master node.

# /etc/init.d/rte.master reboot

Run rte.node reboot on all the nodes (including the master node if it's running tm.spmd and tm.omd).

# /etc/init.d/rte.node reboot

The procedure will attempt to save the system configuration (in the same way as using the mpadmin dump command), kill all the running jobs, and restore the system configuration. Note that the Cluster Console Manager applications may be useful in executing commands on all of the nodes in the cluster simultaneously. For information about the Cluster Console Manager applications, see Appendix A "Cluster Management Tools".

Note -

rte.master reboot saves the existing rdb-log and rdb-save files in /var/hpc/rdb-log.1 and /var/hpc/rdb-save.1. rdb-log is a running log of the resource database activity and rdb-save is a snapshot of the database taken at regular intervals.