Sun HPC ClusterTools 3.0 Administrator's Guide: With CRE

Chapter 8 Troubleshooting

System administrators can control some conditions that can cause errors by performing periodic maintenance. Examples of such preventive maintenance are described in "Cleaning Up Defunct CRE Jobs". "Diagnostics" describes procedures for troubleshooting various kinds of problems. "Error Conditions and Troubleshooting Tips" discusses various error conditions and troubleshooting tips. Finally, "Procedures for Recovery" describes a procedure for recovering the CRE database when a system failure occurs.

Cleaning Up Defunct CRE Jobs

One preventive maintenance practice that can be beneficial is the routine cleanup of defunct jobs. There are several types of such jobs:

Removing CRE Jobs that have Exited

When a job does not exit cleanly, it is possible for all of a job's processes to have reached a final state, but the job object itself to not be removed from the CRE database. The following are two indicators of such incompletely exited jobs:

If you see a job in one of these defunct states perform the following steps to clear the job from the CRE database:

  1. Execute mpps -e again in case the CRE has had time to update the database (and remove the job).

  2. If the job is still running, kill it, specifying its job ID.

% mpkill jid
  1. If necessary, remove the job object from the CRE database.

    If mpps continues to report the killed job, use the -C option to mpkill to remove the job object from the CRE database; This must be done as root, from the master node.

# mpkill -C jid

CRE Jobs that Have Not Terminated

The second type of defunct job includes jobs that are waiting for signals from processes on nodes that have gone off line. The mpps utility displays such jobs in states such as RUNNING, EXITING, SEXTNG, or CORNG.


Note -

If the job-killing option of tm.watchd (-Yk) is enabled, the CRE will handle such situations automatically. This section assumes this option is not enabled.


Kill the job using:

% mpkill jid

There are several variants of the mpkill command, similar to the variants of the Solaris kill command. You may also use:

% mpkill -9 jid 

or

% mpkill -I jid

If these do not succeed, execute mpps -pe to display the unresponsive processes. Then, execute the Solaris ps command on the each of the nodes listed. If those processes still exist on any of the nodes, you can remove them using kill -9 pid.

Once you have eliminated all defunct jobs, data about the jobs may remain in the CRE database. As root from the master node, use mpkill -C to remove this residual data.

Orphaned Processes

When the tm.watchd -Yk option has been enabled, the watch daemon marks processes ORPHAN if they run on nodes that have gone off line. If the node resumes communication with the CRE daemons, the watch daemon will kill the ORPHAN processes. If not, you will have to kill the processes manually using the Solaris kill command. Otherwise, such processes will continue to consume resources and CPU cycles.

Symptoms of orphaned processes can be detected by examining error log files or stdout, if you're running from a terminal. You can also search for such errors as RPC: cannot connect, or RPC: timout. These errors will appear under user.err priority in syslog.


Note -

If an mprun process becomes unresponsive on a system, even where tm.watchd -Yk has been enabled, it may be necessary to use Ctrl-c to kill mprun.


Diagnostics

The following sections describe Solaris diagnostics that may be useful in troubleshooting various types of error conditions.

Network Diagnostics

You can use /usr/sbin/ping to check whether you can connect to the network interface on another node. For example:

% ping hpc-node3

will test (over the default network) the connection to hpc-node3.

You can use /usr/sbin/spray to determine whether a node can handle significant network traffic. spray indicates the amount of dropped traffic. For example:

% spray -c 100 hpc-node3 

sends 100 small packets to hpc-node3.

Checking Load Averages

You can use mpinfo -N or, if the CRE is not running, /usr/bin/uptime, to determine load averages. These averages can help to determine the current load on the machine and how quickly it reached that load level.

Using Interval Diagnostics

The diagnostic programs described below check the status of various parameters. Each accepts a numerical option that specifies the time interval between status checks. If the interval option is not used, the diagnostics output an average value for the respective parameter since boot time. Specify the numerical value at the end of the command to get current information.

Use /usr/bin/netstat to check local system network traffic. For example:

% netstat -ni 3 

checks and reports traffic every 3 seconds.

Use /usr/bin/iostat to display disk and system usage. For example:

% iostat -c 2 

displays percentage utilizations every 2 seconds.

Use /usr/bin/vmstat to generate additional information about the virtual memory system. For example:

% vmstat -S 5

reports on swapping activity every 5 seconds.

It can be useful to run these diagnostics periodically, monitoring their output for multiple intervals.

Error Conditions and Troubleshooting Tips

The following sections include sample error messages and their interpretations, as well as guidelines for anticipating common problems.

Error Messages

No nodes in partition satisfy RRS:

This could happen, for example, if CRE was unable to check out its licenses. This message can also indicate an error in the construction of an RRS.

Aug 27 11:02:30 ops2a tm.rdb[462]: Cond_set: unable to connect to ops2a/45126: connect: Connection refused

If these can be correlated to jobs being killed, these errors can be safely ignored. One way to check this correlation would be to look at the accounting logs for jobs that were signaled during this time.

mprun: unique partition: No such object

Query returned excess
results: 
a.out: (TMTL UL) TMRTE_Abort: Not yet initialized
The attempt to kill your program failed.

This might happen, for example, when mpps shows running processes that are actually no longer running.

Use the mpkill -C nn command to clear out such stale jobs.


Note -

Before removing the job's information from the database, the mpkill -C option verifies that the processes of the job are in fact no longer running.


Troubleshooting Tips

CRE RPC timeouts in user code are generally not recoverable. The job might continue to run, but processes probably won't be able to communicate with each other. There are two ways to deal with this:

/tmp/.hpcshm_mmap.jid.*

/tmp/.hpcshm_acf.jid.*

The Sun MPI shared memory protocol module uses these files for interprocess communication on the same node. These files consume swap space.

Procedures for Recovery

Re-creating the CRE Database

The rte.master reboot and rte.node reboot commands should be used only as a last resort, if the system is not responding (for example, if programs such as mprun, mpinfo, or mpps hang). Follow these steps to rebootthe CRE:

  1. Run rte.master reboot on the master node.

# /etc/init.d/rte.master reboot
  1. Run rte.node reboot on all the nodes (including the master node if it's running tm.spmd and tm.omd).

# /etc/init.d/rte.node reboot

The procedure will attempt to save the system configuration (in the same way as using the mpadmin dump command), kill all the running jobs, and restore the system configuration. Note that the Cluster Console Manager applications may be useful in executing commands on all of the nodes in the cluster simultaneously. For information about the Cluster Console Manager applications, see Appendix A "Cluster Management Tools".


Note -

rte.master reboot saves the existing rdb-log and rdb-save files in /var/hpc/rdb-log.1 and /var/hpc/rdb-save.1. rdb-log is a running log of the resource database activity and rdb-save is a snapshot of the database taken at regular intervals.