C H A P T E R  7

Maintenance and Troubleshooting

This chapter describes some procedures you can use for preventive maintenance and troubleshooting. The topics covered are:


Cleaning Up Defunct Sun CRE Jobs

One preventive maintenance practice that can be beneficial is the routine cleanup of defunct jobs. There are several types of such jobs:

Removing Sun CRE Jobs That Have Exited

When a job does not exit cleanly, it is possible for all of a job's processes to have reached a final state, but the job object itself to not be removed from the Sun CRE database. One indicator of such incompletely exited jobs is a process (identified by mpps) in the EXIT, SEXIT, FAIL, or CORE state.

If you see a job in one of these defunct states, perform the following steps to clear the job from the Sun CRE database:

1. Execute mpps -e again in case Sun CRE has had time to update the database (and remove the job).

2. If the job is still running, kill it, specifying its job ID.

% mpkill jid

If mpps continues to report the killed job, use the -C option to mpkill to remove the job object from the Sun CRE database. This must be done as superuser from the master node.

# mpkill -C jid

Removing Sun CRE Jobs That Have Not Terminated

The second type of defunct job includes jobs that are waiting for signals from processes on nodes that have gone off line. The mpps utility displays such jobs in states such as RUNNING, EXITING, SEXTNG, or CORNG.



Note - If the job-killing option of tm.watchd (-Yk) is enabled, Sun CRE handles such situations automatically. This section assumes this option is not enabled.



Kill the job using:

% mpkill jid

There are several variants of the mpkill command, similar to the variants of the Solaris kill command. You may also use:

% mpkill -9 jid 

or

% mpkill -I jid

If these do not succeed, execute mpps -pe to display the unresponsive processes. Then, execute the Solaris ps command on each of the nodes listed. If those processes still exist on any of the nodes, you can remove them using kill -9 pid.

Once you have eliminated defunct jobs, data about the jobs may remain in the Sun CRE database. As superuser from the master node, use mpkill -C to remove this residual data.

Killing Orphaned Processes

When the tm.watchd -Yk option has been enabled, the watch daemon marks processes ORPHAN if they run on nodes that have gone off line. If the node resumes communication with the Sun CRE daemons, the watch daemon will kill the ORPHAN processes. If not, you will have to kill the processes manually using the Solaris kill command. Otherwise, such processes will continue to consume resources.

Symptoms of orphaned processes can be detected by examining error log files or stdout, if you are running from a terminal. You can also search for such errors as
RPC: cannot connect, or RPC: timout. These errors will appear under user.err priority in syslog.



Note - If an mprun process becomes unresponsive on a system, even where tm.watchd -Yk has been enabled, it may be necessary to use Ctrl-c to kill mprun.




Using Diagnostics

The following sections describe Solaris diagnostics that may be useful in troubleshooting various types of error conditions.

Using Network Diagnostics

You can use /usr/sbin/ping to check whether you can connect to the network interface on another node. For example:

% ping hpc-node3

tests (over the default network) the connection to hpc-node3.

You can use /usr/sbin/spray to determine whether a node can handle significant network traffic. spray indicates the amount of dropped traffic. For example:

% spray -c 100 hpc-node3 

sends 100 small packets to hpc-node3.

Checking Load Averages

You can use mpinfo -N or, if Sun CRE is not running, /usr/bin/uptime, to determine load averages. These averages can help to determine the current load on the machine and how quickly it reached that load level.

Using Interval Diagnostics

The diagnostic programs described below check the status of various parameters. Each accepts a numerical option that specifies the time interval between status checks. If the interval option is not used, the diagnostics output an average value for the respective parameter since boot time. Specify the numerical value at the end of the command to get current information.

Use /usr/bin/netstat to check local system network traffic. For example:

% netstat -ni 3 

checks and reports traffic every three seconds.

Use /usr/bin/iostat to display disk and system usage. For example:

% iostat -c 2 

displays percentage utilizations every two seconds.

Use /usr/bin/vmstat to generate additional information about the virtual memory system. For example:

% vmstat -S 5

reports on swapping activity every five seconds.

It can be useful to run these diagnostics periodically, monitoring their output for multiple intervals.


Interpreting Sun CRE Error Messages

This section presents sample error messages and their interpretations.

No nodes in partition satisfy RRS:

Aug 27 11:02:30 ops2a tm.rdb[462]: Cond_set: unable to connect to ops2a/45126: connect: Connection refused

If these errors can be correlated to jobs being killed, then they can be safely ignored. One way to check this correlation would be to look at the accounting logs for jobs that were signaled during this time.

mprun: unique partition: No such object

Query returned excess results:
a.out: (TMTL UL) TMRTE_Abort: Not yet initialized
The attempt to kill your program failed

This might happen, for example, when mpps shows running processes that are actually no longer running.

Use the mpkill -C nn command to clear out such stale jobs.



Note - Before removing the job's information from the database, the mpkill -C option verifies that the processes of the job are in fact no longer running.



nodename tm.rdb[7896]: [ID 451353 daemon.error] unable to register (TMRTE_RDB_PROG, RDBVERS, tcp)

where nodename stands for the name of the system (master node). This condition can generate errors from the mpinfo command as well.The error message is generated when the rpcbind daemon has stopped running.

To diagnose the problem, perform these steps:

i. Check whether rpcbind is running. Issue one of the two following commands:

% svcs bind

% ps -elf | grep rpcbind

ii. If rpcbind is running, issue the rpcinfo command. If rpcinfo is unable to contact rpcbind, then restart rpcbind.

iii. If rpcbind is not running, restart rpcbind.

After rpcbind has been restarted, the tm.rdb daemon starts normally.


Anticipating Common Problems

This section presents some guidelines for preventing and troubleshooting common problems.



Note - If you have set the Cluster-level attribute logfile, all error messages generated by user code will be handled by Sun CRE (not syslog) and will be logged in a file specified by an argument to logfile.



Sun CRE RPC timeouts in user code are generally not recoverable. The job might continue to run, but processes probably will not be able to communicate with each other. There are two ways to deal with this:

/tmp/.hpcshm_mmap.jid.*

Smaller files will have file names of the form:

/tmp/.hpcshm_acf.jid.*

The Sun MPI shared memory protocol module uses these files for interprocess communication on the same node. These files consume swap space.


Understanding Protocol-Related Errors

Errors may occur at cluster startup or at program initialization because of problems finding or loading protocol modules. Such errors are not fatal to the runtime environment (that is, to the Sun CRE daemons), but they do mean that the protocol in question is not available for communication on the cluster.

This section describes some error conditions that may occur in relation to protocol modules(PMs).

Errors When Sun CRE Daemons Load Protocol Modules

The errors below are generated when the Sun CRE daemons first start up. These errors can occur because of problems in the hpc.conf file, or because of problems loading the PMs.

All these errors are considered nonfatal to the daemon, but the PM that causes the error will not be usable. The errors below cause the Sun CRE daemons to generate calls to syslog that result in self-explanatory error messages.

The daemons cause a warning to be generated when there are duplicate PM entries in the PMODULES section of hpc.conf. If there are multiple PM entries with the same name, then only the first one is loaded.

Errors When Protocol Modules Discover Interfaces

The errors below are generated at program startup when a PM attempts interface discovery.

These errors are nonfatal to the Sun CRE daemons, but they may mean that the PM causing the error will not be usable. Appropriate error strings are generated by syslog.

-WARNING- Problem detected initializing tcp PM: PM=tcp entry hme is missing tokens

-WARNING- Problem detected initializing tcp PM: Interface hme0 has missing or broken entry in hpc.conf. Will use: Rank=1000,stripe=0, mtu=1500 latency=-1, bandwidth=-1

-WARNING- Problem detected initializing tcp PM: PM=tcp entry hme has extra tokens


Recovering From System Failure

Recovering from system failure involves rebooting Sun CRE and recreating the Sun CRE resource database.

The sunhpc.cre_master reboot and sunhpc.cre_node reboot commands should be used only as a last resort, if the system is not responding (for example, if programs such as mprun, mpinfo, or mpps hang).


procedure icon  To Reboot Sun CRE:

1. Run sunhpc.cre_master reboot on the master node:

# /etc/init.d/sunhpc.cre_master reboot

2. Run sunhpc.cre_node reboot on all the nodes (including the master node if it is running tm.spmd and tm.omd):

# /etc/init.d/sunhpc.cre_node reboot

The procedure attempts to save the system configuration (in the same way as using the mpadmin dump command), kill all the running jobs, and restore the system configuration. Note that the Cluster Console Manager applications may be useful in executing commands on all the nodes in the cluster simultaneously. For information about the Cluster Console Manager applications, see Appendix A.



Note - sunhpc.cre_master reboot saves the existing rdb-log and rdb-save files in /var/hpc/rdb-log.1 and /var/hpc/rdb-save.1. The rdb-log file is a running log of the resource database activity and rdb-save is a snapshot of the database taken at regular intervals.



To recover the Sun CRE after a partial failure--that is, when some but not all daemons have failed, it is possible to clean up bad database entries without losing the configuration information. For example, run the following commands on the master node to clear out the dynamic data while preserving the configuration.

# /opt/SUNWhpc/sbin/ctstartd -l
# /etc/init.d/sunhpc.cre_master reboot
 

The ctstartd command is necessary in case some of the daemons are not running. The -l option causes the command to run on the local system; in this case, the master node.