Sun HPC ClusterTools 3.0 Administrator's Guide: With CRE

Cleaning Up Defunct CRE Jobs

One preventive maintenance practice that can be beneficial is the routine cleanup of defunct jobs. There are several types of such jobs:

Jobs that have exited, but still appear in mpps output

Jobs that have not terminated, but need to be removed

Jobs that have orphan processes

Removing CRE Jobs that have Exited

When a job does not exit cleanly, it is possible for all of a job's processes to have reached a final state, but the job object itself to not be removed from the CRE database. The following are two indicators of such incompletely exited jobs:

A process (identified by mpps) in the EXIT, SEXIT, FAIL, or CORE states.
A Prism main window that won't close or exit.

If you see a job in one of these defunct states perform the following steps to clear the job from the CRE database:

Execute mpps -e again in case the CRE has had time to update the database (and remove the job).

If the job is still running, kill it, specifying its job ID.

% mpkill jid

If necessary, remove the job object from the CRE database.

If mpps continues to report the killed job, use the -C option to mpkill to remove the job object from the CRE database; This must be done as root, from the master node.

# mpkill -C jid

CRE Jobs that Have Not Terminated

The second type of defunct job includes jobs that are waiting for signals from processes on nodes that have gone off line. The mpps utility displays such jobs in states such as RUNNING, EXITING, SEXTNG, or CORNG.

Note -

If the job-killing option of tm.watchd (-Yk) is enabled, the CRE will handle such situations automatically. This section assumes this option is not enabled.

Kill the job using:

% mpkill jid

There are several variants of the mpkill command, similar to the variants of the Solaris kill command. You may also use:

% mpkill -9 jid

% mpkill -I jid

If these do not succeed, execute mpps -pe to display the unresponsive processes. Then, execute the Solaris ps command on the each of the nodes listed. If those processes still exist on any of the nodes, you can remove them using kill -9 pid.

Once you have eliminated all defunct jobs, data about the jobs may remain in the CRE database. As root from the master node, use mpkill -C to remove this residual data.

Orphaned Processes

When the tm.watchd -Yk option has been enabled, the watch daemon marks processes ORPHAN if they run on nodes that have gone off line. If the node resumes communication with the CRE daemons, the watch daemon will kill the ORPHAN processes. If not, you will have to kill the processes manually using the Solaris kill command. Otherwise, such processes will continue to consume resources and CPU cycles.

Symptoms of orphaned processes can be detected by examining error log files or stdout, if you're running from a terminal. You can also search for such errors as RPC: cannot connect, or RPC: timout. These errors will appear under user.err priority in syslog.

Note -

If an mprun process becomes unresponsive on a system, even where tm.watchd -Yk has been enabled, it may be necessary to use Ctrl-c to kill mprun.