Sun HPC ClusterTools 3.0 Administrator's Guide: With CRE

Orphaned Processes

When the tm.watchd -Yk option has been enabled, the watch daemon marks processes ORPHAN if they run on nodes that have gone off line. If the node resumes communication with the CRE daemons, the watch daemon will kill the ORPHAN processes. If not, you will have to kill the processes manually using the Solaris kill command. Otherwise, such processes will continue to consume resources and CPU cycles.

Symptoms of orphaned processes can be detected by examining error log files or stdout, if you're running from a terminal. You can also search for such errors as RPC: cannot connect, or RPC: timout. These errors will appear under user.err priority in syslog.


Note -

If an mprun process becomes unresponsive on a system, even where tm.watchd -Yk has been enabled, it may be necessary to use Ctrl-c to kill mprun.