Sun HPC ClusterTools 3.0 Administrator's Guide: With CRE

Procedures for Recovery

Re-creating the CRE Database

The rte.master reboot and rte.node reboot commands should be used only as a last resort, if the system is not responding (for example, if programs such as mprun, mpinfo, or mpps hang). Follow these steps to rebootthe CRE:

  1. Run rte.master reboot on the master node.

# /etc/init.d/rte.master reboot
  1. Run rte.node reboot on all the nodes (including the master node if it's running tm.spmd and tm.omd).

# /etc/init.d/rte.node reboot

The procedure will attempt to save the system configuration (in the same way as using the mpadmin dump command), kill all the running jobs, and restore the system configuration. Note that the Cluster Console Manager applications may be useful in executing commands on all of the nodes in the cluster simultaneously. For information about the Cluster Console Manager applications, see Appendix A "Cluster Management Tools".


Note -

rte.master reboot saves the existing rdb-log and rdb-save files in /var/hpc/rdb-log.1 and /var/hpc/rdb-save.1. rdb-log is a running log of the resource database activity and rdb-save is a snapshot of the database taken at regular intervals.