Using Start and Stop Methods (Sun Cluster Data Services Developer's Guide for Solaris OS)

Sun Cluster Data Services Developer's Guide for Solaris OS

Using `Start` and `Stop` Methods

The RGM calls a resource type's method programs at correct times and on the correct nodes or zones for bringing resource groups offline and online. For example, after the crash of a cluster node or zone, the RGM moves any resource groups that are mastered by that node or zone onto a new node or zone. In this case, you must implement a Start method to provide the RGM with, among other things, a way of restarting each resource on the surviving host node or zone.

A Start method must not return until the resource has been started and is available on the local node or zone. Be certain that resource types that require a long initialization period have sufficiently long timeouts set on their Start methods. To ensure sufficient timeouts, set the default and minimum values for the Start_timeout property in the RTR file.

You must implement a Stop method for situations in which the RGM takes a resource group offline. For example, suppose a resource group is taken offline in ZoneA on Node1 and brought back online in ZoneB on Node2. While taking the resource group offline, the RGM calls the Stop method on resources in the resource group to stop all activity in ZoneA on Node1. After the Stop methods for all resources have completed in ZoneA on Node1, the RGM brings the resource group back online in ZoneB on Node2.

A Stop method must not return until the resource has completely stopped all its activity on the local node or zone and has completely shut down. The safest implementation of a Stop method terminates all processes on the local node or zone that are related to the resource. Resource types that require a long time to shut down need sufficiently long timeouts set on their Stop methods. Set the Stop_timeout property in the RTR file.

If an RGM method callback times out, the method's process tree is killed by a SIGABRT signal (not a SIGTERM signal). As a result, all members of the process group generate a core dump file in the /var/cluster/core directory. This core dump file is generated to enable you to determine why your method exceeded its timeout.

Note –

Avoid writing data service methods that create a new process group. If your data service method must create a new process group, write a signal handler for the SIGTERM and SIGABRT signals. Also, ensure that your signal handler forwards the SIGTERM or SIGABRT signal to the child process group or groups before the signal handler terminates the process. Writing a signal handler for these signals increases the likelihood that all processes that are spawned by your method are correctly terminated.

Failure or timeout of a Stop method causes the resource group to enter an error state that requires the cluster administrator to intervene. To avoid this state, the Stop and Monitor_stop method implementations must attempt to recover from all possible error conditions. Ideally, these methods must exit with 0 (success) error status, having successfully stopped all activity of the resource and its monitor on the local node or zone.