Comparing Restart With Failover (Sun Cluster Data Services Developer's Guide for Solaris OS)

Sun Cluster Data Services Developer's Guide for Solaris OS

Comparing Restart With Failover

If the probefail variable is something other than 0 (success), the nslookup command timed out or the reply came from a server other than the sample service's DNS. In either case, the DNS server is not functioning as expected and the fault monitor calls the decide_restart_or_failover() function to determine whether to restart the data service locally or request that the RGM relocate the data service to a different node. If the probefail variable is 0, a message is generated that the probe was successful.

   if [ $probefail -ne 0 ]; then
         decide_restart_or_failover
   else
         logger -p ${SYSLOG_FACILITY}.err\
         -t [$SYSLOG_TAG]\
         "${ARGV0} Probe for resource HA-DNS successful"
   fi

The decide_restart_or_failover() function uses a time window (Retry_interval) and a failure count (Retry_count) to determine whether to restart DNS locally or request that the RGM relocate the data service to a different node. This function implements the following conditional logic. The code listing for decide_restart_or_failover() in PROBE Program Code Listing contains the code.

If this is the first failure, restart the data service. Log an error message and bump the counter in the retries variable.
If this is not the first failure, but the window has been exceeded, restart the data service. Log an error message, reset the counter, and slide the window.
If the time is still within the window and the retry counter has been exceeded, fail over to another node. If the failover does not succeed, log an error and exit the probe program with status 1 (failure).
If time is still within the window but the retry counter has not been exceeded, restart the data service. Log an error message and bump the counter in the retries variable.

If the number of restarts reaches the limit during the time interval, the function requests that the RGM relocate the data service to a different node. If the number of restarts is under the limit, or the interval has been exceeded so the count begins again, the function attempts to restart DNS on the same node.

Note the following points about this function:

The gettime utility is used to track the time between restarts. This is a C program that is located in the (RT_basedir) directory.
The Retry_count and Retry_interval system-defined resource properties determine the number of restart attempts and the time interval over which to count. These properties default to two attempts in a period of 5 minutes (300 seconds) in the RTR file, although the cluster administrator can change these values.
The restart_service() function is called to attempt to restart the data service on the same node. See the next section, Restarting the Data Service, for information about this function.
The scha_control() API function, with the SCHA_GIVEOVER argument, brings the resource group that contains the sample data service offline and back online on a different node.