Sun Cluster Data Services Developer's Guide for Solaris OS

Controlling the Data Service

A data service must provide a Start or Prenet_start method to activate the application daemon in the cluster, and a Stop or Postnet_stop method to stop the application daemon in the cluster. The sample data service implements a Start and a Stop method. See Deciding Which Start and Stop Methods to Use for information about when to use Prenet_start and Postnet_stop instead.

How the Start Method Works

The RGM runs the Start method on a cluster node or zone when the resource group that contains the data service resource is brought online on that node or zone or when the resource group is already online and the resource is enabled. In the sample application, the Start method activates the in.named DNS daemon in the global zone on that node.

This section describes the major pieces of the Start method for the sample application. This section does not describe functionality that is common to all callback methods, such as the parse_args() function. This section also does not describe using the syslog() function. Common functionality is described in Providing Common Functionality to All Methods.

For the complete listing of the Start method, see Start Method Code Listing.

What the Start Method Does

Before attempting to start DNS, the Start method in the sample data service verifies that the configuration directory and configuration file (named.conf) are accessible and available. Information in named.conf is essential to the successful operation of DNS.

This callback method uses the PMF (pmfadm) to start the DNS daemon (in.named). If DNS crashes or fails to start, the PMF attempts to start the DNS daemon a prescribed number of times during a specified interval. The number of retries and the interval are specified by properties in the data service's RTR file.

Verifying the Configuration

In order to operate, DNS requires information from the named.conf file in the configuration directory. Therefore, the Start method performs some sanity checks to verify that the directory and file are accessible before attempting to start DNS.

The Confdir extension property provides the path to the configuration directory. The property itself is defined in the RTR file. However, the cluster administrator specifies the actual location when the cluster administrator configures the data service.

In the sample data service, the Start method retrieves the location of the configuration directory by using the scha_resource_get() function.


Note –

Because Confdir is an extension property, scha_resource_get() returns both the type and value. The awk command retrieves just the value and places that value in a shell variable, CONFIG_DIR.


# find the value of Confdir set by the cluster administrator at the time of
# adding the resource.
config_info=`scha_resource_get -O Extension -R $RESOURCE_NAME \
-G $RESOURCEGROUP_NAME Confdir`

# scha_resource_get returns the "type" as well as the "value" for the
# extension properties. Get only the value of the extension property 
CONFIG_DIR=`echo $config_info | awk '{print $2}'`

The Start method uses the value of CONFIG_DIR to verify that the directory is accessible. If it is not accessible, Start logs an error message and exits with an error status. See Start Exit Status.

# Check if $CONFIG_DIR is accessible.
if [ ! -d $CONFIG_DIR ]; then
   logger -p ${SYSLOG_FACILITY}.err \
         -t [$SYSLOG_TAG] \
         "${ARGV0} Directory $CONFIG_DIR is missing or not mounted"
   exit 1
fi

Before starting the application daemon, this method performs a final check to verify that the named.conf file is present. If the file is not present, Start logs an error message and exits with an error status.

# Change to the $CONFIG_DIR directory in case there are relative
# pathnames in the data files.
cd $CONFIG_DIR

# Check that the named.conf file is present in the $CONFIG_DIR directory
if [ ! -s named.conf ]; then
   logger -p ${SYSLOG_FACILITY}.err \
         -t [$SYSLOG_TAG] \
         "${ARGV0} File $CONFIG_DIR/named.conf is missing or empty"
   exit 1
fi

Starting the Application

This method uses the process manager facility (pmfadm) to start the application. The pmfadm command enables you to set the number of times to try to restart the application during a specified time frame. The RTR file contains two properties: Retry_count specifies the number of times to attempt restarting an application, and Retry_interval specifies the time period over which to do so.

The Start method retrieves the values of Retry_count and Retry_interval by using the scha_resource_get() function and stores their values in shell variables. The Start method passes these values to pmfadm by using the -n and -t options.

# Get the value for retry count from the RTR file.
RETRY_CNT=`scha_resource_get -O Retry_count -R $RESOURCE_NAME \
-G $RESOURCEGROUP_NAME`
# Get the value for retry interval from the RTR file. This value is in seconds
# and must be converted to minutes for passing to pmfadm. Note that the 
# conversion rounds up; for example, 50 seconds rounds up to 1 minute.
((RETRY_INTRVAL=`scha_resource_get -O Retry_interval -R $RESOURCE_NAME \
-G $RESOURCEGROUP_NAME` / 60))

# Start the in.named daemon under the control of PMF. Let it crash and restart
# up to $RETRY_COUNT times in a period of $RETRY_INTERVAL; if it crashes
# more often than that, PMF will cease trying to restart it.
# If there is a process already registered under the tag
# <$PMF_TAG>, then PMF sends out an alert message that the
# process is already running.
pmfadm -c $PMF_TAAG -n $RETRY_CNT -t $RETRY_INTRVAL \
    /usr/sbin/in.named -c named.conf

# Log a message indicating that HA-DNS has been started.
if [ $? -eq 0 ]; then
   logger -p ${SYSLOG_FACILITY}.err \
         -t [$SYSLOG_TAG] \
         "${ARGV0} HA-DNS successfully started"
fi
exit 0

Start Exit Status

A Start method should not exit with success until the underlying application is actually running and is available, particularly if other data services depend on it. One way to verify success is to probe the application to make sure that it is running before exiting the Start method. For a complex application, such as a database, be certain to set the value for the Start_timeout property in the RTR file sufficiently high to allow time for the application to initialize and recover from a crash.


Note –

Because the application resource (DNS) in the sample data service starts quickly, the sample data service does not poll to verify that it is running before exiting with success.


If this method fails to start DNS and exits with failure status, the RGM checks the Failover_mode property, which determines how to react. The sample data service does not explicitly set the Failover_mode property, so this property has the default value NONE (unless the cluster administrator overrides the default value and specifies a different value). In this case, the RGM takes no action other than to set the state of the data service. The cluster administrator needs to initiate a restart on the same node or zone or a fail over to a different node or zone.

How the Stop Method Works

The RGM runs the Stop method on a cluster node or zone when the resource group that contains the HA-DNS resource is brought offline on that node or zone or if the resource group is online and the resource is disabled. This method stops the in.named (DNS) daemon on that node or zone.

This section describes the major pieces of the Stop method for the sample application. This section does not describe functionality that is common to all callback methods, such as the parse_args() function. This section also does not describe using the syslog() function. Common functionality is described in Providing Common Functionality to All Methods.

For the complete listing of the Stop method, see Stop Method Code Listing.

What the Stop Method Does

There are two primary considerations when attempting to stop the data service. The first is to provide an orderly shutdown. Sending a SIGTERM signal through pmfadm is the best way to accomplish an orderly shutdown.

The second consideration is to ensure that the data service is actually stopped to avoid putting it in Stop_failed state. The best way to accomplish putting the data service in this state is to send a SIGKILL signal through pmfadm.

The Stop method in the sample data service takes both of these considerations into account. It first sends a SIGTERM signal. If this signal fails to stop the data service, the method sends a SIGKILL signal.

Before attempting to stop DNS, this Stop method verifies that the process is actually running. If the process is running, Stop uses the PMF (pmfadm) to stop the process.

This Stop method is guaranteed to be idempotent. Although the RGM should not call a Stop method twice without first starting the data service with a call to its Start method, the RGM could call a Stop method on a resource even though the resource was never started or the resource died of its own accord. Therefore, this Stop method exits with success even if DNS is not running.

Stopping the Application

The Stop method provides a two-tiered approach to stopping the data service: an orderly or smooth approach using a SIGTERM signal through pmfadm and an abrupt or hard approach using a SIGKILL signal. The Stop method obtains the Stop_timeout value (the amount of time in which the Stop method must return). Stop allocates 80 percent of this time to stopping smoothly and 15 percent to stopping abruptly (5 percent is reserved), as shown in the following sample code.

STOP_TIMEOUT='scha_resource_get -O STOP_TIMEOUT -R $RESOURCE_NAME \
-G $RESOURCEGROUP_NAME'
((SMOOTH_TIMEOUT=$STOP_TIMEOUT * 80/100))
((HARD_TIMEOUT=$STOP_TIMEOUT * 15/100))

The Stop method uses pmfadm -q to verify that the DNS daemon is running. If the DNS daemon is running, Stop first uses pmfadm -s to send a TERM signal to terminate the DNS process. If this signal fails to terminate the process after 80 percent of the timeout value has expired, Stop sends a SIGKILL signal. If this signal also fails to terminate the process within 15 percent of the timeout value, the method logs an error message and exits with an error status.

If pmfadm terminates the process, the method logs a message that the process has stopped and exits with success.

If the DNS process is not running, the method logs a message that it is not running and exits with success anyway. The following code sample shows how Stop uses pmfadm to stop the DNS process.

# See if in.named is running, and if so, kill it. 
if pmfadm -q $PMF_TAG; then
   # Send a SIGTERM signal to the data service and wait for 80% of the
   # total timeout value.
   pmfadm -s $RESOURCE_NAME.named -w $SMOOTH_TIMEOUT TERM
   if [ $? -ne 0 ]; then
      logger -p ${SYSLOG_FACILITY}.err \
          -t [$RESOURCETYPE_NAME,$RESOURCEGROUP_NAME,$RESOURCE_NAME] \
          “${ARGV0} Failed to stop HA-DNS with SIGTERM; Retry with \
           SIGKILL”
      
      # Since the data service did not stop with a SIGTERM signal, use 
      # SIGKILL now and wait for another 15% of the total timeout value.
      pmfadm -s $PMF_TAG -w $HARD_TIMEOUT KILL
      if [ $? -ne 0 ]; then
          logger -p ${SYSLOG_FACILITY}.err \
          -t [$SYSLOG_TAG] \
          “${ARGV0} Failed to stop HA-DNS; Exiting UNSUCCESSFUL”
         exit 1
      fi
fi
else 
   # The data service is not running as of now. Log a message and 
   # exit success.
   logger -p ${SYSLOG_FACILITY}.err \
           -t [$SYSLOG_TAG] \
           “HA-DNS is not started”

   # Even if HA-DNS is not running, exit success to avoid putting
   # the data service resource in STOP_FAILED State.
   exit 0
fi

# Could successfully stop DNS. Log a message and exit success.
logger -p ${SYSLOG_FACILITY}.err \
    -t [$RESOURCETYPE_NAME,$RESOURCEGROUP_NAME,$RESOURCE_NAME] \
    “HA-DNS successfully stopped”
exit 0

Stop Exit Status

A Stop method should not exit with success until the underlying application is actually stopped, particularly if other data services depend on it. Failure to do so can result in data corruption.

For a complex application, such as a database, be certain to set the value for the Stop_timeout property in the RTR file sufficiently high to allow time for the application to clean up while stopping.

If this method fails to stop DNS and exits with failure status, the RGM checks the Failover_mode property, which determines how to react. The sample data service does not explicitly set the Failover_mode property, so this property has the default value NONE (unless the cluster administrator overrides the default value and specifies a different value). In this case, the RGM takes no action other than to set the state of the data service to Stop_failed. The cluster administrator needs to stop the application forcibly and clear the Stop_failed state.