JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Oracle Solaris Cluster Data Services Developer's Guide     Oracle Solaris Cluster 4.0
search filter icon
search icon

Document Information

Preface

1.  Overview of Resource Management

2.  Developing a Data Service

3.  Resource Management API Reference

4.  Modifying a Resource Type

5.  Sample Data Service

Overview of the Sample Data Service

Defining the Resource Type Registration File

Overview of the RTR File

Resource Type Properties in the Sample RTR File

Resource Properties in the Sample RTR File

System-Defined Properties in the RTR File

Extension Properties in the RTR File

Providing Common Functionality to All Methods

Identifying the Command Interpreter and Exporting the Path

Declaring the PMF_TAG and SYSLOG_TAG Variables

Parsing the Function Arguments

Generating Error Messages

Obtaining Property Information

Controlling the Data Service

How the Start Method Works

What the Start Method Does

Verifying the Configuration

Starting the Application

Start Exit Status

How the Stop Method Works

What the Stop Method Does

Stopping the Application

Stop Exit Status

Defining a Fault Monitor

How the Probe Program Works

What the Probe Program Does

Obtaining Property Values

Checking the Reliability of the Service

Comparing Restart With Failover

Restarting the Data Service

Probe Exit Status

How the Monitor_start Method Works

What the Monitor_start Method Does

Starting the Probe

How the Monitor_stop Method Works

What the Monitor_stop Method Does

Stopping the Monitor

Monitor_stop Exit Status

How the Monitor_check Method Works

Handling Property Updates

How the Validate Method Works

What the Validate Method Does

Validate Method Parsing Function

Validating Confdir

Validate Exit Status

How the Update Method Works

What the Update Method Does

Stopping the Monitor With Update

Restarting the Monitor

Update Exit Status

6.  Data Service Development Library

7.  Designing Resource Types

8.  Sample DSDL Resource Type Implementation

9.  Oracle Solaris Cluster Agent Builder

10.  Generic Data Service

11.  DSDL API Functions

12.  Cluster Reconfiguration Notification Protocol

A.  Sample Data Service Code Listings

B.  DSDL Sample Resource Type Code Listings

C.  Requirements for Non-Cluster-Aware Applications

D.  Document Type Definitions for the CRNP

E.  CrnpClient.java Application

Index

Defining a Fault Monitor

The sample application implements a basic fault monitor to monitor the reliability of the DNS resource (in.named).

The fault monitor consists of the following elements:

How the Probe Program Works

The dns_probe program implements a continuously running process that verifies that the DNS resource that is controlled by the sample data service is running. The dns_probe is started by the dns_monitor_start method, which is automatically run by the RGM after the sample data service is brought online. The data service is stopped by the dns_monitor_stop method, which the RGM runs before the RGM brings the sample data service offline.

This section describes the major pieces of the PROBE method for the sample application. It does not describe functionality that is common to all callback methods, such as the parse_args() function. This section also does not describe using the syslog() function. Common functionality is described in Providing Common Functionality to All Methods.

For the complete listing of the PROBE method, see PROBE Program Code Listing.

What the Probe Program Does

The probe runs in an infinite loop. It uses nslookup to verify that the correct DNS resource is running. If DNS is running, the probe sleeps for a prescribed interval (set by the Thorough_probe_interval system-defined property) and checks again. If DNS is not running, this program attempts to restart it locally, or depending on the number of restart attempts, requests that the RGM relocate the data service to a different node.

Obtaining Property Values

This program requires the values of the following properties:

The scha_resource_get() function obtains the values of these properties and stores them in shell variables, as follows:

PROBE_INTERVAL=`scha_resource_get -O Thorough_probe_interval \
-R $RESOURCE_NAME -G $RESOURCEGROUP_NAME`

PROBE_TIMEOUT_INFO=`scha_resource_get -O Extension -R $RESOURCE_NAME \
-G $RESOURCEGROUP_NAME Probe_timeout` 
Probe_timeout=`echo $probe_timeout_info | awk '{print $2}'`

DNS_HOST=`scha_resource_get -O Network_resources_used -R $RESOURCE_NAME \
-G $RESOURCEGROUP_NAME`

RETRY_COUNT=`scha_resource_get -O Retry_count -R $RESOURCE_NAME -G \
$RESOURCEGROUP_NAME`

RETRY_INTRVAL=`scha_resource_get -O Retry_interval -R $RESOURCE_NAME -G \
$RESOURCEGROUP_NAME`

RT_BASEDIR=`scha_resource_get -O RT_basedir -R $RESOURCE_NAME -G \
 $RESOURCEGROUP_NAME`

Note - For system-defined properties, such as Thorough_probe_interval, the scha_resource_get() function returns the value only. For extension properties, such as Probe_timeout, the scha_resource_get() function returns the type and value. Use the awk command to obtain the value only.


Checking the Reliability of the Service

The probe itself is an infinite while loop of nslookup commands. Before the while loop, a temporary file is set up to hold the nslookup replies. The probefail and retries variables are initialized to 0.

# Set up a temporary file for the nslookup replies.
DNSPROBEFILE=/var/cluster/run/.$RESOURCE_NAME.probe
probefail=0
retries=0

The while loop carries out the following tasks:

Here is the while loop code.

while :
do
   # The interval at which the probe needs to run is specified in the
   # property THOROUGH_PROBE_INTERVAL. Therefore, set the probe to sleep
   # for a duration of THOROUGH_PROBE_INTERVAL.
   sleep $PROBE_INTERVAL

   # Run an nslookup command of the IP address on which DNS is serving.
   /usr/cluster/bin/hatimerun -t $PROBE_TIMEOUT /usr/sbin/nslookup $DNS_HOST $DNS_HOST \
   > $DNSPROBEFILE 2>&1

      retcode=$?
      if [ $retcode -ne 0 ]; then
            probefail=1
      fi

   # Make sure that the reply to nslookup comes from the HA-DNS
   # server and not from another nameserver mentioned in the 
   # /etc/resolv.conf file.
   if [ $probefail -eq 0 ]; then
# Get the name of the server that replied to the nslookup query.
   SERVER=` awk ' $1=="Server:" { print $2 }' \
   $DNSPROBEFILE | awk -F. ' { print $1 } ' `
   if [ -z "$SERVER" ]; then
      probefail=1
      else
         if [ $SERVER != $DNS_HOST ]; then
            probefail=1
         fi
   fi
fi

Comparing Restart With Failover

If the probefail variable is something other than 0 (success), the nslookup command timed out or the reply came from a server other than the sample service's DNS. In either case, the DNS server is not functioning as expected and the fault monitor calls the decide_restart_or_failover() function to determine whether to restart the data service locally or request that the RGM relocate the data service to a different node. If the probefail variable is 0, a message is generated that the probe was successful.

   if [ $probefail -ne 0 ]; then
         decide_restart_or_failover
   else
         logger -p ${SYSLOG_FACILITY}.err\
         -t [$SYSLOG_TAG]\
         "${ARGV0} Probe for resource HA-DNS successful"
   fi

The decide_restart_or_failover() function uses a time window (Retry_interval) and a failure count (Retry_count) to determine whether to restart DNS locally or request that the RGM relocate the data service to a different node. This function implements the following conditional logic. The code listing for decide_restart_or_failover() in PROBE Program Code Listing contains the code.

If the number of restarts reaches the limit during the time interval, the function requests that the RGM relocate the data service to a different node. If the number of restarts is under the limit, or the interval has been exceeded so the count begins again, the function attempts to restart DNS on the same node.

Note the following points about this function:

Restarting the Data Service

The restart_service() function is called by decide_restart_or_failover() to attempt to restart the data service on the same node.

This function executes the following logic:

function restart_service
{

        # To restart the data service, first verify that the 
        # data service itself is still registered under PMF.
        pmfadm -q $PMF_TAG
        if [[ $? -eq 0 ]]; then
                # Since the TAG for the data service is still registered under
                # PMF, first stop the data service and start it back up again.

                # Obtain the Stop method name and the STOP_TIMEOUT value for
                # this resource.
                STOP_TIMEOUT=`scha_resource_get -O STOP_TIMEOUT \
                        -R $RESOURCE_NAME -G $RESOURCEGROUP_NAMÈ
                STOP_METHOD=`scha_resource_get -O STOP \
                        -R $RESOURCE_NAME -G $RESOURCEGROUP_NAMÈ
                /usr/cluster/bin/hatimerun -t $STOP_TIMEOUT $RT_BASEDIR/$STOP_METHOD \
                        -R $RESOURCE_NAME -G $RESOURCEGROUP_NAME \
                        -T $RESOURCETYPE_NAME

                if [[ $? -ne 0 ]]; then
                        logger-p ${SYSLOG_FACILITY}.err -t [$SYSLOG_TAG] \
                                “${ARGV0} Stop method failed.”
                        return 1
                fi

                # Obtain the START method name and the START_TIMEOUT value for
                # this resource.
                START_TIMEOUT=`scha_resource_get -O START_TIMEOUT \
                        -R $RESOURCE_NAME -G $RESOURCEGROUP_NAMÈ
                START_METHOD=`scha_resource_get -O START \
                        -R $RESOURCE_NAME -G $RESOURCEGROUP_NAMÈ
                /usr/cluster/bin/hatimerun -t $START_TIMEOUT $RT_BASEDIR/$START_METHOD \
                        -R $RESOURCE_NAME -G $RESOURCEGROUP_NAME \
                        -T $RESOURCETYPE_NAME

                if [[ $? -ne 0 ]]; then
                        logger-p ${SYSLOG_FACILITY}.err -t [$SYSLOG_TAG] \
                                “${ARGV0} Start method failed.”
                        return 1
                fi


        else
                # The absence of the TAG for the dataservice 
                # implies that the data service has already
                # exceeded the maximum retries allowed under PMF.
                # Therefore, do not attempt to restart the
                # data service again, but try to failover
                # to another node in the cluster.
                scha_control -O GIVEOVER -G $RESOURCEGROUP_NAME \
                        -R $RESOURCE_NAME
        fi

        return 0
}

Probe Exit Status

The sample data service's PROBE program exits with failure if attempts to restart locally fail and the attempt to fail over to a different node fails as well. This program logs the message Failover attempt failed.

How the Monitor_start Method Works

The RGM calls the Monitor_start method to start the dns_probe method after the sample data service is brought online.

This section describes the major pieces of the Monitor_start method for the sample application. This section does not describe functionality that is common to all callback methods, such as the parse_args() function. This section also does not describe using the syslog() function. Common functionality is described in Providing Common Functionality to All Methods.

For the complete listing of the Monitor_start method, see Monitor_start Method Code Listing.

What the Monitor_start Method Does

This method uses the PMF (pmfadm) to start the probe.

Starting the Probe

The Monitor_start method obtains the value of the RT_basedir property to construct the full path name for the PROBE program. This method starts the probe by using the infinite retries option of pmfadm (-n -1, -t -1), which means that if the probe fails to start, the PMF tries to start it an infinite number of times over an infinite period of time.

# Find where the probe program resides by obtaining the value of the
# RT_basedir property of the resource.
RT_BASEDIR=`scha_resource_get -O RT_basedir -R $RESOURCE_NAME -G \
$RESOURCEGROUP_NAME`

# Start the probe for the data service under PMF. Use the infinite retries
# option to start the probe. Pass the resource name, type, and group to the
# probe program. 
pmfadm -c $RESOURCE_NAME.monitor -n -1 -t -1 \
   $RT_BASEDIR/dns_probe -R $RESOURCE_NAME -G $RESOURCEGROUP_NAME \
   -T $RESOURCETYPE_NAME

How the Monitor_stop Method Works

The RGM calls the Monitor_stop method to stop execution of dns_probe when the sample data service is brought offline.

This section describes the major pieces of the Monitor_stop method for the sample application. This section does not describe functionality that is common to all callback methods, such as the parse_args() function. This section also does not describe using the syslog() function. Common functionality is described in Providing Common Functionality to All Methods.

For the complete listing of the Monitor_stop method, see Monitor_stop Method Code Listing.

What the Monitor_stop Method Does

This method uses the PMF (pmfadm) to check whether the probe is running, and if so, to stop it.

Stopping the Monitor

The Monitor_stop method uses pmfadm -q to see if the probe is running, and if so, uses pmfadm -s to stop it. If the probe is already stopped, the method exits successfully anyway, which guarantees the idempotence of the method.


Caution

Caution - Be certain to use the KILL signal with pmfadm to stop the probe and not a signal that can be masked, such as TERM. Otherwise, the Monitor_stop method can hang indefinitely and eventually time out. The reason is that the PROBE method calls scha_control() when it is necessary to restart or fail over the data service. When scha_control() calls Monitor_stop as part of the process of bringing the data service offline, if Monitor_stop uses a signal that can be masked, Monitor_stop hangs waiting for scha_control() to complete, and scha_control() hangs waiting for Monitor_stop to complete.


# See if the monitor is running, and if so, kill it.
if pmfadm -q $PMF_TAG; then
   pmfadm -s $PMF_TAG KILL
   if [ $? -ne 0 ]; then
         logger -p ${SYSLOG_FACILITY}.err \
            -t [$SYSLOG_TAG] \
            "${ARGV0} Could not stop monitor for resource " \
            $RESOURCE_NAME
           exit 1
   else
         # could successfully stop the monitor. Log a message.
         logger -p ${SYSLOG_FACILITY}.err \
            -t [$SYSLOG_TAG] \
            "${ARGV0} Monitor for resource " $RESOURCE_NAME \
            " successfully stopped"
   fi
fi
exit 0

Monitor_stop Exit Status

The Monitor_stop method logs an error message if it cannot stop the PROBE method. The RGM puts the sample data service into MONITOR_FAILED state on the primary node, which can panic the node.

Monitor_stop should not exit before the probe has been stopped.

How the Monitor_check Method Works

The RGM calls the Monitor_check method whenever the PROBE method attempts to fail over the resource group that contains the data service to a new node.

This section describes the major pieces of the Monitor_check method for the sample application. This section does not describe functionality that is common to all callback methods, such as the parse_args() function. This section also does not describe using the syslog() function. Common functionality is described in Providing Common Functionality to All Methods.

For the complete listing of the Monitor_check method, see Monitor_check Method Code Listing.

The Monitor_check method must be implemented so that it does not conflict with other methods that are running concurrently.

The Monitor_check method calls the Validate method to verify that the DNS configuration directory is available on the new node. The Confdir extension property points to the DNS configuration directory. Therefore, Monitor_check obtains the path and name for the Validate method and the value of Confdir. It passes this value to Validate, as shown in the following listing.

# Obtain the full path for the Validate method from
# the RT_basedir property of the resource type.
RT_BASEDIR=`scha_resource_get -O RT_basedir -R $RESOURCE_NAME \
   -G $RESOURCEGROUP_NAMÈ

# Obtain the name of the Validate method for this resource.
VALIDATE_METHOD=`scha_resource_get -O Validate \
   -R $RESOURCE_NAME -G $RESOURCEGROUP_NAMÈ

# Obtain the value of the Confdir property in order to start the
# data service. Use the resource name and the resource group entered to
# obtain the Confdir value set at the time of adding the resource.
config_info=`scha_resource_get -O Extension -R $RESOURCE_NAME \
 -G $RESOURCEGROUP_NAME Confdir`

# scha_resource_get returns the type as well as the value for extension
# properties. Use awk to get only the value of the extension property.
CONFIG_DIR=`echo $config_info | awk `{print $2}'`

# Call the validate method so that the dataservice can be failed over
# successfully to the new node.
$RT_BASEDIR/$VALIDATE_METHOD -R $RESOURCE_NAME -G $RESOURCEGROUP_NAME \
   -T $RESOURCETYPE_NAME -x Confdir=$CONFIG_DIR

See How the Validate Method Works to see how the sample application verifies the suitability of a node for hosting the data service.