Sun Cluster 2.2 API Developer's Guide

2.3 Fault Monitoring Methods for the in.named Data Service

Sun Cluster enables the author of an HA data service to write fault monitoring methods for the data service. As an example, one can write a modest fault monitor for in.named, and can query in.named periodically using nslookup(1M). If the look-up times out using a very long time-out value, the fault monitor will conclude that the in.named daemon is hung and must be killed and restarted.

Fault monitoring will be executed only on the physical host on which in.named is running, that is, on the host that masters the logical host used by in.named. The non-master physical hosts do not perform fault monitoring.

The fault monitor is started by the FM_START method and stopped by the FM_STOP method. It has no need for the FM_INIT method--HA-in.named would not register an FM_INIT method when calling hareg(1M).

The following is a sample FM_START method for the in.named data service.


#! /bin/sh
 # Copyright 26 Oct 1996 Sun Microsystems, Inc.  All Rights Reserved.
 #ident "@(#)innamed_fm_start.sh  1.1  96/04/13 SMI"
 # HA in.named fm_start method
 # Called-back by Solaris Cluster as the FM_START method for HA in.named.
 #
 ARGV0=`basename $0`
 SYSLOG_FACILITY=`haget -f syslog_facility`

 MASTERED_LOGICAL_HOSTS="$1"
 if [ -z "$MASTERED_LOGICAL_HOSTS" ]; then
 		# This physical host does not currently master any logical hosts.
 		exit 0
 fi

 # Replace comma with space to form an sh word list:
 MASTERED_LOGICAL_HOSTS="`echo $MASTERED_LOGICAL_HOSTS  tr ',' ' '`"

 # Dynamically search the list of logical hosts which this physical
 # host currently masters, to see if one of them is the logical host
 # that HA-in.named uses.

 MYLH=
 for LH in $MASTERED_LOGICAL_HOSTS ; do
 	# Map logical hostname to administrative file system name:
 	PATHPREFIX_FS=`haget -f pathprefix $LH`
 	CONFIG="${PATHPREFIX_FS}/hainnamed/hainnamed.config"

 	if [ -f $CONFIG ]; then
 			MYLH=$LH
 			break
 	fi
 done
 if [ -z "$MYLH" ]; then
 	# This host does not currently master the logical host
 	# that HA-in.named uses.
 	exit 0
 fi

 # This host currently masters the logical host that HA in.named uses,
 # $MYLH.
 # Create an asynchronous process to periodically probe the in.named
 # daemon, under the control of the process monitor facility.
 # The asynchronous probe is in its own shell script:
 #     hainnamed_fmprobe
 # The asynchronous process will be terminated by the FM_STOP method.
 pmfadm -c hainnamedfm hainnamed_fmprobe $MYLH
 exit 0

The following is a sample FM_STOP method for the in.named data service.


#! /bin/sh
 #
 # Copyright 26 Oct 1996 Sun Microsystems, Inc.  All Rights Reserved.
 #
 #ident "@(#)innamed_fm_stop.sh  1.1  96/04/13 SMI"
 #
 # HA in.named fm_stop method
 #
 # Called back by Sun Cluster as the FM_STOP method for HA in.named.
 #
 # Stop the asynchronous fault monitoring process that was created
 # earlier under the control of pmfd.
 #
 # Ignore errors when calling pmfadm just in case the hainnamed_fmprobe
 # is already not running.  Reasons for it being already not running
 # include the fact that it is started only on the physical host that
 # currently masters the logical host, the fact that FM_STOP can be
 # called even though FM_START has not be en called, and the fact
 # that it may have died an early death all by itself.
 pmfadm -s hainnamedfm TERM >/dev/null 2>&1
 exit 0

The following is a sample probe script, ha.innamed_fmprobe, for the in.named data service. It is started under the control of the process monitor facility by the FM_START method.


#! /bin/sh
 #
 # Copyright 26 Oct 1996 Sun Microsystems, Inc.  All Rights Reserved.
 #
 #ident "@(#)hainnamed_fmprobe.sh  1.1  96/04/13 SMI"
 #
 # Usage: hainnamed_fmprobe logical_host
 #
 # Periodically probes the in.named running on the logical_host.
 # If the probe times out, then this script will query the pmfd to
 # see if the pmfd is still running in.named:
 # (i) if so, this script assumes that in.named is hung and
 # sends a KILL signal to the in.named process, causing it to
 # die.  pmfd will restart in.named provided it has not used
 # up its ration of restarts per time period.
 # (ii) if not, this script will assume that in.named has exhausted
 # its ration of restarts.  This script will call hactl -g to give up
 # mastery of the logical host to some other new master physical host.
 #
 ARGV0=`basename $0`
 LOGICAL_HOST="$1"
 SYSLOG_FACILITY=`haget -f syslog_facility`
 PROBE_INTERVAL_SECS=60
 MIN_PROBE_SECS=`hactl -f min_probe_timeout_secs`
 PROBE_TIMEOUT_SECS=`expr $MIN_PROBE_SECS + 180`
 CLUSTER_KEY=`hactl -f cluster_key`
 NSLOOKUP=/usr/sbin/nslookup
 if [ ! -x $NSLOOKUP  -o  ! -s $NSLOOKUP ]; then
 	logger ${SYSLOG_FACILITY}.err \
 		"${ARGV0}: $NSLOOKUP does not exist or is not executable"
 	exit 1
 fi

 while true; do
 	# Call nslookup under a timeout, using hatimerun.
 	# The -norecurse option tells in.named not to consult
 	# other name service instances on other hosts beyond the
 	# one on $LOGICAL_HOST.
 	# The -retry=10000 is telling nslookup to take forever
 	# retrying: this means that for a hung server, nslookup
 	# will never itself giveup, rather, the timeout on hatimerun
 	# will expire first.
 	hatimerun -t $PROBE_TIMEOUT_SECS \
 		$NSLOOKUP -norecurse -retry=10000 $LOGICAL_HOST $LOGICAL_HOST
 	if [ $? -ne 99 ]; then
 			sleep $PROBE_INTERVAL_SECS
 			continue
 	fi

 	# Here when the timeout occurred.
 	logger -p ${SYSLOG_FACILITY}.err \
 		"${ARGV0}: nslookup of in.named on $LOGICAL_HOST timed-out"
 	if pmfadm -q hainnamed then
 			# The in.named process exists.  Kill it on the
 			# assumption that it is hung.  Sleep a short time,
 			# and if hainnamed still exists in the pmfd, assume
 			# that pmfd is restarting it (it has not yet used
 			# up its ration of restarts per time interval.)
 			logger -p ${SYSLOG_FACILITY}.err \
 				"${ARGV0}: KILLing hung in.named"
 			pmfadm -k hainnamed KILL
 			sleep 30
 			if pmfadm -q hainnamed; then
 					continue
 			fi
 	fi
 	# Here when pmfadm -q says that hainnamed no longer
 	# exists in pmfd.  Assume that the ration of restarts
 	# was exhausted.  Also assume that something is amiss
 	# that moving to a new master could improve.
 	logger -p ${SYSLOG_FACILITY}.err \
 		"${ARGV0}: in.named restarted too many times, not restarting"
 	logger -p ${SYSLOG_FACILITY}.err \
 		"${ARGV0}: giving up mastery of $LOGICAL_HOST"
 	hactl -g -s hainnamed -k $CLUSTER_KEY -l $LOGICAL_HOST
 done