探测程序的工作方式 (Sun Cluster 数据服务开发者指南（适用于 Solaris OS）)

Sun Cluster 数据服务开发者指南（适用于 Solaris OS）

探测程序的工作方式

dns_probe 程序执行连续运行过程，验证由数据服务样例控制的 DNS 资源是否正在运行。dns_probe 由 dns_monitor_start 方法启动，该方法在数据服务样例联机之后由 RGM 自动运行。数据服务由 dns_monitor_stop 方法停止，RGM 在使数据服务样例脱机之前运行该方法。

本小节介绍了应用程序样例中的PROBE 方法的重要方面，它没有介绍所有回调方法通用的功能，例如 parse_args() 函数。本节也没有介绍 syslog() 函数的使用。通用功能在为所有方法提供通用功能中进行介绍。

有关 PROBE 方法的完整列表，请参见PROBE 程序代码列表。

探测程序的用途

探测程序以死循环的形式运行。它使用 nslookup 验证正确的 DNS 资源是否正在运行。如果 DNS 正在运行，探测将休眠预定的时间间隔（由系统定义的属性 Thorough_probe_interval 设置），然后再次检查。如果 DNS 未运行，此程序将尝试在本地重启 DNS，或按照重启尝试的次数请求 RGM 将数据服务重定位到其它节点。

获取属性值

此程序需要以下属性的值：

Thorough_probe_interval – 用来设置探测程序休眠的时间段
Probe_timeout??在执行探测的 nslookup 命令上强制执行探测的超时值
Network_resources_used — 用来获取 DNS 服务器的 IP 地址
Retry_count 和 Retry_interval??确定尝试重新启动的次数以及计算这些次数的时间段
RT_basedir??获取包含 PROBE 程序和 gettime 实用程序的目录

scha_resource_get() 函数获取这些属性的值并将其存储于 shell 变量中，如下所示：

PROBE_INTERVAL=`scha_resource_get -O Thorough_probe_interval \
-R $RESOURCE_NAME -G $RESOURCEGROUP_NAME`

PROBE_TIMEOUT_INFO=`scha_resource_get -O Extension -R $RESOURCE_NAME \
-G $RESOURCEGROUP_NAME Probe_timeout` 
Probe_timeout=`echo $probe_timeout_info | awk '{print $2}'`

DNS_HOST=`scha_resource_get -O Network_resources_used -R $RESOURCE_NAME \
-G $RESOURCEGROUP_NAME`

RETRY_COUNT=`scha_resource_get -O Retry_count -R $RESOURCE_NAME -G \
$RESOURCEGROUP_NAME`

RETRY_INTERVAL=`scha_resource_get -O Retry_interval -R $RESOURCE_NAME -G \
$RESOURCEGROUP_NAME`

RT_BASEDIR=`scha_resource_get -O RT_basedir -R $RESOURCE_NAME -G \
 $RESOURCEGROUP_NAME`

注 –

对于系统定义的属性，例如 Thorough_probe_interval，scha_resource_get() 函数仅返回值。对于扩展属性，例如 Probe_timeout，scha_resource_get() 函数返回类型和值。使用 awk 命令仅获取值。

检查服务的可靠性

探测本身是 nslookup 命令的无限 while 循环。while 循环之前，将设置一个临时文件以放置 nslookup 回复。probefail 和 retries 变量均初始化为 0。

# Set up a temporary file for the nslookup replies.
DNSPROBEFILE=/tmp/.$RESOURCE_NAME.probe
probefail=0
retries=0

while 循环执行以下任务：

设置探测程序的休眠间隔
使用 hatimerun 启动 nslookup，传递 Probe_timeout 值并标识目标主机
根据 nslookup 返回码成功与否设置 probefail 变量
如果 probefail 设置为 1（失败），验证对 nslookup 的回复来自于数据服务样例，而不是某一其他 DNS 服务器

下面是 while 循环代码。

while :
do
   # The interval at which the probe needs to run is specified in the
   # property THOROUGH_PROBE_INTERVAL. Therefore, set the probe to sleep
   # for a duration of THOROUGH_PROBE_INTERVAL.
   sleep $PROBE_INTERVAL

   # Run an nslookup command of the IP address on which DNS is serving.
   hatimerun -t $PROBE_TIMEOUT /usr/sbin/nslookup $DNS_HOST $DNS_HOST \
   > $DNSPROBEFILE 2>&1

      retcode=$?
      if [ $retcode -ne 0 ]; then
            probefail=1
      fi

   # Make sure that the reply to nslookup comes from the HA-DNS
   # server and not from another nameserver mentioned in the 
   # /etc/resolv.conf file.
   if [ $probefail -eq 0 ]; then
# Get the name of the server that replied to the nslookup query.
   SERVER=` awk ' $1=="Server:" { print $2 }' \
   $DNSPROBEFILE | awk -F. ' { print $1 } ' `
   if [ -z "$SERVER" ]; then
      probefail=1
      else
         if [ $SERVER != $DNS_HOST ]; then
            probefail=1
         fi
   fi
fi

比较重新启动和故障转移

如果 probefail 变量是非 0 值（成功），则 nslookup 命令超时，或回复来自于服务样例的 DNS 以外的服务器。在每种情况下，DNS 服务器都不能按照预期的情况发挥作用，且故障监视器将调用 decide_restart_or_failover() 函数以确定是在本地重启数据服务，还是请求 RGM 将数据服务重定位到其它节点。如果 probefail 变量为 0，将生成探测成功的消息。

   if [ $probefail -ne 0 ]; then
         decide_restart_or_failover
   else
         logger -p ${SYSLOG_FACILITY}.err\
         -t [$SYSLOG_TAG]\
         "${ARGV0} Probe for resource HA-DNS successful"
   fi

decide_restart_or_failover() 函数使用时间窗口 (Retry_interval) 和故障计数 (Retry_count) 来确定是在本地重新启动 DNS，还是请求 RGM 将数据服务重新定位到其他节点。该函数执行以下条件逻辑。PROBE 程序代码列表中 decide_restart_or_failover() 的代码列表包含代码。

如果这是首次故障，请重启数据服务。记录一条错误消息并取消 retries 变量中的计数器。
如果不是首次故障，但是时间已经超出了窗口的范围，请重启数据服务。记录一条错误消息，复位计数器并滑动窗口。
如果时间仍在窗口范围内，但是已超过重试计数器，请故障转移到另一个节点。如果故障转移不成功，将记录错误并以状态 1（失败）退出探测程序。
如果时间仍处于窗口的范围内，但是未超出重试计数器的计数范围，请重启数据服务。记录错误消息，并在 retries 变量中撞击计数器。

如果在指定时间间隔内达到了重启的最大次数，函数将请求 RGM 将数据服务重定位到其它节点。如果重启的次数在所限制范围之内，或者已超出了时间间隔，以致重新开始计数时，该函数将尝试在同一节点上重启 DNS。请注意以下关于此函数的信息：

gettime 实用程序用来跟踪两次重启操作之间的时间。这是一个 C 程序，位于 (RT_basedir) 目录中。
Retry_count 和 Retry_interval 系统定义的资源属性确定尝试重新启动的次数以及计数的时间间隔。在 RTR 文件中，这些属性的默认值为在时间段 5 分钟（300 秒）内进行两次尝试，尽管群集管理员可以更改这些值。
系统将调用 restart_service() 函数尝试在同一节点上重新启动数据服务。有关该函数的信息，请参见下一节，重启数据服务。
scha_control() API 函数带有 GIVEOVER 选项使包含数据服务样例的资源组脱机并在其他节点上使其重新联机。

重启数据服务

decide_restart_or_failover() 调用 restart_service() 函数，以尝试在同一节点上重新启动数据服务。该函数执行以下逻辑：

确定数据服务是否仍在 PMF 之下注册。如果服务仍处于注册状态，函数将执行以下操作：
- 为数据服务获取 Stop 方法名称和 Stop_timeout 值
- 使用 hatimerun 为数据服务启动 Stop 方法，传递 Stop_timeout 值
- 如果数据服务成功停止，则为数据服务获取 Start 方法名称和 Start_timeout 值
- 使用 hatimerun 为数据服务启动 Start 方法，传递 Start_timeout 值
如果数据服务已不在 PMF 之下注册，则表示数据服务已超过了 PMF 下允许的最大重试次数。系统将使用 GIVEOVER 选项调用 scha_control() 函数，以将数据服务故障转移到其他节点。

function restart_service
{

        # To restart the data service, first verify that the 
        # data service itself is still registered under PMF.
        pmfadm -q $PMF_TAG
        if [[ $? -eq 0 ]]; then
                # Since the TAG for the data service is still registered under
                # PMF, first stop the data service and start it back up again.

                # Obtain the Stop method name and the STOP_TIMEOUT value for
                # this resource.
                STOP_TIMEOUT=`scha_resource_get -O STOP_TIMEOUT \
                        -R $RESOURCE_NAME -G $RESOURCEGROUP_NAM?
                STOP_METHOD=`scha_resource_get -O STOP \
                        -R $RESOURCE_NAME -G $RESOURCEGROUP_NAM?
                hatimerun -t $STOP_TIMEOUT $RT_BASEDIR/$STOP_METHOD \
                        -R $RESOURCE_NAME -G $RESOURCEGROUP_NAME \
                        -T $RESOURCETYPE_NAME

                if [[ $? -ne 0 ]]; then
                        logger-p ${SYSLOG_FACILITY}.err -t [$SYSLOG_TAG] \
                                “${ARGV0} Stop method failed.”
                        return 1
                fi

                # Obtain the START method name and the START_TIMEOUT value for
                # this resource.
                START_TIMEOUT=`scha_resource_get -O START_TIMEOUT \
                        -R $RESOURCE_NAME -G $RESOURCEGROUP_NAM?
                START_METHOD=`scha_resource_get -O START \
                        -R $RESOURCE_NAME -G $RESOURCEGROUP_NAM?
                hatimerun -t $START_TIMEOUT $RT_BASEDIR/$START_METHOD \
                        -R $RESOURCE_NAME -G $RESOURCEGROUP_NAME \
                        -T $RESOURCETYPE_NAME

                if [[ $? -ne 0 ]]; then
                        logger-p ${SYSLOG_FACILITY}.err -t [$SYSLOG_TAG] \
                                “${ARGV0} Start method failed.”
                        return 1
                fi


        else
                # The absence of the TAG for the dataservice 
                # implies that the data service has already
                # exceeded the maximum retries allowed under PMF.
                # Therefore, do not attempt to restart the
                # data service again, but try to failover
                # to another node in the cluster.
                scha_control -O GIVEOVER -G $RESOURCEGROUP_NAME \
                        -R $RESOURCE_NAME
        fi

        return 0
}

探测程序退出状态

如果尝试在本地重新启动失败，并且尝试故障转移到其他节点也失败，则数据服务样例的 PROBE 程序将以失败退出。此程序将记录消息 Failover attempt failed。