Sun Cluster 数据服务开发者指南（适用于 Solaris OS）

设计故障监视器守护进程

使用 DSDL 的资源类型实现通常都有一个执行以下职责的故障监视器守护进程：

定期监视被管理的应用程序的运行状况。监视器守护进程的这一特殊职责很大程度上取决于特殊应用程序，并可能由于资源类型的不同而大有不同。DSDL 包含一些内置实用程序函数，这些函数用于执行基于 TCP 的简单服务的运行状况检查。您可以使用这些实用程序实现使用基于 ASCII 的协议（例如 HTTP、NNTP、IMAP 和 POP3）的应用程序。
使用资源属性 Retry_interval 和 Retry_count 跟踪应用程序遇到的问题。应用程序完全失败时，故障监视器需要确定 PMF 操作脚本是否应当重新启动服务，或者应用程序失败是否累积得太快需要执行故障转移。DSDL 实用程序 scds_fm_action() 和 scds_fm_sleep() 可以帮助实现这种机制。
通常采取的措施包括重新启动应用程序或尝试故障转移包含的资源组。DSDL 实用程序 scds_fm_action() 实现了此算法。此实用程序将为此目的计算在过去的 Retry_interval 秒内当前累积的探测失败数。
更新资源状态以使应用程序的运行状况的状态可用于 scstat 命令以及群集管理 GUI。

设计 DSDL 实用程序的目的在于使故障监视器守护进程的主循环可由本节末尾的伪代码表示。

实现使用 DSDL 的故障监视器时，请牢记以下因素：

scds_fm_sleep() 可以迅速地检测到应用程序进程停止，因为通过 PMF 通知应用程序进程的停止是异步进行的。因此，所需的故障检测时间明显减少，从而增加了服务的可用性。否则，故障监视器可能频繁地唤醒以检查服务的运行状况并查看应用程序进程是否已停止。
如果 RGM 拒绝尝试将使用 scha_control API 的服务故障转移，则 scds_fm_action() 将重置或忽略其当前的失败历史记录。此函数将重置其当前的失败历史记录，因为该历史记录已超出 Retry_count。如果监视器守护进程在下一次迭代中唤醒，但无法成功完成守护进程的运行状况检查，则监视器守护进程将重新尝试调用 scha_control() 函数。该调用可能被再次拒绝，因为导致它在上次迭代中被拒绝的情况仍然有效。重置历史记录可确保故障监视器在下次迭代时至少在本地尝试解决这种情况（例如，通过重新启动应用程序）。
在重新启动失败的情况下，scds_fm_action() 将不重置应用程序失败历史记录，因为通常在情况自身不进行修正时，您将立即发布 scha_control()。
实用程序 scds_fm_action() 将根据失败历史记录将资源状态更新为 SCHA_RSSTATUS_OK、SCHA_RSSTATUS_DEGRADED 或 SCHA_RSSTATUS_FAULTED。从而使此状态可用于群集系统管理。

在大多数情况下，您可以在独立的实用程序中（例如，svc_probe()）实现应用程序特定的运行状况检查操作。您可以将其集成到以下通用主循环中。

for (;;) {
   /* sleep for a duration of thorough_probe_interval between
   *  successive probes.
   */
   (void) scds_fm_sleep(scds_handle,
   scds_get_rs_thorough_probe_interval(scds_handle));
   /* Now probe all ipaddress we use. Loop over
   * 1. All net resources we use.
   * 2. All ipaddresses in a given resource.
   * For each of the ipaddress that is probed,
   * compute the failure history. 
   */
   probe_result = 0;
   /* Iterate through the all resources to get each
   * IP address to use for calling svc_probe()
   */
   for (ip = 0; ip < netaddr->num_netaddrs; ip++) {
   /* Grab the hostname and port on which the
   * health has to be monitored.
   */
   hostname = netaddr->netaddrs[ip].hostname;
   port = netaddr->netaddrs[ip].port_proto.port;
   /*
   * HA-XFS supports only one port and
   * hence obtaint the port value from the
   * first entry in the array of ports.
   */
   ht1 = gethrtime();
   /* Latch probe start time */
   probe_result = svc_probe(scds_handle, hostname, port, timeout);
   /*
   * Update service probe history,
   * take action if necessary.
   * Latch probe end time.
   */
   ht2 = gethrtime();
   /* Convert to milliseconds */
   dt = (ulong_t)((ht2 - ht1) / 1e6);
   /*
   * Compute failure history and take
   * action if needed
   */
   (void) scds_fm_action(scds_handle,
   probe_result, (long)dt);
   }       /* Each net resource */
   }       /* Keep probing forever */