Sun Cluster 資料服務開發者指南 (適用於 Solaris 作業系統)

設計故障監視器常駐程式

通常，使用 DSDL 的資源類型實作具有可執行以下任務的故障監視器常駐程式︰

定期監視管理的應用程式之運作狀態。此監視器常駐程式的特定任務主要取決於特定應用程式，並且在資源類型之間有很大不同。DSDL 包含某些內建公用程式函數，可執行簡單的基於 TCP 之服務的運作狀態檢查。您可以使用這些公用程式實作使用基於 ASCII 的通訊協定 (例如，HTTP、NNTP、IMAP 和 POP3) 之應用程式。
透過使用資源特性 Retry_interval 和 Retry_count，追蹤應用程式遇到的問題。當應用程式完全失敗，故障監視器需要決定 PMF 動作程序檔是否應重新啟動服務或應用程式失敗是否積聚過快以致需要執行容錯移轉。DSDL 公用程式 scds_fm_action() 和 scds_fm_sleep() 用於輔助您實作此機制。
執行動作，通常是重新啟動應用程式或嘗試包含的資源群組之容錯移轉。DSDL 公用程式 scds_fm_action() 實作此演算法。此公用程式將計算在過去的 Retry_interval 秒內探測失敗目前的積聚數目。
更新資源狀態以便應用程式的運作狀態適用於 scstat 指令，以及適用於叢集管理 GUI。

DSDL 公用程式的設計便於故障監視器常駐程式的主迴路由本小節結尾處的虛擬程式碼表示。

當您使用 DSDL 實作故障監視器時，請牢記以下因素︰

由於透過 PMF 的應用程式程序的終止通知為非同步，因此，scds_fm_sleep() 可快速偵測到應用程式程序的終止。因此，故障偵測時間可大幅度降低，從而提高服務的可用性。故障監視器可能時常喚醒，以檢查服務運作狀態並尋找已終止的應用程式程序。
如果 RGM 拒絕使用 scha_control API 嘗試容錯移轉服務，則 scds_fm_action() 將重設或忽略其目前失敗歷程記錄。由於其歷程記錄已超出 Retry_count，此函數將重設目前失敗歷程記錄。如果監視器常駐程式在下次重複運算時喚醒，且無法成功完成常駐程式的運作狀態檢查，則監視器常駐程式將再次嘗試呼叫 scha_control() 函數。由於在最後一次重複運算中導致該呼叫被拒絕的情況仍然有效，因此該呼叫可能會被再次拒絕。重設歷程記錄可確保故障監視器在下次重複運算中至少嘗試更正本機情況 (例如，透過重新啟動應用程式)。
通常，如果該情況不進行本身更正，之後即會快速發佈 scha_control()，因此，scds_fm_action() 不會在重新啟動失敗情況下重設應用程式失敗歷程記錄。
依據失敗歷程記錄，公用程式 scds_fm_action() 將更新資源狀態為 SCHA_RSSTATUS_OK、SCHA_RSSTATUS_DEGRADED 或 SCHA_RSSTATUS_FAULTED。因此，此狀態適用於叢集系統管理。

在大多數情況下，您可以在單獨的獨立式公用程式 (例如，svc_probe()) 中實作應用程式特定的運作狀態檢查動作。您可以將其與以下通用主迴路結合在一起。

for (;;) {
   /* sleep for a duration of thorough_probe_interval between
   *  successive probes.
   */
   (void) scds_fm_sleep(scds_handle,
   scds_get_rs_thorough_probe_interval(scds_handle));
   /* Now probe all ipaddress we use. Loop over
   * 1. All net resources we use.
   * 2. All ipaddresses in a given resource.
   * For each of the ipaddress that is probed,
   * compute the failure history. 
   */
   probe_result = 0;
   /* Iterate through the all resources to get each
   * IP address to use for calling svc_probe()
   */
   for (ip = 0; ip < netaddr->num_netaddrs; ip++) {
   /* Grab the hostname and port on which the
   * health has to be monitored.
   */
   hostname = netaddr->netaddrs[ip].hostname;
   port = netaddr->netaddrs[ip].port_proto.port;
   /*
   * HA-XFS supports only one port and
   * hence obtaint the port value from the
   * first entry in the array of ports.
   */
   ht1 = gethrtime();
   /* Latch probe start time */
   probe_result = svc_probe(scds_handle, hostname, port, timeout);
   /*
   * Update service probe history,
   * take action if necessary.
   * Latch probe end time.
   */
   ht2 = gethrtime();
   /* Convert to milliseconds */
   dt = (ulong_t)((ht2 - ht1) / 1e6);
   /*
   * Compute failure history and take
   * action if needed
   */
   (void) scds_fm_action(scds_handle,
   probe_result, (long)dt);
   }       /* Each net resource */
   }       /* Keep probing forever */