The fault monitors for all of the Sun Cluster HA for Netscape data services share a common methodology for fault monitoring of the data service instance. All use the concept of remote and local fault monitoring.
The fault monitor process running on the node which currently masters the logical host that the data service is running on is called the local fault monitor. The fault monitor process running on a node which is a possible master of the logical host is called a remote fault monitor.
Sun Cluster HA for Netscape fault monitors periodically perform a simple data service operation with the server. If the operation fails or times out, that particular probe is declared to have failed.
When a probe fails, the local fault probe attempts to restart the data service locally. This is usually sufficient to restore the data service. The remote probe keeps a record of the probe failure but does not take any action. Upon two successive failures of the probe (indicating that a restart of the data service did not correct the problem), the remote probe invokes the hactl(1M) command in "takeover" mode to initiate a failover of the logical host. Some Netscape data services use a sliding window algorithm of probe successes and failures, in which a pre-configured number of failures within the window causes the probe to take action.
You can use the hadsconfig(1M) command to tune probe interval and timeout values for Sun Cluster HA for Netscape fault monitors. Reducing the probe interval value for fault probing results in faster detection of problems, but it also might result in spurious failovers due to transient problems. Similarly, reducing the probe timeout value results in faster detection of problems related to the data service instances, but also might result in spurious takeovers if the data service is merely busy due to heavy load. For most situations, the default values for these parameters are sufficient. The parameters are described in the hadsconfig(1M) man page and in the configuration sections of each data service chapter in the Sun Cluster 2.2 Software Installation Guide.
The Sun Cluster HA for DNS fault probe performs an nslookup operation to check the health of the Sun Cluster HA for DNS server. It looks up the domain name of the Sun Cluster HA for DNS logical host from the Sun Cluster HA for DNS server. Depending upon the configuration of your /etc/resolv.conf file, nslookup might contact other servers if the primary Sun Cluster HA for DNS server is down. Thus, the nslookup operation might succeed, even when the primary Sun Cluster HA for DNS server is down. To guard against this, the fault probe verifies whether replies come from the primary Sun Cluster HA for DNS server or other servers.
The Sun Cluster HA for Netscape HTTP fault probe checks the health of the http server by trying to connect to it on the logical host address on the configured port. Note that the fault monitor uses the port number specified to hadsconfig(1M) during configuration of the nshttp service instance.
The Sun Cluster HA for Netscape News fault probe checks the health of the news server by connecting to it on the logical host IP addresses and the nntp port number. It then attempts to execute the NNTP date command on the news server, and expects a response from the server within the specified probe timeout period.
The Sun Cluster HA for Netscape Mail or Message Server fault probe checks the health of the mail or message server by probing it on all three service ports served by the server, namely the SMTP, IMAP, and POP3 ports:
SMTP (port 25)--Executes an SMTP "hello" message on the server and then executes a quit command.
IMAP (port 143)--Executes an IMAP4 CAPABILITY command followed by an IMAP4 LOGOUT command.
POP3 (port 110)--Executes a quit command.
For all of these tests, the fault probe expects a response string from the server within the probe timeout interval. Note that a probe failure on any of the above three service ports is considered a failure of the server. To avoid spurious failovers, the nsmail fault probe uses a sliding window algorithm for tracking probe failures and successes. If the number for probe failures in the sliding window is greater than a pre-configured number, a takeover is initiated by the remote probe.
The Sun Cluster HA for Netscape LDAP local probe can perform a variable number of local restarts before initiating a failover. The local restart mechanism uses a sliding window algorithm; only when the number of retries is exhausted within that window does a failover occur.
The Sun Cluster HA for Netscape LDAP remote probe uses a simple telnet connection to the LDAP port to check the status of the server. The LDAP port number is the one specified during initial set-up with hadsconfig(1M).
The local probe:
Probes the server by running a monitoring script. The script performs a search for the LDAP common name "monitor." The common name is defined by the Directory Server and is used only for monitoring. The probe uses the ldapsearch utility to perform this operation.
Tries to restart the server locally, upon detecting a problem with the server.
Initiates the hactl(1M) command in the giveup mode upon deciding that the local node cannot reliably run the directory server instance, while the remote probe initiates the hactl(1M) command in the takeover mode. If there are multiple possible masters of the logical host, all of the remote probes invoke the takeover operation in unison. However, after the takeover, the underlying framework ensures that a unique master node is chosen for the Directory Server.