Sun Cluster 2.2 System Administration Guide

Appendix B Sun Cluster Fault Detection

This appendix describes fault detection for Sun Cluster, and includes the following topics:

This section presents an overview of Sun Cluster fault detection. This fault detection encompasses three general approaches:

A heartbeat mechanism
Fault monitoring of networks
Fault monitoring of specific data services

Fault monitoring performs sanity checks to ensure that the faulty node is the one being blamed for a problem, and not the healthy node.

Some of the information presented is specific to this release of Sun Cluster, and is expected to change as the product evolves. The time estimates given to detect various faults are rough approximations and are intended only to give the reader a general understanding of how Sun Cluster behaves. This document is not intended to be a program logic manual for the internals of Sun Cluster nor does it describe a programming interface.

Fault Detection Overview

As noted in the basic Sun Cluster architecture discussion, when one server goes down the other server takes over. This raises an important issue: how does one server recognize that another server is down?

Sun Cluster uses three methods of fault detection.

Heartbeat and SMA link monitoring - These monitors run over the private links. For Ethernet, there are two monitors: an SMA link monitor and a cluster membership monitor. For SCI, there are three monitors: an SMA link monitor, a cluster membership monitor, and a low-level SCI heartbeat monitor.
Network fault monitoring - All servers' public network connections are monitored: if a server cannot communicate over the public network because of a hardware or software problem, then another server in the server set will take over.
Data service-specific fault probes - Each Sun Cluster data service performs fault detection that is specific for that data service. This last method addresses the issue of whether the data service is performing useful work, not just the low-level question of whether the machine and operating system appear to be running.

For the second and third methods, one server is probing the other server for a response. After detecting an apparent problem, the probing server carries out a number of sanity checks of itself before forcibly taking over from the other server. These sanity checks try to ensure that a problem on the probing server is not the real cause of the lack of response from the other server. These sanity checks are provided by hactl(1M), a library subroutine that is part of the Sun Cluster base framework; hence, data service-specific fault detection code need only call hactl(1M) to perform sanity checks on the probing server. See the hactl(1M) man page for details.

The Heartbeat Mechanism: Cluster Membership Monitor

Sun Cluster uses a heartbeat mechanism. The heartbeat processing is performed by a real-time high-priority process which is pinned in memory, that is, it is not subject to paging. This process is called the cluster membership monitor. In a ps(1) listing, its name appears as clustd.

Each server sends out an "I am alive" message, or heartbeat, over both private links approximately once every two seconds. In addition, each server is listening for the heartbeat messages from other servers, on both private links. Receiving the heartbeat on either private link is sufficient evidence that another server is running. A server will decide that another server is down if it does not hear a heartbeat message from that server for a sufficiently long period of time, approximately 12 seconds.

In the overall fault detection strategy, the cluster membership monitor heartbeat mechanism is the first line of defense. The absence of the heartbeat will immediately detect hardware crashes and operating system panics. It might also detect some gross operating system problems, for example, leaking away all communication buffers. The heartbeat mechanism is also Sun Cluster's fastest fault detection method. Because the cluster membership monitor runs at real-time priority and because it is pinned in memory, a relatively short timeout for the absence of heartbeats is justified. Conversely, for the other fault detection methods, Sun Cluster must avoid labelling a server as being down when it is merely very slow. For those methods, relatively long timeouts of several minutes are used, and, in some cases, two or more such timeouts are required before Sun Cluster will perform a takeover.

The fact that the cluster membership monitor runs at real-time priority and is pinned in memory leads to the paradox that the membership monitor might be alive even though its server is performing no useful work at the data service level. This motivates the data service-specific fault monitoring, as described in "Data Service-Specific Fault Probes".

Sanity Checking of Probing Node

The network fault probing and data service-specific fault probing require each node to probe another node for a response. Before doing a takeover, the probing node performs a number of basic sanity checks of itself. These checks attempt to ensure that the problem does not really lie with the probing node. They also try to ensure that taking over from the server that seems to be having a problem really will improve the situation. Without the sanity checks, the problem of false takeovers would likely arise. That is, a sick node would wrongly blame another node for lack of response and would take over from the healthier server.

The probing node performs the following sanity checks on itself before doing a takeover from another node:

The probing node checks its own ability to use the public network, as described in "Public Network Monitoring (PNM)".
The probing node also checks whether its own HA data services are responding. All the HA data services that the probing node is already running are checked. If any are not responsive, takeover is inhibited, on the assumption that the probing node will not do any better trying to run another node's services if it can't run its own. Furthermore, the failure of the probing node's own HA data services to respond might be an indication of some underlying problem with the probing node that could be causing the probe of the other node to fail. Sun Cluster HA for NFS provides an important example of this phenomenon: to lock a file on another node, the probing node's own lockd and statd daemons must be working. By checking the response of its lockd and statd daemons, the probing node rules out the scenario where its own daemons' failure to respond makes the other node look unresponsive.

Public Network Monitoring (PNM)

The PNM component has two primary functions:

To monitor the status of configured adapters on a node and report general adapter or network failures
To fail over transparently to other backup adapters on a node when the primary adapter fails

PNM is implemented as a daemon (pnmd) which periodically gathers network statistics on the set of public network interfaces in a node. If the results indicate any abnormalities, pnmd attempts to distinguish between the following three cases:

The network is quiescent.
The network is down.
The network interface is down.

PNM then does a multicast ping. PNM places the results of its findings in the CCD and compares the local results with the results of the other nodes (which are also placed in the CCD). This comparison is used to determine whether the network is down or whether the network interface is faulty. If PNM detects that the network interface is faulty and backup adapters are configured, it performs the network adapter failover.

Note -

The multicast ping initiated by PNM might not be understood by any non-Sun hardware components present in the configuration. Therefore, you should directly connect a Sun network appliance to the network being monitored.

The results of PNM monitoring are used by various entities. The network adapter failover component of PNM uses the monitoring results to decide whether an adapter failover would be useful. For example, if the network is experiencing a failure, no adapter failover is performed. Fault monitors associated with SC HA data services and the API call hactl use the PNM facility to diagnose the cause of data service failures. The information returned by PNM is used to decide whether to migrate the data service, and to determine the location of the data service after migration.

The syslog messages written by the PNM facility on detection of adapter failures are read by the SC Manager, which translates the messages into graphic icons and displays them through the graphical user interface.

You also can run the PNM utilities on the command line to determine the status of network components. For more information, see the man pages pnmset(1M), pnmstat(1M), pnmptor(1M)pnmrtop(1M), , and pnmd(1M).

Sun Cluster Fault Probes

PNM monitors the health of the public network and will switch to backup connections when necessary. However, in the event of the total loss of public network access, PNM will not provide data service or logical host failover. In such a case, PNM will report the loss but it is up to an external fault probe to handle switching between backup nodes.

If you are using VxVM as your volume manager, the Sun Cluster framework is responsible for monitoring each Network Adapter Failover (NAFO) backup group defined per logical host, and initiating a switchover to a backup node when either of the following conditions are met:

There is total loss of the public network (all NAFO backup groups are unavailable) and the backup node has at least one NAFO group available.
There is partial loss of the public network--at least one NAFO backup group is still active when more than one (multiple subnets) are defined for a logical host--and the backup node has a greater number of valid, active NAFO backup groups.

If neither of these conditions are met, Sun Cluster will not attempt a switchover.

If your volume manager is Solstice DiskSuite, loss of public network causes the disconnected node to abort and causes the logical hosts mastered by that node to migrate to the backup node.

The Sun Cluster framework monitors the public networks only while the configuration includes a logical host and while a data service is in the "on" state and registered on that logical host. Only those NAFO backup groups that are in use by a logical host are monitored.

Data Service-Specific Fault Probes

The motivation for performing data service-specific fault probing is that although the server node and operating system are running, the software or hardware might be in such a confused state that no useful work at the data service level is occurring. In the overall architecture, the total failure of the node or operating system is detected by the cluster membership monitor's heartbeat mechanism. However, a node might be working well enough for the heartbeat mechanism to continue to execute even though the data service is not doing useful work.

Conversely, the data service-specific fault probes do not need to detect the state where one node has crashed or has stopped sending cluster heartbeat messages. The assumption is made that the cluster membership monitor detects such states, and the data service fault probes themselves contain no logic for handling these states.

A data service fault probe behaves like a client of the data service. A fault probe running on a machine monitors both the data service exported by that machine and, more importantly, the data service exported by another server. A sick server is not reliable enough to detect its own sickness, so each server is monitoring another node in addition to itself.

In addition to behaving like a client, a data service-specific fault probe will also, in some cases, use statistics from the data service as an indication that useful work is or is not occurring. A probe might also check for the existence of certain processes that are crucial to a particular data service.

Typically, the fault probes react to the absence of service by forcing one server to take over from another. In some cases, the fault probes will first attempt to restart the data service on the original machine before doing the takeover. If many restarts occur within a short time, the indication is that the machine has serious problems. In this case, a takeover by another server is executed immediately, without attempting another local restart.

Sun Cluster HA for NFS Fault Probes

The probing server runs two types of periodic probes against another server's NFS service.

The probing server sends a NULL RPC to all daemon processes on the target node that are required to provide NFS service; these daemons are rpcbind, mountd, nfsd, lockd, and statd.
The probing server does an end-to-end test: it tries to mount an NFS file system from the other node, and then to read and write a file in that file system. It does this end-to-end test for every file system that the other node is currently sharing. Because the mount is expensive, it is executed less often than the other probes.

If any of these probes fail, the probing node will consider doing a takeover from the serving node. However, certain conditions might inhibit the takeover from occurring immediately:

Grace period for local restart - Before doing the takeover, the probing node waits for a short time period that is intended to:
- Give the victim node a chance to notice its own sickness and fix the problem by doing a local restart of its own daemons
- Give the victim node a chance to be less busy (if it is merely overloaded)

After waiting, the prober retries the probe, going on with takeover consideration only if it fails again. In general, two entire timeouts of the basic probe are required for a takeover, to allow for a slow server.

Multiple public networks - If the other node is on multiple public networks, the probing node will try the probe on at least two of them.
Locks - Some backup utilities exploit the lockfs(1M) facility, which locks out various types of updates on a file system, so that backup can take a snapshot of an unchanging file system. Unfortunately, in the NFS context, lockfs(1M) makes a file system appear unavailable; NFS clients will see the condition NFS server not responding. Before doing a takeover, the probing node queries the other node to find out whether the file system is in lockfs state, and, if so, takeover is inhibited. The takeover is inhibited because the lockfs is part of a normal, intended administrative procedure for doing backup. Note that not all backup utilities use lockfs; some permit NFS service to continue uninterrupted.
Daemons - Unresponsiveness of lockd and statd daemons does not cause a takeover. The lockd and statd daemons, together, provide network locking for NFS files. If these daemons are unresponsive, the condition is merely logged to syslog, and a takeover does not occur. lockd and statd, in the course of their normal work, must perform RPCs to client machines, so that a dead or partitioned client can cause lockd and statd to hang for long periods of time. Thus, a bad client can make lockd and statd on the server look sick. And if a takeover by the probing server were to occur, the probing server would probably be stalled by the bad client in the same way. With the current model, a bad client will not cause a false takeover.

After passing these Sun Cluster HA for NFS-specific tests, the process of considering whether or not to do a takeover continues with calls to hactl(1M) (see "Sanity Checking of Probing Node").

The probing server also checks its own NFS service. The logic is similar to the probes of the other server, but instead of doing takeovers, error messages are logged to syslog and an attempt is made to restart any daemons whose process no longer exists. In other words, the restart of a daemon process is performed only when the daemon process has exited or crashed. The restart of a daemon process is not attempted if the daemon process still exists but is not responding, because that would require killing the daemon without knowing which data structures it is updating. The restart is also not done if a local restart has been attempted too recently (within the last hour). Instead, the other server is told to consider doing a takeover (provided the other server passes its own sanity checks). Finally, the rpcbind daemon is never restarted, because there is no way to inform processes that had registered with rpcbind that they need to re-register.

HA-DBMS Fault Probes

The fault probes for Sun Cluster HA for Oracle, Sun Cluster HA for Sybase and Sun Cluster HA for Informix perform similarly to monitor the database server. The HA-DBMS fault probes are configured by running one of the utilities, haoracle(1M), hasybase(1M), or hainformix(1M). (See the online man pages for a detailed description of the options for these utilities.)

Once the utilities are configured and activated, two processes are started on the local node and two processes are started on the remote node simulating a client access. The remote fault probe is initiated by the ha_dbms_serv daemon and is started when hareg -y dataservicename is initiated.

The HA-DBMS module uses two methods to monitor whether the DBMS service is available. First, HA-DBMS extracts statistics from the DBMS itself:

In Oracle, the V$SYSSTAT table is queried.
In Sybase, the global variables @@io_busy, @@pack_received, @@pack_sent, @@total_read, @@total_write, and @@connections are queried.
In Informix, the SYSPROFILE table is queried.

If the extracted statistics indicate that work is being performed for clients, then no other probing of the DBMS is required. Second, if the DBMS statistics show that no work is occurring, then HA-DBMS submits a small test transaction to the DBMS. If all clients happen to be idle, the DBMS statistics would show no work occurring; that is, the test transaction distinguishes the situation of the database being hung from the legitimately idle situation. Because the test transaction is executed only when the statistics show no activity, it imposes no overhead on an active database. The test transaction consists of:

Creating a table by the name of either HA_DBMS_REM or HA_DBMS_LOC
Inserting values into the created table
Updating the inserted value
Dropping the created table

HA-DBMS carefully filters the error codes returned by the DBMS, using a table that describes which codes should or should not cause a takeover. For example, in the case of Sun Cluster HA for Oracle, the scenario of table space full does not cause a takeover, because an administrator must intervene to fix this condition. (If a takeover were to occur, the new master server would encounter the same table space full condition.)

On the other hand, an error return code such as could not allocate Unix semaphore causes Sun Cluster HA for Oracle to attempt to restart ORACLE locally on this server machine. If a local restart has occurred too recently, then the other machine takes over instead (after first passing its own sanity checks).

Sun Cluster HA for Netscape Fault Probes

The fault monitors for all of the Sun Cluster HA for Netscape data services share a common methodology for fault monitoring of the data service instance. All use the concept of remote and local fault monitoring.

The fault monitor process running on the node which currently masters the logical host that the data service is running on is called the local fault monitor. The fault monitor process running on a node which is a possible master of the logical host is called a remote fault monitor.

Sun Cluster HA for Netscape fault monitors periodically perform a simple data service operation with the server. If the operation fails or times out, that particular probe is declared to have failed.

When a probe fails, the local fault probe attempts to restart the data service locally. This is usually sufficient to restore the data service. The remote probe keeps a record of the probe failure but does not take any action. Upon two successive failures of the probe (indicating that a restart of the data service did not correct the problem), the remote probe invokes the hactl(1M) command in "takeover" mode to initiate a failover of the logical host. Some Netscape data services use a sliding window algorithm of probe successes and failures, in which a pre-configured number of failures within the window causes the probe to take action.

You can use the hadsconfig(1M) command to tune probe interval and timeout values for Sun Cluster HA for Netscape fault monitors. Reducing the probe interval value for fault probing results in faster detection of problems, but it also might result in spurious failovers due to transient problems. Similarly, reducing the probe timeout value results in faster detection of problems related to the data service instances, but also might result in spurious takeovers if the data service is merely busy due to heavy load. For most situations, the default values for these parameters are sufficient. The parameters are described in the hadsconfig(1M) man page and in the configuration sections of each data service chapter in the Sun Cluster 2.2 Software Installation Guide.

Sun Cluster HA for DNS Fault Probes

The Sun Cluster HA for DNS fault probe performs an nslookup operation to check the health of the Sun Cluster HA for DNS server. It looks up the domain name of the Sun Cluster HA for DNS logical host from the Sun Cluster HA for DNS server. Depending upon the configuration of your /etc/resolv.conf file, nslookup might contact other servers if the primary Sun Cluster HA for DNS server is down. Thus, the nslookup operation might succeed, even when the primary Sun Cluster HA for DNS server is down. To guard against this, the fault probe verifies whether replies come from the primary Sun Cluster HA for DNS server or other servers.

Sun Cluster HA for Netscape HTTP Fault Probes

The Sun Cluster HA for Netscape HTTP fault probe checks the health of the http server by trying to connect to it on the logical host address on the configured port. Note that the fault monitor uses the port number specified to hadsconfig(1M) during configuration of the nshttp service instance.

Sun Cluster HA for Netscape News Fault Probes

The Sun Cluster HA for Netscape News fault probe checks the health of the news server by connecting to it on the logical host IP addresses and the nntp port number. It then attempts to execute the NNTP date command on the news server, and expects a response from the server within the specified probe timeout period.

Sun Cluster HA for Netscape Mail or Message Server Fault Probes

The Sun Cluster HA for Netscape Mail or Message Server fault probe checks the health of the mail or message server by probing it on all three service ports served by the server, namely the SMTP, IMAP, and POP3 ports:

SMTP (port 25)--Executes an SMTP "hello" message on the server and then executes a quit command.
IMAP (port 143)--Executes an IMAP4 CAPABILITY command followed by an IMAP4 LOGOUT command.
POP3 (port 110)--Executes a quit command.

For all of these tests, the fault probe expects a response string from the server within the probe timeout interval. Note that a probe failure on any of the above three service ports is considered a failure of the server. To avoid spurious failovers, the nsmail fault probe uses a sliding window algorithm for tracking probe failures and successes. If the number for probe failures in the sliding window is greater than a pre-configured number, a takeover is initiated by the remote probe.

Sun Cluster HA for Netscape LDAP Fault Probes

The Sun Cluster HA for Netscape LDAP local probe can perform a variable number of local restarts before initiating a failover. The local restart mechanism uses a sliding window algorithm; only when the number of retries is exhausted within that window does a failover occur.

The Sun Cluster HA for Netscape LDAP remote probe uses a simple telnet connection to the LDAP port to check the status of the server. The LDAP port number is the one specified during initial set-up with hadsconfig(1M).

The local probe:

Probes the server by running a monitoring script. The script performs a search for the LDAP common name "monitor." The common name is defined by the Directory Server and is used only for monitoring. The probe uses the ldapsearch utility to perform this operation.
Tries to restart the server locally, upon detecting a problem with the server.
Initiates the hactl(1M) command in the giveup mode upon deciding that the local node cannot reliably run the directory server instance, while the remote probe initiates the hactl(1M) command in the takeover mode. If there are multiple possible masters of the logical host, all of the remote probes invoke the takeover operation in unison. However, after the takeover, the underlying framework ensures that a unique master node is chosen for the Directory Server.

Sun Cluster HA for Lotus Fault Probes

The Sun Cluster HA for Lotus fault probe has two parts--a local probe that runs on the node on which the Lotus Domino server processes are currently running, and a remote probe that runs on all other nodes that are possible masters of the Lotus Domino server's logical host.

Both probes use a simple telnet connection to the Lotus Domino port to check the status of the Domino server. If a probe fails to connect, it initiates a failover or takeover by invoking the hactl(1M) command.

The local fault probe can perform three local restarts before initiating a failover. The local restart mechanism uses a sliding time window algorithm; only when the number of retries is exhausted within that window does a failover occur.

Sun Cluster HA for Tivoli Fault Probes

Sun Cluster HA for Tivoli uses only a local fault probe. It runs on the node on which the Tivoli object dispatcher, the oserv daemon, is currently running.

The fault probe uses the Tivoli command wping to check the status of the monitored oserv daemon. The wping of an oserv daemon can fail for the following reasons:

The monitored oserv daemon is not running.

The oserv daemon on the server dies while monitoring a client oserv daemon.

Proper Tivoli roles (authorization) have not been set for the administrative user. See the Sun Cluster 2.2 Software Installation Guide for details about Tivoli.

If the local probe fails to ping the oserv daemon, it initiates a failover by invoking the hactl(1M) command. The fault probe will perform one local restart before initiating a failover.

Sun Cluster HA for SAP Fault Probes

The Sun Cluster HA for SAP fault probe monitors the availability of the Central Instance, specifically the message server, the enqueue server, and the dispatcher. The probe monitors only the local node by checking for the existence of the critical SAP processes. It also uses the SAP utility lgtst to verify that the SAP message server is reachable.

Upon detecting a problem, such as when a process dies prematurely or lgtst reports an error, the fault probe will first try to restart SAP on the local node for a configurable number of times (configurable through hadsconfig(1M)). If the number of restarts that the user has configured has been exhausted, then the fault probe initiates a switchover by calling hactl(1M), if this instance has been configured to allow failover (also configurable through hadsconfig(1M)). The Central Instance is shut down before the switchover occurs, and then is restarted on the remote node after the switchover is complete.

Displaying `LOG_DB_WARNING` Messages for the SAP Probe

The Sun Cluster HA for SAP parameter LOG_DB_WARNING determines whether warning messages should be displayed if the Sun Cluster HA for SAP probe cannot connect to the database. When LOG_DB_WARNING is set to y and the probe cannot connect to the database, a message is logged at the warning level in the local0 facility. By default, the syslogd(1M) daemon does not display these messages to /dev/console or to /var/adm/messages. To see these warnings, you must modify the /etc/syslog.conf file to display messages of local0.warning priority. For example:

...
*.err;kern.notice;auth.notice;local0.warning /dev/console
*.err;kern.debug;daemon.notice;mail.crit;local0.warning /var/adm/messages
...

After modifying the file, you must restart syslogd(1M). See the syslog.conf(1M) and syslogd(1M) man pages for more information.