Sun Cluster 3.0 Data Services Installation and Configuration Guide

Sun Cluster Data Service Fault Monitors

The data services supplied by Sun contain fault monitors that are built into the package. The fault monitor (or fault probe) is a process that probes the health of the data service.

Fault Monitor Invocation

The fault monitor is invoked by the RGM when you bring a resource group and its resources online. This invocation causes the RGM to internally call the MONITOR_START method for the data service.

The fault monitor performs two functions:

Monitors the abnormal exit of the data service server process.
Checks the health of the data service.

Monitoring of the Abnormal Exit of the Server Process

The Process Monitor Facility (PMF) monitors the data service process. On abnormal exit, the PMF invokes an action script supplied by the data service to communicate the failure to the data service fault monitor.

This communication between the PMF action script and the probe occurs over a UNIX domain socket. The only communication intended to take place through the UNIX domain socket is when the PMF informs the probe, through the action script, that the data service has exited abnormally. This event is considered a total failure of the data service.

The data service fault probe runs in an infinite loop and sleeps for an adjustable amount of time set by the resource property Thorough_probe_interval. While sleeping, the probe polls for messages from the PMF action script. If the server process exits abnormally during this interval, the PMF action script informs the probe.

The probe then updates the status of the data service as "Service daemon not running" and takes action. The action can involve just restarting the data service locally or failing over the data service to a secondary cluster node. The probe decides whether to restart or to fail over the data service by checking the value set in the resource properties Retry_count and Retry_interval for the data service application resource.

Health Checks of the Data Service

Typically, communication between the probe and the data service occurs through a dedicated command or a successful connection to the specified data service port.

If no messages have been received on the control socket, the probe, after sleeping for an interval specified by Thorough_probe_interval, checks the health of the data service. The logic followed by the probe is as follows:

Sleep (Thorough_probe_interval).
Perform health checks under a time-out property Probe_timeout. This is a resource extension property of each data service that you can set.
If the result of Step 2 is a success, that is, the service is healthy, update the success/failure history by purging any history records that are older than the value set for the resource property Retry_interval. The probe sets the status message for the resource as "Service is online" and returns to Step 1.

If Step 2 resulted in a failure, the probe updates the failure history. It then computes the total number of times the health check failed.

The result of the health check can range from a total failure to success. The interpretation of the result depends on the specific data service. Consider a scenario where the probe can successfully connect to the server and send a handshake message to it but receives only a partial response before timing out. This scenario is most likely a result of system overload. If some action is taken (such as restarting the service), the clients reconnect to the service again, thus further overloading the system. In that case, a data service fault monitor can decide not to treat this "partial" failure as fatal. Instead, the monitor can just track this failure as a nonfatal probe of the service. These partial failures are still accumulated over the interval specified in Retry_interval.

However, if the probe cannot connect to the server at all, it can be considered a fatal failure. Partial failures lead to incrementing the failure count by a fractional amount. A fatal (total) failure always increments the failure count by 1. Every time the failure count increases by 1 (either by a fatal failure or by accumulation of partial failures), the probe attempts to correct the situation either by restarting or failing over the data service.
If the result of the computation in Step 3 (the number of failures in the history interval) is less than the value of the resource property Retry_count, the probe attempts to correct the situation locally (for example, by restarting the service). The probe sets the status message of the resource as "Service is degraded" and returns to Step 1.
If the number of failures in Retry_interval exceeds Retry_count, the probe calls scha_control with the "giveover" option. This option requests failover of the service. If this request succeeds, the fault probe stops on this node. The probe sets the status message for the resource as: "Service has failed."
The scha_control request issued in the previous step can be denied by the Sun Cluster framework because of various reasons; the reason is identified by the return code of scha_control. The probe checks the return code. If the scha_control is denied, the probe resets the failure/success history and starts afresh. The reason for this action is that because the number of failures is already above Retry_count, the fault probe would attempt to issue scha_control in each subsequent iteration (which is to be denied again). This request would place additional load on the system and increase the likelihood of further service failures in the case where they have been triggered by an overloaded system. The probe then returns to Step 1.

Sun Cluster HA for Apache Fault Monitor

The Sun Cluster HA for Apache probe sends a request to the server to query the health of the Apache server. Before the probe actually queries the Apache server, it checks to confirm that network resources are configured for this Apache resource. If no network resources are configured, an error message (No network resources found for resource.) is logged and the probe exits with failure.

The probe executes the following steps:

Uses the time-out value set by the resource property Probe_timeout to limit the time spent trying to successfully probe the Apache server.
Connects to the Apache server and performs an HTTP 1.0 HEAD check by sending the HTTP request and receives a response. In turn, the probe connects to the Apache server on each IP address/port combination.

The result of this query can be either a failure or a success. If the probe successfully receives a reply from the Apache server, the probe returns to its infinite loop and continues the next cycle of probing and sleeping.

The query can fail for various reasons, such as heavy network traffic, heavy system load, and misconfiguration. Misconfiguration can occur if the Apache server is not configured to be listening on all IP address/port combinations that are being probed. The Apache server should service every port for every IP address specified for this resource. If the reply to the query is not received within the Probe_timeout limit (specified in Step 1 previously), the probe considers this scenario a failure on the part of Apache data service and records the failure in its history. An Apache probe failure can be a total failure or a partial failure.

Probe failures that are considered total failures are:
- Failure to connect to the server, as flagged by the error message: Failed to connect to %s port %d, with %s being the host name and %d the port number.
- Running out of time (exceeding the resource property time-out Probe_timeout) after trying to connect to the server.
- Failure to successfully send the probe string to the server, as flagged by the error message: Failed to communicate with server %s port %d: %s, with the first %s being the host name, %d the port number, and the second %s further details about the error.
  
  Two such partial failures within the resource property interval Retry_interval are accumulated by the monitor and are counted as one. Probe failures considered partial failures are:
  - Running out of time (exceeding the resource property timeout Probe_timeout) while trying to read the reply from the server to the probe's query.
  - Failing to read data from the server for other reasons, as flagged by the error message: Failed to communicate with server %s port %d: %s, with the first %s being the host name and %d the port number; the second %s further details about the error.
Based on the history of failures, a failure can cause either a local restart or a failover of the data service. This action is further described in "Health Checks of the Data Service".

Sun Cluster HA for DNS Fault Monitor

The probe uses the nslookup command to query the health of DNS. Before the probe actually queries the DNS server, a check is made to confirm that network resources are configured in the same resource group as the DNS data service. If no network resources are configured, an error message is logged and the probe exits with failure. The probe executes the following steps.

Run the nslookup command by using the time-out value specified by the resource property Probe_timeout.

The result of this nslookup command can be either failure or success. If the nslookup query was successfully replied to by DNS, the probe returns to its infinite loop, waiting for the next probe time.

If the nslookup fails, the probe considers this scenario a failure of the DNS data service and records the failure in its history. The DNS probe considers every failure a total failure
Based on the success/failure history, a failure can cause a local restart or a data service failover. This action is further described in "Health Checks of the Data Service".

Sun Cluster HA for NFS Fault Monitor

The Sun Cluster HA for NFS fault monitor contains two parts. One is NFS system fault monitoring, which involves monitoring the NFS daemons (nfsd, mountd, statd, and mountd) and taking appropriate action when they have problems. The other part is specific to each NFS resource. The fault monitor of each resource monitors the file systems exported by the resource by checking the status of each shared path.

Fault Monitor Startup

The NFS system fault monitor is started by a NFS resource start method. This start method first checks if the NFS system fault monitor (nfs_daemons_probe) is already running under the process monitor pmfadm. If not, the start method starts the nfs_daemons_probe process under the control of the process monitor. It then starts the resource fault monitor (nfs_probe), also under the control of the process monitor.

Fault Monitor Stops

The NFS resource Monitor_stop method stops the resource fault monitor. It also stops the NFS system fault monitor if no other NFS resource fault monitor is running on the local node.

NFS System Fault Monitor Process

The system fault monitor probes rpcbind, statd, lockd, nfsd, and mountd by checking for the presence of the process and its response to a null rpc call. This monitor uses the following NFS extension properties:

`Rpcbind_nullrpc_timeout`	`Lockd_nullrpc_timeout`
`Nfsd_nullrpc_timeout`	`Rpcbind_nullrpc_reboot`
`Mountd_nullrpc_timeout`	`Nfsd_nullrpc_restart`
`Statd_nullrpc_timeout`	`Mountd_nullrpc_restart`

For a description of these properties, see Chapter 7, Installing and Configuring Sun Cluster HA for Network File System (NFS).

Each system fault monitor probe cycle does the following:

Sleeps for Cheap_probe_interval.
Probes rpcbind.

If the process dies, reboots the system if Failover_mode=HARD.

If a null rpc call fails and if Rpcbind_nullrpc_reboot=True and Failover_mode=HARD, reboots the system.
Probes statd and lockd.

If either of these daemons dies, restarts both.

If a null rpc call fails, logs a message to syslog but does not restart.
Probe mountd and mountd.

If the process dies, restart it.

If a null rpc call fails, restart mountd if the PXFS device is available and the extension property Mountd_nullrpc_restart=True.

If any of the NFS daemons fails to restart, the status of all online NFS resources is set to FAULTED. When all NFS daemons are restarted and healthy, the resource status is set to ONLINE again.

NFS Resource Monitor Process

Before starting the resource monitor probes, all shared paths are read from the dfstab file and stored in memory. In each probe cycle, all shared paths are probed in each iteration by performing stat() of the path.

Each resource monitor fault probe does the following:

Sleeps for Thorough_probe_interval.
Refreshes the memory if dfstab has been changed since the last read.
Probes all shared paths in each iteration by doing stat() of the path.

If any path is bad, the resource status is set to FAULTED. If all paths are working, the resource status is set to ONLINE again.

Sun Cluster HA for Oracle Fault Monitor

The two fault monitors for Sun Cluster HA for Oracle are a server and a listener monitor.

Oracle Server Fault Monitor

The fault monitor for the Oracle server uses a request to the server to query the health of the server.

The server fault monitor consists of two processes: a main fault monitor process and database client fault probe. The main process performs error lookup and scha_control actions. The database client fault probe performs database transactions.

All database connections from the probe are performed as user oracle. The main fault monitor determines that the operation is successful if the database is online and no errors are returned during the transaction.

If the database transaction fails, the main process checks the internal action table for an action to be performed and performs the predetermined action. If the action executes an external program, it is executed as a separate process in the background. Some possible actions are: switchover, stopping and restarting the server, and stopping and restarting the resource group.

The probe uses the time-out value set in the resource property Probe_timeout to determine how much time to spend to successfully probe Oracle.

The server fault monitor also scans Oracle's alert_log_file and takes action based on any errors it finds.

The server fault monitor is started through pmfadm to make it highly available. If the monitor is killed for any reason, it is automatically restarted by pmf.

Oracle Listener Fault Monitor

The Oracle listener fault monitor checks the status of an Oracle listener.

If the listener is running, the Oracle listener fault monitor considers a probe successful. If the fault monitor detects an error, the listener is restarted.

The listener probe is started through pmfadm to make it highly available. If it is killed, it is automatically restarted by pmf.

If a problem occurs with the listener during a probe, the probe tries to restart the listener. The maximum number of times it attempts the restart is determined by the value set in the resource property Retry_count. If, after trying for the maximum number of times, the probe is still unsuccessful, it stops the fault monitor and does not switch over the resource group.

Sun Cluster HA for iPlanet Web Server Fault Monitor

The probe for Sun Cluster HA for iPlanet Web Server (iWS) uses a request to the server to query the health of that server. Before the probe actually queries the server, a check is made to confirm that network resources are configured for this Web server resource. If no network resources are configured, an error message (No network resources found for resource.) is logged and the probe exits with failure.

The probe must address two configurations of iWS: the secure instance and insecure instance. If the Web server is in secure mode and if the probe cannot get the secure ports from the configuration file, an error message (Unable to parse configuration file.) is logged and the probe exits with failure. The secure and insecure instance probes involve common steps.

The probe uses the time-out value set by the resource property Probe_timeout to limit the time spent trying to successfully probe iWS. For details on this resource property, see Appendix A, Standard Properties.

The Network_resources_used resource property setting on the iWS resource determines the set of IP addresses that are used by the Web server. The Port_list resource property setting determines the list of port numbers in use by iWS. The fault monitor assumes that the Web server is listening on all combinations of IP and port. If you are customizing your Web server configuration to listen on different port numbers (in addition to port 80), ensure that your resultant configuration (magnus.conf) file contains all possible combinations of IP addresses and ports. The fault monitor attempts to probe all such combinations and might fail if the Web server is not listening on a particular IP address and port combination.

The probe executes the following steps:

The probe connects to the Web server by using the specified IP address and port combination. If the connection is not successful, the probe concludes that a total failure has occurred. The probe then records the failure and takes appropriate action.
If the probe successfully connects, it checks to see if the Web server is being run in a secure mode. If so, the probe just disconnects and returns with a success status. No further checks are performed for a secure iWS server.

However, if the Web server is running in insecure mode, the probe sends a HTTP 1.0 HEAD request to the Web server and waits for the response. The request can be unsuccessful for various reasons, including heavy network traffic, heavy system load, and misconfiguration.

Misconfiguration can occur when the Web server is not configured to be listening on all IP address and port combinations that are being probed. The Web server should service every port for every IP address specified for this resource.

Misconfigurations can also result if the Network_resources_used and Port_list resource properties are not set correctly while you are creating the resource.

If the reply to the query is not received within the Probe_timeout resource proper limit, the probe considers this a failure of Sun Cluster HA for iPlanet Web Server. The failure is recorded in the probe's history.

A probe failure can be a total or partial failure. Probe failures that are considered total failures are:
- Failure to connect to the server, as flagged by the error message: Failed to connect to %s port %d, with %s being the host name and %d the port number.
- Running out of time (exceeding the resource property timeout Probe_timeout) after trying to connect to the server.
- Failure to successfully send the probe string to the server, as flagged by the error message: Failed to communicate with server %s port %d: %s, with the first %s being the host name and %d the port number; the second %s further details about the error.
Two such partial failures within the resource property interval Retry_interval are accumulated by the monitor and are counted as one. Probe failures that are considered partial failures are:
- Running out of time (exceeding the resource property timeout Probe_timeout) while trying to read the reply from the server to the probe's query.
- Failing to read data from the server for other reasons, as flagged by the error message: Failed to communicate with server %s port %d: %s, with the first %s being the host name and %d the port number; the second %s further details about the error.
Based on the history of failures, a failure can cause either a local restart or a failover of the data service. This action is further described in "Health Checks of the Data Service".

Sun Cluster HA for Netscape Directory Server Fault Monitor

The probe for Sun Cluster HA for Netscape Directory Server accesses particular IP addresses and port numbers. The IP addresses are those from network resources listed in the Network_resources_used resource property. The port is the one listed in the Port_list resource property. For a description of these properties, see Appendix A, Standard Properties.

The fault monitor determines whether the Sun Cluster HA for Netscape Directory Server instance is secure or non-secure. The monitor probes secure and non-secure directory servers differently. If the keyword security is not found in the configuration file (slapd.conf) or the setting security off is found, then the instance is determined to be non-secure. Otherwise, it is determined to be secure.

The probe for a secure instance consists of a simple TCP connect. If the connect succeeds, the probe is successful. Secure connect failure or timeout is interpreted as total failure.

The probe for an insecure instance depends on running the ldapsearch executable provided with Sun Cluster HA for Netscape Directory Server. The search filter used is intended to always find something. The probe detects partial and total failures. The following conditions are considered partial failures; all other conditions are interpreted as total failures.

Probe_timeout duration is exceeded while probing the set of IP addresses for the port. Potential causes are:
- System load
- Network traffic load
- Directory server load
- Probe_timeout is set too low for the typical load or the number of directory server instances (that is, IP address and port combinations) being monitored.

A problem other than timeout occurs while invoking ldapsearch. Note that this scenario does not cover the case where ldapsearch is invoked successfully but returns an error.