Sun Cluster HA for Apache Fault Monitor (Sun Cluster 3.0 Data Services Installation and Configuration Guide)

Sun Cluster 3.0 Data Services Installation and Configuration Guide

Sun Cluster HA for Apache Fault Monitor

The Sun Cluster HA for Apache probe sends a request to the server to query the health of the Apache server. Before the probe actually queries the Apache server, it checks to confirm that network resources are configured for this Apache resource. If no network resources are configured, an error message (No network resources found for resource.) is logged and the probe exits with failure.

The probe executes the following steps:

Uses the time-out value set by the resource property Probe_timeout to limit the time spent trying to successfully probe the Apache server.
Connects to the Apache server and performs an HTTP 1.0 HEAD check by sending the HTTP request and receives a response. In turn, the probe connects to the Apache server on each IP address/port combination.

The result of this query can be either a failure or a success. If the probe successfully receives a reply from the Apache server, the probe returns to its infinite loop and continues the next cycle of probing and sleeping.

The query can fail for various reasons, such as heavy network traffic, heavy system load, and misconfiguration. Misconfiguration can occur if the Apache server is not configured to be listening on all IP address/port combinations that are being probed. The Apache server should service every port for every IP address specified for this resource. If the reply to the query is not received within the Probe_timeout limit (specified in Step 1 previously), the probe considers this scenario a failure on the part of Apache data service and records the failure in its history. An Apache probe failure can be a total failure or a partial failure.

Probe failures that are considered total failures are:
- Failure to connect to the server, as flagged by the error message: Failed to connect to %s port %d, with %s being the host name and %d the port number.
- Running out of time (exceeding the resource property time-out Probe_timeout) after trying to connect to the server.
- Failure to successfully send the probe string to the server, as flagged by the error message: Failed to communicate with server %s port %d: %s, with the first %s being the host name, %d the port number, and the second %s further details about the error.
  
  Two such partial failures within the resource property interval Retry_interval are accumulated by the monitor and are counted as one. Probe failures considered partial failures are:
  - Running out of time (exceeding the resource property timeout Probe_timeout) while trying to read the reply from the server to the probe's query.
  - Failing to read data from the server for other reasons, as flagged by the error message: Failed to communicate with server %s port %d: %s, with the first %s being the host name and %d the port number; the second %s further details about the error.
Based on the history of failures, a failure can cause either a local restart or a failover of the data service. This action is further described in "Health Checks of the Data Service".