Sun Cluster HA for iPlanet Web Server Fault Monitor (Sun Cluster 3.0 Data Services Installation and Configuration Guide)

Sun Cluster 3.0 Data Services Installation and Configuration Guide

Sun Cluster HA for iPlanet Web Server Fault Monitor

The probe for Sun Cluster HA for iPlanet Web Server (iWS) uses a request to the server to query the health of that server. Before the probe actually queries the server, a check is made to confirm that network resources are configured for this Web server resource. If no network resources are configured, an error message (No network resources found for resource.) is logged and the probe exits with failure.

The probe must address two configurations of iWS: the secure instance and insecure instance. If the Web server is in secure mode and if the probe cannot get the secure ports from the configuration file, an error message (Unable to parse configuration file.) is logged and the probe exits with failure. The secure and insecure instance probes involve common steps.

The probe uses the time-out value set by the resource property Probe_timeout to limit the time spent trying to successfully probe iWS. For details on this resource property, see Appendix A, Standard Properties.

The Network_resources_used resource property setting on the iWS resource determines the set of IP addresses that are used by the Web server. The Port_list resource property setting determines the list of port numbers in use by iWS. The fault monitor assumes that the Web server is listening on all combinations of IP and port. If you are customizing your Web server configuration to listen on different port numbers (in addition to port 80), ensure that your resultant configuration (magnus.conf) file contains all possible combinations of IP addresses and ports. The fault monitor attempts to probe all such combinations and might fail if the Web server is not listening on a particular IP address and port combination.

The probe executes the following steps:

The probe connects to the Web server by using the specified IP address and port combination. If the connection is not successful, the probe concludes that a total failure has occurred. The probe then records the failure and takes appropriate action.
If the probe successfully connects, it checks to see if the Web server is being run in a secure mode. If so, the probe just disconnects and returns with a success status. No further checks are performed for a secure iWS server.

However, if the Web server is running in insecure mode, the probe sends a HTTP 1.0 HEAD request to the Web server and waits for the response. The request can be unsuccessful for various reasons, including heavy network traffic, heavy system load, and misconfiguration.

Misconfiguration can occur when the Web server is not configured to be listening on all IP address and port combinations that are being probed. The Web server should service every port for every IP address specified for this resource.

Misconfigurations can also result if the Network_resources_used and Port_list resource properties are not set correctly while you are creating the resource.

If the reply to the query is not received within the Probe_timeout resource proper limit, the probe considers this a failure of Sun Cluster HA for iPlanet Web Server. The failure is recorded in the probe's history.

A probe failure can be a total or partial failure. Probe failures that are considered total failures are:
- Failure to connect to the server, as flagged by the error message: Failed to connect to %s port %d, with %s being the host name and %d the port number.
- Running out of time (exceeding the resource property timeout Probe_timeout) after trying to connect to the server.
- Failure to successfully send the probe string to the server, as flagged by the error message: Failed to communicate with server %s port %d: %s, with the first %s being the host name and %d the port number; the second %s further details about the error.
Two such partial failures within the resource property interval Retry_interval are accumulated by the monitor and are counted as one. Probe failures that are considered partial failures are:
- Running out of time (exceeding the resource property timeout Probe_timeout) while trying to read the reply from the server to the probe's query.
- Failing to read data from the server for other reasons, as flagged by the error message: Failed to communicate with server %s port %d: %s, with the first %s being the host name and %d the port number; the second %s further details about the error.
Based on the history of failures, a failure can cause either a local restart or a failover of the data service. This action is further described in "Health Checks of the Data Service".