Tuning the HA for Oracle Traffic Director Fault Monitor

Language:

The HA for Oracle Traffic Director fault monitor is contained in the resource that represents Oracle Traffic Director. You create this resource when you register and configure HA for Oracle Traffic Director. For more information, see Registering and Configuring HA for Oracle Traffic Director.

System properties and extension properties of this resource control the behavior of the fault monitor. The default values of these properties determine the preset behavior of the fault monitor. The preset behavior should be suitable for most Oracle Solaris Cluster installations. Therefore, you should tune the HA for Oracle Traffic Director fault monitor only if you need to modify this preset behavior.

For more information, see the following sections.

Operations by the Fault Monitor During a Probe

The probe for HA for Oracle Traffic Director uses a request to the Oracle Traffic Director instance to query the health of the Oracle Traffic Director instance. Before the probe actually queries the Oracle Traffic Director instance, a check is made to confirm that the network resources are configured for the Oracle Traffic Director instance resource. If no network resources are configured, an error message is logged, and the probe exits with a failure.

If the Oracle Traffic Director instance resource is used in conjunction with a failover logical hostname, the fault monitor will retrieve the Port_list extension property and connect to the Oracle Traffic Director instance through the localhost on that port. When used with a Shared Address, the Resource_dependencies resource-property setting on the Oracle Traffic Director resource determines the set of IP addresses that the Oracle Traffic Director instance uses. The Port_list resource-property setting determines the list of port numbers that the Oracle Traffic Director instance uses. The fault monitor monitors the Oracle Traffic Director server instance through the first port in Port_list. All the ports used by Oracle Traffic Director must be configured in Port_list in order for the shared address load balancing feature to work correctly.

If the probe fails to connect to the Oracle Traffic Director instance using a specified IP address and port combination, a complete failure occurs. The probe records the failure and takes appropriate action.

The probe sends an HTTP HEAD request to the Oracle Traffic Director instance and waits for the response. The request can be unsuccessful for various reasons, including heavy network traffic, heavy system load, and misconfiguration.

Misconfigurations can occur during the following conditions.

The Oracle Traffic Director instance is not configured to listen on all ports (INADDR_ANY), when the Oracle Traffic Director resource is configured with a failover logical hostname.
The Oracle Traffic Director instance is not configured to listen on the shared address when the Oracle Traffic Director resource is configured with a shared address.
The Resource_dependencies and Port_list resource properties were not set correctly when you created the resource.

If the reply to the query is not received within the Probe_timeout resource time limit, the probe considers this probe a failure of HA for Oracle Traffic Director. The failure is recorded in the probe's history.

A probe failure can be a complete or partial failure. The following probe failures are considered complete failures.

Failure to connect to the server. The following error message is sent, where %s indicates the host name and %d indicates the port number.
```
Failed to connect to %s port %d
```
Timeout (exceeding the resource-property timeout Probe_timeout) after trying to connect to the server.
Failure to successfully send the probe string to the server. The following error message is sent, where the first %s indicates the host name, %d indicates the port number, and the second %s indicates further details about the error.
```
Failed to communicate with server %s port %d: %s
```

The monitor accumulates two such partial failures within the resource-property interval Retry_interval and counts them as one failure.

The following probe failures are considered partial failures.

Timeout (exceeding the resource-property timeout Probe_timeout) while trying to read the reply from the server to the probe's query.
Failing to read data from the server for other reasons. The following error message is sent, where the first %s indicates the host name, %d indicates the port number, and the second %s indicates further details about the error.
```
Failed to communicate with server %s port %d: %s
```

The probe connects to the Oracle Traffic Director instance and performs an HTTP GET check by sending a HTTP request to the Server_URL property. If the HTTP server return code is 500 (Internal Server Error) or if the connect fails, the probe will take action.

The result of the HTTP requests is either a success or a failure. If all the requests from the Oracle Traffic Director instance receive a response that indicates a success, then the probe returns and continues the next cycle of probing.

Heavy network traffic, heavy system load, and misconfiguration can cause the HTTP GET probe to fail. Misconfiguration of the Server_URL property can cause a failure if a URI in the Server_URL property includes an incorrect port or hostname. For example, if the URI was specified to connect to a host other than the localhost when configured with a failover logical host, or to connect to a host other than the shared address when configured with a shared address.

Based on the history of failures, a failure can cause either a local restart or a failover of the data service. This action is further described in Tuning Fault Monitors for Oracle Solaris Cluster Data Services in Oracle Solaris Cluster Data Services Planning and Administration Guide .