Sun Cluster 3.0 Data Services Installation and Configuration Guide

Sun Cluster HA for Oracle Fault Monitor

The two fault monitors for Sun Cluster HA for Oracle are a server and a listener monitor.

Oracle Server Fault Monitor

The fault monitor for the Oracle server uses a request to the server to query the health of the server.

The server fault monitor consists of two processes: a main fault monitor process and database client fault probe. The main process performs error lookup and scha_control actions. The database client fault probe performs database transactions.

All database connections from the probe are performed as user oracle. The main fault monitor determines that the operation is successful if the database is online and no errors are returned during the transaction.

If the database transaction fails, the main process checks the internal action table for an action to be performed and performs the predetermined action. If the action executes an external program, it is executed as a separate process in the background. Some possible actions are: switchover, stopping and restarting the server, and stopping and restarting the resource group.

The probe uses the time-out value set in the resource property Probe_timeout to determine how much time to spend to successfully probe Oracle.

The server fault monitor also scans Oracle's alert_log_file and takes action based on any errors it finds.

The server fault monitor is started through pmfadm to make it highly available. If the monitor is killed for any reason, it is automatically restarted by pmf.

Oracle Listener Fault Monitor

The Oracle listener fault monitor checks the status of an Oracle listener.

If the listener is running, the Oracle listener fault monitor considers a probe successful. If the fault monitor detects an error, the listener is restarted.

The listener probe is started through pmfadm to make it highly available. If it is killed, it is automatically restarted by pmf.

If a problem occurs with the listener during a probe, the probe tries to restart the listener. The maximum number of times it attempts the restart is determined by the value set in the resource property Retry_count. If, after trying for the maximum number of times, the probe is still unsuccessful, it stops the fault monitor and does not switch over the resource group.