Sun Cluster 3.1 Data Service for Oracle

Understanding Sun Cluster HA for Oracle Fault Monitor

The two fault monitors for Sun Cluster HA for Oracle are a server and a listener monitor.

Oracle Server Fault Monitor

The fault monitor for the Oracle server uses a request to the server to query the health of the server.

The server fault monitor consists of the following two processes.

a main fault monitor process, which performs error lookup and scha_control actions
a database client fault probe, which performs database transactions

All of the database connections from the probe are performed as user oracle. The main fault monitor determines that the operation is successful if the database is online and no errors are returned during the transaction.

If the database transaction fails, the main process checks the internal action table for an action to be performed and performs the predetermined action. If the action executes an external program, it is executed as a separate process in the background. Possible actions include the following.

switchover
stopping the server
restarting the server
stopping the resource group
restarting the resource group

The probe uses the time-out value that is set in the resource property Probe_timeout to determine how much time to spend to successfully probe Oracle.

The server fault monitor also scans Oracle's alert_log_file and takes action based on any errors that the fault monitor finds.

The server fault monitor is started through pmfadm to make the monitor highly available. If the monitor is killed for any reason, the Process Monitor Facility (PMF) automatically restarts the monitor.

Oracle Listener Fault Monitor

The Oracle listener fault monitor checks the status of an Oracle listener.

If the listener is running, the Oracle listener fault monitor considers a probe successful. If the fault monitor detects an error, the listener is restarted.

The listener probe is started through pmfadm to make the probe highly available. If the probe is killed, PMF automatically restarts the probe.

If a problem occurs with the listener during a probe, the probe tries to restart the listener. The value set in the resource property Retry_count determines the maximum number of times that the probe attempts the restart. If, after trying for the maximum number of times, the probe is still unsuccessful, the probe stops the fault monitor and does not switch over the resource group.