The two fault monitors for Sun Cluster HA for Oracle are a server and a listener monitor.
The fault monitor for the Oracle server uses a request to the server to query the health of the server.
The server fault monitor is started through pmfadm to make the monitor highly available. If the monitor is killed for any reason, the Process Monitor Facility (PMF) automatically restarts the monitor.
The server fault monitor consists of the following processes.
A main fault monitor process, which performs error lookup and scha_control actions
A database client fault probe, which performs database transactions
The main fault monitor determines that an operation is successful if the database is online and no errors are returned during the transaction.
The database client fault probe queries the dynamic performance view v$sysstat to obtain database performance statistics. Changes to these statistics indicate that the database is operational. If these statistics remain unchanged between consecutive queries, the fault probe performs database transactions to determine if the database is operational. These transactions involve the creation, updating, and dropping of a table in the user table space.
The database client fault probe performs all its transactions as the Oracle user. The ID of this user is specified during the preparation of the nodes as explained in How to Prepare the Nodes.
The probe uses the time-out value that is set in the resource property Probe_timeout to determine how much time to allocate to successfully probe Oracle.
If a database transaction fails, the server fault monitor performs an action that is determined by the error that caused the failure. To change the action that the server fault monitor performs, customize the server fault monitor as explained in Customizing the Sun Cluster HA for Oracle Server Fault Monitor.
If the action requires an external program to be run, the program is run as a separate process in the background.
Possible actions are as follows:
Ignore. The server fault monitor ignores the error.
Stop monitoring. The server fault monitor is stopped without shutting down the database.
Restart. The server fault monitor stops and restarts the entity that is specified by the value of the Restart_type extension property:
If the Restart_type extension property is set to RESOURCE_GROUP_RESTART, the server fault monitor restarts the database server resource group. By default, the server fault monitor restarts the database server resource group.
If the Restart_type extension property is set to RESOURCE_RESTART, the server fault monitor restarts the database server resource.
The number of attempts to restart might exceed the value of the Retry_count resource property within the time that the Retry_interval resource property specifies. If this situation occurs, the server fault monitor attempts to switch over the resource group to another node.
Switch over. The server fault monitor switches over the database server resource group to another node. If no nodes are available, the attempt to switch over the resource group fails. If the attempt to switch over the resource group fails, the database server is restarted.
The Oracle software logs alerts in an alert log file. The absolute path of this file is specified by the alert_log_file extension property of the SUNW.oracle_server resource. The server fault monitor scans the alert log file for new alerts at the following times:
When the server fault monitor is started
Each time that the server fault monitor queries the health of the server
If an action is defined for a logged alert that the server fault monitor detects, the server fault monitor performs the action in response to the alert.
Preset actions for logged alerts are listed in Table B–2. To change the action that the server fault monitor performs, customize the server fault monitor as explained in Customizing the Sun Cluster HA for Oracle Server Fault Monitor.
The Oracle listener fault monitor checks the status of an Oracle listener.
If the listener is running, the Oracle listener fault monitor considers a probe successful. If the fault monitor detects an error, the listener is restarted.
The listener probe is started through pmfadm to make the probe highly available. If the probe is killed, PMF automatically restarts the probe.
If a problem occurs with the listener during a probe, the probe tries to restart the listener. The value that is set in the resource property Retry_count determines the maximum number of times that the probe attempts the restart. If, after trying for the maximum number of times, the probe is still unsuccessful, the probe stops the fault monitor and does not switch over the resource group.