Tuning the Oracle Communications ASAP Fault Monitors

Language:

This section describes the Oracle Communications ASAP fault monitor's probing algorithm or functionality, and states the conditions, messages, and recovery actions associated with unsuccessful probing:

For conceptual information about fault monitors, see the Oracle Solaris Cluster 4.3 Concepts Guide .

Resource Properties

The HA for Oracle Communications ASAP fault monitor uses the resource properties that are specified in the resource type ORCL.asap. See the r_properties(5) man page for a list of general resource properties used. See Appendix A, Oracle HA for Oracle Communications ASAP Extension Properties for a list of resource properties for this resource type.

Probing Algorithm and Functionality

The Oracle Communications ASAP is controlled by extension properties that control the probing frequency. The default values of these properties determine the preset behavior of the fault monitor and are suitable for most Oracle Solaris Cluster installations.

You can modify this preset behavior by modifying the following settings:

The interval between fault monitor probes (Thorough_probe_interval).
The timeout for fault monitor probes (Probe_timeout).
The detailed probe on the application (DETAILED_PROBING).

Operation of the Oracle Communications ASAP Probe

The following list explains how the Oracle Communications ASAP probe operates:

The sqlplus command is used to query the database for the list of configured processes. The status utility provides information about all the running ASAP application process. The HA ASAP agent compares the output of the sqlplus command with the output of the status command. If the Control Server process is not started, probe returns 100. If all the processes configured in the database are started, the probe returns 0. If the probe detects that any of the processes are not available, the status message for the resource is set to proc_name terminated.
The probe returns 0 if the DB connection fails and the service is then put in to DEGRADED status. When the DB comes back up, the monitor program identifies the DB connectivity and changes the resource status to ONLINE.
If an application process managed by the Control Server terminates abnormally, the Control Server would try to restart the terminated process. The Control Server uses RESTART_ATTEMPTS and RESTART_DELAY attributes configured in the ASAP.cfg file to restart the terminated process. If the Control Server fails to restart within the time calculated using RESTART_ATTEMPTS and RESTART_DELAY, it might require a restart of complete ASAP system based on the criticality of the process.

If the CRITICAL_PROCESS_LIST property is set, it means the process listed is critical to the running of ASAP system. If the process in the list terminates abnormally and if the Control Server fails to restart the process, the highly available ASAP agent would restart the resource.

The agent probe returns 100 if Control Server is terminated and the highly available ASAP resource will either get restarted on the same node or failed over to the next available node.
If the UNMONITORED_PROCESS_LIST property is set, the agent will not report a status message when the selected process is stopped. The probe method periodically logs a message that the selected process is unmonitored. You can perform maintenance operations without being restarted by the Oracle Communications ASAP agent if the process is intentionally stopped. The agent will not report a status message when the selected process is stopped. You must make sure that the process is enabled for monitoring after performing maintenance by removing the process name from the property.
If the DETAILED_PROBING is set to true, the agent probe executes the run_suite command followed by an SQL query to the ASAP SARM database to get the status of the work order.