HA-DBMS Fault Probes (Sun Cluster 2.2 System Administration Guide)

Sun Cluster 2.2 System Administration Guide

HA-DBMS Fault Probes

The fault probes for Sun Cluster HA for Oracle, Sun Cluster HA for Sybase and Sun Cluster HA for Informix perform similarly to monitor the database server. The HA-DBMS fault probes are configured by running one of the utilities, haoracle(1M), hasybase(1M), or hainformix(1M). (See the online man pages for a detailed description of the options for these utilities.)

Once the utilities are configured and activated, two processes are started on the local node and two processes are started on the remote node simulating a client access. The remote fault probe is initiated by the ha_dbms_serv daemon and is started when hareg -y dataservicename is initiated.

The HA-DBMS module uses two methods to monitor whether the DBMS service is available. First, HA-DBMS extracts statistics from the DBMS itself:

In Oracle, the V$SYSSTAT table is queried.
In Sybase, the global variables @@io_busy, @@pack_received, @@pack_sent, @@total_read, @@total_write, and @@connections are queried.
In Informix, the SYSPROFILE table is queried.

If the extracted statistics indicate that work is being performed for clients, then no other probing of the DBMS is required. Second, if the DBMS statistics show that no work is occurring, then HA-DBMS submits a small test transaction to the DBMS. If all clients happen to be idle, the DBMS statistics would show no work occurring; that is, the test transaction distinguishes the situation of the database being hung from the legitimately idle situation. Because the test transaction is executed only when the statistics show no activity, it imposes no overhead on an active database. The test transaction consists of:

Creating a table by the name of either HA_DBMS_REM or HA_DBMS_LOC
Inserting values into the created table
Updating the inserted value
Dropping the created table

HA-DBMS carefully filters the error codes returned by the DBMS, using a table that describes which codes should or should not cause a takeover. For example, in the case of Sun Cluster HA for Oracle, the scenario of table space full does not cause a takeover, because an administrator must intervene to fix this condition. (If a takeover were to occur, the new master server would encounter the same table space full condition.)

On the other hand, an error return code such as could not allocate Unix semaphore causes Sun Cluster HA for Oracle to attempt to restart ORACLE locally on this server machine. If a local restart has occurred too recently, then the other machine takes over instead (after first passing its own sanity checks).