Recovering From Failures
Oracle Clusterware can recover automatically from many kinds of failures.
The following sections describe several failure scenarios and how Oracle Clusterware manages the failures.
How TimesTen Performs Recovery When Oracle Clusterware is Configured
The TimesTen database monitor (the ttCRSmaster
process) performs
recovery.
It attempts to connect to the failed database without using the
forceconnect
option. If the connection fails with error 994
("Data store connection terminated
"), the database monitor tries to
reconnect 10 times. If the connection fails with error 707 ("Attempt to connect
to a data store that has been manually unloaded from RAM
"), the database
monitor changes the RAM policy and tries to connect again. If the database monitor
cannot connect, it returns a connection failure.
If the database monitor can connect to the database, then it performs these tasks:
-
It queries the
CHECKSUM
column in theTTREP.REPLICATIONS
replication table. -
If the value in the
CHECKSUM
column matches the checksum stored in the Oracle Cluster Registry, then the database monitor verifies the role of the database. If the role isACTIVE
, then recovery is complete.If the role is not
ACTIVE
, then the database monitor queries the replication Commit Ticket Number (CTN) in the local database and the CTN in the active database to find out whether there are transactions that have not been replicated. If all transactions have been replicated, then recovery is complete. -
If the checksum does not match or if some transactions have not been replicated, then the database monitor performs a duplicate operation from the remote database to re-create the local database.
If the database monitor fails to connect with the database because of error 8110 or
8111 (master catchup required or in progress), then it uses the
forceconnect=1
option to connect and starts master catchup.
Recovery is complete when master catchup has been completed. If master catchup fails
with error 8112 ("Operation not permitted
"), then the database monitor
performs a duplicate operation from the remote database. See Automatic Catch-Up of a Failed Master Database.
If the connection fails because of other errors, then the database monitor tries to perform a duplicate operation from the remote database.
The duplicate operation verifies that:
-
The remote database is available.
-
The replication agent is running.
-
The remote database has the correct role. The role must be
ACTIVE
when the duplicate operation is attempted for creation of a standby database. The role must beSTANDBY
orACTIVE
when the duplicate operation is attempted for creation of a read-only subscriber.
When the conditions for the duplicate operation are satisfied, the existing failed database is destroyed and the duplicate operation starts.
When an Active Database or Its Host Fails
If there is a failure on the node where the active database resides, Oracle
Clusterware automatically changes the state of the standby database to
ACTIVE
. If application failover is configured, then the application
begins updating the new active database.
Figure 8-2 shows that the state of the old standby database has changed to ACTIVE
and that the application is updating the new active database.
Figure 8-2 Standby Database Becomes Active
Description of "Figure 8-2 Standby Database Becomes Active"
Oracle Clusterware tries to restart the database or host where the failure occurred. If it is successful, then that database becomes the standby database.
Figure 8-3 shows a cluster where the former active master becomes the standby master.
Figure 8-3 Standby Database Starts on Former Active Host
Description of "Figure 8-3 Standby Database Starts on Former Active Host"
If the failure of the former active master is permanent and advanced availability is configured, Oracle Clusterware starts a standby master on one of the extra nodes.
Figure 8-4 shows a cluster in which the standby master is started on one of the extra nodes.
Figure 8-4 Standby Database Starts on Extra Host
Description of "Figure 8-4 Standby Database Starts on Extra Host"
See Perform a Forced Switchover After Failure of the Active Database or Host if you do not want to wait for these automatic actions to occur.
When a Standby Database or Its Host Fails
If there is a failure on the standby master, Oracle Clusterware first tries to restart the database or host. If it cannot restart the standby master on the same host and advanced availability is configured, Oracle Clusterware starts the standby master on an extra node.
Figure 8-5 shows a cluster in which the standby master is started on one of the extra nodes.
When Read-Only Subscribers or Their Hosts Fail
If there is a failure on a subscriber node, Oracle Clusterware first tries to restart the database or host. If it cannot restart the database on the same host and advanced availability is configured, Oracle Clusterware starts the subscriber database on an extra node.
When Failures Occur on Both Master Nodes
There are both automatic and manual methods for recovery when failures occur on both master nodes.
This section includes these topics:
Automatic Recovery
Oracle Clusterware can achieve automatic recovery from temporary failure on both master nodes after the nodes come back up.
Automatic recovery can occur if:
-
RETURN TWOSAFE
is not specified for the active standby pair. -
AutoRecover
is set toy
. -
RepBackupDir
specifies a directory on shared storage. -
RepBackupPeriod
is set to a value greater than0
.
Oracle Clusterware can achieve automatic recovery from permanent failure on both master nodes if:
-
Advanced availability is configured (virtual IP addresses and at least four hosts).
-
The active standby pair does not replicate cache groups.
-
RETURN TWOSAFE
is not specified for the active standby pair. -
AutoRecover
is set toy
. -
RepBackupDir
specifies a directory on shared storage. -
RepBackupPeriod
is set to a value greater than0
.
See Configuring for Recovery When Both Master Nodes Permanently Fail for examples of cluster.oracle.ini
files.
Manual Recovery for Advanced Availability
This section assumes that the failed master nodes are recovered to new hosts on which TimesTen and Oracle Clusterware are installed.
These steps use the manrecoveryDSN
database and
cluster.oracle.ini
file for examples.
To perform manual recovery in an advanced availability configuration, perform these tasks:
Manual Recovery for Basic Availability
This section assumes that the failed master nodes are recovered to new hosts on which TimesTen and Oracle Clusterware are installed.
These steps use the basicDSN
database and
cluster.oracle.ini
file for examples.
To perform manual recovery in a basic availability configuration, perform these steps:
Manual Recovery to the Same Master Nodes When Databases Are Corrupt
Failures can occur on both master nodes so that the databases are corrupt. You can recover to the same master nodes.
To recover to the same master nodes, perform the following steps:
Manual Recovery When RETURN TWOSAFE Is Configured
You can configure an active standby pair to have a return service of RETURN TWOSAFE
.
You configure RETURN TWOSAFE
by using the ReturnServiceAttribute
Clusterware attribute in the cluster.oracle.ini
file.
This cluster.oracle.ini
example includes backup configuration in case the database logs are not available:
[basicTwosafeDSN]
MasterHosts=host1,host2
ReturnServiceAttribute=RETURN TWOSAFE
RepBackupDir=/shared_drive/dsbackup
RepBackupPeriod=3600
Perform these recovery tasks:
When More Than Two Master Hosts Fail
Approach a failure of more than two master hosts as a more extreme case of dual host failure.
Use these guidelines:
-
Address the root cause of the failure if it is something like a power outage or network failure.
-
Identify or obtain at least two healthy hosts for the active and standby databases.
-
Update the
MasterHosts
andSubscriberHosts
entries in thecluster.oracle.ini
file. -
See Manual Recovery for Advanced Availability and Manual Recovery for Basic Availability for guidelines on subsequent actions to take.
Perform a Forced Switchover After Failure of the Active Database or Host
If you want to force a switchover to the standby database without waiting for automatic recovery to be performed by TimesTen and Oracle Clusterware, you can write an application that uses Oracle Clusterware commands.
Perform the following:
-
Use the
crsctl stop resource
command to stop the TimesTen daemon monitor (ttCRSmaster
) resource on the active database. This causes the role of the standby database to change to active. -
Use the
crsctl start resource
command to restart thettCRSmaster
resource on the former active database. This causes the database to recover and become the standby database.
The following example demonstrates a forced switchover from the active database on host1
to the standby database on host2
.
See the Oracle
Clusterware Administration and Deployment Guide in the Oracle Database
documentation for more information about the crsctl start resource
and
crsctl stop resource
commands.