Recovering from Failures
Oracle Clusterware can recover automatically from many kinds of failures.
The following sections describe several failure scenarios and how Oracle Clusterware manages the failures.
How TimesTen Performs Recovery When Oracle Clusterware is Configured
The TimesTen database monitor (the ttCRSmaster process) performs
recovery.
It attempts to connect to the failed database without using the
forceconnect option. If the connection fails with error 994
("Data store connection terminated"), the database monitor tries to
reconnect 10 times. If the connection fails with error 707 ("Attempt to connect
to a data store that has been manually unloaded from RAM"), the database
monitor changes the RAM policy and tries to connect again. If the database monitor
cannot connect, it returns a connection failure.
If the database monitor can connect to the database, then it performs these tasks:
-
It queries the
CHECKSUMcolumn in theTTREP.REPLICATIONSreplication table. -
If the value in the
CHECKSUMcolumn matches the checksum stored in the Oracle Cluster Registry, then the database monitor verifies the role of the database. If the role isACTIVE, then recovery is complete.If the role is not
ACTIVE, then the database monitor queries the replication Commit Ticket Number (CTN) in the local database and the CTN in the active database to find out whether there are transactions that have not been replicated. If all transactions have been replicated, then recovery is complete. -
If the checksum does not match or if some transactions have not been replicated, then the database monitor performs a duplicate operation from the remote database to re-create the local database.
If the database monitor fails to connect with the database because of error 8110 or
8111 (master catchup required or in progress), then it uses the
forceconnect=1 option to connect and starts master catchup.
Recovery is complete when master catchup has been completed. If master catchup fails
with error 8112 ("Operation not permitted"), then the database monitor
performs a duplicate operation from the remote database. See Automatic Catch-Up of a Failed Master Database.
If the connection fails because of other errors, then the database monitor tries to perform a duplicate operation from the remote database.
The duplicate operation verifies that:
-
The remote database is available.
-
The replication agent is running.
-
The remote database has the correct role. The role must be
ACTIVEwhen the duplicate operation is attempted for creation of a standby database. The role must beSTANDBYorACTIVEwhen the duplicate operation is attempted for creation of a read-only subscriber.
When the conditions for the duplicate operation are satisfied, the existing failed database is destroyed and the duplicate operation starts.
When an Active Database or Its Host Fails
If there is a failure on the node where the active database resides, Oracle
Clusterware automatically changes the state of the standby database to
ACTIVE. If application failover is configured, then the application
begins updating the new active database.
Figure 8-2 shows that the state of the old standby database has changed to ACTIVE and that the application is updating the new active database.
Figure 8-2 Standby Database Becomes Active

Description of "Figure 8-2 Standby Database Becomes Active"
Oracle Clusterware tries to restart the database or host where the failure occurred. If it is successful, then that database becomes the standby database.
Figure 8-3 shows a cluster where the former active master becomes the standby master.
Figure 8-3 Standby Database Starts on Former Active Host

Description of "Figure 8-3 Standby Database Starts on Former Active Host"
If the failure of the former active master is permanent and advanced availability is configured, Oracle Clusterware starts a standby master on one of the extra nodes.
Figure 8-4 shows a cluster in which the standby master is started on one of the extra nodes.
Figure 8-4 Standby Database Starts on Extra Host

Description of "Figure 8-4 Standby Database Starts on Extra Host"
See Perform a Forced Switchover After Failure of the Active Database or Host if you do not want to wait for these automatic actions to occur.
When a Standby Database or Its Host Fails
If there is a failure on the standby master, Oracle Clusterware first tries to restart the database or host. If it cannot restart the standby master on the same host and advanced availability is configured, Oracle Clusterware starts the standby master on an extra node.
Figure 8-5 shows a cluster in which the standby master is started on one of the extra nodes.
When Read-Only Subscribers or Their Hosts Fail
If there is a failure on a subscriber node, Oracle Clusterware first tries to restart the database or host. If it cannot restart the database on the same host and advanced availability is configured, Oracle Clusterware starts the subscriber database on an extra node.
When Failures Occur on Both Master Nodes
There are both automatic and manual methods for recovery when failures occur on both master nodes.
This section includes these topics:
Automatic Recovery
Oracle Clusterware can achieve automatic recovery from temporary failure on both master nodes after the nodes come back up.
Automatic recovery can occur if:
-
RETURN TWOSAFEis not specified for the active standby pair. -
AutoRecoveris set toy. -
RepBackupDirspecifies a directory on shared storage. -
RepBackupPeriodis set to a value greater than0.
Oracle Clusterware can achieve automatic recovery from permanent failure on both master nodes if:
-
Advanced availability is configured (virtual IP addresses and at least four hosts).
-
The active standby pair does not replicate cache groups.
-
RETURN TWOSAFEis not specified for the active standby pair. -
AutoRecoveris set toy. -
RepBackupDirspecifies a directory on shared storage. -
RepBackupPeriodis set to a value greater than0.
See Configuring for Recovery When Both Master Nodes Permanently Fail for examples of cluster.oracle.ini
files.
Manual Recovery for Advanced Availability
This section assumes that the failed master nodes are recovered to new hosts on which TimesTen and Oracle Clusterware are installed.
These steps use the manrecoveryDSN database and
cluster.oracle.ini file for examples.
To perform manual recovery in an advanced availability configuration, perform these tasks:
Manual Recovery for Basic Availability
This section assumes that the failed master nodes are recovered to new hosts on which TimesTen and Oracle Clusterware are installed.
These steps use the basicDSN database and
cluster.oracle.ini file for examples.
To perform manual recovery in a basic availability configuration, perform these steps:
Manual Recovery to the Same Master Nodes When Databases Are Corrupt
Failures can occur on both master nodes so that the databases are corrupt. You can recover to the same master nodes.
To recover to the same master nodes, perform the following steps:
Manual Recovery When RETURN TWOSAFE Is Configured
You can configure an active standby pair to have a return service of RETURN TWOSAFE.
You configure RETURN TWOSAFE by using the ReturnServiceAttribute Clusterware attribute in the cluster.oracle.ini file.
This cluster.oracle.ini example includes backup configuration in case the database logs are not available:
[basicTwosafeDSN]
MasterHosts=host1,host2
ReturnServiceAttribute=RETURN TWOSAFE
RepBackupDir=/shared_drive/dsbackup
RepBackupPeriod=3600Perform these recovery tasks:
When More Than Two Master Hosts Fail
Approach a failure of more than two master hosts as a more extreme case of dual host failure.
Use these guidelines:
-
Address the root cause of the failure if it is something like a power outage or network failure.
-
Identify or obtain at least two healthy hosts for the active and standby databases.
-
Update the
MasterHostsandSubscriberHostsentries in thecluster.oracle.inifile. -
See Manual Recovery for Advanced Availability and Manual Recovery for Basic Availability for guidelines on subsequent actions to take.
Perform a Forced Switchover After Failure of the Active Database or Host
If you want to force a switchover to the standby database without waiting for automatic recovery to be performed by TimesTen and Oracle Clusterware, you can write an application that uses Oracle Clusterware commands.
Perform the following:
-
Use the
crsctl stop resourcecommand to stop the TimesTen daemon monitor (ttCRSmaster) resource on the active database. This causes the role of the standby database to change to active. -
Use the
crsctl start resourcecommand to restart thettCRSmasterresource on the former active database. This causes the database to recover and become the standby database.
The following example demonstrates a forced switchover from the active database on host1 to the standby database on host2.
See the Oracle Clusterware Clusterware Administration and Deployment Guide in the Oracle Database documentation for more information about the crsctl start resource and crsctl stop resource commands.
