Recovering From Failures

Oracle Clusterware can recover automatically from many kinds of failures.

The following sections describe several failure scenarios and how Oracle Clusterware manages the failures.

How TimesTen Performs Recovery When Oracle Clusterware is Configured

The TimesTen database monitor (the ttCRSmaster process) performs recovery.

It attempts to connect to the failed database without using the forceconnect option. If the connection fails with error 994 ("Data store connection terminated"), the database monitor tries to reconnect 10 times. If the connection fails with error 707 ("Attempt to connect to a data store that has been manually unloaded from RAM"), the database monitor changes the RAM policy and tries to connect again. If the database monitor cannot connect, it returns a connection failure.

If the database monitor can connect to the database, then it performs these tasks:

  • It queries the CHECKSUM column in the TTREP.REPLICATIONS replication table.

  • If the value in the CHECKSUM column matches the checksum stored in the Oracle Cluster Registry, then the database monitor verifies the role of the database. If the role is ACTIVE, then recovery is complete.

    If the role is not ACTIVE, then the database monitor queries the replication Commit Ticket Number (CTN) in the local database and the CTN in the active database to find out whether there are transactions that have not been replicated. If all transactions have been replicated, then recovery is complete.

  • If the checksum does not match or if some transactions have not been replicated, then the database monitor performs a duplicate operation from the remote database to re-create the local database.

If the database monitor fails to connect with the database because of error 8110 or 8111 (master catchup required or in progress), then it uses the forceconnect=1 option to connect and starts master catchup. Recovery is complete when master catchup has been completed. If master catchup fails with error 8112 ("Operation not permitted"), then the database monitor performs a duplicate operation from the remote database. See Automatic Catch-Up of a Failed Master Database.

If the connection fails because of other errors, then the database monitor tries to perform a duplicate operation from the remote database.

The duplicate operation verifies that:

  • The remote database is available.

  • The replication agent is running.

  • The remote database has the correct role. The role must be ACTIVE when the duplicate operation is attempted for creation of a standby database. The role must be STANDBY or ACTIVE when the duplicate operation is attempted for creation of a read-only subscriber.

When the conditions for the duplicate operation are satisfied, the existing failed database is destroyed and the duplicate operation starts.

When an Active Database or Its Host Fails

If there is a failure on the node where the active database resides, Oracle Clusterware automatically changes the state of the standby database to ACTIVE. If application failover is configured, then the application begins updating the new active database.

Figure 8-2 shows that the state of the old standby database has changed to ACTIVE and that the application is updating the new active database.

Figure 8-2 Standby Database Becomes Active

Description of Figure 8-2 follows
Description of "Figure 8-2 Standby Database Becomes Active"

Oracle Clusterware tries to restart the database or host where the failure occurred. If it is successful, then that database becomes the standby database.

Figure 8-3 shows a cluster where the former active master becomes the standby master.

Figure 8-3 Standby Database Starts on Former Active Host

Description of Figure 8-3 follows
Description of "Figure 8-3 Standby Database Starts on Former Active Host"

If the failure of the former active master is permanent and advanced availability is configured, Oracle Clusterware starts a standby master on one of the extra nodes.

Figure 8-4 shows a cluster in which the standby master is started on one of the extra nodes.

Figure 8-4 Standby Database Starts on Extra Host

Description of Figure 8-4 follows
Description of "Figure 8-4 Standby Database Starts on Extra Host"

See Perform a Forced Switchover After Failure of the Active Database or Host if you do not want to wait for these automatic actions to occur.

When a Standby Database or Its Host Fails

If there is a failure on the standby master, Oracle Clusterware first tries to restart the database or host. If it cannot restart the standby master on the same host and advanced availability is configured, Oracle Clusterware starts the standby master on an extra node.

Figure 8-5 shows a cluster in which the standby master is started on one of the extra nodes.

Figure 8-5 Standby Database on New Host

Description of Figure 8-5 follows
Description of "Figure 8-5 Standby Database on New Host"

When Read-Only Subscribers or Their Hosts Fail

If there is a failure on a subscriber node, Oracle Clusterware first tries to restart the database or host. If it cannot restart the database on the same host and advanced availability is configured, Oracle Clusterware starts the subscriber database on an extra node.

When Failures Occur on Both Master Nodes

There are both automatic and manual methods for recovery when failures occur on both master nodes.

This section includes these topics:

Automatic Recovery

Oracle Clusterware can achieve automatic recovery from temporary failure on both master nodes after the nodes come back up.

Automatic recovery can occur if:

  • RETURN TWOSAFE is not specified for the active standby pair.

  • AutoRecover is set to y.

  • RepBackupDir specifies a directory on shared storage.

  • RepBackupPeriod is set to a value greater than 0.

Oracle Clusterware can achieve automatic recovery from permanent failure on both master nodes if:

  • Advanced availability is configured (virtual IP addresses and at least four hosts).

  • The active standby pair does not replicate cache groups.

  • RETURN TWOSAFE is not specified for the active standby pair.

  • AutoRecover is set to y.

  • RepBackupDir specifies a directory on shared storage.

  • RepBackupPeriod is set to a value greater than 0.

See Configuring for Recovery When Both Master Nodes Permanently Fail for examples of cluster.oracle.ini files.

Manual Recovery for Advanced Availability

This section assumes that the failed master nodes are recovered to new hosts on which TimesTen and Oracle Clusterware are installed.

These steps use the manrecoveryDSN database and cluster.oracle.ini file for examples.

To perform manual recovery in an advanced availability configuration, perform these tasks:

  1. Ensure that the TimesTen cluster agent (ttCRSAgent) is running on the local host.
    ttCWAdmin -init -hosts localhost
  2. Restore the backup database. Ensure that there is not already a database on the host with the same DSN as the database you want to restore.
    ttCWAdmin -restore -dsn manrecoveryDSN
  3. If there are cache groups in the database, drop and re-create the cache groups.
  4. If the new hosts are not already specified by MasterHosts and SubscriberHosts in the cluster.oracle.ini file, then modify the file to include the new hosts.

    These steps use manrecoveryDSN. This step is not necessary for manrecoveryDSN because extra hosts are already specified in the cluster.oracle.ini file.

  5. Re-create the active standby pair replication scheme.
    ttCWAdmin -create -dsn manrecoveryDSN
  6. Start the active standby pair replication scheme.
    ttCWAdmin -start -dsn manrecoveryDSN

Manual Recovery for Basic Availability

This section assumes that the failed master nodes are recovered to new hosts on which TimesTen and Oracle Clusterware are installed.

These steps use the basicDSN database and cluster.oracle.ini file for examples.

To perform manual recovery in a basic availability configuration, perform these steps:

  1. Acquire new hosts for the databases in the active standby pair.
  2. Ensure that the TimesTen cluster agent (ttCRSAgent) is running on the local host.
    ttCWAdmin -init -hosts localhost
  3. Restore the backup database. Ensure that there is not already a database on the host with the same DSN as the database you want to restore.
    ttCWAdmin -restore -dsn basicDSN
  4. If there are cache groups in the database, drop and re-create the cache groups.
  5. Update the MasterHosts and SubscriberHosts entries in the cluster.oracle.ini file. This example uses the basicDSN database. The MasterHosts entry changes from host1 to host10. The SubscriberHosts entry changes from host2 to host20.
    [basicDSN]
    MasterHosts=host10,host20
  6. Re-create the active standby pair replication scheme.
    ttCWAdmin -create -dsn basicDSN
  7. Start the active standby pair replication scheme.
    ttCWAdmin -start -dsn basicDSN

Manual Recovery to the Same Master Nodes When Databases Are Corrupt

Failures can occur on both master nodes so that the databases are corrupt. You can recover to the same master nodes.

To recover to the same master nodes, perform the following steps:

  1. Ensure that the TimesTen daemon monitor (ttCRSmaster), replication agent and the cache agent are stopped and that applications are disconnected from both databases. This example uses the basicDSN database.
    ttCWAdmin -stop -dsn basicDSN
  2. On the node where you want the new active database to reside, destroy the databases by using the ttDestroy utility.
    ttDestroy basicDSN
  3. Restore the backup database.
    ttCWAdmin -restore -dsn basicDSN
    
  4. If there are cache groups in the database, drop and re-create the cache groups.
  5. Re-create the active standby pair replication scheme.
    ttCWAdmin -create -dsn basicDSN
  6. Start the active standby pair replication scheme.
    ttCWAdmin -start -dsn basicDSN

Manual Recovery When RETURN TWOSAFE Is Configured

You can configure an active standby pair to have a return service of RETURN TWOSAFE.

You configure RETURN TWOSAFE by using the ReturnServiceAttribute Clusterware attribute in the cluster.oracle.ini file.

This cluster.oracle.ini example includes backup configuration in case the database logs are not available:

[basicTwosafeDSN]
MasterHosts=host1,host2
ReturnServiceAttribute=RETURN TWOSAFE
RepBackupDir=/shared_drive/dsbackup
RepBackupPeriod=3600

Perform these recovery tasks:

  1. Ensure that the TimesTen daemon monitor (ttCRSmaster), replication agent and cache agent are stopped and that applications are disconnected from both databases.
    ttCWAdmin -stop -dsn basicTwosafeDSN
  2. Drop the active standby pair.
    ttCWAdmin -drop -dsn basicTwosafeDSN
  3. Decide whether the former active or standby database is more up to date and re-create the active standby pair using the chosen database. The command prompts you to choose the host on which the active database resides.
    ttCWAdmin -create -dsn basicTwosafeDSN

    If neither database is usable, restore the database from backups.

    ttCWAdmin -restore -dsn basicTwosafeDSN
  4. Start the active standby pair replication scheme.
    ttCWAdmin -start -dsn basicTwosafeDSN

When More Than Two Master Hosts Fail

Approach a failure of more than two master hosts as a more extreme case of dual host failure.

Use these guidelines:

  • Address the root cause of the failure if it is something like a power outage or network failure.

  • Identify or obtain at least two healthy hosts for the active and standby databases.

  • Update the MasterHosts and SubscriberHosts entries in the cluster.oracle.ini file.

  • See Manual Recovery for Advanced Availability and Manual Recovery for Basic Availability for guidelines on subsequent actions to take.

Perform a Forced Switchover After Failure of the Active Database or Host

If you want to force a switchover to the standby database without waiting for automatic recovery to be performed by TimesTen and Oracle Clusterware, you can write an application that uses Oracle Clusterware commands.

Perform the following:

  1. Use the crsctl stop resource command to stop the TimesTen daemon monitor (ttCRSmaster) resource on the active database. This causes the role of the standby database to change to active.

  2. Use the crsctl start resource command to restart the ttCRSmaster resource on the former active database. This causes the database to recover and become the standby database.

The following example demonstrates a forced switchover from the active database on host1 to the standby database on host2.

  1. Find all TimesTen resources using the crsctl status resource command.
    % crsctl status resource | grep TT
      NAME=TT_Activeservice_tt181_ttadmin_REP1
      NAME=TT_Agent_tt181_ttadmin_HOST1
      NAME=TT_Agent_tt181_ttadmin_HOST2
      NAME=TT_App_tt181_ttadmin_REP1_updateemp
      NAME=TT_Daemon_tt181_ttadmin_HOST1
      NAME=TT_Daemon_tt181_ttadmin_HOST2
      NAME=TT_Master_tt181_ttadmin_REP1_0
      NAME=TT_Master_tt181_ttadmin_REP1_1
      NAME=TT_Subservice_tt181_ttadmin_REP1
  2. Find the host where the active database resides by retrieving the status of the ttCRSActiveService resource.
    % crsctl status resource TT_Activeservice_tt181_ttadmin_REP1
      NAME=TT_Activeservice_tt181_ttadmin_REP1
      TYPE=application
      TARGET=ONLINE
      STATE=ONLINE on host1
  3. There are two ttCRSmaster resources listed in the initial status report. Discover which ttCRSmaster resource is on the same host as the active database.
    % crsctl status resource TT_Master_tt181_ttadmin_REP1_0
      NAME=TT_Master_tt181_ttadmin_REP1_0
      TYPE=application
      TARGET=ONLINE
      STATE=ONLINE on host1
    
    % crsctl status resource TT_Master_tt181_ttadmin_REP1_1
      NAME=TT_Master_tt181_ttadmin_REP1_1
      TYPE=application
      TARGET=ONLINE
      STATE=ONLINE on host2
  4. Stop the ttCRSmaster resource on the host where the active database resides.
    % crsctl stop resource TT_Master_tt181_ttadmin_REP1_0
      CRS-2673: Attempting to stop 'TT_Master_tt181_ttadmin_REP1_0'
      on 'host1'
      CRS-2677: Stop of 'TT_Master_tt181_ttadmin_REP1_0' on
     'host1' succeeded
  5. Restart the ttCRSmaster resource on the former active database.
    % crsctl start resource TT_Master_tt181_ttadmin_REP1_0
      CRS-2672: Attempting to start 'TT_Master_tt181_ttadmin_REP1_0'
      on 'host1'
      CRS-2676: Start of 'TT_Master_tt181_ttadmin_REP1_0' on
     'host1' succeeded
  6. Confirm that the forced switchover succeeds by checking where the active service ttCRSActiveService and standby service ttCRSsubservice resources are located.
    % crsctl status resource TT_Activeservice_tt181_ttadmin_REP1
      NAME=TT_Activeservice_tt181_ttadmin_REP1
      TYPE=application
      TARGET=ONLINE
      STATE=ONLINE on host2
     
    % crsctl status resource TT_Subservice_tt181_ttadmin_REP1
      NAME=TT_Subservice_tt181_ttadmin_REP1
      TYPE=application
      TARGET=ONLINE
      STATE=ONLINE on host1

See the Oracle Clusterware Administration and Deployment Guide in the Oracle Database documentation for more information about the crsctl start resource and crsctl stop resource commands.