Recovering From a Failure of the Active Database

If the active master has failed and the standby database did not fail or has recovered after a failure, then you can recover the active standby pair by making the standby master the new active master.

In addition, you can then swap the active and standby masters again so that they exist on the original nodes.

Note:

If both the active and standby masters fail, see Recovering After a Dual Failure of Both Active and Standby Databases.

Recovering When the Standby Database is Ready

The first two sections describe how to recover the active database when the standby database is available and synchronized with the active database. The last section describes what to do if following the instructions from either of the first two sections fails; the standby database is available, but the data is not fully synchronized.

When Replication Is Return Receipt or Asynchronous

You can failover to a standby database when the active fails.

Complete the following tasks:

  1. On the standby database, stop the replication agent if it has not already been stopped.
  2. On the standby database, call ttRepStateSet('ACTIVE'). This changes the role of the database from STANDBY to ACTIVE. If you are replicating a read-only cache group, this action automatically causes the autorefresh state to change from PAUSED to ON for this database.
  3. On the new active database, call ttRepStateSave('FAILED', 'failed_database','host_name'), where failed_database is the former active database that failed. This step is necessary for the new active database to replicate directly to the subscriber databases. During normal operation, only the standby database replicates to the subscribers.
  4. On the new active database, start the replication agent and the cache agent.
  5. Destroy the failed database (the old active) with the ttDestroy utility.
  6. Duplicate the new active database to the new standby database. You can use either the ttRepAdmin -duplicate utility or the ttRepDuplicateEx C function to duplicate a database. Use the -keepCG -recoveringNode options with ttRepAdmin to recover and to preserve the cache group after the active master failure. See Duplicating a Database.
  7. Set up the replication agent policy on the new standby database and start the replication agent. See Starting and Stopping the Replication Agents.
  8. Start the cache agent on the new standby database.

Note:

If any of these steps failed, follow the directions in When There Is Unsynchronized Data in the Cache Groups.

The standby database contacts the active database. The active database stops sending updates to the subscribers. When the standby database is fully synchronized with the active database, then the standby database enters the STANDBY state and starts sending updates to the subscribers.The new standby database takes over processing of the cache group automatically when it enters the STANDBY state. If you are replicating an AWT cache group, the new standby database takes over processing of the cache group automatically when it enters the STANDBY state.

Note:

You can verify that the standby database has entered the STANDBY state by using the ttRepStateGet built-in procedure.

When Replication Is Return Twosafe

You can failover to a standby database when the active fails.

Complete the following tasks:

  1. Stop the replication agent on the standby database if it has not already been stopped.
  2. On the standby database, call ttRepStateSet('ACTIVE'). This changes the role of the database from STANDBY to ACTIVE. If you are replicating a read-only cache group, this action automatically causes the autorefresh state to change from PAUSED to ON for this database.
  3. On the new active database, call ttRepStateSave('FAILED', 'failed_database','host_name'), where failed_database is the former active database that failed. This step is necessary for the new active database to replicate directly to the subscriber databases. During normal operation, only the standby database replicates to the subscribers.
  4. On the new active database, start the replication agent and the cache agent.
  5. Connect to the failed database. This triggers recovery from the local transaction logs. If database recovery fails, you must continue from Step 5 of the procedure for recovering when replication is return receipt or asynchronous. See When Replication Is Return Receipt or Asynchronous. If you are replicating a read-only cache group, the autorefresh state is automatically set to PAUSED.
  6. Verify that the replication agent for the failed database has restarted. If it has not restarted, then start the replication agent. See Starting and Stopping the Replication Agents.
  7. Verify that the cache agent for the failed database has restarted. If it has not restarted, then start the cache agent.

Note:

If any of these steps failed, follow the directions in When There Is Unsynchronized Data in the Cache Groups.

When the active database determines that it is fully synchronized with the standby database, then the standby database enters the STANDBY state and starts sending updates to the subscribers. The new standby database takes over processing of the cache group automatically when it enters the STANDBY state. If you are replicating an AWT cache group, the new standby database takes over processing of the cache group automatically when it enters the STANDBY state.

Note:

You can verify that the standby database has entered the STANDBY state by using the ttRepStateGet built-in procedure.

When There Is Unsynchronized Data in the Cache Groups

You can failover to a standby database when the active fails, even if there is unsynchronized data in the cache groups.

If the steps in either When Replication Is Return Receipt or Asynchronous or When Replication Is Return Twosafe fail, then there could be unsynchronized data in the AWT cache groups that has not been propagated to the Oracle database. In addition, there could be unsynchronized data on the Oracle database that has not been uploaded to any read-only cache groups that are included in the active standby pair replication scheme.

If there is data in any AWT cache groups on the standby master that has not been propagated when the active database failed, then simply recovering the standby database as the new active database is not an option. In this case, perform the following:

  1. On the standby database, stop the replication agent and drop the replication configuration using the DROP ACTIVE STANDBY PAIR statement.

  2. Stop the cache agent to ensure that no more updates are applied to the AWT cache groups while performing this recovery operation and to ensure that you control when any read-only cache groups that were included in the replication scheme are refreshed.

  3. For any read-only cache groups that are included in the replication scheme, set the autorefresh state to pause with the ALTER CACHE GROUP ... SET AUTOREFRESH STATE PAUSED statement.

  4. On the standby database, flush any unpropagated committed inserts or updates on TimesTen cache tables for any AWT cache groups to the cached Oracle Database tables, as follows:

    1. Set autocommit to off.

    2. Call the ttCacheAllowFlushAwtSet built-in procedure with the parameter set to 1. This built-in procedure enables you to run a FLUSH CACHE GROUP statement against an AWT cache group and should only be used in this recovery scenario.

      call ttCacheAllowFlushAwtSet(1);
    3. Run the FLUSH CACHE GROUP SQL statement against each AWT cache group to ensure that all data is propagated to the Oracle database.

      Note:

      Running the FLUSH CACHE GROUP statement under these conditions on the AWT cache group only flushes the contents of the tables in the AWT cache group; that is, the data that was either inserted or updated. It does not take into account any delete operations. So, you may have rows that exist on the Oracle database that were deleted from the AWT cache group. It is up to the user to recover any delete operations.

    4. Call the ttCacheAllowFlushAwtSet built-in procedure with the parameter set to 0 to disallow any future running of the FLUSH CACHE GROUP statement on an AWT cache group.

      call ttCacheAllowFlushAwtSet(0);
    5. Commit after calling the ttCacheAllowFlushAwtSet built-in procedure with the parameter set to 0. You can also choose to reset autocommit to on, as it only needed to be off for the ttCacheAllowFlushAwtSet built-in procedure.

  5. Drop and re-create all AWT cache groups using the DROP CACHE GROUP and CREATE CACHE GROUP statements.

  6. Start the replication agent and the cache agent, since the cache agent needs to be active to refresh any read-only cache groups and both must be active in order to load the AWT cache groups.

  7. Refresh all read-only cache groups using the REFRESH CACHE GROUP statement to upload most current committed data from the cached Oracle database tables. Use the REFRESH CACHE GROUP ... PARALLEL n clause to concurrently load these cache groups over multiple threads.

  8. Load all AWT cache groups using the LOAD CACHE GROUP statement to begin the autorefresh process. Use the LOAD CACHE GROUP ... PARALLEL n clause to concurrently load these cache groups over multiple threads.

  9. Stop both the replication agent and the cache agent in preparation to re-create the active standby pair.

  10. Re-create the replication configuration on the standby database using the CREATE ACTIVE STANDBY PAIR statement.

  11. Set the old standby database as the new active database, destroy the failed old active database, perform a duplicate of the active to create a new standby database, and start the cache and replication agents on the standby as described in the steps listed in When Replication Is Return Receipt or Asynchronous.

Failing Back to the Original Nodes

After a successful failover, you may want to fail back so that the active database and the standby database are on their original nodes.

See Reversing the Roles of the Active and Standby Databases.