The procedures for managing failover and recovery depend primarily on a few things.
The replication scheme
Whether the failure occurred on a master or subscriber database
Whether the threshold for the transaction log on the master is exhausted before the problem is resolved and the databases reconnected
The following sections describe different procedures for managing failover:
In a default asynchronous replication scheme, if a subscriber database becomes inoperable or communication to a subscriber database fails, updates at the master are not impeded and the cluster manager does not have to take any immediate action.
If the failed subscriber is configured to use a return service, you must first disable return service blocking, as described in Disabling Return Service Blocking Manually.
During outages at subscriber systems, updates intended for the subscriber are saved
in the transaction log on the master. If the subscriber agent reestablishes communication with
its master before the master reaches its
FAILTHRESHOLD, the updates held in
the log are automatically transferred to the subscriber and no further action is required. See
Setting the Transaction Log Failure Threshold for details on how to establish the
FAILTHRESHOLD value for
the master database.
FAILTHRESHOLD is exceeded, the master sets the subscriber to
failed state and it must be recovered, as described in Recovering a Failed Database. Any application that connects to the failed subscriber receives a
tt_ErrReplicationInvalid (8025) warning indicating that the database has
failed by a replication peer.
An application can use the ODBC
SQLGetInfo function to check if the
subscriber database it is connected to has been set to the
failed state. The
SQLGetInfo function includes a TimesTen-specific infotype,
TT_REPLICATION_INVALID, that returns an integer value of '1' if the
database is failed, or '0' if not failed.
Since the infotype
TT_REPLICATION_INVALID is specific to TimesTen, all applications using it need to include the
timesten.h file in addition to the other ODBC
However, if you are using bidirectional replication scheme where each database serves as both master and subscriber and one of the subscribers fail, then an error condition may occur. For example, assuming that the masters and subscribers for the bidirectional replication scheme are defined as follows:
CREATE REPLICATION r1 ELEMENT elem_accounts_1 TABLE ttuser.accounts MASTER westds ON "westcoast" SUBSCRIBER eastds ON "eastcoast" ELEMENT elem_accounts_2 TABLE ttuser.accounts MASTER eastds ON "eastcoast" SUBSCRIBER westds ON "westcoast";
eastdssubscriber fails, the
westdsmaster stops accumulating updates for this subscriber since it received a failure.
eastdssubscriber fails, the replication agent shuts down on
eastds. However, the
eastdsmaster continues accumulating updates to propagate to its subscriber on
westdsunaware that the replication agent has shut down. These updates continue to accumulate past the defined
FAILTHRESHOLDsince the replication agent (who propagates the records to the subscriber as well as monitors the
FAILTHRESHOLD) is down.
TT_REPLICATION_INVALID is set to 1 on a subscriber or standby database, the replication agent shuts down due to the fact that the subscriber or standby is no longer receiving updates. If your database fails when involved in a bidirectional configuration for your replication scheme, then the replication agent is not running and the
FAILTHRESHOLD is not honored. To resolve this situation, destroy the subscriber or standby database and recreate it:
Destroy the failed database (in this example, the
Re-create the failed database by performing a
ttRepAdmin -duplicateoperation from the other master in the bidirectional replication scheme (in this example, the master on
Check if the database identified by the
hdbc handle has been set to the
SQLINTEGER retStatus; SQLGetInfo(hdbc, TT_REPLICATION_INVALID, (PTR)&retStatus, NULL, NULL);
The cluster manager plays a more central role if a failure involves the master database. If a master database fails, the cluster manager must detect this event and redirect the user load to one of its surviving databases.
This surviving subscriber then becomes the master, which continues to accept
transactions and replicates them to the other surviving subscriber databases. If the
failed master and surviving subscriber are configured in a bidirectional manner,
transferring the user load from a failed master to a subscriber does not require that
you make any changes to your replication scheme. However, when using unidirectional
replication or complex schemes, such as those involving propagators, you may have to
issue one or more
ALTER REPLICATION statements to reconfigure the
surviving subscriber as the "new master" in your scheme. See Replacing a Master Database in a Classic Replication Scheme for an example.
When the problem is resolved, if you are not using the bidirectional configuration or the active standby pair described in Automatic Catch-Up of a Failed Master Database, you must recover the master database as described in Recovering a Failed Database.
After the database is back online, the cluster manager can either transfer the user load back to the original master or reestablish it as a subscriber for the "acting master."
Automatic Catch-Up of a Failed Master Database
The master catch-up feature automatically restores a failed master database from a
subscriber database without the need to invoke the
The master catch-up feature needs no configuration, but it can be used only in the following types of configurations:
A single master replicated in a bidirectional manner to a single subscriber
An active standby pair that is configured with
For replication schemes that are not active standby pairs, the following must be true:
RETURN TWOSAFEmust be enabled.
All replicated transactions must be committed nondurably. They must be transmitted to the remote database before they are committed on the local database. For example, if the replication scheme is configured with
RETURN TWOSAFE BY REQUESTand any transaction is committed without first enabling
RETURN TWOSAFE, master catch-up may not occur after a failure of the master.
When the master replication agent is restarted after a crash or invalidation, any
lost transactions that originated on the master are automatically reapplied from the
subscriber to the master (or from the standby to the active in an active standby pair).
No connections are allowed to the master database until it has completely caught up with
the subscriber. Applications attempting to connect to a database during the catch-up
phase receive an error that indicates a catch-up is in progress. The only exception is
connecting to a database with the
ForceConnect first connection
attribute set in the DSN. When the catch-up phase is complete, the application can
connect to the database. If one of the databases is invalidated or crashes during the
catch-up process, the catch-up phase is resumed when the database comes back up.
Master catch-up can fail under these circumstances:
The failed database is offline long enough for the failure threshold to be exceeded on the subscriber database (the standby database in an active standby pair).
Dynamic load operations are taking place on the active database in an active standby pair when the failure occurs.
RETURN TWOSAFEis not enabled for dynamic load operations even though it is enabled for the active database. The database failure causes the dynamic load transactions to be trapped and
RETURN TWOSAFEto fail.
When Master Catch-Up Is Required for an Active Standby Pair
TimesTen error 8110 (
Connection not permitted. This store requires Master
Catchup.) indicates that the standby database is ahead of the active database
and that master catch-up must occur before replication can resume.
When using master catch-up with an active standby pair, the standby database must be
failed over to become the new active database. If the old active database can recover,
it becomes the new standby database. If it cannot recover, the old active database must
be destroyed and the new standby database must be created by duplicating the new active
database. See When Replication is Return Twosafe for more information about recovering from a failure of
the active database when
RETURN TWOSAFE is configured (required for
In an active standby pair with
RETURN TWOSAFE configured, it is
possible to have a trapped transaction. A trapped transaction occurs when the new
standby database has a transaction present that is not present on the new active
database after failover. Error 16227 (
Standby store has replicated transactions
not present on the active) is one indication of trapped transactions. You
can verify the number of trapped transactions by checking the number of records in
replicated tables on each database during the manual recovery process. For example,
enter a statement similar to the following:
SELECT COUNT(*) FROM reptable;
When there are trapped transactions, perform these tasks for recovery:
ttRepStateSetbuilt-in procedure to change the state on the standby database to
Destroy the old active database.
ttRepAdmin -duplicateto create a new standby database from the new active database, which has all of the transactions. See Duplicating a Database.
Failures in Bidirectional Distributed Workload Schemes
You can distribute the workload over multiple bidirectionally replicated databases, each of which serves as both master and subscriber. When recovering a master/subscriber database, the log on the failed database may present problems when you restart replication.
If a database in a distributed workload scheme fails and work is shifted to a surviving database, the information in the surviving database becomes more current than that in the failed database. If replication is restarted at the failed system before the log failure threshold has been reached on the surviving database, then both databases attempt to update one another with the contents of their transaction logs. In this case, the older updates in the transaction log on the failed database may overwrite more recent data on the surviving system.
There are two ways to recover in such a situation:
If the timestamp conflict resolution rules described in Resolving Replication Conflicts are sufficient to guarantee consistency for your application, then you can restart the failed system and allow the updates from the failed database to propagate to the surviving database. The conflict resolution rules prevent more recent updates from being overwritten.
Re-create the failed database, as described in Recovering a Failed Database. If the database must be re-created, the updates in the log on the failed database that were not received by the surviving database cannot be identified or restored. In the case of several surviving databases, you must select which of the surviving databases is to be used to re-create the failed database. It is possible that at the time the failed database is re-created, the selected surviving database may not have received all updates from the other surviving databases. This results in diverging databases. The only way to prevent this situation is to re-create the other surviving databases from the selected surviving database.
In the event of a temporary network failure, you do not need to perform any specific action to continue replication.
The replication agents that were in communication attempt to reconnect every few seconds. If the agents reconnect before the master database runs out of log space, the replication protocol makes sure they do not miss or repeat any replication updates. If the network is unavailable for a longer period and the log failure threshold has been exceeded for the master log, you need to recover the subscriber as described in Recovering a Failed Database.
Failures Involving Sequences
After a network link failure, if replication is allowed to recover by replaying queued logs, you do not need to take any action.
However, if the failed host was down for a significant amount of time, you must use the
-duplicate command to repopulate the database on the failed host with transactions from the surviving host, as sequences are not rolled back during failure recovery. In this case, the
-duplicate command copies the sequence definitions from one database to the other.