Designing a Highly Available System

There are several primary objectives of any replication scheme.

  • Provide one or more backup databases to ensure that the data is always available to applications

  • Provide a means to recover failed databases from their backup databases

  • Distribute workloads efficiently to provide applications with the quickest possible access to the data

  • Enable software upgrades and maintenance without disrupting service to users

In a highly available system, a subscriber database must be able to survive failures that may affect the master. At a minimum, the master and subscriber need to be on separate hosts. For some applications, you may want to place the subscriber in an environment that has a separate power supply. In certain cases, you may need to place a subscriber at an entirely separate site.

You can configure the following classic replication schemes (as described in Types of Replication Schemes):

  • Unidirectional

  • Bidirectional split workload

  • Bidirectional distributed workload

  • Propagation

In addition, consider whether you want to replicate a whole database or selected elements of the database. Also, consider the number of subscribers in the replication scheme. Unidirectional and propagation replication schemes enable you to choose the number of subscribers.

The rest of this section includes these topics:

See Performing an Online Upgrade with Classic Replication in Oracle TimesTen In-Memory Database Installation, Migration, and Upgrade Guide.

Considering Failover and Recovery Scenarios

As you plan a classic replication scheme, consider every failover and recovery scenario.

For example, subscriber failures generally have no impact on the applications connected to the master databases. Their recovery does not disrupt user service. If a failure occurs on a master database, you should have a means to redirect the application load to a subscriber and continue service with no or minimal interruption. This process is typically handled by a cluster manager or custom software designed to detect failures, redirect users or applications from the failed database to one of its subscribers, and manage recovery of the failed database. See Managing Database Failover and Recovery.

When planning failover strategies, consider which subscribers are to take on the role of the master and for which users or applications. Also, consider recovery factors. For example, a failed master must be able to recover its database from its most up-to-date subscriber, and any subscriber must be able to recover from its master. A bidirectional scheme that replicates the entire database can take advantage of automatic restoration of a failed master. See Automatic Catch-Up of a Failed Master Database.

Consider the failure scenario for the unidirectionally replicated database shown in Figure 9-1. In the case of a master failure, the application cannot access the database until it is recovered from the subscriber. You cannot switch the application connection or user load to the subscriber unless you use an ALTER REPLICATION statement to redefine the subscriber database as the master. See Replacing a Master Database in a Classic Replication Scheme.

Figure 9-1 Recovering a Master in a Unidirectional Scheme

Description of Figure 9-1 follows
Description of "Figure 9-1 Recovering a Master in a Unidirectional Scheme"

Figure 9-2 shows a bidirectional distributed workload scheme in which the entire database is replicated. Failover in this type of classic replication scheme involves shifting the users of the application on the failed database to the application on the surviving database. Upon recovery, the workload can be redistributed to the application on the recovered database.

Figure 9-2 Recovering a Master in a Distributed Workload Scheme

Description of Figure 9-2 follows
Description of "Figure 9-2 Recovering a Master in a Distributed Workload Scheme"

Similarly, the users in a split workload scheme must be shifted from the failed database to the surviving database. Because replication in a split workload scheme is not at the database level, you must use an ALTER REPLICATION statement to set a new master database. See Replacing a Master Database in a Classic Replication Scheme. Upon recovery, the users can be moved back to the recovered master database.

Propagation classic replication schemes also require the use of the ALTER REPLICATION statement to set a new master or a new propagator if the master or propagator fails. Higher availability is achieved if two propagators are defined in the replication scheme. See Figure 1-11 for an example of a propagation replication scheme with two propagators.

Making Decisions About Performance and Recovery Tradeoffs

When you design a classic replication scheme, weigh operational efficiencies against the complexities of failover and recovery. Factors that may complicate failover and recovery include the network topology that connects a master with its subscribers and the complexity of the replication scheme.

For example, it is easier to recover a master that has been fully replicated to a single subscriber than recover a master that has selected elements replicated to different subscribers.

You can configure classic replication to work asynchronously (the default), "semi-synchronously" with return receipt service, or fully synchronously with return twosafe service. Selecting a return service provides greater confidence that your data is consistent on the master and subscriber databases. Your decision to use default asynchronous replication or to configure return receipt or return twosafe mode depends on the degree of confidence you require and the performance tradeoff you are willing to make in exchange.

Table 9-1 summarizes the performance and recover tradeoffs of asynchronous replication, return receipt service and return twosafe service.

Table 9-1 Performance and Recovery Tradeoffs

Type of Behavior Asynchronous Replication (Default) Return Receipt Return Twosafe

Commit sequence

Each transaction is committed first on the master database.

Each transaction is committed first on the master database

Each transaction is committed first on the subscriber database.

Performance on master

Shortest response time and best throughput because there is no log wait between transactions or before the commit on the master.

Longer response time and less throughput than asynchronous.

The application is blocked for the duration of the network round-trip after commit. Replicated transactions are more serialized than with asynchronous replication, which results in less throughput.

Longest response time and least throughput.

The application is blocked for the duration of the network round-trip and remote commit on the subscriber before the commit on the master. Transactions are fully serialized, which results in the least throughput.

Effect of a runtime error

Because the transaction is first committed on the master database, errors that occur when committing on a subscriber require the subscriber to be either manually corrected or destroyed and then recovered from the master database.

Because the transaction is first committed on the master database, errors that occur when committing on a subscriber require the subscriber to be either manually corrected or destroyed and then recovered from the master database.

Because the transaction is first committed on the subscriber database, errors that occur when committing on the master require the master to be either manually corrected or destroyed and then recovered from the subscriber database.

Failover after failure of master

If the master fails and the subscriber takes over, the subscriber may be behind the master and must reprocess data feeds and be able to remove duplicates.

If the master fails and the subscriber takes over, the subscriber may be behind the master and must reprocess data feeds and be able to remove duplicates.

If the master fails and the subscriber takes over, the subscriber is at least up to date with the master. It is also possible for the subscriber to be ahead of the master if the master fails before committing a transaction it had replicated to the subscriber.

In addition to the performance and recovery tradeoffs between the two return services, you should also consider the following:

  • Return receipt can be used in more configurations, whereas return twosafe can only be used in a bidirectional configuration or an active standby pair.

  • Return twosafe enables you to specify a "local action" to be taken on the master database in the event of a timeout or other error encountered when replicating a transaction to the subscriber database.

A transaction is classified as return receipt or return twosafe when the application updates a table that is configured for either return receipt or return twosafe. Once a transaction is classified as either return receipt or return twosafe, it remains so, even if the replication scheme is altered before the transaction completes.

See Using a Return Service in a Classic Replication Scheme.

Distributing Workloads

Consider configuring the databases to distribute application workloads and make the best use of a limited number of servers.

For example, it may be efficient and economical to configure the databases in a bidirectional distributed workload replication scheme so that each serves as both master and subscriber, rather than as separate master and subscriber databases. However, a distributed workload scheme works best with applications that primarily read from the databases. Implementing a distributed workload scheme for applications that frequently write to the same elements in a database may diminish performance and require that you implement a solution to prevent or manage update conflicts, as described in Resolving Replication Conflicts.