Release 2 (9.2)
Part Number A96571-02
This chapter explains concepts relating to Streams high availability environments.
This chapter contains these topics:
Configuring a high availability solution requires careful planning and analysis of failure scenarios. Database backups and physical standby databases provide physical copies of a source database for failover protection. Data Guard, in SQL apply mode, implements a logical standby database in a high availability environment. Because Data Guard is designed for a high availability environment, it handles most failure scenarios. However, some environments may require the flexibility available in Oracle Streams, so that they can take advantage of the extended feature set offered by Streams.
This chapter discusses some of the scenarios that may benefit from a Streams-based solution and explains Streams-specific issues that arise in high availability environments. It also contains information about best practices for deploying Streams in a high availability environment, including hardware failover within a cluster, instance failover within an Oracle Real Application Clusters cluster, and failover and switchover between replicas.
Oracle Real Application Clusters is the preferred method for protecting from an instance or system failure. After a failure, services are provided by a surviving node in the cluster. However, clustering does not protect from user error, media failure, or disasters. These types of failures require redundant copies of the database. You can make both physical and logical copies of a database.
Physical copies are identical, block for block, with the source database, and are the preferred means of protecting data. There are three types of physical copies: database backup, mirrored or multiplexed database files, and a physical standby database.
Logical copies contain the same information as the source database, but the information may be stored differently within the database. Creating a logical copy of your database offers many advantages. However, you should always create a logical copy in addition to a physical copy, not instead of physical copy.
Some of the benefits of a logical standby include the following:
There are three types of logical copies of a database:
Logical standby databases are best maintained using Oracle Data Guard in SQL apply mode. The rest of this chapter discusses Streams replica databases and application maintained copies.
Like Oracle Data Guard in SQL apply mode, Oracle Streams can capture database changes, propagate them to destinations, and apply the changes at these destinations. Streams is optimized for replicating data. Streams can locally capture changes in the online redo log as it is written, and the captured changes can be propagated asynchronously to replica databases. This optimization can reduce the latency and can enable the replicas to lag the primary database by no more than a few seconds. However, there is the possibility for a small window of data loss in the event of a catastrophic failure, which can be avoided by the use of a companion physical standby database.
Nevertheless, you may choose to use Streams to configure and maintain a logical copy of your production database. Although using Streams may require additional work, it offers increased flexibility that may be required to meet specific business requirements. A logical copy configured and maintained using Streams is called a replica, not a logical standby, because it provides many capabilities that are beyond the scope of the normal definition of a standby database. Some of the requirements that can best be met using an Oracle Streams replica are listed in the following sections.
The greatest difference between a replica database and a standby database is that a replica database can be updated and a standby database cannot. Applications that must update data can run against the replica, including job queues and reporting applications that log reporting activity. Replica databases also allow local applications to operate autonomously, protecting local applications from WAN failures and reducing latency for database operations.
The production and the replica need not be running on the exact same platform. This gives more flexibility in using computing assets, and facilitates migration between platforms.
Streams replicas can use different character sets than the production database. Data is automatically converted from one character set to another before being applied. This ability is extremely important if you have global operations and you must distribute data in multiple countries.
If the replica is used for near real-time reporting, Streams can lag the production database by no more than a few seconds, providing up-to-date and accurate queries. Changes can be read from the online redo logs as the logs are written, rather than from the redo logs after archiving.
Streams supports unlimited numbers of replicas. Its flexible routing architecture allows for hub and spoke configurations that can efficiently propagate data to hundreds of replicas. This ability may be important if you must provide autonomous operation to many local offices in your organization. In contrast, because standby databases configured with Data Guard use the
LOG_ARCHIVE_DEST_n initialization parameter to specify destinations, there is a limit of ten copies when you use Data Guard.
Streams replicas can be open to read/write operations at all times. If a primary database fails, Streams replicas are able to instantly resume processing. A small window of data may be left at the primary database, but this data will be automatically applied when the primary recovers. This ability may be important if you value fast recovery time over no lost data. Assuming the primary database can eventually be recovered, the data is only temporarily lost.
In a complex environment, changes need only be captured once. These changes can then be sent to multiple destinations. This ability enables more efficient use of the resources needed to mine the redo logs for changes.
If you wish to have a logical standby database, but are also using Streams on the primary database, you must use Streams APIs to create and maintain the logical copy of the database. Data Guard SQL apply mode and Streams are not supported on the same database.
As mentioned previously, there are scenarios where customers may choose to use Streams to meet some of their high availability requirements. One of the rules of high availability is to keep it simple. Oracle Data Guard is designed for high availability and is easier to implement than a Streams-based high availability solution. Customers who decide to leverage the flexibility offered by Streams must be prepared to invest in the expertise and planning required to make a Streams-based solution robust. This means writing scripts to implement much of the automation and management tools provided out-of-the-box with Oracle Data Guard.
Also, because Streams was designed for data integration, not high availability, it does not provide a zero data loss mode of operation, as does Oracle Data Guard. Customers who cannot afford to lose transactions in the event of a failure should either use Data Guard rather than Streams, or complement their Streams-based solution with a zero data loss physical standby maintained by Data Guard. Data Guard also provides a delayed apply option to protect from human errors. Again, a complimentary physical standby can provide such protection.
The best availability can be achieved by designing the maintenance of logical copies of data directly into an application. The application knows what data is valuable and must be immediately moved off-site to guarantee no data loss. It can also synchronously replicate truly critical data, while asynchronously replicating less critical data. Applications maintain copies of data by either synchronously or asynchronously sending data to other applications that manage another logical copy of the data. Synchronous operations are performed using the distributed SQL or remote procedure features of the database. Asynchronous operations are performed using Advanced Queuing. Advanced Queuing is a database integrated message queuing feature built on top of the infrastructure of Oracle Streams.
Although the highest levels of availability can be achieved with application maintained copies of data, great care is required to realize these results. Typically, a great amount of custom development is required. Many of the difficult boundary conditions that have been analyzed and solved with solutions such as Data Guard and Streams replication must be re-analyzed and solved by the custom application developers. In addition, standard solutions like Data Guard and Streams replication undergo stringent testing both by Oracle and its customers. It will take a great deal of effort before a custom-developed solution can exhibit the same degree of maturity. For these reasons, only organizations with substantial patience and expertise should attempt to build a high availability solution with application maintained copies.
Oracle9i Application Developer's Guide - Advanced Queuing for more information about developing applications with Advanced Queuing
Implementing Streams in a high availability environment requires consideration of possible failure and recovery scenarios, and the implementation of procedures to ensure Streams continues to capture, propagate, and apply changes after a failure. Some of the issues that must be examined include the following:
The following sections discuss these issues in detail.
When configuring a solution using Streams, it is important to anticipate failures and design availability into the architecture. You must examine every database in the distributed system, and design a recovery plan in case of failure of that database. In some situations, failure of a database affects only services accessing data on that database. In other situations, a failure is multiplied, because it may affect other databases.
A configuration where each database is directly connected to every other database in the distributed system is the most resilient to failures, because a failure of one database will not prevent any other databases from operating or communicating. Assuming all data is replicated, services that were using the failed database can connect to surviving replicas.
Although configurations where each database is directly connected to every other database provide the best high availability characteristics, they can become unmanageable when the number of databases becomes large. Hub and spoke configurations solve this manageability issue by funneling changes from many databases into a hub database, and then to other hub databases, or to other spoke databases. To add a new source or destination you simply connect it to a hub database, rather than establishing connections to every other database.
A hub, however, becomes a very important node in your distributed environment. Should it fail, all communications flowing through the hub will fail. Due to the asynchronous nature of the events propagating through the hub, it can be very difficult to redirect a stream from one hub to another. A better approach is to make the hub resilient to failures.
The same techniques used to make a single database resilient from failures also apply to distributed hub databases. Oracle Corporation recommends Oracle Real Application Clusters to provide protection from instance and node failures. This configuration should be combined with a "no loss" physical standby database, to protect from disasters and data errors. Oracle Corporation does not recommend using a Streams replica as the only means to protect from disasters or data errors.
"Primary Database Sharing Data with Several Secondary Databases" for a detailed example of such an environment
Using Oracle Real Application Clusters with Streams introduces some important considerations. Capturing changes from the online redo log as it is written is not supported with Oracle Real Application Clusters. Rather, changes are captured from the archived redo log. Capturing from the archived redo logs introduces additional latency between the time a change is made at the production database and the time it appears at the replica.
If low latency is important, a cold failover cluster should be used to protect from system failure rather than Oracle Real Application Clusters. A cold failover cluster is not an Oracle Real Application Cluster. Instead, a cold failover cluster uses a secondary node to mount and recover the database when the first node fails.
When running in an Oracle Real Application Clusters cluster, a capture process runs on the instance that owns the queue that is receiving the captured logical change records (LCRs). Job queues should be running on all instances where propagation is enabled. Assuming propagation is enabled for an instance, a propagation job running on that instance will propagate LCRs from any queue owned by that instance to destination queues. An apply process runs on the instance that owns the queue from which the apply process dequeues its events. That may or may not be the same queue on which capture runs.
Any propagation to the database running Oracle Real Application Clusters is made over database links. The database links must be configured to connect to the destination instance that owns the queue that will receive the events.
The following sections provide best practices for recovering from failures.
It is important to ensure that propagation continues to function after a failure of a destination database instance. A propagation job will retry (with increasing delay between retries) the database link continually after a failure until the connection is reestablished. In the event the database is restarted on the same node, or on a different node in a cold failover cluster, the connection should be reestablished once the database instance is restarted. In some circumstances, the database link may be waiting on a read or write, and will not detect the failure until a lengthy timeout, controlled by the
TCP_KEEPALIVE_INTERVAL TCP/IP parameter, expires. In such circumstances, the database link should be dropped and re-created to ensure that communication is reestablished quickly.
When an instance in an Oracle Real Application Clusters cluster fails, the instance is recovered by another node in the cluster. Each queue that was previously owned by the failed instance is assigned to a new instance. Any inbound database links must be dropped and reestablished to point to the new instance that owns the destination queue. In a high availability environment, you can prepare scripts that will drop and re-create all necessary database links. Once a failed instance has been recovered, you can execute these scripts so that Streams can resume propagation.
"Configuring Network Connectivity and Database Links" for information about creating database links in a Streams environment
After a failure and restart of a single node database, or a failure and restart of a database on another node in a cold failover cluster, the capture process will automatically return to the state it was in at the time of the failure. That is, if it was running at the time of the failure, there is no need to restart the capture process.
For a capture process running in an Oracle Real Application Clusters environment, if an instance running the capture process fails, then the queue that receives the captured LCRs will be assigned to another node in the cluster. You must determine which instance now owns the queue used by the capture process by querying the
DBA_QUEUE_TABLES data dictionary view, and then restart the capture process on that node. If the failed instance is brought back online subsequently, it will not restart the capture process, even though it was running the capture process at the time of failure, because it is no longer the owner of the queue used by the capture process.
For events to be propagated from a source queue to a destination queue, a propagation job must run on the instance owning the source queue. In a single node database, or cold failover cluster, propagation will resume when the single database instance is restarted.
When running in a Real Application Cluster environment, care must be taken to ensure the propagation jobs are enabled on the correct instance. A propagation job should run on the instance that owns the source queues from which the propagation job sends events to destination queues. An instance affinity parameter can be specified for the propagation job that will force the job to run on a particular instance. If the instance fails, and the queue ownership migrates to another instance, then the propagation job affinity must be reset to this other instance. Also, for any jobs to run on an instance, the dynamic initialization parameter
JOB_QUEUE_PROCESSES must be greater than zero for that instance.
After a failure and restart of a single node database, or a failure and restart of a database on another node in a cold failover cluster, the apply process will automatically return to the state it was in at the time of the failure. That is, if it was running at the time of the failure, then there is no need to restart the apply process.
In an Oracle Real Application Clusters cluster, if an instance hosting the apply process fails, the queue from which the apply process dequeues the events will be assigned to another node in the cluster. You need to determine which instance now owns the queue used by the apply process by querying the
DBA_QUEUE_TABLES data dictionary view, and then restart the apply process on that node. If the failed instance is brought back online subsequently, it will not restart the apply process, even though it was running the apply process at the time of failure, because it is no longer the owner of the queue used by the apply process.