Transaction Recovery Spanning Multiple Sites or Data Centers

6 Transaction Recovery Spanning Multiple Sites or Data Centers

Learn about the best practices for XA transaction recovery of WebLogic domains across physical sites as part of a Disaster Recovery (DR) solution.

Understanding XA Transaction Recovery in Disaster Recovery

Maximizing availability and providing protection from unforeseen disasters and natural calamities are key requirements of a disaster recovery solution for an enterprise deployment. One aspect of a disaster recovery solution is to ensure that all XA transactions of affected WebLogic domains can be recovered if a production site is no longer available.

For transaction recovery, the following solutions are provided:

Active-Passive: These solutions involve setting up and pairing a standby site at a geographically different location with an active (production) site. The standby site is normally in a passive mode; it is started when the production site is not available. See Active-Passive XA Transaction Recovery.
Active-Active Stretch Cluster: In this architecture, a cluster stretches across two sites. Transaction recovery across sites takes place by service or server migration. See Active-Active Stretch Cluster XA Transaction Recovery.

Note:
Transactions that enlist a WebLogic Server JMS resource cannot be recovered in active-active recovery solutions.

See Overview of Disaster Recovery in Oracle® Fusion Middleware Disaster Recovery Guide for more information on disaster recovery solutions for enterprise deployments. The following sections provide the requirements and guidelines to recover XA transactions from failed production domains.

Active-Passive XA Transaction Recovery

Learn about the domain configuration and requirements for XA transactions in an active-passive recovery solution.

Requirements for XA Transactions in an Active-Passive Disaster Recovery Solution

This section provides the conditions and configuration requirements necessary to enable the successful recovery of XA transactions following the failure of a production site in an active-passive disaster recovery solution:

All active-passive domain pairs are configured with symmetric topology, they are identical and have the same domain configurations.
With WebLogic Continuous Availability failover can be orchestrated with Oracle Site Guard. Oracle Enterprise Manager Cloud Control is one option that provides the ability to manage disaster recover of WebLogic Server domains across multiple data centers.
The ability to maintain the workload at less than full capacity on the active and passive site during runtime in order to achieve a consistent capacity during runtime and recovery.
Only hostnames (not static IPs) must be used to specify the listen address of managed servers. Configuration of these hostnames must be identical on all sites. As hostnames between sites are identical but IPs are not, the hostname provides the dynamic ability to simply start an identically configured server or domain on the recovery site.
- Before initiating the Transaction Recovery service by starting servers in the passive domain, update the DNS server to point the DNS names to the machine(s) in passive data center.
  
  For example: Active domain Domain1 has two managed servers running on two machines Mc1 and Mc2. In domain configuration, use the corresponding DNS names dns-1 and dns-2. When the active domain fails and we want to activate corresponding passive domain, update the DNS server and change configuration to point dns-1 and dns-2 to Mc3 and Mc4 respectively. Then start passive Domain2.
- Do not use DNS names that include an underscore, they are not valid in WebLogic Server domains. DNS names with a dash are valid.
You have several options to store the TLog: a default store, JDBC TLog, LLR, and a determiner resource. A default store must be in a common area (usually NFS or SAN). A JDBC TLog uses a database as a common storage location to all WLS servers and is typically replicated using DataGuard or Active DataGuard to ensure high availability. When possible, eliminate XA transaction TLogs write by using determiner resources, see XA Transactions without Transaction TLog Write.
Transaction service migration within a cluster is only supported if the entire cluster, including the corresponding domains and servers, is failed over to a recovery site. Specifically, the administrator must insure the node manager and entire cluster, corresponding domains, and any impacted servers have been shutdown on failed site before starting them all on the recovery site.
Transactions that span WebLogic domains can only be recovered in a site failover if all domains involved in the transaction are failed over.
The domain information is kept in a shared location(s) to ensure domain configurations are in sync. Applications are kept in a shared location(s) to ensure they are in sync.

Note:
Pack/Unpack could be another approach to keep configurations in sync.

See Setting Up and Managing Disaster Recovery Sites in Oracle® Fusion Middleware Disaster Recovery Guide for detailed information on conditions and requirements for setting up active and passive recovery sites for enterprise deployments.

Example Active-Passive Domain Configuration for XA Transaction Recovery

All active-passive domain pairs are configured with symmetric topology, they are identical and have the same domain configurations. The failover process in an Active-Passive architecture is either manual or controlled by an external tool.

Figure 6-1 Domain Configuration for Active-Passive Recovery

Description of "Figure 6-1 Domain Configuration for Active-Passive Recovery"

An application running on Site1 starts a transaction. After it calls commit, the entire application infrastructure tier comes down. In this example, the application session replication, file system replication, and DB replication are taking place between the two sites.

If the entire WebLogic Server tier has not come down then all servers need to be shutdown and all the identical servers in the passive domain need to be started. Since the domains, clusters, servers, and resource all have identical names, as soon as the servers are started, recovery will commence and all transactions will recover. It is recommended to do a graceful shutdown of any server to allow work to drain.

Active-Active Stretch Cluster XA Transaction Recovery

Learn about the requirements and domain configuration for XA transactions in an active-active recovery solution, where a WebLogic Server Cluster stretches across two sites.

When a server of the cluster fails, server and resource checkpoints are still written to the TLOG. Checkpoints are written when the resources are first involved in a global transaction, updated only if there are changes to the transaction participants, and purged only if they are no longer used or become unavailable. Because checkpoints are created early in the transaction, as long as there are no changes to the participants in the global transactions, there is little danger of a checkpoint being out of sync during transaction service migration or transaction recovery spanning multiple sites or data centers. another server of the cluster can take over transaction recovery with service or server migration. This kind of architecture requires low latency between the sites. See Multicast and Cluster Configuration in Administering Clusters for Oracle WebLogic Server.

This section describes the domain configuration and requirements for XA transactions in an active-active stretch cluster recovery solution.

Requirements for XA Transactions in an Active-Active Stretch Cluster Disaster Recovery Solution

This section describes the conditions and configuration requirements necessary to enable the successful recovery of XA transactions following the failure of a production site in an active-active stretch cluster disaster recovery solution.

Since this architecture uses service or server migration for XA transaction recovery, the requirements are same as the conditions necessary for server migration. See Server Migration.

In addition, the network must meet the following requirements:

Full support of IP multicast packet propagation. In other words, all routers and other tunneling technologies must be configured to propagate multicast messages to clustered server instances.
Network latency low enough to ensure that most multicast messages reach their final destination in approximately 10 milliseconds.
Multicast Time-To-Live (TTL) value for the cluster high enough to ensure that routers do not discard multicast packets before they reach their final destination. For instructions on setting the Multicast TTL parameter, see Configure Multicast Time-To-Live (TTL).

See If Your Cluster Spans Multiple Subnets In a WAN in Administering Clusters for Oracle WebLogic Server.

Example Active-Active Stretch Cluster for XA Transaction Recovery

In an active-active stretch cluster recovery solution, the JTA service or server migration is used for transaction recovery. This architecture is recommended when there is low latency between sites.

Figure 6-2 Domain Configuration for Active-Active Stretch Cluster Recovery

Description of "Figure 6-2 Domain Configuration for Active-Active Stretch Cluster Recovery"

In this example, an application running on Site 1 starts a transaction. After it calls commit, one or more servers on Site 1 fails. If service or server migration are configured then the surviving servers on Site 2 (in the stretch cluster) will take over recovery for the failed servers.

Additional Information on Maximum Availability Architecture

Oracle provides a number of resources which provide additional information on how to configure environments that maximum availability.

See the following topics: