6 Transaction Recovery Spanning Multiple Sites or Data Centers

This chapter describes best practices for XA transaction recovery of WebLogic domains across physical sites as part of a Disaster Recovery (DR) solution.

This chapter includes the following sections:

Understanding XA Transaction Recovery in Disaster Recovery

Maximizing availability and providing protection from unforeseen disasters and natural calamities are key requirements of a disaster recovery solution for an enterprise deployment. One aspect of a disaster recovery solution is to ensure that all XA transactions of affected WebLogic domains can be recovered if a production site is no longer available. For transaction recovery, the following solutions are provided:

Active-Active: In an active-active recovery solution, when all of the servers of a cluster of a site fail, transactions can be recovered by an active server or servers in a different domain either collocated on the same site or on a different site. For more information, see Active-Active XA Transaction Recovery.

Note:
Transaction recovery happens across domains but not between clusters in the same domain.
Active-Passive: These solutions involve setting up and pairing a standby site at a geographically different location with an active (production) site. The standby site is normally in a passive mode; it is started when the production site is not available. For more information, see Active-Passive XA Transaction Recovery.
Active-Active Stretch Cluster: In this architecture, a cluster stretches across two sites. Transaction recovery across sites takes place by service or server migration. For more information, see Active-Active Stretch Cluster XA Transaction Recovery.

Note:
Transactions that enlist a WebLogic Server JMS resource cannot be recovered in active-active recovery solutions.

See Overview of Disaster Recovery in Oracle® Fusion Middleware Disaster Recovery Guide for more information on disaster recovery solutions for enterprise deployments. The following sections provide the requirements and guidelines to recover XA transactions from failed production domains.

Active-Active XA Transaction Recovery

Active-active recovery solutions involve two or more active domain configurations and are used to improve scalability and availability. This section describes the domain configuration and requirements for XA transactions in an active-active recovery solution. The following topics are included:

Topics

Requirements for XA Transactions in an Active-Active Recovery Solution

The cross-site XA transaction recovery feature is enabled dynamically by setting the site-name and recovery-site-name attributes. See Configuring MBean Attributes for Cross-Site XA Transaction Recovery. Also, the cross-site recovery feature should be activated after both domains or sites are up.

Note:

The site-name attribute can be set at the same time as recovery-site-name or prior to starting up the servers.

This section describes the configuration requirements necessary to enable the successful recovery of XA transactions after the failure of a production site in an active-active disaster recovery solution:

All domains that are in separate data centers have identical configuration and they have the same domain, server, and resource names.
Only hostnames (not static IPs) must be used to specify the listen address of managed servers. Configuration of these hostnames must be identical on all sites. As hostnames between sites are identical but IPs are not, the hostname provides the dynamic ability to simply start an identically configured server or domain on the recovery site.
To store the TLog, you can use a JDBC TLog, LLR, or a determiner resource. A JDBC TLog uses a database as a common storage location to all WebLogic Servers and is typically replicated using DataGuard or Active DataGuard to ensure high availability. If you are using LLR or a determiner resource for storing transaction logs, you still need to configure a JDBC TLog for non-transaction checkpoints.
- Transaction store names must have a prefix that corresponds to the SiteName of the domain. For more information about the SiteName attribute, See Configuring MBean Attributes for Cross-Sites XA Transaction Recovery.
Cross-site XA transaction recovery solutions use the leasing framework to determine the store ownership transfer in case of failure. The leasing design follows the existing model for database leasing of TRS (transaction recovery service) migration within a cluster. In this model, a server attempts to obtain lease using a write lock in order to avoid the case where multiple servers attempt to recover the same lease. A table per site is created with the nomenclature of [site-name]SITELEASING
Note:
- If Cross-site XA transaction recovery and if server migration with database leasing are both configured there will be two separate tables the DB leasing and the new site leasing tables.
- The site leasing tables should be stored in a highly available database.
- XA data sources are not supported to connect to the database to update the site leasing table.
WebLogic JMS is not supported in this recovery solution.
Each site will have its own leasing table for Cross-Site transaction recovery.
Active-Active recovery is supported in WebLogic MT partitions. For more information, see the chapter Configuring Transactions in Using WebLogic Server MT.
In Maximum Availability Architecture (MAA), it is not a good practice to have transactions span servers in domains that are on two different sites.

Limitations and Considerations for Cross-Site XA Transaction Recovery

This section describes the limitations of XA recovery when transactions span domains, that is, when transaction is started in one domain and calls into another domain(s) within the scope of the transaction:

When there is a failover of all the domains involved in the transaction (for example, entire mid-tier or site failure), all domains involved in the transaction will failover to the recovery site.
If there is a failure of one or more domains involved in the transaction, but not all domains, then manually shutdown the remaining surviving domains so that all domains on the recovery site are available for transaction recovery.
Active-Active topology configurations need to be kept identical on both sites by using WLST scripts, REST, Fusion Middleware Control, or the Administration Console. Same domain names, server names, and resource names should be configured on both sites. Parameters such as SiteName and RecoverySiteName should be configured differently in domains on each site. See Configuring MBean Attributes for Cross-Site XA Transaction Recovery.
In a cross-site XA recovery scenario, when a server is restarted after a crash, it is possible that the server's store is taken over by a recovering server in another domain.
In this case, the original server will signal the recovering server to release the store in order for failback to occur. This failback will be retried internally if necessary.

Example Active-Active Domain Configuration for XA Transaction Recovery

This section provides an example for active-active recovery solution.

Figure 6-1 Domain Configuration for Active-Active Recovery

Description of "Figure 6-1 Domain Configuration for Active-Active Recovery"

In this example, application session replication, file system replication, and database replication are taking place between the two sites. An application running on Site 1 starts a transaction. After it calls commit, several failure scenarios can take place:

The application infrastructure tier (including WebLogic domain) fails on Site 1, the database tier has not crashed.

In this failure scenario, the servers in the WebLogic domain on Site2 take over the TLogs in the database on Site1 for failed servers and all transactions get recovered.
The application infrastructure tier (including WebLogic domain) fails on Site 1, and the database tier crashes.

In this failure scenario, the servers in the WebLogic domain on Site2 take over the TLogs in the database on Site2 for failed servers and all transactions get recovered.
One server fails in the WebLogic domain on Site1. If service or server migration is configured then transactions will be recovered by other servers in the cluster. Due to latency between the two sites, it would take longer for any servers in the WebLogic domain on Site2 to take over recovery for the failed server.
Transactions span two domains, Domain1 and Domain 2 on Site1. The servers in Domain1 are the transaction coordinators. Only Domain1 on Site 1 fails.

In this scenario, the servers in Domain1 on Site2 will take over the TLogs in database on Site1. When the transaction is recovered, Domain 2 on Site1 will be unable to acknowledge the commit call to Domain 1 on Site1, and transaction cannot be recovered. If the failback does not happen in a timely manner, shutdown and restart all domains involved in the transaction for transaction recovery.

Configuring MBean Attributes for Cross-Site XA Transaction Recovery

Table 6-1describes the DomainMBean attributes that you need to configure for Cross-Site XA transaction recovery.

For more information about these MBeans, see MBean Reference for Oracle WebLogic Server.

Table 6-1 DomainMBean Attributes for Cross-Site XA Transaction Recovery

Attribute	Value
SiteName	The name given to a site that the domain belongs to. This attribute is used in association with the RecoverySiteName attribute of JTAMBean. See Table 6-2. For example, if two sites are configured for transaction recovery, one is configured with the SiteName "site1" and RecoverySiteName as "site2". The second site is configured with the SiteName as "site2" and RecoverySiteName as "site1". In this way the sites are configured to recover each other's transactions and hence provides active-active availability.

Attribute

Value

SiteName

The name given to a site that the domain belongs to. This attribute is used in association with the RecoverySiteName attribute of JTAMBean. See Table 6-2.

For example, if two sites are configured for transaction recovery, one is configured with the SiteName "site1" and RecoverySiteName as "site2". The second site is configured with the SiteName as "site2" and RecoverySiteName as "site1". In this way the sites are configured to recover each other's transactions and hence provides active-active availability.

Table 6-2 describes the JTAMBean and JTAClusterMBean attributes that you need to configure for cross site XA transaction recovery.

Table 6-2 JTAMBean Attributes for Cross-Site XA Transaction Recovery

Attribute	Value
RecoverySiteName	The name of the site that a domain will recover for. This is the value that the site that is being recovered has configured as its SiteName. If RecoverySiteName is null the cross-domain XA transaction feature is disabled. This MBean value can be dynamically changed. See Table 6-1.
CrossSiteRecoveryRetryInterval	Interval value for checking the site leasing table to verify that lease has not expired. This is set to 60 seconds by default.
CrossSiteRecoveryLeaseExpiration	If lease has not been updated in `CrossSiteRecoveryLeaseExpiration` then the recovery server will take over the lease and start recovery for the failed server in the remote site. The value is 30 seconds by default and should be adjusted according to latency between sites.
CrossSiteRecoveryLeaseUpdate	The time in seconds to update a lease timestamp. The default is set to 10 seconds.

Active-Passive XA Transaction Recovery

This section describes the domain configuration and requirements for XA transactions in an active-passive recovery solution.

Requirements for XA Transactions in an Active-Passive Disaster Recovery Solution

This section provides the conditions and configuration requirements necessary to enable the successful recovery of XA transactions following the failure of a production site in an active-passive disaster recovery solution:

All active-passive domain pairs are configured with symmetric topology, they are identical and have the same domain configurations.
With WebLogic Continuous Availability failover can be orchestrated with Oracle Site Guard. Oracle Enterprise Manager Cloud Control is one option that provides the ability to manage disaster recover of WebLogic Server domains across multiple data centers.
The ability to maintain the workload at less than full capacity on the active and passive site during runtime in order to achieve a consistent capacity during runtime and recovery.
Only hostnames (not static IPs) must be used to specify the listen address of managed servers. Configuration of these hostnames must be identical on all sites. As hostnames between sites are identical but IPs are not, the hostname provides the dynamic ability to simply start an identically configured server or domain on the recovery site.
- Before initiating the Transaction Recovery service by starting servers in the passive domain, update the DNS server to point the DNS names to the machine(s) in passive data center.
  
  For example: Active domain Domain1 has two managed servers running on two machines Mc1 and Mc2. In domain configuration, use the corresponding DNS names dns-1 and dns-2. When the active domain fails and we want to activate corresponding passive domain, update the DNS server and change configuration to point dns-1 and dns-2 to Mc3 and Mc4 respectively. Then start passive Domain2.
- Do not use DNS names that include an underscore, they are not valid in WebLogic Server domains. DNS names with a dash are valid.
You have several options to store the TLog: a default store, JDBC TLog, LLR, and a determiner resource. A default store must be in a common area (usually NFS or SAN). A JDBC TLog uses a database as a common storage location to all WLS servers and is typically replicated using DataGuard or Active DataGuard to ensure high availability. When possible, eliminate XA transaction TLogs write by using determiner resources, see XA Transactions without Transaction TLog Write.
Transaction service migration within a cluster is only supported if the entire cluster, including the corresponding domains and servers, is failed over to a recovery site. Specifically, the administrator must insure the node manager and entire cluster, corresponding domains, and any impacted servers have been shutdown on failed site before starting them all on the recovery site.
Transactions that span WebLogic domains can only be recovered in a site failover if all domains involved in the transaction are failed over.
The domain information is kept in a shared location(s) to ensure domain configurations are in sync. Applications are kept in a shared location(s) to ensure they are in sync.

Note:
Pack/Unpack could be another approach to keep configurations in sync.

See Setting Up and Managing Disaster Recovery Sites in Oracle® Fusion Middleware Disaster Recovery Guide for detailed information on conditions and requirements for setting up active and passive recovery sites for enterprise deployments.

Example Active-Passive Domain Configuration for XA Transaction Recovery

All active-passive domain pairs are configured with symmetric topology, they are identical and have the same domain configurations. The failover process in an Active-Passive architecture is either manual or controlled by an external tool.

Figure 6-2 Domain Configuration for Active-Passive Recovery

Description of "Figure 6-2 Domain Configuration for Active-Passive Recovery"

An application running on Site1 starts a transaction. After it calls commit, the entire application infrastructure tier comes down. In this example, the application session replication, file system replication, and DB replication are taking place between the two sites.

If the entire WebLogic Server tier has not come down then all servers need to be shutdown and all the identical servers in the passive domain need to be started. Since the domains, clusters, servers, and resource all have identical names, as soon as the servers are started, recovery will commence and all transactions will recover. It is recommended to do a graceful shutdown of any server to allow work to drain.

Active-Active Stretch Cluster XA Transaction Recovery

In this architecture, a WebLogic Server Cluster stretches across two sites. When a server of the cluster failsServer and resource checkpoints are still written to the TLOG. Checkpoints are written when the resources are first involved in a global transaction, updated only if there are changes to the transaction participants, and purged only if they are no longer used or become unavailable. Because checkpoints are created early in the transaction, as long as there are no changes to the participants in the global transactions, there is little danger of a checkpoint being out of sync during transaction service migration or transaction recovery spanning multiple sites or data centers. another server of the cluster can take over transaction recovery with service or server migration. This kind of architecture requires low latency between the sites. For more information, see Multicast and Cluster Configuration in Administering Clusters for Oracle WebLogic Server.

This section describes the domain configuration and requirements for XA transactions in an active-active stretch cluster recovery solution.

Requirements for XA Transactions in an Active-Active Stretch Cluster Disaster Recovery Solution

This section describes the conditions and configuration requirements necessary to enable the successful recovery of XA transactions following the failure of a production site in an active-active stretch cluster disaster recovery solution.

Since this architecture uses service or server migration for XA transaction recovery, the requirements are same as the conditions necessary for server migration. See Server Migration.

In addition, the network must meet the following requirements:

Full support of IP multicast packet propagation. In other words, all routers and other tunneling technologies must be configured to propagate multicast messages to clustered server instances.
Network latency low enough to ensure that most multicast messages reach their final destination in approximately 10 milliseconds.
Multicast Time-To-Live (TTL) value for the cluster high enough to ensure that routers do not discard multicast packets before they reach their final destination. For instructions on setting the Multicast TTL parameter, see Configure Multicast Time-To-Live (TTL).

For more information, see If Your Cluster Spans Multiple Subnets In a WAN in Administering Clusters for Oracle WebLogic Server.

Example Active-Active Stretch Cluster for XA Transaction Recovery

In an active-active stretch cluster recovery solution, the JTA service or server migration is used for transaction recovery. This architecture is recommended when there is low latency between sites.

Figure 6-3 Domain Configuration for Active-Active Stretch Cluster Recovery

Description of "Figure 6-3 Domain Configuration for Active-Active Stretch Cluster Recovery"

In this example, an application running on Site 1 starts a transaction. After it calls commit, one or more servers on Site 1 fails. If service or server migration are configured then the surviving servers on Site 2 (in the stretch cluster) will take over recovery for the failed servers.

Additional Information on Maximum Availability Architecture

Oracle provides a number of resources which provide additional information on how to configure environments that maximum availability, see: