1 Introduction to WebLogic Continuous Availability

Topics include:

What Is Continuous Availability?

Continuous availability is the ability of a system to provide maximum availability by employing both high availability and disaster recovery solutions to ensure that applications are available when they are needed. Typically, a high availability solution provides redundancy in one data center. Disaster recovery solutions provide the ability to safeguard against natural or unplanned outages at a production site by having a recovery strategy for applications and data to a geographically separate standby site.

Oracle WebLogic Server Continuous Availability provides an integrated solution for building maximum availability architectures (MAA) that span data centers in distributed geographical locations. Integrated components include Oracle WebLogic Server, Oracle Coherence, Oracle Traffic Director, and Oracle Site Guard. The major benefits of this integrated solution are faster failover or switchover, increased overall application availability, data integrity, reduced human error and risk, recovery of work, and local access of real-time data.

Continuous Availability Terminology

The following list describes the common terminology that applies to continuous availability:

Active-active: An active-active solution deploys two or more active servers to improve scalability and provide high availability. In active-active deployments, all instances handle requests concurrently. When an entire domain or site fails, transactions can be recovered by an active server in a different domain either collocated in the same site or on a different site.
Active-passive: An active-passive solution involves setting up and pairing a standby site at a geographically different location with an active (production) site. The standby site may have equal or fewer services and resources compared to the production site. Application data, metadata, configuration data, and security data are replicated periodically to the standby site. The standby site is normally in a passive mode; it is started when the production site is not available. This model is usually adopted when the two sites are connected over a WAN, and network latency does not allow clustering across the two sites.
WebLogic Server cluster: A Weblogic Server cluster is a collection of WebLogic Server server instances running simultaneously and working together to provide increased scalability and reliability. In a cluster, most resources and services are deployed identically to each Managed Server, enabling failover and load balancing.
Coherence cluster: A Coherence cluster is a collection of Java Virtual Machines (JVM) processes, called Coherence servers, that run Coherence. A Coherence cluster consists of multiple Coherence server instances that distribute data in-memory to increase application scalability, availability, and performance. Application data is automatically and transparently distributed and backed up across cluster members.
Stretch cluster: A stretch cluster is a cluster in which nodes can span data centers within a proximate geographical range, usually with guaranteed, relatively low latency networking between the sites. Stretch clusters are also referred to as extended clusters.
High availability: High availability is the ability of a system or device to be available when it is needed. A high availability architecture ensures that users can access a system without loss of service. Deploying a high availability system minimizes the time when the system is down, or unavailable, and maximizes the time when it is running, or available.
Disaster recovery: Disaster recovery is the ability to safeguard against natural or unplanned outages at a production site by having a recovery strategy for applications and data to a geographically separate standby site.
Switchover: Switchover is the process of reversing the roles of the production site and the standby site. Switchovers are planned operations done for periodic validation or to perform planned maintenance on the current production site. During a switchover, the current standby site becomes the new production site, and the current production site becomes the new standby site.
Failover: Failover is the process of making the current standby site the new production site after the production site becomes unexpectedly unavailable (for example, due to a disaster at the production site).
Latency: Latency is the time that it takes for packets to travel from one cluster to another, and can be a factor of many things, including the length of the path between the sites and any layers in between. Typically latency is determined by using utilities such as traceroute or ping to send test packets from one site to another. The latency or round-trip time (RTT) has a direct effect on the response time that any one user experiences when accessing the system. The effects of high latency can be seen even with only one user on the system.
Metropolitan area network (MAN): A MAN is a telecommunications or computer network that spans an entire city or campus. The MAN standard for data communication specified in the IEEE 802.6 standard is called distributed-queue dual-bus (DQDB). With DQDB, networks can extend up to 20 miles (30 km) long and operate at speeds of 34–155 Mbit/s. A stretch cluster topology is appropriate in a MAN.
Wide Area Network (WAN): A WAN is a telecommunications or computer network that extends over large geographical distances and between different LANs, MANs and other localized computer networking architectures. Wide area networks are often established with leased telecommunication circuits. Distance and latency of a WAN need to taken into consideration when determining the type of topology you can configure.

Continuous Availability Key Features

Continuous Availability provides maximum availability, reliability, and application stability during planned upgrades or unexpected failures. It builds on the existing high availability features in Oracle WebLogic Server, Oracle Coherence, and Oracle Fusion Middleware, and supports the key features described in the following sections.

Automated Cross-Site XA Transaction Recovery

Automated cross-site XA transaction recovery provides automatic recovery of XA transactions across an entire domain, or across an entire site with servers running in a different domain or at a different site. Cross-site transaction recovery uses the leasing framework to automate cross-site recovery. The leasing design follows the existing model for database leasing of transaction recovery service (TRS) migration within a cluster.

In active/active architectures, transactions can be recovered when an entire domain or site fails by having an active server running in a different domain either collocated at the same site or at a different site. In active/passive architectures, the server at the passive (standby) site at a different location can be started when the production site is no longer available.

For more information, see "Transaction Recovery Spanning Multiple Sites or Data Centers" in Developing JTA Applications for Oracle WebLogic Server. For design considerations when using cross-site XA transaction recovery in continuous availability architectures, see Cross-Site XA Transaction Recovery.

Automated cross-site XA transaction recovery also takes advantage of the WebLogic Server high availability features described in WebLogic Server High Availability Features.

WebLogic Server Zero Downtime Patching

WebLogic Server Zero Downtime Patching (ZDT Patching) provides an automated mechanism to orchestrate the rollout of patches while avoiding downtime or loss of sessions. It reduces risks and downtime of mission-critical applications that require availability and predictability while applying patches.

Using workflows that you define, you can patch or update any number of nodes in a domain with little or no manual intervention. Changes are rolled out to one node at a time, allowing a load balancer such as Oracle Traffic Director to redirect incoming traffic to the remaining nodes until the node has been updated.

You can use ZDT Patching to update Coherence applications while maintaining high availability of the Coherence data during the rollout process.

For more information, see Administering Zero Downtime Patching Workflows.

WebLogic Server Multitenant Live Resource Group Migration

In WebLogic Server Multitenant environments, you can migrate partition resource groups that are running from one cluster/server to another within a domain without impacting the application users. A key benefit of migrating the resource groups is that it eliminates application downtime for planned events.

Resource groups are a collection of (typically) related deployable resources, such as Java EE applications and the data sources, Java Message Service (JMS) artifacts, and other resources that the applications use. When you migrate a resource group, you change the virtual target used by the resource group from one physical target (cluster/server) to another. After migration, the virtual target points to the new physical target (cluster/server).

For more information about resource groups and migration, see the following topics in Using WebLogic Server Multitenant:

Coherence Federated Caching

The Oracle Coherence federated caching feature replicates cache data asynchronously across multiple geographically distributed clusters. Cached data is replicated across clusters to provide redundancy, off-site backup, and multiple points of access for application users in different geographical locations.

Federated caching supports multiple replication topologies. These include:

Active-passive: Replicates data from an active cluster to a passive cluster. The passive site supports read-only operations and off-site backup.
Active-active: Replicates data between active clusters. Data that is put into one active cluster is replicated at the other active clusters. Applications at different sites have access to a local cluster instance.
Hub and spoke: Replicates data from a single hub cluster to multiple spoke clusters. The hub cluster can only send data and the spoke clusters can only receive data. This topology requires multiple geographically dispersed copies of a cluster. Each spoke cluster can be used by local applications to perform read-only operations.

For more information, see “Federating Caches Across Clusters” in Administering Oracle Coherence.

Coherence GoldenGate HotCache

The Oracle Coherence GoldenGate HotCache feature detects and reflects database changes in cache in real time. Third-party updates to the database can cause Coherence applications to work with data that can be stale and out-of-date. Coherence GoldenGate HotCache solves this problem by monitoring the database and pushing any changes into the Coherence cache in real time. It employs an efficient push model that processes only stale data. Low latency is assured because the data is pushed when the change occurs in the database.

In Maximum Availability Architectures, when the database is replicated to a secondary site during failover, the database changes are reflected to the cache using GoldenGate HotCache.

For more information, see “Integrating with Oracle Coherence GoldenGate HotCache” in Integrating Oracle Coherence.

Oracle Traffic Director

Oracle Traffic Director is a fast, reliable, and scalable software load balancer that routes HTTP, HTTPS, and TCP traffic to application servers and web servers on the network. It distributes the requests that it receives from clients to available servers based on the specified load-balancing method, routes the requests based on specified rules, caches frequently accessed data, prioritizes traffic, and controls the quality of service. Oracle Traffic Director is a layer-7 software load balancer and is not meant to replace a global load balancer.

The architecture of Oracle Traffic Director enables it to handle large volumes of application traffic with low latency. For high availability, you can set up pairs of Oracle Traffic Director instances for either active-passive or active-active failover. As the volume of traffic to your network grows, you can easily scale the environment by reconfiguring Oracle Traffic Director with additional back-end servers to which it can route requests.

There is a tight integration between Oracle Traffic Director and Continuous Availability features such as Zero Downtime Patching and Live Partition Migration to provide zero downtime to applications, either during a rolling upgrade process or during partition migration. This integration allows for applications to be highly available without requiring any changes.

For design considerations when using Oracle Traffic Director in continuous availability architectures, see Oracle Traffic Director. For additional information, see Administering Oracle Traffic Director.

Oracle Site Guard

Oracle Site Guard, a component of Oracle Enterprise Manager Cloud Control, is a disaster recovery solution that enables administrators to automate complete site switchover or failover, thereby minimizing downtime for enterprise deployments. Because Oracle Site Guard operates at the site level, it eliminates the need to tediously perform manual disaster recovery for individual site components like applications, middleware, databases, and so on. The traffic of an entire production site can be redirected to a standby site in a single operation.

Administrators do not require any special skills or domain expertise in areas like databases, applications, and storage replication. Oracle Site Guard can continuously monitor disaster recovery readiness, and it can do this without disrupting the production site.

You can manage an Oracle Site Guard configuration by using either the Enterprise Manager Command-Line Interface (EMCLI) or a compatible version of Oracle Enterprise Manager Cloud Control (Cloud Control).

For more information, see Site Guard Administrator's Guide.

Value-Added WebLogic Server and Coherence High Availability Features

In addition to the features described in Continuous Availability Key Features, Oracle Continuous Availability also takes advantage of the high availability features provided with WebLogic Server and Coherence, as described in the following sections.

WebLogic Server High Availability Features

The following WebLogic Server features can be used with the Oracle Continuous Availability features to provide the highest level of availability:

Clustering: A WebLogic Server cluster consists of multiple WebLogic Server server instances running simultaneously and working together in a domain to provide increased scalability and reliability. For more information, see Clustering.
Singleton services: Services such as server and service migration, persistent data stores, and leasing make singleton services such as JMS and JTA highly available in a WebLogic Server cluster. For more information, see Singleton Services.
Session Replication: Session replication is a feature of WebLogic Server clusters that is used to replicate the data stored in a session across different server instances in the cluster. For more information, see Session Replication.
Transaction and data source features:
- Active GridLink data sources that use Fast Connection Failover to provide rapid failure detection of Oracle Real Application Clusters (Oracle RAC) nodes, and failover to remaining nodes for continuous connectivity. For design considerations when using Active Gridlink in continuous availability architectures, see Data Sources. For additional information, see “Using Active GridLink Data Sources” in Administering JDBC Data Sources for Oracle WebLogic Server.
- Transaction logs in the database (JDBC TLogs) that store information about committed transactions coordinated by the server that may not have been completed. WebLogic Server uses the TLogs when recovering from system crashes or network failures. For more information, see “Using Transaction Log Files to Recover Transactions” in Developing JTA Applications for Oracle WebLogic Server.
- No transaction TLog writes (No TLog) where you eliminate writes of the transaction checkpoints to the TLog store. For more information, see “XA Transactions without Transaction TLog Write” in Developing JTA Applications for Oracle WebLogic Server.
- Logging Last Resource (LLR) transaction optimization which is a performance enhancement option that enables one non-XA resource to participate in a global transaction with the same ACID (atomicity, consistency, isolation, durability) guarantee as XA. For more information, see “Logging Last Resource Transaction Optimization” in Developing JTA Applications for Oracle WebLogic Server.
  
  These features work with Oracle Data Guard which replicates databases to make transaction logs needed for recovery to be highly available. For more information about Oracle Data Guard, see Data Guard Concepts and Administration.

For more information about high availability in Oracle Fusion Middleware, see High Availability Guide.

Coherence Persistence and Clusters

Coherence persistence is a set of tools and technologies that manage the persistence and recovery of Coherence distributed caches. Cached data is persisted so that it can be quickly recovered after a catastrophic failure or after a cluster restart due to planned maintenance. Persistence and federated caching can be used together as required. For more information about Coherence persistence, see “Persisting Caches” in Administering Oracle Coherence.

When an application asks for an entry to the Coherence cache, if the entry does not exist in the cache and does exist in the database, then Coherence updates the cache with the database value. This is called Read-Through caching. For more information, see Read-Through Caching in Developing Applications with Oracle Coherence.

Coherence clusters consist of multiple Coherence server instances that distribute data in-memory to increase application scalability, availability, and performance. Application data is automatically and transparently distributed and backed up across cluster members. For more information about Coherence clusters, see “Configuring and Managing Coherence Clusters” in Administering Clusters for Oracle WebLogic Server.

Disaster Recovery for Oracle Database

Oracle WebLogic Server 12c provides strong support for integrating with the High Availability (HA) features of the Oracle database. Integrating with these HA features minimizes database access time while allowing transparent access to rich pooling management functions that maximizes both connection performance and application availability.

Oracle Continuous Availability takes advantage of the HA database features described in this section. The integration of all these products contributes to managing and orchestrating the failover and switchover of the Oracle Database, and makes the failover of the database fast and automatic.

Oracle Data Guard ensures high availability, data protection, and disaster recovery for enterprise data. It provides a comprehensive set of services that create, maintain, manage, and monitor one or more standby databases to enable production Oracle databases to survive disasters and data corruptions. Oracle Data Guard maintains these standby databases as transactionally consistent copies of the primary database. If the primary database becomes unavailable because of a planned or an unplanned outage, then Oracle Data Guard enables you to switch any standby database to the production role, thus minimizing the downtime associated with the outage. See Data Guard Concepts and Administration.
Oracle Active Data Guard is a comprehensive solution to eliminate single points of failure for mission critical Oracle Databases. It prevents data loss and downtime by maintaining a synchronized physical replica (standby) of a production database (primary). If there is an outage, client connections quickly failover to the standby and resume service. Active Data Guard achieves the highest level of data protection through deep integration with Oracle Database, strong fault isolation, and unique Oracle-aware data validation. System and software defects, data corruption, and administrator error that affect a primary are not mirrored to the standby. Idle redundancy is eliminated by directing read-only workloads and backups to active standby databases for high return on investment. See Data Guard Concepts and Administration.
Oracle Data Guard broker logically groups these primary and standby databases into a broker configuration that enables the broker to manage and monitor them together as an integrated unit. It sends notifications to WebLogic Active GridLink which then makes new connections to the database in the failover site, and coordinates with Oracle Clusterware to fail over role-based services. Oracle Site Guard uses Oracle Data Guard broker to perform the failover or switchover of the databases. See Data Guard Broker.
Oracle Real Application Clusters (Oracle RAC) is a clustered version of Oracle Database that allows running multiple database instances on different servers in the cluster against a shared set of data files, also known as the database. The database spans multiple hardware systems and yet appears as a single unified database to the application. See Real Application Clusters Administration and Deployment Guide.
Oracle Clusterware manages the availability of instances of an Oracle RAC database. It works to rapidly recover failed instances to keep the primary database available. If Oracle Clusterware cannot recover a failed instance, then the broker continues to run automatically with one fewer instance. If the last instance of the primary database fails, then the broker provides a way to fail over to a specified standby database. If the last instance of the primary database fails, and fast-start failover is enabled, then the broker can continue to provide high availability by automatically failing over to a pre-determined standby database. See Clusterware Administration and Deployment Guide.
Oracle GoldenGate is a high-performance software application that uses log-based bidirectional data replication for real-time capture, transformation, routing, and delivery of database transactions across heterogeneous systems. Oracle GoldenGate allows for databases to be in active-active mode. Applications that use Oracle GoldenGate must have tolerance for data loss due to the asynchronous nature of Oracle GoldenGate replication. See Administering Oracle GoldenGate for Windows and UNIX.
Oracle Database 12c Global Data Services (GDS) streamline the delivery of database services on a global scale, which is key to deploying databases in MAA environments.These technologies oversee replication and failover while performing load balancing within and across data centers, optimizing resource utilization and streamlining database management practices in a distributed database environment. GDS works by enabling a Global Service across Oracle Real Application Clusters (RAC) and single-instance Oracle databases interconnected via Oracle Data Guard, Oracle GoldenGate, or any other replication technology. Client access to this distributed infrastructure is completely transparent. GDS implementations are easy to apply to Oracle WebLogic Server with minimal changes. See Global Data Services Concepts and Administration Guide.
Application Continuity (AC) is available with the Oracle RAC, Oracle RAC One Node and Oracle Active Data Guard options that masks outages from end users and applications by recovering the in-flight database sessions following recoverable outages. Application Continuity enables replay, in a non-disruptive and rapid manner, of a database request when a recoverable error makes the database session unavailable. The request can contain transactional and nontransactional calls to the database and calls that are executed locally at the client or middle tier. After a successful replay, the application can continue where that database session left off. See Ensuring Application Continuity in Oracle Database Development Guide.

WebLogic Server Active GridLink integrates with the Oracle Database 12c features like Application Continuity and Global Data Services to provide the highest possible availability. Application Continuity will replay transactions when encountered with unplanned database outages. End-user applications will not receive errors or even know that there have been outages. Active GridLink, Application Continuity, and Data Guard provide protection for planned and unplanned database outages in highly available environments.

These technologies oversee replication and failover while performing load balancing within and across data centers, optimizing resource utilization and streamlining database management practices in a distributed database environment.