1 Introduction to WebLogic Server High Availability and Disaster Recovery

Oracle WebLogic Server, together with Oracle Coherence and Database, includes features that you can use for building maximum availability architectures (MAA) that span data centers in distributed geographical locations.

About High Availability and Disaster Recovery in WebLogic Server

Using both high availability and disaster recovery solutions ensures that applications are available when they are needed.

Typically, a high availability solution provides redundancy in one data center. However, production deployments in a data center also need protection from unforeseen disasters and natural calamities. Disaster recovery solutions provide a recovery strategy for applications and data by setting up a standby site at a geographically different location than the production site. Application data, metadata, configuration data, and security data are replicated periodically to the standby site.

Oracle WebLogic Server and Coherence include an extensive set of high availability features, such as server clustering, server migration, cluster integration, Active GridLink, load balancing, failover, backup and recovery, rolling upgrades, and rolling configuration changes, which protect a deployment from unplanned downtime and help to minimize planned downtime. Disaster protection for Oracle databases that are included in your configuration is provided through Oracle Data Guard and Oracle Real Application Clusters (Oracle RAC).

Using the high availability and disaster recovery features of Oracle WebLogic Server, Oracle Coherence, and Oracle Database, you can design and build maximum availability architectures (MAA) that span data centers in distributed geographical locations. Oracle Maximum Availability Architecture (MAA) is Oracle's comprehensive architecture to reduce downtime for scheduled outages, and to prevent, detect and recover from unscheduled outages. The major benefits of these integrated solutions are faster failover or switchover, increased overall application availability, data integrity, reduced human error and risk, recovery of work, and local access of real-time data. See Best Practices Blueprints for High Availability at https://www.oracle.com/database/technologies/high-availability/maa.html.

Terminology

Learn a comprehensive list of common terms that apply to Oracle WebLogic Server and Coherence high availability and disaster recovery.

  • Active-active: An active-active solution deploys two or more active servers to improve scalability and provide high availability. In active-active deployments, all instances handle requests concurrently. When an entire domain or site fails, transactions can be recovered by an active server in a different domain either collocated in the same site or on a different site.

  • Active-passive: An active-passive solution involves setting up and pairing a standby site at a geographically different location with an active (production) site. The standby site may have equal or fewer services and resources compared to the production site, although Oracle recommends configuring symmetrical topology and capacity at both production and standby sites. Having different number of nodes or capacity can cause inconsistencies at the functional and performance levels.. Application data, metadata, configuration data, and security data are replicated periodically to the standby site. The standby site is normally in a passive mode; it is started when the production site is not available. This model is usually adopted when the two sites are connected over a WAN, and network latency does not allow clustering across the two sites.

  • Domain pair: A domain pair consists of an active and a passive domain. In an active-active application infrastructure tier with WebLogic domain pairs, the infrastructure tier spans two sites and each site contains a primary active domain and a secondary passive domain. The primary domains at each site are independent domains and do not have to be configured with a symmetric topology, however the domain pair must be symmetrical. For example, if Domain A is the primary (active) domain on Site 1 and Domain B is the primary (active) domain on Site 2, then there must be a Domain B as the secondary (passive) domain on Site 1, and there must be a Domain A as the secondary (passive) domain on Site 2. That is, the pair of domains at each site must be symmetrical, even though the domains themselves can be unique.

  • WebLogic Server cluster: A Weblogic Server cluster is a collection of WebLogic Server server instances running simultaneously and working together to provide increased scalability and reliability. In a cluster, most resources and services are deployed identically to each Managed Server, enabling failover and load balancing.

  • Coherence cluster: A Coherence cluster is a collection of Java Virtual Machines (JVM) processes, called Coherence servers, that run Coherence. A Coherence cluster consists of multiple Coherence server instances that distribute data in-memory to increase application scalability, availability, and performance. Application data is automatically and transparently distributed and backed up across cluster members.

  • Stretch cluster: A stretch cluster is a cluster in which nodes can span data centers within a proximate geographical range, usually with guaranteed, relatively low latency networking between the sites. Stretch clusters are also referred to as extended clusters.

  • High availability: High availability is the ability of a system or device to be available when it is needed. A high availability architecture ensures that users can access a system without loss of service. Deploying a high availability system minimizes the time when the system is down, or unavailable, and maximizes the time when it is running, or available.

  • Disaster recovery: Disaster recovery is the ability to safeguard against natural or unplanned outages at a production site by having a recovery strategy for applications and data at a geographically separate standby site.

  • Switchover: Switchover is the process of reversing the roles of the production site and the standby site. Switchovers are planned operations done for periodic validation or to perform planned maintenance on the current production site. During a switchover, the current standby site becomes the new production site, and the current production site becomes the new standby site.

  • Failover: Failover is the process of making the current standby site the new production site after the production site becomes unexpectedly unavailable (for example, due to a disaster at the production site).

  • Latency: Latency is the time that it takes for packets to travel from one cluster to another, and can be a factor in many things, including the length of the path between the sites and any layers in between. Typically latency is determined by using utilities such as traceroute or ping to send test packets from one site to another. The latency or round-trip time (RTT) has a direct effect on the response time that any one user experiences when accessing the system. The effects of high latency can be seen even with only one user on the system.

  • Metropolitan area network (MAN): A MAN is a telecommunications or computer network that spans an entire city or campus. The MAN standard for data communication specified in the IEEE 802.6 standard is called distributed-queue dual-bus (DQDB). With DQDB, networks can extend up to 20 miles (30 km) long and operate at speeds of 34–155 Mbit/s. A stretch cluster topology is appropriate in a MAN.

  • Wide Area Network (WAN): A WAN is a telecommunications or computer network that extends over large geographical distances and between different LANs, MANs and other localized computer networking architectures. Wide area networks are often established with leased telecommunication circuits. Distance and latency of a WAN need to taken into consideration when determining the type of topology you can configure.

WebLogic Server High Availability Components and Features

Oracle WebLogic Server provides components and features that work in conjunction with Oracle Coherence and Oracle Database high availability features to provide maximum availability, reliability, and application stability during planned upgrades or unexpected failures.

These WebLogic Server components and features are described in the following sections.

WebLogic Server Zero Downtime Patching

WebLogic Server Zero Downtime Patching (ZDT Patching) provides an automated mechanism to orchestrate the rollout of patches while avoiding downtime or loss of sessions. It reduces risks and downtime of mission-critical applications that require availability and predictability while applying patches.

Using workflows that you define, you can patch or update any number of nodes in a domain with little or no manual intervention. Changes are rolled out to one node at a time, allowing a load balancer to redirect incoming traffic to the remaining nodes until the node has been updated.

The ZDT custom hooks feature identifies certain points, referred to as extension points, in a patching workflow where additional commands can be executed to modify the rollout. A user can specify an extension to be run at one or more predefined extension points in the workflow that is executed either on the Administration server node, or on a remote node. See Modifying Workflows Using Custom Hooks in Administering Zero Downtime Patching Workflows.

You can use ZDT Patching to update Coherence applications while maintaining high availability of the Coherence data during the rollout process.

For an overview of the features in ZDT Patching, see Introduction to Zero Downtime Patching in Administering Zero Downtime Patching Workflows.

Clustering

WebLogic Server clusters provide scalability and reliability for your applications by distributing the work load among multiple instances of WebLogic Server.

For scalability, the capacity of an application deployed on a WebLogic Server cluster can be increased dynamically to meet demand. You can add server instances to a cluster without interruption of service—the application continues to run without impact to clients and end users.

In a WebLogic Server cluster, application processing can continue when a server instance fails. You cluster application components by deploying them on multiple server instances in the cluster—so, if a server instance on which a component is running fails, then another server instance on which that component is deployed can continue application processing. See Understanding WebLogic Server Clustering in Administering Clusters for Oracle WebLogic Server.

Singleton Services

Singleton services are services that must run on only a single Managed Server instance of a cluster at any given time, for example JMS and the JTA transaction recovery system. WebLogic Server allows you to automatically monitor and migrate singleton services from one server instance to another.

WebLogic Server features such as server and service migration, persistent data stores, and leasing make singleton services such as JMS and JTA highly available in a WebLogic Server cluster. See Singleton Services.

Session Replication

Session replication is a feature of WebLogic Server clusters that is used to replicate the data stored in a session across different server instances in the cluster.

WebLogic Server provides three methods for replicating HTTP session state across servers in a cluster: in-memory replication, JDBC-based persistence, and Coherence*Web. See Session Replication.

Transaction and Data Source Features

WebLogic Server features such as Active Gridlink data sources, JDBC TLogs and No TLog, and Logging Last Resource help to provide high availability in WebLogic Server configurations.

  • Active GridLink data sources use Fast Connection Failover to provide rapid failure detection of Oracle Real Application Clusters (Oracle RAC) nodes, and failover to remaining nodes for continuous connectivity. For design considerations when using Active Gridlink in high availability architectures, see Data Sources. See Using Active GridLink Data Sources in Administering JDBC Data Sources for Oracle WebLogic Server.

  • Transaction logs in the database (JDBC TLogs) store information about committed transactions coordinated by the server that may not have been completed. WebLogic Server uses the TLogs when recovering from system crashes or network failures. See Using Transaction Log Files to Recover Transactions in Developing JTA Applications for Oracle WebLogic Server.

  • No transaction TLog writes (No TLog) where you eliminate writes of the transaction checkpoints to the TLog store. See XA Transactions without Transaction TLogs Write in Developing JTA Applications for Oracle WebLogic Server.

  • Logging Last Resource (LLR) transaction optimization, which is a performance enhancement option that enables one non-XA resource to participate in a global transaction with the same ACID (atomicity, consistency, isolation, durability) guarantee as XA. See Logging Last Resource Transaction Optimization in Developing JTA Applications for Oracle WebLogic Server.

    These features work with Oracle Data Guard which replicates databases to make transaction logs needed for recovery to be highly available. See Introduction to Oracle Data Guard in Oracle Data Guard Concepts and Administration.

Coherence High Availability Components and Features

Oracle Coherence provides components and features that work in conjunction with Oracle WebLogic Server and Oracle Database high availability features to provide maximum availability, reliability, and application stability during planned upgrades or unexpected failures.

Coherence Persistence and Clusters

Coherence persistence is a set of tools and technologies that manage the persistence and recovery of Coherence distributed caches. Cached data is persisted so that it can be quickly recovered after a catastrophic failure or after a cluster restart due to planned maintenance. Persistence and federated caching can be used together as required. See Persisting Caches in Administering Oracle Coherence.

When an application asks for an entry to the Coherence cache, if the entry does not exist in the cache and does exist in the database, then Coherence updates the cache with the database value. This is called Read-Through caching. See Read-Through Caching in Developing Applications with Oracle Coherence.

Coherence clusters consist of multiple Coherence server instances that distribute data in-memory to increase application scalability, availability, and performance. Application data is automatically and transparently distributed and backed up across cluster members. See Configuring and Managing Coherence Clusters in Administering Clusters for Oracle WebLogic Server.

Coherence Federated Caching

The Oracle Coherence federated caching feature replicates cache data asynchronously across multiple geographically distributed clusters. Cached data is replicated across clusters to provide redundancy, off-site backup, and multiple points of access for application users in different geographical locations.

Federated caching supports multiple replication topologies. These include:

  • Active-passive: Replicates data from an active cluster to a passive cluster. The passive site supports read-only operations and off-site backup.

  • Active-active: Replicates data between active clusters. Data that is put into one active cluster is replicated at the other active clusters. Applications at different sites have access to a local cluster instance.

  • Hub and spoke: Replicates data from a single hub cluster to multiple spoke clusters. The hub cluster can only send data and the spoke clusters can only receive data. This topology requires multiple geographically dispersed copies of a cluster. Each spoke cluster can be used by local applications to perform read-only operations.

See Federating Caches Across Clusters in Administering Oracle Coherence.

Coherence GoldenGate HotCache

The Oracle Coherence GoldenGate HotCache feature detects and reflects database changes in cache in real time. Third-party updates to the database can cause Coherence applications to work with data that can be stale and out-of-date. Coherence GoldenGate HotCache solves this problem by monitoring the database and pushing any changes into the Coherence cache in real time. It employs an efficient push model that processes only stale data. Low latency is assured because the data is pushed when the change occurs in the database.

In Maximum Availability Architectures, when the database is replicated to a secondary site during failover, the database changes are reflected to the cache using GoldenGate HotCache.

See Integrating with Oracle Coherence GoldenGate HotCache in Integrating Oracle Coherence.

Oracle Database High Availability and Disaster Recovery

Oracle WebLogic Server provides strong support for integrating with the high availability (HA) and disaster recovery features of Oracle Database. Integrating with these HA and disaster recovery features minimizes database access time while allowing transparent access to rich pooling management functions that maximize both connection performance and application availability.

Note:

For the most up-to-date details about the specific database versions that are supported with this release of WebLogic Server, see the Oracle Fusion Middleware Supported System Configurations page on Oracle Technology Network.

Oracle WebLogic Server and Coherence take advantage of the HA database features described in this section. The integration of all these products contributes to managing and orchestrating the failover and switchover of the Oracle Database, and makes the failover of the database fast and automatic.

  • Oracle Data Guard ensures high availability, data protection, and disaster recovery for enterprise data. It provides a comprehensive set of services that create, maintain, manage, and monitor one or more standby databases to enable production Oracle databases to survive disasters and data corruptions. Oracle Data Guard maintains these standby databases as transactionally consistent copies of the primary database. If the primary database becomes unavailable because of a planned or an unplanned outage, then Oracle Data Guard enables you to switch any standby database to the production role, thus minimizing the downtime associated with the outage. See Introduction to Oracle Data Guard in Data Guard Concepts and Administration.

  • Oracle Active Data Guard is a comprehensive solution to eliminate single points of failure for mission critical Oracle Databases. It prevents data loss and downtime by maintaining a synchronized physical replica (standby) of a production database (primary). If there is an outage, client connections quickly failover to the standby and resume service. Active Data Guard achieves the highest level of data protection through deep integration with Oracle Database, strong fault isolation, and unique Oracle-aware data validation. System and software defects, data corruption, and administrator error that affect a primary are not mirrored to the standby. Idle redundancy is eliminated by directing read-only workloads and backups to active standby databases for high return on investment. See Getting Started with Oracle Data Guard in Data Guard Concepts and Administration.

  • Oracle Data Guard broker logically groups these primary and standby databases into a broker configuration that enables the broker to manage and monitor them together as an integrated unit. It sends notifications to WebLogic Active GridLink which then makes new connections to the database in the failover site, and coordinates with Oracle Clusterware to fail over role-based services. See Oracle Data Guard Broker Concepts in Data Guard Broker.

  • Oracle Real Application Clusters (Oracle RAC) is a clustered version of Oracle Database that allows running multiple database instances on different servers in the cluster against a shared set of data files, also known as the database. The database spans multiple hardware systems and yet appears as a single unified database to the application. See Introduction to Oracle RAC in Real Application Clusters Administration and Deployment Guide.

  • Oracle Clusterware manages the availability of instances of an Oracle RAC database. It works to rapidly recover failed instances to keep the primary database available. If Oracle Clusterware cannot recover a failed instance, then the broker continues to run automatically with one fewer instance. If the last instance of the primary database fails, then the broker provides a way to fail over to a specified standby database. If the last instance of the primary database fails, and fast-start failover is enabled, then the broker can continue to provide high availability by automatically failing over to a pre-determined standby database. See Introduction to Oracle Clusterware in Oracle Clusterware Administration and Deployment Guide.

  • Oracle GoldenGate is a high-performance software application that uses log-based bidirectional data replication for real-time capture, transformation, routing, and delivery of database transactions across heterogeneous systems. Oracle GoldenGate allows for databases to be in active-active mode. Applications that use Oracle GoldenGate must have tolerance for data loss due to the asynchronous nature of Oracle GoldenGate replication. See Oracle GoldenGate Administration Overview in Administering Oracle GoldenGate.

  • Oracle Database Global Data Services (GDS) streamline the delivery of database services on a global scale, which is key to deploying databases in MAA environments. These technologies oversee replication and failover while performing load balancing within and across data centers, optimizing resource utilization and streamlining database management practices in a distributed database environment. GDS works by enabling a Global Service across Oracle Real Application Clusters (RAC) and single-instance Oracle databases interconnected via Oracle Data Guard, Oracle GoldenGate, or any other replication technology. Client access to this distributed infrastructure is completely transparent. GDS implementations are easy to apply to Oracle WebLogic Server with minimal changes. See Introduction to Global Data Services in Oracle Database Global Data Services Concepts and Administration Guide.

  • Application Continuity (AC) is available with the Oracle RAC, Oracle RAC One Node and Oracle Active Data Guard options that masks outages from end users and applications by recovering the in-flight database sessions following recoverable outages. Application Continuity enables replay, in a non-disruptive and rapid manner, of a database request when a recoverable error makes the database session unavailable. The request can contain transactional and nontransactional calls to the database and calls that are executed locally at the client or middle tier. After a successful replay, the application can continue where that database session left off. See Ensuring Application Continuity in Real Application Clusters Administration and Deployment Guide.

WebLogic Server Active GridLink integrates with the Oracle Database features like Application Continuity and Global Data Services to provide the highest possible availability. Application Continuity will replay transactions when encountered with unplanned database outages. End-user applications will not receive errors or even know that there have been outages. Active GridLink, Application Continuity, and Data Guard provide protection for planned and unplanned database outages in highly available environments.

These technologies oversee replication and failover while performing load balancing within and across data centers, optimizing resource utilization and streamlining database management practices in a distributed database environment.

Load Balancers

Load balancers provide high availability by ensuring that if one web server goes down, requests are routed to the remaining web servers that are up and running.

There are two types of load balancers: global load balancers and local load balancers. Load balancers can be hardware devices such as Big IP, Cisco, Brocade, and so on—or software applications.

A global load balancer is used when you have multiple sites that need to function as the same logical environment. Its purpose is to distribute requests between the sites based on a pre-determined set of rules. Global load balancers are typically used in disaster recovery deployments.

A local load balancer, such as Oracle HTTP Server, is used to distribute traffic within a site. In a typical deployment, at least two Oracle HTTP Server instances are configured in the web tier to provide high availability. See Oracle HTTP Server High Availability Architecture and Failover Considerations in Administering Oracle HTTP Server. A web tier with Oracle HTTP Server is not a requirement; you can route traffic directly from the hardware load balancer to the WebLogic Server instances in the application tier. However, a web tier provides several advantages, such as faster fail-over in the event of a WebLogic Server instance failure and HTTP redirection, which is why it is recommended as part of the supported MAA architectures.

Supported MAA Architectures

WebLogic Server supports three primary maximum availability architecture (MAA) solutions that can be used to protect an Oracle WebLogic Server system against downtime across multiple data centers.

MAA architectures span data centers in distributed geographical locations. Oracle MAA is Oracle's best practices blueprint based on proven Oracle high availability technologies, expert recommendations and customer experiences. The goal of MAA is to achieve optimal high availability for Oracle customers at the lowest cost and complexity.

See the following topics for details and design considerations for the WebLogic Server and Coherence supported MAA architectures:

Potential Failure Scenarios

Potential failure scenarios range from unexpected full and partial site failures to maintenance outages.

The design considerations and recommendations provided in this document apply to the following potential failure scenarios:

  • Full site failure - With full site failure, the database, the middle-tier application server, and all user connections fail over to a secondary site that is prepared to handle the production load.

  • Partial site failure - In the context of this document, partial failures are at the mid-tier. Partial site failures at the mid-tier can consist of the entire mid-tier (WebLogic Server and Coherence), WebLogic Server only failure, Coherence cluster failure, or a failure in one instance of Oracle HTTP Server when two instances are configured for high availability.

  • Network partition failure - The communication between sites fails.

  • Maintenance outage - During a planned maintenance all components of a site are brought down gracefully. A switchover will take place from one site to the other.