Get Started

Oracle WebLogic Server can be deployed across multiple on-premises sites or across multiple Oracle Cloud Infrastructure (OCI) regions.

The configuration in this playbook uses a single Oracle WebLogic Server domain in which servers in two sites participate in the same cluster (known as a stretched cluster) and rely on Data Guard to provide protection for the database.

In OCI, features such as traffic management, health checks, load balancers, DNS private views, and dynamic routing gateways provide enhanced capabilities to support this setup. For on-premises environments, equivalent networking and infrastructure components should be used to meet these requirements.

The network latency between the sites or regions needs to be sufficiently low to minimize the performance penalty introduced by the delay in invocations and to prevent inconsistencies during deployment and runtime. Oracle supports this topology only when the latency between the WebLogic servers and the database is below 10 ms round-trip time (RTT).

To achieve optimum performance and failover behaviors, Oracle recommends that you analyze each application running in a WebLogic stretched cluster and adjust the different parameters discussed in this playbook (timeouts, session replication configuration, service migration leasing, Java Transaction API (JTA), and so on) accordingly.

Learn About Oracle Fusion Middleware Stretched Clusters

Providing Oracle maximum availability architecture (MAA) is one of the key requirements for any Oracle Fusion Middleware enterprise deployment.

Oracle Fusion Middleware includes an extensive set of high availability features, such as process death detection and restart, server clustering, service migration, GridLink, load balancing, failover, backup, and recovery, rolling upgrades, and rolling configuration changes, which protect an enterprise deployment from unplanned downtime and minimize planned downtime. These features deliver a local, high availability solution within a single data center.

Additionally, applications need protection from unforeseen disasters, natural calamities, and downtime that can affect an entire data center. Most traditional disaster protection systems use the active-passive model which involves setting up a standby site at a geographically different location than the production site. This model is usually adopted when the latency between the sites is high and does not allow clustering across the two sites. This approach provides complete maximum availability architecture (MAA) protection. However, it results in additional operating and administrative costs, because the standby middleware system mirrors the primary system and requires continuous replication. This model is described in the Oracle Fusion Middleware Disaster Recovery Guide.

This playbook describes another model: the active-active model based on Oracle Fusion Middleware stretched clusters, which can be used for protecting an Oracle Fusion Middleware system against downtime across multiple locations. This model uses an active-active configuration for the middle tier, while the database tier uses an active-passive configuration with Data Guard. It is designed to optimize capacity and distribute workloads across the sites. It utilizes the resources more effectively than the active-passive model by making use of all the available resources rather than leaving standby machines idle. This model is referred to as an FMW stretched clusters deployment.

In particular, this playbook focuses on how to implement this model across Oracle Cloud Infrastructure (OCI) regions. It provides the configuration steps for setting up the recommended topology and guidance about the performance and failover implications of this setup.

The results and examples in this playbook are based on a Oracle SOA Suite 14.1.2 system that follows the Enterprise Deployment Guide architecture. This is a significant example because it includes many features such as standard Jakarta components, HTTP session replication, database metadata persistence, a Coherence cluster, and both Java Message Service (JMS) and Java transaction API (JTA) persistent stores, among other relevant considerations for stretched clusters. As a result, the described topology and recommendations can also be applied to other Oracle Fusion Middleware environments.

Terminology

Here are definitions of some terms used in this playbook:
  • Mid-tier (also middle tier or middleware)

    The mid-tier refers to the layer within a multi-tiered application architecture that sits between the user interface (front end) and the data storage (back end). It handles business logic, data processing, and security, acting as a bridge between the user and the database.

  • Oracle Fusion Middleware

    Oracle Fusion Middleware is a comprehensive family of enterprise middleware products from Oracle that enables organizations to build, deploy, and manage applications efficiently and securely. It includes solutions for application servers, integration, business process management, business intelligence, security, identity management, web servers, and more.

  • Disaster

    A sudden, unplanned catastrophic event that causes unacceptable damage or loss in a site or geographical area. A disaster is an event that compromises an organization's ability to provide critical functions, processes, or services for unacceptable period and causes the organization to invoke its recovery plans.

  • Disaster Recovery

    Ability to safeguard against natural or unplanned outages at a production site by having a recovery strategy for applications and data to a geographically separate standby site.

  • Oracle maximum availability architecture

    Oracle maximum availability architecture (Oracle MAA) is the best practice blueprint for data protection and availability of Oracle products (Oracle Database, Oracle Fusion Middleware, Applications). Implementing Oracle MAA best practices is one of the key requirements for any Oracle deployment. It provides recommendations for setting up and managing an Oracle system. Oracle MAA includes the Oracle Fusion Middleware Enterprise Deployment Guide recommendations and adds disaster protection best practices to minimize planned and unplanned downtime for outages affecting an entire data center or region. See the Explore More section for links to related documentation and other resources.

  • Site (or data center)

    A site is the set of different components in a data center needed to run a group of applications. For example, a site could consist of Oracle Fusion Middleware instances, databases, storage, and so on.

  • System

    A system is a set of targets (hosts, databases, application servers, and so on) that work together to host your applications. For example, to monitor an application in Oracle Enterprise Manager, you first create a system, that consists of the database, listener, application server, and hosts targets on which the application runs.

  • Stretched cluster

    A stretched cluster refers to a cluster architecture where the nodes in a single logical cluster are distributed across geographically separate data centers or locations.

  • Switchover

    Switchover is the process of reversing the roles of the production site and standby site. Switchovers are planned operations done for periodic validation or to perform planned maintenance on the current production site. During a switchover, the current standby site becomes the new production site, and the current production site becomes the new standby site. This playbook also uses the term switchover to refer to a site switchover.

  • Switchback

    Switchback is the process of reverting the current production site and the current standby site to their original roles. Switchbacks are planned operations done after the switchover operation has been completed. A switchback restores the original roles of each site: the current standby site becomes the production site and the current production site becomes the standby site. This playbook also uses the term switchback to refer to a site switchback.

  • Failover

    Failover is the process of making the current standby site the new production site after the production site becomes unexpectedly unavailable due, for example, to a disaster at the production site. This playbook also uses the term failover to refer to a site failover.

  • Recovery point objective (RPO)

    Recovery point objective is the amount of data loss that a system can tolerate from a business point of view, such as the amount of data loss that is acceptable when an outage takes place.

  • Recovery time objective (RTO)

    Recovery time objective is the amount of downtime a system can tolerate or the acceptable amount of time that an application or service can remain unavailable when an outage takes place, from a business point of view.

  • Oracle Cloud Infrastructure

    Oracle Cloud Infrastructure (OCI) is a set of complementary cloud services that enable you to build and run a range of applications and services in a highly available hosted environment. OCI provides high-performance compute capabilities (as physical hardware instances) and storage capacity in a flexible overlay virtual network that is securely accessible from your on-premises network.

  • OCI region

    An OCI region is a localized geographic area that contains one or more data centers, hosting availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).

    A region is a site in terms of disaster recovery.

  • Availability domain

    Availability domains are standalone, independent data centers within a region. The physical resources in each availability domain are isolated from the resources in the other availability domains, which provides fault tolerance. Availability domains don’t share infrastructure such as power or cooling, or the internal availability domain network. So, a failure at one availability domain shouldn't affect the other availability domains in the region.

  • OCI virtual cloud network and subnet

    A virtual cloud network (VCN) is a customizable, software-defined network that you set up in an OCI region. Like traditional data center networks, VCNs give you control over your network environment. A VCN can have multiple non-overlapping classless inter-domain routing (CIDR) blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don't overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.

  • Dynamic routing gateway (DRG)

    The DRG is a virtual router that provides a path for private network traffic between VCNs in the same region, between a VCN and a network outside the region, such as a VCN in another OCI region, an on-premises network, or a network in another cloud provider.

  • OCI DNS

    Oracle Cloud Infrastructure Domain Name System (DNS) service is a highly scalable, global anycast domain name system (DNS) network that offers enhanced DNS performance, resiliency, and scalability, so that end users connect to internet applications quickly, from anywhere.

  • OCI DNS private view

    An OCI DNS private view is a collection of one or more OCI Private DNS zones, used to logically group DNS zones and control how they are resolved. A view is attached to a private DNS resolver, which can be the default resolver for a virtual cloud network (VCN) or a custom one. This allows you to manage separate DNS configurations for different environments or applications within your VCN.

  • Virtual IP

    A virtual IP (VIP) address refers to an IP address that is not tied to a particular physical network interface or device. Instead, it is abstracted and can move between devices or be used for various network functions.

  • OCI Traffic Management

    OCI Traffic Management intelligently directs user traffic to optimal endpoints using advanced DNS-based policies (OCI traffic steering policies). It enables organizations to control how DNS queries are resolved, optimizing the routing of client requests for improved availability, performance, and resilience of applications or services deployed on OCI or elsewhere.

  • Load balancer

    A load balancer is a system or service that distributes incoming network traffic across multiple servers to ensure high availability, reliability, and optimal performance of applications.

  • OCI Load Balancer

    OCI Load Balancer is a fully managed Oracle Cloud Infrastructure service that automatically distributes incoming traffic across multiple backend servers or resources to ensure high availability, better performance, and scalability for applications.

  • Block storage

    Block storage is a type of data storage where information is saved in fixed-sized blocks and can be accessed directly by servers or applications via storage protocols such as iSCSI or Fibre Channel.

  • OCI Block Volumes

    With Oracle Cloud Infrastructure Block Volumes, you can create, attach, connect, and move storage volumes, and change volume performance to meet your storage, performance, and application requirements. After you attach and connect a volume to an instance, you can use the volume like a regular hard drive. You can also disconnect a volume and attach it to another instance without losing data.

  • Shared storage

    Shared storage refers to a storage system or resource that can be accessed concurrently by multiple servers, computers, or applications within an IT environment. This setup enables all participating systems to read from and write to the same data repository, making it ideal for scenarios that require data consistency, collaboration, high availability, and scalability.

  • OCI File Storage

    Oracle Cloud Infrastructure File Storage provides a durable, scalable, secure, enterprise-grade network file system. You can connect to OCI File Storage from any bare metal, virtual machine, or container instance in a VCN. You can also access OCI File Storage from outside the VCN by using Oracle Cloud Infrastructure FastConnect and IPSec VPN.

  • OCI database services

    OCI database services refer to the portfolio of managed database solutions provided by Oracle Cloud Infrastructure (OCI). These services offer a variety of database deployment and management options in the Oracle Cloud, supporting different workloads, performance needs, and data models, such as Oracle Base Database Service and Oracle Exadata Database Service.

  • Oracle Data Guard

    Oracle Data Guard and Active Data Guard provide a comprehensive set of services that create, maintain, manage, and monitor one or more standby databases and that enable production Oracle databases to remain available without interruption. Oracle Data Guard maintains these standby databases as copies of the production database by using in-memory replication. If the production database becomes unavailable due to a planned or an unplanned outage, Oracle Data Guard can switch any standby database to the production role, minimizing the downtime associated with the outage. Oracle Active Data Guard provides the additional ability to offload read-mostly workloads to standby databases and also provides advanced data protection features.

This playbook uses OCI as an example infrastructure for deploying Oracle Fusion Middleware stretched clusters. These are the on-premises to OCI equivalencies for the main components required for the Oracle Fusion Middleware stretched cluster setup:

On-premises (or generic) OCI equivalent
Site (or data center) OCI region
Shared storage (for example, NFS) OCI File Storage
Block storage OCI Block Volumes
Global load balancer OCI Traffic Management and steering policies
Local load balancer OCI Load Balancer
Network OCI virtual cloud network
Firewall/firewall rules OCI network security rules
Internal DNS OCI DNS private view
Physical server / virtual machine OCI Compute instance
On-Premises database OCI database service
On-Premises connectivity between sites OCI remote peering with dynamic routing gateway

Architecture

This section describes the topology and considerations for the Oracle Fusion Middleware (FMW) stretched cluster architecture.

The following considerations apply to the FMW stretched cluster architecture:

  • Regions

    There are two regions, or sites, in this topology. The latency between them should not be higher than 10 ms round-trip time (RTT). Bandwidth requirements will depend on the types of payloads handled by each system, but a minimum of 1 or 2 gigabits per second (Gbps) is recommended.

  • Middle tier

    The middle tier operates in an active-active model, with a single domain. Half of the managed servers are deployed at one site, with the remainder at the other site. The Administration Server runs in one site but can fail over to the other site if needed. This setup is commonly referred to as a stretched cluster.

  • Database tier

    The database tier uses an active-passive architecture, supported by Oracle Data Guard. The primary database is located at one site, while the standby database resides at the other site. All middle-tier servers are configured to connect to the database currently serving as the primary, regardless of its location. The Oracle Database configured in each site is an Oracle Real Application Clusters (Oracle RAC). Oracle RAC enables an Oracle database to run across a cluster of servers within the same data center, providing fault tolerance, performance, and scalability with no application changes necessary.

  • Storage

    The shared storage is local to each site. For contention and security reasons, it is not recommended to use shared storage across sites. Disk mirroring or replication from site1 to site2 and vice versa is recommended to provide a recoverable copy of the artifacts in each site.

  • Persistent Stores

    The WebLogic Persistent Stores (Java Message Service (JMS) and Java Transaction API (JTA)) are configured as Java Database Connectivity (JDBC) stores within the database. This way, they are reachable from both sites. This allows the Automatic Service Migration feature to work between both sites.

  • Request forwarding

    The web servers at each site forward requests only to the Oracle WebLogic Server instances located within the same site. There is no cross-region communication between the web servers (Oracle HTTP Server instances) and the WebLogic servers in the other site to minimize latency and cross-region traffic.

  • Load balancers

    Each site has its own dedicated load balancer, which directs requests exclusively to the web servers within that local site. There is no cross-region communication between the load balancers and the web servers located at the other site.

  • Front-end access

    In front of the system, the solution provides a single front-end access to the system. Clients connect using a single address that remains the same, regardless of the site to which they are directed. This mechanism offers a domain name system (DNS) name that is accessible to all clients and routes requests to either site based on predefined criteria and rules, such as the client’s geographic location.

The following diagram illustrates the Oracle Fusion Middleware stretched cluster topology



stretched-cluster-topology-oracle.zip

The following diagram illustrates the WebLogic domain and clusters in the Oracle Fusion Middleware stretched cluster topology:



stretched-cluster-topology-wls-domain-oracle.zip

Advantages

The benefits of using the Oracle Fusion Middleware (FMW) stretched cluster model over the traditional active-passive model include the following:
  • Simplified administration

    Active-passive deployments incur an additional administration overhead to keep the configuration synchronized between the primary and the standby site. Although most of the runtime information and metadata resides in the database, the Oracle WebLogic Server configuration resides in the file system. So, in addition to the Oracle Data Guard replication for the database, the active-passive model requires a continuous file system replication, which can be implemented in various ways (rsync, storage-level replica, and so on.).

    In the FMW stretched clusters model, however, the Oracle WebLogic Server infrastructure keeps the configuration synchronized across the multiple nodes in the same domain. Most of this configuration usually resides under the Administration Server’s domain directory. This configuration is propagated automatically to the other nodes in the same domain when the managed servers start or when a change is implemented. For this reason, the administration overhead of the deployment is very small compared to any active-passive approach, where constant replication of configuration changes is required.

  • Improved availability and lower RTO and RPO for some failover scenarios

    The FMW stretched cluster model provides better recovery point objective (RPO) and recovery time objective (RTO) than the active-passive model in these scenarios:

    • Complete middle-tier failure in one site events

      If all the middle-tier servers in one location fail, the system can continue fulfilling requests with the middle-tier servers in the peer site immediately. The RTO and RPO are zero in this type of scenario.

      To achieve this, the middle-tier servers in the alternative location need to be able to sustain the combined workload of the two locations. Appropriate capacity planning must be done to account for such scenarios. Depending on the design, requests from end clients may need to be throttled when only one site is active. Otherwise, sites must be designed with excesdsive capacity, partially defeating the purpose of constant and efficient capacity usage.

    • Failures in the database tier events

      A switchover of the database incurs the same RTO and RPO in an active-passive and in the FMW stretched cluster model. The overall RTO of the system, however, is lower in the stretched cluster model because the mid-tier servers are already active in the secondary site. A restart of the middle tiers is not required. An appropriate data source configuration, using a dual connect string with GridLink and Fast Application Notification (FAN) notifications, automates the failover of the database connections from the middle tiers, reducing the system RTO.

Considerations

Consider the following when implementing the Oracle Fusion Middleware (FMW) stretched cluster model:
  • Capacity planning

    This model requires capacity planning to account for failover scenarios between the two sites. If an entire site loses the middle tiers, the other must be able to sustain the full workload. Otherwise, the available site may become unresponsive due to the overhead. This means that during normal operation, the middle-tier nodes must be underutilized to allow sufficient capacity to handle failover scenarios. The same rule applies to the web tier. If one site loses all its web servers, the web servers at the other site must be capable of handling all the system requests.

  • Network throughput across sites

    Network throughput mainly depends on two things: how far the sites are and how well the network handles reliability and congestion. No matter how fast the servers or software are, there’s a limit to how quickly data can move between sites. Two important factors affecting this are latency and jitter:

    • Latency is the time it takes for data to travel across the network from one site to another. It can be measured one-way (from source to destination) or round-trip (there and back). The round-trip time (RTT) is more common and can be checked with the ping command.

    • Jitter is the variation in the time it takes for data packets to arrive.

    For the current model, keeping the latency low is especially important, as jitter usually only matters when the latency is already very low. Therefore, controlling latency is the main priority for good performance in this kind of setup. The distance is typically the most relevant cause of latency.

    Tests conducted have shown that the performance in this model (where some Oracle WebLogic Server instances in the cluster are in a different site than the database) degrades considerably when latency (RTT) exceeds 10 milliseconds.

    Multiple test have been conducted by Oracle with different configurations affected by different latencies. The reference latencies shown in this playbook differentiate between clusters with:
    • All members in the same availability domain
    • Members in different availability domains
    • Members in two nearby but different OCI regions

    The following image shows the throughput (transactions per second) for an Oracle Fusion Middleware SOA system running the Fusion Order Demo (FOD) for different latencies between the WebLogic server and the database:



    The following image shows the Java Transaction API (JTA) active time for an Oracle Fusion Middleware SOA system running the Fusion Order Demo (FOD) that uses a database on a different site with different latencies between sites:



    The following image shows the degradation observed in the overall system throughput in transactions per second for different latencies between the sites (both sites working together with the database in one of the sites).



    Note:

    With all of the above in mind, and with the performance penalties observed in many tests, Oracle does not support Oracle Fusion Middleware stretched clusters that exceed 10 milliseconds of latency (RTT) between the sites. Systems may operate without issues, but the transaction times will increase considerably. Latencies beyond 10 milliseconds (RTT) will also cause problems in the Oracle Coherence cluster used for deployment and JTA, web services, or application timeouts. This makes the solutions presented in this playbook suitable primarily for sites or regions with low latency between them.

  • Cross-region traffic

    In the current model, you need to minimize the traffic across sites to reduce the effect of latency on the system’s throughput. In a typical FMW system, the communications between the different tiers or elements are:

    • Access to the database from the Oracle WebLogic Server instances for metadata access and other database read/write operations.

    • Incoming HTTP invocations from load balancers (LBR) or Oracle HTTP Server instances and HTTP callbacks.

    • Java Naming and Directory Interface (JNDI)/Remote Method Invocation (RMI) and Java Message Service (JMS) invocations between Oracle WebLogic Server instances.

    • Oracle Coherence notifications between all servers in the system. For example, SOA uses Coherence to provide a consistent composite and metadata image.

    • HTTP session replications between the Oracle WebLogic Server instances. Some components use stateful web applications that may rely on session replication to enable transparent failover of sessions across servers. Depending on the usage patterns and number of users, this may generate a considerable amount of replication data.

    • Lightweight Directory Access Protocol (LDAP)/policy/identity store access is performed by Oracle WebLogic Server infrastructure for authorization and authentication purposes. Ideally, each site should have an independent identity and policy store that is synchronized regularly to minimize invocations from one site to the other. Alternatively, both sites can share the same store. The impact of sharing the store will depend on the type of store and the usage pattern of the system.

    Whenever possible, all of the above should be contained within the site to enhance performance. For example:

    • The servers in one site should only receive invocations from Oracle HTTP Server instances in the same site.

    • The servers in one site should make JMS, RMI, and JNDI invocations only to servers in the same site and should get callbacks generated by servers only in the same site.

    • HTTP session replication should be restricted to the other servers in the same site only. Replication and failover requirements have to be analyzed for each business case, but ideally, session replication traffic should be reduced across sites as much as possible.

  • Shared storage

    The latency for network file system (NFS) writes across sites may cause severe performance degradation. The servers should use storage devices that are local to their site to eliminate contention in read/write requests to file systems. Shared storage should be limited to being within each site.

  • Other Resources

    The two sites may share other external resources, like LDAP, external JMS destinations, external web services, and so on. It is required that these resources are consistently available in both sites. The configuration details for these external resources are out of the scope of this playbook.