2 Common Design Considerations for High Availability and Disaster Recovery

Oracle provides recommended design considerations and best practices for the Maximum Availability Architecture (MAA) solutions supported for Oracle WebLogic Server and Coherence. MAA architectures span data centers in distributed geographical locations. The goal of MAA is to achieve optimal high availability for Oracle customers at the lowest cost and complexity.

Topics in this chapter include:

The recommendations in this chapter apply to all of the WebLogic Server and Coherence supported MAA architectures. Recommendations that are specific to a particular architecture are provided in the subsequent chapters as follows:

Global Load Balancer

When a global load balancer is deployed in front of the production and standby sites, it provides fault detection services and performance-based routing redirection for the two sites. Additionally, the load balancer can provide authoritative DNS name server equivalent capabilities.

In the event of a primary-site disaster and after the standby site has assumed the production role, a global load balancer is used to reroute user requests to the standby site. Global load balancers such as F5 –BigIP Global Traffic Manager (GTM) and Cisco –Global Site Selector (GSS) also handle DNS server resolution (by off loading the resolution process from the traditional DNS servers).

During normal operations, the global load balancer can be configured with the production site's load balancer name-to-IP mapping. When a DNS switchover is required, this mapping in the global load balancer is changed to map to the standby site's load balancer IP. This allows requests to be directed to the standby site, which now has the production role.

This method of DNS switchover works for both site switchover (planned) and failover (unplanned). One advantage of using a global load balancer is that the time for a new name-to-IP mapping to take effect can be almost immediate. The downside is that an additional investment must be made for the global load balancer. For instructions for performing a DNS switchover, see Manually Changing DNS Names in Disaster Recovery Guide.

Web Tier

Configuring a web tier is optional in the supported WebLogic Server MAA architectures. Web tier products such as Oracle HTTP Server (OHS) and Oracle WebLogic Server Proxy Plug-In are designed to efficiently front-end WebLogic Server applications. OHS and WebLogic Server Proxy Plug-in can be used with other WebLogic Server high availability features.

You can configure Oracle HTTP Server in one of two ways: as part of an existing Oracle WebLogic Server domain or in its own standalone domain. In the WebLogic Server and Coherence supported MAA architectures, the Oracle HTTP server instances are configured as separate standalone domains, where you can configure and manage the Oracle HTTP Server instances using WLST offline commands, independently of the application tier domains.

The mod_wl_ohs module handles the link to Managed Servers. You configure mod_wl_ohs by routing requests of a particular type, such as JSPs, or by routing requests destined to a URL to specific Managed Servers.

Oracle HTTP Server (OHS) has two failure types: process failures and node failures. An individual operating system process may fail. A node failure can involve failure of the entire host computer that OHS runs on.

  • In a process failure, Node Manager protects and manages OHS processes. If an OHS process fails, Node Manager automatically restarts it.

  • In a node failure, the load balancer in front of OHS sends a request to another OHS instance if the first one doesn't respond or URL pings to it indicate that it has failed.

  • If a Managed Server in a cluster fails, the mod_wl_ohs module automatically redirects requests to one of the active cluster members. If the application stores state, state replication is enabled within the cluster, which enables redirected requests access to the same state information.

For more information about Oracle HTTP Server and WebLogic Server Proxy Plug-Ins, see:

WebLogic Server

WebLogic Server features such as clustering, singleton services, session replication, and others can be used together with Coherence and Oracle Database features to provide the highest level of availability.

The following sections provide the design considerations for these WebLogic Server features in a supported WebLogic Server MAA architecture:

Clustering

A WebLogic Server cluster consists of multiple WebLogic Server server instances running simultaneously and working together to provide increased scalability, reliability, and high availability. A cluster appears to clients as a single WebLogic Server instance. The server instances that constitute a cluster can run on the same machine, or be located on different machines. You can increase a cluster's capacity by adding additional server instances to the cluster on an existing machine, or you can add machines to the cluster to host the incremental server instances. Each server instance in a cluster must run the same version of WebLogic Server.

WebLogic Server supports two types of clusters:

  • Dynamic clusters - Dynamic clusters consist of server instances that can be dynamically scaled up to meet the resource needs of your application. When you create a dynamic cluster, the dynamic servers are preconfigured and automatically generated for you, enabling you to easily scale up the number of server instances in your dynamic cluster when you need additional server capacity. Dynamic clusters allows you to define and configure rules and policies to scale up or shrink the dynamic cluster.

    In dynamic clusters, the Managed Server configurations are based off of a single, shared template. It greatly simplifies the configuration of clustered Managed Servers, and allows for dynamically assigning servers to machine resources and greater utilization of resources with minimal configuration.

    Dynamic cluster elasticity allows the cluster to be scaled up or down based on conditions identified by the user. Scaling a cluster can be performed on-demand (interactively by the administrator), at a specific date or time, or based on performance as seen through various server metrics.

    When shrinking a dynamic cluster, the Managed Servers are shut down gracefully and the work/transactions are allowed to complete. If needed, singleton services are automatically migrated to another instance in the cluster.

  • Static clusters - In a static cluster the end-user must configure new servers and add them to the cluster, and start and stop them manually. The expansion and shrinking of the cluster is not automatic; it must be performed by an administrator.

In most cases, Oracle recommends the use of dynamic clusters to provide elasticity to WebLogic deployments. The benefits of dynamic clusters are minimal configuration, elasticity of clusters, and proper migration of JMS and JTA singleton services when shrinking the cluster.

However, there are some instances where static clusters should be used. One such instance is - if you need to manually migrate singleton services. Dynamic clusters do not support manual migration of singleton services.

Singleton Services

A singleton service is a service running on a Managed Server that is available on only one member of a cluster at a time. WebLogic Server allows you to automatically monitor and migrate singleton services from one server instance to another.

Pinned services, such as JMS-related services and user-defined singleton services are hosted on individual server instances within a WebLogic cluster. To ensure that singleton JMS or JTA services do not introduce a single point of failure for dependent applications in the cluster, WebLogic Server can be configured to automatically or manually migrate them to any server instance in the cluster.

Within an application, you can define a singleton service that can be used to perform tasks that you want to be executed on only one member of a cluster at any give time. Automatic singleton service migration allows the automatic health monitoring and migration of user-defined singleton services.

Singleton services described in the following sections include:

Server and Service Migration

Oracle WebLogic Server supports two distinct types of automatic migration mechanisms:

  • Whole server migration, where a migratable server instance, and all of its services, is migrated to a different physical machine upon failure. When a failure occurs in a server that is part of a cluster that is configured with server migration, the server is restarted on any of the other machines that host members of the cluster. See Whole Server Migration in Administering Clusters for Oracle WebLogic Server.

  • Service migration, where failed services are migrated from one server instance to a different available server instance within the cluster. In some circumstances, service migration performs much better then whole server migration because you are only migrating the singleton services as opposed to the entire server. See Service Migration in Administering Clusters for Oracle WebLogic Server.

Oracle recommends to use the Service Migration feature instead of using Server Migration. Service Migration provides the same High Availability protection, by utilizing less resources. For example, the floating IPs required by Server Migration are not needed in Service Migration, and less memory or cpu resources are used as only the critical services are migrated instead of migrating the entire WebLogic server.

Both whole server and Service migration require that you configure a database leasing table. See Leasing.

Instructions for configuring WebLogic Server to use server and service migration in an MAA environment are provided in Using Whole Server Migration and Service Migration in an Enterprise Deployment in Enterprise Deployment Guide for Oracle SOA Suite.

Data Stores

There are two kinds of persistent data stores for Oracle WebLogic Server transactions logs and Oracle WebLogic Server JMS: database-based and file-based.

Keeping persistent stores in the database provides the replication and high availability benefits inherent in the underlying database system. With JMS, TLogs and the application in the same database and replication handled by Oracle Data Guard, cross-site synchronization is simplified and the need for a shared storage sub-system such as a NAS or a SAN is alleviated in the middle tier. See Database.

However, storing TLogs and JMS stores in the database has a penalty on system performance. This penalty is increased when one of the sites needs to cross communicate with the database on the other site. Ideally, from a performance perspective, shared storage that is local to each site should be used for both types of stores and the appropriate replication and backup strategies at storage level should be provisioned in order to guarantee zero data loss without performance degradation. Whether using database stores will be more suitable than shared storage for a system depends on the criticality of the JMS and transaction data, because the level of protection that shared storage provides is much lower than the database guarantees.

You can minimize the performance impact of database stores, especially when there is a large concurrency, by using techniques such as global hash partitions for indexes (if Oracle Database partitioning is available). For recommendations about minimizing the performance impact, see Using Persistent Stores for TLOGs and JMS in an Enterprise Deployment in Enterprise Deployment Guide for Oracle SOA Suite.

In active-active and active-passive topologies, keeping the data stores in the database is a requirement. Oracle recommends keeping WebLogic Server stores such as JMS and JTA stores, in a highly available database such as Oracle RAC and connecting to the database using Active GridLink data sources for maximum performance and availability.

In the case of an active-active stretch cluster, you can choose between keeping the data stores in a shared storage sub-system such as a NAS or a SAN, or in the database. However, using database-based stores is recommended also in stretch clusters, because high availability and cross-site replication is automatically provided by the underlying Oracle RAC database and Data Guard.

Oracle recommends keeping WebLogic Server stores such as JMS and JTA stores, and leasing tables in a highly available database such as Oracle RAC and connecting to the database using Active GridLink data sources. Storing the stores and leasing tables in the database provides the following advantages:

  • Exploits the replication and other high availability aspects inherent in the underlying database system.

  • Enhances handling of disaster recovery scenarios. When JMS, the TLogs and the application are in the same database and the replication is handled by Data Guard, there is no need to worry about cross-site synchronization.

  • Alleviates the need for a shared storage sub-system such as a NAS or a SAN. Usage of the database also reduces overall system complexity since in most cases a database is already present for normal runtime/application work.

Leasing

Leasing is the process WebLogic Server uses to manage services that are required to run on only one member of a cluster at a time. Leasing ensures exclusive ownership of a cluster-wide entity. Within a cluster, there is a single owner of a lease. Additionally, leases can failover in case of server or cluster failure which helps to avoid having a single point of failure. See Leasing in Administering Clusters for Oracle WebLogic Server.

WebLogic Server provides two types of leasing functionality, non-database consensus leasing and high availability database leasing. In high availability or disaster recovery scenarios, Oracle recommends the use of database leasing.

For database leasing we recommend the following:

  • A highly available database such as Oracle RAC and Active GridLink (AGL).

  • A standby database, and Oracle Data Guard to provide replication between the two databases.

WebLogic Server includes an option to automatically create WebLogic cluster database leasing tables. This option automatically detects that a leasing table is missing, detects the database type, and then finds and runs the appropriate default DDL file to create the table. See High Availability Database Leasing in Administering Clusters for Oracle WebLogic Server.

When using database leasing, Oracle WebLogic Servers may shut down if the database remains unavailable (during switchover or failover) for a period that is longer than their server migration fencing times. You can adjust the server migration fencing times as described in the following topics in Administering Clusters for Oracle WebLogic Server:

Session Replication

WebLogic Server provides three methods for replicating HTTP session state across servers in a cluster:

  • In-memory replication - Using in-memory replication, WebLogic Server copies a session state from one server instance to another. The primary server creates a primary session state on the server to which the client first connects, and a secondary replica on another WebLogic Server instance in the cluster. The replica is kept up-to-date so that it may be used if the server that hosts the servlet fails.

  • JDBC-based persistence - In JDBC-based persistence, WebLogic Server maintains the HTTP session state of a servlet or JSP using file-based or JDBC-based persistence. For more information on these persistence mechanisms, see Configuring Session Persistence in Developing Web Applications, Servlets, and JSPs for Oracle WebLogic Server.

  • Coherence*Web - Coherence*Web is not a replacement for WebLogic Server's in-memory HTTP state replication services. However, you should consider using Coherence*Web when an application has large HTTP session state objects, when running into memory constraints due to storing HTTP session object data, or if you want to reuse an existing Coherence cluster. See Using Coherence*Web with WebLogic Server in Administering HTTP Session Management with Oracle Coherence*Web.

Depending on the latency model, tolerance to session loss, and performance, you should choose the method that best fits your requirement.

  • When the latency is small, such as in MAN networks (stretch cluster topology), Oracle recommends WebLogic Server in-memory session replication. However, if a site experiences a failure there is the possibility of session loss.

  • When the latency is large (WAN networks), Active-Active, or Active-Passive topologies, and when your applications cannot tolerate session loss, Oracle recommends database session replication.

In most cases, in-memory session replication performs much better than database session replication. See Failover and Replication in a Cluster in Administering Clusters for Oracle WebLogic Server.

Data Sources

WebLogic Active GridLink data sources integrate with Oracle RAC databases and Oracle Data Guard to provide the best performance, high scalability and the highest availability. The integration with Oracle RAC enables Active GridLink to do Fast Connection Failover (FCF), Runtime Load Balancing (RCLB) and Affinity features. Active GridLink can handle planned maintenance in the database without any interruptions to end-users while allowing all work to complete.

You can configure your Active GridLink URL to minimize the time to failover between databases. See Supported AGL Data Source URL Formats in Administering JDBC Data Sources for Oracle WebLogic Server.

Security

It is important that you determine your security needs and make sure that you take the appropriate security measures before you deploy WebLogic Server and your Java EE applications into a production environment. See Ensuring the Security of Your Production Environment in Securing a Production Environment for Oracle WebLogic Server.

Storage

The Oracle Fusion Middleware components in a given environment are usually interdependent on each other, so it is important that the components in the topology are in sync. Some of the storage artifacts that you need to take into consideration in an MAA environment are classified as static and dynamic.

  • Static artifacts are files and directories that do not change frequently. These include:

    • home: The Oracle home usually consists of an Oracle home and an Oracle WebLogic Server home.

    • Oracle Inventory: This includes oraInst.loc and oratab files, which are located in the /etc directory.

  • Dynamic or runtime artifacts are files that change frequently. Runtime artifacts include:

    • Domain home: Domain directories of the Administration Server and the Managed Servers.

    • Oracle instances: Oracle Instance home directories.

    • Application artifacts, such as .ear or .war files.

    • Database artifacts, such as the MDS repository.

    • Database metadata repositories used by Oracle Fusion Middleware.

    • Persistent stores, such as JMS providers and transaction logs. As a best practice for High Availability and Disaster recovery, Oracle recommends storing these in database persistent stores. See Data Stores.

    • Deployment plans: Used for updating technology adapters such as file and JMS adapters. They need to be saved in a location that is accessible to all nodes in the cluster to which the artifacts are deployed.

For maximum availability, Oracle recommends using redundant binary installations. Each node should have its own Oracle home so that when you apply Zero Downtime Patches, only servers in one node need to come down at a time.

For recommended guidelines regarding shared storage for artifacts such as home directories and configuration files, see Using Shared Storage in High Availability Guide.

Zero Downtime Patching

Zero Downtime Patching (ZDT Patching) provides continuous application availability during the process of rolling out upgrades, even though the possibility of failures during the rollout process always exists. In an MAA environment, Oracle recommends patching one site at a time, and staggering the update to the other site to ensure that the sites remain synchronized. In the case of a site failure scenario, allow for the failed site to resume before resuming ZDT Patching.

When using ZDT Patching, consider the following:

  • Rollout shuts down one node at a time, so the more nodes in a cluster, the less impact it has on the cluster's ability to handle traffic.

  • If a cluster has only two nodes, and one node is down for patching, then high availability cannot be guaranteed. Oracle recommends having more than two nodes in the cluster.

  • If you include a Managed Server on the same node that includes the Administration Server, then both servers must be shutdown together to update Oracle home.

  • Two clusters can have servers on the same node sharing an Oracle home, but both clusters need to be shutdown and patched together.

  • If your configuration contains two Oracle homes, then Oracle recommends that you create and patch the second Oracle home on a nonproduction machine so that you can test the patches you apply, but this is not required. The Oracle home on that node must be identical to the Oracle home you are using for your production domain.

See Introduction to Zero Downtime Patching in Administering Zero Downtime Patching Workflows.

Coherence

Coherence features such as federated caching, persistence, and GoldenGate Hot Cache can be used together with WebLogic Server and Oracle Database features to provide the highest level of availability.

The following sections provide the design considerations for Coherence in the supported MAA architectures.

Coherence Persistent Cache

Cached data is persisted so that it can be quickly recovered after a catastrophic failure or after a cluster restart due to planned maintenance. In multi data center environments, Oracle recommends using Coherence persistence and federated caching together to ensure the highest level of protection during failure or planned maintenance events.

Persistence is only available for distributed caches and requires the use of a centralized partition assignment strategy. There are two persistence modes:

  • Active persistence - cache contents are automatically persisted on all mutations and are automatically recovered on cluster/service startup.

  • On-demand persistence - a cache service is manually persisted and recovered upon request using the persistence coordinator.

See Persisting Caches in Administering Oracle Coherence.

Coherence Federated Caching

You can use Coherence federated caching in active-active and active-passive topologies (not stretch clusters). Before doing so, consider these ramifications:

  • Coherence data reaches the other site at some point in an ordered fashion (in Coherence, ordering is per Coherence partition), even after network partition or remote cluster outage.

  • The remote site may read stale data for a period of time after the local site is being updated.

  • Update conflicts are possible, and we identify these and call out to an application-specific conflict resolver.

Coherence federated caching implements an eventual consistency model between sites for the following reasons:

  • The data center can be anywhere; the location is not constrained by latency or available bandwidth.

  • Tolerance for unavailability of the remote data center or cluster is extremely desirable. Note that it is very hard to tell the difference between communications being down and a remote cluster being down, and it is not necessary to differentiate.

  • Full consistency in active-active configurations requires some sort of distributed center concurrency control, as well as synchronous writes. This can have a significant impact on performance and is not desirable. Instead, where consistency matters, you can use stretch clusters with synchronous replications. In this case, it is reasonable to assert a maximum latency between data centers, with guaranteed bandwidth.

See Federating Caches Across Clusters in Administering Oracle Coherence.

Coherence GoldenGate Hot Cache

Within a single Coherence cluster with a mix of data, where some of the data is owned by the database and some of it is owned by Coherence, you can use both Coherence Read-Through cache and Coherence GoldenGate Hot Cache.

The choice between HotCache and Read-Through cache comes down to (1) whether Read-Through may lead to stale reads if the database is updated behind Coherence's back and (2) whether the real-time nature of HotCache is preferred for other reasons. There can also be situations where both HotCache and Read-Through are used together, for example to push real-time updates via HotCache, but then to handle the case where data was removed due to eviction or expiration.

See Integrating with Oracle Coherence GoldenGate HotCache in Integrating Oracle Coherence.

Database

Oracle Database provides several features such as Oracle Data Guard, Oracle Real Application Clusters (Oracle RAC) and others that can be integrated to provide high availability of the database in MAA architectures. See Oracle Database High Availability and Disaster Recovery. Regardless of the topology, the goal is to minimize the time that it will take for switchover and failover of your databases.

To achieve high availability of your database for both planned and unplanned outages, Oracle recommends using an active-passive configuration with a combination of the following features:

  • Oracle RAC as the highly available database. See Introduction to Oracle RAC in Real Application Clusters Administration and Deployment Guide.

  • Oracle Data Guard because it eliminates single points of failure for mission critical Oracle Databases. It prevents data loss and downtime by maintaining a synchronized physical replica of a production database at a remote location. If the production database is unavailable for any reason, client connections can quickly, and in some configurations transparently, failover to the synchronized replica to restore service. Applications can take advantage of Oracle Data Guard with little or no application changes required. See Introduction to Oracle Data Guard in Data Guard Concepts and Administration.

    In a supported WebLogic Server MAA architecture, Oracle recommends using the Oracle Data Guard maximum availability protection mode. This protection mode provides the highest level of data protection that is possible without compromising the availability of a primary database. It ensures zero data loss except in the case of certain double faults, such as failure of a primary database after failure of the standby database. See Oracle Data Guard Protection Modes in Oracle Data Guard Concepts and Administration.

    Note:

    Oracle Data Guard can only be used in active-passive configurations, but guarantees zero-data loss.

  • Oracle Active Data Guard, an option built on the infrastructure of Oracle Data Guard, allows a physical standby database to be opened read-only while changes are applied to it from the primary database. This enables read-only applications to use the physical standby with minimal latency between the data on the standby database and that on the primary database, even while processing very high transaction volumes at the primary database. This is sometimes referred to as real-time query. See Opening a Physical Standby Database in Oracle Data Guard Concepts and Administration.

    An Oracle Active Data Guard standby database is used for automatic repair of data corruption detected by the primary database, transparent to the application. In the event of an unplanned outage on the primary database, high availability is maintained by quickly failing over to the standby database. An Active Data Guard standby database can also be used to off-load fast incremental backups from the primary database because it is a block-for-block physical replica of the primary database.

    Oracle Active Data Guard provides a far sync feature that improves performance in zero data loss configurations. An Oracle Data Guard far sync instance is a remote Oracle Data Guard destination that accepts redo from the primary database and then ships that redo to other members of the Oracle Data Guard configuration. Unlike a standby database, a far sync instance does not have data files, cannot be opened, and cannot apply received redo. These limitations yield the benefit of using fewer disk and processing resources. More importantly, a far sync instance provides the ability to failover to a terminal database with no data loss if it receives redo data using synchronous transport mode and the configuration protection mode is set to maximum availability. See Using Far Sync Instances in Oracle Data Guard Concepts and Administration.

  • Oracle Data Guard broker as a distributed management framework that automates and centralizes the creation, maintenance, and monitoring of Data Guard configurations. Some of the operations Data Guard Broker can perform is the creation, management, monitoring of the Data Guard configurations, invoking switchover or failover to initiate and control complex role changes across all databases in the configuration, and configuring failover to occur automatically. See Oracle Data Guard Broker Concepts in Data Guard Broker.

    You can enable Oracle Data Guard fast-start failover to fail over automatically when the primary database becomes unavailable. When fast-start failover is enabled, the Oracle Data Guard broker determines if a failover is necessary and initiates the failover to the specified target standby database automatically, with no need for database administrator intervention. See Managing Fast-Start Failover in Oracle Data Guard Broker.

  • Active GridLink Datasources in WebLogic Server makes the scheduled maintenance process at the database servers transparent to applications. When an instance is brought down for maintenance at the database server, draining ensures that all work using instances at that node completes and that idle sessions are removed. Sessions are drained without impacting in-flight work.

  • Application continuity protects you during planned and unplanned outages. Use Application Continuity and Active GridLink for maximum availability during unplanned down events. See Ensuring Application Continuity in Real Application Clusters Administration and Deployment Guide.

  • Global Data Services (GDS) or Data Guard broker with Fast Application Notifications (FAN) to drain across sites. When you use Active Data Guard, work can complete before switching over to secondary database.

If your configuration requires that you have an active-active database configuration, Oracle recommends:

  • Oracle GoldenGate, which allows for databases to be in active-active mode. Both read and write services are active-active on the databases on both sites. See Configuring Oracle GoldenGate for Active-Active Configuration in Administering Oracle GoldenGate.

    When using Oracle Golden Gate, Application Continuity and Active GridLink can be used within a site (intra-site) to handle planned and unplanned down database events. Application Continuity cannot be used to replay transactions during failover or switchover operations across sites (inter-site). Application Continuity does not support failover to a logically different database –including Oracle Logical Standby and Oracle Golden Gate. Replay has a strict requirement that it applies to databases with verified no transaction loss.

    Note:

    Because of the asynchronous replication nature of Oracle GoldenGate, applications must tolerate data loss due to network lag.

  • Implementing conflict resolution with full active/active for all applications/schema that are using Oracle GoldenGate.

  • Designing an environment that requires web affinity to avoid seeing stale data (stick at a site in conversation). Global Data Services (GDS) can provide affinity to the database that is local to the site and manage global services. See Introduction to Global Data Services in Oracle Database Global Data Services Concepts and Administration Guide.

When environments require an active-active database, a combination of these technologies can be used to maximize availability and minimize data loss in planned maintenance events.