Understanding Oracle Fusion Middleware Disaster Recovery

1 Understanding Oracle Fusion Middleware Disaster Recovery

Oracle Fusion Middleware Disaster Recovery is a disaster recovery solution that provides protection to Oracle Fusion Middleware components in different Oracle product suites.

This chapter includes the following sections:

Disaster Recovery Terminology

Learn about disaster recovery terminology.

The following terms are used in disaster recovery:

Disaster

A sudden, unplanned catastrophic event that causes unacceptable damage or loss in a site or geographical area. A disaster is an event that compromises an organization's ability to provide critical functions, processes, or services for an unacceptable period and causes an organization to invoke its recovery plans.
Disaster Recovery

Ability to safeguard against natural or unplanned outages at a production site by having a recovery strategy for applications and data to a geographically separate secondary site.
Disaster Recovery Topology

The production and secondary site hardware and software components that comprise an Oracle Fusion Middleware Disaster Recovery solution.
Enterprise Deployment Guide

The Enterprise Deployment Guides provide detailed and verified instructions that help you plan, prepare, install, and configure a multi-host, secure, highly available, production topology for selected Oracle Fusion Middleware products in the scope of a single data center. The Oracle Fusion Middleware Enterprise Deployment Guide for Oracle Fusion Middleware SOA is available here.
Alias Host Name

Alias host name is an alternate way to access the system besides its real network name. Typically, it resolves to the same IP address as the network name of the system. This can be defined in the name resolution system such as DNS or locally in the local hosts file on each system. Multiple alias host names can be defined for a given system.
Physical Host Name

Physical host name is the host name of the system as returned by the gethostname()call or the hostname command. Typically, the physical host name is also the network name used by clients to access the system. In this case, an IP address is associated with this name in the DNS (or the given name resolution mechanism in use) and this IP is enabled on one of the network interfaces of the system.

A given system typically has one physical host name. It can also have one or more additional network names, that correspond to the IP addresses enabled on its network interfaces which are used by clients to access it over the network. Each network name can be aliased with one or more alias host names.
Production or Primary Site

A production or primary site in a disaster protection topology is the system that is carrying the system’s workload at a precise point in time. It is a group of hardware, network and storage resources, and processes that are actively used to carry business logic and process requests at a precise point in time.
Maximum Availability Architecture

Oracle’s Maximum Availability Architecture (Oracle MAA) is the best practices blueprint for data protection and availability of Oracle products (Database, Fusion Middleware, Applications). Implementing Oracle Maximum Availability Architecture best practices is one of the key requirements for any Oracle deployment. It provides recommendations for setting up and managing an Oracle system. Oracle’s Maximum Availability includes the Enterprise Deployment Guide recommendations and adds disaster protection best practices to minimize planned and unplanned downtimes for outages affecting an entire data center or region.
Recovery Point Objective (RPO)

Recovery point objective is the amount of data loss that a system can tolerate or is acceptable when an outage takes place, from a business point of view.
Recovery Time Objective (RTO)

Recovery time objective is the amount of downtime a system can tolerate or the acceptable amount of time that an application or service can remain unavailable when an outage takes place, from a business point of view.
Site Failover

Process of making the current secondary site the new production or primary site after the production site becomes unexpectedly unavailable due to a disaster at the production site. The term failover is also used to refer to a site failover in this document.
Site Switchback

Process of reverting the current production site and the current secondary site to their original roles. Switchbacks are planned operations done after the switchover operation is completed. The current secondary site becomes the production site and the current production site becomes the secondary site. The term switchback is also used to refer to a site switchback in this document.
Site Switchover

Process of reversing the roles of the production and secondary site. Switchovers are planned operations done for periodic validation or to perform planned maintenance on the current production site. During a switchover, the current secondary site becomes the new production site and the current production site becomes the new secondary site. The term switchover is also used to refer to a site switchover in this document.
Site Synchronization

Process of applying changes made at the production site to the secondary site. For example, when a new application is deployed at the production site, you should perform a synchronization so that the same application is also deployed at the secondary site.
Secondary or Standby Site

A secondary site is a backup location that can take over the business logic and requests that a primary site was processing. Typically, secondary sites are also named as "Standby" because they remain on "standby or inactive mode". This means that they are not processing the production workload during normal operations. However, this does not imply that the secondary site cannot be used for other purposes. This is especially true in more modern models where the secondary site is used for reporting operations and more importantly for validating changes before applying them in the primary site.
Symmetric Topology

An Oracle Fusion Middleware Disaster Recovery configuration that is completely identical across tiers on the production and secondary site. It has identical number of hosts, load balancers, instances, and applications with the same ports being used for both sites and systems configured with identical capacity. This document describes how to set up a symmetric Oracle Fusion Middleware Disaster Recovery topology for an enterprise configuration.
Asymmetric Topology

An Oracle Fusion Middleware Disaster Recovery configuration that is different across tiers on the production and secondary site. For example, an asymmetric topology can include a secondary site with fewer hosts and instances than the production site.

Note:
Oracle does not recommend using scaled down secondary systems. Non-symmetric standbys can cause cascade falls if workloads are not handled properly and they can also produce misconfigurations and data loss.
System

A System is a set of targets (hosts, databases, application servers, and so on) that work together to host your applications. For example, to monitor an application in Enterprise Manager, you would first create a system that consists of the database, listener, application server, and host targets on which the application runs.
Site

Site is the set of different components in a datacenter needed to run a group of applications. For example, a site could consist of Oracle Fusion Middleware instances, databases, storage, and so on.
Virtual Host Name

Virtual host name is a network addressable host name that can be mapped to one or more physical systems. This can be done by enabling the associated VIP in a node through a load balancer or a hardware cluster.

For load balancers, the term virtual server name is used interchangeably with virtual host name in this document. A load balancer can hold a virtual host name on behalf of a set of servers, and clients communicate indirectly with the systems by using the virtual host name.

In a hardware cluster, a virtual host name is a network host name assigned to a cluster virtual IP. Because the cluster virtual IP is not permanently attached to any particular node of a cluster, the virtual host name is not permanently attached to any particular node either.

In the context of a single host, a virtual host name is an additional host name to access the system besides the real network name. It is typically mapped to a virtual IP enabled in the node's network interfaces, or it can be mapped to an existing IP address in the system. In this last case, it becomes an alias host name of the system in the name resolution system DNS or locally in the local host file.
Virtual IP

Generally, a virtual IP (VIP) is an IP that is assigned to a secondary Network Interface Controller (NIC) or to a Virtual Network Interface Controller (VNIC). The hardware nodes or virtual machines have their own physical IP address and physical host name and can use several additional VIP addresses. These VIP addresses "float" or can be migrated between different nodes. VIPs are also used in load balancers and hardware clusters. A VIP presents a single entry point IP address that abstracts accessors from the backend points and can be migrated or moved across nodes for different purposes.

Traditionally, hardware clusters use a cluster virtual IP to present to the outside world the entry point into the cluster. The hardware cluster’s software manages the movement of this IP address between the two physical nodes of the cluster, while clients connect to this IP address without the need to know which physical node this IP address is currently active on.

Currently, Virtual IPs are also managed manually or through application servers (for example, WebLogic provides virtual IP migration functionality with server migration) when a precise component needs to be failed over (transparently to consumers) to a different hardware.

A load balancer also uses a virtual IP as the entry point to a set of servers. These servers tend to be active at the same time. This virtual IP address is not assigned to any individual server but to the load balancer that acts as a proxy between servers and their clients.
WebLogic Whole Server Migration

Whole server migration occurs when a WebLogic Server instance migrates to a different physical system upon failure.
WebLogic Service Migration

Service-level migration occurs when services running in a WebLogic Server move to a different WebLogic Server instance within the cluster.

Overview of Oracle Fusion Middleware Disaster Recovery

Learn about Oracle Fusion Middleware Disaster Recovery.

A disaster protection strategy must address the different phases in a system’s life cycle:

Initial Setup

Configuring the system initially to get an initial replica of the primary system in a secondary location.
Managing Switchover and Failover

Moving workloads to the secondary location in an event of a planned or unplanned downtime affecting an entire data center or geographical region.
Maintainance
- Ongoing Synchronizations
  
  Maintaining secondary location up to speed with the configuration, metadata, and runtime data when it is modified in the primary system.
- Patching
  
  Applying patches to the disaster recovery topology.
- Scale Out Operations
  
  Scaling the system in the secondary when the primary is also modified.

Before running the initial setup and preparing the lifecycle of a disaster protection system, it is crucial to understand the critical aspects that will drive its implementation. It is also important to differentiate between features that provide protection against local failure and those that protect against a disaster.

The other two main areas that will drive decisions for different variations in the disaster protection configuration are data replication (how a system is replicated to a secondary location) and virtualization of the configuration (how to make the configuration used in the primary, valid in the secondary). While following the Oracle MAA recommendations, it is important to understand that a disaster solution needs to use a secondary system that is a replica of the primary so that it can become a first class production site itself in the event of a total loss of the primary. The following sections provide more details in these areas.

Disaster Protection vs Local Failure Protection

Providing Oracle Maximum Availability Architecture is one of the key requirements for any Oracle Fusion Middleware deployment. An Oracle Fusion Middleware Enterprise Deployment Guide provides the best practices and recommendations within the scope of a single data center (See Enterprise Deployment Guide for Oracle SOA Suite). Oracle Fusion Middleware includes an extensive set of high availability features such as process death detection and restart, server clustering, service migration, cluster integration, GridLink, load balancing, failover, backup and recovery, rolling upgrades, and rolling configuration changes which protect an enterprise deployment from unplanned downtime and minimizes the planned downtime.

Most of the downtime experienced by Fusion Middleware systems are caused by local failures. These are failures that affect a component or part of the resources in a data center but that can be corrected with local redundancy for that precise component. These are outages that typically do not render an entire data center as inaccessible. Therefore, disaster protection only makes sense on top of an existing strategy against these local failures. Complete-site outages and downtime affecting entire regions occur much less frequently than local storage crashes, hypervisors failures, local network issues and so on. To provide protection against this type of downtime, follow the recommendations provided in the Enterprise Deployment Guide for your Fusion Middleware component. Enterprise Deployment Guides are the foundation on top of which this disaster protection guide is built.

In addition to these local failures, enterprise deployments need protection from unforeseen disasters and natural calamities that can bring down an entire data center or geographical area. A Maximum Availability Architecture for Fusion Middleware implements all the best practices prescribed by the Fusion Middleware Enterprise Deployment Guide including disaster protection. A disaster protection solution involves setting up a secondary site at a geographically different location with equal services and resources compared to the production site. Oracle recommends configuring symmetrical topology and capacity at both the production and the secondary sites to prevent inconsistencies at the functional and performance levels. The secondary site is normally in a passive mode. This deployment model is sometimes referred to as an active-passive or active-standby model. "Passive" in this context means that the secondary site is not processing the production workload that the primary is processing at that point in time. However, it does not mean that the secondary system cannot be used during normal operation. Secondary systems in the DR configurations proposed by this guide are used to verify new applications, validate patches or to run workload tests before applying those changes to the primary system. This model is usually adopted when the two sites are connected over a WAN and network latency does not allow clustering across the two sites.

Data Replication

Most Oracle Fusion Middleware components are stateful. Different data types stored in different persistent formats need to be copied from the primary site to the secondary site. Application data, metadata, configuration data, and security data need to be replicated periodically to the secondary site. This is done to ensure that in a switchover or failover scenario, the reply from the new active site will be perfectly consistent with the one that was offered from the original primary. Different WebLogic and Fusion Middleware components store configuration information in the file system. Additionally, artifacts such as keystores (for Identity and Trust) are critical pieces of the Oracle Fusion Middleware 14.1.2 Enterprise Deployment SSL configuration. These stores can reside in external vaults and also in the file systems.

The Oracle Fusion Middleware Disaster Recovery solution can use different replication technologies for disaster protection of Oracle Fusion Middleware middle-tier components. It can use storage level replication and is compatible with third-party storage vendor recommended solutions. It can also use other supported methods to replicate the Fusion Middleware middle-tier configurations like DBFS or rsync. Although a single replication strategy is typically used for all file systems, different data types can use different approaches according to the RTO, RPO, and consistency needs in each case.

Replication and disaster protection for the Oracle databases used by Oracle Fusion Middleware is provided through Oracle Data Guard. This is the only supported configuration to protect the Oracle Fusion Middleware against disaster with a remote mirror configuration.

The replication frequency for the different types of data (whether on the database or on a file system) should be as high as the systems recovery point objective (RPO) demands. The time consumed in transitioning from a primary system to a secondary system should be as short as the systems recovery time objective (RTO).

Access Points and Configuration Virtualization

Using configuration and metadata in the primary without any modifications in the secondary is also a key aspect to an appropriate disaster protection solution. Disaster protection solutions should elude manipulating the primary configuration to adjust to secondary. These manipulations increase RTO in failover scenarios and are difficult to maintain as applications evolve. The least amount of information that needs to be replicated, the more frequently the replication cycles can be scheduled thus reducing the system recovery point objective. To make the configuration agnostic to whether it is run from primary or standby, the following requirements must be met:

Clients and other applications or services accessing the system should continue using the same address for their access after a switchover or failover without requiring to change the hostname that is used to access the failed over resources. Failures should be transparent to consumers, especially when these front-end addresses are public and used by thousands of browsers or devices.
All the listen addresses used by different components in the Fusion Middleware system (besides the system’s front-end addresses) should be hostnames that can be activated in both locations (mapping to a different IP in each location). This will avoid the need to replace the listen addresses in the configuration that the secondary receives from the primary.
Any external dependencies (like services that are not part of the Fusion Middleware domain) should be accessible with the same configuration both from the primary and the secondary. This includes external hostnames, storage, or network resources. All of them should be equally accessible in both regions.

Symmetry Requirements

The Oracle Fusion Middleware Disaster Recovery topology uses a replica of the primary in the secondary site. Oracle does not recommend using scaled down secondary systems. Non-symmetric secondary system can cause cascade falls if workload is not handled properly and they can also produce misconfiguration and data loss. Symmetry between the two sites is configured based on the following:

Hardware, Nodes, and Infrastructure Resources

The production and secondary site have identical number of hosts, load balancers, instances, and applications. The same ports are used for both sites. The systems are configured with identical capacity and should be capable to sustain the exact same workload.
Directory Names and Paths

Every file that exists at the production site host must exist in the same directory path at the secondary site peer host. Therefore, Oracle Home names and directory paths for WebLogic domains, deployments, and configuration must be the same at the production and secondary site.
Port Numbers

Port numbers are used by listeners for routing of requests. Port numbers are stored in the configuration and must be the same at the production site host and their secondary site peer hosts.
Security

The same user accounts must exist at both production and secondary site. The same central LDAP content and policies must be accessible from both locations. You must also configure the file system and SSL identically at the production and secondary site. For example, in the Oracle Fusion Middleware Enterprise Deployment Guide for Oracle SOA Suite the production site uses SSL so the secondary site must also use SSL configured in exactly the same way as in the primary.
Load Balancers and Virtual Server Names

A front-end load balancer should be set up with virtual server names for the production site and an identical front-end load balancer should be set up with the same virtual server names for the secondary site.
Software

The same versions of software must be used at the production and secondary site. The operating system patch level must also be the same at both sites and patches to Oracle or third-party software must be applied to both the production and secondary site.

Oracle Fusion Middleware Disaster Recovery Architecture Overview

Learn about the typical topology and main aspects in a disaster recovery solution for an Oracle Fusion Middleware enterprise deployment.

Figure 1-1 shows an overview of an Oracle Fusion Middleware Disaster Recovery topology for an on-premises topology.

Figure 1-1 Production and Standby Sites for Oracle Fusion Middleware Disaster Recovery Topology

This figure shows the production and standby sites for Oracle Fusion Middleware Disaster Recovery topology.

The primary system is constructed following the Oracle Fusion Middleware Enterprise Deployment Guide. Some of the additional key aspects of the solution in Figure 1-1 are described below:

The solution involves two sites. The current production site is running and active, while the second site is serving as a secondary site and is in passive mode.
Hosts on each site have mount points that are defined for accessing shared storage system for the site as prescribed in the pertaining EDG. A replication technology (storage level replication, rsync, or DBFS) is used to copy the middle-tier file systems and other data from the shared storage of the production site to the shared storage of the standby site.
After file replication is enabled, application deployment, configuration, metadata, data, and product binary information is replicated from the production site to the standby site.
It is not necessary to perform any Oracle software installations at the standby site hosts. When the production site storage is replicated to the standby site storage, the equivalent Oracle Home directories and data are written to the standby site storage.
Oracle Data Guard is used to replicate all Oracle database repositories including Oracle Fusion Middleware repositories and custom application databases. For more information about disaster protection provided by the Oracle Data Guard, see Oracle Data Guard.
Middle-tiers in each region connect only to the database that is local in that region. Cross-region connection from middle-tier in primary to the database in secondary and from middle-tier in the secondary to the database in the primary should be avoided because depending on factors like firewalls and host name resolution, these connections could hang affecting the health of the Fusion Middleware system.
During a normal operation, the user requests are initially routed to the production site.
When there is a failure or planned outage of the production site, the following summary steps are executed so that the secondary site assumes the primary role in the topology:
1. File replication from the production to the secondary site is stopped (when a failure occurs, replication may have already been stopped due to the failure).
2. A failover or switchover of the Oracle databases is performed using Oracle Data Guard.
3. The services and applications on the secondary site are started.
4. Using a global load balancer or a DNS, change user requests are rerouted to the secondary site. At this point, the secondary site has assumed the production role.

The following chapters provide details about how to configure the disaster protection system initially and how to manage the system through its lifecycle.

Oracle Fusion Middleware Disaster Recovery in Oracle Cloud Infrastructure

Learn how Oracle Cloud Infrastructure can be used to host a secondary system for an on-premises primary deployment.

For cases where the secondary system resides in the Oracle Cloud Infrastructure (OCI), Oracle provides a framework that analyzes the primary system, creates the peer resources in the secondary, and replicates the entire Fusion Middleware system to the secondary. The framework is opensource and is available in GitHub. The framework creates and configures a symmetric disaster recovery system in the OCI for an existing Oracle WebLogic or Fusion Middleware domain environment based on JEE/Jakarta components (The framework does not cover or address system components such as LDAPs. It is intended for systems that are based on standard WebLogic deployments). The framework offers its greatest degree of automation for the cases where the primary environment follows the Enterprise Deployment Guide. It maintains an inventory of created resources that can be easily cleaned up or reclaimed thus allowing a quick disaster recovery deployment and verification without incurring high costs. These architectures are usually referred to as hybrid disaster protection architectures that provide many benefits as compared to "on-premises to on-premises" disaster protection systems:

Allows a gradual and easy move of workloads to the cloud. The secondary system can be used as a bed test for moving systems to cloud. It allows getting familiar with cloud infrastructure in a quick and versatile way.
Leverages OCI High Availability and Reliability features. Oracle Cloud Infrastructure provides many high availability features with fault domains and availability domains, continuous mirroring for storage at different levels, and redundancy for load balancers and network devices among other things.
Reduces costs as compared to a full-blown secondary on-premises because the management and administration overhead is minimum since shared storage, network, compute, and many other infrastructure pieces are managed directly by Oracle Cloud. Oracle Cloud universal credits can be used to provision the secondary and if after some tests it is decided not to use the DR configuration, the secondary system can be cleaned up in no time and credits can be quickly reclaimed for other purposes.

Apart from these generic benefits, Oracle’s WebLogic hybrid disaster recovery framework completely automates the disaster protection setup experience through a reliable process that avoids human errors and implements many MAA best practices.

Notice that the primary system can also be on OCI. The framework can also be used to create a secondary copy in OCI even if it is created using manual procedures to install Oracle Fusion Middleware in OCI or it is using Oracle Marketplace Stacks. The setup procedures, replication technologies, and configuration in general are specific to OCI and precise implementation details are provided in these cases. For more information about the topology and how to use the different tools provided with it, see the WLS_HYDR framework page in GitHub.