About Managing Failures

The Oracle Fusion Middleware (FMW) stretched cluster topology is resilient to failures in any individual component.

Each site follows the high availability best practices outlined in the Oracle FMW Enterprise Deployment topology, ensuring that local redundancy protects against disruptions at the component level (such as the load balancer, Oracle HTTP Server instances, Oracle WebLogic Server, or the database instance).

The considerations for addressing scenarios such as a complete tier failure within a site or a total site failure are discussed in the following sections.

Manage the Failure of All the Web Servers on One Site

If a site loses all the Oracle HTTP Server instances, the frontend system (whether a global load balancer or an Oracle Cloud Infrastructure (OCI) traffic management steering policy), should mark the site as unhealthy.

All the client requests, regardless of their site preference, will be routed to the other site.

Hence, the Oracle WebLogic servers of the site with the failed web servers will not receive new requests. They may continue doing some processing (for example, processing Oracle SOA composites and java message service (JMS) messages). However, any HTTP callbacks internally generated from these servers will fail because they point to their own site, whose web servers have failed.

The following diagram shows failures in all the web servers on one site

Description of failures-all-web-servers-one-site.png follows

Description of the illustration failures-all-web-servers-one-site.png

failures-all-web-servers-one-site-oracle.zip

Recover From the Failure

No manual intervention is needed for immediate recovery from a failure in all the web servers on one site.

Clients are automatically redirected to the other site thanks to Oracle Cloud Infrastructure (OCI) Traffic Management steering policies or the global load balancer (GLBR).

If the restoration of the lost web server instances is not possible in the short term, you can perform the following to use the WebLogic Servers of the site with the failed web tier:

Configure the Oracle HTTP Server (OHS) instances at the other site to route requests to the Oracle WebLogic Server instances on the failed site.
1. Update the OHS configuration and set the DynamicServerList parameter to ON.
2. Apply this change by restarting the OHS in a rolling fashion to avoid downtime.
3. Additionally, ensure cross-region communication is permitted from the web servers to the WebLogic servers at the other site.
To prevent failures in hypertext transfer protocol (HTTP) callbacks originating from the site with unavailable OHS servers, update the entry for the frontend name in the WebLogic server hosts’ /etc/hosts file or private domain name system (DNS), to point to the load balancer at the other site.

After the failed servers are available again:

Start the OHS processes in the failed site.
As soon as the Oracle Cloud Infrastructure Health Checks are OK again, the traffic management steering policy will load balance the client requests between both sites, as per the rules defined.
Set the DynamicServerList to OFF again in the other site.
Revert any change in the WebLogic servers’ /etc/hosts file (or private DNS) so they point to their own site load balancer again.

The following image shows Java Message Service (JMS) queue messages and client failed requests during a failure of all the web servers on one site:

Description of failure-messages-all-web-servers-one-site.png follows

Description of the illustration failure-messages-all-web-servers-one-site.png

Expected Recovery Time Objective

When using Oracle Cloud Infrastructure (OCI) traffic management steering policies for global balancing, errors are observed during a period of around 1 minute + DNS time to live (TTL).

The DNS update affects clients whose steering policy preference is set to the failed region. The TTL value determines how long these clients will continue using the old entry before updating it to point to the healthy site. The additional time (around 1 min) depends on the frequency and time out of the health checks configured in the OCI steering policy (30 seconds were used in the test above for the Health Check interval and a timeout of 10 seconds).

When using a global load balancer (GLBR), the outage time depends on the frequency of the health checks configured in the GLBR. As soon as the GLBR marks a pool as unhealthy, the incoming requests will be redirected to the other site. With a GLBR, there is no DNS update, so the TTL value of the front-end entry is irrelevant.

Manage the Failure of All the Oracle WebLogic Servers on One Site

When all the WebLogic servers go down at one site, the other site continues processing requests.

The failed site’s load balancer will return a failed response, so the frontend global balancing feature, based on Oracle Cloud Infrastructure (OCI) Traffic steering policies and health checks, should mark the site as unhealthy. All the client requests, regardless of their preference, will be routed to the other site.

The WebLogic Java Message Service (JMS) and Java Transaction API (JTA) services will automatically migrate to the servers in the other site when using Automatic Service Migration along with Java Database Connectivity (JDBC) persistent stores.

In the Oracle Fusion Middleware (FMW) SOA case, if the automatic recovery cluster master was hosted in the failed servers, a new cluster master will arise in the available site. This server performs automated recovery of SOA instances initiated on the other site.

The following diagram shows the failure of all the WebLogic servers on one site:

Description of failures-all-weblogic-servers-one-site.png follows

Description of the illustration failures-all-weblogic-servers-one-site.png

failures-all-weblogic-servers-one-site-oracle.zip

The following image shows client failed requests and JMS messages per server when all the WebLogic Servers fail on a site.

Description of failure-messages-all-weblogic-servers-one-site.png follows

Description of the illustration failure-messages-all-weblogic-servers-one-site.png

In the JMS messages graph, there are four lines, each representing a server’s JMS queue. The green and blue lines (which are almost overlapped) correspond to the servers that were killed. The number of JMS messages for these queues doesn’t increase after the outage begins.

The red and yellow lines represent the servers that remain up in region 2. When all requests are redirected to this region, each remaining server receives 50% of the total load. However, the rate at which messages accumulate in their queues is different. This is because the JMS servers of the failed servers migrated to one of the remaining servers, so the messages in that server are now processed by three queues. As a result, the slope appears lower in the yellow one (note that the monitoring tool does not display the message counts for the migrated queues).

Recover From the Failure

No manual intervention is needed for immediate recovery from a failure in all the Oracle WebLogic Server servers on one site.

After the failed servers are available again:

Start the managed servers in the failed site.
As soon as the Oracle Cloud Infrastructure Health Checks are healthy again, the Traffic Management steering policy will load balance the client requests between both sites, as per the rules defined.

Expected Recovery Time Objective

When using Oracle Cloud Infrastructure (OCI) traffic management steering policies for global balancing, errors are observed during a period of around 1 minute + DNS time to live (TTL).

This is similar to the scenario where there is a failure in of all the web servers on one site,

The DNS update affects clients whose steering policy preference is set to the failed region. The TTL value determines how long these clients will continue using the old entry before updating it to point to the healthy site. The additional time (around 1 min) depends on the frequency and time out of the health checks configured in the OCI Traffic Management steering policy (30 seconds were used in the test for the Health Check interval and a timeout of 10 seconds).

Manage Failures in the Database: Data Guard Switchover and Failover

If an issue affects the primary database only, then perform a database switchover or failover to the other site as soon as possible.

The Java Database Connectivity (JDBC) URL string and Oracle Notification Service (ONS) configuration provided earlier in “Set Up WebLogic Data Sources” ensure that reconnection happens automatically to the new primary database. For these precise tests (using Oracle Fusion Middleware (FMW) SOA FOD and even with high workloads of 160 concurrent invocations), the database switchover or failover takes less than a couple of minutes. This time can vary based on system configuration and environment. In well-tuned systems, the switchover times of 1-5 minutes are common, but factors such as system size, resources, workload, redo log synchronization, and network performance can impact the total duration . See the Explore More section for links to Oracle Data Guard documentation and other resources.

During a switchover or failover of the database, there will be application errors. Also, the WebLogic Servers using service migration can be shut down and restart automatically by the Node Manager if they are not able to update their leasing table. The expected behavior with the default leasing parameters is:

If the database outage is very short (<1-2 min), no WebLogic server auto-restart is expected.
If the database outage is longer (2-10 min), the WebLogic servers may auto-restart due to “lost a lease” when the database starts again.
The lower limit can be increased by tuning WebLogic's database leasing retries, as described earlier in "Configure Tuning WebLogic Database Leasing".
If the database outage is much longer (>10 min), then WebLogic servers can auto-restart due to other failures like losing access to critical JDBC stores (“JDBC store of JTA is unavailable”).

The following diagram shows the database switchover in the FMW stretched clusters topology

Description of stretched-clusters-db-switchover.png follows

Description of the illustration stretched-clusters-db-switchover.png

stretched-clusters-db-switchover-oracle.zip

The following image shows client requests performance and Java Message Service (JMS) messages per server queue during a database switchover in a FMW stretched cluster.

Description of stretched-clusters-db-switchover-messages.png follows

Description of the illustration stretched-clusters-db-switchover-messages.png

Recover From the Failure

To immediately recover from a database failure:

Perform a database switchover, using the Oracle Cloud Infrastructure (OCI) Console or the Oracle Data Guard broker command-line interface.
If the primary database is not available, then perform a database failover from the standby database.

The Oracle WebLogic Server servers automatically reconnect to the new primary database, so no manual actions are needed except to recover from application-specific errors as needed (for example, in Oracle SOA Suite, you may need to recover composites in the Error Hospital).

After the failed servers are available again:

Reinstate the failed database if you performed a database failover.
This action is not required if you performed a switchover.
Perform a database switchback to the original site.

Expected Recovery Time Objective

For a planned switchover, the total time to recover is short and depends on the time that the database needs for the switchover or failover.

For the tests performed, the switchover takes less than 2 minutes.

For an unplanned switchover or failover, the total downtime depends on the time the database was down:

If you perform the database failover or switchover almost immediately, then the total time to recover is short. It depends on the time that the database needs for the switchover or failover. For the tests performed, the switchover takes less than 2 minutes, so the expected recovery time objective (RTO) is:
```
RTO = DB DOWNTIME + SHORT TIME (1-2 min)
```
If the database downtime is longer, there can be additional errors, such as Oracle WebLogic Server auto restarts, that increase the RTO. In this case th expected RTO is:
```
RTO = DB DOWNTIME + WEBLOGIC START TIME
```

Manage Failures in the WebLogic Administration Server

Process failures for the Administration Server are taken care of by the WebLogic Node Manager in that node.

Node Manager will automatically restart the failed server in-place.

However, you need to failover the Administration Server to a different node if an outage completely affects the host where the Administration Server runs.

Essentially, this consists of restarting the Administration Server in a different node, ensuring it points to the location that contains the Administration Server domain directory and that it uses a listen address that maps to the appropriate virtual IP (VIP).

This Administration Server domain directory may be a shared storage location available to different nodes in the same region, or a restore from a backup or file system replication made available to nodes in a different region.

Note:

Regardless of the stretched cluster configuration, it is expected that the appropriate backup procedures are in place for your Oracle WebLogic domain.

Therefore, in a Oracle Fusion Middleware (FMW) stretched cluster topology, different considerations apply when migrating the Administration Server to a node in a different region versus migrating it to a node in the same region.

The following diagram shows administration server failover to the other site in the FMW stretched cluster

Description of failures-admin-server.png follows

Description of the illustration failures-admin-server.png

failures-admin-server-oracle.zip

Fail Over to a Different Region

To fail over the Administration Server to a different region, follow these steps.

Make the backup of your Administration Server’s domain directory (ASERVER_HOME) available in the failover site.
Restore the ASERVER_HOME directory (including both the domain and applications directory) in the failover site so that the same domain directory structure is consistent with the original site.
The subnets in region 1 and region 2 will typically use different classless inter-domain routing (CIDR) blocks. As a result, the virtual IP (VIP) used by the Administration Server in region 1 (for example 10.10.10.1) is not valid in region 2. When the Administration Server runs in region 2, it will use a different VIP (for example 20.20.20.1). Update the hostname resolution so the Administration Server’s listen address (ADMINHOSTVHN) maps to the new VIP.
1. Assign and attach a virtual IP to the host that will run the Administration Server in the region 2, as described in the Using a Virtual IP (VIP) in Oracle Cloud Infrastructure blog.
2. Update the /etc/hosts or the Domain Name System (DNS) private views in both regions to change the ADMINHOSTVHN record from the previous IP (for example 10.10.10.1) to the new IP (for example 20.20.20.1).
Modify the file $NM_HOME/nodemanager.domains in the node where the Administration Server will be restored to include the ASERVER_HOME and restart the Node Manager:
```
domain_name=MSERVER_HOME;ASERVER_HOME
```
Start the Administration Server in the new host.
Oracle HTTP Server instances uses a DNS cache, controlled by the directive WLDNSRefreshInterval in mod_wl_ohs.conf. It is 0 by default, which means “cache forever”. You must restart OHS to refresh the DNS resolution cache. You have two approaches:
1. Restart the OHS servers to refresh the DNS resolution cache.
2. Or set a non-zero value for WLDNSRefreshInterval in mod_wl_ohs.conf.
Otherwise, OHS will keep trying to connect to the Administration Server with the previous IP address.

Verify that the Administration Server is working properly by accessing both the WebLogic Remote Console and the Oracle Enterprise Manager Fusion Middleware Control.

Fail Over to the Same Region

To fail over the Administration Server to a host in the same region, you don’t need to copy the Administration Server’s domain directory, and the value of the virtual IP (VIP) doesn’t change.

Hence, the failover procedure is the same as described in the Enterprise Deployment Guide for the Administration Server, in Verifying Manual Failover of the Administration Server.

For managing the Administration server virtual IP (VIP) in Oracle Cloud Infrastructure (OCI) systems, you can use the steps described in the blog Using a Virtual IP (VIP) in Oracle Cloud Infrastructure

Manage Failure of the Entire Region Hosting the Primary Database

If an outage affects the entire region 1, perform a database switchover or failover to the other site/region as soon as possible.

The Oracle WebLogic Server instances in the remaining site will automatically reconnect to the new primary database if they use the recommended configuration described in the previous section.

The failed site’s load balancer will return a failed response, so the front end global balancing feature should mark the site as unhealthy. All the client requests, regardless of their preference, will be routed to the other site.

The WebLogic JMS and JTA services will automatically migrate to the servers in the other site when using Automatic Service Migration along with JDBC persistent stores. In Oracle Fusion Middleware (FMW) Oracle SOA Suite’s case, if the automatic recovery cluster master was hosted in the failed servers, a new cluster master will arise in the available site. The new cluster manager will perform automated recovery of SOA instances initiated on the other site.

The following diagram shows the failure of the entire region 1 in the FMW stretched clusters topology:

Description of failures-entire-region.png follows

Description of the illustration failures-entire-region.png

failures-entire-region-oracle.zip

Recover From the Failure

To immediately recover from a complete failure in region 1:

Perform a database switchover, using the Oracle Cloud Infrastructure (OCI) Console or the Oracle Data Guard broker command-line interface.

If the primary database is not available, then perform a database failover from the standby database.
The Oracle WebLogic Server instances automatically reconnects to the new primary database, so no manual actions are needed except to recover from application-specific errors (for example, in Oracle SOA Suite, you may need to recover composites in the Error Hospital).
If required, perform an Administration Server failover to the region 2.

After the failed site is recovered and is available again:

Restart the processes in the failed hosts: Oracle HTTP servers, WebLogic Administration Server, and Managed servers.
Make sure that the Administration Server virtual IP (VIP) is set, and no orphan files exist that prevent from startup.
Reinstate the failed database if you performed a database failover.
This action is not required if you performed a switchover.
Perform a database switchover to the original site.

Expected Recovery Time Objective

While the database is down, the system will be unavailable.

The servers in the remaining site can continue processing requests as soon as the database is running again in the remaining site, so the downtime depends on the time used before switching over the database.

If you perform the database failover or switchover almost immediately, then the total time to recover is short. It depends on the time that the database needs for the switchover or failover. For the test performed, the switchover takes less than 2 minutes, so the expected RTO is:
```
RTO = DB DOWNTIME + SHORT TIME (1-2 min).
```
If the database downtime is longer, there can be additional errors, such as Oracle WebLogic Server auto restarts, that increase the RTO. The expected RTO is:
```
RTO = DB DOWNTIME + WEBLOGIC START TIME
```

Manage Failure of the Entire Region Hosting the Standby Database

If a failure affects the entire region 2, the frontend global balancing feature should mark the site as unhealthy.

All the client requests, regardless of their site preference, will be routed to region 1, which continues processing requests. The WebLogic JMS and JTA services will automatically migrate to the servers in site 1 when using Automatic Service Migration along with JDBC persistent stores.

In the Oracle Fusion Middleware (FMW) with Oracle SOA Suite case, if the automatic recovery cluster master was hosted in the failed servers, a new cluster master will arise in the available site. This server performs automated recovery of SOA instances initiated on the other site.

There is no need to perform a database switchover since the outage doesn’t affect the primary database.

The following diagram shows failure of the entire region 2 in the FMW stretched clusters topology.

Description of failures-standby-db-region.png follows

Description of the illustration failures-standby-db-region.png

failures-standby-db-region-oracle.zip

Recover From the Failure

No manual intervention is needed for immediate recovery from a complete failure in region 2.

After the failed site is available again, restart the processes in the failed hosts for the Oracle HTTP servers and WebLogic managed servers.

Make sure that no orphan files exist that prevent WebLogic from starting.

Thanks to the global load balancer feature (either Oracle Cloud Infrastructure Traffic Management steering policies or a global load balancer), the client's requests will be rebalanced between the 2 sites again.

Expected Recovery Time Objective

When using Oracle Cloud Infrastructure (OCI) traffic steering policies for global balancing, the period with failures is around 1 minute more than the time to live (TTL) of the frontend entry defined in the steering policy.

The domain name system (DNS) update impacts clients that have region 2 set as their preference in the geolocation steering policy. The DNS update affects clients whose steering policy preference is set to the failed region. The TTL value determines how long these clients will continue using the old entry before updating it to point to the healthy site. The additional time (around 1 min) depends on the frequency and time out of the health checks configured in the OCI traffic management steering policy (30 seconds were used in the test above for the Health Check interval and a timeout of 10 seconds).