About Managing Failures
The Oracle Fusion Middleware (FMW) stretched cluster topology is resilient to failures in any individual component.
Each site follows the high availability best practices outlined in the Oracle FMW Enterprise Deployment topology, ensuring that local redundancy protects against disruptions at the component level (such as the load balancer, Oracle HTTP Server instances, Oracle WebLogic Server, or the database instance).
The considerations for addressing scenarios such as a complete tier failure within a site or a total site failure are discussed in the following sections.
Manage the Failure of All the Web Servers on One Site
All the client requests, regardless of their site preference, will be routed to the other site.
Hence, the Oracle WebLogic servers of the site with the failed web servers will not receive new requests. They may continue doing some processing (for example, processing Oracle SOA composites and java message service (JMS) messages). However, any HTTP callbacks internally generated from these servers will fail because they point to their own site, whose web servers have failed.
The following diagram shows failures in all the web servers on one site
Recover From the Failure
Clients are automatically redirected to the other site thanks to Oracle Cloud Infrastructure (OCI) Traffic Management steering policies or the global load balancer (GLBR).
If the restoration of the lost web server instances is not possible in the short term, you can perform the following to use the WebLogic Servers of the site with the failed web tier:
- Configure the Oracle HTTP Server (OHS) instances at the other site to route requests to the Oracle WebLogic Server instances on the failed site.
- Update the OHS configuration and set the
DynamicServerListparameter toON. - Apply this change by restarting the OHS in a rolling fashion to avoid downtime.
- Additionally, ensure cross-region communication is permitted from the web servers to the WebLogic servers at the other site.
- Update the OHS configuration and set the
- To prevent failures in hypertext transfer protocol (HTTP) callbacks originating from
the site with unavailable OHS servers, update the entry for the frontend name in the
WebLogic server hosts’
/etc/hostsfile or private domain name system (DNS), to point to the load balancer at the other site.
- Start the OHS processes in the failed site.
As soon as the Oracle Cloud Infrastructure Health Checks are OK again, the traffic management steering policy will load balance the client requests between both sites, as per the rules defined.
- Set the
DynamicServerListtoOFFagain in the other site. - Revert any change in the WebLogic servers’
/etc/hostsfile (or private DNS) so they point to their own site load balancer again.
The following image shows Java Message Service (JMS) queue messages and client failed requests during a failure of all the web servers on one site:
Expected Recovery Time Objective
The DNS update affects clients whose steering policy preference is set to the failed region. The TTL value determines how long these clients will continue using the old entry before updating it to point to the healthy site. The additional time (around 1 min) depends on the frequency and time out of the health checks configured in the OCI steering policy (30 seconds were used in the test above for the Health Check interval and a timeout of 10 seconds).
When using a global load balancer (GLBR), the outage time depends on the frequency of the health checks configured in the GLBR. As soon as the GLBR marks a pool as unhealthy, the incoming requests will be redirected to the other site. With a GLBR, there is no DNS update, so the TTL value of the front-end entry is irrelevant.
Manage the Failure of All the Oracle WebLogic Servers on One Site
The failed site’s load balancer will return a failed response, so the frontend global balancing feature, based on Oracle Cloud Infrastructure (OCI) Traffic steering policies and health checks, should mark the site as unhealthy. All the client requests, regardless of their preference, will be routed to the other site.
The WebLogic Java Message Service (JMS) and Java Transaction API (JTA) services will automatically migrate to the servers in the other site when using Automatic Service Migration along with Java Database Connectivity (JDBC) persistent stores.
In the Oracle Fusion Middleware (FMW) SOA case, if the automatic recovery cluster master was hosted in the failed servers, a new cluster master will arise in the available site. This server performs automated recovery of SOA instances initiated on the other site.
The following diagram shows the failure of all the WebLogic servers on one site:
failures-all-weblogic-servers-one-site-oracle.zip
The following image shows client failed requests and JMS messages per server when all the WebLogic Servers fail on a site.
In the JMS messages graph, there are four lines, each representing a server’s JMS queue. The green and blue lines (which are almost overlapped) correspond to the servers that were killed. The number of JMS messages for these queues doesn’t increase after the outage begins.
The red and yellow lines represent the servers that remain up in region 2. When all requests are redirected to this region, each remaining server receives 50% of the total load. However, the rate at which messages accumulate in their queues is different. This is because the JMS servers of the failed servers migrated to one of the remaining servers, so the messages in that server are now processed by three queues. As a result, the slope appears lower in the yellow one (note that the monitoring tool does not display the message counts for the migrated queues).
Recover From the Failure
After the failed servers are available again:
- Start the managed servers in the failed site.
- As soon as the Oracle Cloud Infrastructure Health Checks are healthy again, the Traffic Management steering policy will load balance the client requests between both sites, as per the rules defined.
Expected Recovery Time Objective
This is similar to the scenario where there is a failure in of all the web servers on one site,
The DNS update affects clients whose steering policy preference is set to the failed region. The TTL value determines how long these clients will continue using the old entry before updating it to point to the healthy site. The additional time (around 1 min) depends on the frequency and time out of the health checks configured in the OCI Traffic Management steering policy (30 seconds were used in the test for the Health Check interval and a timeout of 10 seconds).
When using a global load balancer (GLBR), the outage time depends on the frequency of the health checks configured in the GLBR. As soon as the GLBR marks a pool as unhealthy, the incoming requests will be redirected to the other site. With a GLBR, there is no DNS update, so the TTL value of the front-end entry is irrelevant.
Manage Failures in the Database: Data Guard Switchover and Failover
The Java Database Connectivity (JDBC) URL string and Oracle Notification Service (ONS) configuration provided earlier in “Set Up WebLogic Data Sources” ensure that reconnection happens automatically to the new primary database. For these precise tests (using Oracle Fusion Middleware (FMW) SOA FOD and even with high workloads of 160 concurrent invocations), the database switchover or failover takes less than a couple of minutes. This time can vary based on system configuration and environment. In well-tuned systems, the switchover times of 1-5 minutes are common, but factors such as system size, resources, workload, redo log synchronization, and network performance can impact the total duration . See the Explore More section for links to Oracle Data Guard documentation and other resources.
During a switchover or failover of the database, there will be application errors. Also, the WebLogic Servers using service migration can be shut down and restart automatically by the Node Manager if they are not able to update their leasing table. The expected behavior with the default leasing parameters is:
- If the database outage is very short (<1-2 min), no WebLogic server auto-restart is expected.
- If the database outage is longer (2-10 min), the WebLogic servers may
auto-restart due to “lost a lease” when the database starts again.
The lower limit can be increased by tuning WebLogic's database leasing retries, as described earlier in "Configure Tuning WebLogic Database Leasing".
- If the database outage is much longer (>10 min), then WebLogic servers can auto-restart due to other failures like losing access to critical JDBC stores (“JDBC store of JTA is unavailable”).
The following diagram shows the database switchover in the FMW stretched clusters topology
stretched-clusters-db-switchover-oracle.zip
The following image shows client requests performance and Java Message Service (JMS) messages per server queue during a database switchover in a FMW stretched cluster.
Recover From the Failure
After the failed servers are available again:
- Reinstate the failed database if you performed a database failover.
This action is not required if you performed a switchover.
- Perform a database switchback to the original site.
Expected Recovery Time Objective
For the tests performed, the switchover takes less than 2 minutes.
For an unplanned switchover or failover, the total downtime depends on the time the database was down:
- If you perform the database failover or switchover almost immediately, then
the total time to recover is short. It depends on the time that the database needs for the
switchover or failover. For the tests performed, the switchover takes less than 2 minutes,
so the expected recovery time objective (RTO)
is:
RTO = DB DOWNTIME + SHORT TIME (1-2 min) - If the database downtime is longer, there can be additional errors, such as
Oracle WebLogic Server auto restarts, that increase the RTO. In this case th expected RTO is:
RTO = DB DOWNTIME + WEBLOGIC START TIME
Manage Failures in the WebLogic Administration Server
Node Manager will automatically restart the failed server in-place.
However, you need to failover the Administration Server to a different node if an outage completely affects the host where the Administration Server runs.
Essentially, this consists of restarting the Administration Server in a different node, ensuring it points to the location that contains the Administration Server domain directory and that it uses a listen address that maps to the appropriate virtual IP (VIP).
This Administration Server domain directory may be a shared storage location available to different nodes in the same region, or a restore from a backup or file system replication made available to nodes in a different region.
Note:
Regardless of the stretched cluster configuration, it is expected that the appropriate backup procedures are in place for your Oracle WebLogic domain.Therefore, in a Oracle Fusion Middleware (FMW) stretched cluster topology, different considerations apply when migrating the Administration Server to a node in a different region versus migrating it to a node in the same region.
The following diagram shows administration server failover to the other site in the FMW stretched cluster
Fail Over to a Different Region
Verify that the Administration Server is working properly by accessing both the WebLogic Remote Console and the Oracle Enterprise Manager Fusion Middleware Control.
Fail Over to the Same Region
Hence, the failover procedure is the same as described in the Enterprise Deployment Guide for the Administration Server, in Verifying Manual Failover of the Administration Server.
For managing the Administration server virtual IP (VIP) in Oracle Cloud Infrastructure (OCI) systems, you can use the steps described in the blog Using a Virtual IP (VIP) in Oracle Cloud Infrastructure
Manage Failure of the Entire Region Hosting the Primary Database
The Oracle WebLogic Server instances in the remaining site will automatically reconnect to the new primary database if they use the recommended configuration described in the previous section.
The failed site’s load balancer will return a failed response, so the front end global balancing feature should mark the site as unhealthy. All the client requests, regardless of their preference, will be routed to the other site.
The WebLogic JMS and JTA services will automatically migrate to the servers in the other site when using Automatic Service Migration along with JDBC persistent stores. In Oracle Fusion Middleware (FMW) Oracle SOA Suite’s case, if the automatic recovery cluster master was hosted in the failed servers, a new cluster master will arise in the available site. The new cluster manager will perform automated recovery of SOA instances initiated on the other site.
The following diagram shows the failure of the entire region 1 in the FMW stretched clusters topology:
Recover From the Failure
After the failed site is recovered and is available again:
- Restart the processes in the failed hosts: Oracle HTTP servers, WebLogic Administration
Server, and Managed servers.
Make sure that the Administration Server virtual IP (VIP) is set, and no orphan files exist that prevent from startup.
- Reinstate the failed database if you performed a database failover.
This action is not required if you performed a switchover.
- Perform a database switchover to the original site.
Expected Recovery Time Objective
The servers in the remaining site can continue processing requests as soon as the database is running again in the remaining site, so the downtime depends on the time used before switching over the database.
- If you perform the database failover or switchover almost immediately, then the total time
to recover is short. It depends on the time that the database needs for the switchover or
failover. For the test performed, the switchover takes less than 2 minutes, so the expected
RTO is:
RTO = DB DOWNTIME + SHORT TIME (1-2 min). - If the database downtime is longer, there can be additional errors, such as Oracle WebLogic Server auto restarts, that increase the RTO. The expected RTO is:
RTO = DB DOWNTIME + WEBLOGIC START TIME
Manage Failure of the Entire Region Hosting the Standby Database
All the client requests, regardless of their site preference, will be routed to region 1, which continues processing requests. The WebLogic JMS and JTA services will automatically migrate to the servers in site 1 when using Automatic Service Migration along with JDBC persistent stores.
In the Oracle Fusion Middleware (FMW) with Oracle SOA Suite case, if the automatic recovery cluster master was hosted in the failed servers, a new cluster master will arise in the available site. This server performs automated recovery of SOA instances initiated on the other site.
There is no need to perform a database switchover since the outage doesn’t affect the primary database.
The following diagram shows failure of the entire region 2 in the FMW stretched clusters topology.
Recover From the Failure
After the failed site is available again, restart the processes in the failed hosts for the Oracle HTTP servers and WebLogic managed servers.
Make sure that no orphan files exist that prevent WebLogic from starting.
Thanks to the global load balancer feature (either Oracle Cloud Infrastructure Traffic Management steering policies or a global load balancer), the client's requests will be rebalanced between the 2 sites again.
Expected Recovery Time Objective
The domain name system (DNS) update impacts clients that have region 2 set as their preference in the geolocation steering policy. The DNS update affects clients whose steering policy preference is set to the failed region. The TTL value determines how long these clients will continue using the old entry before updating it to point to the healthy site. The additional time (around 1 min) depends on the frequency and time out of the health checks configured in the OCI traffic management steering policy (30 seconds were used in the test above for the Health Check interval and a timeout of 10 seconds).
When using a global load balancer (GLBR), the outage time depends on the frequency of the health checks configured in the GLBR. As soon as the GLBR marks a pool as unhealthy, the incoming requests will be redirected to the other site. With a GLBR, there is no DNS update, so the TTL value of the front-end entry is irrelevant.








