Managing Switchover, Switchback, and Failover Operations

4 Managing Switchover, Switchback, and Failover Operations

Learn how to perform switchover, switchback, and failover for your Oracle Fusion Middleware Disaster Recovery topology.

This chapter includes the following topics.

Performing a Switchover

A switchover is a planned operation that sets the secondary site as the production role.

This operation is needed when you plan to take down the production site (for example, to perform maintenance) and make the current secondary site as the production site.

To perform a switchover operation:

Shut down any processes running on the production site. These include the Oracle Fusion Middleware instances, and any other processes in the application tier and in the web tier.
Stop the replication between the production site shared storage and the secondary site shared storage. If you are using shared storage replication, pause the replications. If you have scheduled jobs with rsync to update configuration on a regular basis, ensure to either stop the jobs or schedule the replication windows so that they do not interfere with the planned switchover.
When using storage replication, unmount the shared storage volume with the middle-tier artifacts on the current production site and mount the corresponding volumes on the current secondary site which is the new production site.
Use Oracle Data Guard to switchover the databases.
On the secondary site host, manually start all the processes. These include the Oracle Fusion Middleware instances, and any other processes in the application tier and the web tier.
Ensure that all user requests are routed to the secondary site by performing a global DNS push or something similar such as updating the global load balancer. See the Wide Area DNS Operations section.
Use a browser client to perform post switchover testing to confirm that requests are being resolved and redirected to the secondary site.

At this point, the former secondary site is the new production site and the former production site is the new secondary site.
Reestablish the replication between the two sites but configure the replication so that the snapshot or rsync copies go in the opposite direction (from the current production site to the current secondary site). See the documentation for your shared storage to learn how to configure the replication so that snapshot copies are transferred in the opposite direction.

At this point, the former secondary site becomes the new production site, and you can perform maintenance at the original production site. After you have carried out the maintenance of the original production site, you can use it either as the production site or the secondary site.

To use the original production site as the production site, perform a switchback as explained in Performing a Switchback.

Performing a Switchback

A switchback operation reverts the roles of the current production and secondary sites.

To perform a switchback operation:

Shut down any processes running on the current production site. These include the Oracle Fusion Middleware instances, and any other processes in the application tier and the web tier.
Stop the replication between the production site shared storage and the secondary site shared storage. If you are using shared storage replication, pause the replications. If you have scheduled jobs with rsync to update configuration on a regular basis, ensure to either stop the jobs or schedule the replication windows so that they do not interfere with the planned switchback.
When using storage replication, unmount the shared storage volume with the middle-tier artifacts on the current production site and mount the corresponding volumes on the current secondary site which is the new production site.
Use Oracle Data Guard to switchback the databases.
On the new production site hosts, manually start all the processes. These include Oracle Fusion Middleware instances and any other processes in the application tier and the web tier.
Ensure that all user requests are routed to the secondary site by performing a global DNS push or something similar, such as updating the global load balancer. See Wide Area DNS Operations section.
Use a browser client to perform post switchback testing to confirm that requests are being resolved and redirected to the new production site.

At this point, the former secondary site is the new production site and the former production site is the new secondary site.
Reestablish the replication between the two sites, but configure the replication so that the snapshot copies go in the opposite direction (from the new production site to the new secondary site). See the documentation for your shared storage to learn how to configure the replication so that snapshot copies are transferred in the opposite direction.

Performing a Failover

A failover operation sets the secondary site as the production role when the production site becomes unavailable. This is an unplanned operation where the primary site may no longer be accessible hence servers and storage in the primary cannot be managed and changes are only possible through the secondary operations.

To perform a failover operation:

Stop the replication between the production site shared storage and the secondary site shared storage (your shared storage should have a control module also in the secondary site).
When using shared storage replication, mount the shared storage volume with the middle-tier artifacts on the current secondary site which is the new production site.
From the secondary site, use Oracle Data Guard broker (dgrmgrl) to fail over the databases.
On the secondary site hosts, manually start all the processes. These include the Oracle Fusion Middleware instances and any other processes in the application tier and the web tier.
Ensure that all user requests are routed to the secondary site by performing a global DNS push or something similar such as updating the global load balancer. See Wide Area DNS Operations section.
Use a browser client to perform post failover testing to confirm that requests are being resolved and redirected to the production site.

At this point, the secondary site is the new production site. You can examine the issues that caused the former production site to become unavailable.
Once the primary site is accessible, you can use the original production site as the new secondary site. You must reestablish the replication between the two sites but configure the replication so that the snapshot copies go in the opposite direction (from the current production site to the current secondary site). See the documentation for your shared storage system to learn how to configure the replication so that snapshot copies are transferred in the opposite direction.
Depending on the type and duration of the outage in the primary, you may need to reinstate and reconfigure Oracle Data Guard with the database in the original primary system. For more information about different failover situations and how to reinstate a failed primary, see Oracle Data Guard documentation and How To Reinstate Failed Primary Database into Physical Standby (Doc ID 738642.1).

To again use the original production site as the production site, perform a switchback as explained in Performing a Switchback.

Wide Area DNS Operations

When a site switchover or failover is performed, client requests must be redirected transparently to the new site that is now playing the primary role.

To accomplish this redirection, use either a global load balancer or manually changing DNS names.

Using a Global Load Balancer

A global load balancer deployed in front of the production and secondary sites provides fault detection services and performance-based routing redirection for the two sites.

In addition, the load balancer can provide authoritative DNS name server equivalent capabilities.

During normal operations, you can configure the global load balancer with the production site load balancer name-to-IP mapping. When a DNS switchover is required, this mapping in the global load balancer is changed to map to the secondary site's load balancer IP. This allows requests to be directed to the secondary site, which now has the production role.

This method of DNS switchover works for both site switchover and failover. One advantage of using a global load balancer is that the time for a new name-to-IP mapping to take effect can be almost immediate. The downside is that an additional investment must be made for the global load balancer.

Manually Changing DNS Names

The DNS switchover involves manual change of the name-to-IP mapping of the production site's load balancer.

The mapping is changed to map to the IP address of the secondary site's load balancer. Follow these instructions to perform the switchover:

Note the current Time to Live (TTL) value of the production site's load balancer mapping. This mapping is in the DNS cache and it remains there until the TTL expires. As an example, assume that the TTL is 3600 seconds.
Modify the TTL value to a short interval. For example, 60 seconds.
Wait for one interval of the original TTL. This is the original TTL of 3600 seconds from Step 1.
Ensure that the secondary site is switched over to receive requests.
Modify the DNS mapping to resolve the secondary site's load balancer. It gives the appropriate TTL value for normal operation. For example, 3600 seconds.

This method of DNS switchover works for switchover or failover operations. The TTL value set in Step 2 should be a reasonable time period where client requests cannot be fulfilled. The modification of the TTL effectively alters the caching semantics of the address resolution from a long period of time to a short period. Due to the shortened caching period, an increase in DNS requests can be observed.

If the clients that point to FMW endpoints are running on Java, another TTL property can be taken into account. Configure the DNS cache in Java for caching the successful DNS resolutions. In that case, the change in the DNS server is not refreshed until Java is restarted. This can be modified by setting the property networkaddress.cache.ttl to a low value.

You can do it globally, for all the applications that are running on the JVM, by modifying the property in JAVA_HOME/jre/lib/security/java.security file: networkaddress.cache.ttl=60
You can define it for a specific application only, by setting the property in the application's initialization code: java.security.Security.setProperty("networkaddress.cache.ttl" , "60")

Alternatively, for global load balancers and DNS provider record updates, few cloud Web Application Firewall services like Oracle Cloud Infrastructure’s Web Application Firewall provide a way to map a single front end DNS name (a CNAME) to multiple backend IPs (in a DR topology, these would be the IPs of the load balancers in the primary and the secondary). With the appropriate Edge Policies pointing to each region’s load balancer IP, this can act as an effective Global Load Balancer that fails over requests from the primary to the secondary when there is a switchover. To use this alternative, refer to the WAF product specific documentation.

Expected RTO and RPO

This section provides information about the expected RTO and RPO during an outage.

Expected RTO

The Recovery Time Objective (RTO) describes the maximum acceptable downtime for a particular system when an outage occurs. The downtime caused by a failover depends on multiple "uncontrollable" factors, because it is normally an unplanned event caused by a critical issue that affects the system. But it is possible to measure the required downtime for a planned switchover event.

The following table shows the typical time taken by each switchover step in a sample Oracle FMW SOA 14.1.2 EDG system containing SOA, OSB, B2B, WSM and ESS clusters. These particular systems taken as example, use hosts with 4 CPU and 48 GB memory with 8GB maximum heap in the SOA servers. The WLS servers use out-of-the-box configuration in the connection pools of the different SOA Suite components. Additionally, this sample SOA system includes dozen of composites for different type of components (BPEL, MEDIATOR, EDN and so on).

Step No.	Switchover Step	Sample Times in FMW SOA EDG
1	Pre-switchover tasks	This does not cause downtime.
Downtime starts…
2	Stop servers in primary site
	2.1 Stop managed servers	~ 30 seconds (Force) / ~2 minutes (Graceful)
	2.2 Stop admin server	~ 8 seconds (Force) / ~2 minutes (Graceful)
3	Switchover DNS name	This is customer specific. For example, if you use OCI DNS it can be as low as 30 seconds, but it could take hours depending on the DNS provider used. This can be done in parallel with the rest of the steps.
4	Switchover Database	~3 minutes
5	Start the servers in secondary site
	5.1 Start admin	~6 minutes (domain on shared storage)
	5.2 Start managed servers (in parallel)	~2 minutes
Total Downtown ~15 minutes

Natural delays between steps or any other additional validation are not included in the above times, because they depend on how those switchover steps are executed (for example, manually, automated with custom scripts, with orchestration custom tools, and so on). So obviously, some additional time must be considered for the total time, not just the arithmetic sum of the times. The time for DNS switchover is also excluded because it is customer specific.

Normally, the total switchover time is expected to be in the range of 15-30 minutes. Here is a list of tips to minimize the downtime during the switchover operation:

Perform any switchover related activity that does not require downtime before you stop the primary servers. For example, the WebLogic configuration replication based on rsync replication does not require downtime, you can perform it while the primary system is up and running. Another example is to start any shutdown hosts or dependent resources in the secondary site.
If possible, stop the managed servers and admin server in parallel.
If applications and business allow it, use force shutdown to stop the WebLogic servers.
The maximum time taken by the WebLogic servers to shutdown is limited by the parameters "server lifecycle timeout" (normally set to 30 seconds) and "graceful shutdown" (normally set to 120 seconds). Make sure that these parameters are configured to limit the maximum shutdown time.
The front-end update in DNS is customer dependant. Use a low TTL value in the appropriate DNS entry (at least during the switchover operation) to reduce the time for update. Once the switchover is finished, the TTL can be reverted to its original value.
Using Data Guard Broker commands (dgmgrl) to switchover the database is faster than using Enterprise Manager or other agent-based orchestrators.
Load Balancers also take some time to realize that OHS and WebLogic servers are up before they start sending requests to them. It is usually some seconds, depending on the frequency of the LBR health checks. Lower the interval used for the checks in advance and revert it after switchover. You must be cautious when using very low intervals for the healthcheck as it could overload the backend.

Expected RPO

The Recovery Point Objective (RPO) describes the maximum amount of data loss that can be tolerated. For example, in Oracle FMW SOA’s case, this is especially related to transaction logs, JMS messages, and SOA instance information all of which resides in the same database. Given that the database and the WebLogic configuration are replicated with different mechanisms, we can differentiate between the RPO for the runtime data and the RPO for the WebLogic configuration.

The actual achievable RPO for the runtime data relies upon the RPO of the database, because the runtime data (composite instances, JMS messages, TLogs, customer data, and so on) is stored in the database. In some cases, there can be runtime artifacts stored in the file systems too (like files consumed by file or ftp adapters). Therefore, the RPO for the runtime data depends upon the following:

The available network bandwidth and network reliability between primary and standby. Use connections between primary and secondary that provide a consistent performance for bandwidth, latency, and jitter. For a sample system like the one presented above, you can expect an RPO of approximately five minutes. For an optimum behavior, manual configuration of Fast-Start Failover Observer die, the Database may be required. Refer to the Oracle DB documentation for details about Fast-Start Failover.
The Data Guard protection mode used. There are three different modes: Maximum Availability, Maximum Protection or Maximum Performance (default).
- Maximum Availability mode ensures zero data loss except in the case of certain double faults, such as failure of a primary database after failure of the standby database.
- Maximum Performance mode offers slightly less data protection than maximum availability mode and has minimal impact on primary database performance.
- Maximum Protection mode ensures that no data loss occurs if the primary database fails. To ensure that data loss cannot occur, the primary database shuts down instead of continuing processing transactions, if it cannot write its redo stream to at least one synchronized standby database.
  
  The choice of one data guard protection mode or another is driven by business requirements. In some situations, a business cannot afford to lose data regardless of the circumstances. In other situations, the availability of the database may be more important than any potential data loss in the unlikely event of a multiple failure. Finally, some applications always require maximum database performance, and can therefore tolerate a small amount of data loss if any component fails. For more information, see Oracle Data Guard Protection Modes in the Oracle Data Guard Administration documentation.
Additionally, if there are runtime artifacts stored in file systems that are not located in the database (for example, files stored in custom File Storage Services which are consumed or generated by MFT or by File/FTP adapters), the RPO for this data depends on how frequently they are synchronized to the secondary location. What, how, and when this content should be synchronized is determined by the business needs. For example, if these runtime files are very volatile (created/consumed fast), syncing it may be unnecessary and an overkill. But if the content is more static, and it is required to have it in secondary in case of a DR event, the frequency to copy it should be according to the expected RPO of the system. The RPO will be the amount of data generated between the replications of this content.

Alternatively, these runtime files can be located in a DBFS file system. In that case, they are replicated to standby via the underlying Data Guard replica, so the RPO is provided by the Data Guard protection mode.

The actual achievable RPO for the WebLogic configuration depends upon:

How frequently the WebLogic configuration is modified.

The WebLogic configuration does not change as dynamically as the runtime data. Despite the initial stages of a system, it is not common to have configuration changes continuously. The more frequently the configuration is modified, the higher amount of config changes could be lost in a disaster event.
How frequently the WebLogic configuration is synchronized to the standby.

When using shared storage and DBFS replication methods, the WebLogic configuration can be replicated manually or with cron jobs. One approach is to replicate the configuration after every configuration change that is performed in primary. This ensures that secondary WebLogic configuration is always up to date with primary but requires the replication process to be included in every change performed to primary. Another approach is to schedule the replication on a regular basis (for example, every night). In this case, when an outage event takes place, any configuration changes that were applied in primary after the last replication will be lost.
Reliability of the procedure used for the WebLogic configuration replication.

All the replication methods are reliable, but any failure in the underlying infrastructure (for example, unavailability of the staging folder, connectivity outages, and so on) can impact the RPO. Thus, it is recommended to verify the proper functioning of the replication procedure and to perform regular validations of the secondary site.