Plan for Resiliency

Determine Resiliency Requirements

Before diving into what will make your environment resilient, you first need to define what resiliency means to you and your business. In other words, what is the cost associated with an outage of your integration processes?

For some customers, an outage of a few minutes is acceptable and will only partially delay a batch process that runs well within its processing window. For other customers, even a few seconds of outage result in financial losses that have a direct impact on the business.

From that perspective, it is important to look at the following elements:

What is the duration of an acceptable outage in your environment? Here you should define the cost to the business in case of an outage, and outline how that outage evolves with the duration of the outage.
What technologies are used and how can they deliver on the expected level of service? Are you taking a real-time or batch approach? Or a combination of the two? How much data are you processing?

Define a Resilient Architecture

Defining a resilient architecture requires looking at the end-to-end data integration solution.

In the case of an integration process, you need to consider the following components of the architecture (both hardware and software):

Resiliency of the source system
Resiliency of the target system
Resiliency of the staging area if any is used
Resiliency of the data integration tools
Resiliency of the orchestration (if outside of the ETL tool)
Resiliency of the network (both from a connectivity perspective and a bandwidth perspective)

You should also consider the requirements in terms of Disaster Recovery and High Availability. What happens if you lose the data center where this infrastructure is installed?

The following elements are required for your Oracle Data Integrator installation to be resilient:

Your agents need to be redundant: JEE agents are designed to offer resiliency in that a load balancer will distribute the load between agents.
The Oracle Data Integrator repository needs to be running on a resilient system: An Oracle RAC or Exadata installation would be a minimum requirement, so that the loss of a node doesn’t mean the loss of the complete infrastructure. For Oracle Cloud deployments, Oracle Database Exadata Cloud Service provides a resilient solution.
If you are using an external product to orchestrate the Oracle Data Integrator processes (Oracle Integration for instance), you need to make sure that this product is also resilient.

If you are considering a Disaster Recovery strategy, the same elements as described above are required, so you will have to make sure that:

You have a (recent enough) copy of the Oracle Data Integrator repository in your DR site so that you can continue running your Oracle Data Integrator processes.
You have Oracle Data Integrator agents available in that DR site to access this repository.
You have access to the source and target systems, or access to a copy of the source and target systems.

For Oracle Data Integrator specifically, there will be two elements of topology that must be validated:

The IP address or server name of the Oracle Data Integrator Work Repository is stored in the Oracle Data Integrator Master Repository. If the name or IP address change when you are switching to your DR site, you have to make sure that this information is updated before you start Oracle Data Integrator.
The IP address or server name for the source and target systems are stored in the Work repository. There are two possible strategies:
- Define separate contexts for each environment (primary and DR) which allow you to have two distinct physical server definition for each logical unit.
- Or, overwrite the IP address or server names before starting Oracle Data Integrator in the DR site.
In all cases, any script using the SDK that overwrites information in the Oracle Data Integrator repositories must have a reverse script to restore the information in the primary site.

Plan for Initial and Successive Loads

Chances are you will have to design a slightly different process for the initial load of your target system before you can focus on your regular loads (such as near real-time or daily batches).

This said, if you want to protect yourself against unforeseeable future outages, it is crucial to keep and maintain these initial load processes. Having the ability to re-run an initial load (or partial initial load) should never be underrated, in particular when facing the following situations:

A major flaw is discovered in the validity of the data that was loaded (missing data, invalid formula in the ETL, etc.).
A major outage occurs on the target system resulting in massive data loss.
For some reason the integration processes fail to run for too long a period.

Having the ability to run initial loads at-will also gives the ability to instantiate new environments.

You also can enhance these load strategies with the ability to re-apply a previous load (such as a previous month-worth of data). This would require a combination of a partial cleanup of the loaded data, and also a partial load.