Code for Resiliency

Define Automated Cleanup Procedures

Defining automated cleanup procedures will help with the resiliency of the code.

As you develop your code and run your tests, your will need to reset your environment repeatedly while you are fixing bugs, improving performance, and optimizing loading strategies. Creating cleanup procedures early will allow you to improve these procedures as your code evolves, to the point where they should be flexible enough to let you reset your environment to any state you want:

Reset everything, no data in the target system
Reset the environment to the state it was in before the last load (this would include variables, status or audit tables, etc.)
Reset the environment so that the data from a particular load (or batch_id or run_id) is reset across the environment without impacting other data

Automating the cleanup process usually makes that process more efficient to run (no need to look for a list of tables to reset, or SQL queries to run). Automation also dramatically reduces the risk of someone resetting the wrong part of the environment – especially when debugging gets to be frustrating and people get tired.

As these cleanup procedures become more fine-grained, they can be included in the integration processes. There are two ways to take advantage of these:

Always run a preemptive cleanup before loading data. If you are loading data that can be easily identified (with a batch-id, or a particular date) this will provide an easy way to overwrite previous loads of that same data set that would have failed
Use these cleanup procedures in the workflows in case of error. You do not want to abort your entire load because, for a short amount of time, you lost connection to your source or target environment. Let’s say that you are loading hundreds of files and running complex transformations on these files. Would you abort everything because for a split second you had a problem accessing just one of the files, or would you rather complete everything, retry that file as soon as possible, and only take more drastic measures if that file still cannot be loaded? Depending on the stability of the environment you are running on, you can choose to try several times before alerting the appropriate individuals. Likewise, based on the type of outage you experience in your environment, you may want to force a pause before trying again.

Consider different types of recovery procedures. For simple tasks, it is often enough to just retry the operation that failed without any other cleanup or resets. But as code evolves and becomes more complex, the type of errors you will encounter will evolve as well. After resolving load errors (the load itself doesn’t work), you may run into business errors (the data is loaded, but you are loading the wrong data, or some calculations are erroneous). You should carefully investigate how you want to address these types of corrections. In early stages, it may be valid to reset everything when business errors are encountered. But you will want to grow more fine grained as the code evolves, so that you can be more surgical in your corrections.

Loops can be defined in Oracle Data Integrator to include cleanup procedures and increment counters that keep track of the number of attempts. For better traceability of outages, best practices include a recording of the encountered errors, so that appropriate actions can be taken if errors become more frequent: you do not want to build-in resiliency to the cost of hiding growing problems. You can use the approach described in Retrieve Errors From Failed Steps.

If you know that one particular step in your processes is prone to errors though (for example, a remote system that goes offline regularly without a set timing for the outage) then you can take advantage of the Oracle Data Integrator built-in capability to retry a step in your packages. Use Automated Retries describes how to set automated retry for a step.

Two things to keep in mind if you use this approach:

Oracle Data Integrator will not notify you that the process failed if it is set to retry. If one of the further attempts is successful, the step will be logged as successful in the Oracle Data Integrator logs (you can still keep track of these attempts in your own audit tables though).
Retries and wait increase the duration of the running of that step. Keep that in mind when reviewing the performance of your integration processes.

Identify Performance Drift

Performance often gradually decays over time. It is important to monitor performance and make sure that for the same volume of data that gets processed, you still get the original performance. Taking more time to process more data makes sense. Taking more time to process the same amount of data can be a warning sign.

You will want to understand where the delays are, and what causes these delays. This can be related to increased network activity that reduces the available bandwidth, stale statistics in your databases that adversely impact the running of your SQL code, or a profusion of files on a server where you are only interested in a selected few.

Best practice in Oracle Data Integrator is to purge the logs (and scenario reports) as often as possible to improve performance. This said, it is very useful to archive the logs or copy performance relevant information so that you can survey performance over time.

As you investigate performance decay, look into individual steps in Oracle Data Integrator (do not stop at the overall performance): this will be the best way to start understanding where the delays are coming from. Also make sure that you have an automated process to purge the Oracle Data Integrator logs. You can actually create an Oracle Data Integrator job that performs these purges, including scenario reports. For details, see Purge Logs with OdiPurgeLog.

Because Oracle Data Integrator only offers to purge scenario reports linked to the sessions present in the operator, purging scenario reports after sessions have been purged requires assistance from Oracle Support. If you forgot to purge the scenario reports before purging sessions, please contact Oracle support. Oracle can guide you through the appropriate procedure.

For the long run, it is crucial to continually monitor performance. Performance degradation is often a warning sign of a degradation of the environment. In particular, Oracle recommends controlling execution plans using SQL Plan Management (SPM). Execution plan changes in production could cause failures, and table loads can cause change risks to execution plans.

Purge Logs with OdiPurgeLog

If you look in the Utilities drawer of your Oracle Data Integrator Packages, you will see a tool called OdiPurgeLog. You can use this in a scenario designed to purge the Oracle Data Integrator logs, and schedule this scenario to run regularly to make sure that you retain as few logs as possible.

Best practices include:

You should always purge the reports as well. They are more difficult to remove on their own than they are while you purge the logs.
You can set some level of latency in the purge: you can use variables to store a prior time or a prior date before which everything should be purged (you would use this with the End Date parameter).
You can choose to purge scenario logs (and reports) only, or purge both scenarios and load plans logs.

Retrieve Errors From Failed Steps

The getPrevStepLog() API would typically be used in an Oracle Data Integrator procedure. It is very convenient if a step fails, and you want to retrieve the errors reported in that step before you try to take corrective actions.

This API is invoked with a property name that will return the appropriate information. For example, if you want the name of the session, the name of the step that failed and the associated error message, you could use the following code for retrieving the error for the procedure:

Session:
        <%=odiRef.getInfo("SESS_NAME")%> encountered the following
        error at step: <%=odiRef.getPrevStepLog("STEP_NAME")%>Error Message:
        <%odiRef.getPrevStepLog("MESSAGE")%>

You would wrap this snippet in additional code that stores that information somewhere, or sends that information for proper processing.

Use Automated Retries

Automatic retries save time as the complete process is given a chance to complete versus canceling due to brief or temporary glitches.

In your package, select the step where you want to allow for retries. In the properties box, click on the ‘Advanced’ tab. In the Processing after failure area:

Define the number of times you want to attempt to retry that step
Define how long you want to wait between each retry

Use Unique or Dynamic Names for Scenario Sessions

When the same scenario is run numerous times to load different sets of data, the Oracle Data Integrator Operator view is not very helpful if all it shows is a list of many instances of that same scenario running, perhaps with an occasional error.

One elegant way around this when invoking the scenario is to take advantage of the Session Name (SESS_NAME) parameter. If the same scenario is running many times, you probably already have a parameter telling this scenario what data it has to process (particular data slice, load_id, date, etc.). You can use this variable to build a unique name for each execution of the scenario. By setting the session name in the call of the scenario, additional sessions of a package result in much more readable logs in the Oracle Data Integrator operator. This will make it a lot easier to understand which data set is problematic when one fails.

Use Event-Driven Tools

Oracle Data Integrator offers a number of tools that can be used in your packages to wait for new data to be available.

All these tools allow you to set your polling rate, as well as a timeout period:

OdiFileWait waits for file to be available in a directory (keep in mind your Oracle Data Integrator agent will have to see that directory!).
OdiWaitForData waits for new data to be available in a table based on a query that you supply.
OdiWaitForTable waits for a table to be created in your database.

Configure the Agent Blueprint Cache Timeout

With Oracle Data Integrator 12c, the efficiency of the agent has been improved by caching the definition of the scenarios that are run. You can control how long scenarios are cached in the definition of the physical agent in Oracle Data Integrator Topology.

Caching the scenario in the agent is useful for real-time jobs so the agent doesn't have to get the information in the repository for every run. The drawback is that a deployment of a new version of a scenario isn't picked up immediately. The default timeout until a new version of a cached scenario is picked up is 600 seconds (10 minutes), and 100 entries are cached by default.

You can manage these values. In the Agent definition, on the Definition tab, use the Session Blueprint Cache Management section to set Maximum cache entries and Unused Blueprint Lifetime (Sec).