Recover from Application Failures

Recoverability in Oracle Health Insurance Components manifests in different places and at different levels of operation.

Examples of technical errors are:

  • External web services endpoints are unavailable, e.g. because specific systems are down or as a result of network failures

  • Errors in dynamic logic scripts

Typically, technical errors impact a large number of similar operations and recovery costs could be high (both in terms of system resources that are required for recovery as well as regarding risks to meet service level agreements). Therefore, it is important to detect and resolve any error as soon as possible in order to restart failing or incomplete tasks with a minimum of reprocessing, in as close as possible to the state prior to the incident.

Besides recovery from technical errors in individual processes, the system is able to recover from system or node failure.

Tasks

In some Oracle Health Insurance applications, like Oracle Health Insurance Claims, processing is based on units of work referred to as tasks. Error recovery is centered around tasks. Tasks go through different states:

  • Initial: the task is initialized. This involves execution of basic sanity checks and construction of the task’s context.

  • Pending: the task is executed.

  • Completed: execution of the task ended as expected.

  • Errored: an error occurred that prevented the task from completing successfully.

The claims process flow in Oracle Health Insurance Claims is built up as a series of distinct tasks. Tasks in the flow allow the following:

  • Fail fast in order to recover as close as possible to the state prior to the incident.

  • Restarting from a particular task prevents reprocessing that would happen if processing of a claim had to be restarted from the beginning. The goal is to resume processing by only repeating the task in which the error occurred.

Recover Errored tasks in the Claims Processing Flow

In a number of cases, recovery requires manual intervention: after the problem is resolved, processing of errored tasks must be resumed. This is done through the Recovery from Technical Errors screen. The screen was designed with the following principles in mind:

  • Consistency and simplicity: the screen provides a generic recovery facility that applies to any task in the system. Different error-processing on a per-task basis would be harder for operators.

  • Ease of maintenance: when the root cause of the problem is resolved, the system supports (manual) recovery from the error in a simple way for all tasks that failed as a result of it. Manually restarting a large volume of errored tasks one by one is not a valid option. Therefor, the system supports bulk operations to achieve this.

The screen is used to view tasks that are in a technical error state. Once the root cause of a problem has been identified and resolved, the operator can then trigger the resumption of processing on either an individual claim, or on the whole batch of similar tasks in the error state.

The screen has three main features used to:

  • detect tasks in the errored state

  • aid in issue resolution and

  • resubmit them for processing.

Errored Task Listing

The screen shows a table with the following columns, in this order:

  • Last Updated Date, the date and time when this task was last run

  • Task Name, the name of the task in errored status

  • Type, the entity type of the Target associated with this task, for example a Claim

  • Code, the code of the item associated with this task, or if the entity does not have a code, the ID. Note that no data access restrictions apply on the Claim code in this page. Any user with access to this page should be able to see all Claim codes (and the codes of other entities where data access restrictions might apply in other pages)

  • Actions

Quick Search and Advanced Search are available on Last Updated Date, Task Name and Type. Advanced Search has between search on Last Updated Date.

Process All

The process all button requests processing on all tasks returned to the screen by a query.

Actions: Details and Process

Each row in the errored claim screen has a feature for showing more details about the failed task, or for restarting that task. The exception details are a single text box containing a stack trace, formatted as a readable java stack trace.

When a target is re-submitted to the processing flow the status of that task changes from error to pending, and the screen contents are re-queried. In this fashion, a task can not be returned for processing multiple times.

Recover from External Service Failures

For recovery from failures to deliver a message to an external or outbound service the following cases are distinguished:

  • Processing of a claim should not be interrupted

  • Processing of a claim should be interrupted

Failing to deliver a message for which processing of a claim should not be

interrupted

For example: failure to deliver claims events messages. The timing for delivering these types of messages is not critical. The task for delivering the claims event message goes into the errored state (the guiding principle being that the event message needs to be delivered) but processing of the particular claim continues by spawning the next task. The failing task must be re-started from the Recovery from Technical Errors screen.

Oracle Health Insurance Applications keep track of the availability of external services it sends messages to. If the system sends a message and a time out occurs then the external service is temporarily flagged as being unavailable. The system will wait before another attempt is made to access a failing service to prevent running into the same timeout issues with every request it makes to the external service. The message sending tasks fail with error message 'Delivery of message to Web Service {service name x} failed. Timeout period is {time out period y} ms.'. The waiting period is controlled by system property ohi.ws.client.retrytimeout.

Failing to deliver a message for which the Claims Flow should be interrupted

Examples are enrollment request messages: enrollment data is a necessity for adjudication of a claim. The task goes into the errored state and processing of the particular claim is stopped. The failing task must be re-started from the Recovery from Technical Errors screen.