Recover from Application Failures

Recoverability in Oracle Health Insurance applications manifests in different places and at different levels of operation.

Examples of technical errors are:

  • External web services endpoints are unavailable. For example, because specific systems are down or because of network failures.

  • Errors in Dynamic Logic scripts.

Typically, technical errors impact many similar operations and recovery costs can be high (both in terms of system resources that are necessary for recovery or regarding risks meeting service level agreements). Therefore, it is important to detect and resolve any error as soon as possible to restart failing or incomplete tasks with a minimum of reprocessing, as close as possible to the state before the incident.

Besides, recovery from technical errors in individual processes, the system can recover from system or node failure.

Tasks

In some Oracle Health Insurance applications, like Claims, bases processing on units of work referring them as Tasks. Error recovery centers on tasks. Tasks go through different stages:

  • Initial: The task starts. This involves the execution of basic sanity checks and the construction of the task’s context.

  • Pending: Execution of the task.

  • Completed: Execution of the task as expected.

  • Errored: When an error prevents the task from completing successfully.

A series of distinct tasks builds the claims process flow in Claims. Tasks in the flow allow:

  • Fail fast to recover as close as possible to the state before the incident.

  • If a Claim needs to restart from the beginning, restarting from a particular task prevents reprocessing. The goal is to resume processing by only repeating the task once the error occurred.

Steps to Recover Errored Tasks in the Claims Processing Flow

Most times, recovery requires manual intervention. The processing of Errored tasks must resume after the problem resolves. Do this through the Recovery from Technical Errors screen. Design of the screen has the following principles in mind:

  • Consistency and simplicity: the screen provides a generic recovery facility that applies to any task in the system. Different error-processing on a per-task basis would be harder for operators.

  • Ease of maintenance: After resolving the root cause of the problem, the system supports (manual) recovery for all tasks that fail because of the error. Manually restarting a large volume of Errored tasks one by one is not a valid option. Therefore, the system supports bulk operations to achieve this.

View tasks that are in a technical error state using the screen. After identifying and resolving the root cause of a problem, the operator then triggers the resumption of processing on either an individual Claim or on the whole batch of similar tasks in the Errored state.

The screen has three key features used to:

  • Detect tasks in the Errored state.

  • Aid in issue resolution.

  • Resubmit them for processing.

Errored Task Listing

The screen shows a table with the following columns, in this order:

  • Last Updated Date: The last date and time when this task was running.

  • Task Name: The name of the task in Errored status.

  • Type: The Entity Type of the target that the task associates with. For example, a Claims.

  • Code: The Code of the item associated with this task or the ID if the entity does not have a Code. Note that no data access restrictions apply on the Claim Code on this page. Any user with access to this page must be able to see all Claims Codes (and the Codes of other Entities where data access restrictions may apply in other pages).

  • Actions: Quick Search and Advanced Search are available on Last Updated Date, Task Name, and Type. Advanced Search is possible between search on Last Updated Date.

Process All

Process All button requests processing on all tasks on the screen, that a query returns.

Actions: Details and Process

Each row in the Errored Claim screen has a feature for showing more details about the task that fails, or for restarting that task. The exception details are a single text box containing a stack trace, formatted as a readable java stack trace.

The status of that task changes from error to pending and re-queries screen contents when a user re-submits a target to the processing flow. In this fashion, a task cannot return to processing multiple times.

Recover from External Service Failures

For recovery from failures to deliver a message to an external or outbound service, the following cases are important:

  • Processing of a Claim must not interrupt.

  • Processing of a claim must interrupt.

Task Recovery

Use the restart API operation to recover tasks from the user interface.

Restart the Error by sending a POST to

http://[hostName]:[portNumber]/[api-context-root]}/taskprocessing/\{task id}/restart