Recover from Application Failures

Recoverability in Oracle Health Insurance applications manifests in different places and at different levels of operation.

Examples of technical errors are:

  • External web services endpoints are unavailable. For example, because specific systems are down or because of network failures.

  • Errors in Dynamic Logic scripts.

Typically, technical errors impact many similar operations and recovery costs can be high (both in terms of system resources that are necessary for recovery or regarding risks meeting service level agreements). Therefore, it is important to detect and resolve any error as soon as possible to restart failing or incomplete tasks with a minimum of reprocessing, as close as possible to the state before the incident.

Besides, recovery from technical errors in individual processes, the system can recover from system or node failure.

Tasks

In some Oracle Health Insurance applications, like Claims, bases processing on units of work referring them as Tasks. Error recovery centers on tasks. Tasks go through different stages:

  • Initial: The task starts. This involves the execution of basic sanity checks and the construction of the task’s context.

  • Pending: Execution of the task.

  • Completed: Execution of the task as expected.

  • Errored: When an error prevents the task from completing successfully.

A series of distinct tasks builds the claims process flow in Claims. Tasks in the flow allow:

  • Fail fast to recover as close as possible to the state before the incident.

  • If a Claim needs to restart from the beginning, restarting from a particular task prevents reprocessing. The goal is to resume processing by only repeating the task once the error occurred.

Steps to Recover Errored Tasks in the Claims Processing Flow

Most times, recovery requires manual intervention. The processing of Errored tasks must resume after the problem resolves. Do this through the Recovery from Technical Errors screen. Design of the screen has the following principles in mind:

  • Consistency and simplicity: the screen provides a generic recovery facility that applies to any task in the system. Different error-processing on a per-task basis would be harder for operators.

  • Ease of maintenance: After resolving the root cause of the problem, the system supports (manual) recovery for all tasks that fail because of the error. Manually restarting a large volume of Errored tasks one by one is not a valid option. Therefore, the system supports bulk operations to achieve this.

View tasks that are in a technical error state using the screen. After identifying and resolving the root cause of a problem, the operator then triggers the resumption of processing on either an individual Claim or on the whole batch of similar tasks in the Errored state.

The screen has three key features used to:

  • Detect tasks in the Errored state.

  • Aid in issue resolution.

  • Resubmit them for processing.

Errored Task Listing

The screen shows a table with the following columns, in this order:

  • Last Updated Date: The last date and time when this task was running.

  • Task Name: The name of the task in Errored status.

  • Type: The Entity Type of the target that the task associates with. For example, a Claims.

  • Code: The Code of the item associated with this task or the ID if the entity does not have a Code. Note that no data access restrictions apply on the Claim Code on this page. Any user with access to this page must be able to see all Claims Codes (and the Codes of other Entities where data access restrictions may apply in other pages).

  • Actions: Quick Search and Advanced Search are available on Last Updated Date, Task Name, and Type. Advanced Search is possible between search on Last Updated Date.

Process All

Process All button requests processing on all tasks on the screen, that a query returns.

Actions: Details and Process

Each row in the Errored Claim screen has a feature for showing more details about the task that fails, or for restarting that task. The exception details are a single text box containing a stack trace, formatted as a readable java stack trace.

The status of that task changes from error to pending and re-queries screen contents when a user re-submits a target to the processing flow. In this fashion, a task cannot return to processing multiple times.

Recover from External Service Failures

For recovery from failures to deliver a message to an external or outbound service, the following cases are important:

  • Processing of a Claim must not interrupt.

  • Processing of a claim must interrupt.

Failing to Deliver a Message for Which Processing of a Claim Must Not Interrupt

For example, failure to deliver claims events messages. The timing for delivering these types of messages is not critical.

The task for delivering the Claims event message goes into the Errored state (the guiding principle is delivery of the event message is essential) but processing of the particular Claim continues by spawning the next task. Restart the failing task from the Recovery from Technical Errors screen.

Oracle Health Insurance applications keep track of the availability of external services it sends messages to. If the system sends a message and a time-out occurs, then the system flags the external service as being unavailable. The system will wait before another attempt to access a failing service to prevent running into the same timeout issues with every request it makes to the external service. The message sending tasks fail with the error message "Delivery of message to Web Service {service name x} failed. Timeout period is {time out period y} ms". The system property ohi.ws.client.retrytimeout controls the waiting period.

Failing to Deliver a Message for Which the Claims Flow Must Interrupt

Examples are enrollment request messages. Enrollment data is necessary for the adjudication of a Claim. The task goes into the Errored state and processing of the particular Claim stops. Restart the failing task from the Recovery from Technical Errors screen.

Task Recovery

Use the restart API operation to recover tasks from the user interface.

Restart the Error by sending a POST to

http://<hostname>/<context>/api/taskprocessing/<task id>/restart

Dismiss Task

A user can dismiss certain tasks failure using the user interface or through API. The status of the task to DISMISSED and it cannot restart as an effect of dismissing a task.

Dismiss the task by sending a POST to:

http://<hostname>/<context>/api/tasks/<task id>/dismiss

The system adds this additional link to the resource representation of generic API tasks for the tasks that can be dismissed.

Dismiss the following tasks:

  • WorkflowEvent

  • TaskDoneEvent

  • ClaimsEvent

  • CtrClaimEvent

  • PolicyAccountTransactionEvent

System responds back with 204: No Content after dismissing the task

Error Messages

The following error messages that are specific to this service may be returned in the response messages:

Table 1. Error Messages
Code Sev Text

OHI-TASK-IP-002

Fatal

Only tasks in status Errored can be dismissed

OHI-TASK-IP-003

Fatal

{tasktypecode} type of task cannot be dismissed