Recovering from Transient Errors

Because a grid spans multiple hosts, there is an opportunity for multiple types of failure, many of which can be transient errors. For the most part, TimesTen Scaleout can detect transient errors and adapt to them quickly.

Most errors in the grid are transient with error codes designated as Transient, which may cause a specific API, SQL statement or transaction to fail. Most of the time, the application can retry the exact same operation with success.

The potential impacts of a transient error are:

  • The execution of a particular statement failed. Your application should re-run the statement.

  • The execution of a particular transaction failed. Your application should roll back the transaction and perform the operations of the transaction again.

  • The connection to the data instance fails. If you are using a client/server connection, then the TimesTen Scaleout routes the connection to another active data instance. See Client Connection Failover.

The following sections describe how TimesTen Scaleout recovers the element from the more common transient errors:

Retry Transient Errors

While TimesTen Scaleout automatically handles the source of most transient errors, your application may retry the entire transaction.

Specifically, your application may retry the entire transaction when receiving any of the errors described in Table 13-1.

Table 13-1 SQLSTATE and ORA Errors for Retrying After Transient Failure

SQLSTATE ORA Errors PL/SQL Exceptions Error Message

TT005

ORA-57005

Exception -57005

Transient transaction failure due to unavailability of a grid resource. Roll back the transaction and then retry the transaction.

Your applications can check for the transient error as follows:

  • ODBC or JDBC applications check for the SQLSTATE TT005 error to determine if the application should retry the transaction. See Retrying After Transient Errors (ODBC) in Oracle TimesTen In-Memory Database C Developer's Guide and Retrying After Transient Errors (JDBC) in Oracle TimesTen In-Memory Database Java Developer's Guide.

  • OCI and Pro*C applications check for the ORA-57005 error to determine if the application should retry a SQL statement or transaction. See Transient Errors (OCI) in Oracle TimesTen In-Memory Database C Developer's Guide.

  • PL/SQL applications check for the -57005 PL/SQL exception to determine if the application should retry the transaction. See Retrying After Transient Errors (PL/SQL) in Oracle TimesTen In-Memory Database PL/SQL Developer's Guide.

Communications Error

Communications can fail between elements, between data instances or between a data instance and a ZooKeeper membership server.

The following describes the type of communications that might fail:

  • Communication between elements: Used to run SQL statements within transactions and stream data between elements, as required. If there is a communications error while the application is executing a transaction, then you must roll back the transaction. When you retry the transaction, communications are recreated and work continues.

  • Communication between data instances: The data instances communicate with each other for creating communication as well as sending or receiving recovery messages. If there is a break in the communication between the data instances, then communications are automatically recovered when you retry the operation.

  • Communication between data instances and the ZooKeeper membership servers: Each data instance communicates with the ZooKeeper membership service through one of the defined ZooKeeper servers. If communications fail between a data instance and the ZooKeeper server with which it has been communicating, then the data instance attempts to connect to another ZooKeeper server. If the data instance cannot connect to any ZooKeeper server, then the data instance considers itself to be down.

    See Recovering When a Data Instance Is Down for details on what to do when a data instance is down.

Software Error

If a software error causes an element to be unloaded, then an error is returned to the active application. After rolling back the transaction, the application can continue executing transactions as long as one element from each replica set is open.

TimesTen Scaleout attempts to reload the element. Once opened, the element can accept transactions again.

Note:

You can manually initiate the reload of an element by reloading the database with the ttGridAdmin dbload command. If element status is load failed, fix what caused the element load to fail and then reload the element with the ttGridAdmin dbload command. See Load a Database Into Memory (dbLoad) in Oracle TimesTen In-Memory Database Reference.

Host or Data Instance Failure

If the host that contains a data instance crashes or if the data instance crashes, then an error is returned to the active application.

Since the data instance is down, the element status is displayed as down. If the data instance restarts (whether from automatic recovery or manual intervention), the element within the data instance most likely recovers. Monitor the status of the element with the ttGridAdmin dbStatus command to verify if it did recover.

Heavy Load or Temporary Communication Failure

A transient failure may occur if an element becomes slow or unresponsive due to heavy load.

During a database operation, a transient failure can occur for many reasons.

  • A query timeout may occur if one or more hosts of the TimesTen Scaleout are overloaded and are slow to respond.

  • A transient failure occurs with a temporary suspension of communication, such as unplugging from the network to reset communications.