13 Recovering from Failure

Error conditions and failure situations can impact availability. If the error condition can be recovered automatically, then standard operations resume. However, there may be situations where you need to intervene to recover from failure.

TimesTen Scaleout has included error and failure detection with automatic recovery for many error and failure situations in order to maintain a continuous operation for all applications using TimesTen Scaleout. Errors and failure situations can include:

  • Software errors.

  • Network outage or other communication channel failures. A communication channel is a TCP connection.

  • One or more machines hosting a data instance unexpectedly reboots or crashes.

  • The main TimesTen daemon for an instance or any of its sub-daemons fail.

  • An element becomes slow or unresponsive if it is suspended waiting on a lock or as a result of a heavy load.

  • A machine or rack of machines hosting data instances are unexpectedly brought down for unknown reasons.

The response necessary for error conditions and failure situations are as follows:

  • Transient errors: A transient error is due to a temporary condition that TimesTen Scaleout is usually able to quickly resolve. You can immediately retry the failed transaction, which usually succeeds.

  • Element failure: When an element fails, TimesTen Scaleout can automatically recover the element most of the time. However, there are certain element failure situations where you may be required to fix the problem. The application response to an element failure may differ depending on the configuration of the grid and the database. After the problem is fixed, either TimesTen Scaleout recovers the element and operations continue or you supply a new element to take the place of the failed element.

  • Replica set failure: If all of the elements in a replica set fail, there is a method for TimesTen Scaleout to automatically recover the elements (once the original failure issue has been fixed). The element with the latest changes, known as the seed element, is recovered first. Then, all subsequent elements are recovered from the seed element.

  • Database failure: If all replica sets fail, the database is considered failed. You need to reload the database for recovery. How a database recovers when the database reloads depends on the value for the Durability attribute.

  • Data distribution failure: You can attempt a re-synchronization of your data if the data distribution process is interrupted or fails to complete. Re-synchronization involves executing the ttGridAdmin dbDistribute -resync operation.

The following sections describe the error or failure situations and recovery: