About Managing TimesTen Scaleout

TimesTen Scaleout delivers high performance, fault tolerance, and scalability within a highly available in-memory database that provides persistence and recovery. Since a database is distributed across multiple hosts, some components of the database may fail while others continue to operate.

TimesTen Scaleout supports error and failure detection with automatic recovery for many error and failure situations in order to maintain a continuous operation for all applications.

The TimesTen Operator implements best practices for how to handle failures for TimesTen Scaleout. For more information about how TimesTen Scaleout handles failures, see Recovering from Failure in the Oracle TimesTen In-Memory Database Scaleout User's Guide

In particular, the Operator detects and handles the following failure cases:
  • If a TimesTen instance or element fails, the Operator restarts it.

  • If an entire replica set fails and if all elements in the replica set reach the waiting for seed state, the Operator unloads and reloads the database to resolve it (by default). For details about how TimesTen Scaleout recovers from a down replica set, see Recovering from a Down Replica Set in the Oracle TimesTen In-Memory Database Scaleout User's Guide.

  • If all data instances fail, the Operator detects and reports the failure.

The Operator communicates to the TimesTen agent running in the tt container in each Pod running TimesTen. The agent determines information about the state of TimesTen running in the container and sends that information back to the Operator. The Operator analyzes this information and determines the health and state of TimesTen. This information is summarized in well-defined states. The Operator uses state machines to determine the appropriate set of commands to be executed to detect failures and, if possible, repair TimesTen. These states are discussed later in the chapter.

Let's take a deeper look at how the Operator detects and repairs TimesTen Scaleout. Specifically, let's look at how the Operator handles single data instance failure, management instance failure, entire replica set failure, and total database failure.