How the TimesTen Kubernetes Operator Handles Node Failure

Kubernetes does a good job of detecting and resolving Node and Pod failure. However, there could be cases where Kubernetes cannot resolve failure. Let's look at an example that could affect TimesTen databases that reside in containers in Pods on one or more Nodes in a Kubernetes cluster. Let's then examine what you can do to have the TimesTen Operator detect this situation and take appropriate action.

In a Kubernetes cluster, assume you use the local volume provisioner to make storage on each Kubernetes Node available as persistent storage for Pods running on the Node. A drawback of this approach is that the storage on one Node is not available to other Nodes. Consider the following scenario:

There are three nodes in the cluster (Node A, Node B, and Node C).
TimesTen is running on Node A and Node B.
Node A goes down and is unavailable.

Kubernetes detects the failure of Node A, but cannot automatically create a new Pod on Node C to run TimesTen. This is because the persistent volumes used by TimesTen are local to Node A and therefore Node C cannot access these persistent volumes. As a result, if Node A is down and unavailable, Kubernetes cannot create a new Pod on Node C. The TimesTen Operator can correctly fail over the database to Node B, but cannot bring up a replacement for Node A. Therefore, there is no redundancy in the cluster until Node A comes back up.

You can configure the TimesTen Operator to detect such a situation and take appropriate action to reconfigure and automatically start TimesTen on Node C.

Here's how:

The .spec.ttspec.deleteDbOnNotReadyNode datum of a TimesTenClassic object allows you to direct the TimesTen Operator to detect situations where a Node is not ready (or unknown) for a specific period of time. In such cases, if the .spec.ttspec.deleteDbOnNotReadyNode is specified, the TimesTen Operator takes appropriate action to remedy the situation. For more information about the .spec.ttspec.deleteDbOnNotReadyNode datum, see the deleteDbOnNotReadyNode entry in Table 20-3.

Let's look at this in further detail.

Approximately every .spec.ttspec.pollingInterval seconds, the TimesTen Operator reconciles each TimesTenClassic object. During this reconciliation, the TimesTen Operator examines the state of each Pod associated with a TimesTenClassic object. In addition, the TimesTen Operator also retrieves the state of the Node on which the Pod is running. If a Pod is scheduled on a Node that is not ready (or unknown), the TimesTen Operator records the time in the TimesTenClassic object's status.

During the next reconciliation (.spec.ttspec.pollingInterval later), if the Pod is assigned to the same Node and the Node is still not ready (or unknown), then the TimesTen Operator checks to see if the .spec.ttspec.deleteDbOnNotReadyNode is specified. If it is specified, the TimesTen Operator checks to see if the Node's not ready condition has existed for more than .spec.ttspec.deleteDbOnNotReadyNode seconds. If so, the TimesTen Operator deletes the Pod and the PVCs associated with the Pod. This causes Kubernetes to create a new Pod and new PVCs on a surviving Node. Once the Pod is scheduled and started by Kubernetes, the TimesTen Operator configures it as usual.