Recovering When a Single Element Fails in a Replica Set

There are recovery methods you can perform when a single element fails within a replica set when k >= 2:

Troubleshooting Based on Element Status

For some of the element states, you may be required to intervene. When you display the element status, you can respond to each of these element states.

Table 13-2 shows details on each element status and a recommendation of how to respond to changes in the element status.

Table 13-2 Element Status

Status Meaning Notes and Recommendations

close failed

The attempt to close the element failed.

Refer to the ttGridAdmin dbStatus command output for information about the failure.

You can try ttGridAdmin dbClose again.

closing

The element is in the process of closing.

Wait, and run the ttGridAdmin dbStatus command again to see when the element is closed. You can unload the database when some elements are still closing, but you would have to use the ttGridAdmin dbUnload -force command.

create failed

The attempt to create the element failed.

Refer to the ttGridAdmin dbStatus output for information about the failure. A common issue is that there are not enough semaphores to create the element or there is something wrong with the directory (incorrect permissions) for the checkpoint files. See Set the SEMMSL and SEMMNS Parameters.

You can use the ttGridAdmin dbCreate command with the -instance hostname[.instancename] option to retry the creation of the element on that data instance. See Retry Element Creation.

creating

The element is being created.

Wait, and run the ttGridAdmin dbStatus command again to see when the element is created.

destroy failed

The attempt to destroy the element failed.

Refer to the ttGridAdmin dbStatus command output for information about the failure.

If the element status is destroy failed, you can retry the destroy of the element on the data instance with the ttGridAdmin dbDestroy command with the -instance hostname[.instancename] option. See Destroy an Evicted Element or an Element Where a Destroy Failed.

destroyed

The element has been destroyed.

Element no longer exists.

Note: When the last element of a database is destroyed, no record of the database, including element status, will exist.

destroying

The element is being destroyed.

Wait, and run the ttGridAdmin dbStatus command again to see when the element is destroyed.

down

The data instance where this element is located is not running.

If the data instance is down, the status of an element is down.

Try to restart the data instance with the instanceExec command to run ttDaemonAdmin -start command. Use the instanceExec option -only hostname[.instancename].

See Restart a Data Instance That Is Down and Recovering When a Data Instance Is Down.

evicted

The element was evicted or removed through ttGridAdmin dbDistribute and has been removed from the distribution map.

When the element status is evicted, destroy the element of the data instance with the ttGridAdmin dbDestroy command with the -instance hostname[.instancename] option. See Destroy an Evicted Element or an Element Where a Destroy Failed.

evicted (loaded)

The element was evicted or removed through ttGridAdmin dbDistribute but removal from the distribution map has not yet begun.

Wait, and run ttGridAdmin dbStatus command again to see when the element is unloaded.

When the element status is evicted, destroy the element with the ttGridAdmin dbDestroy command with the -instance hostname[.instancename] option. See Destroy an Evicted Element or an Element Where a Destroy Failed.

evicted (unloading)

The element was evicted or removed through ttGridAdmin dbDistribute and is being removed from the distribution map.

Wait, and run ttGridAdmin dbStatus command again to see when the element is unloaded.

When the element status is evicted, destroy the element of the data instance with the ttGridAdmin dbDestroy command with the -instance hostname[.instancename] option. See Destroy an Evicted Element or an Element Where a Destroy Failed.

load failed

The attempt to load the element failed.

Refer to the ttGridAdmin dbStatus command output for information about the failure.

You can try again to load the element with the ttGridAdmin dbLoad command with the -instance hostname[.instancename] option.

loaded

The element is loaded.

Element is loaded and can now be opened. You can confirm if the element is in the distribution map with the ttGridAdmin dbStatus -replicaset command.

loading

The element is being loaded.

Wait, and run the ttGridAdmin dbStatus command again to see when the element is loaded.

opened

The element is open.

Standard status for a functioning element. Database connections are possible through the element.

open failed

The attempt to open the element failed.

Refer to the ttGridAdmin dbStatus command output for information about the failure.

You can try ttGridAdmin dbOpen again.

opening

The element is in the process of opening.

Wait, and run ttGridAdmin dbStatus command again to see when the element is open.

uncreated

The element should be created, but creation has not yet started.

Wait, and run the ttGridAdmin dbStatus command again to see when creation begins (status creating).

unloaded

The element has been unloaded.

Database is ready to be loaded again (ttGridAdmin dbLoad) or destroyed (ttGridAdmin dbDestroy).

You can run the ttGridAdmin dbLoad command to reload the database.

unloading

The element is being unloaded.

Wait, and run the ttGridAdmin dbStatus command again to see when the element is unloaded.

waiting for seed

The element will be loaded, but not until after the seed element in its replica set is loaded.

Note the status of the seed element in the replica set. The element in the replica set that failed with the latest changes is known as the seed element. The seed element recovers first with the latest transaction in the checkpoint and transaction log files.

  • If the status of the seed element is loading, then failed elements will load as soon as the status of the seed element is loaded.

  • If the status of the seed element is load failed, then address that problem. See the entry for load failed above.

  • If the status of the seed element is down, then the failed elements cannot recover. Restart the data instance as indicated within the element down status information in this table.

  • If all elements in the replica set are in the waiting for seed state, then the only way to recover the replica set is to either:

    - Reload the database with the ttGridAdmin dbLoad command. See Database Recovery.

    - If a reload of the database does not recover the elements and if your Durability=0, then you may need to evict the replica set, unload and reload the database with the ttGridAdmin dbDistribute -evict, unLoad and dbLoad commands. See Recovering a Failed Replica Set When Durability=0.

Note:

The notes and recommendations column often refers to ttGridAdmin commands. For more information on these commands within Oracle TimesTen In-Memory Database Reference, see Monitor the Status of a Database (dbStatus) for ttGridAdmin dbStatus, Create a Database (dbCreate) for ttGridAdmin dbCreate, Open a Database (dbOpen) for ttGridAdmin dbOpen, Load a Database Into Memory (dbLoad) for ttGridAdmin dbLoad, Unload a Database (dbUnload) for ttGridAdmin dbUnload, Close a Database (dbClose) for ttGridAdmin dbClose, Destroy a Database (dbDestroy) for ttGridAdmin dbDestroy, and Run a Command or Script on Grid Instances (instanceExec) for ttGridAdmin instanceExec.

The following sections demonstrate how to respond with different scenarios where a single element in the replica set has failed:

Retry Element Creation

If the creation of the element failed, then retry the creation of the element with the ttGridAdmin dbCreate -instance command on the same data instance where the element should exist.

% ttGridAdmin dbCreate database1 -instance host3
Database database1 creation started

Restart a Data Instance That Is Down

When a data instance is down, the element within the data instance is down. You can check if data instances are down by using the ttGridAdmin dbStatus -all command.

Restart the daemon of the data instance with the ttGridAdmin instanceExec -only command to run either the ttDaemonAdmin -start or ttDaemonAdmin -restart commands.

The following example starts the host4.instance1 data instance:

% ttGridAdmin instanceExec -only host4.instance1 ttDaemonAdmin -start 
Overall return code: 0
Commands executed on:
  host4.instance1 rc 0
Return code from host4.instance1: 0
Output from host4.instance1:
TimesTen Daemon (PID: 15491, port: 14000) startup OK.

If the data instance does not restart, see Recovering When a Data Instance Is Down.

Destroy an Evicted Element or an Element Where a Destroy Failed

If you evict an element, you still need to destroy the element to free up the file system space used by the element. After which, you may decide to create a new element.

When the element status is destroy failed or evicted, destroy the element of the data instance with the ttGridAdmin dbDestroy -instance command.

% ttGridAdmin dbDestroy database1 -instance host3
Database database1 destroy started

See Recovering When the Replica Set Has a Permanently Failed Element.

Recovering a Replica Set After an Element Goes Down

When k >= 2, all active elements in the same replica set are transactionally synchronized. Any DML or DDL statements applied to one element in a replica set are also applied to all other elements in the replica set. When one element in the replica set is not up, another element in the replica set continues to run DML or DDL statements.
  • If the failed element recovers, it was unavailable for a time and fell behind transactionally. Before this element can resume its part in the replica set in the grid, it must synchronize its data with the active element of its replica set.

  • If an element permanently fails, such as a file system failure, you need to remove that element from the replica set and replace it with another element with the ttGridAdmin dbDistribute -remove -replaceWith command. See Replace an Element with Another Element.

TimesTen Scaleout automatically re-synchronizes and restores the data on the restored or new element in the replica set with the following methods:

  • Log-based catch up: This process transfers the transaction logs from an active element in the replica set and applies transaction records that are missing on a recovering element. This operation applies the DML or DDL statements that occurred while an element was not participating in the replica set. However, TimesTen Scaleout blocks any new DDL statements during the log-based catch up recovery phase of a recovering element.

    Transactions that are started while one of the elements of the replica set is down must be replayed when recovering the down element. The log-based catch up process waits for any open transactions to commit or roll back before replaying them from the transaction log. If the down element is in the recovery process for an extended period of time, then there may be an open transaction (on the active element) preventing the completion of the log-based catch up process for the recovering element. Use the ttXactAdmin utility to check for open transactions. Resolve any open transactions by either committing or rolling them back.

  • Duplicate: TimesTen Scaleout duplicates the active element either to a recovering element or to a new element that replaces a failed element. The duplication operation copies all checkpoint and log files of the active element to the recovering element.

    However, since the active element continues to accept transactions during the duplicate operation, there may be additional transaction log records that are not a part of the copied transaction log files. After completing the duplicate operation, TimesTen Scaleout contacts the active element and performs a log-based catch up operation to bring the new element completely up to date.

Remove and Replace a Failed Element in a Replica Set

When k >= 2, if an element cannot be recovered automatically, then you have to investigate what caused the failure.

You may discover a problem that can be fixed, such as a drive that needs to be remounted. However, you may discover a problem that cannot be fixed, such as a drive that is completely destroyed. Most permanent, unrecoverable failures are usually related to hardware failures.

  • If you can, fix the problem with the host or the data instance and then perform one of the following:

  • If you cannot fix the problem with the host or data instance, then the data on the element may be in a state where it cannot be retrieved. In this case, you must remove the element and replace it with another element. Once replaced, the active element updates the new element with the data for this replica set.

If one of your hosts is encountering multiple errors (even though it has been able to automatically recover), you may decide to replace it with another host that is more reliable.

To replace an element without data loss, run the ttGridAdmin dbDistribute -remove -replaceWith command, which takes the data that exists on the element you want to replace and redistributes to a new element. See Replace an Element with Another Element.