Recovering When a Single Element Fails in a Replica Set
There are recovery methods you can perform when a single element fails within a
replica set when k >= 2
:
Troubleshooting Based on Element Status
For some of the element states, you may be required to intervene. When you display the element status, you can respond to each of these element states.
Table 13-2 shows details on each element status and a recommendation of how to respond to changes in the element status.
Table 13-2 Element Status
Status | Meaning | Notes and Recommendations |
---|---|---|
|
The attempt to close the element failed. |
Refer to the You can try |
|
The element is in the process of closing. |
Wait, and run the |
|
The attempt to create the element failed. |
Refer to the You can use the |
|
The element is being created. |
Wait, and run the |
|
The attempt to destroy the element failed. |
Refer to the If the element status is |
|
The element has been destroyed. |
Element no longer exists. Note: When the last element of a database is destroyed, no record of the database, including element status, will exist. |
|
The element is being destroyed. |
Wait, and run the |
|
The data instance where this element is located is not running. |
If the data instance is down, the status of an element is down. Try to restart the data instance with the See Restart a Data Instance That Is Down and Recovering When a Data Instance Is Down. |
|
The element was evicted or removed through |
When the element status is |
|
The element was evicted or removed through |
Wait, and run When the element status is |
|
The element was evicted or removed through |
Wait, and run When the element status is |
|
The attempt to load the element failed. |
Refer to the You can try again to load the element with the |
|
The element is loaded. |
Element is loaded and can now be opened. You can confirm if the element is in the distribution map with the |
|
The element is being loaded. |
Wait, and run the |
|
The element is open. |
Standard status for a functioning element. Database connections are possible through the element. |
|
The attempt to open the element failed. |
Refer to the You can try |
|
The element is in the process of opening. |
Wait, and run |
|
The element should be created, but creation has not yet started. |
Wait, and run the |
|
The element has been unloaded. |
Database is ready to be loaded again ( You can run the |
|
The element is being unloaded. |
Wait, and run the |
|
The element will be loaded, but not until after the seed element in its replica set is loaded. |
Note the status of the seed element in the replica set. The element in the replica set that failed with the latest changes is known as the seed element. The seed element recovers first with the latest transaction in the checkpoint and transaction log files.
|
Note:
The notes and recommendations column often refers to ttGridAdmin
commands. For more information on these commands within Oracle TimesTen In-Memory Database
Reference,
see Monitor
the Status of a Database (dbStatus) for ttGridAdmin
dbStatus
, Create a Database (dbCreate) for ttGridAdmin
dbCreate
, Open a Database (dbOpen) for ttGridAdmin
dbOpen
, Load a Database Into Memory (dbLoad) for ttGridAdmin
dbLoad
, Unload a Database (dbUnload) for ttGridAdmin
dbUnload
, Close a Database (dbClose) for ttGridAdmin
dbClose
, Destroy a Database (dbDestroy) for ttGridAdmin
dbDestroy
, and Run a Command or Script on Grid Instances
(instanceExec) for ttGridAdmin instanceExec
.
The following sections demonstrate how to respond with different scenarios where a single element in the replica set has failed:
Retry Element Creation
If the creation of the element failed, then retry the creation of the element with
the ttGridAdmin dbCreate -instance
command on the same data instance where
the element should exist.
% ttGridAdmin dbCreate database1 -instance host3
Database database1 creation started
Restart a Data Instance That Is Down
When a data instance is down, the element within the data instance is down. You can
check if data instances are down by using the ttGridAdmin dbStatus -all
command.
Restart the daemon of the data instance with the ttGridAdmin
instanceExec -only
command to run either the ttDaemonAdmin
-start
or ttDaemonAdmin -restart
commands.
The following example starts the host4.instance1
data instance:
% ttGridAdmin instanceExec -only host4.instance1 ttDaemonAdmin -start
Overall return code: 0
Commands executed on:
host4.instance1 rc 0
Return code from host4.instance1: 0
Output from host4.instance1:
TimesTen Daemon (PID: 15491, port: 14000) startup OK.
If the data instance does not restart, see Recovering When a Data Instance Is Down.
Destroy an Evicted Element or an Element Where a Destroy Failed
If you evict an element, you still need to destroy the element to free up the file system space used by the element. After which, you may decide to create a new element.
When the element status is destroy failed
or
evicted
, destroy the element of the data instance with the
ttGridAdmin dbDestroy -instance
command.
% ttGridAdmin dbDestroy database1 -instance host3 Database database1 destroy started
See Recovering When the Replica Set Has a Permanently Failed Element.
Recovering a Replica Set After an Element Goes Down
-
If the failed element recovers, it was unavailable for a time and fell behind transactionally. Before this element can resume its part in the replica set in the grid, it must synchronize its data with the active element of its replica set.
-
If an element permanently fails, such as a file system failure, you need to remove that element from the replica set and replace it with another element with the
ttGridAdmin dbDistribute -remove -replaceWith
command. See Replace an Element with Another Element.
TimesTen Scaleout automatically re-synchronizes and restores the data on the restored or new element in the replica set with the following methods:
-
Log-based catch up: This process transfers the transaction logs from an active element in the replica set and applies transaction records that are missing on a recovering element. This operation applies the DML or DDL statements that occurred while an element was not participating in the replica set. However, TimesTen Scaleout blocks any new DDL statements during the log-based catch up recovery phase of a recovering element.
Transactions that are started while one of the elements of the replica set is down must be replayed when recovering the down element. The log-based catch up process waits for any open transactions to commit or roll back before replaying them from the transaction log. If the down element is in the recovery process for an extended period of time, then there may be an open transaction (on the active element) preventing the completion of the log-based catch up process for the recovering element. Use the
ttXactAdmin
utility to check for open transactions. Resolve any open transactions by either committing or rolling them back. -
Duplicate: TimesTen Scaleout duplicates the active element either to a recovering element or to a new element that replaces a failed element. The duplication operation copies all checkpoint and log files of the active element to the recovering element.
However, since the active element continues to accept transactions during the duplicate operation, there may be additional transaction log records that are not a part of the copied transaction log files. After completing the duplicate operation, TimesTen Scaleout contacts the active element and performs a log-based catch up operation to bring the new element completely up to date.
Remove and Replace a Failed Element in a Replica Set
When k >= 2, if an element cannot be recovered automatically, then you have to investigate what caused the failure.
You may discover a problem that can be fixed, such as a drive that needs to be remounted. However, you may discover a problem that cannot be fixed, such as a drive that is completely destroyed. Most permanent, unrecoverable failures are usually related to hardware failures.
-
If you can, fix the problem with the host or the data instance and then perform one of the following:
-
Restart the data instance. See Recovering When a Data Instance Is Down.
-
Reload the TimesTen database with the
ttGridAdmin dbload
command, which attempts to reload the element.
-
-
If you cannot fix the problem with the host or data instance, then the data on the element may be in a state where it cannot be retrieved. In this case, you must remove the element and replace it with another element. Once replaced, the active element updates the new element with the data for this replica set.
If one of your hosts is encountering multiple errors (even though it has been able to automatically recover), you may decide to replace it with another host that is more reliable.
To replace an element without data loss, run the ttGridAdmin dbDistribute
-remove -replaceWith
command, which takes
the data that exists on the element you want to replace and
redistributes to a new element. See Replace an Element with Another Element.