Recovering When the Replica Set Has a Permanently Failed Element

If an element in the replica set or a full replica set is unrecoverable because there has been a permanent failure, then you need to remove the failed element or evict the failed replica set.

Permanent failure can occur when a host permanently fails or if all elements in the replica set fail.

  • If all elements within a replica set permanently fail, you must evict the entire replica set, which results in the permanent loss of the data on the elements within that replica set.

    When k = 1, then the permanent failure of one element is a replica set failure. When k >= 2, all elements in a replica set must fail in order for the replica set to be considered failed. If k >= 2 and the replica set permanently fails, you need to evict all elements of the replica set simultaneously.

    Evicting the replica set removes it from the distribution for the grid. However, you cannot evict the replica set if the failed replica set is the only replica set in the database. In this case, save any checkpoint files, transaction log files or daemon log files (if possible) and then destroy and recreate the database.

    When a replica set goes down:

    • If Durability=0, the database goes into read-only mode.

    • If Durability=1, then all transactions that include the failed replica set are blocked until you evict the failed replica set. However, all transactions that do not involve the failed replica set continue to work as if nothing was wrong.

  • If k >= 2 and only one element of a replica set fails, one of the active elements takes over all of the requests for data until the failed element can be replaced with a new element. Thus, no data is lost with the failure. The chosen active element in the replica set processes the incoming transactions. You can simply remove and replace the failed element with a new element that is duplicated from the active element in the replica set. The chosen active element provides the base for a duplicate for the new element. See Replace an Element with Another Element.

Note:

If you know about problems that TimesTen Scaleout is not aware of and that a replica set needs to be evicted, you can evict and replace a replica set as needed.

You can evict the replica set from the distribution map for your grid with the ttGridAdmin dbDistribute -evict command. Make sure that all pending requests for adding or removing elements are applied before requesting the eviction of a replica set.

You have the following options when you evict a replica set:

  • Evict the replica set without replacing it immediately.

    If the data instances and hosts for this replica set have not failed, then you can recreate the replica set using the same data instances. This is a preferred option if there are other databases on the grid and the hosts are fine.

    In this case, you must:

    1. Evict the elements of the failed replica set, while the data instances and hosts are still up.

      When you evict the replica set, the data is lost within this replica set, but the other replica sets in the database continue to function. There is now one fewer replica set in your grid.

    2. Eliminate all checkpoint and transaction logs for the elements within the evicted replica set if you want to add new elements to the distribution map on the same data instances which previously held the evicted elements.

    3. Destroy the elements of the evicted replica set, while the data instances and hosts are still up.

    4. Optionally, you can replace the evicted replica set with a new replica set either on the same data instances and hosts if they are still viable or on new data instances and hosts. Add the new elements to the distribution map. This restores the grid to its expected configuration.

  • Evict the replica set and immediately replace it with a new replica set to restore the grid to its expected configuration.

    1. Create new data instances and hosts to replace the data instances and hosts of the failed replica set.

    2. Evict the elements of the failed replica set, while replacing it with a new replica set. When you evict the replica set, the data is lost within this replica set, but the other replica sets in the database continue to function.

      Use the ttGridAdmin dbDistribute -evict -replaceWith command to evict and replace the replica set with a new replica set, where each new element is created on a new data instance and host. The elements of the new replica set are added to the distribution map. However, the remaining data from the other replica sets are not redistributed to include the new replica. Thus, the new replica set remains empty until you insert data.

    3. Destroy the elements of the evicted replica set.

The following sections demonstrate how to evict a failed replica set when you have one or two elements in the replica set:

Evicting the Element in the Permanently Failed Replica Set When K = 1

The example shown in Figure 13-4 shows a TimesTen database that has been configured with k set to 1 with three data instances: host1.instance1, host2.instance1 and host3.instance1. The element on the host2.instance1 data instance fails because of a permanent hardware failure.

Figure 13-4 Grid Database Where K = 1

Description of Figure 13-4 follows
Description of "Figure 13-4 Grid Database Where K = 1"

The following sections demonstrate the eviction options:

Evict the Element to Potentially Replace at Another Time

If you cannot recover a failed element, you evict the replica set.

The following example:

  1. Evicts the replica set for the element on the host2.instance1 data instance with the ttGridAdmin dbDistribute -evict command.

  2. Destroys the checkpoint and transaction logs for only this element within the evicted replica set with the ttGridAdmin dbDestroy -instance command.

    Note:

    Alternatively, see the instructions in Remove and Replace a Failed Element in a Replica Set if the data instance or host on which the element exists is not reliable.

% ttGridAdmin dbDistribute database1 -evict host2.instance1 -apply
Element host2.instance1 evicted 
Distribution map updated

% ttGridAdmin dbDestroy database1 -instance host2.instance1
Database database1 instance host2 destroy started

% ttGridAdmin dbStatus database1 -all
Database database1 summary status as of Thu Feb 22 16:44:15 PST 2018
 
created,loaded-complete,open
Completely created elements: 2 (of 3)
Completely loaded elements: 2 (of 3)
 
Open elements: 2 (of 3) 
 
Database database1 element level status as of Thu Feb 22 16:44:15 PST 2018
 
Host  Instance  Elem Status    Date/Time of Event  Message 
----- --------- ---- --------- ------------------- ------- 
host1 instance1    1 opened    2018-02-22 16:42:14         
host2 instance1    2 destroyed 2018-02-22 16:44:01         
host3 instance1    3 opened    2018-02-22 16:42:14         
 
Database database1 Replica Set status as of Thu Feb 22 16:44:15 PST 2018
 
RS DS Elem Host  Instance  Status Date/Time of Event  Message 
-- -- ---- ----- --------- ------ ------------------- ------- 
 1  1    1 host1 instance1 opened 2018-02-22 16:42:14         
 2  1    3 host3 instance1 opened 2018-02-22 16:42:14         
 
Database database1 Data Space Group status as of Thu Feb 22 16:44:15 PST 2018
 
DS RS Elem Host  Instance  Status Date/Time of Event  Message 
-- -- ---- ----- --------- ------ ------------------- ------- 
 1  1    1 host1 instance1 opened 2018-02-22 16:42:14         
    2    3 host3 instance1 opened 2018-02-22 16:42:14

This example creates a new element for the replica set as the data instance and host are still viable. Then, adds the new elements to the distribution map.

  1. Creates a new element with the ttGridAdmin dbCreate -instance command on the same data instance where the previous element existed before its replica set was evicted.
  2. Adds the new element into the distribution map with the ttGridAdmin dbDistribute -add command.
% ttGridAdmin dbCreate database1 -instance host2
Database database1 creation started
% ttGridAdmin dbDistribute database1 -add host2 -apply 
Element host2 is added 
Distribution map updated
% ttGridAdmin dbStatus database1 -all
Database database1 summary status as of Thu Feb 22 16:53:17 PST 2018
 
created,loaded-complete,open
Completely created elements: 3 (of 3)
Completely loaded elements: 3 (of 3)
 
Open elements: 3 (of 3)
 
Database database1 element level status as of Thu Feb 22 16:53:17 PST 2018
 
Host  Instance  Elem Status Date/Time of Event  Message
----- --------- ---- ------ ------------------- -------
host1 instance1    1 opened 2018-02-22 16:42:14
host3 instance1    3 opened 2018-02-22 16:42:14
host2 instance1    4 opened 2018-02-22 16:53:14
 
Database database1 Replica Set status as of Thu Feb 22 16:53:17 PST 2018
 
RS DS Elem Host  Instance  Status Date/Time of Event  Message
-- -- ---- ----- --------- ------ ------------------- -------
 1  1    1 host1 instance1 opened 2018-02-22 16:42:14
 2  1    3 host3 instance1 opened 2018-02-22 16:42:14
 3  1    4 host2 instance1 opened 2018-02-22 16:53:14
 
Database database1 Data Space Group status as of Thu Feb 22 16:53:17 PST 2018
 
DS RS Elem Host  Instance  Status Date/Time of Event  Message
-- -- ---- ----- --------- ------ ------------------- -------
 1  1    1 host1 instance1 opened 2018-02-22 16:42:14
    2    3 host3 instance1 opened 2018-02-22 16:42:14
    3    4 host2 instance1 opened 2018-02-22 16:53:14

Evict and Replace the Data Instance Without Re-Distribution

To recover the initial capacity with the same number of replica sets as you started with for the database, evict and replace the evicted element using the ttGridAdmin dbDistribute -evict -replaceWith command.

The following example:

  1. Creates a new host (identified as host4), installation, data instance and element.
  2. Evicts the replica set that contains the failed element on the host2.instance1 data instance and replaces the evicted element with the element on the host4.instance1 data instance using the ttGridAdmin dbDistribute -evict -replaceWith command.

    The data that exists on the elements on the host1.instance1 and host3.instance1 data instances is not redistributed to the new element on the host4.instance1 data instance. The element on the host4.instance1 data instance is empty.

  3. Destroys the element on the host2.instance1 data instance with the ttGridAdmin dbDestroy -instance command.
% ttGridAdmin hostCreate host4 -address myhost.example.com -dataspacegroup 1
Host host4 created in Model
% ttGridAdmin installationCreate -host host4 -location /timesten/host4/installation1
Installation installation1 on Host host4 created in Model
% ttGridAdmin instanceCreate -host host4 -location /timesten/host4 
Instance instance1 on Host host4 created in Model
% ttGridAdmin modelApply
Copying Model.........................................................OK
Exporting Model Version 2.............................................OK
Marking objects 'Pending Deletion'....................................OK
Deleting any Hosts that are no longer in use..........................OK
Verifying Installations...............................................OK
Creating any missing Installations....................................OK
Creating any missing Instances........................................OK
Adding new Objects to Grid State......................................OK
Configuring grid authentication.......................................OK
Pushing new configuration files to each Instance......................OK
Making Model Version 2 current........................................OK
Making Model Version 3 writable.......................................OK
Checking ssh connectivity of new Instances............................OK
Starting new data instances...........................................OK
ttGridAdmin modelApply complete
% ttGridAdmin dbDistribute database1 -evict host2.instance1 
 -replaceWith host4.instance1 -apply
Element host2.instance1 evicted 
Distribution map updated
% ttGridAdmin dbDestroy database1 -instance host2
Database database1 instance host2 destroy started
% ttGridAdmin dbStatus database1 -all
Database database1 summary status as of Thu Feb 22 17:04:21 PST 2018
 
created,loaded-complete,open
Completely created elements: 3 (of 4)
Completely loaded elements: 3 (of 4)
 
Open elements: 3 (of 4)
 
Database database1 element level status as of Thu Feb 22 17:04:21 PST 2018
 
Host  Instance  Elem Status    Date/Time of Event  Message
----- --------- ---- --------- ------------------- -------
host1 instance1    1 opened    2018-02-22 16:42:14
host3 instance1    3 opened    2018-02-22 16:42:14
host2 instance1    4 destroyed 2018-02-22 17:04:11
host4 instance1    5 opened    2018-02-22 17:03:18
 
Database database1 Replica Set status as of Thu Feb 22 17:04:21 PST 2018
 
RS DS Elem Host  Instance  Status Date/Time of Event  Message
-- -- ---- ----- --------- ------ ------------------- -------
 1  1    1 host1 instance1 opened 2018-02-22 16:42:14
 2  1    3 host3 instance1 opened 2018-02-22 16:42:14
 3  1    5 host4 instance1 opened 2018-02-22 17:03:18
 
Database database1 Data Space Group status as of Thu Feb 22 17:04:21 PST 2018
 
DS RS Elem Host  Instance  Status Date/Time of Event  Message
-- -- ---- ----- --------- ------ ------------------- -------
 1  1    1 host1 instance1 opened 2018-02-22 16:42:14
    2    3 host3 instance1 opened 2018-02-22 16:42:14
    3    5 host4 instance1 opened 2018-02-22 17:03:18

Evicting All Elements in a Permanently Failed Replica Set When K >= 2

If k >= 2 and the replica set permanently fails, then you need to evict all elements of the replica set simultaneously.

Figure 13-5 shows where replica set 1 fails.

For the example shown in Figure 13-5, replica set 1 contains elements that exist on the host3.instance1, host4.instance1 and host5.instance1 data instances. The replica set fails in an unrepairable way. When you run the ttGridAdmin dbDistribute command to evict the replica set, specify the data instances of all elements in the replica set that are being evicted.

% ttGridAdmin dbDistribute database1 -evict host3.instance1 
 -evict host4.instance1 -evict host5.instance1 -apply
Element host3.instance1 evicted 
Element host4.instance1 evicted 
Element host5.instance1 evicted 
Distribution map updated

Replacing the Replica Set with New Elements with No Data Redistribution

If you cannot recover any of the elements in the replica set, then you must evict all elements in the replica set simultaneously. To recover the initial capacity with the same number of replica sets as you started with for the database, evict and replace the evicted elements in the failed replica set using the ttGridAdmin dbDistribute -evict -replaceWith command.

The following example:

  1. Creates new elements in the host9.instance1 and host10.instance1 data instances.
  2. Evicts the replica set with the failed elements on the host3.instance1 and host4.instance1 data instances, replacing them with new elements in the host9.instance1 and host10.instance1 data instances.

    The data that exists on the elements in the active replica sets is not redistributed to include the new elements on the host9.instance1 and host10.instance1 data instances. The elements on the host9.instance1 and host10.instance1 data instances are empty.

  3. Destroys the elements on the host3.instance1 and host4.instance1 data instances with the ttGridAdmin dbDestroy -instance command.

    The new replica set is now listed as replica set 1 with the elements from the replaced elements located in the host9.instance1 and host10.instance1 data instances.

% ttGridAdmin hostCreate host9 -internalAddress int-host9 -externalAddress
 ext-host9.example.com -like host3 -cascade
Host host9 created in Model
Installation installation1 created in Model
Instance instance1 created in Model
% ttGridAdmin hostCreate host10 -internalAddress int-host10 -externalAddress
 ext-host10.example.com -like host4 -cascade
Host host10 created in Model
Installation installation1 created in Model
Instance instance1 created in Model
% ttGridAdmin dbDistribute database1 -evict host3.instance1
 -replaceWith host9.instance1 -evict host4.instance1 
 -replaceWith host10.instance1 -apply
Element host3.instance1 evicted 
Element host4.instance1 evicted 
Distribution map updated
% ttGridAdmin dbStatus database1 -all 
Database database1 summary status as of Fri Feb 23 10:22:57 PST 2018
 
created,loaded-complete,open
Completely created elements: 8 (of 8)
Completely loaded elements: 6 (of 8) 
Completely created replica sets: 3 (of 3) 
Completely loaded replica sets: 3 (of 3)  
 
Open elements: 6 (of 8) 
 
Database database1 element level status as of Fri Feb 23 10:22:57 PST 2018
 
Host   Instance  Elem Status  Date/Time of Event  Message
------ --------- ---- ------- ------------------- -------
 host3 instance1    1 evicted 2018-02-23 10:22:28
 host4 instance1    2 evicted 2018-02-23 10:22:28
 host5 instance1    3 opened  2018-02-23 07:28:23
 host6 instance1    4 opened  2018-02-23 07:28:23
 host7 instance1    5 opened  2018-02-23 07:28:23
 host8 instance1    6 opened  2018-02-23 07:28:23
host10 instance1    7 opened  2018-02-23 10:22:27
 host9 instance1    8 opened  2018-02-23 10:22:27
 
Database database1 Replica Set status as of Fri Feb 23 10:22:57 PST 2018
 
RS DS Elem Host   Instance  Status Date/Time of Event  Message
-- -- ---- ------ --------- ------ ------------------- -------
 1  1    8 host9  instance1 opened 2018-02-23 10:22:27
    2    7 host10 instance1 opened 2018-02-23 10:22:27
 2  1    3 host5  instance1 opened 2018-02-23 07:28:23
    2    4 host6  instance1 opened 2018-02-23 07:28:23
 3  1    5 host7  instance1 opened 2018-02-23 07:28:23
    2    6 host8  instance1 opened 2018-02-23 07:28:23
 
Database database1 Data Space Group status as of Fri Feb 23 10:22:57 PST 2018
 
DS RS Elem Host   Instance  Status Date/Time of Event  Message
-- -- ---- ------ --------- ------ ------------------- -------
 1  1    8 host9  instance1 opened 2018-02-23 10:22:27
    2    3 host5  instance1 opened 2018-02-23 07:28:23
    3    5 host7  instance1 opened 2018-02-23 07:28:23
 2  1    7 host10 instance1 opened 2018-02-23 10:22:27
    2    4 host6  instance1 opened 2018-02-23 07:28:23
    3    6 host8  instance1 opened 2018-02-23 07:28:23
 
% ttGridAdmin dbDestroy database1 -instance host3 
Database database1 instance host3 destroy started
% ttGridAdmin dbDestroy database1 -instance host4
Database database1 instance host4 destroy started

Maintaining Database Consistency After an Eviction

Eviction of an entire replica set results in data loss, which can leave the database in an inconsistent state. For example, if the parent records were stored in an evicted replica set, then any child rows on other elements in a different replica set are in a table without a corresponding foreign key parent.

To ensure that you maintain database consistency after an eviction, fix all foreign key references by performing one of the following steps:

  • Delete any child row that does not have a corresponding parent.

  • Drop the foreign key constraint for any child row that does not have a corresponding parent.