7.8.3 Rebooting a Cluster from a Major Outage

If your cluster experiences a complete outage you can reconfigure it using dba.rebootClusterFromCompleteOutage(). This operation enables you to connect to one of the cluster's MySQL instances and use its metadata to recover the cluster.

A complete outage means that group replication has stopped on all member instances.

Note

Ensure all cluster members are started before running dba.rebootClusterFromCompleteOutage(). The command will fail if any of the cluster members are unreachable.

This check is ignored if the cluster is INVALIDATED and is a member of a ClusterSet.

Connect to the most up-to-date instance and run the following command:

  JS> var cluster = dba.rebootClusterFromCompleteOutage()

If all members have the same GTID set, the member to which you are currently connected becomes the primary. See Selecting a Primary with rebootClusterFromCompleteOutage.

The dba.rebootClusterFromCompleteOutage() operation follows these steps to ensure the cluster is correctly reconfigured:

Cluster metadata and the cluster topology is retrieved from the current instance.
If a cluster member is in RECOVERING or ERROR, and all other members are OFFLINE or ERROR, dba.rebootClusterFromCompleteOutage() attempts to stop Group Replication on that member. If Group Replication fails to stop, the command stops and displays an error.
The InnoDB Cluster metadata found on the instance which MySQL Shell is currently connected to is checked to see if it contains the GTID superset. If the currently connected instance does not contain the GTID superset, the operation aborts with that information.
See GTID Superset.
If the instance contains the GTID superset, the cluster is recovered based on the metadata stored in that instance.
MySQL Shell checks which instances of the cluster are currently reachable and fails if any member is currently unreachable.
Note
It is possible to bypass this check with the force option. This reboots the cluster using the remaining contactable members.
See Force Option.
Similarly, MySQL Shell detects instances which are currently not reachable. It is not possible to add or remove former members to the cluster as part of the dba.rebootClusterFromCompleteOutage() command, if they are currently unreachable.
If enabled on the primary instance of the cluster, while in single-primary mode, super_read_only is disabled.

GTID Superset

To reboot the cluster, you must connect to the member with the GTID superset, which means the instance which had applied the most transactions before the outage.

To determine which member has the GTID superset, do one of the following:

Connect to an instance and run dba.rebootClusterFromCompleteOutage() with dryRun: true. The generated report returns information similar to the following:.
```
Switching over to instance '127.0.0.1:4001' to be used as seed.
            
```
This indicates the member with the GTID superset.
Running dba.rebootClusterFromCompleteOutage() against a member with a lower GTID set results in an error.
Connect to each instance in turn and run the following in SQL mode:
```
SHOW VARIABLES LIKE 'gtid_executed';
```
The instance which has applied the largest GTID Sets of transactions contains the GTID superset.

Note

It is possible to override this behavior, and use an instance with a lower GTID set, by running dba.rebootClusterFromCompleteOutage() with the force option.

This makes the selected member the primary and discards any transactions not included in the selected member's GTID set.

If this process fails, and the cluster metadata has become badly corrupted, you might need to drop the metadata and create the cluster again from scratch. You can drop the cluster metadata using dba.dropMetadataSchema().

Warning

The dba.dropMetadataSchema() method should only be used as a last resort, when it is not possible to restore the cluster. It cannot be undone.

If you are using MySQL Router with the cluster, when you drop the metadata, all current connections are dropped and new connections are forbidden. This causes a full outage.

Options

dba.rebootClusterFromCompleteOutage() has the following options:

force: true | false (default): If true, the operation must be executed even if some members of the Cluster cannot be reached, or the primary instance selected has a diverging or lower GTID_SET. See Force Option
dryRun: true | false (default): performs all validations and steps of the command, but no changes are made. A report is displayed when finished. See Testing rebootClusterFromCompleteOutage.
primary: Instance definition representing the instance that must be selected as the primary. See Selecting a Primary with rebootClusterFromCompleteOutage.
switchCommunicationStack: mysql | xcom: The Group Replication protocol stack to be used by the Cluster after the reboot. See Section 7.5.9, “Configuring the Group Replication Communication Stack”.
ipAllowList: The list of hosts allowed to connect to the instance for Group Replication traffic when using the XCOM protocol stack.
localAddress: string value with the Group Replication local address to use instead of the automatically generated one when using the XCOM protocol stack.

Force Option

The force option enables you to ignore the availability of Cluster members or GTID-set divergence in the selected member and reboot the Cluster.

For example, rebooting the Cluster myCluster:

  JS> var cluster = dba.rebootClusterFromCompleteOutage("myCluster",{force: true})

The force option is not permitted in the following situations:

If the Cluster belongs to a ClusterSet and is INVALIDATED or the primary Cluster is not in global status OK,
The Cluster belongs to a ClusterSet, is the primary Cluster, and is INVALIDATED.

It is not possible to add or rejoin instances with rebootClusterFromCompleteOutage. If you used force to ignore unreachable members and reboot your Cluster, you must use cluster.rejoinInstance() to add the unreachable members to the Cluster.

Selecting a Primary with rebootClusterFromCompleteOutage

You can define the Cluster primary in one of the following ways:

Define the primary option in the dba.rebootClusterFromCompleteOutage() command.
For example, rebooting the Cluster myCluster and setting the member running on the local machine, on port 4001, as the primary:
```
var cluster = dba.rebootClusterFromCompleteOutage("myCluster",{primary: "127.0.0.1:4001"})
          
```
By using the primary option with the force option on a Cluster member with a lower GTID set than another member.

Testing rebootClusterFromCompleteOutage

You can test the changes by using the dryRun option. This option validates the command and its options and generates a log of results. An exception is thrown if there is a problem with the proposed changes.

The following example shows a dry run of rebooting the Cluster, myCluster, setting the primary to the local member running on port 4001, and the log message it returns:

JS > var cluster = dba.rebootClusterFromCompleteOutage("myCluster",{primary: "127.0.0.1:4001", dryRun: true})

NOTE: dryRun option was specified. Validations will be executed, but no changes will be applied.
Cluster instances: '127.0.0.1:4000' (OFFLINE), '127.0.0.1:4001' (OFFLINE), '127.0.0.1:4002' (OFFLINE)
Switching over to instance '127.0.0.1:4001' to be used as seed.
dryRun finished.

Considerations for ClusterSet and ReplicaSet

rebootClusterFromCompleteOutage performs the following checks and generates a warning if the Cluster does not meet the requirements:

Confirms the Replica Cluster was not forcibly removed from the ClusterSet.
Confirms the ClusterSet's primary Cluster is reachable.
Checks the Cluster for errant transactions which are not View Change Log Events (VCLE). See How Distributed Recovery Works.
Confirms the Cluster's executed transaction set (GTID_EXECUTED) is not empty.

The command automatically rejoins a Replica Cluster to the ClusterSet, ensuring the ClusterSet replication channel is configured for all Cluster members.

Switching Communication Stack

You can switch communication stack during a dba.rebootClusterFromCompleteOutage() operation.

For example:

js> dba.rebootClusterFromCompleteOutage("testcluster", {switchCommunicationStack: "mysql"})

Switching from the MYSQL protocol to XCOM requires an additional network address for the localAddress and may also require you to define ipAllowList values.