MySQL Shell 8.0 (part of MySQL 8.0)

6.2.7 Troubleshooting InnoDB Cluster

This section describes how to troubleshoot an InnoDB Cluster.

Rejoining an Instance to a Cluster

If an instance leaves the cluster, for example because it lost connection, and for some reason it could not automatically rejoin the cluster, it might be necessary to rejoin it to the cluster at a later stage. To rejoin an instance to a cluster issue Cluster.rejoinInstance(instance).

Tip

If the instance has super_read_only=ON then you might need to confirm that AdminAPI can set super_read_only=OFF. See Super Read-only and Instances for more information.

In the case where an instance has not had its configuration persisted (see Persisting Settings), upon restart the instance does not rejoin the cluster automatically. The solution is to issue cluster.rejoinInstance() so that the instance is added to the cluster again and ensure the changes are persisted. Once the InnoDB Cluster configuration is persisted to the instance's option file it rejoins the cluster automatically.

If you are rejoining an instance which has changed in some way then you might have to modify the instance to make the rejoin process work correctly. For example, when you restore a MySQL Enterprise Backup backup, the server_uuid changes. Attempting to rejoin such an instance fails because InnoDB Cluster instances are identified by the server_uuid variable. In such a situation, information about the instance's old server_uuid must be removed from the InnoDB Cluster metadata and then a Cluster.rescan() must be executed to add the instance to the metadata using it's new server_uuid. For example:

cluster.removeInstance("root@instanceWithOldUUID:3306", {force: true})

cluster.rescan()

In this case you must pass the force option to the Cluster.removeInstance() method because the instance is unreachable from the cluster's perspective and we want to remove it from the InnoDB Cluster metadata anyway.

Restoring a Cluster from Quorum Loss

If an instance (or instances) fail, then a cluster can lose its quorum, which is the ability to vote in a new primary. This can happen when there is a failure of enough instances that there is no longer a majority of the instances which make up the cluster to vote on Group Replication operations. See Fault-tolerance. When a cluster loses quorum you can no longer process write transactions with the cluster, or change the cluster's topology, for example by adding, rejoining, or removing instances. However if you have an instance online which contains the InnoDB Cluster metadata, it is possible to restore a cluster with quorum. This assumes you can connect to an instance that contains the InnoDB Cluster metadata, and that instance can contact the other instances you want to use to restore the cluster.

Important

This operation is potentially dangerous because it can create a split-brain scenario if incorrectly used and should be considered a last resort. Make absolutely sure that there are no partitions of this group that are still operating somewhere in the network, but not accessible from your location.

Connect to an instance which contains the cluster's metadata, then use the Cluster.forceQuorumUsingPartitionOf(instance) operation, which restores the cluster based on the metadata on instance, and then all the instances that are ONLINE from the point of view of the given instance definition are added to the restored cluster.

mysql-js> cluster.forceQuorumUsingPartitionOf("icadmin@ic-1:3306")

  Restoring replicaset 'default' from loss of quorum, by using the partition composed of [icadmin@ic-1:3306]

  Please provide the password for 'icadmin@ic-1:3306': ******
  Restoring the InnoDB cluster ...

  The InnoDB cluster was successfully restored using the partition from the instance 'icadmin@ic-1:3306'.

  WARNING: To avoid a split-brain scenario, ensure that all other members of the replicaset
  are removed or joined back to the group that was restored.

In the event that an instance is not automatically added to the cluster, for example if its settings were not persisted, use Cluster.rejoinInstance() to manually add the instance back to the cluster.

The restored cluster might not, and does not have to, consist of all of the original instances which made up the cluster. For example, if the original cluster consisted of the following five instances:

  • ic-1

  • ic-2

  • ic-3

  • ic-4

  • ic-5

and the cluster experiences a split-brain scenario, with ic-1, ic-2, and ic-3 forming one partition while ic-4 and ic-5 form another partition. If you connect to ic-1 and issue Cluster.forceQuorumUsingPartitionOf('icadmin@ic-1:3306') to restore the cluster the resulting cluster would consist of these three instances:

  • ic-1

  • ic-2

  • ic-3

because ic-1 sees ic-2 and ic-3 as ONLINE and does not see ic-4 and ic-5.

Rebooting a Cluster from a Major Outage

If your cluster suffers from a complete outage, you can ensure it is reconfigured correctly using dba.rebootClusterFromCompleteOutage(). This operation takes the instance which MySQL Shell is currently connected to and uses its metadata to recover the cluster. In the event that a cluster's instances have completely stopped, the instances must be started and only then can the cluster be started. For example if the machine a sandbox cluster was running on has been restarted, and the instances were at ports 3310, 3320 and 3330, issue:

mysql-js> dba.startSandboxInstance(3310)
mysql-js> dba.startSandboxInstance(3320)
mysql-js> dba.startSandboxInstance(3330)
    

This ensures the sandbox instances are running. In the case of a production deployment you would have to start the instances outside of MySQL Shell. Once the instances have started, you need to connect to an instance with the GTID superset, which means the instance which had applied the most transaction before the outage. If you are unsure which instance contains the GTID superset, connect to any instance and follow the interactive messages from the dba.rebootClusterFromCompleteOutage() operation, which detects if the instance you are connected to contains the GTID superset. Reboot the cluster by issuing:

mysql-js> var cluster = dba.rebootClusterFromCompleteOutage();

The dba.rebootClusterFromCompleteOutage() operation then follows these steps to ensure the cluster is correctly reconfigured:

  • The InnoDB Cluster metadata found on the instance which MySQL Shell is currently connected to is checked to see if it contains the GTID superset, in other words the transactions applied by the cluster. If the currently connected instance does not contain the GTID superset, the operation aborts with that information. See the subsequent paragraphs for more information.

  • If the instance contains the GTID superset, the cluster is recovered based on the metadata of the instance.

  • Assuming you are running MySQL Shell in interactive mode, a wizard is run that checks which instances of the cluster are currently reachable and asks if you want to rejoin any discovered instances to the rebooted cluster.

  • Similarly, in interactive mode the wizard also detects instances which are currently not reachable and asks if you would like to remove such instances from the rebooted cluster.

If you are not using MySQL Shell's interactive mode, you can use the rejoinInstances and removeInstances options to manually configure instances which should be joined or removed during the reboot of the cluster.

If you encounter an error such as The active session instance isn't the most updated in comparison with the ONLINE instances of the Cluster's metadata. then the instance you are connected to does not have the GTID superset of transactions applied by the cluster. In this situation, connect MySQL Shell to the instance suggested in the error message and issue dba.rebootClusterFromCompleteOutage() from that instance.

Tip

To manually detect which instance has the GTID superset rather than using the interactive wizard, check the gtid_executed variable on each instance. For example issue:

mysql-sql> SHOW VARIABLES LIKE 'gtid_executed';

The instance which has applied the largest GTID set of transactions contains the GTID superset.

If this process fails, and the cluster metadata has become badly corrupted, you might need to drop the metadata and create the cluster again from scratch. You can drop the cluster metadata using dba.dropMetadataSchema().

Warning

The dba.dropMetadataSchema() method should only be used as a last resort, when it is not possible to restore the cluster. It cannot be undone.

If you are using MySQL Router with the cluster, when you drop the metadata, all current connections are dropped and new connections are forbidden. This causes a full outage.

Rescanning a Cluster

If you make configuration changes to a cluster outside of the AdminAPI commands, for example by changing an instance's configuration manually to resolve configuration issues or after the loss of an instance, you need to update the InnoDB Cluster metadata so that it matches the current configuration of instances. In these cases, use the Cluster.rescan() operation, which enables you to update the InnoDB Cluster metadata either manually or using an interactive wizard. The Cluster.rescan() operation can detect new active instances that are not registered in the metadata and add them, or obsolete instances (no longer active) still registered in the metadata, and remove them. You can automatically update the metadata depending on the instances found by the command, or you can specify a list of instance addresses to either add to the metadata or remove from the metadata. You can also update the topology mode stored in the metadata, for example after changing from single-primary mode to multi-primary mode outside of AdminAPI.

The syntax of the command is Cluster.rescan([options]). The options dictionary supports the following:

  • interactive: boolean value used to disable or enable the wizards in the command execution. Controls whether prompts and confirmations are provided. The default value is equal to MySQL Shell wizard mode, specified by shell.options.useWizards.

  • addInstances: list with the connection data of the new active instances to add to the metadata, or auto to automatically add missing instances to the metadata. The value auto is case-insensitive.

    • Instances specified in the list are added to the metadata, without prompting for confirmation

    • In interactive mode, you are prompted to confirm the addition of newly discovered instances that are not included in the addInstances option

    • In non-interactive mode, newly discovered instances that are not included in the addInstances option are reported in the output, but you are not prompted to add them

  • removeInstances: list with the connection data of the obsolete instances to remove from the metadata, or auto to automatically remove obsolete instances from the metadata.

    • Instances specified in the list are removed from the metadata, without prompting for confirmation

    • In interactive mode, you are prompted to confirm the removal of obsolete instances that are not included in the removeInstances option

    • In non-interactive mode, obsolete instances that are not included in the removeInstances option are reported in the output but you are not prompted to remove them

  • updateTopologyMode: boolean value used to indicate if the topology mode (single-primary or multi-primary) in the metadata should be updated (true) or not (false) to match the one being used by the cluster. By default, the metadata is not updated (false).

    • If the value is true then the InnoDB Cluster metadata is compared to the current mode being used by Group Replication, and the metadata is updated if necessary. Use this option to update the metadata after making changes to the topology mode of your cluster outside of AdminAPI.

    • If the value is false then InnoDB Cluster metadata about the cluster's topology mode is not updated even if it is different from the topology used by the cluster's Group Replication group

    • If the option is not specified and the topology mode in the metadata is different from the topology used by the cluster's Group Replication group, then:

      • In interactive mode, you are prompted to confirm the update of the topology mode in the metadata

      • In non-interactive mode, if there is a difference between the topology used by the cluster's Group Replication group and the InnoDB Cluster metadata, it is reported and no changes are made to the metadata

    • When the metadata topology mode is updated to match the Group Replication mode, the auto-increment settings on all instances are updated as described at InnoDB Cluster and Auto-increment.