MySQL 8.0 Reference Manual Including MySQL NDB Cluster 8.0 Configuring Distributed Recovery

Distributed recovery is fault tolerant and configurable. This section explains how to adjust the settings.

Donor Selection

A donor is selected randomly from the existing list of online group members in the current view. Selecting a random donor means that there is a good chance that the same server is not selected more than once when multiple members enter the group.

From MySQL 8.0.17, the joiner only selects a donor that is running a lower or equal patch version of MySQL Server compared to itself. For earlier releases, all of the online members are allowed to be a donor.

Donor Connection Retries

If the connection to the selected donor fails, a new connection is automatically attempted to a new candidate donor. Selecting a different donor in the event of an error means that there is a chance the new candidate donor does not have the same error.

Distributed recovery detects an error and switches over to a new donor in the following scenarios:

  • Connection error - There is an authentication issue or another problem with making the connection to a candidate donor.

  • Purged transactions - The selected donor has purged from its binary logs, or never had in its binary logs, some transactions that are needed for the recovery process.

  • Duplicated transactions - The joining member already contains some transactions that conflict with the transactions coming from the selected donor during recovery. This could be caused by some errant transactions present on the server joining the group.

  • Replication errors - If any of the recovery threads fail (receiver or applier threads) then an error occurs and recovery switches over to a new donor. Distributed recovery data transfer relies on the binary log and existing MySQL replication framework, therefore it is possible that some transient errors could cause errors in the receiver or applier threads.

Number of Connection Attempts

If the connection retry limit is reached without a successful connection, the distributed recovery procedure terminates with an error. The default number of attempts a joining member makes when trying to connect to a donor from the pool of donors is 10. Note that this accounts for the global number of attempts that the server joining the group makes connecting to each one of the suitable donors.

You can configure this setting using the group_replication_recovery_retry_count system variable. The following command sets the maximum number of attempts to connect to a donor to 5:

mysql> SET GLOBAL group_replication_recovery_retry_count= 5;
Sleep Interval for Connection Attempts

The group_replication_recovery_reconnect_interval system variable defines how much time the recovery process should sleep between donor connection attempts. The default is 60 seconds, and you can change this value dynamically. The following command sets the recovery donor connection retry interval to 120 seconds:

mysql> SET GLOBAL group_replication_recovery_reconnect_interval= 120;

Note, however, that recovery does not sleep after every donor connection attempt. As the server joining the group is connecting to different servers and not to the same one over and over again, it can assume that the problem that affects server A does not affect server B. As such, recovery suspends only when it has gone through all the possible donors. Once the server joining the group has tried to connect to all the suitable donors in the group and none remains, the recovery process sleeps for the number of seconds configured by the group_replication_recovery_reconnect_interval system variable.

Marking the Joining Member Online

When distributed recovery has successfully completed state transfer from the donor to the joining member, the joining member can be marked as online in the group and ready to participate. By default, this is done after the joining member has received and applied all the transactions that it was missing. Optionally, you can allow a joining member to be marked as online when it has received and certified (that is, completed conflict detection for) all the transactions that it was missing, but before it has applied them. If you want to do this, use the group_replication_recovery_complete_at system variable to specify the alternative setting TRANSACTIONS_CERTIFIED.

SSL and Authentication for Distributed Recovery

You can optionally use SSL for distributed recovery connections between group members. SSL for distributed recovery is configured separately from SSL for normal group communications, which is determined by the server's SSL settings and the group_replication_ssl_mode system variable. For distributed recovery connections, dedicated Group Replication recovery SSL system variables are available to configure the use of certificates and ciphers specifically for the distributed recovery process over the group_replication_recovery channel.

By default, SSL is not used for distributed recovery connections. To activate this, set group_replication_recovery_use_ssl=ON, and configure the Group Replication recovery SSL system variables as described in Section 18.5.2, “Group Replication Secure Socket Layer (SSL) Support”. You need a replication user that is set up to use SSL.

If you are not using SSL for distributed recovery over the group_replication_recovery channel (so group_replication_recovery_use_ssl is set to OFF), and the replication user account for Group Replication authenticates with the caching_sha2_password plugin (which is the default in MySQL 8.0) or the sha256_password plugin, RSA key-pairs are used for password exchange. In this case, either use the group_replication_recovery_public_key_path system variable to specify the RSA public key file, or use the group_replication_recovery_get_public_key system variable to request the public key from the master, as described in Using Group Replication and the Caching SHA-2 User Credentials Plugin.

Compression for Distributed Recovery

From MySQL 8.0.18, you can optionally configure compression for distributed recovery connections between group members. Compression can benefit distributed recovery where network bandwidth is limited and the donor has to transfer many transactions to the joining member. The group_replication_recovery_compression_algorithm and group_replication_recovery_zstd_compression_level system variables configure permitted compression algorithms and zstd compression level for Group Replication recovery connections when a new member joins a group and connects to a donor. For more information, see Section 4.2.6, “Connection Compression Control”.