Tuning the HA for MySQL Cluster Fault Monitor

Language:

The HA for MySQL Cluster fault monitor verifies that the data service is running in a healthy condition.

An HA for MySQL Cluster fault monitor is contained in each resource that represents the MySQL Cluster instance. You created these resources when you registered and configured HA for MySQL Cluster. For more information, see Registering and Configuring HA for MySQL Cluster.

System properties and extension properties of the MySQL Cluster resources control the behavior of the fault monitor. The default values of these properties determine the preset behavior of the fault monitor. Because the preset behavior should be suitable for most Oracle Solaris Cluster installations, tune the HA for MySQL Cluster fault monitor only if you need to modify this preset behavior.

Tuning the HA for MySQL Cluster fault monitor involves the following tasks, depending on the specific component:

Setting the return value for failed MySQL Cluster monitor connections for the ndb daemon
Setting the interval between fault monitor probes
Setting the time out for fault monitor probes
Defining the criteria for persistent faults
Specifying the failover behavior of a resource

The fault monitor HA for MySQL Cluster ndb daemon differentiates between connection problems and definitive application failures. The value of ERROR_ON_SHOW in the MySQL Cluster ndb daemon parameter file specifies the return code for connection problems. This value results in a certain amount of ignored consecutive failed probes as long as they all return the value of ERROR_ON_SHOW. The first successful probe reverts this back to zero. The maximum number of failed probes is calculated as 100 / ERROR_ON_SHOW. A definitive application failure will result in an immediate restart or failover.

The definition of the return value ERROR_ON_SHOW defines one of two behaviors for failed database connections of a MySQL Cluster ndb daemon resource.

Retry the connection to the ndb management server several times before considering the MySQL Cluster ndb Daemon resource as failed and triggering a restart or failover.
Complain at every probe that the connection to the test database failed. No restart or failover will be triggered.

To achieve either of these behaviors, use the standard resource properties retry_interval and thorough_probe_interval.

A just complainingprobe is achieved as soon as the following equation is true: retry_interval < thorough_probe_interval *100/ERROR_ON_SHOW
As soon as this equation is false, the MySQL Cluster ndb Daemon resource restarts after 100 / ERROR_ON_SHOW consecutive probe failures.

The value 100/ERROR_ON_SHOW defines the maximum number of retries for the probe in the case of a failed connection.

Assume that the following resource parameters are set:
- thorough_probe_interval=90
- retry_interval=660
- ERROR_ON_SHOW=25
If you encounter, for example, unresponsive management servers for 4.5 minutes, you will see three complaints in /var/adm/messages, but no resource restart. If the shortage lasts 6 minutes, you will have a restart of the MySQL Cluster ndb Daemon resource after the fourth probe.

If you do not want a resource restart in the previous example, set the value of ERROR_ON_SHOW to 15 or less.

For more information, see Tuning Fault Monitors for Oracle Solaris Cluster Data Services in Oracle Solaris Cluster 4.3 Data Services Planning and Administration Guide

This section contains the following additional information:

Operation of the HA for MySQL Cluster Management Server Parameter File

The HA for MySQL Cluster management server resources use a parameter file to pass parameters to the start, stop, and probe commands. Changes to these parameters take effect at least at every restart, or enabling, or disabling of the resource.

Changing one of the following parameters, takes effect at the next probe of the MySQL Cluster management server resource:

BASEDIR
USER
TRY_RECONNECT
CONNECT_STRING
CONFIG_DIR ID

Note - An unexpected change of the parameters with an enabled MySQL Cluster management server resource might result in an unplanned service outage. To avoid such an outage, first disable the MySQL Cluster management server resource, execute the change, and then re-enable the resource.

Operation of the HA for MySQL Cluster `ndb` Daemon Parameter File

The HA for MySQL Cluster ndb daemon resources use a parameter file to pass parameters to the start, stop, and probe commands. Changes to these parameters take effect at least at every restart, or enabling, or disabling of the resource.

Changing one of the following parameters, takes effect at the next probe of the MySQL Cluster ndb daemon resource:

BASEDIR
USER
TRY_RECONNECT
CONNECT_STRING
ID
MULTI_THREAD
DATA_DIR
ERROR_ON_SHOW

Caution - Do not lower the Probe_timeout property of the ndbd daemon resource below 70 seconds. The probe algorithm relies on the presence of a management server. If the first physical node specified in the CONNECT_STRING is down, you will get a 60 seconds timeout. There must be enough time left, to run the probe request on the second node specified in the CONNECT_STRING.

Note - An unexpected change of the parameters with an enabled MySQL Cluster ndb daemon resource might result in an unplanned service outage. Therefore, disable the MySQL Cluster ndb Daemon resource first, execute the change, and then re-enable the resource.

Operation of the Fault Monitor for HA for MySQL Cluster Management Server

The fault monitor for HA for MySQL Cluster management server ensures that all the requirements for the MySQL Cluster management server component to run are met. These requirements include the following:

The HA for MySQL Cluster management server ndb_mgmd process is running. If this process is not running, the fault monitor restarts the MySQL Cluster management server. If the fault persists, the fault monitor gives up on the resource group that contains the resource for the MySQL Cluster management server because it is a scalable or multiple-master resource.
Connections to the MySQL Cluster management server are possible, and the ndb_mgm STATUS command does not show the value "not connected" for the selected server ID.

Operation of the Fault Monitor for HA for MySQL Cluster `ndb` Daemon

The fault monitor for HA for MySQL Cluster ndb daemon ensures that all the requirements for the MySQL Cluster ndb daemon component to run are met. These requirements include the following:

The HA for MySQL Cluster ndb daemon ndbd or ndbmtd process is running, depending on the MULTITHREAD value at resource start time.
If this process is not running, the fault monitor restarts the MySQL Cluster ndb daemon. If the fault persists, the fault monitor gives up the resource group that contains the resource for the MySQL Cluster ndb daemon, because it is a multiple-master resource.
Connections to the MySQL Cluster ndb daemon management server are possible, and the ndb_mgm STATUS command show the value "started" or "starting" for the selected server ID. If the resource is waiting to be put online, only "started" is a legal value for the selected server ID.

If the connection to the management server fails, the probe exits with the connection failed return code ERROR_ON_SHOW. If the ndb_mgm status command shows an illegal value, the fault monitor restarts the MySQL Cluster ndb daemon resource, if it is not in its wait for online phase.

Oracle® Solaris Cluster Data Service for MySQL Cluster Guide