Sun Cluster 2.2 System Administration Guide

Chapter 9 Using Dual-String Mediators

This chapter describes the Solstice DiskSuite feature that allows Sun Cluster to run highly available data services using only two disk strings. The topics in this chapter are listed below. Refer to the Solstice DiskSuite documentation for more information about the Solstice DiskSuite features and concepts.

Overview of Mediators

The requirement for Sun Cluster is that a dual-string configuration must survive the failure of a single node or a single string of drives without user intervention.

In a dual-string configuration, metadevice state database replicas are always placed such that exactly half of the replicas are on one string and half are on a second string. A quorum (half + 1 or more) of the replicas is required to guarantee that the most current data is being presented. In the dual-string configuration, if one string becomes unavailable, a quorum of the replicas will not be available.

A mediator is a host (node) that stores mediator data. Mediator data provides information about the location of other mediators and contains a commit count that is identical to the commit count stored in the database replicas. This commit count is used to confirm that the mediator data is in sync with the data in the database replicas. Mediator data is individually verified before use.

Solstice DiskSuite requires a replica quorum (half + 1) to determine when "safe" operating conditions exist. This guarantees data correctness. With a dual-string configuration, it is possible that only one string is accessible. In this situation it is impossible to get a replica quorum. If mediators are used and a mediator quorum is present, the mediator data can help you determine whether the data on the accessible string is up-to-date and safe to use.

The introduction of mediators enables the Sun Cluster software to ensure that the most current data is presented in the case of a single string failure in a dual-string configuration.

Golden Mediators

To avoid unnecessary user intervention in some dual-string failure scenarios, the concept of a golden mediator has been implemented. If exactly half of the database replicas are accessible and an event occurs that causes the mediator hosts to be updated, two mediator updates are attempted. The first update attempts to change the commit count and to set the mediator to not golden. The second update occurs if and only if during the first phase, all mediator hosts were successfully contacted and the number of replicas that were accessible (and which had their commit count advanced) were exactly half of the total number of replicas. If all the conditions are met, the second update sets the mediator status to golden. The golden status enables a takeover to proceed, without user intervention, to the host with the golden status. If the status is not golden, the data will be set to read-only, and user intervention is required for a takeover or failover to succeed. For the user to initiate a takeover or failover, exactly half of the replicas must be accessible.

The golden state is stored in volatile memory (RAM) only. Once a takeover occurs, the mediator data is once again updated. If any mediator hosts cannot be updated, the golden state is revoked. Since the state is in RAM only, a reboot of a mediator host causes the golden state to be revoked. The default state for mediators is not golden.

Configuring Mediators

Figure 9-1 shows a Sun Cluster system configured with two strings and mediators on two Sun Cluster nodes.

Regardless of the number of nodes, there are still only two mediator hosts in the cluster. The mediator hosts are the same for all disksets using mediators in a given cluster, even when a mediator host is not a member of the server set capable of mastering the diskset.

To simplify the presentation, the configurations shown here use only one diskset and a symmetric configuration. The number of disksets is not significant in these sample scenarios. In the stable state, the diskset is mastered by phys-hahost1.

Figure 9-1 Sun Cluster System in Steady State With Mediators

Normally, if half + 1 of the database replicas are accessible, then mediators are not used. When exactly half of the replicas are accessible, the mediator's commit count can be used to determine whether the accessible half is the most up to date. To guarantee that the correct mediator commit count is being used, both of the mediators must be accessible, or the mediator must be golden. Half + 1 of the mediators constitutes a mediator quorum. The mediator quorum is independent of the replica quorum.

Failures Addressed by Mediators

With mediators, it is possible to recover from single failures, as well as some double failures. Since Sun Cluster only guarantees automatic recovery from single failures, only the single-failure recovery situation is covered here in detail. The double failure scenarios are included, but only general recovery processes are described.

Figure 9-1 shows a dual-string configuration in the stable state. Note that mediators are established on both Sun Cluster nodes, so both nodes must be up for a mediator quorum to exist and for mediators to be used. If one Sun Cluster node fails, a replica quorum will exist. If a takeover of the diskset is necessary, the takeover will occur without the use of mediators.

The following sections show various failure scenarios and describe how mediators help recover from these failures.

Single Server Failure

Figure 9-2 shows the situation where one Sun Cluster node fails. In this case, the mediator software is not used since there is a replica quorum available. Sun Cluster node phys-hahost2 will take over the diskset previously mastered by phys-hahost1.

The process for recovery in this scenario is identical to the process followed when one Sun Cluster node fails and there are more than two disk strings. No administrator action is required except perhaps to switch over the diskset after phys-hahost1 rejoins the cluster. See the haswitch(1M) man page for more information about the switchover procedure.

Figure 9-2 Single Sun Cluster Server Failure With Mediators

Single String Failure

Figure 9-3 illustrates the case where, starting from the steady state shown in Figure 9-1, a single string fails. When String 1 fails, the mediator hosts on both phys-hahost1 and phys-hahost2 will be updated to reflect the event, and the system will continue to run as follows:

No takeover occurs.
Sun Cluster node phys-hahost1 continues to own the diskset.
Because String 1 failed, it must be resynchronized with String 2. For more information about the resynchronization process, refer to the Solstice DiskSuite User's Guide and the metareplace(1M) man page.

The commit count is incremented and the mediators remain golden.

Figure 9-3 Single String Failure With Mediators

The administration required in this scenario is the same that is required when a single string fails in the three or more string configuration. Refer to the relevant chapter on administration of your disk expansion unit for details on these procedures.

Host and String Failure

Figure 9-4 shows a double failure where both String 1 and phys-hahost2 fail. If the failure sequence is such that the string fails first, and later the host fails, the mediator on phys-hahost1 could be golden. In this case, we have the following conditions:

The mediator on phys-hahost1 is golden.
Half of the mediators are available.
Half of the replicas are accessible.
The mediator commit count on phys-hahost1 matches the commit count found in the replicas on String 2

Figure 9-4 Multiple Failure - One Server and One String

This type of failure is recovered automatically by Sun Cluster. If phys-hahost2 mastered the diskset, phys-hahost1 will take over mastery of the diskset. Otherwise, mastery of the diskset will be retained by phys-hahost1. After String 1 is fixed, the data on String 1 must be resynchronized with the data on String 2. For more information about the resynchronization process, refer to the Solstice DiskSuite User's Guide and the metareplace(1M) man page.

Caution -

Although you can recover from this scenario, you must be sure to restore the failed components immediately since a third failure will cause the cluster to be unavailable.

If the mediator on phys-hahost1 is not golden, this case is not automatically recovered by Sun Cluster and requires administrative intervention. In this case, Sun Cluster generates an error message and the logical host is put into maintenance mode (read-only). If this or any other multiple failure occurs, contact your service provider to assist you.

Administering Mediators

Administer mediator hosts with the medstat(1M) and metaset(1M) commands. Use these commands to add or delete mediator hosts, and to check and fix mediator data. See the medstat(1M), metaset(1M), and mediator(7) man pages for details.

How to Add Mediator Hosts

Use this procedure after you have installed and configured Solstice DiskSuite.

Start the cluster software on all nodes.

On the first node:
# scadmin startcluster
On all remaining nodes:
# scadmin startnode

Determine the name of the private link for each node.

Use grep(1) to identify the private link included in the clustername.cdb file.

hahost1# grep "^cluster.node.0.hostname" \ 
/etc/opt/SUNWcluster/conf/clustername.cdb
cluster.node.0.hostname : hahost0
phys-hahost1# grep "cluster.node.0.hahost0" \
/etc/opt/SUNWcluster/conf/clustername.cdb | grep 204
204.152.65.33

hahost1# grep "^cluster.node.1.hostname" \
/etc/opt/SUNWcluster/conf/clustername.cdb
cluster.node.1.hostname : hahost1
hahost1# grep "cluster.node.1.hahost1" \
/etc/opt/SUNWcluster/conf/clustername.cdb | grep 204
204.152.65.34

In this example, 204.152.65.33 is the private link for hahost0 and 204.152.65.34 is the private link for hahost1.

Configure mediators using the metaset(1M) command.

Add each host with connectivity to the diskset as a mediator for that diskset. Run each command on the host currently mastering the diskset. You can use the hastat(1M) command to determine the current master of the diskset. The information returned by hastat(1M) for the logical host identifies the diskset master.

hahost1# metaset -s disksetA -a -m hahost0,204.152.65.33
hahost1# metaset -s disksetA -a -m hahost1,204.152.65.34
hahost1# metaset -s disksetB -a -m hahost0,204.152.65.33
hahost1# metaset -s disksetB -a -m hahost1,204.152.65.34
hahost1# metaset -s disksetC -a -m hahost0,204.152.65.33
hahost1# metaset -s disksetC -a -m hahost1,204.152.65.34

The metaset(1M) command treats the private link as an alias.

How to Check the Status of Mediator Data

Run the medstat(1M) command.
phys-hahost1# medstat -s diskset
See the medstat(1M) man page to interpret the output. If the output indicates that the mediator data for any one of the mediator hosts for a given diskset is bad, refer to the following procedure to fix the problem.

How to Fix Bad Mediator Data

Note -

The medstat(1M) command checks the status of mediators. Use this procedure if medstat(1M) reports that a mediator host is bad.

Remove the bad mediator host(s) from all affected diskset(s).

Log into the Sun Cluster node that owns the affected diskset and enter:
phys-hahost1# metaset -s diskset -d -m bad_mediator_host

Restore the mediator host and its aliases:
phys-hahost1# metaset -s diskset -a -m bad_mediator_host, physical_host_alias,...
Note -
The private links must be assigned as mediator host aliases. Specify the physical host IP address first, and then the HA private link on the metaset(1M) command line. See the mediator(7) man page for details on this use of the metaset(1M) command.

Handling Failures Without Automatic Recovery

Certain double-failure scenarios exist that do not allow for automatic recovery by Sun Cluster. They include the following:

Both a node and a string have failed in a dual string configuration, but the mediator on the surviving node was not golden. This scenario is further described in "Host and String Failure".
Mediator data is bad, stale, or non-existent on one or both of the nodes and one of the strings in a dual string configuration fails. The next attempt to take ownership of the affected logical host(s) will fail.
A string fails in a dual string configuration, but the number of good replicas on the surviving string does not represent at least half of the total replica count for the failed diskset. The next attempt by DiskSuite to update these replicas will result in a system panic.
A failure with no automatic recovery has occurred, and an attempt is made to bring the affected logical host(s) out of maintenance mode before manual recovery procedures have been completed.

It is very important to monitor the state of the disksets, replicas, and mediators regularly. The medstat(1M) command is useful for this purpose. Bad mediator data, replicas, and disks should always be repaired immediately to avoid the risk of potentially damaging multiple failure scenarios.

When a failure of this type does occur, one of the following sets of error messages will be logged:

ERROR: metaset -s <diskset> -f -t exited with code 66
ERROR: Stale database for diskset <diskset>
NOTICE: Diskset <diskset> released

ERROR: metaset -s <diskset> -f -t exited with code 2
ERROR: Tagged data encountered for diskset <diskset>
NOTICE: Diskset <diskset> released

ERROR: metaset -s <diskset> -f -t exited with code 3
ERROR: Only 50% replicas and 50% mediator hosts available for diskset <diskset>
NOTICE: Diskset <diskset> released

Eventually, the following set of messages also will be issued:

ERROR: Could not take ownership of logical host(s) <lhost>, so switching into maintenance mode
ERROR: Once in maintenance mode, a logical host stays in maintenance mode until the admin intervenes manually
ERROR: The admin must investigate/repair the problem and if appropriate use haswitch command to move the logical host(s) out of maintenance mode

Note that for a dual failure of this nature, high availability goals are sacrificed in favor of attempting to preserve data integrity. Your data might be unavailable for some time. In addition, it is not possible to guarantee complete data recovery or integrity.

Your service provider should be contacted immediately. Only an authorized service representative should attempt manual recovery from this type of dual failure. A carefully planned and well coordinated effort is essential to data recovery. Do nothing until your service representative arrives at the site.

Your service provider will inspect the log messages, evaluate the problem, and, possibly, repair any damaged hardware. Your service provider might then be able to regain access to the data by using some of the special metaset(1M) options described on the mediator(7) man page. However, such options should be used with extreme care to avoid recovery of the wrong data.

Caution -

Attempts to alternate access between the two strings should be avoided at all costs; such attempts will make the situation worse.

Before restoring client access to the data, exercise any available validation procedures on the entire dataset or on any data affected by recent transactions against the dataset.

Before you run the haswitch(1M) command to return any logical host from maintenance mode, make sure that you release ownership of the associated diskset.

Error Log Messages Associated With Mediators

The following syslog or console messages indicate that there is a problem with mediators or mediator data. Use the procedure "How to Fix Bad Mediator Data" to address the problem.

Attention required - medstat shows bad mediator data on host %s for diskset %s

Attention required - medstat finds a fatal error in probing mediator data on host %s for diskset %s!

Attention required - medstat failed for diskset %s