Sun Cluster 2.2 System Administration Guide

Administering Mediators

Administer mediator hosts with the medstat(1M) and metaset(1M) commands. Use these commands to add or delete mediator hosts, and to check and fix mediator data. See the medstat(1M), metaset(1M), and mediator(7) man pages for details.

How to Add Mediator Hosts

Use this procedure after you have installed and configured Solstice DiskSuite.

Start the cluster software on all nodes.

On the first node:
# scadmin startcluster
On all remaining nodes:
# scadmin startnode

Determine the name of the private link for each node.

Use grep(1) to identify the private link included in the clustername.cdb file.

hahost1# grep "^cluster.node.0.hostname" \ 
/etc/opt/SUNWcluster/conf/clustername.cdb
cluster.node.0.hostname : hahost0
phys-hahost1# grep "cluster.node.0.hahost0" \
/etc/opt/SUNWcluster/conf/clustername.cdb | grep 204
204.152.65.33

hahost1# grep "^cluster.node.1.hostname" \
/etc/opt/SUNWcluster/conf/clustername.cdb
cluster.node.1.hostname : hahost1
hahost1# grep "cluster.node.1.hahost1" \
/etc/opt/SUNWcluster/conf/clustername.cdb | grep 204
204.152.65.34

In this example, 204.152.65.33 is the private link for hahost0 and 204.152.65.34 is the private link for hahost1.

Configure mediators using the metaset(1M) command.

Add each host with connectivity to the diskset as a mediator for that diskset. Run each command on the host currently mastering the diskset. You can use the hastat(1M) command to determine the current master of the diskset. The information returned by hastat(1M) for the logical host identifies the diskset master.

hahost1# metaset -s disksetA -a -m hahost0,204.152.65.33
hahost1# metaset -s disksetA -a -m hahost1,204.152.65.34
hahost1# metaset -s disksetB -a -m hahost0,204.152.65.33
hahost1# metaset -s disksetB -a -m hahost1,204.152.65.34
hahost1# metaset -s disksetC -a -m hahost0,204.152.65.33
hahost1# metaset -s disksetC -a -m hahost1,204.152.65.34

The metaset(1M) command treats the private link as an alias.

How to Check the Status of Mediator Data

Run the medstat(1M) command.
phys-hahost1# medstat -s diskset
See the medstat(1M) man page to interpret the output. If the output indicates that the mediator data for any one of the mediator hosts for a given diskset is bad, refer to the following procedure to fix the problem.

How to Fix Bad Mediator Data

Note -

The medstat(1M) command checks the status of mediators. Use this procedure if medstat(1M) reports that a mediator host is bad.

Remove the bad mediator host(s) from all affected diskset(s).

Log into the Sun Cluster node that owns the affected diskset and enter:
phys-hahost1# metaset -s diskset -d -m bad_mediator_host

Restore the mediator host and its aliases:
phys-hahost1# metaset -s diskset -a -m bad_mediator_host, physical_host_alias,...
Note -
The private links must be assigned as mediator host aliases. Specify the physical host IP address first, and then the HA private link on the metaset(1M) command line. See the mediator(7) man page for details on this use of the metaset(1M) command.

Handling Failures Without Automatic Recovery

Certain double-failure scenarios exist that do not allow for automatic recovery by Sun Cluster. They include the following:

Both a node and a string have failed in a dual string configuration, but the mediator on the surviving node was not golden. This scenario is further described in "Host and String Failure".
Mediator data is bad, stale, or non-existent on one or both of the nodes and one of the strings in a dual string configuration fails. The next attempt to take ownership of the affected logical host(s) will fail.
A string fails in a dual string configuration, but the number of good replicas on the surviving string does not represent at least half of the total replica count for the failed diskset. The next attempt by DiskSuite to update these replicas will result in a system panic.
A failure with no automatic recovery has occurred, and an attempt is made to bring the affected logical host(s) out of maintenance mode before manual recovery procedures have been completed.

It is very important to monitor the state of the disksets, replicas, and mediators regularly. The medstat(1M) command is useful for this purpose. Bad mediator data, replicas, and disks should always be repaired immediately to avoid the risk of potentially damaging multiple failure scenarios.

When a failure of this type does occur, one of the following sets of error messages will be logged:

ERROR: metaset -s <diskset> -f -t exited with code 66
ERROR: Stale database for diskset <diskset>
NOTICE: Diskset <diskset> released

ERROR: metaset -s <diskset> -f -t exited with code 2
ERROR: Tagged data encountered for diskset <diskset>
NOTICE: Diskset <diskset> released

ERROR: metaset -s <diskset> -f -t exited with code 3
ERROR: Only 50% replicas and 50% mediator hosts available for diskset <diskset>
NOTICE: Diskset <diskset> released

Eventually, the following set of messages also will be issued:

ERROR: Could not take ownership of logical host(s) <lhost>, so switching into maintenance mode
ERROR: Once in maintenance mode, a logical host stays in maintenance mode until the admin intervenes manually
ERROR: The admin must investigate/repair the problem and if appropriate use haswitch command to move the logical host(s) out of maintenance mode

Note that for a dual failure of this nature, high availability goals are sacrificed in favor of attempting to preserve data integrity. Your data might be unavailable for some time. In addition, it is not possible to guarantee complete data recovery or integrity.

Your service provider should be contacted immediately. Only an authorized service representative should attempt manual recovery from this type of dual failure. A carefully planned and well coordinated effort is essential to data recovery. Do nothing until your service representative arrives at the site.

Your service provider will inspect the log messages, evaluate the problem, and, possibly, repair any damaged hardware. Your service provider might then be able to regain access to the data by using some of the special metaset(1M) options described on the mediator(7) man page. However, such options should be used with extreme care to avoid recovery of the wrong data.

Caution -

Attempts to alternate access between the two strings should be avoided at all costs; such attempts will make the situation worse.

Before restoring client access to the data, exercise any available validation procedures on the entire dataset or on any data affected by recent transactions against the dataset.

Before you run the haswitch(1M) command to return any logical host from maintenance mode, make sure that you release ownership of the associated diskset.

Error Log Messages Associated With Mediators

The following syslog or console messages indicate that there is a problem with mediators or mediator data. Use the procedure "How to Fix Bad Mediator Data" to address the problem.

Attention required - medstat shows bad mediator data on host %s for diskset %s

Attention required - medstat finds a fatal error in probing mediator data on host %s for diskset %s!

Attention required - medstat failed for diskset %s