Administer mediator hosts with the medstat(1M) and metaset(1M) commands. Use these commands to add or delete mediator hosts, and to check and fix mediator data. See the medstat(1M), metaset(1M), and mediator(7) man pages for details.
Use this procedure after you have installed and configured Solstice DiskSuite.
Start the cluster software on all nodes.
On the first node:
# scadmin startcluster |
On all remaining nodes:
# scadmin startnode |
Determine the name of the private link for each node.
Use grep(1) to identify the private link included in the clustername.cdb file.
hahost1# grep "^cluster.node.0.hostname" \ /etc/opt/SUNWcluster/conf/clustername.cdb cluster.node.0.hostname : hahost0 phys-hahost1# grep "cluster.node.0.hahost0" \ /etc/opt/SUNWcluster/conf/clustername.cdb | grep 204 204.152.65.33 hahost1# grep "^cluster.node.1.hostname" \ /etc/opt/SUNWcluster/conf/clustername.cdb cluster.node.1.hostname : hahost1 hahost1# grep "cluster.node.1.hahost1" \ /etc/opt/SUNWcluster/conf/clustername.cdb | grep 204 204.152.65.34 |
In this example, 204.152.65.33 is the private link for hahost0 and 204.152.65.34 is the private link for hahost1.
Configure mediators using the metaset(1M) command.
Add each host with connectivity to the diskset as a mediator for that diskset. Run each command on the host currently mastering the diskset. You can use the hastat(1M) command to determine the current master of the diskset. The information returned by hastat(1M) for the logical host identifies the diskset master.
hahost1# metaset -s disksetA -a -m hahost0,204.152.65.33 hahost1# metaset -s disksetA -a -m hahost1,204.152.65.34 hahost1# metaset -s disksetB -a -m hahost0,204.152.65.33 hahost1# metaset -s disksetB -a -m hahost1,204.152.65.34 hahost1# metaset -s disksetC -a -m hahost0,204.152.65.33 hahost1# metaset -s disksetC -a -m hahost1,204.152.65.34 |
The metaset(1M) command treats the private link as an alias.
phys-hahost1# medstat -s diskset |
See the medstat(1M) man page to interpret the output. If the output indicates that the mediator data for any one of the mediator hosts for a given diskset is bad, refer to the following procedure to fix the problem.
The medstat(1M) command checks the status of mediators. Use this procedure if medstat(1M) reports that a mediator host is bad.
Remove the bad mediator host(s) from all affected diskset(s).
Log into the Sun Cluster node that owns the affected diskset and enter:
phys-hahost1# metaset -s diskset -d -m bad_mediator_host |
Restore the mediator host and its aliases:
phys-hahost1# metaset -s diskset -a -m bad_mediator_host, physical_host_alias,... |
The private links must be assigned as mediator host aliases. Specify the physical host IP address first, and then the HA private link on the metaset(1M) command line. See the mediator(7) man page for details on this use of the metaset(1M) command.
Certain double-failure scenarios exist that do not allow for automatic recovery by Sun Cluster. They include the following:
Both a node and a string have failed in a dual string configuration, but the mediator on the surviving node was not golden. This scenario is further described in "Host and String Failure".
Mediator data is bad, stale, or non-existent on one or both of the nodes and one of the strings in a dual string configuration fails. The next attempt to take ownership of the affected logical host(s) will fail.
A string fails in a dual string configuration, but the number of good replicas on the surviving string does not represent at least half of the total replica count for the failed diskset. The next attempt by DiskSuite to update these replicas will result in a system panic.
A failure with no automatic recovery has occurred, and an attempt is made to bring the affected logical host(s) out of maintenance mode before manual recovery procedures have been completed.
It is very important to monitor the state of the disksets, replicas, and mediators regularly. The medstat(1M) command is useful for this purpose. Bad mediator data, replicas, and disks should always be repaired immediately to avoid the risk of potentially damaging multiple failure scenarios.
When a failure of this type does occur, one of the following sets of error messages will be logged:
ERROR: metaset -s <diskset> -f -t exited with code 66 ERROR: Stale database for diskset <diskset> NOTICE: Diskset <diskset> released ERROR: metaset -s <diskset> -f -t exited with code 2 ERROR: Tagged data encountered for diskset <diskset> NOTICE: Diskset <diskset> released ERROR: metaset -s <diskset> -f -t exited with code 3 ERROR: Only 50% replicas and 50% mediator hosts available for diskset <diskset> NOTICE: Diskset <diskset> released |
Eventually, the following set of messages also will be issued:
ERROR: Could not take ownership of logical host(s) <lhost>, so switching into maintenance mode ERROR: Once in maintenance mode, a logical host stays in maintenance mode until the admin intervenes manually ERROR: The admin must investigate/repair the problem and if appropriate use haswitch command to move the logical host(s) out of maintenance mode |
Note that for a dual failure of this nature, high availability goals are sacrificed in favor of attempting to preserve data integrity. Your data might be unavailable for some time. In addition, it is not possible to guarantee complete data recovery or integrity.
Your service provider should be contacted immediately. Only an authorized service representative should attempt manual recovery from this type of dual failure. A carefully planned and well coordinated effort is essential to data recovery. Do nothing until your service representative arrives at the site.
Your service provider will inspect the log messages, evaluate the problem, and, possibly, repair any damaged hardware. Your service provider might then be able to regain access to the data by using some of the special metaset(1M) options described on the mediator(7) man page. However, such options should be used with extreme care to avoid recovery of the wrong data.
Attempts to alternate access between the two strings should be avoided at all costs; such attempts will make the situation worse.
Before restoring client access to the data, exercise any available validation procedures on the entire dataset or on any data affected by recent transactions against the dataset.
Before you run the haswitch(1M) command to return any logical host from maintenance mode, make sure that you release ownership of the associated diskset.
The following syslog or console messages indicate that there is a problem with mediators or mediator data. Use the procedure "How to Fix Bad Mediator Data" to address the problem.
Attention required - medstat shows bad mediator data on host %s for diskset %s Attention required - medstat finds a fatal error in probing mediator data on host %s for diskset %s! Attention required - medstat failed for diskset %s |