Solaris Cluster is designed to achieve automatic system recovery under severe failure scenarios by transferring operational control from one server node to the next. But most failures in a Solaris system do not require full system switch over action to recover.
Failures involving network communication are handled quickly and quietly by Solaris IPMP.
System disk failures are handled silently and automatically by Solaris ZFS.
Failures with any single disk drive in the attached storage array are recovered automatically by the storage array firmware. And where the storage array might lack the ability to recover from disk failure, Solaris ZFS is in control to provide uninterrupted disk I/O to the alternate drive in the mirrored configuration.
If an HBA port to the shared array should fail, Solaris automatically switches to an alternate port. Similarly, if a controller module on the shared array should fail or an interconnecting cable is disconnected, Solaris instantly reverts to the alternate path that connects to the disk resource.
Failure in a library communication path is recovered automatically by dual TCP/IP logic in ACSLS. And operations from a failed library controller card are recovered automatically by ACSLS HA logic associated with library Redundant Electronics (RE).
If any of the multiple running processes in ACSLS should fail, the ACSLS daemon instantly reboots the failed process.
Should the ACSLS daemon itself fail, or should any of the remaining ACSLS services stop running, the Solaris Service Management Facility (SMF) is there to instantly restart the failed service.
All of these scenarios are handled quickly and automatically without the involvement of Solaris Cluster. But if any other severe fault should impact ACSLS operation on the active server node, ACSLS HA instructs Solaris Cluster to switch control over to the alternate node.
Once it is started, ACSLS HA probes the system once every minute, watching for any of the following events to occur:
Loss of communication to an attached library.
Loss of network contact to the ACSLS logical host.
Loss of contact to the RPC listener port for client calls.
Loss of access to the ACSLS file system.
Unrecoverable maintenance state of the acsls SMF service.
Any of these events triggers a Cluster fail over. Solaris Cluster also knows to fail over if any fatal system conditions on the active server node occurs.
To activate Cluster failover control:
# cd /opt/ACSLSHA/util # ./acsAgt configure
The utility prompts you for the logical host name. Ensure that the logical host is defined in the /etc/hosts
file, and that the corresponding i.p. address maps to the ipmp
group defined in the chapter, "Configuring the Solaris System for ACSLS HA". Before running acsAgt
configure
, use zpool
list
to confirm that acslspool
is mounted to the current server node.
This action initiates Cluster control of ACSLS. Solaris Cluster monitors the system, probing once each minute to verify the health of ACSLS specifically and the Solaris system in general. Any condition that is deemed fatal initiates an action on the alternate node.
To check cluster status of the ACSLS resource group:
# clrg status
The display:
Reveals the status of each node.
Identifies which node is the active node.
Reveals whether failover action is suspended.
It is advisable to set a policy in the acsls-storage
resource to reboot the active node whenever communication is lost between that node and the shared RAID disk device. This action causes the active node to relinquish control when it cannot connect to the disk, allowing Solaris Cluster to pass control to the alternate node. By setting the Failover_mode
from SOFT to HARD, this ensures a reboot of the active node whenever communication has been lost to the shared storage device.
To view the existing Failover_mode
, run the following command:
# clrs show -v acsls-storage | grep Failover
The Failover_mode
should be set to HARD as follows:
# clrs set -p Failover_mode=HARD acsls-storage
Once cluster control has been activated, ACSLS can be operated in normal fashion. Start and stop ACSLS using the standard acsss
control utility. Under cluster control, a user starts and stops ACSLS services in the same fashion as starting and stopping the application on a standalone ACSLS server. Operation is administered with these standard acsss
commands:
acsss enable acsss disable acsss db
Manually starting or stopping acsss
services with these commands in no way causes Solaris Cluster to intervene with failover action. Nor will the use of the Solaris SMF commands (such as svcadm
) cause Cluster to intervene. Whenever acsss
services are aborted or interrupted, it is SMF, not Cluster, that is primarily responsible for restarting these services.
Solaris Cluster only intervenes to restore control on the adjacent node under the following circumstances:
Lost communication with the ACSLS filesystem.
Lost communication with all redundant public Ethernet ports.
Lost and unrecoverable communication with a specified library.
If it is suspected that the maintenance activity might trigger an unwanted cluster failover event, suspend cluster control of the acsls
resource group.
To suspend Cluster control:
# clrg suspend acsls-rg
While the resource group is suspended, Solaris Cluster makes no attempt to switch control to the adjacent node, no matter what conditions might otherwise trigger such action.
This suspension enables you to make more invasive repairs to the system, even while library production may be in full operation.
If the active node happens to reboot while in suspended mode, it does not mount the acslspool
after the reboot, and ACSLS operation is halted. To clear this condition, resume Cluster control.
To resume Cluster control:
# clrg resume acsls-rg
If the shared disk resource is mounted to the current node, then normal operation resumes. But if Solaris Cluster discovers upon activation that the zpool
is not mounted, it immediately switches control to the adjacent node. If the adjacent node is not accessible, control switches back to the current node. Cluster attempts to mount the acslspool
and start ACSLS services on this node.
The following procedure provides for a safe power down sequence if it is necessary to power down the ACSLS HA System.
Determine the active node in the cluster.
# clrg status
Look for the online node.
Log in as root
to the active node and halt Solaris Cluster control of the ACSLS resource group.
# clrg suspend acsls-rg
Switch to user acsss
and shutdown the acsss
services:
# su - acsss $ acsss shutdown
Log out as acsss
and gracefully power down the node.
$ exit # init 5
Log in to the alternate node and power it down with init 5
.
Power down the shared disk array using the physical power switch.
To restore ACSLS operation on the node that was active before a controlled shutdown:
Power on both nodes locally using the physical power switch or remotely using the Sun Integrated Lights Out Manager.
Power on the shared disk array
Log in to either node as root
.
If you attempt to login as acsss
or to list the $ACS_HOME
directory, you find that the shared disk resource is not mounted to either node. To resume cluster monitoring, run the following command:
# clrg resume acsls-rg
With this action, Solaris Cluster mounts the shared disk to the node that was active when the system was brought down. This action should also automatically reboot the acsss
services and resume normal operations.
There may be occasions where ACSLS must continue operation from a standalone server environment on one node while the other node is being serviced. This would apply in situations of hardware maintenance, an operating system upgrade, or an upgrade to Solaris Cluster.
Use the following procedures to create a standalone ACSLS server.
Reboot the desired node in a non-cluster mode.
# reboot -- -x
To boot into non-cluster mode from the Open Boot Prom (OBP) on SPARC servers:
ok: boot -x
On X86 Servers, it is necessary to edit the GRUB boot menu.
Power on the system.
When the GRUB boot menu appears, press e (edit).
From the submenu, using the arrow keys, select kernel /platform/i86pc/multiboot. When this is selected, press e.
In the edit mode, add -x
to the multiboot option kernel /platform/i86pc/multiboot -x
and click return.
With the multiboot -x
option selected, press b to boot with that option.
Once the boot cycle is complete, log in as root and import the ACSLS Z-pool.
# zpool import acslspool
Use the -f
(force) option if necessary when the disk resource remains tied to another node.
# zpool import -f acslspool
Bring up the acsss
services.
# su - acsss $ acsss enable