StorageTek Automated Cartridge System Library Software High Availability 8.3 Cluster Installation, Configuration, and Operation Release 8.3 E51939-02 |
|
![]() Previous |
![]() Next |
Solaris Cluster is designed to achieve automatic system recovery under severe failure scenarios by transferring operational control from one server node to the next. But most failures in a Solaris system do not require full system switch-over action to recover.
Failures involving network communication are handled quickly and quietly by Solaris IPMP.
System disk failures are handled silently and automatically by Solaris ZFS.
Failures with any single disk drive in the attached storage array are recovered automatically by the storage array firmware. And where the storage array might lack the ability to recover from disk failure, Solaris ZFS is in control to provide uninterrupted disk I/O to the alternate drive in the mirrored configuration.
If an HBA port to the shared array should fail, Solaris automatically switches to an alternate port. Similarly, if a controller module on the shared array should fail or an interconnecting cable is disconnected, Solaris instantly reverts to the alternate path that connects to the disk resource.
Failure in a library communication path is recovered automatically by dual TCP/IP logic in ACSLS. And operations from a failed library controller card are recovered automatically by ACSLS HA logic associated with library Redundant Electronics (RE).
If any of the multiple running processes in ACSLS should fail, the ACSLS daemon instantly reboots the failed process.
Should the ACSLS daemon itself fail, or should any of the remaining ACSLS services stop running, the Solaris Service Management Facility (SMF) is there to instantly reboot the failed service.
All of these scenarios are handled quickly and automatically without the involvement of Solaris Cluster. But if any other severe fault should impact ACSLS operation on the active server node, ACSLS HA instructs Solaris Cluster to switch control over to the alternate node.
Once it is started, ACSLS HA probes the system once every minute, watching for any of the following events to occur:
Loss of communication to an attached library
Loss of network contact to the ACSLS logical host
Loss of contact to the RPC listener port for client calls
Loss of access to the ACSLS file system
Unrecoverable maintenance state of the acsls SMF service
Any of these events triggers a Cluster fail over. Solaris Cluster also knows to fail over if any fatal system conditions on the active server node occurs.
To activate Cluster failover control:
# cd /opt/ACSLSHA/util # ./start_acslsha.sh -h <logical hostname> -g <IPMP group> -z acslspool
This action initiates Cluster control of ACSLS. Solaris Cluster monitors the system, probing once each minute to verify the health of ACSLS specifically and the Solaris system in general. Any condition that is deemed fatal initiates an action on the alternate node.
To check cluster status of the ACSLS resource group:
# clrg status
The display will:
Reveal the status of each node.
Identify which node is the active node.
Reveal whether failover action is suspended.
Once cluster control has been activated, you can operate ACSLS in normal fashion. You can start and stop ACSLS using the standard acsss
control utility. Under cluster control, a user starts and stops ACSLS services in the same fashion as they would start and stop the application on a stand-alone ACSLS server. Operation is administered with these standard acsss
commands:
acsss enable acsss disable acsss db
Manually starting or stopping acsss
services with these commands in no way causes Solaris Cluster to intervene with failover action. Nor will the use of the Solaris SMF commands (such as svcadm
) cause Cluster to intervene. Whenever acsss
services are aborted or interrupted, it is SMF, not Cluster, that is primarily responsible for rebooting these services.
Solaris Cluster only intervenes to restore control on the adjacent node under the following circumstances:
Lost communication with the ACSLS filesystem
Lost communication with all redundant public Ethernet ports
Lost and unrecoverable communication with a specified library
If you suspect that your maintenance activity might trigger an unwanted cluster failover event, you can suspend cluster control of the acsls resource group.
To suspend Cluster control:
# clrg suspend acsls-rg
While the resource group is suspended, Solaris Cluster makes no attempt to switch control to the adjacent node, no matter what conditions might otherwise trigger such action.
This enable you to make more invasive repairs to the system, even while library production may be in full operation.
If the active node happens to reboot while in suspended mode, it will not mount the acslspool
after the reboot, and ACSLS operation will be halted. To clear this condition, you should resume Cluster control.
To resume Cluster control:
# clrg resume acsls-rg
If the shared disk resource is mounted to the current node, then normal operation resumes. But if Solaris Cluster discovers upon activation that the zpool
is not mounted. it immediately switches control to the adjacent node. If the adjacent node is not accessible, then control switches back to the current node and Cluster attempts to mount the acslspool
and start ACSLS services on this node.
The following procedure provides for a safe power-down sequence if it is necessary to power down the ACSLS HA System.
Determine the active node in the cluster.
# clrg status
Look for the online node.
Log in as root
to the active node and halt Solaris Cluster control of the ACSLS resource group.
# clrg suspend acsls-rg
Switch to user acsss
and shutdown the acsss
services:
# su - acsss $ acsss shutdown
Log out as acsss
and gracefully power down the node.
$ exit # init 5
Log in to the alternate node and power it down with init 5
.
Power down the shared disk array using the physical power switch.
To restore ACSLS operation on the node that was active before a controlled shutdown, use the following procedure
Power on both nodes locally using the physical power switch or remotely using the Sun Integrated Lights Out Manager.
Power on the shared disk array
Log in to either node as root
.
If you attempt to login as acsss
or to list the $ACS_HOME
directory, you find that the shared disk resource is not mounted to either node. To resume cluster monitoring, run the following command:
# clrg resume acsls-rg
With this action, Solaris Cluster mounts the shared disk to the node that was active when you brought the system down. This action should also automatically reboot the acsss services and normal operation should resume.
There may be occasions where ACSLS must continue operation from a standalone server environment on one node while the other node is being serviced. This would apply in situations of hardware maintenance, an operating system upgrade, or an upgrade to Solaris Cluster.
Use the following procedures to create a standalone ACSLS server.
Reboot the desired node in a non-cluster mode.
# reboot -- -x
To boot into non-cluster mode from the Open Boot Prom (OBP) on SPARC servers:
ok: boot -x
On X86 Servers, it is necessary to edit the GRUB boot menu.
Power on the system.
When the GRUB boot menu appears, press e (edit).
From the submenu, using the arrow keys, select kernel /platform/i86pc/multiboot. When this is selected, press e.
In the edit mode, add -x
to the multiboot option kernel /platform/i86pc/multiboot -x
and click return.
With the multiboot -x
option selected, press b to boot with that option.
Once the boot cycle is complete, log in as root and import the ACSLS Z-pool.
# zpool import acslspool
Use the -f
(force) option if necessary when the disk resource remains tied to another node.
# zpool import -f acslspool
Bring up the acsss
services.
# su - acsss $ acsss enable