8 Fine Tuning ACSLS HA

This chapter explains how to set up an optimal failover policy in a library complex, how to adjust the default pingpong interval to avoid unwanted fail-back events, and how to register for email notification of failover events.

Defining a Failover Policy for Library Communications

The ACSLS HA agent constantly monitors communication between ACSLS and the attached libraries. Such communication is critical for continuous ACSLS operation. But what action, if any, should be taken in a failed library communication depends upon a policy that is determined by the local ACSLS HA administrator.

A policy table, $ACS_HOME/acslsha/ha_acs_list.txt, allows the local administrator to define the desired failover action for any ACS that requires HA recovery. In a library communication failure, and depending on the administrator's directive, the ACSLS HA agent fails over to the alternate node if successful ACS communication has been confirmed on that node.

In multiple ACS environments, it may be desirable for the ACSLS HA system to fail over when communication with any single ACS has failed. But since any fail over action disrupts production on all attached libraries, the administrator may prefer to limit general fail over action to the more critical ACS (or ACSs) in the data center. A policy record is created in ha_acs_list.txt for each ACS for which cluster fail over action is required when library communication is lost. Each record has two fields:

ACS Number   Fail-over Action (true or false)

The first field is the ACS ID and the second field is the Boolean value of true or false. The logic of the policy settings is as follows:

When the second field is false, the ACSLS HA agent does not initiate cluster failover action to the alternate node, even though communication to the ACS has failed and cannot be restored.
When the second field is true, the ACSLS HA agent asserts cluster failover action after every attempt to reestablish communication from the primary node has failed. The system fails over only if library contact has been confirmed on the alternate node.

The default action is false for any ACS that is not listed in this file.

Libraries with Redundant Electronics (RE)

For libraries with Redundant Electronics (RE), the ACSLS HA agent attempts to switch communication to the alternate RE path before resorting to cluster failover action. This RE switch action applies only to a single SL8500, an SL3000, or an older 9310 with dual LMUs. Automatic RE switching is not attempted on any partitioned library.

Setting the Failover `Pingpong`_`interval`

The Solaris Cluster Pingpong_interval is a timeout property that prevents repeated failover action if full recovery cannot be restored after the first cluster failover event.

This is a user-modifiable property for the ACSLS resource group. The default value is set to 20 minutes. With this setting, the first failover event occurs immediately when failover action is requested by the ACSLS HA agent. But if the condition which might trigger failover action is not cleared on the new cluster node, then subsequent failover action is delayed until the defined pingpong interval has expired. This prevents needless thrashing of control between one cluster node and the other until the root problem has been resolved.

To change the default setting of this property, modify the default number defined in the file, $ACS_HOME/acslsha/pingpong_interval. That number is expressed in seconds.

The default setting of 1200 seconds is a reasonable setting for most medium to large library configurations. An optimal timeout value for this property depends upon the actual number of LSMs and tape drives that exist in the library configuration. Larger library configurations take longer to recover after a failover event and so this number should be set to a longer interval for systems configured with more than ten LSMs or forty drives, or both.

A setting of 1800 (30 minutes) would be recommended for a forty-LSM configuration, while a setting of 900 (15 minutes) is recommended for smaller libraries configured with one to four LSMs.

Changes you make here do take effect until you reconfigure ACSLS HA with the command, acsAgt configure.

# cd /opt/ACSLSHA/util
# ./acsAgt configure

This command may be asserted even if the acsls-rg resource group is already active. It registers the new default setting without impacting normal HA operation.

The pingpong_interval setting can be dynamically changed for testing purposes using acsAgt pingpong. The value set with this command remains in effect until you restart the resource group with acsAgt configure.

Registering for Email Notification of System Events

Users with administrative duties may register for automatic email notification of system events, including system boot events and ACSLS HA cluster failover events.

To register for such events, users must add their email address in the respective files under the directory:

$ACS_HOME/data/external/email_notification/
   boot_notification
   ha_failover_notification

Place the email address of each intended recipient on a single line under the header remarks. Thereafter, every time the system boots or the HA cluster fails over to the standby node, each registered user is notified by email.

This capability assumes that the sendmail service has been enabled on the ACSLS server, and that network firewall constraints allow for email communication from the data center.

8 Fine Tuning ACSLS HA

Defining a Failover Policy for Library Communications

Libraries with Redundant Electronics (RE)

Setting the Failover Pingpong_interval

Registering for Email Notification of System Events

Setting the Failover `Pingpong`_`interval`