8 Fine Tuning ACSLS HA

This chapter explains how to set up an optimal failover policy in a library failure, how to adjust the pingpong interval to avoid unwanted fail-back events, and how to register for email notification of failover events.

Defining a Failover Policy for Library Communications

The ACSLS HA agent constantly monitors communication between ACSLS and the attached libraries. Such communication is critical for continuous ACSLS operation. But what action, if any, should be taken in a failed library communication depends upon a policy that is determined by the local ACSLS HA administrator.

A policy table, $ACS_HOME/acslsha/ha_acs_list.txt, allows the local administrator to define the desired failover action for any ACS that requires HA recovery. In a library communication failure, and depending on the administrator's directive, the ACSLS HA agent fails over to the alternate node if successful ACS communication has been confirmed on that node.

In multiple ACS environments, it may be desirable for the ACSLS HA system to fail over when communication with any single ACS has failed. But since any failover action will disrupt production on all attached libraries the administrator may prefer to limit general failover action to the more critical ACS (or ACSs) in the data center. A policy record is created in ha_acs_list.txt for each ACS for which cluster failover action is required when library communication is lost. Each record has two fields:

ACS Number   Fail-over Action (true or false)

The first field is the ACS ID and the second field is the Boolean value of true or false. The logic of the policy settings is as follows:

When the second field is false, the ACSLS HA agent will not initiate cluster failover action to the alternate node, even though communication to the ACS has failed and cannot be restored.
When the second field is true, the ACSLS HA agent asserts cluster failover action after every attempt to reestablish communication from the primary node has failed. The system fails over only if library contact has been confirmed on the alternate node.

The default action is false for any ACS that is not listed in this file.

Libraries with Redundant Electronics (RE)

For libraries with Redundant Electronics (RE), the ACSLS HA agent attempts to switch communication to the alternate RE path before resorting to cluster failover action. This RE switch action applies only to a single SL8500, an SL3000, or an older 9310 with dual LMUs. Automatic RE switching is not attempted on any partitioned library.

Setting the Failover `Pingpong`_`interval`

The Solaris Cluster Pingpong_interval is a timeout property that prevents repeated failover action if full recovery cannot be restored after the first cluster failover event.

This is a user-modifiable property for the ACSLS resource group. The default value is set to 20 minutes. With this setting, the first failover event occurs immediately when failover action is requested by the ACSLS-HA agent. But if the condition which might trigger failover action is not cleared on the new cluster node, then subsequent failover action is delayed until the defined pingpong interval has expired. This prevents needless thrashing of control between one cluster node and the other until the root problem has been resolved.

To adjust the setting of this property, you can modify the default number defined in the file, $ACS_HOME/acslsha/pingpong_interval. That number is expressed in seconds.

The default setting of 1200 seconds is a reasonable setting for most medium to large library configurations. An optimal timeout value for this property depends upon the actual number of LSMs and tape drives that exist in the library configuration. Larger library configurations take longer to recover after a failover event and so this number should be set to a longer interval for systems configured with more than ten LSMs or forty drives, or both.

A setting of 1800 (30 minutes) would be recommended for a forty-LSM configuration, while a setting of 900 (15 minutes) is recommended for smaller libraries configured with one to four LSMs.

After changing the property in the pingpong_interval file, it is necessary to run the ACSLS HA start script.

start_acslsha.sh -h logical hostname -g IPMP group -z acslspool

This start command may be run even if HA system is already running. It registers the new pingpong_interval without impacting normal HA operation.

Registering for Email Notification of System Events

Users with administrative duties may register for automatic email notification of system events, including system boot events and ACSLS-HA cluster failover events.

To register for such events, users must add their email address in the respective files under the directory:

$ACS_HOME/data/external/email_notification/
   boot_notification
   ha_failover_notification

Place the email address of each intended recipient on a single line under the header remarks. Thereafter, every time the system boots or the HA cluster fails over to the standby node, each registered user is notified by email.

This capability assumes that the sendmail service has been enabled on the ACSLS server, and that network firewall constraints allow for email communication from the data center.

8 Fine Tuning ACSLS HA

Defining a Failover Policy for Library Communications

Libraries with Redundant Electronics (RE)

Setting the Failover Pingpong_interval

Registering for Email Notification of System Events

Setting the Failover `Pingpong`_`interval`