Sun Enterprise 10000 SSP 3.5 User Guide

How Automatic Failover Works

Automatic failover of the main to the spare SSP is accomplished through the following:

Failover monitoring

Failover monitoring is performed by the fod daemon, which continuously monitors the components in a dual SSP configuration for failure conditions. When a failover condition is detected, the fod daemon, in conjunction with the ssp_startup daemon, actually initiates the failover from the main SSP to the spare.

For details on the fod daemon and the various failure conditions that it detects, see Chapter 10, SSP Internals.
Data synchronization

For failover purposes, data on the main SSP must be synchronized with data on the spare SSP. The data synchronization daemon, datasyncd(1M,ensures that all SSP configuration files and specified user-created files (identified in the data propagation list) are copied from the main SSP to the spare, so that both SSPs are synchronized when a failover occurs. For further information on the datasyncd daemon, see Chapter 10, SSP Internals.

This data synchronization occurs whenever the SSP configuration or user-created files change on the main SSP, or when you perform a data synchronization backup. For details on data synchronization backup, see "To Resynchronize SSP Configuration Files Between the Main and the Spare SSP".
- When a change to an SSP configuration file occurs, the change is propagated to the spare SSP immediately, except for the ssp_resource(4) file and the COD license file, which are checked once every minute and then propagated if they have changed.
- Any change to a user-created file is propagated to the spare SSP at the time interval designated through the setdatasync(1M) command.
You control the data synchronization process using the setdatasync(1M) command, as described in "Managing Data Synchronization".
Command synchronization

The recovery of user-defined commands interrupted by an automatic failover is called command synchronization. You use synchronization commands to indicate how these user commands are to be rerun on the new main SSP after a failover. For details on controlling command synchronization, see "Performing Command Synchronization".
Floating IP address

The working main SSP is identified by a floating IP address that you assign during SSP installation or upgrade. This floating IP address is a logical interface that eliminates the need for a specific SSP host name to communicate between the Sun Enterprise 10000 domains and the main SSP. When a failover occurs, the floating IP address identifies the new main SSP. The floating IP address enables communication between the external monitoring software and the working main SSP.

The following sections provide an overview of the basic SSP failover situations and the various ways to control automatic failover.

SSP Failover Situations

An automatic failover is triggered when a failure in the dual SSP configuration affects the proper operation of the main SSP. Failure points can be caused by the following:

Failed network connections
SSP system failure due to a
- System panic
- Complete power failure
- Drop in the OpenBoot PROM (OBP) that persists for five minutes or less
Resource depletion

Resource depletion refers to the insufficient amount of disk space and virtual memory needed to perform SSP operations. If these resources drop below a certain threshold, the fod daemon initiates a failover. These resources are stored in the ssp_resource(4) file and can be modified using the setfailover command. For details, see "To Modify the Memory or Disk Space Threshold in the ssp_resource File".

However, note that failover will not occur when it has been disabled by operator request or when certain failure conditions prevent the failover. The various failure conditions and the resulting failover actions are summarized in Chapter 10, SSP Internals. Chapter 10, SSP Internals10 identifies and explains the different points of failure detected by the failover process.

Controlling Automatic SSP Failover

The SSP failover capability is automatically enabled upon SSP installation or upgrade. You control the failover state through the setfailover(1M) command, which enables you to do the following:

Disable, enable, or force an SSP failover.
View or set the memory or disk space thresholds in the ssp_resource file.

For additional information, see the setfailover(1M) man page.

To Disable SSP Failover

As user ssp on the main SSP, type:
ssp% setfailover off
SSP failover remains disabled until you enable it. To determine whether failover was disabled, use the showfailover(1M) command to review the failover state, as explained in "Obtaining Failover Status Information".

To Enable SSP Failover

When you use the setfailover(1M) command to enable failover after it has been disabled, the connection states are checked before failover is enabled. All connection links must be functioning properly before failover can be enabled. If any failed connections exist, failover is not enabled.

As user ssp on the main SSP, type:
ssp% setfailover on
SSP failover is activated if all control board connections are working. To verify that failover was enabled, use the showfailover(1M) command to review the failover state and connection status, as explained in "Obtaining Failover Status Information".

Note -
Wait several minutes before verifying the failover state. During this time, the setfailover command checks the control board connections before activating SSP failover.

To Force a Failover to the Spare SSP

Note -

Before forcing an SSP failover, be sure that both the main and spare SSP are synchronized. Use the setdatasync(1M) command to synchronize the SSP configuration files between the main and spare SSP.

As user ssp on the main SSP, type:
ssp% setfailover force

Use the showfailover(1M) command to determine whether the forced failover occurred and to review the failover state and connection status.

For details, see "Obtaining Failover Status Information".

Re-enable SSP failover, as explained in "To Enable SSP Failover".

To Modify the Memory or Disk Space Threshold in the `ssp_resource` File

When memory or disk space resources drop below a certain threshold, a failover occurs. However, you can change the threshold for these resources, which are stored in the ssp_resource(4) file, by using the setfailover(1M) command.

On the main SSP, log in as user ssp and do one of the following:
- To change the memory threshold, type:
  ssp% setfailover -m memory_threshold
  where memory_threshold is the updated virtual memory value in Kbytes.
- To change the disk space threshold, type:
  ssp% setfailover -d disk_space_threshold
  where disk_space_threshold is the updated disk space value in Kbytes.

Verify the updated threshold value by using the setfailover(1M) command with only the -m or -d option.

Obtaining Failover Status Information

Use the showfailover(1M) command on the main SSP to display failover status information. The following example shows the failover information displayed.

ssp% showfailover  
Failover State:
     SSP Failover: Disabled
     CB Failover:  Active
Failover Connection Map:
     Main SSP to Spare SSP thru Main Hub:   FAILED
     Main SSP to Spare SSP thru Spare Hub:  FAILED
     Main SSP to Primary Control Board:     GOOD
     Main SSP to Spare Control Board:       GOOD
     Spare SSP to Main SSP thru Main Hub:   FAILED
     Spare SSP to Main SSP thru Spare Hub:  FAILED
     Spare SSP to Primary Control Board:    FAILED
     Spare SSP to Spare Control Board:      FAILED
SSP/CB Host Information
     Main SSP:                              xf12-ssp
     Spare SSP:                             xf12-ssp2
     Primary Control Board (JTAG source):   xf12-cb1
     Spare Control Board:                   xf12-cb0
     System Clock source:                   xf12-cb1

The failover status includes the

Failover state

The failover state is one of the following:
- Active -- automatic failover is enabled and functioning normally
- Disabled -- automatic failover has been disabled by operator request or by a failure condition that prevents a failover from occurring
- Failed -- a failover occurred
  
  After a failover, the status is listed as Failed until you re-enable failover using the setfailover(1M) command. You must manually re-enable failover, even after you have fixed all connections and they are identified as GOOD in the failover connection map (explained below).
Failover connection map

The connection map provides the status of the control board connection links monitored by the failover processes. A connection link is either GOOD, which means the connection is functioning properly, or FAILED, which indicates the connection is not working.

If you have failed connections, use this connection map to help determine the failure condition. For additional details on the failure conditions associated with the various failure points, see "Description of Failover Detection Points" in Chapter 10, SSP Internals.
SSP/CB host information

The host information identifies the SSPs, control boards, and the control board that manages the JTAG interface and system clock.

You can also obtain information about the role of the current SSP by specifying the showfailover(1M) command with the -r option. The SSP role is either UNKNOWN (SSP role has not been determined), MAIN, or SPARE.

For additional details on the showfailover(1M) command, see the showfailover(1M) man page.

Managing Data Synchronization

If you have user-created files (non-SSP files that are not contained in the SSP directories) that must be maintained on the spare SSP for failover purposes, you must identify these files in the data propagation list (/var/opt/SUNWssp/.ssp_private/user_file_list) used for data synchronization. The datasyncd daemon uses this list to determine which files to copy from the main SSP to the spare.

By default, the data synchronization process checks for any changes to the user-created files on the main SSP every 60 minutes. You can use the setdatasync command to set the interval at which the data propagation list is to be checked for modifications (see "To Add a File to the Data Propagation List"). The interval starts from the time at which a file is added to the data propagation list. The files in this list are propagated to the spare SSP only when they have changed from the last interval check.

Note -

The data synchronization daemon uses the available disk space in the /tmp directory to copy files from the main SSP to the spare. If you have files to be copied that are larger than the /tmp directory, those files cannot be propagated. For example, if the data synchronization backup file (ds_backup.cpio) file gets larger than the available space in /tmp, you must reduce the size of this file before data propagation can occur. For details on reducing the size of the data synchronization backup file, see "To Reduce the Size of the Data Synchronization Backup File".

Use the setdatasync(1M) command to do the following:

Add a file to the data propagation list and indicate how often this file is to be checked for modifications.
Remove a file from the data propagation list.
Erase all entries and temporary files in the data propagation list and remove the data propagation list.
Push a file to the spare SSP without adding the file to the data propagation list.
Resynchronize the SSP configuration files between the main and the spare SSP.

Note -

The files on the spare SSP are not monitored by the datasyncd daemon, which means that if you remove a user-created file on the spare SSP, the user file will not be automatically restored (copied) from the main to the spare SSP. In addition, do not remove SSP configuration files from the spare SSP.

For additional details, see the setdatasync(1M) man page.

To Add a File to the Data Propagation List

As user ssp on the main SSP, type:
ssp% setdatasync -i interval schedule filename
where interval indicates the frequency (number of minutes) that the specified filename is to be checked as part of the data synchronization process. The specified file name must contain the absolute path. The files on the data propagation list are copied to the spare SSP only when those files change on the main SSP, and not each time the files are checked.

To Remove a File From the Data Propagation List

As user ssp on the main SSP, type:
ssp% setdatasync cancel filename
where filename is the file to be removed from the data propagation list. The file name must contain the absolute path.

To Remove the Data Propagation List

The setdatasync clean command is useful for managing disk space in single SSP configurations, where the data propagation list can grow quite large and consume unnecessary disk space. It is possible for the /tmp directory to become full, which can cause the system to hang. You can run the setdatasync clean command as needed, either daily or weekly to prevent the /tmp directory from growing too large. Or, you can automate the cleanup by using the cron(1M) command with a crontab(1M) entry that uses the setdatasync clean command.

Note -

Do not use this option when you have a dual SSP configuration because it can desynchronize data between the main and spare SSP.

As user ssp on the main SSP, type:
ssp% setdatasync clean

To Push a File to the Spare SSP

As user ssp on the main SSP, type:
ssp% setdatasync push filename
where filename is the file to be moved to the spare SSP without adding the file to the data propagation list. The file name must contain the absolute path.

To Resynchronize SSP Configuration Files Between the Main and the Spare SSP

Use this procedure to keep data between the main and spare SSP synchronized. If you want to archive an SSP configuration, use the ssp_backup(1M) command.

As user ssp on the main SSP, type:
ssp% setdatasync backup
A data synchronization backup file of all SSP configuration data on the main SSP is created and then restored on the spare SSP. Note that the data synchronization backup differs from a backup created by the ssp_backup(1M) command:
- The data synchronization backup, while similar to a backup created by the ssp_backup command, does not back up the /tftpboot directory.
- The data synchronization backup does not restore the following files:
  - /var/opt/SUNWssp/.ssp_private/machine_server_fifo
  - /var/opt/SUNWssp/adm/messages
    
    This file is propagated to the /var/opt/SUNWssp/adm/messages.dsbk file on the spare SSP.
  - /var/opt/SUNWssp/adm/messages.dsbk
  - /var/opt/SUNWssp/.ssp_private/user_file_list
  - /var/opt/SUNWssp/.ssp_private/.ds_queue
The data synchronization backup can fail if the backup file exceeds the available disk space in the /tmp directory. For details on reducing the size of the data synchronization backup file, see the following procedure.

To Reduce the Size of the Data Synchronization Backup File

As superuser on the main SSP, run ssp_backup(1M) to create an archive of your SSP environment.

Remove the following files to reduce the size of the data synchronization backup created before you run setdatasync backup:
- $SSPLOGGER/messages.x
- $SSPLOGGER/domain/Edd-recovery_files
- $SSPLOGGER/domain/messages.x
- $SSPLOGGER/domain/netcon.x
- $SSPLOGGER/domain/post/files
where x is the archive number of the file. Because these files are propagated from the new main SSP to the spare after a failover, you must remove these files on both the main and spare SSP to prevent regeneration of these files.

Obtaining Data Synchronization Information

Use the showdatasync(1M) command on the main SSP to obtain basic status information about data synchronization. The examples in this section show the different types of information displayed by the showdatasync command.

The following example shows the status of the datasyncd daemon (file propagation), the files contained in the current data propagation list, and the files queued for data propagation:

ssp% showdatasync 
File Propagation Status:  ACTIVE
Active File:              -
Queued files:             0

The next example shows a data propagation list:

ssp% showdatasync -l  
TIME PROPAGATED         INTERVAL     FILE
Mar 23 16:00:00         60           /tmp/t1
Mar 23 17:00:00         120          /tmp/t2

The example below shows the files queued for data synchronization:

ssp% showdatasync -Q  
FILE
/tmp/t1
/tmp/t2

For additional details, see the showdatasync(1M) man page.

Performing Command Synchronization

Command synchronization recovers user-defined commands that are interrupted by a failover and automatically reruns those commands on the new main SSP after a failover. Command synchronization does the following:

Maintains a command synchronization list on the spare SSP that specifies the commands to be restarted after a failover. Each command is run as user ssp.
After a failover, reruns specified user commands.
After a failover, resumes processing of specified user scripts from certain marked points (that you identify within each script).

These user scripts must be structured so that processing can be resumed from a labeled marker point in the script.

If you want user commands to be automatically recovered after a failover, you must prepare these user commands for synchronization as explained in the following sections.

Preparing User Commands for Automatic Restart

The runcmdsync(1M) command prepares a user command for automatic restart. runcmdsync adds the user command to the command synchronization list, which identifies the commands to be rerun after a failover.

To Prepare a User Command for Restart

As user ssp on the main SSP, type:
ssp% runcmdsync script_name [parameters]
where:

script_name is the name of the user command to be restarted.

parameters are the options associated with the specified command.

The specified command will be rerun automatically on the new main SSP after a failover.

Preparing User Scripts for Automatic Recovery

If you want to resume processing of a user script from a certain marked point (location) within the script, you must include the following synchronization commands in the user script:

initcmdsync(1M) creates a command synchronization descriptor that identifies a particular script and its associated data.

These descriptors are placed in a command synchronization list that determines which user scripts are to be restarted after an automatic failover.
savecmdsync(1M) specifies a marker point from which the script can be restarted.
cancelcmdsync(1M) removes the command synchronization descriptor from the command synchronization list.

Each script must contain the initcmdsync and cancelcmdsync commands to initialize the script for synchronization and then remove the command from the command synchronization list respectively. For details on the synchronization commands, see the cmdsync(1M) man page.

Note -

These synchronization commands are intended for use by experienced programmers. You can use the runcmdsync(1M) command instead of the synchronization commands described in this section to prepare a script for recovery. However, the runcmdsync(1M) command will prepare the script so that it is rerun from the beginning and not from specified marker points.

The following procedures describe how to use these synchronization commands.

Note -

After an SSP failover or in a single SSP configuration, SSP failover is disabled. When failover is disabled, scripts that contain synchronization commands will generate error messages to the platform log file and return non-zero exit codes. These error messages can be ignored.

To Create a Command Synchronization Descriptor

In your user script, type the following to create a command synchronization descriptor that identifies your script:
initcmdsync script_name [parameters]
where:

script_name is the name of the script.

parameters are the options associated with the specified script.

The output returned from the initcmdsync command serves as the command synchronization descriptor.

To Specify a Command Synchronization Marker Point

In your user script, type the following to mark an execution point from which processing can be resumed:
savecmdsync -M identifier cmdsync_descriptor
where:

identifier is a positive integer that marks an execution point from which the script can be restarted.

cmdsync_descriptor is the command synchronization descriptor output by the initcmdsync command.

To Remove a Command Synchronization Descriptor

In your user script, type the following after the script termination sequence:
cancelcmdsync cmdsync_descriptor
where cmdsync_descriptor is the command synchronization descriptor output by the initcmdsync command. The specified descriptor is removed from the command synchronization list so that the user script is not run on the new main SSP after a failover.

Obtaining Command Synchronization Information

Use the showcmdsync(1M) command on the main SSP to review the command synchronization list that identifies the user commands to be restarted on the new main SSP after an automatic failover.

The following is an example command synchronization list output by the showcmdsync (1M) command:

ssp% showcmdsync 
DESCRIPTOR      IDENTIFIER   CMD
         0              -1   c1 c2 a2

For further details, see the showcmdsync(1M) man page.

Example Script with Synchronization Commands

SSP provides an example user script that shows how the synchronization commands can be used. This script is located in the /opt/SUNWssp/examples/cmdsync directory. This directory also contains a README file that explains how the script works.

After an SSP Failover

After an SSP failover occurs, you must perform certain recovery tasks:

Identify the failure point or condition that caused the failover and determine how to correct the failure.

Depending on the failover condition, note that a failover is either initiated or disabled. To identify the failure point, use the showfailover(1M) command to review the failover state and connection status. Review the connection map in the showfailover output and the summary of the failover detection points in Chapter 10, SSP Internals .

You can also review the platform log file to review other error conditions and determine the corrective action needed to reactivate the failed components.
After resolving the problem, re-enable SSP failover using the setfailover(1M) command (see "To Enable SSP Failover").
Rerun any SSP commands that were interrupted by a failover, with the exception of the DR commands addboard(1M), deleteboard(1M), and moveboard(1M), which are automatically resumed on the new main SSP.

How Automatic Failover Works

SSP Failover Situations

Controlling Automatic SSP Failover

To Disable SSP Failover

To Enable SSP Failover

To Force a Failover to the Spare SSP

To Modify the Memory or Disk Space Threshold in the ssp_resource File

Obtaining Failover Status Information

Managing Data Synchronization

To Add a File to the Data Propagation List

To Remove a File From the Data Propagation List

To Remove the Data Propagation List

To Push a File to the Spare SSP

To Resynchronize SSP Configuration Files Between the Main and the Spare SSP

To Reduce the Size of the Data Synchronization Backup File

Obtaining Data Synchronization Information

Performing Command Synchronization

Preparing User Commands for Automatic Restart

To Prepare a User Command for Restart

Preparing User Scripts for Automatic Recovery

To Create a Command Synchronization Descriptor

To Specify a Command Synchronization Marker Point

To Remove a Command Synchronization Descriptor

Obtaining Command Synchronization Information

Example Script with Synchronization Commands

After an SSP Failover

To Modify the Memory or Disk Space Threshold in the `ssp_resource` File