Sun Enterprise 10000 SSP 3.5 User Guide

How Automatic Failover Works

Automatic failover of the main to the spare SSP is accomplished through the following:

The following sections provide an overview of the basic SSP failover situations and the various ways to control automatic failover.

SSP Failover Situations

An automatic failover is triggered when a failure in the dual SSP configuration affects the proper operation of the main SSP. Failure points can be caused by the following:

However, note that failover will not occur when it has been disabled by operator request or when certain failure conditions prevent the failover. The various failure conditions and the resulting failover actions are summarized in Chapter 10, SSP Internals. Chapter 10, SSP Internals10 identifies and explains the different points of failure detected by the failover process.

Controlling Automatic SSP Failover

The SSP failover capability is automatically enabled upon SSP installation or upgrade. You control the failover state through the setfailover(1M) command, which enables you to do the following:

For additional information, see the setfailover(1M) man page.

To Disable SSP Failover
  1. As user ssp on the main SSP, type:


    ssp% setfailover off
    

    SSP failover remains disabled until you enable it. To determine whether failover was disabled, use the showfailover(1M) command to review the failover state, as explained in "Obtaining Failover Status Information".

To Enable SSP Failover

When you use the setfailover(1M) command to enable failover after it has been disabled, the connection states are checked before failover is enabled. All connection links must be functioning properly before failover can be enabled. If any failed connections exist, failover is not enabled.

  1. As user ssp on the main SSP, type:


    ssp% setfailover on
    

    SSP failover is activated if all control board connections are working. To verify that failover was enabled, use the showfailover(1M) command to review the failover state and connection status, as explained in "Obtaining Failover Status Information".


    Note -

    Wait several minutes before verifying the failover state. During this time, the setfailover command checks the control board connections before activating SSP failover.


To Force a Failover to the Spare SSP

Note -

Before forcing an SSP failover, be sure that both the main and spare SSP are synchronized. Use the setdatasync(1M) command to synchronize the SSP configuration files between the main and spare SSP.


  1. As user ssp on the main SSP, type:


    ssp% setfailover force
    

  2. Use the showfailover(1M) command to determine whether the forced failover occurred and to review the failover state and connection status.

    For details, see "Obtaining Failover Status Information".

  3. Re-enable SSP failover, as explained in "To Enable SSP Failover".

To Modify the Memory or Disk Space Threshold in the ssp_resource File

When memory or disk space resources drop below a certain threshold, a failover occurs. However, you can change the threshold for these resources, which are stored in the ssp_resource(4) file, by using the setfailover(1M) command.

  1. On the main SSP, log in as user ssp and do one of the following:

    • To change the memory threshold, type:


      ssp% setfailover -m memory_threshold
      

      where memory_threshold is the updated virtual memory value in Kbytes.

    • To change the disk space threshold, type:


      ssp% setfailover -d disk_space_threshold
      

      where disk_space_threshold is the updated disk space value in Kbytes.

  2. Verify the updated threshold value by using the setfailover(1M) command with only the -m or -d option.

Obtaining Failover Status Information

Use the showfailover(1M) command on the main SSP to display failover status information. The following example shows the failover information displayed.


ssp% showfailover  
Failover State:
     SSP Failover: Disabled
     CB Failover:  Active
Failover Connection Map:
     Main SSP to Spare SSP thru Main Hub:   FAILED
     Main SSP to Spare SSP thru Spare Hub:  FAILED
     Main SSP to Primary Control Board:     GOOD
     Main SSP to Spare Control Board:       GOOD
     Spare SSP to Main SSP thru Main Hub:   FAILED
     Spare SSP to Main SSP thru Spare Hub:  FAILED
     Spare SSP to Primary Control Board:    FAILED
     Spare SSP to Spare Control Board:      FAILED
SSP/CB Host Information
     Main SSP:                              xf12-ssp
     Spare SSP:                             xf12-ssp2
     Primary Control Board (JTAG source):   xf12-cb1
     Spare Control Board:                   xf12-cb0
     System Clock source:                   xf12-cb1

The failover status includes the

You can also obtain information about the role of the current SSP by specifying the showfailover(1M) command with the -r option. The SSP role is either UNKNOWN (SSP role has not been determined), MAIN, or SPARE.

For additional details on the showfailover(1M) command, see the showfailover(1M) man page.

Managing Data Synchronization

If you have user-created files (non-SSP files that are not contained in the SSP directories) that must be maintained on the spare SSP for failover purposes, you must identify these files in the data propagation list (/var/opt/SUNWssp/.ssp_private/user_file_list) used for data synchronization. The datasyncd daemon uses this list to determine which files to copy from the main SSP to the spare.

By default, the data synchronization process checks for any changes to the user-created files on the main SSP every 60 minutes. You can use the setdatasync command to set the interval at which the data propagation list is to be checked for modifications (see "To Add a File to the Data Propagation List"). The interval starts from the time at which a file is added to the data propagation list. The files in this list are propagated to the spare SSP only when they have changed from the last interval check.


Note -

The data synchronization daemon uses the available disk space in the /tmp directory to copy files from the main SSP to the spare. If you have files to be copied that are larger than the /tmp directory, those files cannot be propagated. For example, if the data synchronization backup file (ds_backup.cpio) file gets larger than the available space in /tmp, you must reduce the size of this file before data propagation can occur. For details on reducing the size of the data synchronization backup file, see "To Reduce the Size of the Data Synchronization Backup File".


Use the setdatasync(1M) command to do the following:


Note -

The files on the spare SSP are not monitored by the datasyncd daemon, which means that if you remove a user-created file on the spare SSP, the user file will not be automatically restored (copied) from the main to the spare SSP. In addition, do not remove SSP configuration files from the spare SSP.


For additional details, see the setdatasync(1M) man page.

To Add a File to the Data Propagation List
  1. As user ssp on the main SSP, type:


    ssp% setdatasync -i interval schedule filename 
    

    where interval indicates the frequency (number of minutes) that the specified filename is to be checked as part of the data synchronization process. The specified file name must contain the absolute path. The files on the data propagation list are copied to the spare SSP only when those files change on the main SSP, and not each time the files are checked.

To Remove a File From the Data Propagation List
  1. As user ssp on the main SSP, type:


    ssp% setdatasync cancel filename 
    

    where filename is the file to be removed from the data propagation list. The file name must contain the absolute path.

To Remove the Data Propagation List

The setdatasync clean command is useful for managing disk space in single SSP configurations, where the data propagation list can grow quite large and consume unnecessary disk space. It is possible for the /tmp directory to become full, which can cause the system to hang. You can run the setdatasync clean command as needed, either daily or weekly to prevent the /tmp directory from growing too large. Or, you can automate the cleanup by using the cron(1M) command with a crontab(1M) entry that uses the setdatasync clean command.


Note -

Do not use this option when you have a dual SSP configuration because it can desynchronize data between the main and spare SSP.


  1. As user ssp on the main SSP, type:


    ssp% setdatasync clean  
    

To Push a File to the Spare SSP
  1. As user ssp on the main SSP, type:


    ssp% setdatasync push filename 
    

    where filename is the file to be moved to the spare SSP without adding the file to the data propagation list. The file name must contain the absolute path.

To Resynchronize SSP Configuration Files Between the Main and the Spare SSP

Use this procedure to keep data between the main and spare SSP synchronized. If you want to archive an SSP configuration, use the ssp_backup(1M) command.

  1. As user ssp on the main SSP, type:


    ssp% setdatasync backup  
    

    A data synchronization backup file of all SSP configuration data on the main SSP is created and then restored on the spare SSP. Note that the data synchronization backup differs from a backup created by the ssp_backup(1M) command:

    • The data synchronization backup, while similar to a backup created by the ssp_backup command, does not back up the /tftpboot directory.

    • The data synchronization backup does not restore the following files:

      • /var/opt/SUNWssp/.ssp_private/machine_server_fifo

      • /var/opt/SUNWssp/adm/messages

        This file is propagated to the /var/opt/SUNWssp/adm/messages.dsbk file on the spare SSP.

      • /var/opt/SUNWssp/adm/messages.dsbk

      • /var/opt/SUNWssp/.ssp_private/user_file_list

      • /var/opt/SUNWssp/.ssp_private/.ds_queue

    The data synchronization backup can fail if the backup file exceeds the available disk space in the /tmp directory. For details on reducing the size of the data synchronization backup file, see the following procedure.

To Reduce the Size of the Data Synchronization Backup File
  1. As superuser on the main SSP, run ssp_backup(1M) to create an archive of your SSP environment.

  2. Remove the following files to reduce the size of the data synchronization backup created before you run setdatasync backup:

    • $SSPLOGGER/messages.x

    • $SSPLOGGER/domain/Edd-recovery_files

    • $SSPLOGGER/domain/messages.x

    • $SSPLOGGER/domain/netcon.x

    • $SSPLOGGER/domain/post/files

    where x is the archive number of the file. Because these files are propagated from the new main SSP to the spare after a failover, you must remove these files on both the main and spare SSP to prevent regeneration of these files.

Obtaining Data Synchronization Information

Use the showdatasync(1M) command on the main SSP to obtain basic status information about data synchronization. The examples in this section show the different types of information displayed by the showdatasync command.

The following example shows the status of the datasyncd daemon (file propagation), the files contained in the current data propagation list, and the files queued for data propagation:


ssp% showdatasync 
File Propagation Status:  ACTIVE
Active File:              -
Queued files:             0

The next example shows a data propagation list:


ssp% showdatasync -l  
TIME PROPAGATED         INTERVAL     FILE
Mar 23 16:00:00         60           /tmp/t1
Mar 23 17:00:00         120          /tmp/t2

The example below shows the files queued for data synchronization:


ssp% showdatasync -Q  
FILE
/tmp/t1
/tmp/t2

For additional details, see the showdatasync(1M) man page.

Performing Command Synchronization

Command synchronization recovers user-defined commands that are interrupted by a failover and automatically reruns those commands on the new main SSP after a failover. Command synchronization does the following:

If you want user commands to be automatically recovered after a failover, you must prepare these user commands for synchronization as explained in the following sections.

Preparing User Commands for Automatic Restart

The runcmdsync(1M) command prepares a user command for automatic restart. runcmdsync adds the user command to the command synchronization list, which identifies the commands to be rerun after a failover.

To Prepare a User Command for Restart
  1. As user ssp on the main SSP, type:


    ssp% runcmdsync script_name [parameters] 
    

    where:

    script_name is the name of the user command to be restarted.

    parameters are the options associated with the specified command.

    The specified command will be rerun automatically on the new main SSP after a failover.

Preparing User Scripts for Automatic Recovery

If you want to resume processing of a user script from a certain marked point (location) within the script, you must include the following synchronization commands in the user script:

Each script must contain the initcmdsync and cancelcmdsync commands to initialize the script for synchronization and then remove the command from the command synchronization list respectively. For details on the synchronization commands, see the cmdsync(1M) man page.


Note -

These synchronization commands are intended for use by experienced programmers. You can use the runcmdsync(1M) command instead of the synchronization commands described in this section to prepare a script for recovery. However, the runcmdsync(1M) command will prepare the script so that it is rerun from the beginning and not from specified marker points.


The following procedures describe how to use these synchronization commands.


Note -

After an SSP failover or in a single SSP configuration, SSP failover is disabled. When failover is disabled, scripts that contain synchronization commands will generate error messages to the platform log file and return non-zero exit codes. These error messages can be ignored.


To Create a Command Synchronization Descriptor
  1. In your user script, type the following to create a command synchronization descriptor that identifies your script:


    initcmdsync script_name [parameters]

    where:

    script_name is the name of the script.

    parameters are the options associated with the specified script.

    The output returned from the initcmdsync command serves as the command synchronization descriptor.

To Specify a Command Synchronization Marker Point
  1. In your user script, type the following to mark an execution point from which processing can be resumed:


    savecmdsync -M identifier cmdsync_descriptor 
    

    where:

    identifier is a positive integer that marks an execution point from which the script can be restarted.

    cmdsync_descriptor is the command synchronization descriptor output by the initcmdsync command.

To Remove a Command Synchronization Descriptor
  1. In your user script, type the following after the script termination sequence:


    cancelcmdsync cmdsync_descriptor
    

    where cmdsync_descriptor is the command synchronization descriptor output by the initcmdsync command. The specified descriptor is removed from the command synchronization list so that the user script is not run on the new main SSP after a failover.

Obtaining Command Synchronization Information

Use the showcmdsync(1M) command on the main SSP to review the command synchronization list that identifies the user commands to be restarted on the new main SSP after an automatic failover.

The following is an example command synchronization list output by the showcmdsync (1M) command:


ssp% showcmdsync 
DESCRIPTOR      IDENTIFIER   CMD
         0              -1   c1 c2 a2

For further details, see the showcmdsync(1M) man page.

Example Script with Synchronization Commands

SSP provides an example user script that shows how the synchronization commands can be used. This script is located in the /opt/SUNWssp/examples/cmdsync directory. This directory also contains a README file that explains how the script works.

After an SSP Failover

After an SSP failover occurs, you must perform certain recovery tasks: