C H A P T E R  12

SC Failover

SC failover maximizes Sun Fire high-end system uptime by adding high-availability features to its administrative operations. A Sun Fire high-end system contains two SCs. Failover provides software support to a high-availability two-SC system configuration.

The main SC provides all resources for the entire Sun Fire high-end system. If hardware or software failures occur on the main SC or on any hardware control path (for example, the console bus interface or Ethernet interface) from the main SC to other system devices, SC failover software automatically triggers a failover to the spare SC. The spare SC then assumes the role of the main and takes over all the main SC responsibilities. In a high-availability, system configuration using two SCs, SMS data, configuration, and log files are replicated on the spare SC. Active domains are not affected by this switch.



Note - For failover to be supported, both SCs must be configured with identical versions of the Solaris OS and SMS software.



This chapter includes the following sections:


Overview

In the current high-availability SC configuration, one SC acts as a "hot spare" for the other.

Failover eliminates the single point of failure in the management of the Sun Fire high-end system. The fomd daemon identifies and handles as many multiple points failure as possible. Some failover scenarios are discussed in Failure and Recovery.

At any time during SC failover, the failover process does not adversely affect any configured or running domains except for temporary loss of services from the SC.

In a high-availability SC system:

The failover management daemon (fomd(1M)) is the core of the SC failover mechanism. It is installed on both the main and spare SCs.

The fomd daemon performs the following functions:

Services that would be interrupted during an SC failover include:

You do not need to know the host name of the main SC to establish connections to it. As part of configuring SMS (refer to the smsconfig(1M) man page), a logical host name was created which is always active on the main SC. Refer to the Sun Fire 15K/12K System Site Planning Guide and the System Management Services (SMS) 1.6 Installation Guide for information on the creation of the logical host names in your network database.

Operations interrupted by an SC failover can be recovered after the failover completes. Reissuance of the interrupted operation causes the operation to resume and continue to completion.

All automated functions provided by fomd resume without operator intervention after SC failover. Any recovery actions interrupted before completion by the SC failover restarts.


Fault Monitoring

There are three types of failovers:

1. Main-initiated

A main-initiated failover is where the fomd running on the main SC yields control to the spare SC in response to either an unrecoverable local hardware or software failure or an operator request.

2. Spare-initiated (takeover)

In a spare-initiated failover (takeover), the fomd running on the spare determines that the main SC is no longer functioning properly.

3. Indirect-triggered takeover

If the I2 network path between the SCs is down and there is a fault on the main, the main switches itself to the role of spare. Upon detecting this, the spare SC assumes the role of main.

In the last two scenarios, the spare fomd eliminates the possibility of a split-brain condition by resetting the main SC.

When either a software-controlled or a user-forced failover occurs, fomd deactivates the failover mechanism. This eliminates the possibility of repeatedly failing over back and forth between the two SCs.


File Propagation

One of the purposes of the fomd is propagation of data from the main SC to the spare SC through the interconnects that exist between the two SCs. This data includes configuration, data, and log files.

The fomd daemon performs the following functions:

The I2 network must be operative for the transfer of data to occur.



Note - Any changes made to the network configuration on one SC using smsconfig -m must be made to the other SC as well. Network configuration is not automatically propagated.



Should both interconnections between the two SCs fail, failover can still occur provided main and spare SC accesses to the high-availability SRAMs (HASRAMs) remain intact. Due to the failure of both interconnections, propagation of SMS data can no longer occur, creating the potential of stale data on the spare SC. In the event of a failover, fomd on the new main keeps the current state of the data, logs the state, and provides other SMS daemons and clients information about the current state of the data.

When either of the interconnects between the two SCs is healthy again, data is pulled over depending on the timestamp of each SMS file. If the timestamp of the file is earlier than the one on the SC now acting as the spare, it gets transferred over. If the timestamp of the file is later than the one on the spare SC, no action is taken.

Failover cannot occur when both of the following conditions are met:

This is considered a quadruple fault, and failover is disabled until at least one of the links is restored.


Failover Management

This section explains the startup, main SC, and spare SC roles.

Startup



Note - Failover between main and spare SCs with different Solaris OS versions is not a Sun-supported configuration.



For the failover software to function, both SCs must be present in the system. The determination of main and spare roles is based in part on the SC number. This slot number does not prevent a given SC from assuming either role - it only controls how it goes about doing so.

If SMS is started on one SC first, that SC becomes main. If SMS starts up on both SCs at essentially the same time, whichever SC first determines that the other SC either is not main or is not running SMS becomes main.

If SC0 is in the middle of the startup process, it queries SC1 for its role, and if the SC1 role cannot be confirmed, SC0 tries to become main. SC0 resets SC1 during this process. This is done to prevent both SCs from assuming the main role, a condition known as split brain. The reset occurs even if the failover mechanism is deactivated.

Main SC

Upon startup, the fomd running on the main SC begins periodically testing the hardware and network interfaces. Initially the failover mechanism is disabled (internally) until at least one status response has been received from the remote (spare) SC indicating that it is healthy.

If a local fault is detected by the main fomd during initial startup, failover occurs when all of the following conditions are met:

1. The I2 network was not the source of the fault.

2. The remote SC is healthy (as indicated by the health status response).

3. The failover mechanism has not been deactivated.

Spare SC

Upon startup, fomd runs on the spare SC and begins periodically testing the software, hardware, and network interfaces.

If a local fault is detected by the fomd running on the spare SC during initial startup, it informs the main fomd of its debilitated state.


Failover CLI Commands

This section describes the setfailover and showfailover commands.

setfailover Command

The setfailover command modifies the state of the SC failover mechanism. The default state is on. The following is an example of using the setfailover command:
Forcing a failover to a spare SC with a faulty clock can cause the affected domains to domain stop (dstop). The setfailover command detects faulty clocks on spare SCs and provides a second chance confirmation prompt to avoid accidentally forcing a failover to a faulty SC. However, the -q (quiet) and -y (yes to all prompts) options do not allow checking for a faulty SC.


# setfailover [-q] [-y|-n] [on|off|force]



caution icon

Caution - The -qoption suppresses allprompts, including the second chance prompt. If you use both the -qand the -yoptions, the failover is forced to the spare SC even if it is faulty. This forced failover could result in a Dstop if the spare SC is faulty.



The following is an example of the setfailover command detecting a faulty clock on the spare SC:


# setfailover force
Forcing failover. Do you want to continue (yes/no)? yes
The spare clock input on some boards might be bad. Forcing a failover now is likely to cause the affected domains to domain stop (Dstop).
Do you want to continue (yes/no)? no

TABLE 12-1 describes SC failover states.


TABLE 12-1 Options for Modifying Failover States

State

Definition

[-q]

Enables quiet mode, which suppresses all messages to stdout including prompts. When used alone, -q defaults to the -n option for all prompts. When used with either the -y or the -n option, -q suppresses all user prompts and automatically answers with either yes or no based on the option chosen.

[-y|-n]

-y automatically answers yes to all prompts. Prompts are displayed unless used with the -q option. Use with caution. -n automatically answers no to all prompts. Prompts are displayed unless used with the -q option.

on

Enables failover for systems that previously had failover disabled due to a failover or an operator request. This option instructs the command to attempt to re-enable failover only. If failover cannot be re-enabled, subsequent use of the showfailover command indicates the current failure that prevented the enable.

off

Disables the failover mechanism. This prevents a failover until the mechanism is re-enabled.

force

Forces a failover to the spare SC. The spare SC must be available and healthy.




Note - In the event a patch must be applied to SMS 1.6, failover must be disabled before the patch is installed. Refer to the System Management Services (SMS) 1.6 Installation Guide.



For more information and examples, refer to the setfailover man page.

showfailover Command

The showfailover command allows you to monitor the state and display the current status of the SC failover mechanism. The -v option displays the current status of all monitored components.


xc30p13-sc0:sms-svc:13> showfailover -v

SC Failover Status: ACTIVE

Status of Shared Memory:

HASRAM (CSB at CS0): ........................................Good

HASRAM (CSB at CS1): ........................................Good

Status of xc30p13-sc0: 
Role: ................................................MAIN
SMS Daemons: .........................................Good
System Clock: ........................................Good
Private I2 Network: ..................................Good
Private HASRAM Network:...............................Good
Public Network..................................NOT TESTED
System Memory: ......................................38.9%
S Disk Status: 
/: ..................................................17.4%

Console Bus Status:

EXB at EX1: .................................................Good

EXB at EX2: .................................................Good

EXB at EX4: ................................................Good

Status of xc30p13-sc1: 
Role: ...............................................SPARE
SMS Daemons: .........................................Good
System Clock: ........................................Good
Private I2 Network: ..................................Good
Private HASRAM Network:...............................Good
Public Network: ................................NOT TESTED
System Memory: ......................................34.2%
Disk Status: 
/: ..................................................17.1%
Console Bus Status: 
EXB at EX1: .........................................Good
EXB at EX2: .........................................Good
EXB at EX4: .........................................Good

The -r option displays the SC role: main, spare, or unknown. For example:


sc0:sms-user:> showfailover -r
MAIN

If you do not specify an option, only the state information is displayed:


sc0:sms-user:> showfailover
SC Failover Status: state

The failover mechanism can be in one of four states: ACTIVATING, ACTIVE, DISABLED, and FAILED. TABLE 12-2 describes the four states.


TABLE 12-2 States of the Failover Mechanism

State

Definition

ACTIVATING

The failover mechanism is preparing to transition to the ACTIVE state. Failover becomes active when all tests have passed and files have been synchronized.

ACTIVE

The failover mechanism is enabled and functioning normally.

DISABLED

The failover mechanism has been disabled due to the occurrence of a failover or an operator request (setfailover off).

FAILED

The failover mechanism has detected a failure that prevents a failover from being possible, or failover has not yet completed activation.


In addition showfailover displays the state of each of the network interface links monitored by the failover processes. The display format is as follows:


network i/f device name: [GOOD|FAILED]

The showfailover returns a failure string describing the failure condition. Each failure string has a code associated with it. The following table defines the codes and associated failure strings.

TABLE 12-3 describes the showfailover command failure strings.Table listing failure strings returned when network interface devices exhibit failure conditions, with descriptions.


TABLE 12-3 showfailover Failure Strings

String

Explanation

None

No failure.

S-SC EXT NET

The spare SC external network interface has failed.

S-SC CONSOLE BUS

A fault has been detected on the spare SC console bus paths.

S-SC LOC CLK

The spare SC local clock has failed.

S-SC DISK FULL

The spare SC system is full.

S-SC IS DOWN

The spare SC is down or unresponsive. If this message results from the I2 network or HASRAMs being down, the spare SC could still be running. Log in to the spare SC to verify.

S-SC MEM EXHAUSTED

The spare SC memory or swap space has been exhausted.

S-SC SMS DAEMON

At least one SMS daemon could not be started or restarted on the spare SC.

S-SC INCOMPATIBLE SMS VERSION

The spare SC is running a different version of SMS software. Both SCs must be running the same version.

I2 NETWORK/HASRAM DOWN

Both interfaces for communication between the SCs are down. The main cannot tell what version of SMS is running on the spare or what its state is. It declares the spare down and logs a message to that effect. Dependent services, including file propagation, are unavailable.


For examples and more information, refer to the showfailover man page.


Command Synchronization

If an SC failover occurs during the execution of a command, you can restart the same command on the new main SC.

All commands and actions do the following:

The fomd daemon provides the following support for command synchronization:

The four CLI commands in SMS that require command sync support are addboard, deleteboard, moveboard, and rcfgadm.

cmdsync CLIs

The cmdsync commands provide the ability to initialize a script or command with a cmdsync descriptor, update an existing cmdsync descriptor execution point, or cancel a cmdsync descriptor from the spare SC's list of recovery actions. Commands or scripts can also be run in a cmdsync envelope.

In the case of an SC failover to the spare, initialization of a cmdsync descriptor on the spare SC enables the spare SC to restart or resume the target script or command from the last execution point set. These commands executes only on the main SC, and have no effect on the current cmdsync list if executed on the spare.

Commands or scripts invoked with the cmdsync commands when there is no enabled spare SC result in a no-op operation. That is, command execution proceeds as normal, but a log entry in the platform log indicates that a cmdsync attempt has failed.

initcmdsync Command

The initcmdsync(1M) command creates a cmdsync descriptor. The target script or command and its associated parameters are saved as part of the cmdsync data. The exit code of the initcmdsync command provides a cmdsync descriptor that can be used in subsequent cmdsync commands to reference the action. Actual execution of the target command or script is not performed. For more information, refer to the initcmdsync (1M) man page.

savecmdsync Command

The savecmdsync(1M) command saves a new execution point in a previously defined cmdsync descriptor. This allows a target command or script to restart execution at a location associated with an identifier. The target command or script supports the ability to be restarted at this execution point, otherwise the restart execution is at the beginning of the target command or script. For more information, refer to the savecmdsync (1M) man page.

cancelcmdsync Command

The cancelcmdsync(1M) command removes a cmdsync descriptor from the spare restart list. Once this command is run, the target command or script associated with the cmdsync descriptor is not restarted on the spare SC in the event of a failover. Take care to ensure that all target commands or scripts contain an initcmdsync command sequence as well as a cancelcmdsync sequence after the normal or abnormal termination flows. For more information, refer to the cancelcmdsync (1M) man page.

runcmdsync Command

The runcmdsync(1M) command executes the specified target command or script under a cmdsync wrapper. You cannot restart at execution points other than the beginning. The target command or script is executed through the system command after creation of the cmdsync descriptor. Upon termination of the system command, the cmdsync descriptor is removed from the cmdsync list, and the exit code of the system command returned to the user. For more information, refer to the runcmdsync (1M) man page.

showcmdsync Command

The showcmdsync(1M) command displays the current cmdsync descriptor list. For more information, refer to the showcmdsync (1M) man page.


Data Synchronization

Customized data synchronization is provided in SMS by the setdatasync(1M) command. setdatasync enables you to specify a user-created file to be added to or removed from the data propagation list.

setdatasync Command

The setdatasync list identifies the files to be copied from the main to the spare system controller (SC) as part of data synchronization for automatic failover. The specified user file and the directory in which it resides must have read and write permissions for you on both SCs. You must also have platform or domain privileges.

The data synchronization process checks the user-created files on the main SC for any changes. If the user-created files on the main SC have changed since the last propagation, they are repropagated to the spare SC. By default, the data synchronization process checks a specified file every 60 minutes; however, you can use setdatasync to indicate how often a user file is checked for modifications.

You can also use setdatasync to propagate a specified file to the spare SC without adding the file to the data propagation list.

Using setdatasync backup can slow down automatic fomd file propagation.

The time required to execute setdatasync backup is proportional to the number of files being transferred. Other factors that can affect the speed of file transfer include: the average size of files being transferred, the amount of memory available on the SCs, the load (CPU cycles and disk traffic) on the SCs, and whether the I2 network is functioning.

The following statistics assume an average file size of 200 Kbytes:



Note - There are repropagation constraints you should be aware of before using this command. For more information and examples, refer to the setdatasync (1M) man page.



showdatasync Command

The showdatasync command provides the current status of files being propagated (copied) from the main SC to its spare. The showdatasync command also provides the list of files registered using setdatasync and their status. Data propagation synchronizes data on the spare SC with data on the main SC, so that the spare SC is current with the main SC if an SC failover occurs.

For more information, refer to the showdatasync (1M) man page.


Failure and Recovery

In a high-availability configuration, fomd manages the failover mechanism on the local and remote SCs. the fomd daemon detects the presence of local hardware and software faults and determines the appropriate action to take.

The fomd daemon is responsible for detecting the faults described in TABLE 12-4.


TABLE 12-4 fomd Hardware and Software Fault Categories

Category

Description

a

All relevant hardware buses that are local to the SC Control board (CB)/CPU board.

b

The external network interfaces.

c

The I2 network interface between the SCs.

d

Unrecoverable software failures. This category is for those cases where an SMS software component (daemon) crashes and cannot be restarted after three attempts, the file system is full, the heap is exhausted, and so forth.


FIGURE 12-1 illustrates the failover fault categories.


FIGURE 12-1 Failover Fault Categories


TABLE 12-5 illustrates how faults in the categories affect the failover mechanism. Assume that the failover mechanism is activated.


TABLE 12-5 Failover Fault Categories

Failure Point

Main SC

Spare SC

Failover

Notes

a

X

 

X

Failover to spare occurs.

a

 

X

Disables

No effect on the main SC, but the spare SC has suffered a hardware fault so failover is disabled.

b

X

 

 

Failover to spare.

b

 

X

No effect

The fact that the spare SC external network interfaces have failed does not affect the failover mechanism.

c

 

 

No effect

Main and spare SC log the fault.

d

X

 

X

Failover to the spare SC, assuming that it is healthy.

d

 

X

Disables

Failover is disabled because the spare SC is deemed unhealthy at this point.


Failover on Main SC (Main-Controlled Failover)

Events for the main fomd during SC failover occur in the following order:

1. Detects the fault.

2. Stops generating heartbeats.

3. Tells the remote failover software to start a takeover timer. The purpose of this timer is to provide an alternate means for the remote (spare) SC to take over if for any reason the main hangs and never reaches a count of 10.

4. Starts the SMS software in spare mode.

5. Removes the logical IP interface.

6. Enables the console bus caging mechanism.

7. Triggers propagation of any modified SMS files to the spare SC or HASRAMs.

8. Stops file propagation monitoring.

9. Shuts down main-specific daemons and sets the main SC role to UNKNOWN.

10. Logs a failover event.

11. Notifies remote (spare) failover software that it should assume the role of main. If the takeover timer expires before the spare is notified, the remote SC takes over on its own.

Events for the spare fomd during failover occur in the following order:

1. Receives message from the main fomd to assume main role, or the takeover timer expires. If the former is true, then the takeover timer is stopped.

2. Resets the old main SC.

3. Notifies hwad, frad, and mand to configure the spare fomb in the main role.

4. Assumes the role of main.

5. Starts generating heartbeat interrupts.

6. Configures the logical IP interface.

7. Disables the console bus caging mechanism.

8. Starts the SMS software in main mode.

9. Prepare the DARBs to receive interrupts.

10. Logs a role reversal event, spare to main.

11. The spare SC is now the main, and fomd deactivates the failover mechanism.

Fault on Main SC (Spare Takes Over Main Role)

In this scenario, the spare SC takes main control in reaction to loss of communication with the main SC. The most important aspect of this type of failover is the prevention of the split-brain condition. Another assumption is that the failover mechanism is not deactivated. If it has been deactivated, no takeover can occur.

The spare fomd does the following:

From the spare fomd perspective, this phenomenon can be caused by two conditions: the main SC is truly dead, or the I2 network interface is down.

In the former case, a failover is needed (provided that the failover mechanism is activated), while in the latter it is not. To identify which is the case, the spare fomd polls for the presence of heartbeat interrupts from the main SC to determine if the main SC is still up and running. As long as heartbeat interrupts are being received, or the failover mechanism is deactivated or disabled, no failover occurs.

In the case where no interrupts are detected but the failover mechanism is deactivated, the spare fomd does not attempt to take over unless the operator manually activates the failover mechanism using the CLI command setfailover. Otherwise, if the spare SC is healthy, the spare fomd proceeds to take over the role of main.

The following lists the events for the spare fomd, in order, during failover:

1. Reconfigures itself as main. This includes taking over control of the I2C bus, configuring the logical main SC IP address, and starting up the necessary SMS software daemons.

2. Starts generating heartbeat interrupts.

3. Configures the logical IP interface.

4. Disables console bus caging.

5. Starts the SMS software in main mode.

6. Configures the DARB interrupts.

7. Logs a takeover event.

8. The spare fomd, now the main, deactivates the failover mechanism.

I2 Network Fault

The following lists the events, in order, that occur after an I2 network fault.

1. The main fomd detects the I2 network is not healthy.

2. The main fomd stops propagating files and checkpointing data over to the spare SC.

3. The spare fomd detects the I2 network is not healthy.

From the spare fomd perspective, this phenomenon can be caused by two conditions: the main SC is truly malfunctioning, or the I2 network interface is down. In the former case, the corrective action is to fail over, while in the latter, it is not. To identify which is the case, the fomd starts polling for the presence of heartbeat interrupts from the main SC to determine if the main SC is still up and running. If heartbeat interrupts are present, the fomd keeps the spare as spare.

4. The spare fomd clears out the checkpoint data on the local disk.

Fault on Main SC (I2 Network Is Also Down)

The following lists the events, in order, that occur after a fault on the main SC.

1. The main fomd detects the fault.

If the last known state of the spare SC was good, then the main fomd stops generating heartbeats. Otherwise, failover does not continue.

If the access to the console bus is still available, the main failover software finishes propagating any remaining critical files to HASRAM and flushes out any or all critical state information to HASRAM.

2. The main fomd reconfigures the SMS software into spare mode.

3. The main fomd removes the logical main SC IP address.

4. The main fomd stops generating heartbeat interrupts.

Fault Recovery and Reboot

This section describes fault recovery and reboot precesses.

I2 Fault Recovery

The following lists the events, in order, that occur during an I2 network fault recovery.

1. The main fomd detects that the I2 network is healthy.

If the spare SC is completely healthy as indicated in the health status response message, the fomd enables failover and, assuming that the failover mechanism has not been deactivated by the operator, does a complete re-sync of the log files and checkpointing data over to the spare SC.

2. The spare fomd detects that the I2 network is healthy.

The spare fomd disables failover and clears out the checkpoint data on the local disk.

Reboot and Recovery

The following lists the events, in order, that occur during a reboot and recovery. A reboot and recovery scenario happens in two cases.

Main SC Receives a Master Reset or Its UltraSPARC Processor Receives a Reset

1. Assume SSCPOST passed without any problems. If SSCPOST failed and the OS cannot be booted, the main is inoperable.

2. Assume all SSC Solaris drivers attached without any problems. If the SBBC driver fails to attach, see Fault on Main SC (Spare Takes Over Main Role). If any other drivers fail to attach, see Failover on Main SC (Main-Controlled Failover).

3. The main fomd is started.

4. If the fomd determines that the remote SC has already assumed the main role, then see Number 5 in Spare SC Receives a Master Reset or Its UltraSPARC Processor Receives a Reset. Otherwise, proceed to Number 5 in this list.

5. The fomd configures the logical main IP address and starts up the rest of the SMS software.

6. SMS daemons start in recovery mode if necessary.

7. Main fomd starts generating heartbeat interrupts.

8. At this point, the main SC is fully recovered.

Spare SC Receives a Master Reset or Its UltraSPARC Processor Receives a Reset

1. Assume SSCPOST passed without any problems. If SSCPOST failed and the OS cannot be booted, the spare is inoperable.

2. Assume all SSC Solaris drivers attached without any problems. If the SBBC driver fails to attach, or any other drivers fail to attach, the spare SC is deemed inoperable.

3. The fomd is started.

4. The fomd determines that the SC is the preferred spare and assumes the spare role.

5. The fomd starts checking for the presence of heartbeat interrupts from the remote (initially presumed to be main) SC.

If after a configurable amount of time no heartbeat interrupts are detected, the failover mechanism state is checked. If enabled and activated, fomd initiates a take over. See Number 5 of Main SC Receives a Master Reset or Its UltraSPARC Processor Receives a Reset. Otherwise, fomd continues monitoring for the presence of heartbeat interrupts and the state of the failover mechanism.

6. The fomd starts periodically checking the hardware, software, and network interfaces.

7. The fomd configures the local main SC IP address.

8. At this point, the spare SC is fully recovered.

Client Failover Recovery

The following lists the events that occur during a client failover recovery. A recovery scenario happens in the following two cases.

Fault on Main SC-Recovering From the Spare SC

Clients with any operations in progress are manually recovered by checkpointing any recurring data.

Fault on Main SC (With I2 Network Down)-Recovering From the Spare SC

Since the I2 network is down, all checkpointing data is removed. Clients cannot perform any recovery.

Once you have finished with recovery, you can continue with the reboot steps.

Reboot Main SC (With Spare SC Down)

This condition is identical to Fault on Main SC-Recovering From the Spare SC.

Reboot of Spare SC

No recovery is necessary.


Security

All failover-specific network traffic (such as health status request or response messages and file propagation packets) is sent only over the interconnect network that exists between the two SCs.