C H A P T E R 12 - SC Failover

SC failover maximizes Sun Fire high-end system uptime by adding high-availability features to its administrative operations. A Sun Fire high-end system contains two SCs. Failover provides software support to a high-availability two-SC system configuration.

The main SC provides all resources for the entire Sun Fire high-end system. If hardware or software failures occur on the main SC or on any hardware control path (for example, the console bus interface or Ethernet interface) from the main SC to other system devices, SC failover software automatically triggers a failover to the spare SC. The spare SC then assumes the role of the main and takes over all the main SC responsibilities. In a high-availability, system configuration using two SCs, SMS data, configuration, and log files are replicated on the spare SC. Active domains are not affected by this switch.

Overview

In the current high-availability SC configuration, one SC acts as a "hot spare" for the other.

Failover eliminates the single point of failure in the management of the Sun Fire high-end system. The fomd daemon identifies and handles as many multiple points failure as possible. Some failover scenarios are discussed in Failure and Recovery.

At any time during SC failover, the failover process does not adversely affect any configured or running domains except for temporary loss of services from the SC.

The failover management daemon (fomd(1M)) is the core of the SC failover mechanism. It is installed on both the main and spare SCs.

You do not need to know the host name of the main SC to establish connections to it. As part of configuring SMS (refer to the smsconfig(1M) man page), a logical host name was created which is always active on the main SC. Refer to the Sun Fire 15K/12K System Site Planning Guide and the System Management Services (SMS) 1.6 Installation Guide for information on the creation of the logical host names in your network database.

Operations interrupted by an SC failover can be recovered after the failover completes. Reissuance of the interrupted operation causes the operation to resume and continue to completion.

All automated functions provided by fomd resume without operator intervention after SC failover. Any recovery actions interrupted before completion by the SC failover restarts.

Fault Monitoring

A main-initiated failover is where the fomd running on the main SC yields control to the spare SC in response to either an unrecoverable local hardware or software failure or an operator request.

In a spare-initiated failover (takeover), the fomd running on the spare determines that the main SC is no longer functioning properly.

If the I2 network path between the SCs is down and there is a fault on the main, the main switches itself to the role of spare. Upon detecting this, the spare SC assumes the role of main.

In the last two scenarios, the spare fomd eliminates the possibility of a split-brain condition by resetting the main SC.

When either a software-controlled or a user-forced failover occurs, fomd deactivates the failover mechanism. This eliminates the possibility of repeatedly failing over back and forth between the two SCs.

File Propagation

One of the purposes of the fomd is propagation of data from the main SC to the spare SC through the interconnects that exist between the two SCs. This data includes configuration, data, and log files.

Should both interconnections between the two SCs fail, failover can still occur provided main and spare SC accesses to the high-availability SRAMs (HASRAMs) remain intact. Due to the failure of both interconnections, propagation of SMS data can no longer occur, creating the potential of stale data on the spare SC. In the event of a failover, fomd on the new main keeps the current state of the data, logs the state, and provides other SMS daemons and clients information about the current state of the data.

When either of the interconnects between the two SCs is healthy again, data is pulled over depending on the timestamp of each SMS file. If the timestamp of the file is earlier than the one on the SC now acting as the spare, it gets transferred over. If the timestamp of the file is later than the one on the spare SC, no action is taken.

This is considered a quadruple fault, and failover is disabled until at least one of the links is restored.

Failover Management

Startup

For the failover software to function, both SCs must be present in the system. The determination of main and spare roles is based in part on the SC number. This slot number does not prevent a given SC from assuming either role - it only controls how it goes about doing so.

If SMS is started on one SC first, that SC becomes main. If SMS starts up on both SCs at essentially the same time, whichever SC first determines that the other SC either is not main or is not running SMS becomes main.

If SC0 is in the middle of the startup process, it queries SC1 for its role, and if the SC1 role cannot be confirmed, SC0 tries to become main. SC0 resets SC1 during this process. This is done to prevent both SCs from assuming the main role, a condition known as split brain. The reset occurs even if the failover mechanism is deactivated.

Main SC

Upon startup, the fomd running on the main SC begins periodically testing the hardware and network interfaces. Initially the failover mechanism is disabled (internally) until at least one status response has been received from the remote (spare) SC indicating that it is healthy.

If a local fault is detected by the main fomd during initial startup, failover occurs when all of the following conditions are met:

Spare SC

Upon startup, fomd runs on the spare SC and begins periodically testing the software, hardware, and network interfaces.

If a local fault is detected by the fomd running on the spare SC during initial startup, it informs the main fomd of its debilitated state.

Failover CLI Commands

setfailover Command

The setfailover command modifies the state of the SC failover mechanism. The default state is on. The following is an example of using the setfailover command:
Forcing a failover to a spare SC with a faulty clock can cause the affected domains to domain stop (dstop). The setfailover command detects faulty clocks on spare SCs and provides a second chance confirmation prompt to avoid accidentally forcing a failover to a faulty SC. However, the -q (quiet) and -y (yes to all prompts) options do not allow checking for a faulty SC.

# setfailover [-q] [-y|-n] [on|off|force]

The following is an example of the setfailover command detecting a faulty clock on the spare SC:

# setfailover force

Forcing failover. Do you want to continue (yes/no)? yes

The spare clock input on some boards might be bad. Forcing a failover now is likely to cause the affected domains to domain stop (Dstop).

Do you want to continue (yes/no)? no

TABLE 12-1 Options for Modifying Failover States
State	Definition
[-q]	Enables quiet mode, which suppresses all messages to `stdout` including prompts. When used alone, `-q` defaults to the `-n` option for all prompts. When used with either the `-y` or the `-n` option, `-q` suppresses all user prompts and automatically answers with either yes or no based on the option chosen.
[-y\|-n]	`-y` automatically answers yes to all prompts. Prompts are displayed unless used with the `-q` option. Use with caution. `-n` automatically answers no to all prompts. Prompts are displayed unless used with the `-q` option.
on	Enables failover for systems that previously had failover disabled due to a failover or an operator request. This option instructs the command to attempt to re-enable failover only. If failover cannot be re-enabled, subsequent use of the `showfailover` command indicates the current failure that prevented the enable.
off	Disables the failover mechanism. This prevents a failover until the mechanism is re-enabled.
force	Forces a failover to the spare SC. The spare SC must be available and healthy.

showfailover Command

The showfailover command allows you to monitor the state and display the current status of the SC failover mechanism. The -v option displays the current status of all monitored components.

xc30p13-sc0:sms-svc:13> showfailover -v

SC Failover Status: ACTIVE

Status of Shared Memory:

HASRAM (CSB at CS0): ........................................Good

HASRAM (CSB at CS1): ........................................Good

Status of xc30p13-sc0:

Role: ................................................MAIN

SMS Daemons: .........................................Good

System Clock: ........................................Good

Private I2 Network: ..................................Good

Private HASRAM Network:...............................Good

Public Network..................................NOT TESTED

System Memory: ......................................38.9%

S Disk Status:

/: ..................................................17.4%

Console Bus Status:

EXB at EX1: .................................................Good

EXB at EX2: .................................................Good

EXB at EX4: ................................................Good

Status of xc30p13-sc1:

Role: ...............................................SPARE

SMS Daemons: .........................................Good

System Clock: ........................................Good

Private I2 Network: ..................................Good

Private HASRAM Network:...............................Good

Public Network: ................................NOT TESTED

System Memory: ......................................34.2%

Disk Status:

/: ..................................................17.1%

Console Bus Status:

EXB at EX1: .........................................Good

EXB at EX2: .........................................Good

EXB at EX4: .........................................Good

sc0:sms-user:> showfailover -r

MAIN

sc0:sms-user:> showfailover

SC Failover Status: state

The failover mechanism can be in one of four states: ACTIVATING, ACTIVE, DISABLED, and FAILED. TABLE 12-2 describes the four states.

TABLE 12-2 States of the Failover Mechanism
State	Definition
ACTIVATING	The failover mechanism is preparing to transition to the ACTIVE state. Failover becomes active when all tests have passed and files have been synchronized.
ACTIVE	The failover mechanism is enabled and functioning normally.
DISABLED	The failover mechanism has been disabled due to the occurrence of a failover or an operator request (`setfailover` off).
FAILED	The failover mechanism has detected a failure that prevents a failover from being possible, or failover has not yet completed activation.

In addition showfailover displays the state of each of the network interface links monitored by the failover processes. The display format is as follows:

network i/f device name: [GOOD|FAILED]

The showfailover returns a failure string describing the failure condition. Each failure string has a code associated with it. The following table defines the codes and associated failure strings.

TABLE 12-3 describes the showfailover command failure strings.Table listing failure strings returned when network interface devices exhibit failure conditions, with descriptions.

TABLE 12-3 `showfailover` Failure Strings
String	Explanation
None	No failure.
S-SC EXT NET	The spare SC external network interface has failed.
S-SC CONSOLE BUS	A fault has been detected on the spare SC console bus paths.
S-SC LOC CLK	The spare SC local clock has failed.
S-SC DISK FULL	The spare SC system is full.
S-SC IS DOWN	The spare SC is down or unresponsive. If this message results from the I2 network or HASRAMs being down, the spare SC could still be running. Log in to the spare SC to verify.
S-SC MEM EXHAUSTED	The spare SC memory or swap space has been exhausted.
S-SC SMS DAEMON	At least one SMS daemon could not be started or restarted on the spare SC.
S-SC INCOMPATIBLE SMS VERSION	The spare SC is running a different version of SMS software. Both SCs must be running the same version.
I2 NETWORK/HASRAM DOWN	Both interfaces for communication between the SCs are down. The main cannot tell what version of SMS is running on the spare or what its state is. It declares the spare down and logs a message to that effect. Dependent services, including file propagation, are unavailable.

Command Synchronization

If an SC failover occurs during the execution of a command, you can restart the same command on the new main SC.

The four CLI commands in SMS that require command sync support are addboard, deleteboard, moveboard, and rcfgadm.

cmdsync CLIs

The cmdsync commands provide the ability to initialize a script or command with a cmdsync descriptor, update an existing cmdsync descriptor execution point, or cancel a cmdsync descriptor from the spare SC's list of recovery actions. Commands or scripts can also be run in a cmdsync envelope.

In the case of an SC failover to the spare, initialization of a cmdsync descriptor on the spare SC enables the spare SC to restart or resume the target script or command from the last execution point set. These commands executes only on the main SC, and have no effect on the current cmdsync list if executed on the spare.

Commands or scripts invoked with the cmdsync commands when there is no enabled spare SC result in a no-op operation. That is, command execution proceeds as normal, but a log entry in the platform log indicates that a cmdsync attempt has failed.

initcmdsync Command

The initcmdsync(1M) command creates a cmdsync descriptor. The target script or command and its associated parameters are saved as part of the cmdsync data. The exit code of the initcmdsync command provides a cmdsync descriptor that can be used in subsequent cmdsync commands to reference the action. Actual execution of the target command or script is not performed. For more information, refer to the initcmdsync (1M) man page.

savecmdsync Command

The savecmdsync(1M) command saves a new execution point in a previously defined cmdsync descriptor. This allows a target command or script to restart execution at a location associated with an identifier. The target command or script supports the ability to be restarted at this execution point, otherwise the restart execution is at the beginning of the target command or script. For more information, refer to the savecmdsync (1M) man page.

cancelcmdsync Command

The cancelcmdsync(1M) command removes a cmdsync descriptor from the spare restart list. Once this command is run, the target command or script associated with the cmdsync descriptor is not restarted on the spare SC in the event of a failover. Take care to ensure that all target commands or scripts contain an initcmdsync command sequence as well as a cancelcmdsync sequence after the normal or abnormal termination flows. For more information, refer to the cancelcmdsync (1M) man page.

runcmdsync Command

The runcmdsync(1M) command executes the specified target command or script under a cmdsync wrapper. You cannot restart at execution points other than the beginning. The target command or script is executed through the system command after creation of the cmdsync descriptor. Upon termination of the system command, the cmdsync descriptor is removed from the cmdsync list, and the exit code of the system command returned to the user. For more information, refer to the runcmdsync (1M) man page.

showcmdsync Command

The showcmdsync(1M) command displays the current cmdsync descriptor list. For more information, refer to the showcmdsync (1M) man page.

Data Synchronization

Customized data synchronization is provided in SMS by the setdatasync(1M) command. setdatasync enables you to specify a user-created file to be added to or removed from the data propagation list.

setdatasync Command

The setdatasync list identifies the files to be copied from the main to the spare system controller (SC) as part of data synchronization for automatic failover. The specified user file and the directory in which it resides must have read and write permissions for you on both SCs. You must also have platform or domain privileges.

The data synchronization process checks the user-created files on the main SC for any changes. If the user-created files on the main SC have changed since the last propagation, they are repropagated to the spare SC. By default, the data synchronization process checks a specified file every 60 minutes; however, you can use setdatasync to indicate how often a user file is checked for modifications.

You can also use setdatasync to propagate a specified file to the spare SC without adding the file to the data propagation list.

The time required to execute setdatasync backup is proportional to the number of files being transferred. Other factors that can affect the speed of file transfer include: the average size of files being transferred, the amount of memory available on the SCs, the load (CPU cycles and disk traffic) on the SCs, and whether the I2 network is functioning.

showdatasync Command

The showdatasync command provides the current status of files being propagated (copied) from the main SC to its spare. The showdatasync command also provides the list of files registered using setdatasync and their status. Data propagation synchronizes data on the spare SC with data on the main SC, so that the spare SC is current with the main SC if an SC failover occurs.

Failure and Recovery

In a high-availability configuration, fomd manages the failover mechanism on the local and remote SCs. the fomd daemon detects the presence of local hardware and software faults and determines the appropriate action to take.

The fomd daemon is responsible for detecting the faults described in TABLE 12-4.

TABLE 12-4 `fomd` Hardware and Software Fault Categories
Category	Description
a	All relevant hardware buses that are local to the SC Control board (CB)/CPU board.
b	The external network interfaces.
c	The I2 network interface between the SCs.
d	Unrecoverable software failures. This category is for those cases where an SMS software component (daemon) crashes and cannot be restarted after three attempts, the file system is full, the heap is exhausted, and so forth.

TABLE 12-5 illustrates how faults in the categories affect the failover mechanism. Assume that the failover mechanism is activated.

TABLE 12-5 Failover Fault Categories
Failure Point	Main SC	Spare SC	Failover	Notes
a	X		X	Failover to spare occurs.
a		X	Disables	No effect on the main SC, but the spare SC has suffered a hardware fault so failover is disabled.
b	X			Failover to spare.
b		X	No effect	The fact that the spare SC external network interfaces have failed does not affect the failover mechanism.
c			No effect	Main and spare SC log the fault.
d	X		X	Failover to the spare SC, assuming that it is healthy.
d		X	Disables	Failover is disabled because the spare SC is deemed unhealthy at this point.

Failover on Main SC (Main-Controlled Failover)

3. Tells the remote failover software to start a takeover timer. The purpose of this timer is to provide an alternate means for the remote (spare) SC to take over if for any reason the main hangs and never reaches a count of 10.

11. Notifies remote (spare) failover software that it should assume the role of main. If the takeover timer expires before the spare is notified, the remote SC takes over on its own.

1. Receives message from the main fomd to assume main role, or the takeover timer expires. If the former is true, then the takeover timer is stopped.

3. Notifies hwad, frad, and mand to configure the spare fomb in the main role.

11. The spare SC is now the main, and fomd deactivates the failover mechanism.

Fault on Main SC (Spare Takes Over Main Role)

In this scenario, the spare SC takes main control in reaction to loss of communication with the main SC. The most important aspect of this type of failover is the prevention of the split-brain condition. Another assumption is that the failover mechanism is not deactivated. If it has been deactivated, no takeover can occur.

From the spare fomd perspective, this phenomenon can be caused by two conditions: the main SC is truly dead, or the I2 network interface is down.

In the former case, a failover is needed (provided that the failover mechanism is activated), while in the latter it is not. To identify which is the case, the spare fomd polls for the presence of heartbeat interrupts from the main SC to determine if the main SC is still up and running. As long as heartbeat interrupts are being received, or the failover mechanism is deactivated or disabled, no failover occurs.

In the case where no interrupts are detected but the failover mechanism is deactivated, the spare fomd does not attempt to take over unless the operator manually activates the failover mechanism using the CLI command setfailover. Otherwise, if the spare SC is healthy, the spare fomd proceeds to take over the role of main.

1. Reconfigures itself as main. This includes taking over control of the I²C bus, configuring the logical main SC IP address, and starting up the necessary SMS software daemons.

I2 Network Fault

2. The main fomd stops propagating files and checkpointing data over to the spare SC.

From the spare fomd perspective, this phenomenon can be caused by two conditions: the main SC is truly malfunctioning, or the I2 network interface is down. In the former case, the corrective action is to fail over, while in the latter, it is not. To identify which is the case, the fomd starts polling for the presence of heartbeat interrupts from the main SC to determine if the main SC is still up and running. If heartbeat interrupts are present, the fomd keeps the spare as spare.

Fault on Main SC (I2 Network Is Also Down)

The following lists the events, in order, that occur after a fault on the main SC.

If the last known state of the spare SC was good, then the main fomd stops generating heartbeats. Otherwise, failover does not continue.

If the access to the console bus is still available, the main failover software finishes propagating any remaining critical files to HASRAM and flushes out any or all critical state information to HASRAM.

Fault Recovery and Reboot

I2 Fault Recovery

The following lists the events, in order, that occur during an I2 network fault recovery.

If the spare SC is completely healthy as indicated in the health status response message, the fomd enables failover and, assuming that the failover mechanism has not been deactivated by the operator, does a complete re-sync of the log files and checkpointing data over to the spare SC.

The spare fomd disables failover and clears out the checkpoint data on the local disk.

Reboot and Recovery

The following lists the events, in order, that occur during a reboot and recovery. A reboot and recovery scenario happens in two cases.

Main SC Receives a Master Reset or Its UltraSPARC Processor Receives a Reset

1. Assume SSCPOST passed without any problems. If SSCPOST failed and the OS cannot be booted, the main is inoperable.

5. The fomd configures the logical main IP address and starts up the rest of the SMS software.

Spare SC Receives a Master Reset or Its UltraSPARC Processor Receives a Reset

1. Assume SSCPOST passed without any problems. If SSCPOST failed and the OS cannot be booted, the spare is inoperable.

2. Assume all SSC Solaris drivers attached without any problems. If the SBBC driver fails to attach, or any other drivers fail to attach, the spare SC is deemed inoperable.

4. The fomd determines that the SC is the preferred spare and assumes the spare role.

5. The fomd starts checking for the presence of heartbeat interrupts from the remote (initially presumed to be main) SC.

If after a configurable amount of time no heartbeat interrupts are detected, the failover mechanism state is checked. If enabled and activated, fomd initiates a take over. See Number 5 of Main SC Receives a Master Reset or Its UltraSPARC Processor Receives a Reset. Otherwise, fomd continues monitoring for the presence of heartbeat interrupts and the state of the failover mechanism.

6. The fomd starts periodically checking the hardware, software, and network interfaces.

Client Failover Recovery

The following lists the events that occur during a client failover recovery. A recovery scenario happens in the following two cases.

Fault on Main SC-Recovering From the Spare SC

Clients with any operations in progress are manually recovered by checkpointing any recurring data.

Fault on Main SC (With I2 Network Down)-Recovering From the Spare SC

Since the I2 network is down, all checkpointing data is removed. Clients cannot perform any recovery.

Reboot Main SC (With Spare SC Down)

Reboot of Spare SC

Security

All failover-specific network traffic (such as health status request or response messages and file propagation packets) is sent only over the interconnect network that exists between the two SCs.

Overview

Fault Monitoring

File Propagation

Failover Management

Startup

Main SC

Spare SC

Failover CLI Commands

`setfailover` Command

`showfailover` Command

Command Synchronization

`cmdsync` CLIs

`initcmdsync` Command

`savecmdsync` Command

`cancelcmdsync` Command

`runcmdsync` Command

`showcmdsync` Command

Data Synchronization

`setdatasync` Command

`showdatasync` Command

Failure and Recovery

Failover on Main SC (Main-Controlled Failover)

Fault on Main SC (Spare Takes Over Main Role)

I2 Network Fault

Fault on Main SC (I2 Network Is Also Down)

Fault Recovery and Reboot

I2 Fault Recovery

Reboot and Recovery

Main SC Receives a Master Reset or Its UltraSPARC Processor Receives a Reset

Spare SC Receives a Master Reset or Its UltraSPARC Processor Receives a Reset

Client Failover Recovery

Fault on Main SC-Recovering From the Spare SC

Fault on Main SC (With I2 Network Down)-Recovering From the Spare SC

Reboot Main SC (With Spare SC Down)

Reboot of Spare SC

Security

Overview

Fault Monitoring

File Propagation

Failover Management

Startup

Main SC

Spare SC

Failover CLI Commands

setfailover Command

showfailover Command

Command Synchronization

cmdsync CLIs

initcmdsync Command

savecmdsync Command

cancelcmdsync Command

runcmdsync Command

showcmdsync Command

Data Synchronization

setdatasync Command

showdatasync Command

Failure and Recovery

Failover on Main SC (Main-Controlled Failover)

Fault on Main SC (Spare Takes Over Main Role)

I2 Network Fault

Fault on Main SC (I2 Network Is Also Down)

Fault Recovery and Reboot

I2 Fault Recovery

Reboot and Recovery

Main SC Receives a Master Reset or Its UltraSPARC Processor Receives a Reset

Spare SC Receives a Master Reset or Its UltraSPARC Processor Receives a Reset

Client Failover Recovery

Fault on Main SC-Recovering From the Spare SC

Fault on Main SC (With I2 Network Down)-Recovering From the Spare SC

Reboot Main SC (With Spare SC Down)

Reboot of Spare SC

Security

`setfailover` Command

`showfailover` Command

`cmdsync` CLIs

`initcmdsync` Command

`savecmdsync` Command

`cancelcmdsync` Command

`runcmdsync` Command

`showcmdsync` Command

`setdatasync` Command

`showdatasync` Command