C H A P T E R  8

System Controller Failover

Sun Fire midrange systems can be configured with two system controllers for high availability. In a high-availability system controller (SC) configuration, one SC serves as the main SC, which manages all the system resources, while the other SC serves as a spare. When certain conditions cause the main SC to fail, a switchover or failover from the main SC to the spare is triggered automatically, without operator intervention. The spare SC assumes the role of the main and takes over all system controller responsibilities.

This chapter explains the following:

SC Failover Overview

The SC failover capability is enabled by default on Sun Fire midrange servers that have two System Controller boards installed. The failover capability includes both automatic and manual failover. In automatic SC failover, a failover is triggered when certain conditions cause the main SC to fail or become unavailable. In manual SC failover, you force the switchover of the spare SC to the main.

The failover software performs the following tasks to determine when a failover from the main SC to the spare is necessary and to ensure that the system controllers are failover-ready:

If at any time the spare SC is not available or does not respond, the failover mechanism disables SC failover. If SC failover is enabled, but the connection link between the SCs is down, failover remains enabled and active until the system configuration changes. After a configuration change, such as a change in platform or domain parameter settings, the failover mechanism remains enabled, but it is not active (SC failover is not in a failover-ready state because the connection link is down). You can check the SC failover state by using commands such as showfailover or showplatform, as explained in To Obtain Failover Status Information.

What Triggers an Automatic Failover

A failover from the main to the spare SC is triggered when one of the following failure conditions occurs:

What Happens During a Failover

An SC failover is characterized by the following:

The SC failover is logged in the platform message log file, which is viewed on the console of the new main SC or through the showlogs command on the SC. The information displayed indicates that a failover has occurred and identifies the failure condition that triggered the failover.

CODE EXAMPLE 8-1 shows the type of information that appears on the console of the spare SC when a failover occurs due to a stop in the main SC heartbeat:

CODE EXAMPLE 8-1 Messages Displayed During an Automatic Failover

Platform Shell - Spare System Controller
schostname:sc> Nov 12 01:15:42 schostname Platform.SC: SC Failover: enabled and active.
Nov 12 01:16:42 schostname Platform.SC: SC Failover: no heartbeat detected from the Main SC
Nov 12 01:16:42 schostname Platform.SC: SC Failover: becoming main SC ...
Nov 12 01:16:49 schostname Platform.SC: Chassis is in single partition mode.
Nov 12 01:17:04 schostname Platform.SC: Main System Controller
Nov 12 01:17:04 schostname Platform.SC: SC Failover: disabled

The prompt for the main SC is hostname:SC> . Note that the upper-case letters, SC, identify the main SC.

The prompt for the spare SC is hostname:sc> . Note that the lower-case letters, sc, identify the spare SC.

When an SC failover occurs, the prompt for the spare SC changes and becomes the prompt for the main SC (hostname:SC> ), as shown in the last line of CODE EXAMPLE 8-1.

When an SC failover is in progress, command execution is disabled.

The recovery time for an SC failover from the main to the spare is approximately five minutes or less. This recovery period consists of the time required to detect a failure and direct the spare SC to assume the responsibilities of the main SC.

The failover process does not affect any running domains, except for temporary loss of services from the system controller.

After an automatic or manual failover occurs, the failover capability is automatically disabled. This prevents the possibility of repeated failovers back and forth between the two SCs.

A failover closes SSH or Telnet sessions connected to the domain console, and any domain console output is lost. When you reconnect to the domain through an SSH or Telnet session, you must specify the host name or IP address of the new main SC, unless you previously assigned a logical host name or IP address to your main system controller. See the next section for an explanation of the logical host name and IP address.

The remainder of this chapter describes SC failover prerequisites, conditions that affect your SC failover configuration, and how to manage SC failover, including how to recover after an SC failover occurs.

SC Failover Prerequisites

This section identifies SC failover prerequisites and optional platform parameters that can be set for SC failover:

Starting with the 5.13.0 release, SC failover requires that you run the same version of the firmware on both the main and spare system controller. Be sure to follow the instructions for installing and upgrading the firmware described in the Install.info file that accompanies the firmware release.

You can optionally perform the following after you install or upgrade the firmware on each SC:

The logical host name or IP address identifies the working main system controller, even after a failover occurs. Assign the logical IP address or host name by running the setupplatform command on the main SC.

Note - The logical host name or IP address is required if you are using Sun Management Center software for Sun Fire midrange systems.

The date and time between the two SCs must be synchronized, to ensure that the same time service is provided to the domains. Run the setupplatform command on each SC to identify the host name or IP address of the system to be used as the SNTP server (reference clock).

For further information on setting the platform date and time, see To Set the Date, Time, and Time Zone for the Platform.

Conditions That Affect Your SC Failover Configuration

If you power cycle your system (power your system off then on), note the following:

Certain factors, such as disabling or running SC POST with different diag levels, influence which SC is booted first.

If SC failover is disabled at the time a power cycle occurs, it is possible for the new main SC to boot with a stale SC configuration.

When SC failover is disabled, data synchronization does not occur between the main and spare SC. As a result, any configuration changes made on the main SC are not propagated to the spare. If the roles of the main and spare SC change after a power cycle, scapp on the new main SC will boot with a stale SC configuration. As long as SC failover is enabled and active, data on both SCs will be synchronized, and it will not matter which SC becomes the main SC after the power cycle.

Managing SC Failover

You control the failover state by using the setfailover command, which enables you to do the following:

You can also obtain failover status information through commands such as showfailover or showplatform. For details, see To Obtain Failover Status Information.

procedure icon  To Disable SC Failover

single-step bulletFrom the platform shell on either the main or spare SC, type:

schostname:SC> setfailover off

A message indicates failover is disabled. Note that SC failover remains disabled until you re-enable it (see the next procedure).

procedure icon  To Enable SC Failover

single-step bulletFrom the platform shell on either the main or spare SC, type:

schostname:SC> setfailover on

The following message is displayed while the failover software verifies the failover-ready state of the system controllers:

SC Failover: enabled but not active.

Within a few minutes, after failover readiness has been verified, the following message is displayed on the console, indicating that SC failover is activated:

SC Failover: enabled and active.

procedure icon  To Perform a Manual SC Failover

1. Be sure that other SC commands are not currently running on the main SC.

2. From the platform shell on either the main or spare SC, type:

schostname:SC> setfailover force

A failover from one SC to the other occurs, unless there are fault conditions (for example, the spare SC is not available or the connection link between the SCs is down) that prevent the failover from taking place.

A message describing the failover event is displayed on the console of the new main SC.

Be aware that the SC failover capability is automatically disabled after the failover. If at some point you need the SC failover feature, be sure to re-activate failover (see To Enable SC Failover).

procedure icon  To Obtain Failover Status Information

single-step bulletRun any of the following commands from either the main or spare SC to display failover information:

The SC failover state can be one of the following:

- The main SC has a higher firmware version than the spare.

- A board in the system can be controlled by the main SC but not the spare.

In this case, the showfailover -v output indicates that the failover configuration is degraded and identifies the boards that cannot by managed by the spare SC. For example:

CODE EXAMPLE 8-3 showfailover Command Output - Failover Degraded Example

schostname:SC> showfailover -v
Main System Controller
SC Failover: enabled and active.
Clock failover enabled.
SC Failover: Failover is degraded
SC Failover: Please upgrade the other SC SSC1 running 5.13.0
SB0: COD CPU Board V2 not supported on 5.13.0
SB2: CPU Board V3 not supported on 5.13.0

If a degraded failover condition occurs, upgrade the spare system controller firmware to the same version used by the main system controller. For firmware upgrade instructions, refer to the Install.info file that accompanies the firmware release.

For details on these commands, refer to their descriptions in the Sun Fire Midrange System Controller Command Reference Manual.

Recovering After an SC Failover

This section explains the recovery tasks that you must perform after an SC failover occurs.

procedure icon  To Recover After an SC Failover Occurs

1. Identify the failure point or condition that caused the failover and determine how to correct the failure.

a. Use the showlogs command to review the platform messages logged for the working SC.

Evaluate these messages for failure conditions and determine the corrective action needed to reactivate any failed components.

b. If the syslog loghost has been configured, review the platform loghost to see any platform messages for the failed SC.

c. If you need to replace a failed System Controller board, see To Remove and Replace a System Controller Board in a Redundant SC Configuration.

If you need to hot-plug an SC (remove an SC that has been powered off and then insert a replacement SC), be sure to verify that the clock signals to the system boards are coming from the new main SC before you perform the hot-plug operation. Run the showboard -p clock command to verify the clock signal source.

d. If an automatic failover occurred while you were running the flashupdate, setkeyswitch, or DR commands, rerun those commands after you resolve the failure condition.

Any flashupdate, setkeyswitch, or DR operations are stopped when an automatic failover occurs. However, if you were running configuration commands such as setupplatform, it is possible that some configuration changes occurred before the failover. Be sure to verify whether any configuration changes were made

For example, if you were running the setupplatform command when an automatic failover occurred, use the showplatform command to verify any configuration changes made before the failover. After you resolve the failure condition, run the appropriate commands to update your configuration as needed.

2. After you resolve the failover condition, re-enable SC failover by using the setfailover on command (see To Enable SC Failover).