Control board failover is automatically enabled upon SSP installation or upgrade. The fod daemon performs failover monitoring of the control boards and other failover components. If the primary control board is not functioning properly, the fod daemon will trigger an automatic failover to the spare control board. A control board failure can be caused by
A clock failure
When a clock failure occurs, all active domains arbstop simultaneously and a control board failover is automatically triggered. Both the system clock and JTAG interface are automatically moved to the spare control board. When the new control board is started, normal EDD recovery actions reboot the Sun Enterprise 10000 domains.
A JTAG interface failure
If the SSP cannot communicate with the JTAG interface, the SSP determines that the control board failed and automatically triggers a control board failover.
Failure of the Ethernet interface on the control board
Failure of the control board processor
Disconnected cable between the control board and the hub
Failure of the hub connected to the control board
Disconnected cable between the main SSP and the hub
Failure of the SSP network interface card (NIC) for the control board network
User error caused by disabling the NIC for the control board network.
Note that under certain failure conditions the fod daemon can disable a control board failover. For a detailed description of the failure conditions and a summary of the failover actions performed, see Chapter 10, SSP Internals.
A control board failover can be either partial or complete, depending on whether domains are running:
If domains are active and a control board failure condition is detected, a partial failover occurs.
In a partial failover, the JTAG interface is moved from the primary control board to the spare. However, the system clock source remains on the failed primary control board. You must complete the control board failover so that both the JTAG interface and system clock source are managed by the same control board. For details, see "To Force a Complete Control Board Failover".
If no domains are running and a control board failure condition is detected, a complete failover occurs.
In a complete control board failover, both the JTAG interface and the system clock source are moved from the primary control board to the spare.
You can enable, disable, or force a control board failover as explained in the following procedures. Use the setfailover(1M) command on the main SSP to manage the failover state. For example, after a control board failover occurs, you must use the setfailover(1M) command to re-enable the control board failover capability.
As user ssp on the main SSP, type:
ssp% setfailover -t cb off |
Control board failover remains disabled until you enable it. To determine whether control board failover was disabled, use the showfailover(1M) command to verify the failover state, as explained in "Obtaining Control Board Failover Information".
As user ssp on the main SSP, type:
ssp% setfailover -t cb on |
Control board failover is activated when all the connection links are functioning properly. If any failed connections exist, control board failover is not enabled. You can use the showfailover(1M) command to verify that control board failover is enabled and review the connection status.
If you want to force a complete control board failover, where both the JTAG connection and the system clock source are moved from the primary control board to the spare, you must shut down any domains that are running and power off, then power on all system boards before you switch control boards. If you do not shut down all the domains, a partial control board failover occurs. The JTAG connection is moved to the spare control board but the system clock source remains on the former primary control board.
If any domains are running, shut down those domains using the standard shutdown(1M) command.
Log in to the main SSP as user ssp.
To ensure that domains do not arbstop, do the following:
Type the following to force the control board failover:
ssp% setfailover -t cb force |
Issue the bringup(1M) command for all domains.
Re-enable control board failover as described in "To Enable Control Board Failover".
Use the showfailover(1M) command on the main SSP to obtain the failover state of an SSP or control board failover and the status of the private connection links. The names of the SSPs and control boards are also provided, and the control boards responsible for the JTAG interface and system clock are identified. For details on the failover information displayed, see "Obtaining Failover Status Information".
The following example shows the information displayed for a control board failover in which the primary control board failed.
ssp% showfailover Failover State: SSP Failover: Active CB Failover: Failed Failover Connection Map: Main SSP to Spare SSP thru Main Hub: GOOD Main SSP to Spare SSP thru Spare Hub: GOOD Main SSP to Primary Control Board: FAILED Main SSP to Spare Control Board: GOOD Spare SSP to Main SSP thru Main Hub: GOOD Spare SSP to Main SSP thru Spare Hub: GOOD Spare SSP to Primary Control Board: FAILED Spare SSP to Spare Control Board: GOOD SSP/CB Host Information Main SSP: xf12-ssp Spare SSP: xf12-ssp2 Primary Control Board (JTAG source): xf12-cb1 Spare Control Board: xf12-cb0 System Clock source: xf12-cb1 |
You can also use Hostview to verify the type of control board failover (complete or partial). When you use Hostview to verify a control board, the "J" (JTAG) and "C" (system clock source) characters indicate which control board manages the JTAG interface and system clock.
Figure 9-1 shows an example Hostview window after a partial control board failover. One control board handles the JTAG interface, while the other serves as the system clock source.
After a control board failover occurs, you must perform certain recovery tasks:
Identify the failure point or condition that caused the failover and determine how to correct the failure.
For example, if a control board failover occurred due to a faulty control board, you must determine whether you need to replace the failed control board.
Use the showfailover(1M) command to review the failover state and verify which control board is responsible for the JTAG interface and system clock. Review the connection map in the showfailover output and the summary of the failover detection points in Chapter 10, SSP Internals.
You can also review the platform log file to review other error conditions and determine the corrective action needed to reactivate the failed components.
If a partial failover occurred, resynchronize the JTAG and system clock interfaces so that they are managed by the same control board.
To resynchronize the JTAG and system clock interfaces, perform a complete control board failover as described in "To Force a Complete Control Board Failover". The first domain that is brought up resynchronizes the system clock and the JTAG interface on the primary control board.
Once you have resolved the control board failure, re-enable control board failover (for details, see "To Enable Control Board Failover").