Sun Enterprise 10000 SSP 3.5 User Guide

Automatic Failover to the Spare Control Board

Control board failover is automatically enabled upon SSP installation or upgrade. The fod daemon performs failover monitoring of the control boards and other failover components. If the primary control board is not functioning properly, the fod daemon will trigger an automatic failover to the spare control board. A control board failure can be caused by

Note that under certain failure conditions the fod daemon can disable a control board failover. For a detailed description of the failure conditions and a summary of the failover actions performed, see Chapter 10, SSP Internals.

A control board failover can be either partial or complete, depending on whether domains are running:

Managing Control Board Failover

You can enable, disable, or force a control board failover as explained in the following procedures. Use the setfailover(1M) command on the main SSP to manage the failover state. For example, after a control board failover occurs, you must use the setfailover(1M) command to re-enable the control board failover capability.

To Disable Control Board Failover
  1. As user ssp on the main SSP, type:


    ssp% setfailover -t cb off
    

    Control board failover remains disabled until you enable it. To determine whether control board failover was disabled, use the showfailover(1M) command to verify the failover state, as explained in "Obtaining Control Board Failover Information".

To Enable Control Board Failover
  1. As user ssp on the main SSP, type:


    ssp% setfailover -t cb on
    

    Control board failover is activated when all the connection links are functioning properly. If any failed connections exist, control board failover is not enabled. You can use the showfailover(1M) command to verify that control board failover is enabled and review the connection status.

To Force a Complete Control Board Failover

Note -

If you want to force a complete control board failover, where both the JTAG connection and the system clock source are moved from the primary control board to the spare, you must shut down any domains that are running and power off, then power on all system boards before you switch control boards. If you do not shut down all the domains, a partial control board failover occurs. The JTAG connection is moved to the spare control board but the system clock source remains on the former primary control board.


  1. If any domains are running, shut down those domains using the standard shutdown(1M) command.

  2. Log in to the main SSP as user ssp.

  3. To ensure that domains do not arbstop, do the following:

    1. Stop event detection monitoring.


      ssp% edd_cmd -x stop
      
    2. Power off all of the system boards.


      ssp% power -off -all
      
    3. Power on all of the system boards.


      ssp% power -on -all
      
    4. Start event detection monitoring.


      ssp% edd_cmd -x start
      
  4. Type the following to force the control board failover:


    ssp% setfailover -t cb force
    
  5. Issue the bringup(1M) command for all domains.

  6. Re-enable control board failover as described in "To Enable Control Board Failover".

Obtaining Control Board Failover Information

Use the showfailover(1M) command on the main SSP to obtain the failover state of an SSP or control board failover and the status of the private connection links. The names of the SSPs and control boards are also provided, and the control boards responsible for the JTAG interface and system clock are identified. For details on the failover information displayed, see "Obtaining Failover Status Information".

The following example shows the information displayed for a control board failover in which the primary control board failed.


ssp% showfailover  
Failover State:
     SSP Failover: Active
     CB Failover:  Failed
Failover Connection Map:
     Main SSP to Spare SSP thru Main Hub:   GOOD
     Main SSP to Spare SSP thru Spare Hub:  GOOD
     Main SSP to Primary Control Board:     FAILED
     Main SSP to Spare Control Board:       GOOD
     Spare SSP to Main SSP thru Main Hub:   GOOD
     Spare SSP to Main SSP thru Spare Hub:  GOOD
     Spare SSP to Primary Control Board:    FAILED
     Spare SSP to Spare Control Board:      GOOD
SSP/CB Host Information
     Main SSP:                              xf12-ssp
     Spare SSP:                             xf12-ssp2
     Primary Control Board (JTAG source):   xf12-cb1
     Spare Control Board:                   xf12-cb0
     System Clock source:                   xf12-cb1

You can also use Hostview to verify the type of control board failover (complete or partial). When you use Hostview to verify a control board, the "J" (JTAG) and "C" (system clock source) characters indicate which control board manages the JTAG interface and system clock.

Figure 9-1 shows an example Hostview window after a partial control board failover. One control board handles the JTAG interface, while the other serves as the system clock source.

Figure 9-1 Example Hostview Window After a Partial Control Board Failover

Graphic

After Control Board Failover

After a control board failover occurs, you must perform certain recovery tasks: