C H A P T E R  4

Managing RAS Features and System Firmware

This chapter describes how to manage reliability, availability, and serviceability (RAS) features and system firmware, including the Sun Advanced Lights Out Manager (ALOM) system controller, and automatic system recovery (ASR). In addition, this chapter describes how to unconfigure and reconfigure a device manually, and introduces multipathing software.

This chapter contains the following sections:


OpenBoot Emergency Procedures

The introduction of universal serial bus (USB) keyboards with the newest Sun systems has made it necessary to change some of the OpenBoot emergency procedures. Specifically, the Stop-N, Stop-D, and Stop-F commands that were available on systems with non-USB keyboards are not supported on systems that use USB keyboards, such as the Sun Fire server. If you are familiar with the earlier (non-USB) keyboard functionality, this section describes the analogous OpenBoot emergency procedures available in newer systems that use USB keyboards.

OpenBoot Emergency Procedures

The following sections describe how to perform the functions of the Stop commands on systems that use USB keyboards. These same functions are available through the ALOM software.

Stop-A Functionality

Stop-A (Abort) key sequence works the same as it does on systems with standard keyboards, except that it does not work during the first few seconds after the server is reset. In addition, you can issue the ALOM system controller break command.

Stop-N Functionality

Stop-N functionality is not available. However, the Stop-N functionality can be closely emulated by completing the following steps, provided the system console is configured to be accessible using either the serial management port or the network management port.


procedure icon  To Restore OpenBoot Configuration Defaults

1. Log in to the ALOM system controller.

2. Type the following commands:


sc> bootmode reset_nvram
sc> bootmode bootscript="setenv auto-boot? false"
sc> 



Note - If you do not issue the poweroff and poweron commands or the reset command within 10 minutes, the host server ignores the bootmode command.



You can issue the bootmode command without arguments to display the current setting.


sc> bootmode
Bootmode: reset_nvram
Expires WED SEP 09 09:52:01 UTC 2006
bootscript="setenv auto-boot? false"

3. To reset the system, type the following commands:


sc> reset
Are you sure you want to reset the system [y/n]?  y
sc> 

4. To view console output as the system boots with default OpenBoot configuration variables, switch to console mode.


sc> console
 
ok

5. Type set-defaults to discard any customized IDPROM values and to restore the default settings for all OpenBoot configuration variables.

Stop-F Functionality

The Stop-F functionality is not available on systems with USB keyboards.

Stop-D Functionality

The Stop-D (Diags) key sequence is not supported on systems with USB keyboards. However, the Stop-D functionality can be closely emulated by setting the virtual keyswitch to diag, using the ALOM setkeyswitch command.


Automatic System Recovery

The system provides for automatic system recovery (ASR) from failures in memory modules or PCI cards.

Automatic system recovery functionality enables the system to resume operation after experiencing certain nonfatal hardware faults or failures. When ASR is enabled, the firmware diagnostics automatically detect failed hardware components. An autoconfiguring capability designed into the system firmware enables the system to unconfigure failed components and to restore system operation. As long as the system is capable of operating without the failed component, the ASR features enable the system to reboot automatically, without operator intervention.

Autoboot Options

The system firmware stores a configuration variable called auto-boot?, which controls whether the firmware will automatically boot the operating system after each reset. The default setting for Sun platforms is true.

Normally, if a system fails power-on diagnostics, auto-boot? is ignored and the system does not boot unless an operator boots the system manually. An automatic boot is not acceptable for booting a system in a degraded state. Therefore, the Sun Fire server OpenBoot firmware provides a second setting, auto-boot-on-error?. This setting controls whether the system will attempt a degraded boot when a subsystem failure is detected. Both the auto-boot? and auto-boot-on-error? switches must be set to true to enable an automatic degraded boot. To set the switches, type:


ok setenv auto-boot? true
ok setenv auto-boot-on-error? true



Note - The default setting for auto-boot-on-error? is true. Therefore, the system attempts a degraded boot unless you change this setting to false. In addition, the system will not attempt a degraded boot in response to any fatal nonrecoverable error, even if degraded booting is enabled. For examples of fatal nonrecoverable errors, see Error Handling Summary.



Error Handling Summary

Error handling during the power-on sequence falls into one of the following three cases:



Note - If POST or OpenBoot firmware detects a nonfatal error associated with the normal boot device, the OpenBoot firmware automatically unconfigures the failed device and tries the next-in-line boot device, as specified by the boot-device configuration variable.




Displaying System Fault Information

ALOM software lets you display current valid system faults. The showfaults command displays the fault ID, the faulted FRU device, and the fault message to standard output. The showfaults command also displays POST results. For example:


sc> showfaults
   ID FRU        Fault
    0   FT0.FM2   SYS_FAN at FT0.FM2 has FAILED.

Adding the -v option displays the time:


sc> showfaults -v
   ID Time                 FRU            Fault
    0   MAY 20 10:47:32 FT0.FM2       SYS_FAN at FT0.FM2 has FAILED.


procedure icon  To Display System Fault Information

single-step bulletAt the sc> prompt, type:


sc> showfaults -v


Multipathing Software

Multipathing software enables you to define and control redundant physical paths to I/O devices, such as storage devices and network interfaces. If the active path to a device becomes unavailable, the software can automatically switch to an alternate path to maintain availability. This capability is known as automatic failover. To take advantage of multipathing capabilities, you must configure the server with redundant hardware, such as redundant network interfaces or two host bus adapters connected to the same dual-ported storage array.

Three different types of multipathing software are available:

For More Information

For instructions on how to configure and administer Solaris IP Network Multipathing, consult the IP Network Multipathing Administration Guide provided with your specific Solaris release.

For information about Sun StorEdge Traffic Manager, refer to your Solaris OS documentation.