C H A P T E R  2

Managing RAS Features and System Firmware

This chapter describes how to manage reliability, availability, and serviceability (RAS) features and system firmware, including the Sun Advanced Lights Out Manager (ALOM) system controller, and Automatic System Recovery (ASR). In addition, this chapter describes how to unconfigure and reconfigure a device manually, and introduces multipathing software.

This chapter contains the following sections:



Note - This chapter does not cover detailed troubleshooting and diagnostic procedures. For information about fault isolation and diagnostic procedures, refer to the Sun Blade T6300 Server Module Service Manual.




Interpreting System LEDs

The behavior of LEDs on your server conforms to the American National Standards Institute (ANSI) Status Indicator Standard (SIS). These standard LED behaviors are described in TABLE 2-1.


TABLE 2-1 LED Behavior and Meaning

LED Behavior

Meaning

Off

The condition represented by the color is not true.

Steady on

The condition represented by the color is true.

Standby blink

The system is functioning at a minimal level and ready to resume full function.

Slow blink

Transitory activity or new activity represented by the color is taking place.

Fast blink

Attention is required.

Feedback flash

Activity is taking place commensurate with the flash rate (such as disk drive activity).


The LEDs have assigned meanings, described in TABLE 2-2.


TABLE 2-2 LED Behaviors With Assigned Meanings

Color

Behavior

Definition

Description

White

Off

Steady state

 

 

Fast blink

4 Hz repeating sequence, equal intervals on and off

This indicator helps you to locate a particular enclosure, board, or subsystem (for example, the Locator LED).

Blue

Off

Steady state

 

 

Steady on

Steady state

If blue is on, a service action can be performed on the applicable component with no adverse consequences (for example, the OK to Remove LED).

Yellow or Amber

Off

Steady state

 

 

Steady on

Steady state

This indicator signals the existance of a fault condition. Service is required (for example, the Service Required LED).

Green

Off

Steady state

 

 

Standby blink

Repeating sequence consisting of a brief (0.1 sec.) on flash followed by a long off period (2.9 sec.)

The system is running at a minimum level and is ready to be quickly revived to full function (for example, the System Activity LED).

 

Steady on

Steady state

Status normal. System or component functioning with no service actions required.

 

Slow blink

 

A transitory (temporary) event is taking place for which direct proportional feedback is not needed or not feasible.


Controlling the Locator LED

You control the Locator LED from the sc> prompt or by the Locator button on the front of the server module.


procedure icon  To Turn On the Locator LED From the ALOM System Controller Command Prompt

single-step bulletType:


sc> setlocator on


procedure icon  To Turn Off the Locator LED From the ALOM System Controller Command Prompt

single-step bulletType:


sc> setlocator off


procedure icon  To Display the State of the Locator LED From the ALOM System Controller Command Prompt

single-step bulletType:


sc> showlocator
Locator LED is on.



Note - You do not need user permissions to use the setlocator and showlocator commands.




Automatic System Recovery

Automatic System Recovery functionality enables the system to resume operation after experiencing certain nonfatal hardware faults or failures. When ASR is enabled, the system's firmware diagnostics automatically detect failed hardware components. An autoconfiguring capability designed into the system firmware enables the system to unconfigure failed components and to restore system operation. As long as the system is capable of operating without the failed component, the ASR features enable the system to reboot automatically, without operator intervention.



Note - ASR is not activated until you enable it. See Enabling and Disabling Automatic System Recovery.



AutoBoot Options

The system firmware stores a configuration variable called auto-boot?, which controls whether the firmware automatically boots the operating system after each reset. The default setting for Sun platforms is true.

Normally, if a system fails power-on diagnostics, auto-boot? is ignored and the system does not boot unless an operator boots the system manually. An automatic boot is generally not acceptable for booting a system in a degraded state. Therefore, the server's OpenBoot firmware provides a second setting, auto-boot-on-error?. This setting controls whether the system will attempt a degraded boot when a subsystem failure is detected. Both the auto-boot? and auto-boot-on-error? switches must be set to true to enable an automatic degraded boot.


procedure icon  To Enable an Automatic Degraded Boot

single-step bulletSet the switches by typing:


ok setenv auto-boot? true
ok setenv auto-boot-on-error? true



Note - The default setting for auto-boot-on-error? is false. The system will not attempt a degraded boot unless you change this setting to true. In addition, the system will not attempt a degraded boot in response to any fatal nonrecoverable error, even if degraded booting is enabled. For examples of fatal nonrecoverable errors, see Error Handling Summary.



Error Handling Summary

Error handling during the power-on sequence falls into one of the following three cases:

When a DIMM fails, the firmware unconfigures the entire logical bank associated with the failed server module. Another nonfailing logical bank must be present in the system for the system to attempt a degraded boot. Note that certain DIMM failures might not be diagnosable to a single DIMM. These failures are fatal, and result in both logical banks being unconfigured.



Note - If POST or OpenBoot Diagnostics detect a nonfatal error associated with the normal boot device, the OpenBoot firmware automatically unconfigures the failed device and tries the next-in-line boot device, as specified by the boot-device configuration variable.



Reset Scenarios

Three ALOM configuration variables, diag_mode, diag_level, and diag_trigger, control whether the system runs firmware diagnostics in response to system reset events.

The standard system reset protocol bypasses POST completely unless the virtual keyswitch or ALOM variables are set as follows:


TABLE 2-3 Virtual Keyswitch Setting for Reset Scenario

Keyswitch

Value

virtual keyswitch

diag


TABLE 2-4 ALOM Variable Settings for Reset Scenario

Variable

Value

Default

diag_mode

normal or service

normal

diag_level

min or max

min

diag_trigger

power-on-reset error-reset

power-on-reset


 

Therefore, ASR is enabled by default. For instructions, see Enabling and Disabling Automatic System Recovery.

Automatic System Recovery User Commands

ALOM commands are available for enabling and disabling ASR and for obtaining ASR status information.

For more information, see:

Enabling and Disabling Automatic System Recovery

The ASR feature is not activated until you enable it. Enabling ASR requires changing configuration variables in ALOM as well as OpenBoot firmware.


procedure icon  To Enable Automatic System Recovery

1. At the sc> prompt, type:


sc> setsc diag_mode normal
sc> setsc diag_level min
sc> setsc diag_trigger power-on-reset

2. At the ok prompt, type:


ok setenv auto-boot? true
ok setenv auto-boot-on-error? true

3. To cause the parameter changes to take effect, type:


ok reset-all

The system permanently stores the parameter changes and boots automatically when the OpenBoot configuration variable auto-boot? is set to true (its default value).



Note - To store parameter changes, you can also power cycle the system using the front panel Power button.




procedure icon  To Disable Automatic System Recovery

1. At the ok prompt, type:


ok setenv auto-boot-on-error? false

2. To cause the parameter changes to take effect, type:


ok reset-all

The system permanently stores the parameter change.



Note - To store parameter changes, you can also power cycle the system using the front panel Power button.



After you disable the ASR feature, it is not activated again until you re-enable it.

Obtaining Automatic System Recovery Information

Use the following procedure to retrieve information about the status of system components affected by ASR.


procedure icon  To Obtain ASR Information

single-step bulletAt the sc> prompt, type:


sc> showcomponent

In the showcomponent command output, any devices marked disabled have been manually unconfigured using the system firmware. The showcomponent command also lists devices that have failed firmware diagnostics and have been automatically unconfigured by the system firmware.


Unconfiguring and Reconfiguring Devices

To support a degraded boot capability, the ALOM firmware provides the disablecomponent command, which enables you to unconfigure system devices manually. This command flags the specified device as disabled by creating an entry in the ASR database.


procedure icon  To Unconfigure a Device Manually

single-step bulletAt the sc> prompt, type:


sc> disablecomponent asr-key

The asr-key is one of the device identifiers from TABLE 2-5



Note - The device identifiers are not case-sensitive. You can type them as uppercase or lowercase characters.




TABLE 2-5 Device Identifiers and Devices

Device Identifiers

Devices

MB/CMPcpu-number/Pstrand-number

CPU strand (Number: 0-31)

MB/PCIEa

PCIe leaf A (/pci@780)

MB/PCIEb

PCIe leaf B (/pci@7c0)

MB/CMP0/CHchannel-number/Rrank-number/Ddimm-number

DIMMs



procedure icon  To Reconfigure a Device Manually

single-step bulletAt the sc> prompt, type:


sc> enablecomponent asr-key

where asr-key is any device identifier from TABLE 2-5.



Note - The device identifiers are not case-sensitive. You can type them as uppercase or lowercase characters.



You can use the ALOM enablecomponent command to reconfigure any device that you previously unconfigured with the disablecomponent command.


Displaying System Fault Information

ALOM software enables you to display current valid system faults. The showfaults command displays the fault ID, the faulted FRU device, and the fault message to standard output. The showfaults command also displays POST results.


procedure icon  To Display System Fault Information

single-step bulletType showfaults.

For example:


sc> showfaults
   ID FRU        Fault
    0 FT0.F2     SYS_FAN at FT0.F2 has FAILED.

Adding the -v option displays additional information,


sc> showfaults -v
   ID Time              FRU        Fault
    0   MAY 20 10:47:32 FT0.F2     SYS_FAN at FT0.F2 has FAILED.

For more information about the showfaults command, refer to the Advanced Lights Out Management (ALOM) CMT v1.3 Guide.


Multipathing Software

Multipathing software enables you to define and control redundant physical paths to I/O devices, such as storage devices and network interfaces. If the active path to a device becomes unavailable, the software can automatically switch to an alternate path to maintain availability. This capability is known as automatic failover. To take advantage of multipathing capabilities, you must configure the server with redundant hardware, such as redundant network interfaces or two host bus adapters connected to the same dual-ported storage array.

For your server, three different types of multipathing software are available:

For More Information on Multipathing Software

For instructions on how to configure and administer Solaris IP Network Multipathing software, consult the IP Network Multipathing Administration Guide provided with your specific Solaris release.

For information about VVM and its DMP feature, refer to the documentation provided with the VERITAS Volume Manager software.

For information about Sun StorageTek Traffic Manager, refer to your Solaris OS documentation.


Storing FRU Information

The setfru command enables you to store information on FRU PROMs. For example, you might store information identifying the server in which the FRUs have been installed.


procedure icon  To Store Information in Available FRU PROMs

single-step bulletAt the sc> prompt type:


setfru -c data