|C H A P T E R 2|
Managing RAS Features and System Firmware
This chapter describes how to manage reliability, availability, and serviceability (RAS) features and system firmware, including Sun Advanced Lights Out Manager (ALOM) system controller, and automatic system recovery (ASR). In addition, this chapter describes how to unconfigure and reconfigure a device manually, and introduces multipathing software.
This chapter contains the following sections:
Note - This chapter does not cover detailed troubleshooting and diagnostic procedures. For information about fault isolation and diagnostic procedures, refer to the diagnostics and troubleshooting guide for your server.
The system controller supports a total of nine concurrent ALOM CMT sessions per server, one connection through the serial management port and eight connections available through the network management port.
After you log in to your ALOM account, the system controller command prompt (sc>) appears, and you can enter system controller commands. If the command you want to use has multiple options, you can either enter the options individually or grouped together, as shown in the following example. The commands shown in the following example are identical.
All environmental monitoring and control is handled by the system controller. The system controller command prompt (sc>) provides you with a way of interacting with the system controller. For more information about the sc> prompt, see ALOM CMT and The sc> Prompt.
For instructions on connecting to the system controller, see:
1. If you are logged in to the system console, type #. (Hash-Period) to get to the sc> prompt.
Press the Hash key, followed by the Period key. Then press the Return key.
2. At the ALOM CMT login prompt, enter the login name and press Return.
The default login name is admin.
3. At the password prompt, enter the password and press Return to get to the sc> prompt.
Note - There is no default password when connecting to ALOM CMT for the first time using the serial management port. When connecting to the system controller using the network management port for the first time, the default ALOM CMT password is the last 8 digits of the Chassis Serial Number. The Chassis Serial Number can be found printed on the back of the server or in the printed system information sheet which shipped with your server. You must assign a password during initial system configuration. For more information, refer to the installation guide for your server and the ALOM CMT guide for your server.
Using the system controller, you can monitor the system, turn the Locator LED on and off, or perform maintenance tasks on the system controller card itself. For more information, refer to the ALOM CMT guide for your server.
1. Log in to the system controller.
2. Use the showenvironment command to display a snapshot of the server's environmental status.
The information this command can display includes temperature, power supply status, front panel LED status, and so on.
The behavior of LEDs on your server conform the the American National Standards Institute (ANSI) Status Indicator Standard (SIS). These standard LED behaviors are described in TABLE 2-1.
The LEDs have assigned meanings, described in TABLE 2-2.
You control the Locator LED from the sc> prompt or by the Locator button on the front of the chassis.
To turn on the Locator LED from the system controller command prompt, type:
To turn off the Locator LED from the system controller command prompt, type:
To display the state of the Locator LED from the system controller command prompt, type:
The system provides for automatic system recovery (ASR) from failures in memory modules or PCI cards.
Automatic system recovery functionality enables the system to resume operation after experiencing certain nonfatal hardware faults or failures. When ASR is enabled, the system's firmware diagnostics automatically detect failed hardware components. An autoconfiguring capability designed into the system firmware enables the system to unconfigure failed components and to restore system operation. As long as the system is capable of operating without the failed component, the ASR features enable the system to reboot automatically, without operator intervention.
Note - ASR is not activated until you enable it. See Enabling and Disabling Automatic System Recovery.
For more information about ASR, refer to the service manual for your server.
The system firmware stores a configuration variable called auto-boot?, which controls whether the firmware will automatically boot the operating system after each reset. The default setting for Sun platforms is true.
Normally, if a system fails power-on diagnostics, auto-boot? is ignored and the system does not boot unless an operator boots the system manually. An automatic boot is generally not acceptable for booting a system in a degraded state. Therefore, the server's OpenBoot firmware provides a second setting, auto-boot-on-error?. This setting controls whether the system will attempt a degraded boot when a subsystem failure is detected. Both the auto-boot? and auto-boot-on-error? switches must be set to true to enable an automatic degraded boot.
Set the switches by typing:
Note - The default setting for auto-boot-on-error? is false. The system will not attempt a degraded boot unless you change this setting to true. In addition, the system will not attempt a degraded boot in response to any fatal nonrecoverable error, even if degraded booting is enabled. For examples of fatal nonrecoverable errors, see Error Handling Summary.
Error handling during the power-on sequence falls into one of the following three cases:
Note - If POST or OpenBoot Diagnostics detect a nonfatal error associated with the normal boot device, the OpenBoot firmware automatically unconfigures the failed device and tries the next-in-line boot device, as specified by the boot-device configuration variable.
For more information about troubleshooting fatal errors, refer to the service manual for your server.
Three ALOM CMT configuration variables, diag_mode, diag_level, and diag_trigger, control whether the system runs firmware diagnostics in response to system reset events.
The standard system reset protocol bypasses POST completely unless the virtual keyswitch or ALOM CMT variables are set as follows:
Therefore, ASR is enabled by default. For instructions, see Enabling and Disabling Automatic System Recovery.
The ALOM CMT commands are available for obtaining ASR status information and for manually unconfiguring or reconfiguring system devices. For more information, see:
The automatic system recovery (ASR) feature is not activated until you enable it. Enabling ASR requires changing configuration variables in ALOM CMT as well as OpenBoot firmware.
1. At the sc> prompt, type:
2. At the ok prompt, type:
3. To cause the parameter changes to take effect, type:
The system permanently stores the parameter changes and boots automatically when the OpenBoot configuration variable auto-boot? is set to true (its default value).
1. At the ok prompt, type:
2. To cause the parameter changes to take effect, type:
The system permanently stores the parameter change.
After you disable the automatic system recovery (ASR) feature, it is not activated again until you re-enable it.
Use the following procedure to retrieve information about the status of system components affected by automatic system recovery (ASR).
At the sc> prompt, type:
In the showcomponent command output, any devices marked disabled have been manually unconfigured using the system firmware. The showcomponent command also lists devices that have failed firmware diagnostics and have been automatically unconfigured by the system firmware.
For more information, see:
To support a degraded boot capability, the ALOM CMT firmware provides the disablecomponent command, which enables you to unconfigure system devices manually. This command flags the specified device as disabled by creating an entry in the ASR database.
At the sc> prompt, type:
The asr-key is one of the device identifiers from TABLE 2-5
1. At the sc> prompt, type:
where asr-key is any device identifier from TABLE 2-5
You can use the ALOM CMT enablecomponent command to reconfigure any device that you previously unconfigured with the disablecomponent command.
ALOM CMT software enables you to display current valid system faults. The showfaults command displays the fault ID, the faulted FRU device, and the fault message to standard output. The showfaults command also displays POST results.
Adding the -v option displays additional information,
For more information about the showfaults command, refer to the Advanced Lights Out Management (ALOM) CMT v1.3 Guide.
Multipathing software enables you to define and control redundant physical paths to I/O devices, such as storage devices and network interfaces. If the active path to a device becomes unavailable, the software can automatically switch to an alternate path to maintain availability. This capability is known as automatic failover. To take advantage of multipathing capabilities, you must configure the server with redundant hardware, such as redundant network interfaces or two host bus adapters connected to the same dual-ported storage array.
For your server, three different types of multipathing software are available:
For instructions on how to configure and administer Solaris IP Network Multipathing, consult the IP Network Multipathing Administration Guide provided with your specific Solaris release.
For information about VVM and its DMP feature, refer to the documentation provided with the VERITAS Volume Manager software.
For information about Sun StorEdge Traffic Manager, refer to your Solaris OS documentation.
The setfru command enables you to store information on FRU PROMs. For example, you might store information identifying the server in which the FRUs have been installed.
At the sc> prompt type: