C H A P T E R  2

Managing RAS Features and System Firmware

This chapter describes how to manage reliability, availability, and serviceability (RAS) features and system firmware, including ALOM CMT on the system controller, and automatic system recovery (ASR). In addition, this chapter describes how to unconfigure and reconfigure a device manually, and introduces multipathing software.

This chapter contains the following sections:



Note - This chapter does not cover detailed troubleshooting and diagnostic procedures. For information about fault isolation and diagnostic procedures, refer to the service manual for your server.




ALOM CMT and The System Controller

The ALOM system controller supports a total of nine concurrent sessions per server, eight connections available through the network management port and one connection through the serial management port.

After you log in to your ALOM CMT account, the ALOM system controller command prompt (sc>) appears, and you can enter ALOM system controller commands. If the command you want to use has multiple options, you can either enter the options individually or grouped together, as shown in the following example. The commands are identical.


sc> poweroff -f -y
sc> poweroff -fy

Logging In To ALOM CMT

All environmental monitoring and control is handled by ALOM CMT on the ALOM system controller. The ALOM system controller command prompt (sc>) provides you with a way of interacting with ALOM CMT. For more information about the sc> prompt, see ALOM CMT sc> Prompt.

For instructions on connecting to the ALOM system controller, see:



Note - This procedure assumes that the system console is directed to use the serial management and network management ports (the default configuration).




procedure icon  To Log In To ALOM CMT

1. If you are logged in to the system console, type #. (Pound-Period) to get to the sc> prompt.

Press the Pound key, followed by the Period key. Then press the Return key.

2. At the ALOM CMT login prompt, enter the login name and press Return.

The default login name is admin.


Sun(tm) Advanced Lights Out Manager 1.0.12
Please login: admin

3. At the password prompt, enter the password and press Return twice to get to the sc> prompt.


Please Enter password:
 
sc>



Note - There is no default password. You must assign a password during initial system configuration. For more information, refer to the installation guide and ALOM CMT guide for your server.





caution icon

Caution - In order to provide optimum system security, best practice is to change the default system login name and password during initial setup.



Using the ALOM system controller, you can monitor the system, turn the Locator LED on and off, or perform maintenance tasks on the ALOM system controller card itself. For more information, refer to the ALOM CMT guide for your server.


procedure icon  To View Environmental Information

1. Log in to the ALOM system controller.

2. Use the showenvironment command to display a snapshot of the server's environmental status.

The information this command can display includes temperature, power supply status, front panel LED status, and so on.



Note - Some environmental information might not be available when the server is in standby mode.





Note - You do not need ALOM system controller user permissions to use this command.



Interpreting System LEDs

The behavior of LEDs on the Sun Fire T2000 Server conform to the American National Standards Institute (ANSI) Status Indicator Standard (SIS). These standard LED behaviors are described in TABLE 2-1.


TABLE 2-1 LED Behavior and Meaning

LED Behavior

Meaning

Off

The condition represented by the color is not true.

Steady on

The condition represented by the color is true.

Standby blink

The system is functioning at a minimal level and ready to resume full function.

Slow blink

Transitory activity or new activity represented by the color is taking place.

Fast blink

Attention is required.

Feedback flash

Activity is taking place commensurate with the flash rate (such as disk drive activity).


The LEDs have assigned meanings, described in TABLE 2-2.


TABLE 2-2 LED Behaviors with Assigned Meanings

Color

Behavior

Definition

Description

White

Off

Steady state

 

 

Fast blink

4Hz repeating sequence, equal intervals On and Off.

This indicator helps you to locate a particular enclosure, board, or subsystem.

For example, the Locator LED.

Blue

Off

Steady state

 

 

Steady On

Steady State

If blue is on, a service action can be performed on the applicable component with no adverse consequences.

For example: the OK-to-Remove LED

Yellow/Amber

Off

Steady State

 

 

Slow Blink

1Hz repeating sequence, equal intervals On and Off.

This indicator signals new fault conditions. Service is required.

For example: the Service Required LED.

 

Steady On

Steady State

The amber indicator stays on until the service action is completed and the system returns to normal function.

Green

Off

Steady State

 

 

Standby Blink

Repeating sequence consisting of a brief (0.1 sec.) ON flash followed by a long OFF period (2.9 sec.)

The system is running at a minimum level and is ready to be quickly revived to full function.

For example: the System Activity LED

 

Steady On

Steady State

Status normal; system or component functioning with no service actions required

 

Slow Blink

 

A transitory (temporary) event is taking place for which direct proportional feedback is not needed or not feasible.


Controlling the Locator LED

You control the Locator LED from the sc> prompt or by the locator button on the front of the chassis.


FIGURE 2-1 LocatorButton on Sun Fire T2000 Chassis

Graphic image of the front panel of the Sun Fire T2000 server. The locator button is located in the upper left corner of the chassis.


single-step bulletTo turn on the Locator LED, from the ALOM system controller command prompt, type:


sc> setlocator on
Locator LED is on.

single-step bulletTo turn off the Locator LED, from the ALOM system controller command prompt, type:


sc> setlocator off
Locator LED is off.

single-step bulletTo display the state of the Locator LED, from the ALOM system controller command prompt, type:


sc> showlocator
Locator LED is on.



Note - You do not need user permissions to use the setlocator and showlocator commands




OpenBoot Emergency Procedures

The introduction of Universal Serial Bus (USB) keyboards with the newest Sun systems has made it necessary to change some of the OpenBoot emergency procedures. Specifically, the Stop-N, Stop-D, and Stop-F commands that were available on systems with non-USB keyboards are not supported on systems that use USB keyboards, such as the Sun Fire T2000 Server. If you are familiar with the earlier (non-USB) keyboard functionality, this section describes the analogous OpenBoot emergency procedures available in newer systems that use USB keyboards.

OpenBoot Emergency Procedures for Sun Fire T2000 Systems

The following sections describe how to perform the functions of the Stop commands on systems that use USB keyboards, such as the Sun Fire T2000 Server server. These same functions are available through Sun Advanced Lights Out Manager (ALOM) system controller software.

Stop-A Functionality

Stop-A (Abort) key sequence works the same as it does on systems with standard keyboards, except that it does not work during the first few seconds after the server is reset. In addition, you can issue the ALOM system controller break command. For more information, see Reaching the ok Prompt.

Stop-N Functionality

Stop-N functionality is not available. However, the Stop-N functionality can be closely emulated by completing the following steps, provided the system console is configured to be accessible using either the serial management port or the network management port.


procedure icon  To Restore OpenBoot Configuration Defaults

1. Log in to the ALOM system controller.

2. Type the following command:


sc> bootmode reset_nvram
sc> bootmode bootscript="setenv auto-boot? false"
sc> 



Note - If you do not issue the poweroff and poweron commands or the reset command within 10 minutes, the host server ignores the bootmode command.



You can issue the bootmode command without arguments to display the current setting


sc> bootmode
Bootmode: reset_nvram
Expires WED SEP 09 09:52:01 UTC 2005
bootscript="setenv auto-boot? false"

3. To reset the system, type the following commands:


sc> reset
Are you sure you want to reset the system [y/n]?  y
sc> 

4. To view console output as the system boots with default OpenBoot configuration variables, switch to console mode.


sc> console
 
ok

5. Type set-defaults to discard any customized IDPROM values and to restore the default settings for all OpenBoot configuration variables.

Stop-F Functionality

The Stop-F functionality is not available on systems with USB keyboards.

Stop-D Functionality

The Stop-D (Diags) key sequence is not supported on systems with USB keyboards. However, the Stop-D functionality can be closely emulated by setting the virtual keyswitch to diag, using the ALOM CMT setkeyswitch command. For more information, refer to the ALOM CMT guide for your server.


Automatic System Recovery

The system provides for automatic system recovery (ASR) from failures in memory modules or PCI cards.

Automatic system recovery functionality enables the system to resume operation after experiencing certain nonfatal hardware faults or failures. When ASR is enabled, the system's firmware diagnostics automatically detect failed hardware components. An auto-configuring capability designed into the system firmware enables the system to unconfigure failed components and to restore system operation. As long as the system is capable of operating without the failed component, the ASR features enable the system to reboot automatically, without operator intervention.



Note - ASR is not activated until you enable it. See Enabling and Disabling Automatic System Recovery.



For more information about ASR, refer to the service manual for your server.

Auto-Boot Options

The system firmware stores a configuration variable called auto-boot?, which controls whether the firmware will automatically boot the operating system after each reset. The default setting for Sun platforms is true.

Normally, if a system fails power-on diagnostics, auto-boot? is ignored and the system does not boot unless an operator boots the system manually. An automatic boot is generally not acceptable for booting a system in a degraded state. Therefore, the Sun Fire T2000 Server OpenBoot firmware provides a second setting, auto-boot-on-error?. This setting controls whether the system will attempt a degraded boot when a subsystem failure is detected. Both the auto-boot? and auto-boot-on-error? switches must be set to true to enable an automatic degraded boot. To set the switches, type:


ok setenv auto-boot? true
ok setenv auto-boot-on-error? true



Note - The default setting for auto-boot-on-error? is false. The system will not attempt a degraded boot unless you change this setting to true. In addition, the system will not attempt a degraded boot in response to any fatal nonrecoverable error, even if degraded booting is enabled. For examples of fatal nonrecoverable errors, see Error Handling Summary.



Error Handling Summary

Error handling during the power-on sequence falls into one of the following three cases:



Note - If POST or OpenBoot firmware detects a nonfatal error associated with the normal boot device, the OpenBoot firmware automatically unconfigures the failed device and tries the next-in-line boot device, as specified by the boot-device configuration variable.



For more information about troubleshooting fatal errors, refer to the service manual for your server.

Reset Scenarios

Three ALOM CMT configuration variables, diag_mode, diag_level, and diag_trigger, control whether the system runs firmware diagnostics in response to system reset events.

The standard system reset protocol bypasses POST completely unless the virtual keyswitch or ALOM CMT variables and are set as follows:


TABLE 2-3 Virtual Keyswitch Setting for Reset Scenario

Keyswitch

Value

virtual keyswitch

diag


TABLE 2-4 ALOM CMT Variable Settings for Reset Scenario

Variable

Value

diag_mode

normal or service

diag_level

min or max

diag_trigger

power-on-reset error-reset


 

The default settings for these variables are:

Therefore, ASR is enabled by default. For instructions, see Enabling and Disabling Automatic System Recovery.

Automatic System Recovery User Commands

The ALOM CMT commands are available for obtaining ASR status information and for manually unconfiguring or reconfiguring system devices. For more information, see:

Enabling and Disabling Automatic System Recovery

The automatic system recovery (ASR) feature is not activated until you enable it. Enabling ASR requires changing configuration variables in ALOM CMT as well as OpenBoot.


procedure icon  To Enable Automatic System Recovery

1. At the sc> prompt, type:


sc> setsc diag_mode normal
sc> setsc diag_level max
sc> setsc diag_trigger power-on-reset

2. At the ok prompt, type:


ok setenv auto-boot true
ok setenv auto-boot-on-error? true



Note - For more information about OpenBoot configuration variables, refer to the service manual for your server.



3. To cause the parameter changes to take effect, type:


ok reset-all

The system permanently stores the parameter changes and boots automatically when the OpenBoot configuration variable auto-boot? is set to true (its default value).



Note - To store parameter changes, you can also power cycle the system using the front panel Power button.




procedure icon  To Disable Automatic System Recovery

1. At the ok prompt, type:


ok setenv auto-boot-on-error? false

2. To cause the parameter changes to take effect, type:


ok reset-all

The system permanently stores the parameter change.



Note - To store parameter changes, you can also power cycle the system using the front panel Power button.



After you disable the automatic system recovery (ASR) feature, it is not activated again until you re-enable it.

Obtaining Automatic System Recovery Information

Use the following procedure to retrieve information about the status of system components affected by automatic system recovery (ASR).

single-step bulletAt the sc> prompt, type:


sc> showcomponent

In the showcomponent command output, any devices marked disabled have been manually unconfigured using the system firmware. The showcomponent command also lists devices that have failed firmware diagnostics and have been automatically unconfigured by the system firmware.

For more information, see:


Unconfiguring and Reconfiguring Devices

To support a degraded boot capability, the ALOM CMT firmware provides the disablecomponent command, which enables you to unconfigure system devices manually. This command "marks" the specified device as disabled by creating an entry in the ASR database. Any device marked disabled, whether manually or by the system's firmware diagnostics, is removed from the system's machine description prior to the hand-off to other layers of system firmware, such as OpenBoot PROM.


procedure icon  To Unconfigure a Device Manually

single-step bulletAt the sc> prompt, type:


sc> disablecomponent asr-key

Where the asr-key is one of the device identifiers from TABLE 2-5



Note - The device identifiers are not case sensitive. You can type them as uppercase or lowercase characters.




TABLE 2-5 Device Identifiers and Devices

Device Identifiers

Devices

MB/CMPcpu_number/Pstrand_number

CPU Strand (Number: 0-31)

PCIEslot_number

PCI-E Slot (Number: 0-2)

PCIXslot_number

PCI-X (Number: 0-1):

IOBD/PCIEa

PCI-E leaf A (/pci@780)

IOBD/PCIEb

PCI-E leaf B (/pci@7c0)

TTYA

DB9 Serial Port

MB/CMP0/CHchannel_number/Rrank_number/Ddimm_number

DIMMS



procedure icon  To Reconfigure a Device Manually

1. At the sc> prompt, type:


sc> enablecomponent asr-key

where the asr-key is any device identifier from TABLE 2-5



Note - The device identifiers are not case sensitive. You can type them as uppercase or lowercase characters.



You can use the ALOM CMT enablecomponent command to reconfigure any device that you previously unconfigured with the disablecomponent command.


Displaying System Fault Information

ALOM CMT software lets you display current valid system faults. The showfaults command displays the fault ID, the faulted FRU device, and the fault message to standard output. The showfaults command also displays POST results. For example:


sc> showfaults
   ID FRU       Fault
    0 FT0.FM2   SYS_FAN at FT0.FM2 has FAILED.

Adding the -v option displays the time:


sc> showfaults -v
   ID Time              FRU         Fault
    0 MAY 20 10:47:32   FT0.FM2     SYS_FAN at FT0.FM2 has FAILED.

For more information about the showfaults command, refer to the ALOM CMT guide for your server.


procedure icon  To Display System Fault Information

single-step bulletAt the sc> prompt type:


sc> showfaults -v


Multipathing Software

Multipathing software lets you define and control redundant physical paths to I/O devices, such as storage devices and network interfaces. If the active path to a device becomes unavailable, the software can automatically switch to an alternate path to maintain availability. This capability is known as automatic failover. To take advantage of multipathing capabilities, you must configure the server with redundant hardware, such as redundant network interfaces or two host bus adapters connected to the same dual-ported storage array.

For the Sun Fire T2000 Server, three different types of multipathing software are available:

For More Information

For instructions on how to configure and administer Solaris IP Network Multipathing, consult the IP Network Multipathing Administration Guide provided with your specific Solaris release.

For information about VVM and its DMP feature, refer to the documentation provided with the VERITAS Volume Manager software.

For information about Sun StorEdge Traffic Manager, refer to your Solaris OS documentation.


Storing FRU Information


procedure icon  To Store Information in Available FRU PROMs

single-step bulletAt the sc> prompt type:


setfru -c data