C H A P T E R  2

Managing RAS Features and System Firmware

This chapter describes how to manage reliability, availability, and serviceability (RAS) features and system firmware, including ILOM on the service processor, and automatic system recovery (ASR). In addition, this chapter describes how to unconfigure and reconfigure a device manually, and introduces multipathing software.

This chapter contains the following sections:



Note - This chapter does not cover detailed troubleshooting and diagnostic procedures. For information about fault isolation and diagnostic procedures, refer to the Sun Netra T5220 Server Service Manual.



ILOM and the Service Processor

The ILOM service processor supports a total of five concurrent sessions per server, four SSH connections available through the network management port and one connection available through the serial management port.

After you log in to your ILOM account, the ILOM service processor command prompt (->) appears, and you can enter ILOM service processor commands. If the command you want to use has multiple options, you can either enter the options individually or grouped together, as shown in the following example.


-> stop -force -script /SYS
-> start -script /SYS

Logging In To ILOM

All environmental monitoring and control is handled by ILOM on the ILOM service processor. The ILOM service processor command prompt (->) provides you with a way of interacting with ILOM. For more information about the -> prompt, see ILOM -> Prompt.

For instructions on connecting to the ILOM service processor, see:



Note - This procedure assumes that the system console is directed to use the serial management and network management ports (the default configuration).



procedure icon  To Log In To ILOM

1. At the ILOM login prompt, enter the login name and press Return.

The default login name is root.


Integrated Lights Out Manager 2.0
Please login: root

2. At the password prompt, enter the password and press Return to get to the -> prompt.


Please Enter password:
->



Note - The default user is root and the password is changeme. For more information, refer to the Sun Netra T5220 Server Installation Guide, the Integrated Lights Out Management User’s Guide, and the Integrated Lights Out Management 2.0 Supplement for the Sun Netra T5220 Server.




caution icon Caution - To provide optimum system security, change the default system password during initial setup.


Using the ILOM service processor, you can monitor the system, turn the Locator LED on and off, or perform maintenance tasks on the ILOM service processor itself. For more information, refer to the ILOM user’s guide and the ILOM supplement for your server.


procedure icon  To View Environmental Information

1. Log in to the ILOM service processor.

2. Use the following command to display a snapshot of the server’s environmental status.

show /SP/faultmgmt



Note - You do not need ILOM administrator permissions to use this command.



Status Indicators

The system has LED indicators associated with the server itself and with various components. The server status indicators are located on the bezel and repeated on the back panel. The components with LED indicators to convey status are the dry contact alarm card, power supply units, Ethernet port, and hard drives.

The topics in this section include:

Interpreting System LEDs

The behavior of LEDs on the Sun Netra T5220 server conform to the American National Standards Institute (ANSI) Status Indicator Standard (SIS). These standard LED behaviors are described in TABLE 2-1.


TABLE 2-1 Standard LED Behaviors and Values

LED Behavior

Meaning

Off

The condition represented by the color is not true.

Steady on

The condition represented by the color is true.

Standby blink

The system is functioning at a minimal level and ready to resume full function.

Slow blink

Transitory activity or new activity represented by the color is taking place.

Fast blink

Attention is required.

Feedback flash

Activity is taking place commensurate with the flash rate (such as disk drive activity).


The system LEDs have assigned meanings, described in TABLE 2-2.


TABLE 2-2 System LED Behaviors With Assigned Meanings

Color

Behavior

Definition

Description

White

Off

Steady state

 

 

Fast blink

4-Hz repeating sequence, equal intervals On and Off.

This indicator helps you to locate a particular enclosure, board, or subsystem.

Example: the Locator LED.

Blue

Off

Steady state

 

 

Steady on

Steady state

If blue is on, a service action can be performed on the applicable component with no adverse consequences.

Example: the OK-to-Remove LED.

Yellow/amber

Off

Steady state

 

 

Slow blink

1-Hz repeating sequence, equal intervals On and Off.

This indicator signals new fault conditions. Service is required.

Example: the Service Required LED.

 

Steady on

Steady state

The amber indicator stays on until the service action is completed and the system returns to normal function.

Green

Off

Steady state

 

 

Standby blink

Repeating sequence consisting of a brief (0.1 sec.) on flash followed by a long off period (2.9 sec.)

The system is running at a minimum level and is ready to be quickly revived to full function.

Example: the System Activity LED.

 

Steady on

Steady state

Status normal. System or component functioning with no service actions required.

 

Slow blink

 

A transitory (temporary) event is taking place for which direct proportional feedback is not needed or is not feasible.


Bezel Server Status Indicators

FIGURE 2-1 shows the location of the bezel indicators, and TABLE 2-3 provides information about the server status indicators.

FIGURE 2-1 Location of the Bezel Server Status and Alarm Status Indicators


Figure shows the front panel of the Sun Netra T5220 server. The locator button (top button) is located in the upper left corner of the chassis.


Figure Legend

1

User (amber) Alarm Status Indicator

5

Locator LED

2

Minor (amber) Alarm Status Indicator

6

Fault LED

3

Major (red) Alarm Status Indicator

7

Activity LED

4

Critical (red) Alarm Status Indicator

8

Power Button



TABLE 2-3 Bezel Server Status Indicators

Indicator

LED Color

LED State

Component Status

Locator

White

On

Server is identified

 

 

Off

Normal state

Fault

Amber

On

The server has detected a problem and requires the attention of service personnel.

 

 

Off

The server has no detected faults.

Activity

Green

On

The server is powered up and running the Oracle Solaris Operating System.

 

 

Off

Either power is not present or the Oracle Solaris software is not running.


Alarm Status Indicators

The dry contact alarm card has four LED status indicators that are supported by ILOM. They are located vertically on the bezel (FIGURE 2-1). Information on the alarm indicators and dry contact alarm states is provided in TABLE 2-4. For more information on alarm indicators, see the Integrated Lights Out Management User’s Guide.


TABLE 2-4 Alarm Indicators and Dry Contact Alarm States

Indicator and Relay

Labels

Indicator Color

Application or Server State

Condition or Action

Activity Indicator State

Alarm Indicator State

Relay

NC[1]

State

Relay

NO[2]

State

Comments

Critical

(Alarm0)

Red

Server state (Power on or off, and Oracle Solaris OS functional or not functional)

No power input

Off

Off

Closed

Open

Default state

System power off

Off

Off[3]

Closed

Open

Input power connected

System power turns on, Oracle Solaris OS not fully loaded

Off

Off

Closed

Open

Transient state

Oracle Solaris OS successfully loaded

On

Off

Open

Closed

Normal operating state

Watchdog timeout

Off

On

Closed

Open

Transient state, reboot Oracle Solaris OS

Oracle Solaris OS shutdown initiated by user[4]

Off

Off

Closed

Open

Transient state

Lost input power

Off

Off

Closed

Open

Default state

System power shutdown by user

Off

Off

Closed

Open

Transient state

Application state

User sets critical alarm to on[5]

--

On

Closed

Open

Critical fault detected

User sets critical alarm to off

--

Off

Open

Closed

Critical fault cleared

Major

(Alarm1)

Red

Application state

User sets major alarm to on

--

On

Open

Closed

Major fault detected

User sets major alarm to off

--

Off

Closed

Open

Major fault cleared

Minor

(Alarm2)

Amber

Application state

User sets minor alarm to on

--

On

Open

Closed

Minor fault detected

User sets minor alarm to off

--

Off

Closed

Open

Minor fault cleared

User

(Alarm3)

Amber

Application state

User sets user alarm to on

--

On

Open

Closed

User fault detected

User sets user alarm to off

--

Off

Closed

Open

User fault cleared


When the user sets an alarm, a message is displayed on the console. For example, when the critical alarm is set, the following message is displayed on the console:
In certain instances when the critical alarm is set, the associated alarm indicator is not lit.


SC Alert: CRITICAL ALARM is set 

Controlling the Locator LED

You control the Locator LED from the -> prompt or with the Locator button on the front of the chassis.


procedure icon  To Control the Locator LED

single-step bullet  To turn on the Locator LED, from the ILOM service processor command prompt, type:


-> set /SYS/LOCATE value=on

single-step bullet  To turn off the Locator LED, from the ILOM service processor command prompt, type:


-> set /SYS/LOCATE value=off

single-step bullet  To display the state of the Locator LED, from the ILOM service processor command prompt, type:


-> show /SYS/LOCATE



Note - You do not need Administrator permissions to use the set /SYS/LOCATE and show /SYS/LOCATE commands



OpenBoot Emergency Procedures

The introduction of Universal Serial Bus (USB) keyboards with the newest systems has made it necessary to change some of the OpenBoot emergency procedures. Specifically, the Stop-N, Stop-D, and Stop-F commands that were available on systems with non-USB keyboards are not supported on systems that use USB keyboards, such as the Sun Netra T5220 server. If you are familiar with the earlier (non-USB) keyboard functionality, this section describes the analogous OpenBoot emergency procedures available in newer systems that use USB keyboards.

OpenBoot Emergency Procedures for the Sun Netra T5220 System

The following sections describe how to perform the functions of the Stop commands on systems that use USB keyboards. These same functions are available through Integrated Lights Out Manager (ILOM) system controller software.

Stop-N Functionality

Stop-N functionality is not available. However, you can closely emulate the Stop-N functionality by completing the following steps, provided the system console is configured to be accessible using either the serial management port or the network management port.


procedure icon  To Restore OpenBoot Configuration Defaults

1. Log in to the ILOM service processor.

2. Type the following commands:


-> set /HOST/bootmode state=reset_nvram
-> set /HOST/bootmode script="setenv auto-boot? false"
-> 



Note - If you do not issue the stop /SYS and start /SYS commands or the reset /SYS command within 10 minutes, the host server ignores the set/HOST/bootmode commands.


You can issue the show /HOST/bootmode command without arguments to display the current setting.


-> show /HOST/bootmode
 
 /HOST/bootmode
    Targets:
 
    Properties:
        config = (none)
        expires = Tue Jan 19 03:14:07 2038
        script = (none)
        state = normal 

3. To reset the system, type the following commands:


-> reset /SYS
Are you sure you want to reset /SYS (y/n)?  y
-> 

4. To view console output as the system boots with default OpenBoot configuration variables, switch to console mode.


-> set /SP/network pendingipdiscovery=dhcp
Set ’pendingipdiscovery’ to ’dhcp’
 
-> set /SP/network commitpending=true
Set ’commitpending’ to ’true’
->

5. To discard any customized IDPROM values and restore the default settings for all OpenBoot configuration variables, type:


-> set /SP reset_to_defaults=all 
-> reset /SP

Stop-F Functionality

The Stop-F functionality is not available on systems with USB keyboards.

Stop-D Functionality

The Stop-D (Diags) key sequence is not supported on systems with USB keyboards. However, you can closely emulate the Stop-D functionality by setting the virtual keyswitch to diag, using the ILOM set /SYS keyswitch_state=diag command. For more information, refer to the Integrated Lights Out Management User’s Guide and the Integrated Lights Out Management 2.0 Supplement for the Sun Netra T5220 Server.


Automatic System Recovery

The system provides for automatic system recovery (ASR) from failures in memory modules or PCI cards.

Automatic system recovery functionality enables the system to resume operation after experiencing certain nonfatal hardware faults or failures. When ASR is enabled, the system’s firmware diagnostics automatically detect failed hardware components. An autoconfiguring capability designed into the system firmware enables the system to unconfigure failed components and to restore system operation. As long as the system is capable of operating without the failed component, the ASR features enable the system to reboot automatically, without operator intervention.



Note - ASR is not activated until you enable it. See Enabling and Disabling Automatic System Recovery.


For more information about ASR, refer to the Sun Netra T5220 Server Service Manual.

Auto-Boot Options

The system firmware stores a configuration variable called auto-boot?, which controls whether the firmware will automatically boot the operating system after each reset. The default setting for Sun Netra platforms is true.

Normally, if a system fails power-on diagnostics, auto-boot? is ignored and the system does not boot unless an operator boots the system manually. An automatic boot is generally not acceptable for booting a system in a degraded state. Therefore, the server OpenBoot firmware provides a second setting, auto-boot-on-error?. This setting controls whether the system will attempt a degraded boot when a subsystem failure is detected. Both the auto-boot? and auto-boot-on-error? switches must be set to true to enable an automatic degraded boot. To set the switches, type:


ok setenv auto-boot? true
ok setenv auto-boot-on-error? true



Note - The default setting for auto-boot-on-error? is false. The system will not attempt a degraded boot unless you change this setting to true. In addition, the system will not attempt a degraded boot in response to any fatal nonrecoverable error, even if degraded booting is enabled. For examples of fatal nonrecoverable errors, see Error Handling Summary.


Error Handling Summary

Error handling during the power-on sequence falls into one of the following three cases:



Note - If POST or OpenBoot firmware detects a nonfatal error associated with the normal boot device, the OpenBoot firmware automatically unconfigures the failed device and tries the next-in-line boot device, as specified by the boot-device configuration variable.


For more information about troubleshooting fatal errors, refer to the Sun Netra T5220 Server Service Manual.

Reset Scenarios

Three ILOM /HOST/diag configuration properties, mode, level, and trigger, control whether the system runs firmware diagnostics in response to system reset events.

The standard system reset protocol bypasses POST completely unless the virtual keyswitch or ILOM properties are set as follows:


TABLE 2-5 Virtual Keyswitch Setting for Reset Scenario

Keyswitch

Value

/SYS keyswitch_state

diag


If keyswitch_state is set to diag, the system can power itself on using preset values of diagnostic properties (/HOST/diag level=max, /HOST/diag mode=max, /HOST/diag verbosity=max) to provide thorough fault coverage. This option overrides the values of diagnostic properties that you might have set elsewhere.


TABLE 2-6 ILOM Property Settings for Reset Scenario

Property

Value

mode

normal or service

level

min or max

trigger

power-on-reset error-reset


The default settings for these properties are:

For instructions on automatic system recovery (ASR), see Enabling and Disabling Automatic System Recovery.

Automatic System Recovery User Commands

The ILOM commands are available for obtaining ASR status information and for manually unconfiguring or reconfiguring system devices. For more information, see:

Enabling and Disabling Automatic System Recovery

The automatic system recovery (ASR) feature is not activated until you enable it. Enabling ASR requires changing configuration variables in ILOM as well as in OpenBoot firmware.


procedure icon  To Enable Automatic System Recovery

1. At the -> prompt, type:


-> set /HOST/diag mode=normal
-> set /HOST/diag level=max
-> set /HOST/diag trigger=power-on-reset

2. At the ok prompt, type:


ok setenv auto-boot true
ok setenv auto-boot-on-error? true



Note - For more information about OpenBoot configuration variables, refer to the service manual for your server.


3. To cause the parameter changes to take effect, type:


ok reset-all

The system permanently stores the parameter changes and boots automatically when the OpenBoot configuration variable auto-boot? is set to true (its default value).



Note - To store parameter changes, you can also power cycle the system using the front panel Power button.



procedure icon  To Disable Automatic System Recovery

1. At the ok prompt, type:


ok setenv auto-boot-on-error? false

2. To cause the parameter changes to take effect, type:


ok reset-all

The system permanently stores the parameter change.



Note - To store parameter changes, you can also power cycle the system using the front panel Power button.


After you disable the ASR feature, it is not activated again until you re-enable it.

Obtaining Automatic System Recovery Information


procedure icon  To Retrieve Information About the Status of System Components Affected by ASR

single-step bullet  At the -> prompt, type:


-> show /SYS/component component_state

In the show /SYS/component component_state command output, any devices marked disabled have been manually unconfigured using the system firmware. The command output also shows devices that have failed firmware diagnostics and have been automatically unconfigured by the system firmware.

For more information, see:


Unconfiguring and Reconfiguring Devices

To support a degraded boot capability, the ILOM firmware provides the
set Device_Identifier component_state=disabled command, which enables you to unconfigure system devices manually. This command “marks” the specified device as disabled by creating an entry in the ASR database. Any device marked disabled, whether manually or by the system’s firmware diagnostics, is removed from the system’s machine description prior to the hand-off to other layers of system firmware, such as OpenBoot PROM.


procedure icon  To Unconfigure a Device Manually

single-step bullet  At the -> prompt, type:


-> set Device-Identifier component_state=disabled

where the Device-Identifier is one of the device identifiers from TABLE 2-7



Note - The device identifiers are case sensitive.



TABLE 2-7 Device Identifiers and Devices

Device Identifiers

Devices

/SYS/MB/PCI-MEZZ/PCIXnumber

 

/SYS/MB/PCI-MEZZ/PCIEnumber

 

 

/SYS/MB/CMPcpu-number/Pstrand-number

CPU Strand (Number: 0-63)

/SYS/MB/RISERriser-number/PCIEslot-number

PCIe Slot (Number: 0-2)

/SYS/MB/RISERriser-number/XAUIcard-number

XAUI card (Number: 0-1)

/SYS/MB/GBEcontroller-number

GBE controllers (Number: 0-1)

  • GBE0 controls NET0 and NET1
  • GBE1 controls NET2 and NET3

/SYS/MB/PCIE

PCIe root complex

/SYS/MB/USBnumber

USB ports (Number: 0-1, located on rear of chassis)

/SYS/MB/CMP0/L2-BANKnumber

(Number: 0-3)

/SYS/USBBD/USBnumber

USB ports (Number: 2-3, located on front of chassis)

/SYS/MB/CMP0/BRbranch-number/CHchannel-number/Ddimm-number

DIMMS



procedure icon  To Reconfigure a Device Manually

single-step bullet  At the -> prompt, type:


-> set Device-Identifier component-state=enabled

where the Device-Identifier is any device identifier from TABLE 2-7



Note - The device identifiers are not case sensitive. You can type them as uppercase or lowercase characters.


You can use the ILOM set Device-Identifier component_state=enabled command to reconfigure any device that you previously unconfigured with the
set Device-Identifier component_state=disabled command.


Displaying System Fault Information

ILOM software enables you to display current valid system faults.


procedure icon  To Display Current Valid System Faults

single-step bullet  Type:


-> show /SP/faultmgmt 

This command displays the fault ID, the faulted FRU device, and the fault message to standard output. The show /SP/faultmgmt command also displays POST results.

For example:


-> show /SP/faultmgmt
  /SP/faultmgmt
     Targets:
         0 (/SYS/PS1)
 
     Properties:
 
 
     Commands:
         cd
         show
->

For more information about the show /SP/faultmgmt command, refer to the ILOM guide and the ILOM supplement for your server.


procedure icon  To Clear a Fault

single-step bullet  Type:


-> set /SYS/component clear_fault_action=true

Setting clear_fault_action to true clears the fault at the component and all levels below it in the /SYS tree.


Storing FRU Information


procedure icon  To Store Information in Available FRU PROMs

single-step bullet  At the -> prompt type:


-> set /SP customer_frudata=data


Multipathing Software

Multipathing software enables you to define and control redundant physical paths to I/O devices such as storage devices and network interfaces. If the active path to a device becomes unavailable, the software can automatically switch to an alternate path to maintain availability. This capability is known as automatic failover. To take advantage of multipathing capabilities, you must configure the server with redundant hardware, such as redundant network interfaces or two host bus adapters connected to the same dual-ported storage array.

For the Sun Netra T5220 Server, three different types of multipathing software are available:

For More Information

For instructions on how to configure and administer Oracle Solaris IP Network Multipathing, consult the IP Network Multipathing Administration Guide provided with your specific Oracle Solaris release.

For information about VVM and its DMP feature, refer to the documentation provided with the VERITAS Volume Manager software.

For information about Sun StorageTek Traffic Manager, refer to your Oracle Solaris OS documentation.


1 (TableFootnote) NC state is the normally closed state. This state represents the default mode of the relay contacts in the normally closed state.
2 (TableFootnote) NO state is the normally open state. This state represents the default mode of the relay contacts in the normally open state.
3 (TableFootnote) The implementation of this alarm indicator state is subject to change.
4 (TableFootnote) The user can shut down the system using commands such as init0 and init6. These commands do not remove power from the system.
5 (TableFootnote) Based on a determination of the fault conditions, the user can turn the alarm on using the Oracle Solaris platform alarm API or ILOM CLI.