C H A P T E R  5

Managing RAS Features and System Firmware

This chapter describes how to manage reliability, availability, and serviceability (RAS) features and system firmware, including Sun Advanced Lights Out Manager (ALOM) system controller, automatic system restoration (ASR), and the hardware watchdog mechanism. In addition, this chapter describes how to unconfigure and reconfigure a device manually, and introduces multipathing software.

This chapter contains the following sections:



Note - This chapter does not cover detailed troubleshooting and diagnostic procedures. For information about fault isolation and diagnostic procedures, see Chapter 8 and Chapter 9.



About Reliability, Availability, and Serviceability Features

Reliability, availability, and serviceability (RAS) are aspects of a system's design that affect its ability to operate continuously and to minimize the time necessary to service the system.

Together, reliability, availability, and serviceability features provide near continuous system operation.

To deliver high levels of reliability, availability, and serviceability, the Sun Fire V445 server offers the following features:

Hot-Pluggable and Hot-Swappable Components

Sun Fire V445 hardware is designed to support hot-plugging of internal disk drives. By using the proper software commands, you can install or remove these components while the system is running. The server also supports hot-swapping of power supplies, fan trays, and USB components. These components can be removed and installed without issuing software commands. Hot-plug and hot-swap technology significantly increase the system's serviceability and availability, by providing you with the ability to do the following:

For additional information about the system's hot-pluggable and hot-swappable components, see About Hot-Pluggable and Hot-Swappable Components.

n+2 Power Supply Redundancy

The system features four hot-pluggable power supplies, any two of which are capable of handling the system's entire load. Thus, the four power supplies provide N+N redundancy, enabling the system to continue operating should up to two of the power supplies or its AC power source fail.

For more information about power supplies, redundancy, and configuration rules, see About the Power Supplies.

ALOM System Controller

Sun Advanced Lights Out Manager (ALOM) system controller is a secure server management tool that comes preinstalled on the Sun Fire V445 server, in the form of a module with preinstalled firmware. It lets you monitor and control your server over a serial line or over a network. The ALOM system controller provides remote system administration for geographically distributed or physically inaccessible systems. You can connect to the ALOM system controller card using a local alphanumeric terminal, a terminal server, or a modem connected to its serial management port, or over a network using its 10BASE-T network management port.

For more details about the ALOM system controller hardware, see About the ALOM System Controller Card.

For information about configuring and using the ALOM system controller, see:

Environmental Monitoring and Control

The Sun Fire V445 server features an environmental monitoring subsystem that protects the server and its components against:

Monitoring and control capabilities are handled by the ALOM system controller firmware. This ensures that monitoring capabilities remain operational even if the system has halted or is unable to boot, and without requiring the system to dedicate CPU and memory resources to monitor itself. If the ALOM system controller fails, the operating system reports the failure and takes over limited environmental monitoring and control functions.

The environmental monitoring subsystem uses an industry-standard I2C bus. The I2C bus is a simple two-wire serial bus used throughout the system to allow the monitoring and control of temperature sensors, fan trays, power supplies, and status indicators.

Temperature sensors are located throughout the system to monitor the ambient temperature of the system, the CPUs, and the CPU die temperature. The monitoring subsystem polls each sensor and uses the sampled temperatures to report and respond to any overtemperature or undertemperature conditions. Additional I2C sensors detect component presence and component faults.

The hardware and software together ensure that the temperatures within the enclosure do not exceed predetermined "safe operation" ranges. If the temperature observed by a sensor falls below a low-temperature warning threshold or rises above a high-temperature warning threshold, the monitoring subsystem software lights the system Service Required indicators on the front and back panels. If the temperature condition persists and reaches a critical threshold, the system initiates a graceful system shutdown. In the event of a failure of the ALOM system controller, backup sensors are used to protect the system from serious damage, by initiating a forced hardware shutdown.

All error and warning messages are sent to the system console and logged in the /var/adm/messages file. Service Required indicators remain lit after an automatic system shutdown to aid in problem diagnosis.

The monitoring subsystem is also designed to detect fan failures. The system features integral power supply fan trays, and six fan trays each containing one fan. Four fans are for cooling CPU/Memory modules and two fans are for cooling the disk drive. All fans are hot-swappable. If any fan fails, the monitoring subsystem detects the failure and generates an error message to the system console, logs the message in the /var/adm/messages file, and lights the Service Required indicators.

The power subsystem is monitored in a similar fashion. Polling the power supply status periodically, the monitoring subsystem indicates the status of each supply's DC outputs, AC inputs, and presence.



Note - The power supply fans are not required for system cooling. However, if a power supply fails, its fan obtains power from other power supplies and through the motherboard to maintain the cooling function.


If a power supply problem is detected, an error message is sent to the system console and logged in the /var/adm/messages file. Additionally, indicators located on each power supply light to indicate failures. The system Service Required indicator lights to indicate a system fault. The ALOM system controller console alerts record power supply failures.

Automatic System Restoration

The system provides automatic system restoration (ASR) from component failures in memory modules and PCI cards.

The ASR features enable the system to resume operation after experiencing certain nonfatal hardware faults or failures. Automatic self-test features enable the system to detect failed hardware components. An autoconfiguring capability designed into the system's boot firmware enables the system to unconfigure failed components and to restore system operation. As long as the system can operate without the failed component, the ASR features enable the system to reboot automatically, without operator intervention.

During the power-on sequence, if a faulty component is detected, the component is marked as failed and, if the system can function, the boot sequence continues. In a running system, some types of failures can cause the system to fail. If this happens, the ASR functionality enables the system to reboot immediately if it is possible for the system to detect the failed component and operate without it. This prevents a faulty hardware component from keeping the entire system down or causing the system to crash repeatedly.



Note - Control over the system ASR functionality is provided by several OpenBoot commands and configuration variables. For additional details, see About Automatic System Restoration.


Sun StorEdge Traffic Manager

Sun StorEdgetrademark Traffic Manager, a feature found in the Solaris OS and later versions, is a native multipathing solution for storage devices such as Sun StorEdge disk arrays. Sun StorEdge Traffic Manager provides the following features:

For more information, see Sun StorEdge Traffic Manager. Also consult your Solaris software documentation.

Hardware Watchdog Mechanism and XIR

To detect and respond to a system hang, should one ever occur, the Sun Fire V445 server features a hardware "watchdog" mechanism, which is a hardware timer that is continually reset as long as the operating system is running. In the event of a system hang, the operating system is no longer able to reset the timer. The timer will then expire and cause an automatic externally initiated reset (XIR), eliminating the need for operator intervention. When the hardware watchdog mechanism issues the XIR, debug information is displayed on the system console. The hardware watchdog mechanism is present by default, but it requires some additional setup in the Solaris OS.

The XIR feature is also available for you to invoke manually at the ALOM system controller prompt. You use the ALOM system controller reset -x command manually when the system is unresponsive and an L1-A (Stop-A) keyboard command or alphanumeric terminal Break key does not work. When you issue the reset -x command manually, the system is immediately returned to the OpenBoot ok prompt. From there, you can use OpenBoot commands to debug the system.

For more information, see:

Support for RAID Storage Configurations

By attaching one or more external storage devices to the Sun Fire V445 server, you can use a redundant array of independent disks (RAID) software application such as Solstice DiskSuitetrademark to configure system disk storage in a variety of different RAID levels. Configuration options include RAID 0 (striping), RAID 1 (mirroring), RAID 0+1 (striping plus mirroring), RAID 1+0 (mirroring plus striping), and RAID 5 (striping with interleaved parity). You choose the appropriate RAID configuration based on the price, performance, reliability, and availability goals for your system. You can also configure one or more disk drives to serve as "hot spares" to fill in automatically in the event of a disk drive failure.

In addition to software RAID configurations, you can set up a hardware RAID 1 (mirroring) configuration for any pair of internal disk drives using the SAS controller, providing a high-performance solution for disk drive mirroring.

For more information, see:

Error Correction and Parity Checking

DIMMs employ error-correcting code (ECC) to ensure high levels of data integrity. The system reports and logs correctable ECC errors. (A correctable ECC error is any single-bit error in a 128-bit field.) Such errors are corrected as soon as they are detected. The ECC implementation can also detect double-bit errors in the same 128-bit field and multiple-bit errors in the same nibble (4 bits). In addition to providing ECC protection for data, parity protection is also used on the PCI and UltraSCSI buses, and in the UltraSPARC IIIi CPU internal caches. ECC detection and correction for DRAM is present in the 1 Mbyte on-chip ecache SRAM of the UltraSPARC-IIIi processor.


About the ALOM System Controller Command Prompt

The ALOM system controller supports a total of five concurrent sessions per server: four connections available through the network management port and one connection through the serial management port.



Note - Some of the ALOM system controller commands are also available through the Solaris scadm utility. For more information, see the Sun Advanced Lights Out Manager (ALOM) Online Help.


After you log in to your ALOM account, the ALOM system controller command prompt (sc>) appears, and you can enter ALOM system controller commands. If the command you want to use has multiple options, you can either enter the options individually or grouped together, as shown in the following example. The commands are identical.


TABLE 5-1
sc> poweroff -f -y
sc> poweroff -fy


Logging In to the ALOM System Controller

All environmental monitoring and control is handled by the ALOM system controller. The ALOM system controller command prompt (sc>) provides you with a way of interacting with the system controller. For more information about the sc> prompt, see About the sc> Prompt

For instructions on connecting to the ALOM system controller, see:


procedure icon  To Log In to the ALOM System Controller



Note - This procedure assumes that the system console is directed to use the serial management and network management ports (the default configuration).


1. If you are logged in to the system console, type #. to get to the sc> prompt.

Press the hash key, followed by the period key. Then press the Return key.

2. At the login prompt, enter the login name and press Return.

The default login name is admin.


TABLE 5-2
Sun(tm) Advanced Lights Out Manager 1.1
 
Please login: admin

3. At the password prompt, enter the password and press Return twice to get to the sc> prompt.


TABLE 5-3
Please Enter password:
 
sc>



Note - There is no default password. You must assign a password during initial system configuration. For more information, see your Sun Fire V445 Server Installation Guide and Sun Advanced Lights Out Manager (ALOM) Online Help.




caution icon Caution - In order to provide optimum system security, best practice is to change the default system login name and password during initial setup.


Using the ALOM system controller, you can monitor the system, turn the Locator indicator on and off, or perform maintenance tasks on the ALOM system controller card itself. For more information, see:


About the scadm Utility

The System Controller Administration (scadm) utility, which is part of the Solaris OS, enables you to perform many ALOM tasks while logged in to the host server. The scadm commands control several functions. Some functions allow you to view or set ALOM environment variables.



Note - Do not use the scadm utility while SunVTStrademark diagnostics are running. See your SunVTS documentation for more information.


You must be logged in to the system as superuser to use the scadm utility. The scadm utility uses the following syntax:


TABLE 5-4
# scadm command

The scadm utility sends its output to stdout. You can also use scadm in scripts to manage and configure ALOM from the host system.

For more information about the scadm utility, refer to the following:


Viewing Environmental Information

Use the showenvironment command to view environment information.


procedure icon  To View Environmental Information

1. Log in to the ALOM system controller.

2. Use the showenvironment command to display a snapshot of the server's environmental status.


TABLE 5-5
sc> showenvironment
 
=============== Environmental Status ===============
 
 
------------------------------------------------------------------------------System Temperatures (Temperatures in Celsius):
------------------------------------------------------------------------------Sensor         Status    Temp LowHard LowSoft LowWarn HighWarn HighSoft HighHard
------------------------------------------------------------------------------
C1.P0.T_CORE    OK         72    -20     -10       0     108      113      120
C1.P0.T_CORE    OK         68    -20     -10       0     108      113      120
C2.P0.T_CORE    OK         70    -20     -10       0     108      113      120
C3.P0.T_CORE    OK         70    -20     -10       0     108      113      120
C0.T_AMB        OK         23    -20     -10       0      60       65       75
C1.T_AMB        OK         23    -20     -10       0      60       65       75
C2.T_AMB        OK         23    -20     -10       0      60       65       75
C3.T_AMB        OK         23    -20     -10       0      60       65       75
FIRE.T_CORE     OK         40    -20     -10       0      80       85       92
MB.IO_T_AMB     OK         31    -20     -10       0      70       75       82
FIOB.T_AMB      OK         26    -18     -10       0      65       75       85
MB.T_AMB        OK         28    -20     -10       0      70       75       82
....

The information this command can display includes temperature, power supply status, front panel indicator status, and so on. The display uses a format similar to that of the UNIX command prtdiag(1m).



Note - Some environmental information might not be available when the server is in Standby mode.




Note - You do not need ALOM system controller user permissions to use this command.



Controlling the Locator Indicator

The Locator indicator locates the server in a data center or lab. When the Locator indicator is enabled, the white Locator indicator flashes.You can control the Locator indicator either from the Solaris command prompt or from the sc> prompt. You can also reset the Locator indicator with the Locator indicator button.


procedure icon  To Control the Locator Indicator

1. To turn on the Locator indicator, do one of the following:

2. To turn off the Locator indicator, do one of the following:

3. To display the state of the Locator indicator, do one of the following:



Note - You do not need user permissions to use the locator commands.



About Performing OpenBoot Emergency Procedures

The introduction of Universal Serial Bus (USB) keyboards with the newest Sun systems has made it necessary to change some of the OpenBoot emergency procedures. Specifically, the Stop-N, Stop-D, and Stop-F commands that were available on systems with non-USB keyboards are not supported on systems that use USB keyboards, such as the Sun Fire V445 server. If you are familiar with the earlier (non-USB) keyboard functionality, this section describes the analogous OpenBoot emergency procedures available in newer systems that use USB keyboards.

The following sections describe how to perform the functions of the Stop commands on systems that use USB keyboards, such as the Sun Fire V445 server. These same functions are available through Sun Advanced Lights Out Manager (ALOM) system controller software.

Stop-A Function

Stop-A (Abort) key sequence works the same as it does on systems with standard keyboards, except that it does not work during the first few seconds after the server is reset. In addition, you can issue the ALOM system controller break command. For more information, see Entering the ok Prompt.

Stop-N Function

The Stop-N function is not available. However, you can reset OpenBoot configuration variables to their default values by completing the following steps, provided the system console is configured to be accessible using either the serial management port or the network management port.


procedure icon  To Emulate the Stop-N Function

1. Log in to the ALOM system controller.

2. Type:


TABLE 5-12
sc> bootmode reset_nvram
sc>
SC Alert: SC set bootmode to reset_nvram, will expire 20030218184441.
bootmode
Bootmode: reset_nvram
Expires TUE FEB 18 18:44:41 2003

This command resets the default OpenBoot configuration variables.

3. To reset the system, type:


TABLE 5-13
sc> reset
Are you sure you want to reset the system [y/n]?  y
sc> console

4. To view console output as the system boots with default OpenBoot configuration variables, switch to console mode.


TABLE 5-14
sc> console
 
ok

5. Type set-defaults to discard any customized IDPROM values and to restore the default settings for all OpenBoot configuration variables.

Stop-F Function

The Stop-F function is not available on systems with USB keyboards.

Stop-D Function

The Stop-D (Diags) key sequence is not supported on systems with USB keyboards. However, the Stop-D function can be closely emulated with ALOM software by enabling the Diagnostics mode.

In addition, you can emulate Stop-D function using the ALOM system controller bootmode diag command. For more information, see the Sun Advanced Lights Out Manager (ALOM) Online Help.


About Automatic System Restoration

The system provides automatic system restoration (ASR) from failures in memory modules or PCI cards.

Automatic system restoration functionality enables the system to resume operation after experiencing certain nonfatal hardware faults or failures. When ASR is enabled, the system's firmware diagnostics automatically detect failed hardware components. An autoconfiguring capability designed into the OpenBoot firmware enables the system to unconfigure failed components and to restore system operation. As long as the system is capable of operating without the failed component, the ASR features enable the system to reboot automatically, without operator intervention.

For more information about ASR, see About Automatic System Restoration.


Unconfiguring a Device Manually

To support a degraded boot capability, the OpenBoot firmware provides the
asr-disable command, which enables you to unconfigure system devices manually. This command "marks" a specified device as disabled, by creating an appropriate status property in the corresponding device tree node. By convention, the Solaris OS does not activate a driver for any device so marked.


procedure icon  To Unconfigure a Device Manually

1. At the ok prompt, type:


 
ok asr-disable device-identifier

where device-identifier is one of the following:



Note - The device identifiers are not case-sensitive. You can type them as uppercase or lowercase characters.



TABLE 5-15 Device Identifiers and Devices

Device Identifiers

Devices

cpu0-bank0, cpu0-bank1, cpu0-bank2, cpu0-bank3, ... cpu3-bank0, cpu3-bank1, cpu3-bank2, cpu3-bank3

Memory banks 0 - 3 for each CPU

cpu0-bank*, cpu1-bank*, ... cpu3-bank*

All memory banks for each CPU

ide

On-board IDE controller

net0, net1,net2,net3

On-board Ethernet controllers

ob-scsi

SAS controller

pci0, ... pci7

PCI slots 0 - 7

pci-slot*

All PCI slots

pci*

All on-board PCI devices (on-board Ethernet, SAS) and all PCI slots

hba8, hba9

PCI bridge chips 0 and 1, respectively

usb0, ..., usb4

USB devices

*

All devices


The show-devs command lists the system devices and displays the full path name of each device.

where alias-name is the alias that you want to assign, and physical-device-path is the full physical device path for the device.



Note - If you manually disable a device using asr-disable, and then assign a different alias to the device, the device remains disabled even though the device alias has changed.


2. To cause the parameter change to take effect, type:


 
ok reset-all

The system permanently stores the parameter change.



Note - To store parameter changes, you can also power cycle the system using the front panel Power button.



Reconfiguring a Device Manually

You can use the OpenBoot asr-enable command to reconfigure any device that you previously unconfigured with the asr-disable command.


procedure icon  To Reconfigure a Device Manually

1. At the ok prompt, type:


 
ok asr-enable device-identifier

where the device-identifier is one of the following:



Note - The device identifiers are not case-sensitive. You can type them as uppercase or lowercase characters.


For a list of device identifiers and devices, see TABLE 5-15.


Enabling the Hardware Watchdog Mechanism and Its Options

For background information about the hardware watchdog mechanism and related externally initiated reset (XIR) functionality, see:


procedure icon  To Enable the Hardware Watchdog Mechanism and Its Options

1. Edit the /etc/system file to include the following entry:


 
set watchdog_enable = 1

2. To obtain the ok prompt, type:


TABLE 5-16
# init 0

3. Reboot the system so that the changes can take effect.

4. To have the hardware watchdog mechanism automatically reboot the system in case of system hang, at the ok prompt, type:


 
ok setenv error-reset-recovery = boot

5. To generate automated crash dumps in case of system hang, at the ok prompt, type:


 
ok setenv error-reset-recovery = none

The sync option leaves you at the ok prompt to debug the system. For more information about OpenBoot configuration variables, see Appendix C.


About Multipathing Software

Multipathing software allows you to define and control redundant physical paths to I/O devices, such as storage devices and network interfaces. If the active path to a device becomes unavailable, the software can automatically switch to an alternate path to maintain availability. This capability is known as automatic failover. To take advantage of multipathing capabilities, you must configure the server with redundant hardware, such as redundant network interfaces or two host bus adapters connected to the same dual-ported storage array.

For the Sun Fire V445 server, three different types of multipathing software are available:

For information about setting up redundant hardware interfaces for networks, see About Redundant Network Interfaces.

For instructions on how to configure and administer Solaris IP Network Multipathing, consult the IP Network Multipathing Administration Guide provided with your specific Solaris release.

For information about Sun StorEdge Traffic Manager, see Sun StorEdge Traffic Manager and refer to your Solaris OS documentation.

For information about VERITAS Volume Manager and its DMP feature, see About Volume Management Software and refer to the documentation provided with the VERITAS Volume Manager software.