C H A P T E R  2

Managing RAS Features and System Firmware

This chapter describes how to manage reliability, availability, and serviceability (RAS) features and system firmware, including Sun Advanced Lights Out Manager (ALOM) system controller, automatic system recovery (ASR), and the hardware watchdog mechanism. In addition, this chapter describes how to unconfigure and reconfigure a device manually, and introduces multipathing software.

This chapter contains the following sections:



Note - This chapter does not cover detailed troubleshooting and diagnostic procedures. For information about fault isolation and diagnostic procedures, refer to the Netra 440 Server Diagnostics and Troubleshooting Guide (817-3886-xx).




ALOM System Controller

The ALOM system controller supports a total of five concurrent sessions per server: four connections available through the network management port and one connection through the serial management port.



Note - Some of the ALOM system controller commands are also available through the Solaris scadm utility. For more information, refer to the Advanced Lights Out Manager User's Guide (817-5481-xx).



After you log in to your ALOM account, the ALOM system controller command prompt (sc>) appears, and you can enter ALOM system controller commands. If the command you want to use has multiple options, you can either enter the options individually or grouped together, as shown in the following example. The commands are identical.

sc> poweroff -f -y
sc> poweroff -fy

Logging In To the ALOM System Controller

All environmental monitoring and control is handled by the ALOM system controller. The ALOM system controller command prompt (sc>) provides you with a way of interacting with the system controller. For more information about the sc> prompt, see About the sc> Prompt.

For instructions on connecting to the ALOM system controller, see:



Note - This procedure assumes that the system console is directed to use the serial management and network management ports (the default configuration).




procedure icon  To Log In To the ALOM System Controller

1. If you are logged in to the system console, type #. to get to the sc> prompt.

Press the pound sign key, followed by the period key. Then press the Return key.

2. At the ALOM login prompt, enter the login name and press Return.

The default login name is admin.

Sun(tm) Advanced Lights Out Manager 1.3
Please login: admin

3. At the password prompt, enter the password and press Return twice to get to the sc> prompt.

Please Enter password:
 
sc>



Note - There is no default password. You must assign a password during initial system configuration. For more information, refer to your Netra 440 Server Installation Guide (817-3882-xx) and Advanced Lights Out Manager User's Guide (817-5481-xx).





caution icon

Caution - In order to provide optimum system security, best practice is to change the default system login name and password during initial setup.



Using the ALOM system controller, you can monitor the system, turn the Locator LED on and off, or perform maintenance tasks on the ALOM system controller card itself. For more information, refer to the Advanced Lights Out Manager User's Guide (817-5481-xx).

About the scadm Utility

The System Controller Administration (scadm) utility, which is part of the Solaris OS, enables you to perform many ALOM tasks while logged in to the host server. The scadm commands control several functions. Some functions allow you to view or set ALOM environment variables.



Note - Do not use the scadm utility while SunVTStrademark diagnostics are running. See your SunVTS documentation for more information.



You must be logged in to the system as root to use the scadm utility. The scadm utility uses the following syntax:

# scadm command

The scadm utility sends its output to stdout. You can also use scadm in scripts to manage and configure ALOM from the host system.

For more information about the scadm utility, refer to the following:


procedure icon  To View Environmental Information

1. Log in to the ALOM system controller.

2. Use the showenvironment command to display a snapshot of the server's environmental status.


sc> showenvironment
 
=============== Environmental Status ===============
 
 
------------------------------------------------------------------------------
System Temperatures (Temperatures in Celsius):
------------------------------------------------------------------------------
Sensor         Status    Temp LowHard LowSoft LowWarn HighWarn HighSoft HighHard
------------------------------------------------------------------------------
C0.P0.T_CORE    OK         48    -20     -10       0      97      102      120
C1.P0.T_CORE    OK         53    -20     -10       0      97      102      120
C2.P0.T_CORE    OK         49    -20     -10       0      97      102      120
C3.P0.T_CORE    OK         57    -20     -10       0      97      102      120
C0.T_AMB        OK         28    -20     -10       0      70       82       87
C1.T_AMB        OK         33    -20     -10       0      70       82       87
C2.T_AMB        OK         27    -20     -10       0      70       82       87
C3.T_AMB        OK         28    -20     -10       0      70       82       87
MB.T_AMB        OK         32    -18     -10       0      65       75       85
...

The information this command can display includes temperature, power supply status, front panel LED status, system control keyswitch position, and so on. The display uses a format similar to that of the UNIX command prtdiag(1m).



Note - Some environmental information might not be available when the server is in standby mode.





Note - You do not need ALOM system controller user permissions to use this command.



The showenvironment command has one option: -v. If you use this option, ALOM returns more detailed information about the host server's status, including warning and shutdown thresholds.

Controlling the Locator LED

You can control the Locator LED either from the Solaris command prompt or from the sc> prompt.

single-step bulletTo turn on the Locator LED, do one of the following:

single-step bulletTo turn off the Locator LED, do one of the following:

single-step bulletTo display the state of the Locator LED, do one of the following:



Note - You do not need user permissions to use the setlocator and showlocator commands.




OpenBoot Emergency Procedures

The introduction of Universal Serial Bus (USB) keyboards with the newest Sun systems has made it necessary to change some of the OpenBoot emergency procedures. Specifically, the Stop-N, Stop-D, and Stop-F commands that were available on systems with non-USB keyboards are not supported on systems that use USB keyboards, such as the Netra 440 server. If you are familiar with the earlier (non-USB) keyboard functionality, this section describes the analogous OpenBoot emergency procedures available in newer systems that use USB keyboards.

OpenBoot Emergency Procedures for Systems With Non-USB Keyboards

TABLE 2-1 summarizes the Stop key command functions for systems that use standard (non-USB) keyboards.

TABLE 2-1 Stop Key Command Functions for Systems With Standard (Non-USB) Keyboards

Standard (Non-USB)
Keyboard Command

Description

Stop

Bypass POST. This command does not depend on security mode.

Stop-A

Abort.

Stop-D

Enter the diagnostic mode (set diag-switch? to true).

Stop-F

Enter Forth on ttya instead of probing. Use fexit to continue with the initialization sequence. Useful when there is a hardware problem.

Stop-N

Reset OpenBoot configuration variables to their default values.


OpenBoot Emergency Procedures for Systems With USB Keyboards

The following sections describe how to perform the functions of the Stop commands on systems that use USB keyboards, such as the Netra 440 server. These same functions are available through Sun Advanced Lights Out Manager (ALOM) system controller software.

Stop-A Functionality

Stop-A (Abort) key sequence works the same as it does on systems with standard keyboards, except that it does not work during the first few seconds after the server is reset. In addition, you can issue the ALOM system controller break command. For more information, see Reaching the ok Prompt.

Stop-N Functionality

Stop-N functionality is not available. However, the Stop-N functionality can be closely emulated by completing the following steps, provided the system console is configured to be accessible using either the serial management port or the network management port.


procedure icon  To Restore OpenBoot Configuration Defaults

1. Log in to the ALOM system controller.

2. Type the following command:

sc> bootmode reset_nvram
sc>
SC Alert: SC set bootmode to reset_nvram, will expire 20030218184441.
bootmode
Bootmode: reset_nvram
Expires TUE FEB 18 18:44:41 2003

This command resets the default OpenBoot configuration variables.

3. To reset the system, type the following command:

sc> reset
Are you sure you want to reset the system [y/n]?  y
sc> console

4. To view console output as the system boots with default OpenBoot configuration variables, switch to console mode.

sc> console
 
ok

5. Type set-defaults to discard any customized IDPROM values and to restore the default settings for all OpenBoot configuration variables.

Stop-F Functionality

The Stop-F functionality is not available on systems with USB keyboards.

Stop-D Functionality

The Stop-D (Diags) key sequence is not supported on systems with USB keyboards. However, the Stop-D functionality can be closely emulated by turning the system control keyswitch to the Diagnostics position. For more information, refer to the Netra 440 Server Product Overview (817-3881-xx).

In addition, you can emulate Stop-D functionality using the ALOM system controller bootmode diag command. For more information, refer to the Advanced Lights Out Manager User's Guide (817-5481-xx).


Automatic System Recovery

The system provides for automatic system recovery (ASR) from failures in memory modules or PCI cards.

Automatic system recovery functionality enables the system to resume operation after experiencing certain nonfatal hardware faults or failures. When ASR is enabled, the system's firmware diagnostics automatically detect failed hardware components. An auto-configuring capability designed into the OpenBoot firmware enables the system to unconfigure failed components and to restore system operation. As long as the system is capable of operating without the failed component, the ASR features enable the system to reboot automatically, without operator intervention.



Note - ASR is not activated until you enable it. See Enabling and Disabling Automatic System Recovery.



For more information about ASR, refer to the Netra 440 Server Diagnostics and Troubleshooting Guide (817-3886-xx).

Auto-Boot Options

The OpenBoot firmware stores a configuration variable on the system configuration card (SCC) called auto-boot?, which controls whether the firmware will automatically boot the operating system after each reset. The default setting for Sun platforms is true.

Normally, if a system fails power-on diagnostics, auto-boot? is ignored and the system does not boot unless an operator boots the system manually. A manual boot is obviously not acceptable for booting a system in a degraded state. Therefore, the Netra 440 server OpenBoot firmware provides a second setting, auto-boot-on-error?. This setting controls whether the system will attempt a degraded boot when a subsystem failure is detected. Both the auto-boot? and auto-boot-on-error? switches must be set to true to enable an automatic degraded boot. To set the switches, type:

ok setenv auto-boot? true
ok setenv auto-boot-on-error? true



Note - The default setting for auto-boot-on-error? is false. Therefore, the system will not attempt a degraded boot unless you change this setting to true. In addition, the system will not attempt a degraded boot in response to any fatal nonrecoverable error, even if degraded booting is enabled. For examples of fatal nonrecoverable errors, see Error Handling Summary.



Error Handling Summary

Error handling during the power-on sequence falls into one of the following three cases:



Note - If POST or OpenBoot Diagnostics detects a nonfatal error associated with the normal boot device, the OpenBoot firmware automatically unconfigures the failed device and tries the next-in-line boot device, as specified by the diag-device configuration variable.



For more information about troubleshooting fatal errors, refer to the Netra 440 Server Diagnostics and Troubleshooting Guide (817-3886-xx).

Reset Scenarios

Three OpenBoot configuration variables, diag-switch?, obdiag-trigger, and post-trigger, control whether the system runs firmware diagnostics in response to system reset events.

The standard system reset protocol bypasses POST and OpenBoot Diagnostics completely unless the variable diag-switch? is set to true, or the system control keyswitch is in the Diagnostics position. The default setting for this variable is false. Therefore, to enable ASR, which relies on firmware diagnostics to detect faulty devices, you must change this setting to true. For instructions, see Enabling and Disabling Automatic System Recovery.

To control which reset events, if any, automatically initiate firmware diagnostics, the OpenBoot firmware provides variables called obdiag-trigger and post-trigger. For detailed explanations of these variables and their uses, refer to the Netra 440 Server Diagnostics and Troubleshooting Guide (817-3886-xx).

Automatic System Recovery User Commands

The OpenBoot commands .asr, asr-disable, and asr-enable are available for obtaining ASR status information and for manually unconfiguring or reconfiguring system devices. For more information, see:

Enabling and Disabling Automatic System Recovery

The automatic system recovery (ASR) feature is not activated until you enable it at the system ok prompt.


procedure icon  To Enable Automatic System Recovery

1. At the ok prompt, type:

ok setenv diag-switch? true
ok setenv auto-boot? true
ok setenv auto-boot-on-error? true

2. Set the obdiag-trigger variable to any combination of power-on-reset, error-reset, and user-reset. For example, type:

ok setenv obdiag-trigger power-on-reset error-reset



Note - For more information about OpenBoot configuration variables, refer to the Netra 440 Server Diagnostics and Troubleshooting Guide (817-3886-xx).



3. To cause the parameter changes to take effect, type:

ok reset-all

The system permanently stores the parameter changes and boots automatically when the OpenBoot configuration variable auto-boot? is set to true (its default value).



Note - To store parameter changes, you can also power cycle the system using the front panel Power button.




procedure icon  To Disable Automatic System Recovery

1. At the ok prompt, type:

ok setenv auto-boot-on-error? false

2. To cause the parameter change to take effect, type:

ok reset-all

The system permanently stores the parameter change.



Note - To store parameter changes, you can also power cycle the system using the front panel Power button.



After you disable the automatic system recovery (ASR) feature, it is not activated again until you enable it at the system ok prompt.

Obtaining Automatic System Recovery Information

Use the following procedure to retrieve information about the status of the automatic system recovery (ASR) feature.

single-step bulletAt the ok prompt, type:

ok .asr

In the .asr command output, any devices marked disabled have been manually unconfigured using the asr-disable command. The .asr command also lists devices that have failed firmware diagnostics and have been automatically unconfigured by the OpenBoot ASR feature.

For more information, see:


Unconfiguring and Reconfiguring Devices

To support a degraded boot capability, the OpenBoot firmware provides the
asr-disable command, which enables you to unconfigure system devices manually. This command "marks" a specified device as disabled, by creating an appropriate status property in the corresponding device tree node. By convention, the Solaris OS does not activate a driver for any device so marked.


procedure icon  To Unconfigure a Device Manually

1. At the ok prompt, type:

ok asr-disable device-identifier

where the device-identifier is one of the following:



Note - The device identifiers are not case sensitive. You can type them as uppercase or lowercase characters.



TABLE 2-2 Device Identifiers and Devices

Device Identifiers

Devices

cpu0-bank0, cpu0-bank1, cpu0-bank2, cpu0-bank3, ... cpu3-bank0, cpu3-bank1, cpu3-bank2, cpu3-bank3

Memory banks 0-3 for each CPU

cpu0-bank*, cpu1-bank*, ... cpu3-bank*

All memory banks for each CPU

ob-ide

On-board IDE controller

ob-net0, ob-net1

On-board Ethernet controllers

ob-scsi

On-board Ultra-4 SCSI controller

pci-slot0, pci-slot1, ... pci-slot5

PCI slots 0-5

pci-slot*

All PCI slots

pci*

All on-board PCI devices (on-board Ethernet, Ultra-4 SCSI) and all PCI slots

hba8, hba9

PCI bridge chips 0 and 1, respectively

ob-usb0, ob-usb1

USB devices

*

All devices


The show-devs command lists the system devices and displays the full path name of each device.

where alias-name is the alias that you want to assign, and physical-device-path is the full physical device path for the device.



Note - If you manually disable a device using asr-disable, and then assign a different alias to the device, the device remains disabled even though the device alias has changed.



2. To cause the parameter change to take effect, type:

ok reset-all

The system permanently stores the parameter change.



Note - To store parameter changes, you can also power cycle the system using the front panel Power button.




procedure icon  To Reconfigure a Device Manually

1. At the ok prompt, type:

ok asr-enable device-identifier

where the device-identifier is one of the following:



Note - The device identifiers are not case sensitive. You can type them as uppercase or lowercase characters.



You can use the OpenBoot asr-enable command to reconfigure any device that you previously unconfigured with the asr-disable command.


Enabling the Hardware Watchdog Mechanism and Its Options

For background information about the hardware watchdog mechanism and related externally initiated reset (XIR) functionality, refer to the Netra 440 Server Product Overview (817-3881-xx).


procedure icon  To Enable the Hardware Watchdog Mechanism

1. Edit the /etc/system file to include the following entry:

set watchdog_enable = 1

2. Bring the system to the ok prompt by typing the following:

# init 0

3. Reboot the system so that the changes can take effect.

To have the hardware watchdog mechanism automatically reboot the system in case of system hangs:

single-step bulletAt the ok prompt, type the following:

ok setenv error-reset-recovery boot

To generate automated crash dumps in case of system hangs:

single-step bulletAt the ok prompt, type the following:

ok setenv error-reset-recovery none

The sync option leaves you at the ok prompt in order to debug the system. For more information about OpenBoot configuration variables, see Appendix A.


Multipathing Software

Multipathing software lets you define and control redundant physical paths to I/O devices, such as storage devices and network interfaces. If the active path to a device becomes unavailable, the software can automatically switch to an alternate path to maintain availability. This capability is known as automatic failover. To take advantage of multipathing capabilities, you must configure the server with redundant hardware, such as redundant network interfaces or two host bus adapters connected to the same dual-ported storage array.

For the Netra 440 server, three different types of multipathing software are available:

For More Information

For information about setting up redundant hardware interfaces for networks, refer to the Netra 440 Server Installation Guide (817-3882-xx).

For instructions on how to configure and administer Solaris IP Network Multipathing, consult the IP Network Multipathing Administration Guide provided with your specific Solaris release.

For information about VVM and its DMP feature, see Volume Management Software and refer to the documentation provided with the VERITAS Volume Manager software.

For information about Sun StorEdge Traffic Manager, refer to the Netra 440 Server Product Overview (817-3881-xx) and refer to your Solaris OS documentation.