Sun Enterprise 250 Server Owner's Guide

About Reliability, Availability, and Serviceability Features

Reliability, availability, and serviceability are aspects of a system's design that affect its ability to operate continuously and minimize the time necessary to service the system. Reliability refers to a system's ability to operate continuously without failures and to maintain data integrity. System availability refers to the percentage of time that a system remains accessible and usable. Serviceability relates to the time it takes to restore a system to service following a system failure. Together, reliability, availability, and serviceability provide for near continuous system operation.

To deliver high levels of reliability, availability and serviceability, the system offers the following features:

Error correction and parity checking for improved data integrity

Easily accessible status indicators

Hot-pluggable disk drives

Support for RAID 0, 1, and 5 storage configurations

Environmental monitoring and fault protection

N+1 power supply redundancy

Hot-swappable power supplies

Automatic system recovery (ASR)

Hardware watchdog mechanism

Four different levels of system diagnostics

Remote System Control (RSC)

Error Correction and Parity Checking

Error correcting code (ECC) is used on all internal system data paths to ensure high levels of data integrity. All data that moves between processors, I/O, and memory have end-to-end ECC protection.

The system reports and logs correctable ECC errors. A correctable ECC error is any single-bit error in a 64-bit field. Such errors are corrected as soon as they are detected. The ECC implementation can also detect double-bit errors in the same 64-bit field and multiple-bit errors in the same nibble (4 bits).

In addition to providing ECC protection for data, the system offers parity protection on all system address buses. Parity protection is also used on the PCI and SCSI buses, and in the UltraSPARC CPU's internal and external cache.

Status LEDs

The system provides easily accessible light-emitting diode (LED) indicators on the system front panel, internal disk bays, and power supplies to provide a visual indication of system and component status. Status LEDs eliminate guesswork and simplify problem diagnosis for enhanced serviceability.

Status and control panel LEDs are described in "About the Status and Control Panel". Disk drive and power supply LEDs are described in "Error Indications".

Hot-Pluggable Disk Drives

The "hot-plug" feature of the system's internal disk drives permits the removal and installation of drives while the system is operational. All drives are easily accessed from the front of the system. Hot-plug technology significantly increases the system's serviceability and availability, by providing the ability to:

Increase storage capacity dynamically to handle larger work loads and improve system performance.

Replace disk drives without service disruption.

For more information about hot-pluggable disk drives, see "About Internal Disk Drives" and "About Disk Array Configurations and Concepts".

Support for RAID 0, RAID 1, and RAID 5 Disk Configurations

The Solstice DiskSuite software designed for use with the system provides the ability to configure system disk storage in a variety of different RAID levels. You choose the appropriate RAID configuration based on the price, performance, and reliability/availability goals for your system.

RAID 0 (striping), RAID 1 (mirroring), RAID 0+1 (striping plus mirroring) and RAID 5 configurations (striping with interleaved parity) can all be implemented using Solstice DiskSuite. You can also configure one or more drives to serve as "hot spares" to fill in automatically for a defective drive in the event of a disk failure.

For more information about RAID configurations, see "About Disk Array Configurations and Concepts".

Environmental Monitoring and Control

The system features an environmental monitoring subsystem designed to protect against:

Extreme temperatures

Lack of air flow through the system

Power supply problems

Monitoring and control capabilities reside at the operating system level as well as in the system's flash PROM firmware. This ensures that monitoring capabilities are operational even if the system has halted or is unable to boot.

The environmental monitoring subsystem uses an industry standard I²C bus implemented on the main logic board. The I²C bus is a simple two-wire serial bus, used throughout the system to allow the monitoring and control of temperature sensors, fans, power supplies, and status LEDs.

Temperature sensors are located throughout the system to monitor the ambient temperature of the system and the temperature of each CPU module. The monitoring subsystem frequently polls each sensor and uses the sampled temperatures to:

Regulate fan speeds for maintaining an optimum balance between proper cooling and noise levels.

Report and respond to any over-temperature conditions.

To indicate an over-temperature condition, the monitoring subsystem generates a warning message, and depending on the nature of the condition, may even shut down the system. If a CPU module reaches 60 degrees C or the ambient temperature reaches 53 degrees C, the system generates a warning message and illuminates the temperature fault LED on the status and control panel. If a CPU module reaches 65 degrees C or the ambient temperature reaches 58 degrees C, the system is automatically shut down.

This thermal shutdown capability is also built into the main logic board circuitry as a fail-safe measure. This feature provides backup thermal protection in the unlikely event that the environmental monitoring subsystem becomes disabled at both the software and firmware levels.

All error and warning messages are displayed on the system console (if one is attached) and are logged in the /var/adm/messages file. Front panel fault LEDs remain lit after an automatic system shutdown to aid in problem diagnosis.

The monitoring subsystem is also designed to detect fan failures. The system includes three fans, part of a single assembly called the fan tray assembly. Any fan failure causes the monitoring subsystem to generate an error message and light the general fault LED on the status and control panel.

The power subsystem is monitored in a similar fashion. The monitoring subsystem periodically polls the power supply status registers for a power supply OK status, indicating the status of each supply's +2.5V, +3.3V, +5V, +12V, and -12V DC outputs.

If a power supply problem is detected, an error message is displayed on the console (if one is attached) and logged in the /var/adm/messages file. The power supply LED on the status and control panel is also lit. The LEDs located on the power supply itself will indicate the type of fault, and if two power supplies are installed, will indicate which supply is the source of the fault.

For more information about error messages generated by the environmental monitoring subsystem, see "Environmental Failures". For additional details about the status and control panel LEDs, see "About the Status and Control Panel".

N+1 Power Supply Redundancy

The system can accommodate one or two power supplies. All system configurations can operate with only one power supply installed. A second supply can be used to provide N+1 redundancy, allowing the system to continue operating should one of the power supplies fail.

For more information about power supplies, redundancy, and configuration rules, see "About Power Supplies".

Hot-Swappable Power Supplies

Power supplies in a redundant configuration feature a "hot-swap" capability. You can remove and replace a faulty power supply without turning off the system power or even shutting down the operating system. The power supplies are easily accessed from the rear of the system, without the need to remove system covers.

Automatic System Recovery (ASR)

The system provides for automatic system recovery (ASR) from the following types of hardware component failures:

CPU modules
Memory modules
PCI buses
System I/O interfaces

The automatic system recovery feature allows the system to resume operation after experiencing certain hardware faults or failures. Automatic self-test features enable the system to detect failed hardware components and an auto-configuring capability designed into the system's boot firmware allows the system to deconfigure failed components and restore system operation. As long as the system is capable of operating without the failed component, the ASR features will enable the system to reboot automatically, without operator intervention.

During the power-on sequence, if a faulty component is detected, the component is effectively disabled and, if the system remains capable of functioning, the boot sequence continues. In a running system, some types of failures (such as a processor failure) will usually bring the system down. If this happens, the ASR functionality enables the system to reboot immediately if it is possible for the system to function without the failed component. This prevents a faulty hardware component from keeping the entire system down or causing the system to crash again.

Control over the system's ASR functionality is provided by a number of OpenBoot PROM commands. These are described in the document Platform Notes: Sun Enterprise 250 Server, available on the Solaris on Sun Hardware AnswerBook. This AnswerBook documentation is provided on the SMCC Supplement CD for the Solaris release you are running.

Hardware Watchdog Mechanism

To detect and respond to system hang conditions, the Enterprise 250 server features a hardware watchdog mechanism--a hardware timer that is continually reset as long as the operating system is running. In the event of a system hang, the operating system is no longer able to reset the timer. The timer will then expire and cause an automatic system reset, eliminating the need for operator intervention.

Note -

The hardware watchdog mechanism is not activated until you enable it.

To enable this feature, you must edit the /etc/system file to include the following entry:

set watchdog_enable = 1

This change does not take effect until you reboot the system.

Four Levels of Diagnostics

For enhanced serviceability and availability, the system provides four different levels of diagnostic testing: power-on self-test (POST), OpenBoot diagnostics (OBDiag), SunVTS(TM), and Solstice(TM) SyMON(TM).

POST and OBDiag are firmware-resident diagnostics that can run even if the server is unable to boot the operating system. Application-level diagnostics, such as SunVTS and Solstice SyMON, offer additional troubleshooting capabilities once the operating system is running.

POST diagnostics provide a quick but thorough check of the most basic hardware functions of the system. For more information about POST, see "About Power-On Self-Test (POST) Diagnostics" and "How to Use POST Diagnostics".

OBDiag provides a more comprehensive test of the system, including external interfaces. OBDiag is described in "About OpenBoot Diagnostics (OBDiag)" and "How to Use OpenBoot Diagnostics (OBDiag)".

At the application level, you have access to SunVTS diagnostics. Like OBDiag, SunVTS provides a comprehensive test of the system, including its external interfaces. SunVTS also allows you to run tests remotely over a network connection. You can only use SunVTS if the operating system is running. For more information about SunVTS, see "About SunVTS Software", "How to Use SunVTS Software", and "How to Check Whether SunVTS Software Is Installed".

Another application-level program, called Solstice SyMON, provides you with a variety of continuous system monitoring capabilities. It allows you to monitor system hardware status and operating system performance of your server. For more information about SyMON, see "About Solstice SyMON Software".

Remote System Control (RSC)

Remote System Control (RSC) is a secure server management tool that lets you monitor and control your server over modem lines or over a network. RSC provides remote system administration for geographically distributed or physically inaccessible systems. The RSC software works with the System Service Processor (SSP) on the Enterprise 250 main logic board. The RSC and SSP support both serial and Ethernet connections to a remote console.

Once RSC is configured to manage your server, you can use it to run diagnostic tests, view diagnostic and error messages, reboot your server, and display environmental status information from a remote console. If the operating system is down, RSC will notify a central host of any power failures, hardware failures, or other important events that may be occurring on your server.

The RSC provides the following features:

Remote system monitoring and error reporting (including diagnostic output)

Remote reboot on demand

Ability to monitor system environmental conditions remotely

Ability to run diagnostic tests from a remote console

Remote event notification for over-temperature conditions, power supply failures, fatal system errors, or system crashes

Remote access to detailed event logs

Remote console functions on serial and ethernet ports

For information about configuring and using RSC, see the Remote System Control (RSC) User's Guide, provided with the RSC software.