C H A P T E R 2 - Reliability, Availability, and Serviceability Features

C H A P T E R 2

Reliability, Availability, and Serviceability Features

Reliability, availability, and serviceability (RAS) are aspects of a system's design that affect its ability to operate continuously and to minimize the time necessary to service the system. Reliability refers to a system's ability to operate continuously without failures and to maintain data integrity. System availability refers to the ability of a system to recover to an operational state after a failure, with minimal impact. Serviceability relates to the time it takes to restore a system to service following a system failure. Together, reliability, availability, and serviceability features provide for near continuous system operation.

To deliver high levels of reliability, availability, and serviceability, the Netra 440 server offers the following features:

Hot-swappable hard drives and fan trays

Redundant, hot-swappable power supplies

Sun Advanced Lights Out Manager (ALOM) system controller

Environmental monitoring and fault protection

Automatic system recovery (ASR) capabilities for PCI cards and system memory

ALOM watchdog mechanism and externally initiated reset (XIR) capability

Internal hardware drive mirroring (RAID 1)

Support for drive and network multipathing with automatic failover

Error correction and parity checking for improved data integrity

Easy access to all internal replaceable components

Full in-rack serviceability for nearly all components

For more information about using RAS features, refer to the Netra 440 Server System Administration Guide (817-3884-xx).

Hot-Swappable Components

Netra 440 hardware is designed to support hot-swapping of internal hard drives and power supplies. By using the proper software commands, you can install or remove these components while the system is running. Hot-swap technology significantly increases the system's serviceability and availability, by providing you with the ability to do the following:

Increase storage capacity dynamically to handle larger work loads and to improve system performance

Replace hard drives, fan trays, and power supplies without service disruption

3+1 or 2+2 Power Supply Redundancy

The system features four hot-swappable power supplies, two of which are capable of handling the system's entire load. Thus, the four power supplies provide "3+1" or "2+2" redundancy, enabling the system to continue operating should one of the power supplies fail (3+1 redundancy) or its DC power source fail (2+2 redundancy).

Note - Four power supplies must be present at all times to ensure proper system cooling. Even if one power supply has failed, its fans obtain power from the other power supply and through the motherboard to maintain proper system cooling.

For more information about power supplies, redundancy, and configuration rules, see Power Supplies. For instructions on performing a power supply hot-swap operation, see the Netra 440 Server Service Manual (817-3883-xx).

System Controller

Sun Advanced Lights Out Manager (ALOM) system controller is a secure server management tool that comes preinstalled on the Netra 440 server, in the form of a module with preinstalled firmware. It lets you monitor and control your server over a serial line or over a network. The ALOM system controller provides remote system administration for geographically distributed or physically inaccessible systems. You can connect to the ALOM system controller card using a local alphanumeric terminal, a terminal server, or a modem connected to its serial management port, or over a network using its 10BASE-T network management port.

When you first power on the system, the ALOM system controller card provides a default connection to the system console through its serial management port. After initial setup, you can assign an IP address to the network management port and connect the network management port to a network. You can run diagnostic tests, view diagnostic and error messages, reboot your server, and display environmental status information using the ALOM system controller software. Even if the operating system is down or the system is powered off, the ALOM system controller can send an e-mail alert about hardware failures, or other important events that can occur on the server.

The ALOM system controller provides the following features:

Default system console connection through its serial management port to an alphanumeric terminal, terminal server, or modem

Network management port for remote monitoring and control over a network, after initial setup

Remote system monitoring and error reporting, including diagnostic output

Remote reboot, power-on, power-off, and reset functions

Ability to monitor system environmental conditions remotely

Ability to run diagnostic tests using a remote connection

Ability to remotely capture and store boot and run logs, which you can review or replay later

Remote event notification for overtemperature conditions, power supply faults, system shutdown, or system resets

Remote access to detailed event logs

For more details about the ALOM system controller hardware, see ALOM System Controller Card and Ports.

For information about configuring and using the ALOM system controller, refer to the Netra 440 Server System Administration Guide (817-3884-xx).

Environmental Monitoring and Control

The Netra 440 server features an environmental monitoring subsystem designed to protect the server and its components against:

Extreme temperatures

Lack of adequate airflow through the system

Operating with missing or misconfigured components

Power supply failures

Internal hardware faults

Monitoring and control capabilities are handled by the ALOM system controller firmware. This ensures that monitoring capabilities remain operational even if the system has halted or is unable to boot, and without requiring the system to dedicate CPU and memory resources to monitor itself. If the ALOM system controller fails, the operating system reports the failure and takes over limited environmental monitoring and control functions.

The environmental monitoring subsystem uses an industry-standard I²C bus. The I²C bus is a simple two-wire serial bus used throughout the system to allow the monitoring and control of temperature sensors, fans, power supplies, status LEDs, and the front panel rotary switch.

Temperature sensors are located throughout the system to monitor the ambient temperature of the system, the CPUs, and the CPU die temperature. The monitoring subsystem polls each sensor and uses the sampled temperatures to report and respond to any overtemperature or undertemperature conditions. Additional I²C sensors detect component presence and component faults.

The hardware and software together ensure that the temperatures within the enclosure do not exceed predetermined "safe operation" ranges. If the temperature observed by a sensor falls below a low-temperature warning threshold or rises above a high-temperature warning threshold, the monitoring subsystem software lights the system Service Required LEDs on the front and back panels. If the temperature condition persists and reaches a critical threshold, the system initiates a graceful system shutdown. In the event of a failure of the ALOM system controller, backup sensors are used to protect the system from serious damage, by initiating a forced hardware shutdown.

All error and warning messages are sent to the system console and logged in the /var/adm/messages file. Service Required LEDs remain lit after an automatic system shutdown to aid in problem diagnosis.

The power subsystem is monitored in a similar fashion. Polling the power supply status periodically, the monitoring subsystem indicates the status of each supply's outputs, inputs, and presence.

If a power supply problem is detected, an error message is sent to the system console and logged in the /var/adm/messages file. Additionally, LEDs located on each power supply light to indicate failures. The system Service Required LED lights to indicate a system fault.

Automatic System Recovery

The system provides automatic system recovery (ASR) from component failures in memory modules and PCI cards.

The ASR features enable the system to resume operation after experiencing certain nonfatal hardware faults or failures. Automatic self-test features enable the system to detect failed hardware components. An auto-configuring capability designed into the system's boot firmware enables the system to unconfigure failed components and to restore system operation. As long as the system can operate without the failed component, the ASR features enable the system to reboot automatically, without operator intervention.

During the power-on sequence, if a faulty component is detected, the component is marked as failed and, if the system can function, the boot sequence continues. In a running system, some types of failures can bring down the system. If this happens, the ASR functionality enables the system to reboot immediately if it is possible for the system to detect the failed component and operate without it. This prevents a faulty hardware component from keeping the entire system down or causing the system to crash repeatedly.

Note - ASR functionality is not enabled until you activate it. Control over the system ASR functionality is provided by several OpenBoot commands and configuration variables. For additional information, refer to the Netra 440 Server System Administration Guide.

Sun StorEdge Traffic Manager

Sun StorEdge trademark Traffic Manager, a feature found in Solaris 8 and later operating systems, is a native multipathing solution for storage devices such as Sun StorEdge drive arrays. Sun StorEdge Traffic Manager provides the following features:

Host-level multipathing

Physical host controller interface (pHCI) support

Sun StorEdge T3, Sun StorEdge 3510, and Sun StorEdge A5x00 support

Load balancing

For more information, refer to the Netra 440 Server System Administration Guide
(817-3884-xx).

ALOM Watchdog Mechanism and XIR

To detect and respond to a system hang, should one ever occur, the Netra 440 server features an ALOM "watchdog" mechanism, which is a timer that is continually reset as long as the operating system and user application are running. In the event of a system hang, the operating system is no longer able to reset the timer. The timer will then expire and cause an automatic externally initiated reset (XIR), eliminating the need for operator intervention. When the ALOM watchdog mechanism issues the XIR, debug information is displayed on the system console.

The XIR feature is also available for you to invoke manually at the ALOM system controller prompt. You use the ALOM system controller reset -x command manually when the system is unresponsive and an L1-A (Stop-A) keyboard command or alphanumeric terminal Break key does not work. When you issue the reset -x command manually, the system is immediately returned to the OpenBoot ok prompt. From there, you can use OpenBoot commands to debug the system.

For more information, refer to the Netra 440 Server System Administration Guide (817-3884-xx) and the Netra 440 Server Diagnostics and Troubleshooting Guide (817-3886-xx).

Support for RAID Storage Configurations

By attaching one or more external storage devices to the Netra 440 server, you can use a redundant array of independent drives (RAID) software application such as Solstice DiskSuite trademark or VERITAS Volume Manager to configure system drive storage in a variety of different RAID levels. Configuration options include RAID 0 (striping), RAID 1 (mirroring), RAID 0+1 (striping plus mirroring), RAID 1+0 (mirroring plus striping), and RAID 5 (striping with interleaved parity). You choose the appropriate RAID configuration based on the price, performance, reliability, and availability goals for your system. You can also configure one or more hard drives to serve as "hot spares" to fill in automatically in the event of a hard drive failure.

In addition to software RAID configurations, you can set up a hardware RAID 1 (mirroring) configuration for any pair of internal hard drives using the on-board Ultra-4 SCSI controller, providing a high-performance solution for hard drive mirroring.

For more information, refer to the Netra 440 Server System Administration Guide (817-3884-xx).

Error Correction and Parity Checking

DIMMs employ error-correcting code (ECC) to ensure high levels of data integrity. The system reports and logs correctable ECC errors. (A correctable ECC error is any single-bit error in a 128-bit field.) Such errors are corrected as soon as they are detected. The ECC implementation can also detect double-bit errors in the same 128-bit field and multiple-bit errors in the same nibble (4 bits). In addition to providing ECC protection for data, parity protection is also used on the PCI and UltraSCSI buses, and in the UltraSPARC IIIi CPU internal caches.

Sun Java System Cluster Software

Sun Java System Cluster software lets you connect up to eight Sun servers in a cluster configuration. A cluster is a group of nodes that are interconnected to work as a single, highly available and scalable system. A node is a single instance of Solaris software. The software can be running on a standalone server or on a domain within a standalone server. With Sun Java System Cluster software, you can add or remove nodes while online, and mix and match servers to meet your specific needs.

Sun Java System Cluster software delivers high availability through automatic fault detection and recovery, and scalability, ensuring that mission-critical applications and services are always available when needed.

With Sun Java System Cluster software installed, other nodes in the cluster automatically take over and assume the workload when a node goes down. The software delivers predictability and fast recovery capabilities through features such as local application restart, individual application failover, and local network adapter failover. Sun Java System Cluster software significantly reduces downtime and increases productivity by helping to ensure continuous service to all users.

The software lets you run both standard and parallel applications on the same cluster. It supports the dynamic addition or removal of nodes, and enables Sun servers and storage products to be clustered together in a variety of configurations. Existing resources are used more efficiently, resulting in additional cost savings.

Sun Java System Cluster software allows nodes to be separated by up to 10 kilometers. This way, in the event of a disaster in one location, all mission-critical data and services remain available from the other unaffected locations.

For more information, see the documentation supplied with the Sun Java System Cluster software.