About Reliability, Availability, and Serviceability

The Sun Blade 8000 Series includes many blade-centric and chassis-wide features that increase reliability, availability, and serviceability (RAS). These RAS features are aspects of a system's design that affect its ability to operate continuously and to minimize the time necessary to service the system. Reliability refers to the system's ability to operate continuously without failures and to maintain data integrity. Availability refers to the ability of the system to recover to an operational state after a failure, with minimal impact. Serviceability relates to the time it takes to restore a system to service following a component failure. Together, the RAS features of the Sun Blade 8000 Series provide for near continuous operation.

This topic includes the following sections:

Hot-Pluggable Components

Sun Blade 8000 Series hardware supports hot-plugging of the chassis-mounted Sun Blade Server Modules (blades), Sun Blade 8000 Network Express Modules, PCI Express ExpressModules, Chassis Monitoring Modules, fan modules, power supply modules, and hard disk drives. Using the proper software commands, you can install or remove these components while the system is running. Hot-plug technology significantly increases the system's serviceability and availability by enabling you to replace these components without service disruption. For more information, see About Hot-Pluggable Components.

Redundant Components

The Sun Blade 8000 Series provides redundant components that enable the system to continue operations in the event that one of the associated components fails. This separation of functions minimizes the impact of component problems and servicing. The redundant components include the following:

  • Server Modules (blades) depending on system configuration

  • Power supply modules

  • PCI Express ExpressModules (Sun Blade 8000 Chassis only)

  • Network Express Modules

  • Chassis Monitoring Modules

  • System fans

Environmental Monitoring

The Sun Blade 8000 Series features an environmental monitoring subsystem designed to protect components against the following:

  • Extreme temperatures

  • Lack of adequate airflow throughout the system

  • Power supply failures

  • Hardware faults

Temperature sensors located throughout the system monitor the ambient temperature of the chassis and internal components. The software and hardware ensure that the temperatures within the chassis do not exceed predetermined safe operating ranges. If the temperature observed by a sensor falls below or rises above a set threshold, the monitoring software subsystem lights the amber Service Required indicators on the front and back of the system. If the temperature condition persists and reaches a critical threshold, the system may initiate a graceful system shutdown.

All error and warning messages are sent to the Chassis Monitoring Module (CMM), and are logged in the Sun ILOM log file. Additionally, some customer-replaceable units (CRUs), such as power supplies, fans, and DIMMs, provide LEDs that indicate a failure within the CRU.

Error Correction and Parity

The AMD dual-core processors on the Sun Blade X8400, X8420, and X8440 Server Modules (blades) and the Intel quad-core processor on the X8450 Server Module provide parity protection on internal cache memories and error-correcting code (ECC) protection of the data. The system can detect and log to the system event log (SEL) the following types of errors:

  • Correctable and uncorrectable memory ECC errors

  • SP correctable memory ECC errors

  • Correctable and uncorrectable CPU internal errors

  • Faults in the chassis shared infrastructure, including fan and power supply faults

Advanced ECC corrects up to 4 bits in error on nibble boundaries, as long as they are all in the same DRAM. If a DRAM fails, the DIMM or FBDIMM continues to function.

RAS Features Summary

Feature

Description

Power supplies

Hot-pluggable; integrated into the chassis, making the blades more reliable

  • For the Sun Blade 8000 Chassis – N+N configuration

  • For the Sun Blade 8000 P Chassis – N+1 configuration

Airflow and cooling

Fans are integrated into the chassis, making the fans, blades, and power supplies more reliable

For the Sun Blade 8000 Chassis:

  • 3 hot-pluggable front fan modules cool the PCI Express ExpressModules

  • 6 fans, integral to the power supplies, cool the power supplies

  • 9 hot-pluggable rear fan modules cool the blades

For the Sun Blade 8000 P Chassis:

  • 4 fans, integral to the power supplies, cool the power supplies

  • 9 hot-pluggable rear fan modules cool the blades

Server Modules (blades)

Hot-pluggable; servicing can be done without affecting cabling or I/O configuration

Memory

ECC-protected memory and CPUs

I/O modules

Hot-pluggable PCI Express ExpressModules (for the Sun Blade 8000 Chassis only) and Network Express Modules

Server Module (blade) disk drives

Hot-pluggable; configurable in RAID-0 (striping) and RAID-1 (mirroring) configuration

Chassis Monitoring Modules

Hot-pluggable; active/standby operation with two CMMs installed

Service processors

Redundant connection to the internal management network

Sun ILOM and system management

Intelligent per-blade and chassis-wide management functions; Sun ILOM continues to function and be accessible when the operating system goes offline or the system is powered off; provides remote management of the blades and remote floppy and CD-ROM emulation

Hardware upgrades

No tools required to access user-upgradeable modules

Software upgrades

Network-based booting and network-based operating system and BIOS upgrades

Power-on and restart

Automatic server restart; network-based booting capability

Troubleshooting

Troubleshooting includes:

  • Environmental monitoring

  • Failure prediction analysis

  • Rapid response lighting of system status indicators

  • Service LED indicators

  • System error logging, including logging to the system event log (SEL)