C H A P T E R 3 - Reliability, Availability, and Serviceability

C H A P T E R 3

Reliability, Availability, and Serviceability

Reliability, availability, and serviceability (RAS) assess and measure system ability to operate continuously and to minimize service times. The reliability of a system reduces failures and ensures data integrity. Serviceability provides short service cycles when component upgrades are necessary or failures occur. When high reliability to avoid failures and quick serviceability, to recover rapidly from failures are combined, the result is high availability. The availability of a system defines continuous accessibility to the functions and applications supported by the system. The supported functions and applications are described in the following section:

Section 3.1, SPARC CPU Error Protection

Section 3.2, System Interconnect Error Protection

Section 3.3, Redundant Components

Section 3.4, Reconfigurable Sun Fireplane Interconnect

Section 3.5, Automatic System Recovery

Section 3.6, System Controller

Section 3.7, Concurrent Serviceability

3.1 SPARC CPU Error Protection

The CPU has error correction code (ECC) protection on its external cache SRAM and parity protection on the major internal SRAM structures, as shown in FIGURE 3-1. The letters P and E in the block diagram denote parity generate and check; and ECC generate, check, and correct by the receiving unit, respectively. A parity error on an internal cache structure is corrected by software, ensuring correct operation after the fault.

FIGURE 3-1 CPU Error Detection and Correction

Diagram showing address and data path error detection and correction on a CPU chip.

The external cache data resides on eight high-speed (4 ns) SRAMs. A single-bit error-correcting and double-bit error detecting code protects the 64-byte-wide cache lines. Errors during data-cache or instruction-cache fills are recovered by software flushing and invalidation. Errors during system data transactions are corrected by hardware.

The Sun Fire E25K/E20K address bus connections between the CPU and the address repeater are protected by parity.

The CPU generates both parity and ECC for all outgoing data blocks. The parity is checked by the receiving dual-CPU data switch. The ECC is checked by all data switch units in the path of a transfer. ECC is checked and corrected by the CPU when it receives a data block.

3.2 System Interconnect Error Protection

FIGURE 3-2 shows the protection methods at various points in the address and data interconnect. The letters P, E, and C in the block diagram denote parity generate and check; ECC check; and ECC generate, check, and correct by the receiving unit, respectively. Dashed lines denote the address interconnect, and solid lines denote the data interconnect.

3.2.1 Address Interconnect Error Protection

The Sun Fireplane interconnect address bus has three parity-error bits. In addition to the bus-level protection, the address and response crossbars on the Sun Fire E25K/E20K Sun Fireplane interconnect have ECC protection for address transactions across the Sun Fireplane interconnect. The ECC corrects single-bit address errors and detects double-bit errors. An address parity or uncorrectable ECC error stops execution in the affected dynamic system domain.

3.2.2 Data Interconnect Error Protection

All data interconnect transactions move a 64-byte-wide data block. System devices generate ECC when they source data, either for a write from the device or in response to a read of the device. They check ECC and correct single-bit errors when they receive data. Data is thus protected against both memory and data path errors from end to end.

3.2.3 Data Interconnect Error Isolation

If system devices checked only ECC when they received data, it would be difficult to diagnose the cause of an error. If a device generates bad ECC on a write to memory, the error can be detected by some other devices, but the cause of the error is difficult to isolate. There are two additional checks to help isolate the cause of the errors:

Individual point-to-point data links are covered by parity. This is denoted by a P in FIGURE 3-2.

ECC is checked as it enters or leaves each system device by the level 1 data switch. This is denoted by an E in FIGURE 3-2.

The ECC checks that are performed by the data switch can identify the source of ECC errors in most cases. A particularly hard case for ECC error correction occurs when a device writes bad ECC into memory. These errors are detected much later by other devices reading these locations. Since the bad device writer might have written bad ECC to many locations and these might be read by many devices, the errors appear to be in many memory locations while the real error might be a single bad device writer.

Because the data switch ASICs check the ECC for all data entering or leaving each device from other devices, the original source of errors can be isolated. For example, a bad device writer that writes bad ECC to a memory on a different board produces ECC errors that are detected in two data switches. The direction and transaction tag information can identify which CPU pair was the source of the error and which device is the target of a bad ECC device writer.

If the bad device writer writes bad ECC to its local memory, the data does not pass through a data switch. Therefore, the bad device writer is not detected until the data with the bad ECC is read by either the same CPU or another device. In either case, the cause of the ECC error can be isolated to the pair of CPUs that share the dual CPU data switch (DCDS). If the data is read by the same CPU, the fact that the data switch on that board never detected an error indicates that the data was corrupted by the local CPU or the DCDS. If the data is read by a different CPU pair, then the data passes through a data switch and the ECC error is detected as originating from a particular DCDS or the associated CPUs.

3.2.4 Console Bus Error Protection

The console bus is a secondary bus that enables access by the system controller to the inner workings of the machine without having to rely on the integrity of the primary data and address buses. This enables the system controller to operate even when there is a fault preventing the continuation of the main operation. This console bus action is common to all domains and is parity protected.

FIGURE 3-2 Interconnect ECC and Parity Checking

Diagram showing address and data path error detection and correction between the CPU/Memory and I/O boards, and the expander board and Sun Fireplane.

3.3 Redundant Components

System availability is greatly enhanced by the ability to configure redundant components. All hot-swap components in the system can be configured redundantly, if the customer desires. Each system board is capable of independent operation. Sun Fire E25K/E20K systems are built with multiple system boards and are inherently capable of operating with a subset of the configured boards.

Redundant system components include:

CPU/Memory boards

I/O assemblies

PCI cards

System Control boards

System clock sources

Bulk power supplies

Fan trays

3.3.1 Redundant CPU/Memory Boards

A Sun Fire E25K system can configure up to 18 CPU/Memory boards. A Sun Fire E20K system can configure up to 9 CPU/Memory boards. Each board contains up to four CPUs and their associated memory banks. Each CPU/Memory board is capable of independent operation and can be hot-swapped out of a running system and moved between system domains. The system is inherently capable of operating with a subset of the configured boards.

3.3.2 Redundant I/O Assemblies

A Sun Fire E25K system can configure up to 18 I/O assemblies (hsPCI-X/hsPCI+). A Sun Fire E20K system can configure up to 9 I/O assemblies. Each assembly supports up to four PCI cards. The I/O assemblies can be hot-swapped out of running systems and moved between system domains.

3.3.3 Redundant PCI Cards

You can mount a standard PCI card in the Sun Fire E25K/E20K PCI I/O assembly by using a special cassette that enables the cards to be changed using the hot-swap-replacement procedures. You can configure systems with multiple connections to the peripheral devices, enabling redundant controllers and channels. Software maintains the multiple paths and can switch to an alternate path if the primary fails.

3.3.4 Redundant System Control Boards

Sun Fire E25K/E20K systems contain two System Control boards. The system controller software running in each embedded CPU checks the other system controller and copies state information to enable automatic failover to the other system controller if the active System Control board fails.

The systems also contain a main System Control board and an alternate hot-swap replaceable System Control board. The main System Control board provides all the system controller resources for the system. If failures of the hardware or software occur on the main System Control board, or if failures on any hardware control path (console bus interface, Ethernet interface) from the main System Control board to other system devices occur, the system controller failover software automatically triggers a failover to the spare System Control board. The spare System Control board assumes the role of the main System Control board and takes over all the main system controller responsibilities. The system controller data, configuration, and log files are replicated on both System Control boards.

3.3.5 Redundant System Clocks

Sun Fire E25K/E20K systems have redundant system clocks. If the system clock on one System Control board fails, the consumers of the clock lines continue to draw clock resources from the other System Control board until downtime can be arranged to replace the failed System Control board.

3.3.6 Redundant Power

The Sun Fire E25K/E20K cabinet uses six 4-kW dual AC-DC power supplies. Two power cables go to each AC power supply, so that each can connect to a separate power source. These supplies convert the input power to 48 VDC, and are N+1 redundant. Therefore, the system can continue running with a failed power supply, if necessary.

The power supplies can be replaced while the system is in operation. Power is distributed to the individual system board sets through separate DC circuit breakers. Each board set has its own on-board voltage converters that transform 48 VDC to the levels required by the on-board logic components. Failure of a DC-to-DC converter affects only that particular system board.

3.3.7 Redundant Fans

There are four fan trays above and four fan trays below the system boards. Each fan tray contains two layers of six-inch fans. The fans have two speeds: nominal and high speed. If any of the sensed components in the system overheat, all fans are set to high speed. If a single fan fails, the redundant fan in the corresponding layer of the tray switches to high speed. The fans are N+1 redundant, enabling the system to run with a failed fan. The fan trays can be hot-swapped while the system is running.

3.4 Reconfigurable Sun Fireplane Interconnect

Sun Fire E25K/E20K systems have three independent crossbars implemented on the Sun Fireplane interconnect: one for addresses, one for responses, and one for data. The Sun Fireplane interconnect contains 20 ASICs and is the only non hot-swap logic component in the system. Because a failed Sun Fireplane interconnect ASIC cannot be removed from a running system, each of the three Sun Fireplane interconnect crossbars can be independently configured in and out of a degraded mode. A degraded mode is separately configurable for each system domain.

3.5 Automatic System Recovery

A suitably configured system always reboots after a failure. The system controller locates the fault; reconfigures the system excluding the failed CPU, memory, I/O, or interconnect component; and reboots the operating system.

The system controller configures only the parts that have a clear fatal-error bit. Field-replaceable units (FRUs) that have already been detected as faulty, by this or another machine, should not be used.

3.5.1 Built-In Self-Test

Built-in self-test (BIST) logic in the ASICs applies pseudo-random patterns at the system clock rate, providing high-fault coverage of combinatorial logic. The local BIST operates within each ASIC and verifies the correct operation of the ASIC. The interconnect built-in self-test performs an interconnect test to verify that the ASICs can communicate across the interconnect. The local built-in self-tests rely on the interfaces of each ASIC sending each other known test data.

3.5.2 Power-On Self-Test

The power-on self-test (POST) tests each logic block first in isolation, and then with progressively more of the system. Failing components are isolated from the Sun Fireplane interconnect. The result is that the system is booted only with logic blocks that have passed this self-test and that should operate without error.

Local POST runs in each CPU and system POST runs in the system controller.

3.6 System Controller

The heart of Sun's availability technology is the system controller. This controller contains an off-the-shelf SPARCengine Netra 2140 6U cPCI board with an UltraSPARC-IIi embedded system. This board runs Solaris software and System Management software.

The system controller has access through JTAG to registers in each significant chip in the machine and continuously monitors the state of the machine. If a problem is detected, the system controller attempts to determine what hardware has failed and then takes steps to prevent the failed hardware from being used until it has been replaced.

The system controller performs the following main functions:

Configures the system by setting up the system and coordinating the boot process

Sets up the system partitions and domains

Generates the system clocks

Monitors the environmental sensors throughout the system

Detects and diagnoses errors and enables recovery

Provides the platform console functionality and the domain consoles

Provides routing through a system log of messages to a syslog host

3.6.1 Console Bus

The console bus is a secondary bus that enables the system controller to access the inner working of the system without having to rely on the integrity of the system address and data buses. This enables the system controller to operate even when there is a fault preventing the continuation of system operation. The system controller is parity protected.

3.6.2 Environmental Monitoring

The system controller regularly monitors the system environmental sensors in order to have enough advance warning of a potential condition so that the machine can be brought gracefully to a halt--avoiding physical damage to the system and possible corruption of data.

The environmental items monitored include:

Power state

Voltages

Fan speed

Temperatures

Device failure

Device presence

3.7 Concurrent Serviceability

The most significant serviceability feature of the Sun Fire E25K/E20K systems is the replacement of system boards online as a concurrent service, the ability to service various parts of the machine without interfering with a running system. Failing components are identified in the failure logs with the FRUs clearly identified. With the exception of the Sun Fireplane interconnect, power centerplane, fan backplane, and the power module, all boards and power supplies in the system can be removed and replaced during system operation without scheduled downtime using hot-swap replacement procedures. You can also replace the System Control board that is currently active or switch control to the redundant System Control board without causing a disruption in the main system operation.

The ability to repair these items without downtime is a significant contributor in achieving higher availability. A by-product of this online repairability of the system concerns upgrades to the on-site hardware. Customers might want to have additional memory or an extra I/O controller. These operations can be accomplished online, resulting in only a brief (and minor) loss of performance while the system board affected is temporarily taken out of service.

Concurrent service is a function of the following hardware facilities:

All Sun Fireplane interconnect connections are point to point, which makes it possible to logically isolate system boards by dynamically reconfiguring the system.

Sun Fire E25K/E20K systems use a distributed DC power system. Each system board has its own power supply, enabling each system board to be powered on or off individually.

All ASICs that connect an off-board Sun Fireplane interconnect have a loopback mode that enables the system board to be verified before it is dynamically reconfigured into the system.

3.7.1 Dynamic Reconfiguration of System Boards

The online removal and replacement of a system board from a running system is called dynamic reconfiguration. For example, the board can be configured in the system even though one of its CPUs failed. To replace the module without incurring downtime, dynamic reconfiguration can isolate the board from the system, enabling the board to be replaced using the hot-replacement procedures. This dynamic reconfiguration operation has three distinct steps:

Dynamic detach

Hot-swap

Dynamic attach

Dynamic reconfiguration enables a board that is not currently being used by the system to provide resources to the system. It can be used in conjunction with hot-swap replacement to upgrade a system without incurring any downtime or to move resources from one domain to another domain. It can also be used to replace a defective module that was deconfigured by the system and subsequently hot-swapped and repaired or replaced.

Dynamic deconfiguration and reconfiguration are accomplished by the system administrator (or service provider) working through the system controller. The following process is used during configuration changes and hot-swap replacement procedures:

1. The Solaris operating system scheduler is informed of the board in question, to prevent new processes from starting. Meanwhile, any running processes and I/O operations are completed, and memory contents are rewritten into other memory banks.

2. A switchover to alternate I/O paths takes place so that when the I/O assembly is removed, the system continues to have access to the data.

3. The system administrator performs the hot-swap operation, by manually removing the now deconfigured system board from the system. The removal sequences are controlled by the system controller, and the system administrator follows the software instructions.

4. The removed system board is repaired, exchanged, or upgraded.

5. The new board is reinserted into the system.

6. The swapped system board is dynamically configured by the operating system when inserted. The I/O can be switched back, the scheduler assigns new processes, and the memory starts to fill.

With a combination of dynamic reconfiguration and hot-swap replacement, the Sun Fire E25K/E20K systems can be repaired or upgraded with minimal user inconvenience. The hot-swap replacement of hardware minimizes this interval to minutes by the on-site exchange of system boards.

An additional advantage of dynamic reconfiguration and hot-swap replacement of hardware is that online system upgrades can be performed. For instance, when a customer purchases an additional system board, it too can be added to the system without disturbing operation.

3.7.2 System Control Board Set Removal and Replacement

The hot-spare System Control board set, which is not actively supplying system clocks, can be removed from a running system.

3.7.3 Bulk Power Supply Removal and Replacement

Bulk 4-kW dual AC-DC power supplies can be hot-swapped with no interruption to the system because the remaining power supplies can power the system during replacement.

3.7.4 Fan Tray Removal and Replacement

When a fan fails, the system control compensates by switching the corresponding fan on the other layer to high-speed operation The system is designed to operate normally under these conditions until the failed fan assembly can be conveniently serviced. The fan trays can be hot-swapped with no interruption to the system.

3.7.5 Remote Service

An optional capability for automatic email reporting of unplanned reboots and error log information to customer service headquarters sites is available. Every system controller has remote access capability that enables remote login to the system controller. Through this remote connection, all system controller diagnostics are accessible. Diagnostics can be run remotely or locally on deconfigured system boards while the Solaris software is running on the other system boards.