|C H A P T E R 2|
System Features and Capabilities
The Sun Fire E6900/E4900 system's primary features include the ability to partition your system and create domains.These features provide greater reliability, availability, and serviceability, which means uptime. These features and capabilities are as follows:
The Sun Fire system can be divided into partitions and domains. A single physical system can have multiple independent logical systems, each running its own operating system, by using partitions and domains. Partitions and domains differ only in terms of their flexibility and isolation.
[ D ]
A single physical Sun Fire E6900 system can be divided into two partitions. All connections between boards of one partition and boards of the other partition are disabled. The system logically behaves as two separate systems.
If the partitions are assigned to the physical half of the Sun Fire E6900 system, then the power planes associated with each partition are also isolated. A Sun Fire E6900 system can be divided into two partitions by logically isolating one set of Repeater boards for each partition. Sun Fire E4900 systems also support two partitions.
Each partition on the Sun Fire E6900 system can have up to two domains, allowing for up to four domains total. For the Sun Fire E4900 system, if a single partition is established, it can support two domains; if two partitions are established, however, each partition will support only one domain.
The Sun Fire system can be logically divided into multiple domains. Since each domain is comprised of one or more system boards, a domain can have between one and 24 processors. Each domain runs its own instance of the operating system and has its own peripherals and network connections. You can configure domains without interrupting the operation of other domains on the same system.
Domains can be used for:
While production work continues on the remaining (and usually larger) domain, there will not be any adverse interaction between any of the domains. You can gain confidence in the correctness of applications without disturbing production work. When the testing work is complete, the system can be rejoined logically without rebooting (there are no physical changes when you use domains). Thus, if problems occur, the rest of your system is not affected.
The Sun Fire E6900 system can have up to four domains. The Sun Fire E4900 system can have up to two domains. Each instance of the Solaris Operating System runs in its own domain. Domains do not depend on each other and do not interact with each other.
A single partition on a Sun Fire E6900 system can be divided into two domains. Unlike partitions, domains share the Repeater boards. Each domain gets half the address bandwidth of a full system bus.
The reliability capabilities of the Sun Fire system fall into four categories:
All the ASICs are designed for worst-case temperature, voltage, frequency, and airflow combinations. The high level of logic integration in the ASICs reduces component and interconnect count.
A distributed power system improves power supply performance and reliability.
Extensive self-test upon power-on reboot after a hardware failure screens all of the key logic blocks in the Sun Fire system:
All I/O cables have a positive lock mechanism and a strain-relief support to prevent accidental disconnections.
The Sun Fire system contains a number of subsystems that are capable of recovering from errors without failing. Subsystems that have a large number of connections have greater odds of failure. The subsystems that have the highest probability of errors are protected from transient errors through the use of single-bit error correction that uses an error-correcting code.
The entire data path from the local data crossbars and the memory subsystem is protected by error-correcting code. Single-bit-data errors detected in these subsystems are corrected by receiving a UltraSPARC® IV/IV+ module, and the system is notified for logging purposes that an error has occurred.
The memory subsystem does not check or correct errors but provides the extra storage bits. The Sun Fire data buffer chips use the error-correcting codes to assist in fault isolation.
If a correctable error is detected by the interconnect, the system controller is notified and enough information is saved to isolate the failure to a single net within the interconnect system. The data containing the error is sent through the interconnect unchanged, and the error is reported.
Memory errors are logged by software so that defective DIMMs can be identified and replaced during scheduled maintenance.
Almost all internal system paths are protected by some form of redundant check mechanism. Transmission of bad data is thus detected, preventing propagation of bad data without notification. All uncorrectable errors result in an error condition. Recovery requires an operating system automatic reboot.
Multiple-bit ECC errors are detected by the receiving port, which notifies the operating system, so that depending upon what process is affected, the system as a whole can avoid failure.
Parity errors on external cache reads to the interconnect become multibit ECC data errors and are handled as other multibit errors.
Any single-bit or multiple-bit errors detected in the address interconnect are unrecoverable and are fatal to the operating system.
Timeout errors detected by the port controller or memory controller are an indication of lost transactions. Timeouts are therefore always unrecoverable.
The Sun Fire system uses a highly reliable distributed power system. Each I/O subsystem, CPU/Memory board, System Controller board, or Repeater board within the system has DC-to-DC converters for that board only, with multiple converters for each voltage. When a DC-to-DC converter fails, the system controller is notified. The system board reporting the failure will then be deconfigured from the system. No guarantee is made regarding continued system operation at the time of the failure.
The system chassis environment is monitored for key measures of system stability, such as temperature, airflow, and power supply performance. The system controller is constantly monitoring the system environmental sensors in order to have enough advance warning of a potential condition that the machine can be brought gracefully to a halt, avoiding physical damage to the system and possible corruption of data.
The internal temperature of the system is monitored at key locations as a fail-safe mechanism. Based on temperature readings, the system can notify the administrator of a potential problem, begin an orderly shutdown, or power off the system immediately.
The Sun Fire system performs additional sensing to enhance the reliability by enabling constant health checks. DC voltages are monitored at key points within the system. DC current from each power supply is monitored and reported to the system controller. The CPU power control will shut down any overheating CPU without shutting down the system.
For organizations whose goal is to make information instantly available to users across the enterprise, high levels of availability are essential. This is especially true for a large shared resource system such as the Sun Fire system.
The Reliability, Availability, and Serviceability (RAS) goals for the Sun Fire system are to protect the integrity of the customers data and to maximize availability. The focus is on three areas:
To ensure data integrity at the hardware level, all data is error correction code (ECC) protected, and control buses are protected by parity checks out to the data on the disks. These checks ensure that errors are contained.
For tolerance to errors, resilience capabilities are designed into the Sun Fire system to ensure that the system continues to operate, even in a degraded mode. Because it is a symmetrical multiprocessing system, the Sun Fire system can function with one or more processors disabled. In recovering from a problem, the system is checked quickly to determine the fault and to ensure minimum downtime. The system can be configured with redundant hardware to reduce downtime.
The Sun Fire system capabilities raise its availability from the normal commercial category to the high availability category. These capabilities are grouped as follows:
The Sun Fire system has redundant cooling. If one fan fails, the remaining fans automatically increase their speed, thereby enabling the system to continue to operate, even at the maximum specified ambient. Therefore, operation need not be suspended when a fan fails. You can replace a fan while the system is operating, again without any adverse impact on the availability metric. The Sun Fire system has comprehensive and fail-safe temperature monitoring to ensure that there is no over-temperature stressing of components in the event of a cooling failure.
AC power is supplied to the Sun Fire system through up to four independent,
30-ampere, single-phase Redundant Transfer Switches (RTS). Each RTS module carries power to two or three 2,200-watt bulk DC power supplies.
The AC connections must be controlled by separate customer circuit breakers, and can be on isolated power grids if a high level of availability is required. Optionally, third-party battery backup power can be used to provide AC power in the event of utility failure.
On the Sun Fire system, data errors are detected, corrected, and/or reported by the data buffer on behalf of its associated processor. Additionally, data errors passing through the interconnection will be detected and will cause a record stop condition for the ASICs. The ASICs detect and initiate this condition. These history buffers and record stop-condition bits can then be read and used by offline diagnostics.
Resiliency capabilities enable processing and data access to continue in spite of a failure, possibly with reduced resources. These capabilities usually require that you reboot the system, and this is counted as repair time in the availability equation.
The Sun Fire logic DC power system is modular at the system board level. Bulk
56-VDC is supplied through a circuit protector to each system board. This 56 volts is converted through several small DC-to-DC converters to the specific low voltages needed on the board. Failure of a DC-to-DC converter affects only that particular system board. You need to configure only as many bulk DC power supplies as are needed for the particular system configuration. The standard redundant configurations are three DC power supplies for up to three system boards and six DC power supplies for up to six system boards on the Sun Fire E6900 system.
The System Controller board contains the system controller interface as well as the clock source and the emergency shutdown logic. Optionally, you can configure two System Controller boards in the system for redundancy.
The Repeater, CPU/Memory boards, and the I/O subsystems hold the DC-to-DC converters that power the address repeater, the system data controller, the system data crossbar, and all other ASICs. If one Repeater board fails, the system will continue to operate in a degraded mode, which includes two of the four address buses and data buses.
If you have a failure of an UltraSPARC processor, the dual data switch, the external cache SRAMs, or the associated support ASICs, the failed processor can be isolated from the remainder of the system by a power-on self-test (POST) configuration step. As long as there is at least one functioning processor available in the configuration, the system can operate.
When POST completes testing the memory subsystem, any faulty banks of memory will be identified. POST can then reconfigure the memory configuration using only reliable memory banks, taking advantage of the highly configurable nature of the address-match logic in the memory controller.
Both the customer mean time between failure and the customer availability measures of the system are enhanced by the Sun Fire system's capability to configure redundant components. There are no components in the system that cannot be configured redundantly if the customer desires. Each system board is capable of independent operation. The Sun Fire system is built with multiple system boards and is inherently capable of operating when only a subset of the configured boards is functional.
In addition to the basic system boards, redundant configurable components include:
You can configure systems with multiple connections to the peripheral devices, enabling redundant controllers and channels. Software maintains the multiple paths and can switch to an alternate path on the failure of the primary path.
The system controller is controlled though a console interface workstation. Redundant system controllers and interfaces can be configured if the customer desires.
To reduce repair time, the Sun Fire system has been designed with a number of maintenance capabilities and aids. These are used by the Sun Fire system administrator and by the service provider.
Several capabilities enable service to be performed without forcing scheduled downtime. Failing components are identified in the failure logs in such a way that the field-replaceable unit (FRU) is clearly identified. All boards and power supplies in a properly configured system can be removed and replaced during system operation without scheduled downtime.
Connectors are keyed so that boards cannot be installed upside down. No special tools are required to access the inside of the system. This is because all voltages within the cabinet are considered extra-low voltages (ELVs) as defined by applicable safety agencies.
No jumpers are required for configuration of the Sun Fire system. This makes for a much easier installation of new and/or upgraded system components. There are no slot dependencies other than the special slots required for the System Controller and Repeater boards.
The Sun Fire system cooling-system design includes capabilities that provide strength in the area of RAS. Standard proven parts and components are used wherever possible. FRUs and subassemblies are designed for quick and easy replacement with minimal use of tools required.
56-VDC power supplies can be hot-swapped with no interruption to the system. This assumes that the system is configured from the factory for power supply redundancy.
If a fan fails, the remaining working fans are set to high-speed operation by the system controller to compensate for the reduced airflow. The system is designed to operate normally under these conditions until the failed fan assembly is serviced. The fan trays can be hot-swapped with no interruption to the system.
The Sun Fire system has an interconnect domain facility that enables the system boards to be assigned to separate domains. For example, one domain can do production while a second domain experimentally runs the next revision of the operating system or exercises a suspected bad board with production-type work.
Nonconcurrent service requires the entire system to be powered off.
Every System Controller board has remote access capability that enables remote login to the system controller. Through this remote connection, all system controller diagnostics are accessible. You can run diagnostics remotely or locally on deconfigured system boards while the operating system is running on the other system boards.