Sun Fire 15K/12K Systems Introduction
|
This chapter provides the following introductory information for the Sun Fire 15K/12K systems:
- Section 1.1, "System Boards" on page 1-2
- Section 1.2, "System Configuration" on page 1-4
- Section 1.3, "System Interconnects" on page 1-5
- Section 1.4, "Dynamic System Domains" on page 1-8
- Section 1.5, "Reliability, Availability, and Serviceability" on page 1-9
The Sun Fire 15K/12K systems use the latest UltraSPARC III Cu CPU and the Sun Fireplane interconnect architecture running the binary-compatible Solaris 8 UNIX® operating environment (FIGURE 1-1). The Sun Fireplane interconnect has faster CPUs. The industry-leading dynamic system domain and reliability, availability, and serviceability (RAS) capabilities have been applied and use the active-centerplane technology.
FIGURE 1-1 Sun Fire 15K/12K Systems
[ D ]
The Sun Fire 15K/12K systems are essentially the same. The Sun Fire 15K system has the capacity for 18 CPU/Memory boards and 18 I/O boards. The Sun Fire 12K system has the capacity for nine CPU/Memory boards and nine I/O boards. Each system contains two System Control boards (one main and one spare).
1.1 System Boards1.1.1 CPU/Memory Boards
The CPU/Memory board holds four CPUs. Each CPU has an associated memory subsystem of eight DIMMs, so memory bandwidth and capacity are both scaled up as CPUs are added. The memory capacity of the board is 32 Gbytes using a 1-Gbyte DIMM. The maximum memory bandwidth inside a board is 9.6 Gbytes per second. The CPU/Memory board has a 4.8 Gbyte per second connection to the rest of the system.
1.1.2 I/O Boards
The Sun Fire 15K/12K hot-swap PCI assembly architecture (hsPCI-X/hsPCI+) has two I/O controllers. Each controller provides one 33-MHz peripheral component interconnect (PCI) bus and three 33/66/90 MHz PCI buses for a total of four on each I/O assembly. Therefore, each I/O assembly has four hot-swap component PCI slots. A Sun Fire I/O assembly has a 2.4 Gbyte/sec connection to the rest of the system.
1.1.3 System Controller
The system controller is the heart of the Sun Fire 15K/12K systems availability and serviceability technology. It configures the system, coordinates the boot process, sets up the dynamic system domains, monitors the system environmental sensors, and handles error detection, diagnosis, and recovery. Two System Control boards are configured into the system to provide redundancy and automatic failover in the event that one board fails.
1.1.4 Peripherals
The Sun Fire 15K/12K systems cabinet does not have room for peripherals, with the exception of the system controller peripherals (DVD-ROM, digital audio tape (DAT) drive, and hard drive). However, more peripheral devices can be configured in additional peripheral expansion racks.
1.2 System Configuration
TABLE 1-1 summarizes the maximum configuration of the Sun Fire 15K/12K systems.
TABLE 1-1 Sun Fire 15K/12K System Maximum Configuration
Component
|
15K Configuration
|
12K Configuration
|
CPU/Memory boards
|
18
|
9
|
CPUs
|
72
|
36
|
Number of DIMMs
|
576
|
288
|
Memory capacity (with 1-Gbyte DIMMs)
|
576 GB
|
288 GB
|
Sun Fireplane interconnect
|
Active
|
Active
|
Repeater boards
|
NA
|
NA
|
Expander boards
|
18
|
9
|
Domains
|
18
|
9
|
I/O boards (assemblies)
|
18
|
9
|
PCI assembly types
|
hsPCI+
|
hsPCI+
|
PCI assembly types
|
hsPCI-X
|
hsPCI-X
|
PCI slots per assembly
|
4
|
4
|
Maximum PCI slots
|
72
|
36
|
Bulk power supplies
|
6
|
6
|
Power requirements
|
24 kW
|
24 kW
|
System Control boards
|
2
|
2
|
Redundant cooling
|
Yes
|
Yes
|
Redundant AC input
|
Yes
|
Yes
|
Enclosure
|
Sun Fire 15K/12K Systems cabinet
|
Sun Fire 15K/12K Systems cabinet
|
Room in enclosure for peripherals
|
No
|
No
|
1.3 System Interconnects
TABLE 1-2 summarizes the interconnect capacities of the Sun Fire 15K/12K systems.
TABLE 1-2 Sun Fire 15K/12K Systems Interconnect Specifications
Interconnect
|
Specification
|
System clock
|
150 MHz
|
Coherency protocol
|
Snooping on each board set,
directory across a centerplane
|
System address interconnect
|
18 snoopy buses,
18x18 global address crossbar,
18x18 global response crossbar,
|
CPU/Memory board internal bisection bandwidth
|
4.8 Gbytes/sec
|
CPU/Memory board
off-board data port
|
4.8 Gbytes/sec
|
I/O board
off-board data port
|
2.4 Gbytes/sec
|
System data interconnect
|
18 3x3 board set crossbars,
18 x 18 global crossbar
|
System bisection bandwidth
|
43 Gbytes/sec
|
Average lmbench (back-to-back-load) latency assumes random accesses
|
326 ns
|
Note - The definition of snooping, as defined in the PCI System Architecture, Third
Edition, Appendix A: Glossary, 1995, by MindShare, Inc., (ISBN 0-201-40993-3):
Snooping - When a memory access is performed by an agent other than the
cache controller, the cache controller must snoop the transaction to
determine if the current master is accessing information that is also
resident within the cache. If a snoop hit occurs, the cache controller
must take an appropriate action to ensure the continued consistency
of its cached information.
|
1.3.1 Sun Fireplane Interconnect Architecture
The Sun Fire 15K/12K systems use the Sun Fireplane interconnect system- interconnect architecture that is the coherent shared-memory protocol used by the UltraSPARC III Cu CPU generation. This is the fourth generation of shared-memory interconnect. Sun Microsystems uses an improved system interconnect with each new CPU generation to keep system performance scaling with CPU performance.
The Sun Fireplane interconnect architecture is an evolutionary improvement over the previous generation Ultra Port Architecture (UPA). The system clock rate is increased by 50% from 100 MHz to 150 MHz. The snoops per clocks are doubled from one half to one. Taken together, this triples the snooping bandwidth to 150 million addresses per second.
The Sun Fireplane interconnect architecture also adds a new layer of point-to-point directory-coherency protocol. This protocol is used in systems that require more bandwidth than a single snoopy bus can provide. This facility enables coherency to be maintained between multiple snoopy buses.
FIGURE 1-2 shows the Sun Fireplane interconnect architecture of the Sun Fire 15K system. The board diagrams show the actual on-board connectivity but omit the switch and controller chips for clarity.
FIGURE 1-2 Sun Fireplane Interconnects
[ D ]
The Sun Fire 15K/12K systems use an expander board to implement a 3x3 switch between a CPU/Memory board, an I/O board, and the Sun Fireplane interconnect port. The Sun Fire 15K/12K systems have three 18x18 crossbars on its Sun Fireplane interconnect for addresses, responses, and data so that address traffic does not interfere with data traffic. The peak Sun Fire 15K/12K systems Sun Fireplane interconnect bandwidth is 43 Gbytes per second.
1.3.2 Address Interconnect
The dashed lines in FIGURE 1-2 are the snoopy address buses. A snoop can occur at every system clock. In the Sun Fire 15K/12K systems, there is a separate snoopy address bus on each board set. A board set is the combination of a CPU/Memory board, an I/O board, and an expander board. Coherency is maintained between board sets by using the point-to-point (directory) portion of the coherency protocol.
1.3.3 Data Interconnect
The solid lines in FIGURE 1-2 represent the data paths. The small circles at the intersections of these lines indicate three-port switches. The CPU/Memory board has three levels of 3x3 switches between a CPU or memory unit and the off-board port. The off-board bandwidth of a CPU/Memory board is 4.8 Gbytes per second. The bandwidth of an I/O board is 2.4 Gbytes per second.
1.4 Dynamic System Domains
Each domain in the Sun Fire 15K/12K systems include one or more
CPU/Memory boards and one or more I/O boards. Each domain runs its own instance of the Solaris operating environment and has its own peripherals and network connections. Domains can be reconfigured without interrupting the operation of other domains. Domains can be used for:
- Testing new applications
- Making operating system updates
- Supporting various departments
- Removing and reinstalling boards for repair or upgrade
As an example, the Sun Fire 15K system is divided into three domains. Here is one example of partitioning a fully populated system into three domains to handle three types of functions:
- Domain 1 is set up to run online transaction processing (OLTP). It is a 32-CPU domain containing eight boards of four CPUs each.
- Domain 2 is set up to run decision support software (DSS). It is also a 32-CPU domain containing eight boards of four CPUs each.
- Domain 3 is set up as a domain for developers. It is a two-board domain, each board with four CPUs.
Boards can be automatically migrated between domains as the load change demands.
The Sun Fire 15K system can have up to 18 domains. The Sun Fire 12K system can have up to 9 domains. Domains are isolated from each other by the interconnect application-specific integrated circuits (ASICs).
1.5 Reliability, Availability, and Serviceability
Reliability, availability, and serviceability (RAS) are critical requirements of customers who deploy business-critical applications. The Sun Fire 15K/12K systems build upon the industry-leading RAS capabilities. The sections that follow describe some of the major features that improve RAS.
1.5.1 Integrated Circuit Reliability
- Start-up diagnostics. All major Sun Fire 15K/12K systems ASICs do a built-in self-test (BIST) on power-on. This applies random patterns at a system clock rate to provide a high-fault coverage of combinatorial logic. The power-on self-test (POST) is controlled from the system controller, and first tests each logic block in isolation. Then the POST continues testing using more and more of the system. Failing components are electrically isolated from the Sun Fireplane interconnect. The result is that the system is booted only with logic blocks that have passed this self-test and that should operate without error.
- Internal SRAM protection inside the UltraSPARC III Cu CPU. With higher-density CPUs and lower-core voltages, SRAM cells have become more vulnerable to bit flips from cosmic-ray disturbances. Single-bit errors for the majority of the internal SRAMs are detected and are recoverable.
- External SRAM protection. All external SRAMs are protected by error-correcting codes (ECC). This includes the external cache data of the CPU and the coherency directory cache of the Sun Fire 15K/12K systems.
1.5.2 Interconnect Reliability
- Address interconnect protection. The Sun Fire 15K/12K systems address buses and control signals are parity protected to detect single-bit errors. In addition, the address and response crossbars on the Sun Fireplane interconnect have ECC protection to correct single-bit errors and detect double-bit errors.
- Data interconnect protection. The entire system data path is protected by ECC, which corrects single-bit errors and detects double-bit errors before they can cause data corruption. ECC is generated by a CPU or I/O controller when it initiates a write command. The extra bits are carried throughout the interconnect to the destination. The memory subsystem does not check or correct errors, but only provides the extra storage bits. When data is read out of memory, it is checked and, if necessary, corrected by the receiving CPU or I/O controller. To help isolate failures, parity is also checked as data is passed from chip to chip. The data switch ASICs also check ECC. The ECC patterns use detect-complete DRAM chip failures but cannot correct them.
1.5.3 Fault-Tolerant Redundancy
A failure in these subsystems does not cause any loss of availability.
- N+1 redundancy. The AC power inputs, the bulk-power supplies, and the cooling fans are all fault tolerant through N+1 redundancy. If one of these subunits fails, the remainder of the components can continue system operation without interruption.
- Failover while running. The System Control boards are configured in pairs. One is active, and the other is a hot-spare. In the event of a failure of the system controller CPU or of the clock generation logic, control is switched from the failed board to the other board without system interruption.
1.5.4 Reconfiguration After Failure
- Automatic system recovery. A suitably configured system always reboots after a failure. The system controller locates the fault; reconfigures the system excluding the failed CPU, memory, or interconnect component; and reboots the operating system.
- Interconnect reconfiguration after failure. After a system interconnect failure occurs, the system restarts with the bad interconnect components isolated and with half the system bandwidth still available. The three crossbars can be separately reconfigured between full and degraded mode on a domain-by-domain basis.
1.5.5 Serviceability
- System controller. The System Control board is the heart of the RAS technology. The SC CPU board is an off-the-shelf SPARCengine CP1500 6U cPCI board with an UltraSPARC IIi embedded system. This board runs Solaris Software and System Management Software. The system controller has access by means of JTAG (joint test action group) to registers in each significant chip in the machine, and continuously monitors the state of the machine. If a problem is detected, the system controller attempts to determine what hardware has malfunctioned and then takes steps to prevent that hardware from being accessed until it has been replaced.
- Console bus. The console bus is a secondary bus that enables the system controller to access the inner workings of the machine without having to rely on the integrity of the system address and data buses. This enables the system controller to operate even when there is a fault that prevents the system operation from continuing. It is protected by parity.
- Environmental monitoring. The system controller monitors the cabinet environment for key measures of system stability such as temperature, fan operation, and power supply performance.
- Concurrent serviceability. The fans, the bulk power supplies, and the system boards are all hot-swap components. They can be removed and replaced in a running system.
- Dynamic system domains. Dynamic system domains enable a repaired or upgraded board to be added or removed from a running domain.
Sun Fire 15K/12K Systems
|
806-3509-13
|
|
Copyright © 2006, Sun Microsystems, Inc. All Rights Reserved.