C H A P T E R  2

Dynamic System Domains

The Sun Fire 15K/12K systems contain dynamic domains. These domains are described in the following sections.

The Sun Fire 15K system can be dynamically subdivided into as many as 18 dynamic system domains. The Sun Fire 12K system can be subdivided into as many as 9 dynamic system domains. Each domain has a separate boot disk (to execute a specific instance of the Solaris operating environment) as well as separate disk storage, network interfaces, and I/O interfaces. CPU boards and I/O boards can be separately added and removed from running domains.

Domains are used for server consolidation to run separate parts of a solution, such as an application server, a web server, and a database server. The domains are hardware-protected from hardware or software faults in other domains.


2.1 Domain Configurability

Each of the system boards (slot 0 and slot 1 boards) can be independently added to, or removed from, a running domain. This enables CPU and memory resources to be moved from one domain to another without disturbing the disk storage and network connections. In the Sun Fire 15K system, each domain must have an I/O board; therefore, there is a maximum of 18 domains. In the Sun Fire 12K system, each domain must have an I/O board; therefore, there is a maximum of 9 domains.

When the two system boards in a board set are in separate domains, this board set is termed a split expander. The expander board keeps the transactions separate for each system board. FIGURE 2-1 shows an example of configuration with some of the board sets split between the two domains. No physical proximity is needed for boards in a domain.

Since split-expander hardware is shared between two domains, this board set failure will bring down both domains. For example, if a fully configured system is divided into two nine-board set domains, the impact of all split, versus all unsplit, expanders is on the order of 5% higher MTBF (mean time between failure). Also, memory accesses that go through a split expander take two system clocks (13 ns) longer. If all expanders were split, the load-use latency for accesses to other board sets would increase about 6%.


FIGURE 2-1 Example of Domain Configuration With Some Split Board Sets

Diagram showing a system with CPU/Memory boards and I/O boards connected to an expander board for a split expander configuration for two domains.



2.2 Domain Protection

Primary domain protection is accomplished in the address extender queue (AXQ) ASICs by checking each transaction for domain validity when a transaction is first detected. In the Sun Fire 15K system, the system data interface (SDI) chips can also screen data transfer requests for valid destinations (to as many as 36 system boards). In addition, each Sun Fireplane interconnect arbiter (data, address, response) screens requests to as many as 18 expanders. In the Sun Fire 12K system, the SDI chips can screen data transfer requests for valid destinations (to as many as 18 system boards). Each Sun Fireplane interconnect arbiter (data, address, response) screens requests to as many as 9 expanders.This is a double check on the other domain protection mechanisms, which are in the AXQ and the SDI chips.

If a transgression error is detected in the AXQ, the AXQ treats the error operation like a request to nonexistent memory. It reissues the request without asserting a mapped coherency protocol signal, causing a Solaris operating environment switch execution from one process to another. A transgression error in the Sun Fireplane interconnect causes a domainstop of the transgressing domains because this error must indicate a failure of the primary protection mechanism.


2.3 Domain Fault Isolation

Domains are protected against software or hardware faults in other domains. If there is a fault in the processor or memory hardware that is assigned to a particular domain, only that one domain will be affected. If there is a fault in hardware that is shared between multiple domains, only those domains that share the hardware are affected.

As an example of hardware shared between two domains, consider a system which is configured to have a CPU/Memory board in one domain and its associated I/O board in another domain. The logic on a split expander board is shared between those two domains. A fault in a split expander or its control wiring to the Sun Fireplane interconnect causes a failure only in those two domains. A fault in globally shared hardware, such as the system clock generator or Sun Fireplane interconnect chips, causes a failure in all domains.

Fatal errors, such as a parity error in control wiring or a faulty ASIC, causes a domainstop. The steering signals from the expander boards to the arbiter chips of the Sun Fireplane interconnect are parity protected. If there is a parity error, the multiple copies of the Sun Fireplane interconnect arbiter could get out of sync. Therefore, this type of parity error causes an immediate domainstop of the domain.

Nonfatal errors or correctable single-bit errors in packets sent through the Sun Fireplane interconnect causes a recordstop. A recordstop freezes the history buffers in the ASICs, enabling failure information to be scanned out through JTAG while the domain continues to run.

For a split-expander transaction (expander with board 0 and board 1 in different domains), it is necessary to keep the arbiters in sync so that the error cannot propagate to multiple domains. In this type of transaction, two extra cycles of latency are introduced so that a steering parity error can be detected by all arbiters before one arbiter processes its own correct version of the steering. Configure your system with a minimum of split expanders to improve system performance.

The steering signals within the Sun Fireplane interconnect, from the data arbiter ASICs to the data MUX ASICs, are parity protected. It is not possible for the data MUX chips to cross-check for errors before processing on the steering. Therefore, a parity error on these localized wires could cause a domainstop in any or all domains.