C H A P T E R  2

System Features and Capabilities

This chapter provides information on hardware and domain configuration, resource management, and reliability, availability, and serviceability (RAS).


2.1 Hardware Configuration

This section describes the hardware configuration. It includes these topics:

2.1.1 CPU Module

The M4000 server supports up to two CPU modules and the M5000 server supports up to four CPU modules. The CPU module consists of two processors per module. The CPU modules are high-performance multicore processor chips which contain an on-chip secondary cache to minimize memory latency. These processor chips also support the instruction retry function that enables continuous processing by retrying instructions when any error is detected.

2.1.1.1 CPU Types and Features

This section describes the CPU types and features.


TABLE 2-1 CPU Specifications

CPU Name

SPARC64 VI Processor

SPARC64 VII/VII+ Processor

Number of cores

2 cores

4 cores

Operational mode

SPARC64 VI compatible mode

SPARC64 VI compatible mode/SPARC64 VII enhanced mode


2.1.1.2 Supported Processors and CPU Operational Modes

The M4000/M5000 servers can support SPARC64 VI processors, SPARC64 VII processors, SPARC64 VII+ processors, or a mix of these different types of processors. A single domain can be configured with a mix of these processors. This section applies only to M4000/M5000 servers that run or will run SPARC64 VII/SPARC64 VII+ processors.



Note - Supported firmware and Oracle Solaris OS will vary based on the processor type. For details, see the latest version of the Product Notes (for XCP version 1100 or later) for your server.


A SPARC Enterprise M4000/M5000 server domain runs in one of the following CPU operational modes:

All processors in the domain behave like and are treated by the OS as SPARC64 VI processors. The new capabilities of SPARC64 VII or SPARC64 VII+ processors are not available in this mode.

All boards in the domain must contain only SPARC64 VII or SPARC64 VII+ processors. In this mode, the server utilizes the new features of these processors.

By default, the Oracle Solaris OS automatically sets a domain’s CPU operational mode each time the domain is booted based on the types of processors it contains. It does this when the cpumode variable is set to auto.

For more information on CPU operational modes, refer to the SPARC Enterprise M3000/M4000/M5000/M8000/M9000 Servers XSCF User’s Guide.

2.1.2 Memory Subsystem

Each memory board in the server contains four or eight DIMMs (dual inline memory modules). Both midrange servers use Double Data Rate II (DDR II) type DIMMs. The memory subsystem supports up to eight-way memory interleaving for high-speed memory access. For more information on memory boards and DIMMs, see Memory Board.

2.1.3 I/O Subsystem

Each I/O subsystem contains the following:

The PCI slots support the hot-plug function, which enables you to replace the IOU while the domain is operating. Before you can remove a PCI card, you must first unconfigure and disconnect it.

You can also add an optional External I/O Expansion Unit, which contains additional PCIe slots or PCI-X slots.

2.1.4 System Bus

The CPU, memory subsystem, and I/O subsystem are directly connected to implement data transfer by using a high-speed broadband switch. Individual components are connected through tightly coupled switches, which use an even latency for data transfer. These components can be added to the server to enhance the processing capability (in proportion to the number of components added).

When a data error is detected in a CPU, Memory Access Controller (MAC), or I/O Controller (IOC), the system bus agent corrects the data and transfers it.

2.1.5 System Control

This section on system control describes XSCFU Hardware, Fault Detection and Management, and System Remote Control/Monitoring.

2.1.5.1 eXtended System Control Facility Unit (XSCFU)

The eXtended System Control Facility Unit (XSCFU), also known as the Service Processor, operates independently from the SPARC64 VI/SPARC64 VII/SPARC64 VII+ domains. The Service Processor directs the system startup, reconfiguration, and fault diagnosis. This is where the system management software, which is the eXtended System Control Facility (XSCF) firmware, runs.

2.1.5.2 Fault Detection and Management

The XSCF firmware provides fault detection and management capabilities, such as monitoring, detecting, and reporting system errors or faults to the Service Processor. The XSCF firmware monitors the system status continuously to help the system operate in a stable condition.

The XSCF firmware promptly collects a hardware log when any system fault is detected. The firmware does the following:

As necessary, according to the fault conditions, the XSCF firmware degrades parts of domains or resets the system to prevent another fault from occurring. The firmware provides easy-to-understand and accurate information on hardware errors and fault locations. This enables you to take prompt action on faults.

For more information on XSCF fault management, refer to the SPARC Enterprise M3000/M4000/M5000/M8000/M9000 Servers XSCF User’s Guide.

2.1.5.3 System Remote Control/Monitoring

The XSCF firmware provides an IP address filtering function, which permits access to XSCF and an encryption communication based on SSH and SSL. XSCF logs operator mistakes and unauthorized access attempts made during system operation. The system administrator can grant users appropriate privileges for particular tasks.

The XSCF firmware also manages user accounts for system or domain administration. The system administrator can grant users an adequate user privilege.

The XSCF firmware provides the following remote notification services:


2.2 Partitioning

The M4000 and M5000 servers can be divided into multiple independent systems for operation. This dividing function is called partitioning. This section describes features of partitioning and system configurations that can be implemented through partitioning.

The individual systems that result from the partitioning of the server are called domains. Domains are sometimes called partitions. Partitioning enables arbitrary assignment of resources in the server. Partitioning also enables flexible domain configurations to be used according to the job load or processing amount.

Each domain runs on an independent operating system. Each domain is protected by hardware so that it is not affected by other domains. For example, a software-based problem, such as an OS panic, in one domain does not directly affect jobs in the other domains. Furthermore, the operating system in each domain can be reset and shut down independently.

2.2.1 Physical Unit for Domain Constitution

The basic hardware resource making up a domain in the server is called the physical system board (PSB). The physical unit configuration of each divided part of a PSB is called an extended system board (XSB). A PSB in this server can be logically divided into one part (no division) or four parts. A PSB that is logically divided into one part (no division) is called a Uni-XSB, and a PSB that is logically divided into four parts is called a Quad-XSB. A domain can be configured with any combination of these XSBs. The XSCF is used to configure a domain and specify the PSB division type.

2.2.2 Domain Configuration

A domain is an independent computing resource that runs an individual instance of the Oracle Solaris OS. Each domain is separated from other domains, and is not affected by operations in other domains. Domains enable one server to perform different types of processing.

The operations within a domain are controlled with Oracle Solaris administration tools. However, to create, configure, and monitor domains, you must use the XSCF, as described in the SPARC Enterprise M3000/M4000/M5000/M8000/M9000 Servers Administration Guide and the SPARC Enterprise M3000/M4000/M5000/M8000/M9000 Servers XSCF User’s Guide. For more background on domains, see Domains.


2.3 Resource Management

Both midrange servers provide four means of managing the server’s resources:

2.3.1 Dynamic Reconfiguration

Dynamic reconfiguration (DR) enables hardware resources on system boards to be added and removed dynamically without stopping system operation. DR thus enables optimal relocation of system resources. Using the DR function enables additions or distributions of resources as required for job expansions or new jobs, and it can be used for the following purposes.

2.3.2 PCI Hot-Plug

You can insert and remove PCI cards for certain PCIe and PCI-X hot-plug controllers while the server is running. Before you can remove a PCI card, you must first unconfigure and disconnect it using the Oracle Solaris cfgadm(1M) command. For more information, refer to the SPARC Enterprise M4000/M5000 Servers Service Manual.

2.3.3 Capacity on Demand (COD)

The COD feature allows you to configure spare processing resources on your M4000/M5000 server in the form of one or more COD CPUs which can be activated at a later date when additional processing power is needed. To access each COD CPU, you must purchase a COD hardware activation permit. Under certain conditions, you can use COD resources before purchasing COD permits for them.

For more information, refer to the SPARC Enterprise M4000/M5000/M8000/M9000 Servers Capacity on Demand (COD) User’s Guide.

2.3.4 Zones

Oracle Solaris OS has a function called zones, which divides the processing resources and allocates them to applications. Zones provide flexible resource allocation, which enables optimal resource management with consideration given to the processing load.

In a domain, resources can be divided into sections called containers. The processing sections are allocated to each application. The processing resources are managed independently in each container. If a problem occurs in a container, the container can be isolated so it does not affect other containers.


2.4 Reliability, Availability, and Serviceability

Reliability, availability, and serviceability (RAS) are aspects of the system design that affect the ability of the system to:

TABLE 2-2 defines each RAS feature.


TABLE 2-2 RAS Definitions

RAS Feature

Description

Reliability

Length of time the midrange server can operate normally without failure. The ability to detect failures with accuracy.

Availability

Ratio of time during which the system is accessible and usable.

Serviceability

Time required for the system to be recovered by specific maintenance after a failure occurs.


2.4.1 Reliability

Reliability represents the length of time the midrange server can operate normally without failure.

To improve quality, adequate components must be selected with consideration given to the product service life and the required response in case of a failure. In evaluations such as stress tests that check the service life, components and products are inspected to determine whether they meet the target reliability levels.

Reliability is equally important to both hardware and software. Naturally, trouble-free software is desired, but eliminating all software problems is difficult.

Installing the functions below leads to reliability improvements in the field.

Memory patrol prevents faulty areas from being used and thereby prevents the occurrence of system failures.

2.4.2 Availability

Availability represents the ratio of time the midrange server is accessible and usable. An operating ratio is used as an index.

Faults cannot be completely eliminated. To provide high availability, the system must be incorporated with mechanisms that enable continuous system operation even if a failure occurs in hardware, such as components and devices, basic software such as the operating system, or business application software.

The midrange servers can provide high availability by implementing the items listed below. Also, a cluster configuration can provide higher availability.

Since the memory patrol facility is implemented in hardware, it is not affected by the software processing workload.

2.4.3 Serviceability

Serviceability represents the ease of recovery from a system failure. To facilitate recovery from a failure, after detecting the failure the system administrator and/or field engineer must do the following:

The midrange server can provide high serviceability with the following features: