C H A P T E R  1

Introduction

This chapter introduces the features for the Sun Firetrademark family of midrange servers-the E6900/E4900/6800/4810/4800/3800 systems. For detailed descriptions of these systems, refer to the Sun Fire E6900/E4900 Systems Overview Manual and the Sun Fire 6800/4810/4800/3800 Systems Overview Manual.

This chapter describes:

The term platform, as used in this book, refers to the collection of resources such as power supplies, the centerplane, and fans that are not for the exclusive use of a domain.

A segment, also referred to as a partition, is a group of Sun FirePlane switches (Repeater boards) that are used together to provide communication between CPU/Memory boards and I/O assemblies in the same domain.

A domain runs its own instance of the Solaris operating environment and is independent of other domains. Each domain has its own CPUs, memory, and I/O assemblies. Hardware resources including fans and power supplies are shared among domains, as necessary for proper operation.

The system controller (SC) is an embedded system that configures and monitors the platform. You access the system controller using either serial or Ethernet connections. It is the focal point for platform and domain configuration and management, and is used to connect to the domain consoles.

The system controller offers a command-line interface that enables you to perform tasks needed to configure the platform and each domain. The system controller provides monitoring and configuration capabilities through the Simple Network Monitoring Protocol (SNMP), used by the Sun Management Center software. For more information on the system controller hardware and firmware, see System Controller and System Controller Firmware.


Domains

With this family of midrange systems, you can group system boards (CPU/Memory boards and I/O assemblies) into domains. Each domain can host its own instance of the Solaris operating environment and is independent of other domains.

Domains include the following features:

All systems are configured at the factory with one domain.

You create domains by using either the system controller command-line interface or the Suntrademark Management Center software. How to create domains using the system controller software is described in Creating and Starting Domains. For instructions on how to create domains using the Sun Management Center, refer to the Sun Management Center 3.5 Version 3 Supplement for Sun Fire Midrange Systems.

The largest domain configuration comprises all CPU/Memory boards and I/O assemblies in the system. The smallest domain configuration consists of one CPU/Memory board and one I/O assembly.

An active domain must meet these requirements:

In addition, sufficient power and cooling is required. The power supplies and fan trays are not assigned to a domain.

If you run more than one domain in a partition, then the domains are not completely isolated. A failed Repeater board could affect all domains within the partition. For more information, see Repeater Boards.



Note - If a Repeater board failure affects a domain running host-licensed software, it is possible to continue running that software by swapping the HostID/MAC address of the affected domain with that of an available domain. For details, see Swapping Domain HostID/MAC Addresses.




System Components

The system boards in each system consist of CPU/Memory boards and I/O assemblies. The Sun Fire midrange systems have Repeater boards (TABLE 1-1) that provide communication between CPU/Memory boards and I/O assemblies.


TABLE 1-1 Repeater Boards in Sun Fire Midrange Systems

System

 

Boards Required per Partition

Total Number of Boards per System

 

Sun Fire E6900 and 6800 systems

2

4--RP0, RP1, RP2, RP3

Sun Fire E4900 and 4800 systems

1

2--RP0, RP2

Sun Fire 4810 system

1

2--RP0, RP2

Sun Fire 3800 system

N/A

Equivalent of two Repeater boards (RP0 and RP2) are built into an active centerplane.


For a system overview, including descriptions of the boards in the system, refer to the Sun Fire 6800/4810/4800/3800 Systems Overview Manual and the Sun Fire E6900/E4900 Systems Overview Manual.


Segments

A segment, also referred to as a partition, is a group of Repeater boards that are used together to provide communication between CPU/Memory boards and I/O assemblies. Depending on the system configuration, each partition can be used by either one or two domains.

Sun Fire midrange systems can be configured to have one or two partitions. When a system is divided into two partitions, the system controller firmware logically isolates connections of one partition from the other. Partitioning is done at the Repeater board level. A single-mode partition forms one large partition using all of the Repeater boards. In dual-partition mode, two smaller partitions using fewer Repeater boards are created, each using one-half of the total number of Repeater boards in the system. For more information on Repeater boards, see Repeater Boards.

Use the setupplatform command to set up partition mode. For system controller command syntax and descriptions, refer to the Sun Fire Midrange System Controller Command Reference Manual.

Isolating errors to one partition is one of the main reasons to configure your system into dual-partition mode. With two partitions, if there is a failure in one domain in a partition, the failure will not affect the other domains running in the other partition. The exception to this is if there is a centerplane failure. If you set up two domains, it is strongly suggested that you configure dual-partition mode with the setupplatform command. Each partition should contain one domain.

Be aware that if you configure your system into two partitions, half of the theoretical maximum data bandwidth is available to the domains. However, the snooping address bandwidth is preserved.

The interconnect bus implements cache coherency through a technique called snooping. With this approach each cache monitors the address of all transactions on the system interconnect, watching for transactions that update addresses it possesses. Since all CPUs need to see the broadcast addresses on the system interconnect, the address and command signals arrive simultaneously. The address and command lines are connected in a point-to-point fashion.

TABLE 1-2 lists the maximum number of partitions and domains each system can have


TABLE 1-2 Maximum Number of Partitions and Domains Per System

Sun Fire E6900 and 6800 Systems

Sun Fire E4900/4810/4800/3800 Systems

Number of Partitions1

1 or 2

1 or 2

Number of Active Domains in Dual-Partition Mode

Up to 4 (A, B, C, D)

Up to 2 (A, C)

Number of Active Domains in Single-Partition Mode

Up to 2 (A, B)

Up to 2 (A, B)

1 The default is one partition.


FIGURE 1-1 through FIGURE 1-6 show partitions and domains for Sun Fire midrange systems. The Sun Fire 3800 system has the equivalent of two Repeater boards, RP0 and RP2, as part of the active centerplane. The Repeater boards in the Sun Fire 3800 system are integrated into the centerplane.

All of these systems are very flexible, and you can assign CPU/Memory boards and I/O assemblies to any domain or partition. The configurations shown in the following illustrations are examples only and your configuration may differ.

TABLE 1-3 describes the board names used in FIGURE 1-1 through FIGURE 1-6.


TABLE 1-3 Board Name Descriptions

Board Name

Description

SB0 - SB5

CPU/Memory boards

IB6 - IB9

I/O assemblies

RP0 - RP3

Repeater boards


FIGURE 1-1 shows the single-partition mode for Sun Fire E6900 and 6800 systems. These systems have four Repeater boards that operate in pairs (RP0, RP1) and (RP2, RP3), six
CPU/Memory boards (SB0-SB5), and four I/O assemblies (IB6-IB9).


FIGURE 1-1 Sun Fire E6900 and 6800 Systems in Single-Partition Mode

Diagram of single-partition mode in Sun Fire E6900 and 6800 systems that have four Repeater boards, six CPU/Memory boards, and four I/O assemblies.


FIGURE 1-2 shows dual-partition mode for Sun Fire E6900 and 6800 systems. The same boards and assemblies are shown as in FIGURE 1-1.


FIGURE 1-2 Sun Fire E6900 and 6800 Systems in Dual-Partition Mode

Diagram of a dual partition in Sun Fire E6900 and 6800 systems that have four Repeater boards, six CPU/Memory boards, and four I/O assemblies.


FIGURE 1-3 shows single-partition mode on Sun Fire E4900/4810/4800 systems. These systems have two Repeater boards (RP0 and RP2) that operate separately (not in pairs as in the Sun Fire E6900 and 6800 systems), three CPU/Memory boards (SB0, SB2, and SB4), and two I/O assemblies (IB6 and IB8).


FIGURE 1-3 Sun Fire E4900/4810/4800 Systems in Single-Partition Mode

Diagram of a single partition in a Sun Fire E4900/4810/4800 system that has two Repeater boards, three CPU/Memory boards, and two I/O assemblies.


FIGURE 1-4 shows Sun Fire E4900/4810/4800 systems in dual-partition mode. The same boards and assemblies are shown as in FIGURE 1-3.


FIGURE 1-4 Sun Fire E4900/4810/4800 Systems in Dual-Partition Mode

Diagram of a dual partition in a Sun Fire E4900/4810/4800 system that has two Repeater Boards, three CPU/Memory boards, and two I/O assemblies.


FIGURE 1-5 shows the Sun Fire 3800 system in single-partition mode. This system has the equivalent of two Repeater boards (RP0 and RP2) integrated into the active centerplane, two CPU/Memory boards (SB0 and SB2), and two I/O assemblies
(IB6 and IB8).


FIGURE 1-5 Sun Fire 3800 System in Single-Partition Mode

Diagram of a single partition in a Sun Fire 3800 system that has two Repeater boards, two CPU/Memory boards, and two I/O assemblies.


FIGURE 1-6 shows the Sun Fire 3800 system in dual-partition mode. The same boards and assemblies are shown as in FIGURE 1-5. This system also has the equivalent of two Repeater boards, RP0 and RP2, integrated into the active centerplane.


FIGURE 1-6 Sun Fire 3800 System in Dual-Partition Mode

Diagram of a dual partition in a Sun Fire 3800 system that has two Repeater boards, two CPU/Memory boards, and two I/O assemblies.



System Controller

The system controller is the focal point for platform and domain configuration and management and is used to connect to the domain consoles.

System controller functions include:

The system can support up to two System Controller boards (TABLE 1-4) that function as a main and spare system controller (SC). This redundant configuration of system controllers supports the SC failover mechanism, which triggers the automatic switchover of the main SC to the spare, if the main SC fails. For details on SC failover, see Chapter 8.


TABLE 1-4 Functions of System Controller Boards

System Controller

Function

Main

Manages all system resources. Configure your system to connect to the main System Controller board.

Spare

If the main SC fails and a failover occurs, the spare SC assumes all system controller tasks formerly handled by the main SC. The spare SC functions as a hot standby (a running SC that can take over as the main SC if the main SC fails), and is used only as a backup for the main SC.


Starting with the 5.16.0 release, the firmware supports an enhanced memory SC (referred to as system controller V2 or SC V2). In a redundant SC configuration, both the main and spare SC must be of the same type. Mixed SC configurations are not supported.

Serial and Ethernet Ports

There are three methods to connect to the system controller console:

For security and performance reasons, it is suggested that the system controllers be configured on a private network. For details, refer to the Sun BluePrintstrademark online article, Sun Fire Midframe Server Best Practices for Administration, at

http://www.sun.com/blueprints

TABLE 1-5 describes the features of the serial port and the Ethernet port on the System Controller board. The Ethernet port provides the fastest connection.


TABLE 1-5 Serial Port and Ethernet Port Features on the System Controller Board

Capability

Serial Port

Ethernet Port

Number of connections

One

Multiple (SSH: five; telnet: twelve)

Connection speed

9.6 Kbps

10/100 Mbps

System logs

Remain in the system controller message queue

Remain in the system controller message queue and are written to the configured syslog host(s). See TABLE 3-1 for instructions on setting up the platform and domain loghosts. Loghosts capture error messages regarding system failures and can be used to troubleshoot system failures.

SNMP

Not supported

Supported for Sun Management Center only

Firmware upgrades

No

Yes (using the flashupdate command)

Security

  • Secure physical location plus secure terminal server
  • Password protection to the platform and domain shells

Password-protected access only


System Controller Connections

Logical Connection Limits

The system controller supports one logical connection on the serial port and multiple logical connections with a remote connection using SSH (as many as five connections) or telnet (as many as twelve connections) on the Ethernet port. Connections can be set up for either the platform or one of the domains. Each domain can have only one logical connection at a time.

Secure Remote Connections

An alternative to the Telnet protocol, the Secure Shell (SSH) protocol provides secure access to the system controller. SSH uses encryption to protect the data flowing between host and client, using authentication mechanisms to identify both hosts and clients.

The system controller provides SSHv2 server capability. You can use the SSH client software included in the Solaris 9 operating environment or OpenSSH clients with the Solaris 8 operating environment or SSHv2-compliant clients from other operating environments. For additional information on SSH, see Securing the System Platform.

System Controller Firmware

The sections that follow provide information on the system controller firmware, including:

Platform Administration

The platform administration function manages resources and services that are shared among the domains. With this function, you can determine how resources and services are configured and shared.

Platform administration functions include:

Platform Shell

The platform shell is the operating environment for the platform administrator. Only commands that pertain to platform administration are available. To connect to the platform, see To Select Destinations From the SC Main Menu.

Platform Console

The platform console is the system controller serial port, where the system controller boot messages and platform log messages are printed.



Note - The Solaris operating environment messages are displayed on the domain console.



System Controller Tasks Completed at System Power-On

When you power on the system, the system controller boots the real-time operating system and starts the System Controller Application (ScApp).

If there was an interruption of power, additional tasks completed at system power-on include:

Domain Administration

The domain administration function manages resources and services for a specific domain.

Domain administration functions include:

For platform administration functions, see Platform Administration.

Domain Shell

The domain shell is the operating environment for the domain administrator and is where domain tasks can be performed. There are four domain shells (A-D).

To connect to a domain, see To Navigate Between The Platform Shell And a Domain.

Domain Console

If the domain is active (Solaris operating environment, the OpenBoot PROM, or the power-on self-test (POST) is running in the domain), you can access the domain console. When you connect to the domain console, you will be at one of the following modes of operation:

If the domain is not active, you will be at the domain console prompt, where the prompt is schostname:domainID>:

Maximum Number of Domains

The domains that are available vary with the system type and configuration. For more information on the maximum number of domains you can have, see Segments.

Domain Keyswitch

Each domain has a virtual keyswitch. You can set five keyswitch positions: off (default), standby, on, diag, and secure.

For information on keyswitch settings, see Setting Keyswitch Positions. For a description and syntax of the setkeyswitch command, refer to the Sun Fire Midrange System Controller Command Reference Manual.

Environmental Monitoring

Sensors throughout the system monitor temperature, voltage, current, and fan speed. The system controller periodically reads the values from each of these sensors. This information is maintained for display using the console commands and is available to Sun Management Center through SNMP.

When a sensor is generating values that are outside of the normal limits, the system controller takes appropriate action. This includes shutting down components in the system to prevent damage. Domains may be automatically paused as a result. If domains are paused, an abrupt hardware pause occurs (it is not a graceful shutdown of the Solaris operating environment).

Log Messages

Console messages generated by the SC for the platform and each domain are displayed on the appropriate consoles. These messages are also logged in a dynamic buffer on the SC, and these logs can be viewed by using the showlogs command. Limited history is maintained and log messages are not permanently stored in this 4 Kbyte dynamic buffer. Note that these log messages are lost when the SC is rebooted or when it loses power.

However, if your midrange system has SC V2s (enhanced-memory SCs), approximately 112 Kbytes of certain message logs and system messages are retained in persistent storage, even after the SC is rebooted or the SC loses power. (For details on system error messages, see System Error Buffer.).

The persistent logs can be viewed by using the showlogs -p command. For details on the showlogs command and the options available to display specific types of persistent log messages, refer to the Sun Fire Midrange System Controller Command Reference Manual.

Even if your system has SC V2s, it is strongly suggested that you set up a syslog host so that the platform and domain console messages are sent to the syslog host, to enhance accountability and long-term storage of log information. Note that the messages retained are not the Solaris operating environment console messages.


Setting Up for Redundancy

To minimize single points of failure, configure system resources using redundant components. This allows domains to remain functional. System availability can be enhanced when using redundant components.

For troubleshooting tips to perform if a board or component fails, see Board and Component Failures.

This section covers these topics:

CPU/Memory Boards

All systems support multiple CPU/Memory boards. Each domain must contain at least one CPU/Memory board.

The maximum number of CPUs you can have on a CPU/Memory board is four. CPU/Memory boards are configured with either two CPUs or four CPUs. TABLE 1-6 lists the maximum number of CPU/Memory boards for each system.


TABLE 1-6 Maximum Number of CPU/Memory Boards in Sun Fire Midrange Systems

System

 

Maximum Number of
CPU/Memory Boards

Maximum Number of CPUs

Sun Fire E6900 and 6800 systems

6

24

Sun Fire 4810 system

3

12

Sun Fire E4900 and 4800 systems

3

12

Sun Fire 3800 system

2

8


Each CPU/Memory board has eight physical banks of memory. The CPU provides memory management unit (MMU) support for two banks of memory. Each bank of memory has four slots. Dual inline memory modules (DIMMs) must populate a bank in groups of four. The minimum amount of memory needed to operate a domain is one bank (four DIMMs).

A CPU can be used with no memory installed in any of its banks. A memory bank cannot be used unless the corresponding CPU is installed and functioning.

A failed CPU or faulty memory will be isolated from the domain by the CPU power-on self-test (POST). If a CPU is disabled by POST, the corresponding memory banks for the CPU will also be disabled.

You can operate a domain with as little as one CPU and one memory bank (four memory modules).

I/O Assemblies

All systems support multiple I/O assemblies. For the types of I/O assemblies supported by each system and other technical information, refer to the Sun Fire 6800/4810/4800/3800 Systems Overview Manual and the Sun Fire E6900/E4900 Systems Overview Manual. TABLE 1-7 lists the maximum number of I/O assemblies for each system.


TABLE 1-7 Maximum Number of I/O Assemblies and I/O Slots per I/O Assembly

System

 

Maximum Number of I/O Assemblies

Number of CompactPCI or PCI I/O Slots per Assembly

Sun Fire E6900 and 6800 systems

4

  • 8 slots--6 slots for full-length PCI cards and 2 short slots for short PCI cards
  • 4 slots for CompactPCI cards

Sun Fire 4810 system

2

 

  • 8 slots--6 slots for full-length PCI cards and 2 short slots for short PCI cards
  • 4 slots for CompactPCI cards

Sun Fire E4900 and 4800 systems

2

  • 8 slots--6 slots for full-length PCI cards and 2 short slots for short PCI cards
  • 4 slots for CompactPCI cards

Sun Fire 3800 system

2

6 slots for CompactPCI cards


There are two possible ways to configure redundant I/O (TABLE 1-8).


TABLE 1-8 Configuring for I/O Redundancy

Ways to Configure For I/O Redundancy

Description

Redundancy across I/O assemblies

You must have two I/O assemblies in a domain with duplicate cards in each I/O assembly that are connected to the same disk or network subsystem for path redundancy.

Redundancy within I/O assemblies

You must have duplicate cards in the I/O assembly that are connected to the same disk or network subsystem for path redundancy. This does not protect against the failure of the I/O assembly itself.


The network redundancy features use part of the Solaris operating environment, known as IP multipathing. For information on IP multipathing (IPMP), refer to the Solaris documentation supplied with the Solaris 8 or 9 operating environment release.

The Sun StorEdge Traffic Manager provides multipath disk configuration management, failover support, I/O load balancing, and single instance multipath support. For details, refer to the Sun StorEdge documentation available on the Sun Storage Area Network (SAN) Web site at:

http://www.sun.com/storage/san

Cooling

All systems have redundant cooling when the maximum number of fan trays are installed. If one fan tray fails, the remaining fan trays automatically increase speed, thereby enabling the system to continue to operate.



caution icon

Caution - With the minimum number of fan trays installed, you do nothave redundant cooling.



With redundant cooling, you do not need to suspend system operation to replace a failed fan tray. You can hot-swap a fan tray while the system is running, with no interruption to the system.

TABLE 1-9 shows the minimum and maximum number of fan trays required to cool each system For location information, such as the fan tray number, refer to the labels on the system and the following documents:

Each system has comprehensive temperature monitoring to ensure that there is no over-temperature stressing of components in the event of a cooling failure or high ambient temperature. If there is a cooling failure, the speed of the remaining operational fans increases. If necessary, the system will be shut down.

Power

In order for power supplies to be redundant, you must have the required number of power supplies installed plus one additional redundant power supply for each power grid (referred to as the n+1 redundancy model). This means that two power supplies are required for the system to function properly. The third power supply is redundant. All three power supplies draw approximately the same current.

The power is shared in the power grid. If one power supply in the power grid fails, the remaining power supplies in the same power grid are capable of delivering the maximum power required for the power grid.

If more than one power supply in a power grid fails, there will be insufficient power to support a full load. For guidelines on what to do when a power supply fails, see To Handle Failed Components.

The System Controller boards and the ID board obtain power from any power supply in the system. Fan trays obtain power from either power grid.

TABLE 1-10 describes the minimum and redundant power supply requirements.


TABLE 1-10 Minimum and Redundant Power Supply Requirements

System

 

 

Number of Power Grids per System

 

Minimum Number of Power Supplies in Each Power Grid

Total Number of Supplies in Each Power Grid (Including Redundant Power Supplies)

Sun Fire E6900 and 6800 systems

2

2 (grid 0)

3

Sun Fire E6900 and 6800 systems

 

2 (grid 1)

3

Sun Fire 4810 system

1

2 (grid 0)

3

Sun Fire E4900 and 4800 systems

1

2 (grid 0)

3

Sun Fire 3800 system

1

2 (grid 0)

3


In Sun Fire E6900 and 6800 systems, the power grid has power supplies assigned to the power grid. Power supplies ps0, ps1, and ps2 are assigned to power grid 0. Power supplies ps3, ps4, and ps5 are assigned to power grid 1. If one power grid fails, the remaining power grid is still operational.

TABLE 1-11 lists the components in each power grid for Sun Fire E6900 and 6800 systems. If you have a Sun Fire E4900/4810/4800/3800 system, refer to the components in grid 0, since these systems have only power grid 0.


TABLE 1-11 Sun Fire E6900 and 6800 System Components in Each Power Grid

Components in the System

Grid 0

Grid 1

CPU/Memory boards

SB0, SB2, SB4

SB1, SB3, SB5

I/O assemblies

IB6, IB8

IB7, IB9

Power supplies

PS0, PS1, PS2

PS3, PS4, PS5

Repeater boards

RP0, RP1

RP2, RP3

Redundant transfer unit (RTU)

RTUF (front)

RTUR (rear)


Repeater Boards

The Repeater board, also referred to as a Fireplane switch, is a crossbar switch that connects multiple CPU/Memory boards and I/O assemblies. Having the required number of Repeater boards is mandatory for operation. There are Repeater boards in each midrange system except for the Sun Fire 3800. In the Sun Fire 3800 system, the equivalent of two Repeater boards are integrated into the active centerplane. Repeater boards are not fully redundant.

For steps to perform if a Repeater board fails, see Recovering from a Repeater Board Failure.

TABLE 1-12 lists the Repeater board assignments by each domain in Sun Fire E6900 and 6800 systems.


TABLE 1-12 Repeater Board Assignments by Domains in the Sun Fire E6900 and 6800 Systems

Partition Mode

Repeater Boards

Domains

Single partition

RP0, RP1, RP2, RP3

A, B

Dual partition

RP0, RP1

A, B

Dual partition

RP2, RP3

C, D




Note - If an E6900 or 6800 system in single-partition mode has less than four working repeater boards available, the firmware will automatically change to dual-partition mode at the next domain reboot or keyswitch operation.



Table that identifies the domain Repeater board assignments in Sun Fire E6900 and 6800 systems.

TABLE 1-13 lists the Repeater board assignments by each domain in Sun Fire E4900/4810/4800/3800 systems.


TABLE 1-13 Repeater Board Assignments by Domains in Sun Fire E4900/4810/4800/3800 Systems

Partition Mode

Repeater Boards

Domains

Single partition

RP0, RP2

A, B

Dual partition

RP0

A

Dual partition

RP2

C


TABLE 1-14 lists the Repeater board and domain configurations for single-partition mode and dual-partition mode in Sun Fire E6900 and 6800 systems.


TABLE 1-14 Sun Fire E6900 and 6800 Domain and Repeater Board Configurations for Single- and Dual-Partitioned Systems

Sun Fire 6800 System in Single-Partition Mode

Sun Fire 6800 System in Dual-Partition Mode

RP0

RP1

RP2

RP3

RP0

RP1

RP2

RP3

Domain A

Domain A

Domain C

Domain B

Domain B

Domain D


TABLE 1-15 lists the configurations for single-partition mode and dual-partition mode in Sun Fire E4900/4810/4800/3800 systems.


TABLE 1-15 Sun Fire E4900/4810/4800/3800 Domain and Repeater Board Configurations for Single- and Dual-Partitioned Systems

Sun Fire 4810/4800/3800 System in Single-Partition Mode

Sun Fire 4810/4800/3800 System in Dual-Partition Mode

RP0

RP2

RP0

RP2

Domain A

Domain A

Domain C

Domain B

 

 


System Clocks

The System Controller board provides redundant system clocks. For more information on system clocks, see System Controller Clock Failover.


Reliability, Availability, and Serviceability (RAS)

Reliability, availability, and serviceability (RAS) are features of the Sun Fire midrange systems.

The following sections provide details on RAS. For more hardware-related information on RAS, refer to the Sun Fire 6800/4810/4800/3800 Systems Service Manual and the Sun Fire E6900/E4900 Systems Service Manual. For RAS features that involve the Solaris operating environment, refer to the Sun Hardware Platform Guide.

Reliability

The firmware reliability features include:

The reliability features also improve system availability.

POST

The power-on self-test (POST) is part of powering on a domain. A board or component that fails POST will be disabled. The domain, running the Solaris operating environment, is booted only with components that have passed POST testing.

Environmental Monitoring

The system controller monitors the system temperature, current, and voltage sensors. The fans are also monitored to make sure they are functioning. Environmental status is not provided to the Solaris operating environment--only the need for an emergency shutdown. The environmental status is provided to the Sun Management Center software with SNMP.

System Controller Clock Failover

Each system controller provides a system clock signal to each board in the system. Each board automatically determines which clock source to use. Clock failover is the ability to change the clock source from one system controller to another system controller without affecting the active domains.

When a system controller is reset or rebooted, clock failover is temporarily disabled. When the clock source is available again, clock failover is automatically enabled.

Error Checking and Correction

Any non-persistent storage device, for example dynamic random access memory (DRAM) used for main memory or static random access memory (SRAM) used for caches, is subject to occasional incidences of data loss due to collisions of alpha particles. The data loss changes the value stored in the memory location affected by the collision. These collisions predominantly result in losing one data bit.

When a bit of data is lost, this is referred to as a soft error in contrast to a hard error, which results from faulty hardware. The soft errors happen at the soft error rate, which can be predicted as a function of:

When an error-check mechanism detects that one or more bits in a word of data has changed, this is broadly categorized as an error checking and correction (ECC) error. ECC errors can be divided into two classes (TABLE 1-16).


TABLE 1-16 ECC Error Classes

ECC Error Classes

Definition

Correctable errors

ECC errors with one data bit lost, which ECC can correct.

Non-correctable errors

ECC errors with multiple data bits lost.


ECC was developed to facilitate the survival of the naturally occurring data losses. Every word of data stored in memory also has check information stored along with it. This check information facilitates two things:

1. When a word of data is read out of memory, the check information can be used to detect:

2. If one bit has changed, the check information can be used to determine which bit in the word changed. The word is corrected by flipping the bit back to its complementary value.

Availability

The firmware availability features include:

Component Location Status

The physical location of a component, such as slots for CPU/Memory boards or slots for I/O assemblies, can be used to manage hardware resources that are configured into or out of the system.

A component location has either a disabled or enabled state, which is referred to as the component location status.

For example, if you have components that are failing, you can assign the disabled status to the locations of the failed components so that those components are deconfigured from the system.

The component locations that can be specified are described in TABLE 1-17:


TABLE 1-17 Component Locations

System Component

Component Subsystem

Component Location

CPU system

 

slot/port/physical_bank/logical_bank

 

CPU/Memory boards (slot)

SB0, SB1, SB2, SB3, SB4, SB5

 

Ports on the
CPU/Memory board

P0, P1, P2, P3

 

Physical memory banks on
CPU/Memory boards

B0, B1

 

Logical banks on CPU/Memory boards

L0, L1, L2, L3

I/O assembly system

 

slot/port/bus or slot/card

 

I/O assemblies (slot)

IB6, IB7, IB8, IB9

 

Ports on the
I/O assembly

P0 and P1

 

Note: Leave at least one I/O controller 0 enabled in a domain so that the domain can communicate with the system controller.

 

Buses on the I/O assembly

B0, B1

 

I/O cards in the I/O assemblies

C0, C1, C2, C3, C4, C5, C6, C7 (The number of
I/O cards in the I/O assembly varies with the
I/O assembly type.)


Use the following commands to set and review the component location status:

You set the component location status by running the setls command from the platform or domain shell. The component location status is updated at the next domain reboot, board power cycle, or POST execution (for example, POST is run whenever you perform a setkeyswitch on or off operation).

The platform component location status supersedes the domain component location status. For example, if a component location is disabled in the platform, that location will be disabled in all domains. If you change the status of a component location in a domain, the change applies only to that domain. This means that if the component is moved to another location or to another domain, the component does not retain the same location status.



Note - Starting with the 5.15.0 release, the enablecomponent and disablecomponent commands have been replaced by the setls command. These commands were formerly used to manage component resources. While the enablecomponent and disablecomponent commands are still available, it is suggested that you use the setls command to control the configuration of components into or out of the system.



Use the showcomponent command to display the location status of a component (enabled or disabled). In some cases, certain components identified as disabled cannot be enabled. If the POST status in the showcomponent output for a disabled component is chs (abbreviation for component health status), the component cannot be enabled, based on the current diagnostic data maintained for the component. For additional information on component health status, see Automatic Diagnosis and Recovery Overview.

System Controller Failover Recovery

Systems with redundant System Controller boards support the SC failover capability. In a high-availability system controller configuration, the SC failover mechanism triggers the switchover of the main SC to the spare if the main SC fails. Within approximately five minutes or less, the spare SC becomes the main and takes over all system controller operations. For details on SC failover, see SC Failover Overview.

Error Diagnosis and Domain Recovery

When the SC detects a domain hardware error, it pauses the domain. The firmware includes an auto-diagnosis (AD) engine that tries to identify either the single or multiple components responsible for the error. If possible, the SC disables (deconfigures) those components so that they cannot be used by the system.

After the auto-diagnosis, the SC automatically reboots the domain, provided that the reboot-on-error parameter of the setupdomain command parameter is set to true, as part of the auto-restoration process. For details on the AD engine and the auto-restoration process, see Automatic Diagnosis and Recovery Overview.

An automatic reboot of a specific domain can occur up to a maximum of three times. After the third automatic reboot, the domain is paused if another hardware error occurs, and the error reboots are stopped. Rather than restarting the domain manually, contact your service provider for assistance on resolving the domain hardware error.

If you set the reboot-on-error parameter to false, the domain is paused when the SC detects a domain hardware error. You must manually restart the domain (perform setkeyswitch off and then setkeyswitch on).

Hung Domain Recovery

The hang-policy parameter of the setupdomain command, when set to the value reset (default), causes the system controller to automatically recover hung domains. For details, see Automatic Recovery of Hung Domains.

Unattended Power Failure Recovery

If there is a power outage, the system controller reconfigures active domains. TABLE 1-18 describes domain actions that occur during or after a power failure when the keyswitch is:

System Controller Reboot Recovery

The SC can be rebooted through SC failover or by using the reboot command. The SC will start up and resume management of the system. The reboot does not disturb the domain(s) currently running the Solaris operating environment.

Serviceability

The firmware serviceability features promote the efficiency and timeliness of providing routine as well as emergency service to midrange systems.

LEDs

All field-replaceable units (FRUs) that are accessible from outside the system have LEDs that indicate their state. The system controller manages all the LEDs in the system, with the exception of the power supply LEDs, which are managed by the power supplies. For a discussion of LED functions, refer to the appropriate board or device chapter of the Sun Fire 6800/4810/4800/3800 Systems Service Manual or the Sun Fire E6900/E4900 Systems Service Manual.

Nomenclature

The system controller, the Solaris operating environment, the power-on self-test (POST), and the OpenBoot PROM error messages use FRU name identifiers that match the physical labels in the system. The only exception is the OpenBoot PROM nomenclature used for I/O devices, which use the device path names as described in Appendix A.

System Controller XIR Support

The system controller reset command enables you to recover from a hard hung domain and extract a Solaris operating environment core file.

System Error Buffer

If a system error occurs due to a fault condition, the information is stored in a system error buffer that retains system error messages. This information, which can be viewed by running the showerrorbuffer command, is used by your service provider to analyze a failure or problem. For details on the showerrorbuffer command, refer to the Sun Fire Midrange System Controller Command Reference Manual.


Capacity on Demand Option

Capacity on Demand (COD) is an option that provides additional processing resources (additional CPUs) when you need them. These additional CPUs are provided on COD CPU/Memory boards that are installed in your system. However, to access these COD CPUs, you must first purchase the COD right-to-use (RTU) licenses for them. After you obtain the COD RTU licenses for your COD CPUs, you can activate those CPUs as needed. For details on COD, see COD Overview.


Dynamic Reconfiguration

Dynamic reconfiguration (DR), which is provided as part of the Solaris operating environment, enables you to safely add and remove CPU/Memory boards and I/O assemblies while the system is still running. DR controls the software aspects of dynamically changing the hardware used by a domain, with minimal disruption to user processes running in the domain.

You can use DR to do the following:

The DR software uses the cfgadm command, which is a command-line interface for configuration administration. You can perform domain management DR tasks using the SC. The DR agent also provides a remote interface to the Sun Management Center software on Sun Fire midrange systems.

For complete information on DR, refer to the Sun Fire Midrange Systems Dynamic Reconfiguration User Guide and also the Solaris documentation included with the Solaris operating environment.


Sun Management Center Software for Sun Fire Midrange Systems

The Sun Management Center software is the graphical user interface for managing the Sun Fire midrange systems.

To optimize the effectiveness of the Sun Management Center software, you must install it on a separate system. The Sun Management Center software has the capability to logically group domains and the system controller into a single manageable object, to simplify operations.

The Sun Management Center software once configured, is also the recipient of SNMP traps and events.

To use the Sun Management Center, you must attach the System Controller board to a network. With a network connection, you can view both the command-line interface and the graphical user interface.

For information on the Sun Management Center software, refer to the Sun Management Center 3.5 Version 3 Supplement for Sun Fire Midrange Systems, which is available online.


FrameManager

The FrameManager is an LCD that is located in the top right corner of the Sun Fire system cabinet. For a description of its functions, refer to the "FrameManager" chapter of the Sun Fire 6800/4810/4800/3800 Systems Service Manual and the Sun Fire E6900/E4900 Systems Service Manual.