|C H A P T E R 1|
A domain runs its own instance of the Solaris operating environment and is independent of other domains. Each domain has its own CPUs, memory, and I/O assemblies. Hardware resources including fans and power supplies are shared among domains, as necessary for proper operation.
The system controller is an embedded system that connects into the centerplane of these midframe systems. You access the system controller using either serial or Ethernet connections. It is the focal point for platform and domain configuration and management and is used to connect to the domain consoles.
The system controller configures and monitors the hardware in the system and provides a command line interface that enables you to perform tasks needed to configure the platform and each domain. The system controller also provides monitoring and configuration capability with SNMP for use with the Sun Management Center software. For more information on the system controller hardware and software, see System Controller and System Controller Firmware.
With this family of midframe systems, you can group system boards (CPU/Memory boards and I/O assemblies) into domains. Each domain can host its own instance of the Solaris operating environment and is independent of other domains.
You create domains by using either the system controller command line interface or the Sun Management Center. How to create domains using the system controller software is described in Creating and Starting Domains. For instructions on how to create domains using the Sun Management Center, refer to the Sun Management Center Supplement for Sun Fire 6800/4810/4800/3800 Systems.
If you run more than one domain in a partition, then the domains are not completely isolated. A failed Repeater board could affect all domains within the partition. For more information, see Repeater Boards.
Note - If a Repeater board failure affects a domain running host-licensed software, it is possible to continue running that software by swapping the HostID/MAC address of the affected domain with that of an available domain. For details, see Swapping Domain HostID/MAC Addresses.
The system boards in each system consist of CPU/Memory boards and I/O assemblies. The Sun Fire 6800/4810/4800 systems have Repeater boards (TABLE 1-1), which provide communication between CPU/Memory boards and I/O assemblies.
A partition is a group of Repeater boards that are used together to provide communication between CPU/Memory boards and I/O assemblies. Depending on the system configuration, each partition can be used by either one or two domains.
These systems can be configured to have one or two partitions. Partitioning is done at the Repeater board level. A single-mode partition forms one large partition using all of the Repeater boards. In dual-partition mode, two smaller partitions using fewer Repeater boards are created. For more information on Repeater boards, see Repeater Boards.
TABLE 1-2 lists the maximum number of partitions and domains each system can have.
FIGURE 1-1 through FIGURE 1-6 show partitions and domains for the Sun Fire 6800/4810/4800/3800 systems. The Sun Fire 3800 system has the equivalent of two Repeater boards, RP0 and RP2, as part of the active centerplane. The Repeater boards are not installed in the Sun Fire 3800 system as they are for the other systems. Instead, the Repeater boards in the Sun Fire 3800 system are integrated into the centerplane.
All of these systems are very flexible, and you can assign CPU/Memory boards and I/O assemblies to any domain or partition. The configurations shown in the following illustrations are examples only and your configuration may differ.
FIGURE 1-1 shows the Sun Fire 6800 system in single-partition mode. This system has four Repeater boards that operate in pairs (RP0, RP1) and (RP2, RP3), six
CPU/Memory boards (SB0 - SB5), and four I/O assemblies (IB6 - IB9).
FIGURE 1-3 shows the Sun Fire 4810/4800 systems in single-partition mode. These systems have two Repeater boards (RP0 and RP2) that operate separately (not in pairs as in the Sun Fire 6800 system), three CPU/Memory boards (SB0, SB2, and SB4), and two I/O assemblies (IB6 and IB8).
FIGURE 1-5 shows the Sun Fire 3800 system in single-partition mode. This system has the equivalent of two Repeater boards (RP0 and RP2) integrated into the active centerplane, two CPU/Memory boards (SB0 and SB2), and two I/O assemblies
(IB6 and IB8).
FIGURE 1-6 shows the Sun Fire 3800 system in dual-partition mode. The same boards and assemblies are shown as in FIGURE 1-5. This system also has the equivalent of two Repeater boards, RP0 and RP2, integrated into the active centerplane.
The system controller is an embedded system that connects into the centerplane of the Sun Fire midframe systems. It is the focal point for platform and domain configuration and management and is used to connect to the domain consoles.
The system can support up to two System Controller boards (TABLE 1-4) that function as a main and spare system controller. This redundant configuration of system controllers supports the SC failover mechanism, which triggers the automatic switchover of the main SC to the spare if the main SC fails. For details on SC failover, see Chapter 8.
If the main system controller fails and a failover occurs, the spare assumes all system controller tasks formerly handled by the main system controller. The spare system controller functions as a hot standby, and is used only as a backup for the main system controller.
For performance reasons, it is suggested that the system controllers be configured on a private network. For details, refer to the article, Sun Fire Midframe Server Best Practices for Administration, at
TABLE 1-5 describes the features of the serial port and the Ethernet port on the System Controller board. The Ethernet port provides the fastest connection.
Remain in the system controller message queue and are written to the configured syslog host(s). See TABLE 3-1 for instructions on setting up the platform and domain loghosts. Loghosts capture error messages regarding system failures and can be used to troubleshoot system failures.
The system controller supports one logical connection on the serial port and multiple logical connections with telnet on the Ethernet port. Connections can be set up for either the platform or one of the domains. Each domain can have only one logical connection at a time.
The platform shell is the operating environment for the platform administrator. Only commands that pertain to platform administration are available. To connect to the platform, see Obtaining the Platform Shell.
For platform administration functions, see Platform Administration.
To connect to a domain, see Obtaining a Domain Shell or Console.
If the domain is active (Solaris operating environment, the OpenBoot PROM, or POST is running in the domain), you can access the domain console. When you connect to the domain console, you will be at one of the following modes of operation:
The domains that are available vary with the system type and configuration. For more information on the maximum number of domains you can have, see Partitions.
For information on keyswitch settings, see Setting Keyswitch Positions. For a description and syntax of the setkeyswitch command, refer to the Sun Fire 6800/4810/4800/3800 System Controller Command Reference Manual.
Sensors throughout the system monitor temperature, voltage, current, and fan speed. The system controller periodically reads the values from each of these sensors. This information is maintained for display using the console commands and is available to Sun Management Center through SNMP.
When a sensor is generating values that are outside of the normal limits, the system controller takes appropriate action. This includes shutting down components in the system to prevent damage. Domains may be automatically paused as a result. If domains are paused, an abrupt hardware pause occurs (it is not a graceful shutdown of the Solaris operating environment).
The system controller does not have permanent storage for console messages. Both the platform and each domain have a small buffer that maintains some history. However, this information is lost when the system is rebooted or the system controller loses power.
To enhance accountability and for long-term storage, it is strongly suggested that you set up a syslog host so that the platform and domain console messages are sent to the syslog host. Be aware that these messages are not the Solaris operating environment console messages.
To minimize single points of failure, configure system resources using redundant components, which allows domains to remain functional. Component failures can be quickly and transparently handled when using redundant components.
For troubleshooting tips to perform if a board or component fails, see Board and Component Failures.
You can create two partitions on every midframe system. Use the setupplatform command to set up partition mode. For system controller command syntax and descriptions, refer to the Sun Fire 6800/4810/4800/3800 System Controller Command Reference Manual.
When a system is divided into two partitions, the system controller software logically isolates connections of one partition from the other. Partitioning is done at the Repeater board level. A single partition forms one large partition using all of the Repeater boards. In dual-partition mode, two smaller partitions using fewer Repeater boards are created, each using one-half of the total number of Repeater boards in the system.
Isolating errors to one partition is one of the main reasons to configure your system into dual-partition mode. With two partitions, if there is a failure in one domain in a partition, the failure will not affect the other domains running in the other partition. The exception to this is if there is a centerplane failure.
The interconnect bus implements cache coherency through a technique called snooping. With this approach each cache monitors the address of all transactions on the system interconnect, watching for transactions that update addresses it possesses. Since all CPUs need to see the broadcast addresses on the system interconnect, the address and command signals arrive simultaneously. The address and command lines are connected in a point-to-point fashion.
Redundancy within a domain means that any component in the domain can fail. With redundancy within a domain, when a component in a domain fails, the component failure might not affect domain functionality because the redundant component takes over and continues all operations in the domain.
Unlike the other midframe systems, the Sun Fire 6800 system has two power grids. Each power grid is supplied by a different redundant transfer unit (RTU). TABLE 1-6 lists the boards in power grid 0 and power grid 1.
For troubleshooting tips to perform if a board or component fails, see Board and Component Failures.
The maximum number of CPUs you can have on a CPU/Memory board is four. CPU/Memory boards are configured with either two CPUs or four CPUs. TABLE 1-7 lists the maximum number of CPU/Memory boards for each system.
Each CPU/Memory board has eight physical banks of memory. The CPU provides memory management unit (MMU) support for two banks of memory. Each bank of memory has four slots. The memory modules (DIMMs) must be populated in groups of four to fill a bank. The minimum amount of memory needed to operate a domain is one bank (four DIMMs).
All systems support multiple I/O assemblies. For the types of I/O assemblies supported by each system and other technical information, refer to the Sun Fire 6800/4810/4800/3800 Systems Overview Manual. TABLE 1-8 lists the maximum number of I/O assemblies for each system.
There are two possible ways to configure redundant I/O (TABLE 1-9).
The network redundancy features use part of the Solaris operating environment, known as IP multipathing. For information on IP multipathing (IPMP), see IP Multipathing (IPMP) Software and refer to the Solaris documentation supplied with the Solaris 8 or 9 operating environment release.
The Sun StorEdge Traffic Manager provides multipath disk configuration management, failover support, I/O load balancing, and single instance multipath support. For details, refer to the Sun StorEdge documentation available on the Sun Storage Area Network (SAN) Web site:
All systems have redundant cooling when the maximum number of fan trays are installed. If one fan tray fails, the remaining fan trays automatically increase speed, thereby enabling the system to continue to operate.
TABLE 1-10 shows the minimum and maximum number of fan trays required to cool each system For location information, such as the fan tray number, refer to the labels on the system and to the Sun Fire 6800/4810/4800/3800 Systems Service Manual.
Each system has comprehensive temperature monitoring to ensure that there is no over-temperature stressing of components in the event of a cooling failure or high ambient temperature. If there is a cooling failure, the speed of the remaining operational fans increases. If necessary, the system is shut down.
In order for power supplies to be redundant, you must have the required number of power supplies installed plus one additional redundant power supply for each power grid (referred to as the n+1 redundancy model). This means that two power supplies are required for the system to function properly. The third power supply is redundant. All three power supplies draw about the same current.
The power is shared in the power grid. If one power supply in the power grid fails, the remaining power supplies in the same power grid are capable of delivering the maximum power required for the power grid.
If more than one power supply in a power grid fails, there will be insufficient power to support a full load. For guidelines on what to do when a power supply fails, see To Handle Failed Components.
TABLE 1-11 describes the minimum and redundant power supply requirements.
Each power grid has power supplies assigned to the power grid. Power supplies ps0, ps1, and ps2 are assigned to power grid 0. Power supplies ps3, ps4, and ps5 are assigned to power grid 1. If one power grid, such as power grid 0 fails, the remaining power grid is still operational.
TABLE 1-12 lists the components in the Sun Fire 6800 system in each power grid. If you have a Sun Fire 4810/4800/3800 system, refer to the components in grid 0, since these systems have only power grid 0.
The Repeater board, also referred to as a Fireplane switch, is a crossbar switch that connects multiple CPU/Memory boards and I/O assemblies. Having the required number of Repeater boards is mandatory for operation. There are Repeater boards in each midframe system except for the Sun Fire 3800. In the Sun Fire 3800 system, the equivalent of two Repeater boards are integrated into the active centerplane. Repeater boards are not fully redundant.
TABLE 1-14 lists the Repeater board assignments by each domain in the Sun Fire 4810/4800 systems.
TABLE 1-15 lists the configurations for single-partition mode and dual-partition mode for the Sun Fire 6800 system regarding Repeater boards and domains.
TABLE 1-16 lists the configurations for single-partition mode and dual-partition mode for the Sun Fire 4810/4800/3800 systems.
The System Controller board provides redundant system clocks. For more information on system clocks, see System Controller Clock Failover.
The following sections provide details on RAS. For more hardware-related information on RAS, refer to the Sun Fire 6800/4810/4800/3800 Systems Service Manual. For RAS features that involve the Solaris operating environment, refer to the Sun Hardware Platform Guide.
The power-on self-test (POST) is part of powering on a domain. A board or component that fails POST will be disabled. The domain, running the Solaris operating environment, is booted only with components that have passed POST testing.
The component locations that can be specified are described in TABLE 1-17:
Note - Starting with the 5.15.0 release, the enablecomponent and disablecomponent commands have been replaced by the setls command. These commands were formerly used to manage component resources. While the enablecomponent and disablecomponent commands are still available, it is suggested that you use the setls command to control the configuration of components into or out of the system.
The system controller monitors the system temperature, current, and voltage sensors. The fans are also monitored to make sure they are functioning. Environmental status is not provided to the Solaris operating environment--only the need for an emergency shutdown. The environmental status is provided to the Sun Management Center software with SNMP.
Each system controller provides a system clock signal to each board in the system. Each board automatically determines which clock source to use. Clock failover is the ability to change the clock source from one system controller to another system controller without affecting the active domains.
Any non-persistent storage device, for example Dynamic Random Access Memory (DRAM) used for main memory or Static Random Access Memory (SRAM) used for caches, is subject to occasional incidences of data loss due to collisions of alpha particles. The data loss changes the value stored in the memory location affected by the collision. These collisions predominantly result in losing one data bit.
When a bit of data is lost, this is referred to as a soft error in contrast to a hard error, which results from faulty hardware. The soft errors happen at the soft error rate, which can be predicted as a function of:
When an error check mechanism detects that one or more bits in a word of data has changed, this is broadly categorized as an error checking and correction (ECC) error. ECC errors can be divided into two classes (TABLE 1-18).
ECC was invented to facilitate the survival of the naturally occurring data losses. Every word of data stored in memory also has check information stored along with it. This check information facilitates two things:
Systems with redundant System Controller boards support the SC failover capability. In a high-availability system controller configuration, the SC failover mechanism triggers the switchover of the main SC to the spare if the main SC fails. Within approximately five minutes or less, the spare SC becomes the main and takes over all system controller operations. For details on SC failover, see SC Failover Overview.
When the system controller detects a domain hardware error, it pauses the domain. The firmware includes an auto-diagnosis (AD) engine that tries to identify either the single or multiple components responsible for the error. If possible, the system controller disables (deconfigures) those components so that they cannot be used by the system.
After the auto-diagnosis, the system controller automatically reboots the domain, provided that the reboot-on-error parameter of the setupdomain command parameter is set to true, as part of the auto-restoration process. For details on the AD engine and the auto-restoration process, see Auto-Diagnosis and Auto-Restoration.
An automatic reboot of a specific domain can occur up to a maximum of three times. After the third automatic reboot, the domain is paused if another hardware occurs, and the error reboots are stopped. Rather than restart the domain manually, contact your service provider for assistance on resolving the domain hardware error.
If you set the reboot-on-error parameter to false, the domain is paused when the system controller detects a domain hardware. You must manually restart the domain (perform setkeyswitch off and then setkeyswitch on).
The hang-policy parameter of the setupdomain command, when set to the value reset (default), causes the system controller to automatically recover hung domains. For details, see Automatic Recovery of Hung Domains.
If there is a power outage, the system controller reconfigures active domains. TABLE 1-19 describes domain actions that occur during or after a power failure when the keyswitch is:
The system controller can be rebooted through SC failover or by using the reboot command, The system controller will start up and resume management of the system. The reboot does not disturb the domain(s) currently running the Solaris operating environment.
All field-replaceable units (FRUs) that are accessible from outside the system have LEDs that indicate their state. The system controller manages all the LEDs in the system, with the exception of the power supply LEDs, which are managed by the power supplies. For a discussion of LED functions, refer to the appropriate board or device chapter of the Sun Fire 6800/4810/4800/3800 Systems Service Manual.
The system controller, the Solaris operating environment, the power-on self-test (POST), and the OpenBoot PROM error messages use FRU name identifiers that match the physical labels in the system. The only exception is the OpenBoot PROM nomenclature used for I/O devices, which use the device path names as described in Appendix A.
You can configure the system controller platform and domains to log errors by using the syslog protocol to an external loghost. It is strongly recommended that you set the syslog host. For details on setting the syslog host, see TABLE 3-1.
The system controller also has an internal buffer where error messages are stored. You can display the system controller logged events, stored in the system controller message buffer, by using the showlogs command. There is one log for the platform and one log for each of the four domains.
If a system error occurs due to a fault condition, you can obtain detailed information about the error through the showerrorbuffer command. The information displayed is stored in a system error buffer that retains system error messages. This information can be used by your service provider to analyze a failure or problem.
Capacity on Demand (COD) is an option that provides additional processing resources (CPUs) when you need them. These additional CPUs are provided on COD CPU/Memory boards that are installed in your system. However, to access these COD CPUs, you must first purchase the COD right-to-use (RTU) licenses for them. After you obtain the COD RTU licenses for your COD CPUs, you can activate those CPUs as needed. For details on COD, see COD Overview.
Dynamic reconfiguration (DR), which is provided as part of the Solaris operating environment, enables you to safely add and remove CPU/Memory boards and I/O assemblies while the system is still running. DR controls the software aspects of dynamically changing the hardware used by a domain, with minimal disruption to user processes running in the domain.
The DR software uses the cfgadm command, which is a command-line interface for configuration administration. You can perform domain management DR tasks using the system controller software. The DR agent also provides a remote interface to the Sun Management Center software on Sun Fire 6800/4810/4800/3800 systems.
For complete information on DR, refer to the Sun Fire 6800, 4810, 4800, and 3800 Systems Dynamic Reconfiguration User Guide and also the Solaris documentation included with the Solaris operating environment.
The Solaris operating environment implementation of IPMP provides the following features(TABLE 1-20).
Ability to detect when a network adaptor that failed previously has been repaired and automatically switches back (failback) the network access from an alternate network adaptor. This assumes that you have enabled failbacks.
Outbound network packets are spread across multiple network adaptors without affecting the ordering of packets in order to achieve higher throughput. Load spreading occurs only when the network traffic is flowing to multiple destinations using multiple connections.
For more information on IP network multipathing (IPMP), refer to the System Administration Guide: IP Services, which is available with your Solaris operating environment release. The System Administration Guide: IP Services explains basic IPMP features and network configuration details. This book is available online with your Solaris operating environment release.
To optimize the effectiveness of the Sun Management Center, you must install it on a separate system. The Sun Management Center has the capability to logically group domains and the system controller into a single manageable object, to simplify operations.
To use the Sun Management Center, you must attach the System Controller board to a network. With a network connection, you can view both the command-line interface and the graphical user interface. To attach the System Controller board Ethernet port, refer to the installation documentation that was shipped with your system.
The FrameManager is an LCD that is located in the top right corner of the Sun Fire system cabinet. For a description of its functions, refer to the "FrameManager" chapter of the Sun Fire 6800/4810/4800/3800 Systems Service Manual.