|C H A P T E R 1|
This chapter introduces the features for the Sun Fire family of midrange servers-the E6900/E4900/6800/4810/4800/3800 systems. For detailed descriptions of these systems, refer to the Sun Fire E6900/E4900 Systems Overview Manual and the Sun Fire 6800/4810/4800/3800 Systems Overview Manual.
This chapter describes:
The term platform, as used in this book, refers to the collection of resources such as power supplies, the centerplane, and fans that are not for the exclusive use of a domain.
A segment, also referred to as a partition, is a group of Sun FirePlane switches (Repeater boards) that are used together to provide communication between CPU/Memory boards and I/O assemblies in the same domain.
A domain runs its own instance of the Solaris operating environment and is independent of other domains. Each domain has its own CPUs, memory, and I/O assemblies. Hardware resources including fans and power supplies are shared among domains, as necessary for proper operation.
The system controller (SC) is an embedded system that configures and monitors the platform. You access the system controller using either serial or Ethernet connections. It is the focal point for platform and domain configuration and management, and is used to connect to the domain consoles.
The system controller offers a command-line interface that enables you to perform tasks needed to configure the platform and each domain. The system controller provides monitoring and configuration capabilities through the Simple Network Monitoring Protocol (SNMP), used by the Sun Management Center software. For more information on the system controller hardware and firmware, see System Controller and System Controller Firmware.
With this family of midrange systems, you can group system boards (CPU/Memory boards and I/O assemblies) into domains. Each domain can host its own instance of the Solaris operating environment and is independent of other domains.
Domains include the following features:
All systems are configured at the factory with one domain.
You create domains by using either the system controller command-line interface or the Sun Management Center software. How to create domains using the system controller software is described in Creating and Starting Domains. For instructions on how to create domains using the Sun Management Center, refer to the Sun Management Center 3.5 Version 3 Supplement for Sun Fire Midrange Systems.
The largest domain configuration comprises all CPU/Memory boards and I/O assemblies in the system. The smallest domain configuration consists of one CPU/Memory board and one I/O assembly.
An active domain must meet these requirements:
In addition, sufficient power and cooling is required. The power supplies and fan trays are not assigned to a domain.
If you run more than one domain in a partition, then the domains are not completely isolated. A failed Repeater board could affect all domains within the partition. For more information, see Repeater Boards.
Note - If a Repeater board failure affects a domain running host-licensed software, it is possible to continue running that software by swapping the HostID/MAC address of the affected domain with that of an available domain. For details, see Swapping Domain HostID/MAC Addresses.
The system boards in each system consist of CPU/Memory boards and I/O assemblies. The Sun Fire midrange systems have Repeater boards (TABLE 1-1) that provide communication between CPU/Memory boards and I/O assemblies.
For a system overview, including descriptions of the boards in the system, refer to the Sun Fire 6800/4810/4800/3800 Systems Overview Manual and the Sun Fire E6900/E4900 Systems Overview Manual.
A segment, also referred to as a partition, is a group of Repeater boards that are used together to provide communication between CPU/Memory boards and I/O assemblies. Depending on the system configuration, each partition can be used by either one or two domains.
Sun Fire midrange systems can be configured to have one or two partitions. When a system is divided into two partitions, the system controller firmware logically isolates connections of one partition from the other. Partitioning is done at the Repeater board level. A single-mode partition forms one large partition using all of the Repeater boards. In dual-partition mode, two smaller partitions using fewer Repeater boards are created, each using one-half of the total number of Repeater boards in the system. For more information on Repeater boards, see Repeater Boards.
Use the setupplatform command to set up partition mode. For system controller command syntax and descriptions, refer to the Sun Fire Midrange System Controller Command Reference Manual.
Isolating errors to one partition is one of the main reasons to configure your system into dual-partition mode. With two partitions, if there is a failure in one domain in a partition, the failure will not affect the other domains running in the other partition. The exception to this is if there is a centerplane failure. If you set up two domains, it is strongly suggested that you configure dual-partition mode with the setupplatform command. Each partition should contain one domain.
Be aware that if you configure your system into two partitions, half of the theoretical maximum data bandwidth is available to the domains. However, the snooping address bandwidth is preserved.
The interconnect bus implements cache coherency through a technique called snooping. With this approach each cache monitors the address of all transactions on the system interconnect, watching for transactions that update addresses it possesses. Since all CPUs need to see the broadcast addresses on the system interconnect, the address and command signals arrive simultaneously. The address and command lines are connected in a point-to-point fashion.
TABLE 1-2 lists the maximum number of partitions and domains each system can have
FIGURE 1-1 through FIGURE 1-6 show partitions and domains for Sun Fire midrange systems. The Sun Fire 3800 system has the equivalent of two Repeater boards, RP0 and RP2, as part of the active centerplane. The Repeater boards in the Sun Fire 3800 system are integrated into the centerplane.
All of these systems are very flexible, and you can assign CPU/Memory boards and I/O assemblies to any domain or partition. The configurations shown in the following illustrations are examples only and your configuration may differ.
TABLE 1-3 describes the board names used in FIGURE 1-1 through FIGURE 1-6.
FIGURE 1-1 shows the single-partition mode for Sun Fire E6900 and 6800 systems. These systems have four Repeater boards that operate in pairs (RP0, RP1) and (RP2, RP3), six
CPU/Memory boards (SB0-SB5), and four I/O assemblies (IB6-IB9).
FIGURE 1-2 shows dual-partition mode for Sun Fire E6900 and 6800 systems. The same boards and assemblies are shown as in FIGURE 1-1.
FIGURE 1-3 shows single-partition mode on Sun Fire E4900/4810/4800 systems. These systems have two Repeater boards (RP0 and RP2) that operate separately (not in pairs as in the Sun Fire E6900 and 6800 systems), three CPU/Memory boards (SB0, SB2, and SB4), and two I/O assemblies (IB6 and IB8).
FIGURE 1-4 shows Sun Fire E4900/4810/4800 systems in dual-partition mode. The same boards and assemblies are shown as in FIGURE 1-3.
FIGURE 1-5 shows the Sun Fire 3800 system in single-partition mode. This system has the equivalent of two Repeater boards (RP0 and RP2) integrated into the active centerplane, two CPU/Memory boards (SB0 and SB2), and two I/O assemblies
(IB6 and IB8).
FIGURE 1-6 shows the Sun Fire 3800 system in dual-partition mode. The same boards and assemblies are shown as in FIGURE 1-5. This system also has the equivalent of two Repeater boards, RP0 and RP2, integrated into the active centerplane.
The system controller is the focal point for platform and domain configuration and management and is used to connect to the domain consoles.
System controller functions include:
The system can support up to two System Controller boards (TABLE 1-4) that function as a main and spare system controller (SC). This redundant configuration of system controllers supports the SC failover mechanism, which triggers the automatic switchover of the main SC to the spare, if the main SC fails. For details on SC failover, see Chapter 8.
If the main SC fails and a failover occurs, the spare SC assumes all system controller tasks formerly handled by the main SC. The spare SC functions as a hot standby (a running SC that can take over as the main SC if the main SC fails), and is used only as a backup for the main SC.
Starting with the 5.16.0 release, the firmware supports an enhanced memory SC (referred to as system controller V2 or SC V2). In a redundant SC configuration, both the main and spare SC must be of the same type. Mixed SC configurations are not supported.
There are three methods to connect to the system controller console:
For security and performance reasons, it is suggested that the system controllers be configured on a private network. For details, refer to the Sun BluePrints online article, Sun Fire Midframe Server Best Practices for Administration, at
TABLE 1-5 describes the features of the serial port and the Ethernet port on the System Controller board. The Ethernet port provides the fastest connection.
Remain in the system controller message queue and are written to the configured syslog host(s). See TABLE 3-1 for instructions on setting up the platform and domain loghosts. Loghosts capture error messages regarding system failures and can be used to troubleshoot system failures.
The system controller supports one logical connection on the serial port and multiple logical connections with a remote connection using SSH (as many as five connections) or telnet (as many as twelve connections) on the Ethernet port. Connections can be set up for either the platform or one of the domains. Each domain can have only one logical connection at a time.
An alternative to the Telnet protocol, the Secure Shell (SSH) protocol provides secure access to the system controller. SSH uses encryption to protect the data flowing between host and client, using authentication mechanisms to identify both hosts and clients.
The system controller provides SSHv2 server capability. You can use the SSH client software included in the Solaris 9 operating environment or OpenSSH clients with the Solaris 8 operating environment or SSHv2-compliant clients from other operating environments. For additional information on SSH, see Securing the System Platform.
The sections that follow provide information on the system controller firmware, including:
The platform administration function manages resources and services that are shared among the domains. With this function, you can determine how resources and services are configured and shared.
Platform administration functions include:
The platform shell is the operating environment for the platform administrator. Only commands that pertain to platform administration are available. To connect to the platform, see To Select Destinations From the SC Main Menu.
The platform console is the system controller serial port, where the system controller boot messages and platform log messages are printed.
When you power on the system, the system controller boots the real-time operating system and starts the System Controller Application (ScApp).
If there was an interruption of power, additional tasks completed at system power-on include:
The domain administration function manages resources and services for a specific domain.
Domain administration functions include:
For platform administration functions, see Platform Administration.
The domain shell is the operating environment for the domain administrator and is where domain tasks can be performed. There are four domain shells (A-D).
To connect to a domain, see To Navigate Between The Platform Shell And a Domain.
If the domain is active (Solaris operating environment, the OpenBoot PROM, or the power-on self-test (POST) is running in the domain), you can access the domain console. When you connect to the domain console, you will be at one of the following modes of operation:
If the domain is not active, you will be at the domain console prompt, where the prompt is schostname:domainID>:
The domains that are available vary with the system type and configuration. For more information on the maximum number of domains you can have, see Segments.
Each domain has a virtual keyswitch. You can set five keyswitch positions: off (default), standby, on, diag, and secure.
For information on keyswitch settings, see Setting Keyswitch Positions. For a description and syntax of the setkeyswitch command, refer to the Sun Fire Midrange System Controller Command Reference Manual.
Sensors throughout the system monitor temperature, voltage, current, and fan speed. The system controller periodically reads the values from each of these sensors. This information is maintained for display using the console commands and is available to Sun Management Center through SNMP.
When a sensor is generating values that are outside of the normal limits, the system controller takes appropriate action. This includes shutting down components in the system to prevent damage. Domains may be automatically paused as a result. If domains are paused, an abrupt hardware pause occurs (it is not a graceful shutdown of the Solaris operating environment).
Console messages generated by the SC for the platform and each domain are displayed on the appropriate consoles. These messages are also logged in a dynamic buffer on the SC, and these logs can be viewed by using the showlogs command. Limited history is maintained and log messages are not permanently stored in this 4 Kbyte dynamic buffer. Note that these log messages are lost when the SC is rebooted or when it loses power.
However, if your midrange system has SC V2s (enhanced-memory SCs), approximately 112 Kbytes of certain message logs and system messages are retained in persistent storage, even after the SC is rebooted or the SC loses power. (For details on system error messages, see System Error Buffer.).
The persistent logs can be viewed by using the showlogs -p command. For details on the showlogs command and the options available to display specific types of persistent log messages, refer to the Sun Fire Midrange System Controller Command Reference Manual.
Even if your system has SC V2s, it is strongly suggested that you set up a syslog host so that the platform and domain console messages are sent to the syslog host, to enhance accountability and long-term storage of log information. Note that the messages retained are not the Solaris operating environment console messages.
To minimize single points of failure, configure system resources using redundant components. This allows domains to remain functional. System availability can be enhanced when using redundant components.
For troubleshooting tips to perform if a board or component fails, see Board and Component Failures.
This section covers these topics:
All systems support multiple CPU/Memory boards. Each domain must contain at least one CPU/Memory board.
The maximum number of CPUs you can have on a CPU/Memory board is four. CPU/Memory boards are configured with either two CPUs or four CPUs. TABLE 1-6 lists the maximum number of CPU/Memory boards for each system.
Each CPU/Memory board has eight physical banks of memory. The CPU provides memory management unit (MMU) support for two banks of memory. Each bank of memory has four slots. Dual inline memory modules (DIMMs) must populate a bank in groups of four. The minimum amount of memory needed to operate a domain is one bank (four DIMMs).
A CPU can be used with no memory installed in any of its banks. A memory bank cannot be used unless the corresponding CPU is installed and functioning.
A failed CPU or faulty memory will be isolated from the domain by the CPU power-on self-test (POST). If a CPU is disabled by POST, the corresponding memory banks for the CPU will also be disabled.
You can operate a domain with as little as one CPU and one memory bank (four memory modules).
All systems support multiple I/O assemblies. For the types of I/O assemblies supported by each system and other technical information, refer to the Sun Fire 6800/4810/4800/3800 Systems Overview Manual and the Sun Fire E6900/E4900 Systems Overview Manual. TABLE 1-7 lists the maximum number of I/O assemblies for each system.
There are two possible ways to configure redundant I/O (TABLE 1-8).
The network redundancy features use part of the Solaris operating environment, known as IP multipathing. For information on IP multipathing (IPMP), refer to the Solaris documentation supplied with the Solaris 8 or 9 operating environment release.
The Sun StorEdge Traffic Manager provides multipath disk configuration management, failover support, I/O load balancing, and single instance multipath support. For details, refer to the Sun StorEdge documentation available on the Sun Storage Area Network (SAN) Web site at:
All systems have redundant cooling when the maximum number of fan trays are installed. If one fan tray fails, the remaining fan trays automatically increase speed, thereby enabling the system to continue to operate.
With redundant cooling, you do not need to suspend system operation to replace a failed fan tray. You can hot-swap a fan tray while the system is running, with no interruption to the system.
TABLE 1-9 shows the minimum and maximum number of fan trays required to cool each system For location information, such as the fan tray number, refer to the labels on the system and the following documents:
Each system has comprehensive temperature monitoring to ensure that there is no over-temperature stressing of components in the event of a cooling failure or high ambient temperature. If there is a cooling failure, the speed of the remaining operational fans increases. If necessary, the system will be shut down.
In order for power supplies to be redundant, you must have the required number of power supplies installed plus one additional redundant power supply for each power grid (referred to as the n+1 redundancy model). This means that two power supplies are required for the system to function properly. The third power supply is redundant. All three power supplies draw approximately the same current.
The power is shared in the power grid. If one power supply in the power grid fails, the remaining power supplies in the same power grid are capable of delivering the maximum power required for the power grid.
If more than one power supply in a power grid fails, there will be insufficient power to support a full load. For guidelines on what to do when a power supply fails, see To Handle Failed Components.
The System Controller boards and the ID board obtain power from any power supply in the system. Fan trays obtain power from either power grid.
TABLE 1-10 describes the minimum and redundant power supply requirements.
In Sun Fire E6900 and 6800 systems, the power grid has power supplies assigned to the power grid. Power supplies ps0, ps1, and ps2 are assigned to power grid 0. Power supplies ps3, ps4, and ps5 are assigned to power grid 1. If one power grid fails, the remaining power grid is still operational.
TABLE 1-11 lists the components in each power grid for Sun Fire E6900 and 6800 systems. If you have a Sun Fire E4900/4810/4800/3800 system, refer to the components in grid 0, since these systems have only power grid 0.
The Repeater board, also referred to as a Fireplane switch, is a crossbar switch that connects multiple CPU/Memory boards and I/O assemblies. Having the required number of Repeater boards is mandatory for operation. There are Repeater boards in each midrange system except for the Sun Fire 3800. In the Sun Fire 3800 system, the equivalent of two Repeater boards are integrated into the active centerplane. Repeater boards are not fully redundant.
For steps to perform if a Repeater board fails, see Recovering from a Repeater Board Failure.
TABLE 1-12 lists the Repeater board assignments by each domain in Sun Fire E6900 and 6800 systems.
Note - If an E6900 or 6800 system in single-partition mode has less than four working repeater boards available, the firmware will automatically change to dual-partition mode at the next domain reboot or keyswitch operation.
TABLE 1-13 lists the Repeater board assignments by each domain in Sun Fire E4900/4810/4800/3800 systems.
TABLE 1-14 lists the Repeater board and domain configurations for single-partition mode and dual-partition mode in Sun Fire E6900 and 6800 systems.
TABLE 1-15 lists the configurations for single-partition mode and dual-partition mode in Sun Fire E4900/4810/4800/3800 systems.
The System Controller board provides redundant system clocks. For more information on system clocks, see System Controller Clock Failover.
Reliability, availability, and serviceability (RAS) are features of the Sun Fire midrange systems.
The following sections provide details on RAS. For more hardware-related information on RAS, refer to the Sun Fire 6800/4810/4800/3800 Systems Service Manual and the Sun Fire E6900/E4900 Systems Service Manual. For RAS features that involve the Solaris operating environment, refer to the Sun Hardware Platform Guide.
The firmware reliability features include:
The reliability features also improve system availability.
The power-on self-test (POST) is part of powering on a domain. A board or component that fails POST will be disabled. The domain, running the Solaris operating environment, is booted only with components that have passed POST testing.
The system controller monitors the system temperature, current, and voltage sensors. The fans are also monitored to make sure they are functioning. Environmental status is not provided to the Solaris operating environment--only the need for an emergency shutdown. The environmental status is provided to the Sun Management Center software with SNMP.
Each system controller provides a system clock signal to each board in the system. Each board automatically determines which clock source to use. Clock failover is the ability to change the clock source from one system controller to another system controller without affecting the active domains.
When a system controller is reset or rebooted, clock failover is temporarily disabled. When the clock source is available again, clock failover is automatically enabled.
Any non-persistent storage device, for example dynamic random access memory (DRAM) used for main memory or static random access memory (SRAM) used for caches, is subject to occasional incidences of data loss due to collisions of alpha particles. The data loss changes the value stored in the memory location affected by the collision. These collisions predominantly result in losing one data bit.
When a bit of data is lost, this is referred to as a soft error in contrast to a hard error, which results from faulty hardware. The soft errors happen at the soft error rate, which can be predicted as a function of:
When an error-check mechanism detects that one or more bits in a word of data has changed, this is broadly categorized as an error checking and correction (ECC) error. ECC errors can be divided into two classes (TABLE 1-16).
ECC was developed to facilitate the survival of the naturally occurring data losses. Every word of data stored in memory also has check information stored along with it. This check information facilitates two things:
1. When a word of data is read out of memory, the check information can be used to detect:
2. If one bit has changed, the check information can be used to determine which bit in the word changed. The word is corrected by flipping the bit back to its complementary value.
The firmware availability features include:
The physical location of a component, such as slots for CPU/Memory boards or slots for I/O assemblies, can be used to manage hardware resources that are configured into or out of the system.
A component location has either a disabled or enabled state, which is referred to as the component location status.
For example, if you have components that are failing, you can assign the disabled status to the locations of the failed components so that those components are deconfigured from the system.
The component locations that can be specified are described in TABLE 1-17:
Use the following commands to set and review the component location status:
You set the component location status by running the setls command from the platform or domain shell. The component location status is updated at the next domain reboot, board power cycle, or POST execution (for example, POST is run whenever you perform a setkeyswitch on or off operation).
The platform component location status supersedes the domain component location status. For example, if a component location is disabled in the platform, that location will be disabled in all domains. If you change the status of a component location in a domain, the change applies only to that domain. This means that if the component is moved to another location or to another domain, the component does not retain the same location status.
Note - Starting with the 5.15.0 release, the enablecomponent and disablecomponent commands have been replaced by the setls command. These commands were formerly used to manage component resources. While the enablecomponent and disablecomponent commands are still available, it is suggested that you use the setls command to control the configuration of components into or out of the system.
Use the showcomponent command to display the location status of a component (enabled or disabled). In some cases, certain components identified as disabled cannot be enabled. If the POST status in the showcomponent output for a disabled component is chs (abbreviation for component health status), the component cannot be enabled, based on the current diagnostic data maintained for the component. For additional information on component health status, see Automatic Diagnosis and Recovery Overview.
Systems with redundant System Controller boards support the SC failover capability. In a high-availability system controller configuration, the SC failover mechanism triggers the switchover of the main SC to the spare if the main SC fails. Within approximately five minutes or less, the spare SC becomes the main and takes over all system controller operations. For details on SC failover, see SC Failover Overview.
When the SC detects a domain hardware error, it pauses the domain. The firmware includes an auto-diagnosis (AD) engine that tries to identify either the single or multiple components responsible for the error. If possible, the SC disables (deconfigures) those components so that they cannot be used by the system.
After the auto-diagnosis, the SC automatically reboots the domain, provided that the reboot-on-error parameter of the setupdomain command parameter is set to true, as part of the auto-restoration process. For details on the AD engine and the auto-restoration process, see Automatic Diagnosis and Recovery Overview.
An automatic reboot of a specific domain can occur up to a maximum of three times. After the third automatic reboot, the domain is paused if another hardware error occurs, and the error reboots are stopped. Rather than restarting the domain manually, contact your service provider for assistance on resolving the domain hardware error.
If you set the reboot-on-error parameter to false, the domain is paused when the SC detects a domain hardware error. You must manually restart the domain (perform setkeyswitch off and then setkeyswitch on).
The hang-policy parameter of the setupdomain command, when set to the value reset (default), causes the system controller to automatically recover hung domains. For details, see Automatic Recovery of Hung Domains.
If there is a power outage, the system controller reconfigures active domains. TABLE 1-18 describes domain actions that occur during or after a power failure when the keyswitch is:
The SC can be rebooted through SC failover or by using the reboot command. The SC will start up and resume management of the system. The reboot does not disturb the domain(s) currently running the Solaris operating environment.
The firmware serviceability features promote the efficiency and timeliness of providing routine as well as emergency service to midrange systems.
All field-replaceable units (FRUs) that are accessible from outside the system have LEDs that indicate their state. The system controller manages all the LEDs in the system, with the exception of the power supply LEDs, which are managed by the power supplies. For a discussion of LED functions, refer to the appropriate board or device chapter of the Sun Fire 6800/4810/4800/3800 Systems Service Manual or the Sun Fire E6900/E4900 Systems Service Manual.
The system controller, the Solaris operating environment, the power-on self-test (POST), and the OpenBoot PROM error messages use FRU name identifiers that match the physical labels in the system. The only exception is the OpenBoot PROM nomenclature used for I/O devices, which use the device path names as described in Appendix A.
The system controller reset command enables you to recover from a hard hung domain and extract a Solaris operating environment core file.
If a system error occurs due to a fault condition, the information is stored in a system error buffer that retains system error messages. This information, which can be viewed by running the showerrorbuffer command, is used by your service provider to analyze a failure or problem. For details on the showerrorbuffer command, refer to the Sun Fire Midrange System Controller Command Reference Manual.
Capacity on Demand (COD) is an option that provides additional processing resources (additional CPUs) when you need them. These additional CPUs are provided on COD CPU/Memory boards that are installed in your system. However, to access these COD CPUs, you must first purchase the COD right-to-use (RTU) licenses for them. After you obtain the COD RTU licenses for your COD CPUs, you can activate those CPUs as needed. For details on COD, see COD Overview.
Dynamic reconfiguration (DR), which is provided as part of the Solaris operating environment, enables you to safely add and remove CPU/Memory boards and I/O assemblies while the system is still running. DR controls the software aspects of dynamically changing the hardware used by a domain, with minimal disruption to user processes running in the domain.
You can use DR to do the following:
The DR software uses the cfgadm command, which is a command-line interface for configuration administration. You can perform domain management DR tasks using the SC. The DR agent also provides a remote interface to the Sun Management Center software on Sun Fire midrange systems.
For complete information on DR, refer to the Sun Fire Midrange Systems Dynamic Reconfiguration User Guide and also the Solaris documentation included with the Solaris operating environment.
The Sun Management Center software is the graphical user interface for managing the Sun Fire midrange systems.
To optimize the effectiveness of the Sun Management Center software, you must install it on a separate system. The Sun Management Center software has the capability to logically group domains and the system controller into a single manageable object, to simplify operations.
The Sun Management Center software once configured, is also the recipient of SNMP traps and events.
To use the Sun Management Center, you must attach the System Controller board to a network. With a network connection, you can view both the command-line interface and the graphical user interface.
For information on the Sun Management Center software, refer to the Sun Management Center 3.5 Version 3 Supplement for Sun Fire Midrange Systems, which is available online.
The FrameManager is an LCD that is located in the top right corner of the Sun Fire system cabinet. For a description of its functions, refer to the "FrameManager" chapter of the Sun Fire 6800/4810/4800/3800 Systems Service Manual and the Sun Fire E6900/E4900 Systems Service Manual.