C H A P T E R  8

Domain Control

This chapter addresses the functions that provide control over domain software and server hardware. Control functions are invoked at the discretion of an administrator. They are also useful to SMS for providing automatic system recovery (ASR).

Domain control functionality provides control over the software running on a domain. It includes those functions that enable a domain to be booted and interrupted. Only the domain administrator can invoke the domain control functions.

This chapter includes the following sections:


Booting Domains

This section describes the various aspects of booting the Solaris OS in a domain.

The setkeyswitch(1M) command is responsible for initiating and sequencing a domain boot. It powers on the domain hardware as required and invokes a POST to test and configure the hardware in the logical domain into a Sun Fire high-end system's physical hardware domain. It downloads and initiates the OpenBoot PROM as required to boot the Solaris OS on the domain.

Only domains that have their virtual keyswitch set appropriately are subject to boot control. See Virtual Keyswitch.

OpenBoot PROM boot parameters are stored in the domain's virtual NVRAM. The osd(1M) command provides those parameter values to OpenBoot PROM, which adapts the domain boot as indicated.

Certain parameters, in particular those that might not be adjustable from OpenBoot PROM itself when a domain is failing to boot, can be set by setobpparams(1M) so that they take effect at the next boot attempt.

Keyswitch Control

The domain keyswitch control (see Virtual Keyswitch) manually initiates domain boot.

The setkeyswitch command boots a properly configured domain when its keyswitch control is moved from the off or standby position to one of the on positions.

The setobpparams(1M) command provides a method by which a manually initiated (keyswitch control) domain boot sequence can be stopped in the OpenBoot PROM. For more information, see Setting the OpenBoot PROM Variables and refer to the setobpparams man page.

Power Control

Power for the following components can be controlled using the poweron and poweroff commands.


procedure icon  To Power System Boards On and Off From the Command Line

Platform administrators are allowed to control power to the entire system and can execute these commands without a location option. Domain administrators can control power to any system board assigned to their domains. Users with only domain privileges must supply the location option.

where location is the location of the system component you want to power on and, if you are a domain administrator, for which you have privileges.

For more information, refer to the poweron(1M) man page.

where location is the location of the system component you want to power off and, if you are a domain administrator, for which you have privileges.

Enter y or n after the warning message:


!!!WARNING!!!WARNING!!!WARNING!!!WARNING!!!WARNING!!!
!!!WARNING!!!WARNING!!!WARNING!!!WARNING!!!WARNING!!!
 
This will trip the breakers on PS at PS5, which must be turned on manually!
 
Are you sure you want to continue to power off this component? (yes/no)? y



caution icon

Caution - Remove a component from the domain using DR before powering it down. Powering off the component without first removing it from the domains causes a domain stop (dstop). If you are powering off a component to replace it, use the poweroff(1M) command. Do not use the breakers to power off the component before it has been removed from the domain; this can also cause a dstop. After the component has been removed from the domain, using the breakers to power it down does not cause a dstop.



For more information, refer to the poweroff(1M) man page.

If you try to power off the system while any domain is actively running the OS, the command fails and displays a message in the message panel of the window. In that case, issuing a setkeyswitch domain-id standby command for the active domains gracefully shuts down the processors. Once they have shut down, you can reissue the command to power off.

If the platform loses power due to a power outage, pcd records and saves the last state of each domain before power was lost.


procedure icon  To Recover From Power Failure

If you lose power to only the SC, switch on the power to the SC. Sun Fire high-end system domains are not affected by the loss of power to one SC. If you lose power to both the SC and the domains, use the following procedure to recover from the power failure. For switch locations, refer to the Sun Fire 15K/12K System Site Planning Guide.



caution icon

Caution - Losing power to both SCs without shutting down SMS crashes the domains.



1. Manually switch off the bulk power supplies on the Sun Fire high-end system as well as the power switch on the SC.

This prevents power surge problems that can occur when power is restored.

2. After power is restored, manually switch on the bulk power supplies on the Sun Fire high-end system.

3. Manually switch on the SC power.

This boots the SC and starts the SMS daemons. Check your SC platform message file for completion of the SMS daemons.

Wait for the recovery process to complete. Any domain that was powered on and running the Solaris OS returns to the OS run state. Domains at OpenBoot PROM eventually return to an OpenBoot PROM run state.

The recovery process must finish before any SMS operation is performed. You can monitor the domain message files to determine when the recovery process has completed.

Domain-Requested Reboot

SMS reboots domains upon request from the domain management software (Solaris software or dsmd). The domain software requests reboot services in the following situations.

Automatic System Recovery (ASR)

Automatic system recovery (ASR) consists of those procedures that restore the system to running all properly configured domains after one or more domains have been rendered inactive due to software or hardware failures or due to unacceptable environmental conditions.

SMS software supports a software-initiated reboot request as part of ASR. Every domain that crashed is automatically rebooted by dsmd.

Situations that require ASR are domain boots requested by domain software upon detecting failures that crash the domain (for example, panic).

There are other situations, such as detection of domain software hangs as described in Solaris Software Hang Events, where SMS initiates a domain boot as part of the recovery process.

The dsmd software ignores the OpenBoot PROM parameter, auto-boot?, which on systems without a service processor can prevent the system from automatically rebooting in power-on-reset situations. dsmd does not ignore keyswitch control. If the keyswitch is set to off or standby, the keyswitch setting is honored when determining whether a domain is subject to ASR reboot actions.

Domain Reboot

In general, a fast domain reboot is possible in situations where:

Because SMS is responsible for monitoring the hardware and detecting and responding to errors, SMS decides whether or not to request a fast reboot based upon its record of hardware errors since the last boot.

Because POST controls the hardware configuration based upon a number of inputs including, but not limited to, the blacklist data (see Blacklist Editing), POST decides whether or not the hardware configuration has changed so as to preclude a fast reboot. If system management has requested a fast reboot, POST verifies that the hardware configuration implied by its current inputs matches the hardware configuration used for the last boot; if it does not, POST fails the fast-POST operation. The system management software is prepared to recover from this type of POST failure by requesting a full-test (slow) domain boot.

Sun Fire high-end system management software minimizes the elapsed time taken by the part of the domain boot process that it can control.

Domain Abort or Reset

Certain error conditions can occur in a domain that require aborting the domain software or issuing a reset to the domain software or hardware. This section describes the domain abort and reset functions that are provided by dsmd.

The dsmd software provides a software-initiated mechanism to abort a domain Solaris OS, requesting that it panic to take a core image. No user intervention is needed.

SMS provides the reset(1M) command to enable the user to abort the domain software and issue a reset to the domain hardware.

Control is passed to the OpenBoot PROM after the reset command is issued. In the case of a user-interface-issued reset command, the OpenBoot PROM uses its default configuration to determine whether the domain is booted to the Solaris environment. In the case of a dsmd-issued reset command, the OpenBoot PROM provides parameters that force the domain to be booted to the Solaris OS.

The reset command normally sends a signal to all CPU ports of a specified domain. This is a hard reset and clears the hardware to a clean state. Using the -x option, however, reset can send an XIR signal to the processors in a specified domain. This is done in software and is considered a soft reset. An error message is given if the virtual key switch is in the secure position. An optional Are you sure? prompt is given by default. For example:


sc0:sms-user:> reset -d C 
Do you want to send RESET to domain C? [y|n]:y
RESET to processor 4.1.0 initiated.
RESET to processor 4.1.1 initiated.
RESET initiated to all processors for domain: C

For more information, refer to the reset man page.

For information on resetting the main or spare SC see SC Reset and Reboot.

SMS software illuminates or darkens the indicator LEDs on LED-equipped hot-pluggable units (HPUs) as necessary to reflect the correct state when the HPU is given a power-on reset.


Hardware Control

Hardware control functions are those that configure and control the platform hardware. Some functions are invoked on the domain.

Power-On Self-Test (POST)

System Management Services software invokes POST in two contexts:

1. At domain boot time, POST is invoked to test and configure all functional hardware available to the domain.

POST eliminates all hardware components that fail the self-test and attempts to build a bootable domain from the functionally working hardware.

POST provides extensive diagnostics to help analyze failures. You can request that POST only verify a domain configuration, and not test it, in situations where the domain is being rebooted with no indications that a hardware failure was the cause.

2. Before a DR operation to add a system board to a domain, POST is invoked to test and configure the system board components.

If POST indicates that the candidate system board is functional, the DR operation can safely incorporate the system board into the physical (hardware) domain.

Although POST is generally invoked automatically, there are user-visible interfaces that affect automatic POST invocations:

This gives you finer-grained control over the hardware components that are used in a domain than is allowed by the standard domain configuration interfaces that operate on DCUs, such as system boards.

Blacklist Editing

SMS supports three blacklists: one for the platform, one for the domains, and the internal automatic system recovery (ASR) blacklist.

Platform and Domain Blacklisting

The editable blacklist files specify that certain hardware resources are to be considered unusable by POST. They will not be probed for, tested, or configured in the domain interconnect.

Usually these blacklist files are empty and are not required to be present.

Blacklist capability in this context is used for resource management purposes.

Blacklisting temporarily limits the system configuration to less than all the hardware present. This has several applications, such as benchmarking, limiting memory use to make DR detach of the board faster, and varying the configuration for troubleshooting.

Sun Fire high-end system POST supports two editable canonical blacklist files, one for the platform and one for the domain, located in these two files:

/etc/opt/SUNWSMS/config/platform/blacklist

/etc/opt/SUNWSMS/config/domain-id/blacklist

The two files are considered logically concatenated.



Note - The blacklist file specifies resources based on physical location. If the component is physically moved, any corresponding blacklist entries must be changed accordingly.



The blacklist file specifies blacklisted components logically-for example, by specifying their position - and the blacklist remains on the component position through a hot-plug operation, rather than following a specific component.


procedure icon  To Blacklist a Component

1. Log in to the SC.

You must have platform administrator, domain administrator, or configurator privileges to edit the blacklist files.

2. Type the following command:


sc0:sms-user:> disablecomponent [-d domain-indicator] location

where:

-d domain-indicator

Specifies the domain using one of the following:

domain-id - ID for a domain. Valid domain-ids are A-R and are not case sensitive.

domain-tag - Name assigned to a domain using addtag(1M).

location

List of component locations comprising:

 

board-loc/proc/bank/logical-bank

 

board-loc/proc/bank/all-dimms-on-that-bank

 

board-loc/proc/bank/all-banks-on-that-proc

 

board-loc/proc/bank/all-banks-on-that-board

 

board-loc/proc

 

board-loc/cassette

 

board-loc/bus

 

board-loc/paroli-link

 

If no domain-indicator is specified, the platform blacklist is edited. All component locations are separated by forward slashes. The location forms are optional and are used to specify particular components on boards in specific locations.

Multiple location arguments are permitted, separated by a space.

TABLE 8-1 Table listing location arguments valid for Sun Fire 15K/E25K and Sun Fire 12K/E20K, respectively. Valid location Arguments for Sun Fire High-End Servers

Location

Valid Form for Sun Fire 15K/E25K

Valid Form for Sun Fire 12K/E20K

board-loc

SB(0...17)

IO(0...17)

CS(0|1)

EX(0...17)

SB(0...8)

IO(0...8)

CS(0|1)

EX(0...8)

Processor/Processor Pair (proc)

P(0...3)

PP(0|1)

P(0...3)

PP(0|1)

bank

B

B

logical-bank

L(0|1)

L(0|1)

all-dimms-on-that-bank

D

D

all-banks-on-that-proc

B

B

all-banks-on-that-board

B

B

HsPCI cassette

C(3|5)V(0|1)

C(3|5)V(0|1)

HsPCI+ cassette

C3V(0|1|2) and C5V0

C3V(0|1|2) and C5V0

bus

ABUS|DBUS|RBUS (0|1)

ABUS|DBUS|RBUS (0|1)

paroli-link

PAR(0|1)

PAR(0|1)


Processor locations indicate single processors or processor pairs. There are four possible processors on a system board. Processor pairs on that board are procs 0 and 1, and procs 2 and 3.



Note - If you blacklist a single CPU/memprocessor in a processor pair, neither processor is used.



The MaxCPU has two processors, procs 0 and 1, and only one proc pair (PP0). disablecomponent exits and displays an error message if you use PP1 as a location for this board.

The HsPCI and HsPCI+ assemblies contain hot-pluggable cassettes.

There are three bus locations: address, data, and response.



Note - Do not use the disablecomponents command to disable centerplane support boards or a bus on the system controller.




procedure icon  To Remove a Component From the Blacklist

1. Log in to the SC.

2. Type the following command:


sc0:sms-user:> enablecomponent [-d domain-indicator] location

where:

-d domain-indicator

Specifies the domain using one of the following:

domain-id - ID for a domain. Valid domain-ids are A-R and are not case sensitive.

domain-tag - Name assigned to a domain using addtag(1M).

location

List of component locations consisting of:

 

board-loc/proc/bank/logical-bank,

 

board-loc/proc/bank/all-dimms-on-that-bank

 

board-loc/proc/bank/all-banks-on-that-proc

 

board-loc/proc/bank/all-banks-on-that-board

 

board-loc/proc

 

board-loc/cassette

 

board-loc/bus

 

board-loc/paroli-link

 

If no domain-indicator is specified, the platform blacklist is edited. All component locations are separated by forward slashes. The location forms are optional and are used to specify particular components on boards in specific locations.

Multiple location arguments are permitted, separated by a space.

TABLE 8-2 Table listing location arguments valid for Sun Fire 15K/E25K and Sun Fire 12K/E20K, respectively. Valid location Arguments for Sun Fire High-End Servers

Location

Valid Form for Sun Fire 15K/E25K

Valid Form for Sun Fire 12K/E20K

board-loc

SB(0...17)

IO(0...17)

CS(0|1)

EX(0...17)

SB(0...8)

IO(0...8)

CS(0|1)

EX(0...8)

Processor/processor pair (proc)

P(0...3)

PP(0|1)

P(0...3)

PP(0|1)

bank

B

B

logical-bank

L(0|1)

L(0|1)

all-dimms-on-that-bank

D

D

all-banks-on-that-proc

B

B

all-banks-on-that-board

B

B

HsPCI cassette

C(3|5)V(0|1)

C(3|5)V(0|1)

HsPCI+ cassette

C3V(0|1|2) and C5V0

C3V(0|1|2) and C5V0

bus

ABUS|DBUS|RBUS (0|1)

ABUS|DBUS|RBUS (0|1)

paroli-link

PAR(0|1)

PAR(0|1)


Processor locations indicate single processors or processor pairs. There are four possible processors on a CPU/Mem board. Processor pairs on that board are: procs 0 and 1, and procs 2 and 3.



Note - If you blacklist a single CPU or memory processor in a processor pair, neither processor is used.



The MaxCPU has two processors, procs 0 and 1, and only one proc pair (PP0). The disable component command exits and displays an error message if you use PP1 as a location for this board.

The HsPCI and HsPCI+ assemblies contain hot-pluggable cassettes.

There are three bus locations: address, data and response.

For more information, refer to the enablecomponent(1M) and disablecomponent(1M) man pages.

ASR Blacklist

Hardware that has failed repeatedly, perhaps intermittently, must be excluded from subsequent domain configurations for many reasons. It might be some time before the component can be physically replaced. The failed component might be a subcomponent such as one processor on a CPU board. You do not want to lose the services of the rest of the component by powering it down until it can be replaced. If the hardware is broken, you do not want to waste time having POST discover that every time it runs. If the failure is intermittent, you do not want POST to pass it, only to have it fail when the OS is running.

To this end, esmd creates and edits a separate ASR blacklist file. Components that have been powered off due to environmental conditions are automatically listed and excluded from POST. The poweron, setkeyswitch, addboard, and moveboard commands query the ASR blacklist for components to exclude. Each of these commands except poweron displays a warning message. poweron instead asks whether you would like to continue or abort powering on the component. For more information, refer to the enablecomponent(1M), disablecomponent(1M,) and showcomponent(1M) man pages.

Power Control

The main SC has power control over the following components in the Sun Fire high-end system rack:

See HPU LEDs for a description of power control in the Sun Fire high-end system I/O racks.

SMS supports the domain Solaris command interface (cfgadm(1M)) by providing the rcfgadm(1M) command to request power on or off of the HPCI adapter slots in a Sun Fire high-end system HsPCI I/O board. For more information, refer to the rcfgadm man page.

The keyswitch control interface setkeyswitch, as described in Virtual Keyswitch, enables the user to power on or off the hardware assigned to a domain.

All power operations are logged by the power control software.

The power control software conforms to all hardware requirements for powering on or off components. For example, SMS checks for adequate power available before powering on components. The power control interfaces will not perform a user-specified power on or power off operation if it violates a hardware requirement. Power operations that are performed contrary to hardware requirements or hardware suggested procedures are noted in the message logs.

By default, the power control software refuses to perform power operations that will affect running software. The power control user interfaces include methods to override this default behavior and forcibly complete the power operation at the cost of crashing running software. The use of these forcible overrides on power operations are noted in the message logs.

As described in HPU LEDs, SMS illuminates or darkens the indicator LEDs on LED-equipped HPUs, as necessary, to reflect the correct state when the HPU is powered on or off.

Fan Control

The esmd command provides the fan speed control for Sun Fire high-end system fans. In general, fan speeds are set to the lowest speed that provides adequate cooling, so as to minimize noise levels.

Hot-Plug Operations

Hot-plug refers to the ability to physically insert or remove a board from a powered-on platform that is actively running one or more domains without affecting those domains. During a hot-plug operation, the board is isolated from all domains.

The term for a hardware component that can be hot-plugged is hot-pluggable unit (HPU). The OK to Remove indicator LED on an HPU is illuminated when it can be safely unplugged; see HPU LEDs for more information about the OK to Remove LEDs. Board presence registers indicate whether an HPU is present or absent and sense an HPU plug or unplug.

The Sun Fire high-end system HsPCI and HsPCI+ I/O assemblies are equipped with OK to Remove indicator LEDs associated with the slots into which HsPCI and HsPCI+ I/O assemblies are plugged. Each slot is equipped with a hot-plug controller that controls power to the slot and can detect presence of an adapter in the slot. However, unlike SMS support for other Sun Fire high-end system HPUs, the software that controls hot-plug for the HsPCI and HsPCI+ I/O assemblies is part of the Solaris OS on the domain.

SMS enables you to power on and off the adapter slots.

SMS software provides software interfaces, invocable from the domain, to control hardware devices associated with the adapter slots on I/O boards.



Note - For the purposes of the remaining hot-plug discussion in this section, HPUs do not include hot-pluggable I/O adapters.



SMS software provides support as necessary to enable hot-plug servicing of all HPUs in the Sun Fire high-end system rack.

Once an HPU is isolated from all domains, the only software support required for a hot-plug operation is power-off control.

Dynamic reconfiguration (DR) isolates DCUs (system boards) from a domain by DR detaching the DCU.

Unplugging

When an HPU is unplugged, the presence indicator for the HPU detects its absence, resulting in a change in hardware configuration status as described in Hardware Configuration.

The expected mode of user interaction during hot-unplug is as follows:

Go directly to the HPU you want to unplug.

If the HPU indicator LEDs show that it is not OK to Remove, request that the HPU be powered off using the poweroff command.

If the power-off function discovers that the HPU is in use by a domain, the power-off function fails, indicating that you first must use DR to remove the HPU from active use.

Refer to the System Management Services (SMS) 1.6 Dynamic Reconfiguration User Guide for more information.

Plugging

The presence of a newly inserted HPU is detected and reported as a change in hardware configuration status, as described in Hardware Configuration.

SC Reset and Reboot

The SC supports software-initiated resets for the main and spare, providing the same functionality as external reset buttons on the system controller. Typically, an SC might be reset after failover. It is possible for the main SC software to reset the spare SC, if present, and vice versa. An SC cannot reset itself.


procedure icon  To Reset the Main or Spare SC

The resetsc(1M) command sends a reset signal to the other SC. If the other SC is not present, resetsc exits with an error.

single-step bulletType the following command:


sc0:sms-user:> resetsc 
"About to reset other SC. Are you sure you want to continue?" (y or [n])? y

For more information, refer to the resetsc man page.

HPU LEDs

The LEDs reflect the status of the hot-pluggable units (HPUs). LEDs come in groups of three:

This section describes the LED control policies that are followed by SMS software for the HPUs.

Except for the system controllers, all Sun Fire high-end system HPUs are powered on and tested under control of the SMS software that runs on the main system controller.

To a certain extent, the design of the LEDs, especially their initial state upon power-on-reset, is based upon the assumption that POST is automatically initiated at power-on-reset. The only Sun Fire high-end system HPUs that meet this assumption are the system controllers. Powering on a system controller causes the processor to begin executing SC-POST code from PROM.

For all other HPUs, some are tested by POST and some are tested (or monitored) by SMS software. Although it is generally the case that testing follows shortly after power on, it is not always so.

Furthermore, it is possible that POST can be run multiple times on a power-on HPU that is being dynamically reconfigured from one domain to another. It is also possible that POST and SMS can both detect faults on the same physical HPU. These differences in power and test control between the system controllers and other Sun Fire high-end system HPUs result in different policies proposed to manage them.

The system controller provides three sets of HPU LEDs that indicate:

When the Sun Fire high-end system rack is powered on, power is supplied to the system controllers. The operating indicator LED and the OK to Remove indicator LEDs are, appropriately, initialized by the hardware. All three fault LEDs are illuminated so that the fault LEDs correctly reflect a fault, should there be a problem that prevents SC-POST from running.

SMS software, upon powering off the spare system controller, extinguishes the operating indicator LED and illuminates the OK to Remove indicator LEDs on the spare system controller. SMS software cannot adjust the operating indicator or OK to Remove LEDs after powering off the main SC, where the software is running.

SC-POST does the following:

SC-OpenBoot PROM firmware and SMS software illuminate the proper fault LEDs on the system controller after detecting a hardware error.

The following policies are used to manage LEDs on HPUs other than the system controllers.



Note - The Sun Fire high-end system correctly illuminates the operating indicator LED and correctly darkens the OK to Remove indicator LEDs when HPUs are powered on or given a power-on-reset.



On the SC, the fault LEDs are illuminated at power on, maintained on during testing, and then extinguished if no fault is found.

Faults detected after SC-POST can cause later fault LED illumination.

Except for the brief period when the SC is being tested by POST, the fault LEDs on the SC indicate that a fault has occurred since power on. The same is true (an illuminated fault LED indicates that a fault has been detected since power on) for non-SC HPUs. For every non-SC HPU that has LEDs within the Sun Fire high-end system, SMS ensures that the fault indicator LED is extinguished when a power on or power on reset occurs.