C H A P T E R  7

Domain Control

This chapter addresses the functions that provide control over domain software as well as server hardware. Control functions are invoked at the discretion of an administrator. They are also useful to SMS for providing automatic system recovery (ASR).

Domain control functionality provides control over the software running on a domain. It includes those functions that allow a domain to be booted and interrupted. Only the domain administrator can invoke the domain control functions.

This chapter includes the following sections:


Domain Boot

This section describes the various aspects of booting the Solaris operating environment in a domain running SMS software.

setkeyswitch(1M) is responsible for initiating and sequencing a domain boot. It powers on the domain hardware as required and invokes POST to test and configure the hardware in the logical domain into a Sun Fire high-end system's physical hardware domain. It downloads and initiates OpenBoot PROM as required to boot the Solaris operating environment on the domain.

Only domains that have their virtual keyswitch set appropriately are subject to boot control. See Virtual Keyswitch.

OpenBoot PROM boot parameters are stored in the domain's virtual NVRAM. osd(1M) provides those parameter values to OpenBoot PROM, which adapts the domain boot as indicated.

Certain parameters, in particular those that may not be adjustable from OpenBoot PROM itself when a domain is failing to boot, can be set by setobpparams(1M) so that they take effect at the next boot attempt.

Keyswitch On

The domain keyswitch control (Virtual Keyswitch) manually initiates domain boot.

setkeyswitch boots a properly configured domain when its keyswitch control is moved from the off or standby position to one of the on positions. This takes approximately 20 minutes.

setobpparams(1M) provides a method by which a manually initiated (keyswitch control) domain boot sequence can be stopped in OpenBoot PROM. For more information see Setting the OpenBoot PROM Variables and refer to the setobpparams man page.

Power

SMS boots all properly configured domains when the Sun Fire high-end system chassis is powered on using the poweron(1M) command. SMS shuts down all properly configured domains when the chassis is powered off using the poweroff command.

SMS checks the power state of components to determine if they are on or off and enables or disables console bus ports (where appropriate) when boards are powered on/off. poweron checks to see if a component is physically present. poweroff unconfigures DCUs from the expander and changes the expander from split-slot to nonsplit-slot when appropriate. poweroff unconfigures the expander from the centerplane when the expander is powered off and checks for voltage reading tolerances to help determine if the board is on or off.

The following components can be power controlled using the poweron and poweroff commands.


procedure icon  To Power System Boards On and Off From the Command Line

Platform administrators are allowed to power control the entire system and can execute these commands without a location option. Domain administrators can power control any system board assigned to their domain(s). Users with only domain privileges must supply the location option.

1. To power on a system component, type:

sc0:sms-user:> poweron  location

where

location

The location of the system component you wish to power on and, if you are a domain administrator, for which you have privileges.


For more information, refer to the poweron(1M) man page.

2. To power off a system component, type:

sc0:sms-user:> poweroff  location

where:

location

The location of the system component you wish to power off and, if you are a domain administrator, for which you have privileges.


Enter y or n after the warning message:

!!!WARNING!!!WARNING!!!WARNING!!!WARNING!!!WARNING!!!
!!!WARNING!!!WARNING!!!WARNING!!!WARNING!!!WARNING!!!

This will trip the breakers on PS at PS5, which must be turned on manually!

Are you sure you want to continue to power off this component? (yes/no)? y



Note - If you are powering off a component to replace it, use the poweroff(1M) command. Do not use the breakers to power off the component; this can cause a Domain Stop.



For more information, refer to the poweroff(1M) man page.

If you try to power off the system while any domain is actively running the operating system, the command will fail and display a message in the message panel of the window. In that case, issuing a setkeyswitch domain_id standby command for the active domain(s) will gracefully shut down the processors. Then, you can reissue the command to power off.

If the platform loses power due to a power outage, pcd records and saves the last state of each domain before power was lost.


procedure icon  To Recover From Power Failure

If you lose power only to the SC, switch on the power to the SC. Sun Fire high-end system domains are not affected by the loss of power to one SC. If you lose power to both the SC and the domains, use the following procedure to recover from the power failure. For switch locations refer to the Sun Fire 15K/12K System Site Planning Guide.



Note - Losing power to both SCs without shutting down SMS, will crash the domains.



1. Manually switch off the bulk power supplies on the Sun Fire high-end system as well as the power switch on the SC.

This prevents power surge problems that can occur when power is restored.

2. After power is restored, manually switch on the bulk power supplies on the Sun Fire high-end system.

3. Manually switch on the SC power.

This boots the SC and starts the SMS daemons. Check your SC platform message file for completion of the SMS daemons.

4. Wait for the recovery process to complete.

Any domain that was powered on and running the Solaris operating environment returns to the operating environment run state. Domains at OpenBoot PROM eventually return to an OpenBoot PROM run state.

The recovery process must finish before any SMS operation is performed. You can monitor the domain message files to determine when the recovery process has completed.

Domain-Requested

SMS reboots domains upon request from the domain software (Solaris software or dsmd). The domain software requests reboot services in the following situations.

Automatic System Recovery (ASR)

Automatic system recovery (ASR) consists of those procedures that restore the system to running all properly configured domains after one or more domains have been rendered inactive due to software or hardware failures or due to unacceptable environmental conditions.

SMS software supports a software-initiated reboot request as part of ASR. Every domain that crashed is automatically rebooted by dsmd.

Situations that require ASR are domain boots requested by domain software upon detecting failures that crash the domain (for example, panic).

There are other situations, such as detection of domain software hangs as described in Solaris Software Hang Events, where SMS initiates a domain boot as part of the recovery process.

dsmd ignores the OpenBoot PROM parameter, auto-boot?, which on systems without a service processor can prevent the system from automatically rebooting in power-on-reset situations. dsmd does not ignore keyswitch control. If the keyswitch is set to off or standby, the keyswitch setting will be honored in determining whether a domain is subject to ASR reboot actions.

Fast Boot

In general a fast domain reboot is possible in situations where:

Sun Fire high-end system management software minimizes the elapsed time taken by the part of the domain boot process that it can control.

Domain Abort/Reset

Certain error conditions can occur in a domain that require aborting the domain software or issuing a reset to the domain software or hardware. This section describes the domain abort/reset functions that are provided by dsmd.

dsmd provides a software-initiated mechanism to abort a domain Solaris OS, requesting that it panic to take a core image. No user intervention is needed.

SMS provides the reset(1M) command to allow the user to abort the domain software and issue a reset to the domain hardware.

Control is passed to OpenBoot PROM after the reset command is issued. In the case of a user-interface-issued reset command, OpenBoot PROM uses its default configuration to determine whether the domain is booted to the Solaris environment. In the case of a dsmd-issued reset command, OpenBoot PROM provides parameters that force the domain to be booted to the Solaris operating environment.

reset normally sends a signal to all CPU ports of a specified domain. This is a hard reset and clears the hardware to a clean state. Using the -x option, however reset can send an XIR signal to the processors in a specified domain. This is done in software and is considered a soft reset. An error message is given if the virtual key switch is in the secure position. An optional Are you sure? prompt is given by default. For example:

sc0:sms-user:> reset -d C 
Do you want to send RESET to domain C? [y|n]:y
RESET to processor 4.1.0 initiated.
RESET to processor 4.1.1 initiated.
RESET initiated to all processors for domain: C

For more information refer to the reset man page.

For information on resetting the main or spare SC see SC Reset and Reboot.

SMS software illuminates or darkens the indicator LEDs on LED-equipped hot-pluggable units (HPU) as necessary to reflect the correct state when the HPU is given a power-on reset.


Hardware Control

Hardware control functions are those that configure and control the platform hardware. Some functions are invoked on the domain.

Power-On Self-Test (POST)

System management services software invokes POST in two contexts.

  1. At domain boot-time, POST is invoked to test and configure all functional hardware available to the domain.

    POST eliminates all hardware components that fail self-test and attempts to build a bootable domain from the functionally working hardware.

    POST provides extensive diagnostics to report hardware test results to help analyze failures. POST may be requested only to verify a domain configuration, and not test it, in situations where the domain is being rebooted with no indications that a hardware failure was the cause.

  2. Before a DR operation to add a system board to a domain begins, POST is invoked to test and configure the system board components.

    If POST indicates that the candidate system board is functional, the DR operation can safely incorporate the system board into the physical (hardware) domain.

Although POST is generally invoked automatically, there are user visible interfaces that affect automatic POST invocations:

Blacklist Editing

SMS supports three blacklists: one for the platform, one for the domains; and the internal automatic system recovery (ASR) blacklist.

Platform and Domain Blacklisting

The editable blacklist files specify that certain hardware resources are to be considered unusable by POST. They will not be probed for, tested, or configured in the domain interconnect.

Usually these blacklist files are empty, and are not required to be present.

Blacklist capability in this context is used for resource management purposes.

Blacklisting temporarily limits the system configuration to less than all the hardware present. This has several applications, such as benchmarking, limiting memory use to make DR detach of the board faster, and varying the configuration for troubleshooting.

Sun Fire high-end system POST supports two editable canonical blacklist files, one for the platform, one for the domain, located in:

/etc/opt/SUNWSMS/config/platform/blacklist

and

/etc/opt/SUNWSMS/config/domain_id/blacklist

The two files are considered logically concatenated.



Note - The blacklist file specifies resources based on physical location. If the component is physically moved, any corresponding blacklist entries must be changed accordingly.



Blacklist specifies blacklisted components logically, for example, by specifying their position and the blacklist remains on the component position through a hot-swap operation rather than following a specific component.


procedure icon  To Blacklist a Component

1. Log in to the SC.

You must have platform administrator, or domain administrator, or configurator privileges to edit the blacklist files.

2. Type:

sc0:sms-user:> disablecomponent [-d domain_indicator] location

where:

-d domain_indicator

Specifies the domain using one of the following:

domain_id - ID for a domain. Valid domain_ids are A-R and are not case sensitive.

domain_tag - Name assigned to a domain using addtag(1M).

location

List of component locations comprised of:

board_loc/proc/bank/logical_bank

board_loc/proc/bank/all_dimms_on_that_bank

board_loc/proc/bank/all_banks_on_that_proc

board_loc/proc/bank/all_banks_on_that_board

board_loc/proc

board_loc/cassette

board_loc/bus

board_loc/paroli_link


If no domain_indicator is specified, the platform blacklist is edited. All component locations are separated by forward slashes. The location forms are optional and are used to specify particular components on boards in specific locations.

Multiple location arguments are permitted separated by a space.

Location

Valid Form for Sun Fire 15K

Valid Form for Sun Fire 12K

board_loc

SB(0...17)

IO(0...17)

CS(0|1)

EX(0...17)

SB(0...8)

IO(0...8)

CS(0|1)

EX(0...8)

Processor/Processor Pair (proc)

P(0...3)

PP(0|1)

P(0...3)

PP(0|1)

bank

B

B

logical_bank

L(0|1)

L(0|1)

all_dimms_on_that_bank

D

D

all_banks_on_that_proc

B

B

all_banks_on_that_board

B

B

HsPCI cassette

C(3|5)V(0|1)

C(3|5)V(0|1)

HsPCI+ cassette

C3V(0|1|2) and C5V0

C3V(0|1|2) and C5V0

bus

ABUS|DBUS|RBUS (0|1)

ABUS|DBUS|RBUS (0|1)

paroli_link

PAR(0|1)

PAR(0|1)


Processor locations indicate single processors or processor pairs. There are four possible processors on a CPU/Memory board. Processor pairs on that board are procs 0 and 1, and procs 2 and 3.



Note - If you blacklist a single CPU/mem processor in a processor pair, neither processor is used.



The MaxCPU has two processors, procs 0 and 1, and only one proc pair (PP0). disablecomponent exits and displays an error message if you use PP1 as a location for this board.

The HsPCI and HsPCI+ assemblies contain hot-swappable cassettes.

There are three bus locations: address, data, and response.



Note - Do not use the disablecomponents command to disable center plane support boards or a bus on the system controller.




procedure icon  To Remove a Component From the Blacklist

1. Log in to the SC.

2. Type:

sc0:sms-user:> enablecomponent [-d domain_indicator] location

where:

-d domain_indicator

Specifies the domain using one of the following:

domain_id - ID for a domain. Valid domain_ids are A-R and are not case sensitive.

domain_tag - Name assigned to a domain using addtag(1M).

location

List of component locations comprised of:

board_loc/proc/bank/logical_bank,

board_loc/proc/bank/all_dimms_on_that_bank

board_loc/proc/bank/all_banks_on_that_proc

board_loc/proc/bank/all_banks_on_that_board

board_loc/proc

board_loc/cassette

board_loc/bus

board_loc/paroli_link


If no domain_indicator is specified the platform blacklist is edited. All component locations are separated by forward slashes. The location forms are optional and are used to specify particular components on boards in specific locations.

Multiple location arguments are permitted separated by a space.

Location

Valid Form for Sun Fire 15K

Valid Form for Sun Fire 12K

board_loc

SB(0...17)

IO(0...17)

CS(0|1)

EX(0...17)

SB(0...8)

IO(0...8)

CS(0|1)

EX(0...8)

Processor/Processor Pair (proc)

P(0...3)

PP(0|1)

P(0...3)

PP(0|1)

bank

B

B

logical_bank

L(0|1)

L(0|1)

all_dimms_on_that_bank

D

D

all_banks_on_that_proc

B

B

all_banks_on_that_board

B

B

HsPCI cassette

C(3|5)V(0|1)

C(3|5)V(0|1)

HsPCI+ cassette

C3V(0|1|2) and C5V0

C3V(0|1|2) and C5V0

bus

ABUS|DBUS|RBUS (0|1)

ABUS|DBUS|RBUS (0|1)

paroli_link

PAR(0|1)

PAR(0|1)


Processor locations indicate single processors or processor pairs. There are four possible processors on a CPU/Mem board. Processor pairs on that board are: procs 0 and 1, and procs 2 and 3.



Note - If you blacklist a single CPU/mem processor in a processor pair, neither processor is used.



The MaxCPU has two processors,: procs 0 and 1, and only one proc pair (PP0). disablecomponent exits and displays an error message if you use PP1 as a location for this board.

The HsPCI and HsPCI+ assemblies contain hot-swappable cassettes.

There are three bus locations: address, data and response.

For more information, refer to the enablecomponent(1M) and disablecomponent(1M) man pages.

ASR Blacklist

Hardware that has failed repeatedly, perhaps intermittently, needs to be excluded from subsequent domain configurations for many reasons. It may be some time before the component can be physically replaced. The failed component might be a subcomponent such as one processor on a CPU board. You do not want to lose the services of the rest of the component by powering it down until it can be replaced. If the hardware is broken, you do not want to waste time having POST discover that every time it runs. If the failure is intermittent, you do not want POST to pass it, only to have it fail when the OE is running.

To this end, esmd creates and edits a separate ASR blacklist file. Components that have been powered off due to environmental conditions are automatically listed and excluded from POST. poweron, setkeyswitch, addboard, and moveboard query the ASR blacklist for components to exclude. Each of these commands except poweron display a warning message. poweron instead asks whether you would like to continue or abort powering up the component. For more information refer to the enablecomponent(1M), disablecomponent(1M,)and showcomponent(1M) man pages.

Power Control

The main SC has power control over the following components in the Sun Fire high-end system rack:

See HPU LEDs for a description of power control in the Sun Fire high-end system I/O racks.

SMS supports the domain Solaris command interface (cfgadm(1M)) by providing the rcfgadm(1M) command to request power on or off of the HPCI adaptor slots in a Sun Fire high-end system HsPCI I/O assembly. For more information refer to the rcfgadm man page.

The keyswitch control interface, setkeyswitch, as described in Virtual Keyswitch allows the user to power on or off the hardware assigned to a domain.

All power operations are logged by the power control software.

The power control software conforms to all hardware requirements for powering on or off components. For example, SMS checks for adequate power available before powering on components. The power control interfaces will not perform a user-specified power on or power off operation if it violates a hardware requirement. Power operations that are performed contrary to hardware requirements or hardware recommended procedures are noted in the message logs.

By default, the power control software refuses to perform power operations that will affect running software. The power control user interfaces include methods to override this default behavior and forcibly complete the power operation at the cost of crashing running software. The use of these forcible overrides on power operations are noted in the message logs.

As described in HPU LEDs, SMS illuminates or darkens the indicator LEDs on LED-equipped HPUs, as necessary, to reflect the correct state when the HPU is powered on or off.

Fan Control

esmd provides the fan speed control for Sun Fire high-end system fans. In general, fan speeds are set to the lowest speed that provides adequate cooling so as to minimize noise levels.

Hot-Swap

Hot-swap refers to the ability to physically insert or remove a board from a powered-on platform, actively running one or more domains without affecting those domains. During a hot-swap operation, the board is isolated from all domains.

The term for a hardware component that may be hot-swapped is hot-pluggable unit (HPU). The OK to remove indicator LED on an HPU is illuminated when it can be safely unplugged; see HPU LEDs for more information about the OK to remove LEDs. Board presence registers indicate whether an HPU is present or absent and sense an HPU plug or unplug.

The Sun Fire high-end system HsPCI and HsPCI+ I/O assemblies are equipped with OK to remove indicator LEDs associated with the slots into which HsPCI and HsPCI+ I/O assemblies are plugged. Each slot is equipped with a hot-plug controller that controls power to the slot and can detect presence of an adaptor in the slot. However, unlike SMS support for other Sun Fire high-end system HPUs, the software that controls hot-swap for the HsPCI and HsPCI+ I/O assemblies is part of the Solaris environment on the domain.

SMS allows you to power on and off the adaptor slots.

SMS software provides software interfaces, invocable from the domain, to control hardware devices associated with the adaptor slots on I/O boards.

For the purposes of the remaining hot-swap discussion in this section, HPUs do not include hot-swappable I/O adaptors.

SMS software provides support as necessary to allow hot-swap servicing of all HPUs in the Sun Fire high-end system rack.

Once an HPU is isolated from all domains the only software support required for hot-swap is power-off control.

Dynamic reconfiguration (DR) is used to isolate DCUs (system boards) from a domain by DR detaching the DCU.

Hot-Unplug

When an HPU is unplugged, the presence indicator for the HPU detects its absence resulting in a change in hardware configuration status as described in Hardware Configuration.

The expected mode of user interaction during hot-unplug is as follows:

Go directly to the HPU you wish to unplug. If the HPU indicator LEDs show that it is notOK to remove, request that the HPU be powered off using the poweroff command. If the power-off function discovers that the HPU is in use by a domain, the power-off function will fail, indicating that you first must use DR to remove the HPU from active use. Refer to the System Management Services (SMS) 1.4 Dynamic Reconfiguration User Guide for more information.

Hot-Plug

The presence of a newly inserted HPU will be detected and reported as a change in hardware configuration status as described in Hardware Configuration.

SC Reset and Reboot

The SC supports software-initiated resets for the main and spare, providing the same functionality as external reset buttons on the system controller. Typically, an SC might be reset after failover. It is possible for the main SC software to reset the spare SC, if present, and vice versa. An SC cannot reset itself.


procedure icon  To Reset the Main or Spare SC

resetsc (1M) sends a reset signal to the other SC. If the other SC is not present, resetsc exits with an error.

1. Type:

sc0:sms-user:> resetsc 
"About to reset other SC. Are you sure you want to continue?" (y or [n])? y

For more information, refer to the resetsc man page.

HPU LEDs

The LEDs reflect the status of the hot-pluggable units (HPU). LEDs come in groups of three:

This section describes the LED control policies that are followed by SMS software for the HPUs.

Except for the system controllers, all Sun Fire high-end system HPUs are powered on and tested under control of the SMS software that runs on the main system controller.

To a certain extent, the design of the LEDs, especially their initial state upon power-on-reset, is based upon the assumption that POST is automatically initiated at power-on-reset. The only Sun Fire high-end system HPUs that meet this assumption are the system controllers. Powering on a system controller causes the processor to begin executing SC-POST code from PROM.

For all other HPUs, some are tested by POST and some are tested (or monitored) by SMS software, and although it is generally the case that testing follows shortly after power on, it is not always so.

Furthermore, it is possible that POST can be run multiple times on a power-on HPU that is being dynamically reconfigured from one domain to another. It is also possible that POST and SMS can both detect faults on the same physical HPU. These differences in power and test control between the system controllers and other Sun Fire high-end system HPUs result in different policies proposed to manage them.

The system controller provides three sets of HPU LEDs:

When the Sun Fire high-end system rack is powered on, power is supplied to the system controllers. The operating indicator LED, and the OK to remove indicator LEDs are, appropriately, initialized by the hardware. All three fault LEDs are illuminated so that the fault LEDs correctly reflect a fault, should there be a problem that prevents SC-POST from running.

SMS software, upon powering off the spare system controller, extinguishes the operating indicator LED, and illuminates the OK to remove indicator LEDs on the spare system controller. SMS software cannot adjust the operating indicator or OK to remove indicator LEDs after powering off the main SC, where the software is running.

SC-POST does the following:

SC-OpenBoot PROM firmware and SMS software illuminate the proper fault LED(s) on the system controller after detecting a hardware error.

The following policies are used to manage LEDs on HPUs other than the system controllers.