C H A P T E R  9

Domain Status

Status functions return measured values that characterize the state of the server hardware or software. As such, these functions are used both to provide values for status displays and input to monitoring software that periodically polls status functions and verifies that the values returned are within normal operational limits. Monitoring and event detection functions that use the status functions are described in this chapter.

This chapter includes the following sections:


Software Status

The software state consists of status information provided by the software running in a domain. The identity of the software component currently running (for example, POST, OpenBoot PROM, or Solaris software) is available. Additional status information is available (booting, running, panicking).

SMS software provides the following command(s) to display the status of the software, if any, currently running in a domain.

Status Commands

showboards Command

showboards(1M) displays the assignment information and status of the DCUs. These include the following: Location, Power, Type of board, Board status, Test status, and Domain.

If no options are specified, showboards displays all DCUs including those that are assigned or available for the platform administrator. For the domain administrator or configurator, showboards displays only those DCUs for those domains for which the user has privileges, including those boards that are assigned or available and in the domain's available component list.

If domain_indicator is specified, this command displays which DCUs are assigned or available to the given domain. If the -a option is used, showboards displays all boards including DCUs.

For examples and more information, see To Obtain Board Status and refer to the showboards manpage.

showdevices Command

showdevices(1M) displays configured physical devices on system boards and the resources made available by these devices. Usage information is provided by applications and subsystems that are actively managing system resources. The predicted impact of a system board DR operation may be optionally displayed by performing an offline query of managed resources.

showdevices gathers device information from one or more Sun Fire high-end system domains. The command uses the dca(1M) as a proxy to gather the information from the domains.

For examples and more information, see To Obtain Device Status and refer to the showdevices manpage.

showenvironment Command

showenvironment(1M) displays environmental data including: Location, Device, Sensor, Value, Unit, Age, Status. For fan trays, Power, Speed, and Fan Number are displayed. For bulk power, the Power, Value, Unit, and Status are shown.

If a domain domain_indicator is specified, environmental data relating to the domain is displayed, providing that the user has domain privileges for that domain. If a domain is not specified, all domain data permissible to the user will be displayed.

DCUs (for example, CPU, I/O) belong to a domain and you must have domain privileges to view their status. Environmental data relating to such things as fan trays, bulk power, or other boards are displayed without domain permissions. You can also specify individual reports for temperatures, voltages, currents, faults, bulk power status, and fan tray status with the -p option. If the -p option is not present, all reports will be shown.

For examples and more information, see Environmental Status and refer to the showenvironment man page.

showobpparams Command

showobpparams(1M) displays OpenBoot PROM bringup parameters. showobpparams allows a domain administrator to display the virtual NVRAM and REBOOT parameters passed to OpenBoot PROM by setkeyswitch(1M).

For examples and more information, see Setting the OpenBoot PROM Variables and refer to the showobpparams man page.

showplatform Command

showplatform(1M) displays the available component list and domain state of each domain.

A domain is identified by a domain_tag if one exists. Otherwise it is identified by the domain_id, a letter in the set A-R. The letter set is case insensitive. The Solaris hostname is displayed if one exists. If a hostname has not been assigned to a domain, Unknown is printed.

The following is a list of domain statuses:

Status

Description

Unknown

The domain state could not be determined or for Ethernet addresses, it indicates the domain idprom image file does not exist. You need to contact your Sun service representative.

Powered Off

The domain is powered off.

Keyswitch Standby

The keyswitch for the domain is in STANDBY position.

Running Domain POST

The domain power-on self-test is running.

Loading OBP

The OpenBoot PROM for the domain is being loaded.

Booting OBP

The OpenBoot PROM for the domain is booting

Running OBP

The OpenBoot PROM for the domain is running.

In OBP Callback

The domain has been halted and has returned to the OpenBoot PROM.

Loading Solaris

The OpenBoot PROM is loading the Solaris software

Booting Solaris

The domain is booting the Solaris software

Domain Exited OBP

The domain OpenBoot PROM exited.

OBP Failed

The domain OpenBoot PROM failed.

OBP in sync Callback to OS

The OpenBoot PROM is in sync callback to the Solaris software.

Exited OBP

The OpenBoot PROM has exited.

In OBP Error Reset

The domain is in OpenBoot PROM due to an error reset condition.

Solaris Halted in OBP

Solaris software is halted and the domain is in OpenBoot PROM.

OBP Debugging

The OpenBoot PROM is being used as a debugger

Environmental Domain Halt

The domain was shut down due to an environmental emergency.

Booting Solaris Failed

OpenBoot PROM running, boot attempt failed.

Loading Solaris Failed

OpenBoot PROM running, loading attempt failed.

Running Solaris

Solaris software is running on the domain.

Solaris Quiesce In-Progress

A Solaris software quiesce is in progress.

Solaris Quiesced

Solaris software has quiesced.

Solaris Resume In-Progress

A Solaris software resume is in progress

Solaris Panic

Solaris software has panicked, panic flow has started.

Solaris Panic Debug

Solaris software panicked, and is entering debugger mode.

Solaris Panic Continue

Exited debugger mode and continuing panic flow.

Solaris Panic Dump

Panic dump has started.

Solaris Halt

Solaris software is halted.

Solaris Panic Exit

Solaris software exited as a result of a panic.

Environmental Emergency

An environmental emergency has been detected

Debugging Solaris

Debugging Solaris software; this is not a hung condition.

Solaris Exited

Solaris software has exited.

Domain Down

The domain is down and the setkeyswitch in the ON, DIAG, or SECURE position.

In Recovery

The domain is in the midst of an automatic system recovery.


Domain status reflects two cases. The first is that dsmd is busy trying to recover the domain and the second is that dsmd has given up trying to recover the domain. In the second case you always see "Domain Down." In the first case you see either "Domain Down" or some other status. To recover from a "Domain Down" in either case, use

sc0:sms-user:> setkeyswitch off
sc0:sms-user:> setkeyswitch on

setkeyswitch off, setkeyswitch on.

For examples and more information, see To Obtain Domain Status and refer to the showplatform man page.

showxirstate Command

showxirstate(1M) displays CPU dump information after a reset pulse is sent to the processors. This save state dump can be used to analyze the cause of abnormal domain behavior. showxirstate creates a list of all active processors in that domain and retrieves the save state information for each processor, including its processor signature.

showxirstate data resides, by default, in /var/opt/SUNWSMS/adm/domain_id/dump.

For examples and more information, refer to the showxirstate man page.

Solaris Software Heartbeat

During normal operation, the Solaris environment produces a periodic heartbeat indicator readable from the SC. dsmd detects the absence of heartbeat updates for a running Solaris system as a hung Solaris. Hangs are not detected for any software components other than the Solaris software.



Note - The Solaris software heartbeat should not be confused with the SC-to-SC (hardware) heartbeat or the heartbeat network, both used to determine the health of failover. For more information see, SC Heartbeats.



The only reflection of the Solaris heartbeat occurs when dsmd detects a failure to update the Solaris heartbeat of sufficient duration to indicate that the Solaris software is hung. Upon detection of a Solaris software hang, dsmd conducts an ASR.


Hardware Status

The hardware status functions report information about the hardware configuration, hardware failures detected, and platform environmental state.

Hardware Configuration

The following hardware configuration status is available from the Sun Fire high-end system management software:



Note - The hardware configuration status available to SMS running on the SC is limited to presence /absence. It includes no information about the I/O configuration; such as, where I/O adaptors are plugged in and what devices are attached to those I/O adaptors. Such information is available only to the software running on the domain that owns the I/O adaptor.



The hardware configuration supported by functions described in this section excludes I/O adaptors and I/O devices. showboards displays all hardware components that are present.

As described in Blacklist Editing, the current contents of the component blacklist(s) can always be viewed and altered.

Environmental Status

The following hardware environmental measurements are available:

showenvironment displays every environmental measurement that can be taken within the Sun Fire high-end system rack.


procedure icon  To Display the Environment Status for Domain A

1. Log in to the SC.

Platform administrators can view any environment status on the entire platform. Domain administrators can see the environment status only for those domains for which they have privileges.

2. Type:

sc0:sms-user:> showenvironment -d A

As described in HPU LEDs, the operating indicator LEDs on Sun Fire high-end system HPUs visibly reflect that the HPUs are powered on and the OK to remove indicator LEDs visibly reflect those that can be unplugged.

Hardware Error Status

dsmd monitors the Sun Fire high-end system hardware operational status and reports errors. The occurrence of some errors are directly reported to the SC (for example, the error register(s) in every ASIC propagate to the SBBC on the SC that provides an error summary register). Although the occurrence of some errors is indicated by an interrupt delivered to the SC, some error states may require the SC to monitor hardware registers for error indications. When a hardware error is detected, esmd follows the established procedures for collecting and clearing the hardware error state.

The following types of errors can occur on Sun Fire high-end system hardware:

Hardware error status is generally not reported as a status. Rather, event handling functions perform various actions when hardware errors occur such as logging errors, initiating ASR, and so forth. These functions are discussed in Domain Events.



Note - As described inHPU LEDs, the fault LEDs, after POST completion, identify Sun Fire high-end system HPUs in which faults have been discovered since last powered on or submitted to a power on reset.




SC Hardware and Software Status

Proper operation of SMS depends upon proper operation of the hardware and the Solaris software on the SC. The ability to support automatic failover from the main to the spare system controller requires properly functioning hardware and software on the spare. SMS software running on the main system controller must either be functioning sufficiently to diagnose a software or hardware failure in a manner that can be detected by the spare or it must fail in a manner that can be detected by the spare.

SC-POST determines the status of system controller hardware. It tests and configures the system controller at power-on or power-on reset.

The SC does not boot if the SC fails to function.

If the control board fails to function, the SC boots normally, but without access to the control board devices. The level of hardware functionality required to boot the system controller is essentially the same as that required for a standalone SC.

SC-POST writes diagnostic output to the SC console serial port (TTY-A). Additionally, SC-POST leaves a brief diagnostics status summary message in an NVRAM buffer that can be read by a Solaris driver and logged and/or displayed when the Solaris software boots.

SC firmware and software display information to identify and service SC hardware failures.

SC firmware and software provide a software interface that verifies that the system controller hardware is functional. This selects a working system controller as main in a high-availability SC configuration.

The system controller LEDs provide visible status regarding power and detected hardware faults as described in HPU LEDs.

Solaris software provides a level of self-diagnosis and automatic recovery (panic and reboot). Solaris software utilizes the SC hardware watchdog logic to trap hang conditions and force an automatic recovery reboot.

There are four hardware paths of communication between the SCs (two Ethernet connections, the heartbeat network, and one SC-to-SC heartbeat signal) that are used in the high-availability SC configuration by each SC to detect hangs or failures on the other SC.

SMS practices self-diagnosis and institutes automatic failure recovery procedures, even in non-high-availability SC configurations.

Upon recovery, SMS software either takes corrective actions as necessary to restore the platform hardware to a known, functional configuration or reports the inability to do so.

SMS software records and logs sufficient information to allow engineering diagnosis of single-occurrence software failures in the field.

SMS software takes a noticeable interval to initialize itself and become fully functional. The user interfaces behave predictably during this interval. Any rejections of user commands are clearly identified as due to system initialization with advice to try again after a suitable interval.

SMS software implementation uses a distributed client/server architecture. Any errors encountered during SMS initialization, due to attempts to interact with a process that has not yet completed initialization, are dealt with silently.