|C H A P T E R 10|
Domain Status Functions
Status functions return measured values that characterize the state of the server hardware or software. As such, these functions are used to provide both values for status displays and input to monitoring software that periodically polls status functions and verifies that the values returned are within normal operational limits. Monitoring and event detection functions that use the status functions are described in this chapter.
This chapter contains the following sections:
The software state consists of status information provided by the software running in a domain. The identity of the software component currently running (for example, POST, OpenBoot PROM, or Solaris software) is available. Additional status information is available (booting, running, panicking).
SMS software provides the following commands to display the status of the software, if any, currently running in a domain:
This section describes the SMS domain status commands.
The showboards(1M) command displays the assignment information and status of the DCU, including: Location, Power, Type of board, Board status, Test status, and Domain.
If no options are specified, showboards displays all DCUs, including those that are assigned or available for the platform administrator. For the domain administrator or configurator, showboards displays only DCUs for domains for which the user has privileges, including those boards that are assigned or available and in the domain's available component list.
If domain-indicator is specified, this command displays which DCUs are assigned or available to the given domain. If the -v option is used, showboards displays all boards, including DCUs.
For examples and more information, see To Obtain Board Status and refer to the showboards man page.
The showdevices(1M) command displays configured physical devices on system boards and the resources made available by these devices. Usage information is provided by applications and subsystems that are actively managing system resources. The predicted impact of a system board DR operation can be optionally displayed by performing an offline query of managed resources.
The showdevices command gathers device information from one or more Sun Fire high-end system domains. The command uses the dca(1M) as a proxy to gather the information from the domains.
For examples and more information, see To Obtain Board Status and refer to the showdevices man page.
The showenvironment(1M) command displays environmental data including. Location, Sensor, Value, Unit, Age, Status. For fan trays, Power, Speed, and Fan Number are displayed. For bulk power, the Power, Value, Unit, and Status are shown.
If domain-indicator is specified, environmental data relating to the domain is displayed, providing that the user has domain privileges for that domain. If a domain is not specified, all domain data permissible to the user is displayed.
DCUs (for example, CPU or I/O) belong to a domain and you must have domain privileges to view their status. Environmental data relating to such things as fan trays, bulk power, or other boards are displayed without domain permissions. You can also specify individual reports for temperatures, voltages, currents, faults, bulk power status, and fan tray status with the -p option. If the -p option is not present, all reports are shown.
For examples and more information, see Environmental Status and refer to the showenvironment man page.
The showobpparams(1M) command displays OpenBoot PROM bringup parameters. The showobpparams command enables a domain administrator to display the virtual NVRAM and REBOOT parameters passed to OpenBoot PROM by setkeyswitch(1M).
For examples and more information, see Setting the OpenBoot PROM Variables and refer to the showobpparams man page.
The showpcimode(1m) command lists the mode settings for all the PCI-X slots on a V2HPCIX I/O board in your server. The settings are specified by the setpcimode command. A slot that returns a status of normal is running in PCI-X mode. A slot that returns a status of pci_only has been forced to run in PCI mode.
If you specify an I/O board that is not a V2HPCIX board, the command returns an error.
The showplatform(1M) command displays the available component list and domain state of each domain.
A domain is identified by a domain-tag if one exists. Otherwise, it is identified by the domain-id, a letter in the set A-R. The letter set is case insensitive. The Solaris hostname is displayed if one exists. If a hostname has not been assigned to a domain, Unknown is printed.
TABLE 10-1 lists domain statuses.
Domain status reflects two cases. The first is that dsmd is busy trying to recover the domain and the second is that dsmd has given up trying to recover the domain. In the second case you always see "Domain Down." In the first case you see either "Domain Down" or some other status. To recover from a "Domain Down" in either case, use setkeyswitch off, setkeyswitch on.
For examples and more information, see To Obtain Domain Status and refer to the showplatform man page.
The showxirstate(1M) command displays CPU dump information after a reset pulse is sent to the processors. This save state dump can be used to analyze the cause of abnormal domain behavior. showxirstate creates a list of all active processors in that domain and retrieves the save state information for each processor, including its processor signature.
The showxirstate command data resides, by default, in /var/opt/SUNWSMS/adm/domain-id/dump.
For examples and more information, refer to the showxirstate man page.
During normal operation, the Solaris environment produces a periodic heartbeat indicator readable from the SC. The dsmd daemon detects the absence of heartbeat updates for a running Solaris system as a hung Solaris. Hangs are not detected for any software components other than the Solaris software.
Note - The Solaris software heartbeat should not be confused with the SC-to-SC (hardware) heartbeat or the heartbeat network, both used to determine the health of failover. For more information, see SC Heartbeats.
The only reflection of the Solaris heartbeat occurs when dsmd detects a failure to update the Solaris heartbeat of sufficient duration to indicate that the Solaris software is hung. Upon detection of a Solaris software hang, dsmd conducts an ASR.
The hardware status functions report information about the hardware configuration, hardware failures detected, and platform environmental state.
The following hardware configuration status is available from the Sun Fire high-end system management software:
Note - The hardware configuration status available to SMS running on the SC is limited to presence or absence. It does not include information about the I/O configuration, such as where I/O adapters are plugged in and what devices are attached to those I/O adapters. Such information is available only to the software running on the domain that owns the I/O adapter.
The hardware configuration supported by functions described in this section excludes I/O adapters and I/O devices. The showboards command displays all hardware components that are present.
As described in Blacklist Editing, the current contents of the component blacklists can always be viewed and altered.
The following hardware environmental measurements are available:
The showenvironment command displays every environmental measurement that can be taken within the Sun Fire high-end system rack.
1. Log in to the SC.
Platform administrators can view any environment status on the entire platform. Domain administrators can see the environment status only for those domains for which they have privileges.
2. Type the following command:
As described in HPU LEDs, the operating indicator LEDs on Sun Fire high-end system HPUs visibly reflect that the HPUs are powered on and the OK to remove LEDs visibly reflect those that can be unplugged.
The dsmd daemon monitors the Sun Fire high-end system hardware operational status and reports errors. Occurrences of some errors are directly reported to the SC (for example, the error registers in every ASIC propagate to the SBBC on the SC that provides an error summary register). Although the occurrence of some errors is indicated by an interrupt delivered to the SC, some error states might require the SC to monitor hardware registers for error indications. When a hardware error is detected, esmd follows the established procedures for collecting and clearing the hardware error state.
The following types of errors can occur on Sun Fire high-end system hardware:
Hardware error status is generally not reported as a status. Rather, event-handling functions perform various actions when hardware errors occur such as logging errors, initiating ASR, and so forth. These functions are discussed in Chapter 11.
Note - As described in HPU LEDs, the fault LEDs, after POST completion, identify Sun Fire high-end system HPUs in which faults have been discovered since last powered on or submitted to a power-on reset.
Proper operation of SMS depends upon proper operation of the hardware and the Solaris software on the SC. The ability to support automatic failover from the main to the spare system controller requires properly functioning hardware and software on the spare. SMS software running on the main system controller must either be functioning sufficiently to diagnose a software or hardware failure in a manner that can be detected by the spare, or it must fail in a manner that can be detected by the spare.
SC-POST determines the status of system controller hardware. It tests and configures the system controller at power-on or power-on-reset.
The SC does not boot if the SC fails to function.
If the control board fails to function, the SC boots normally, but without access to the control board devices. The level of hardware functionality required to boot the system controller is essentially the same as that required for a standalone SC.
SC-POST writes diagnostic output to the SC console serial port (TTY-A). Additionally, SC-POST leaves a brief diagnostics status summary message in an NVRAM buffer that can be read by a Solaris driver and logged or displayed when the Solaris software boots.
SC firmware and software display information to identify and service SC hardware failures.
SC firmware and software provide a software interface that verifies that the system controller hardware is functional. This selects a working system controller as the main SC in a high-availability SC configuration.
The system controller LEDs provide visible status regarding power and detected hardware faults, as described in HPU LEDs.
Solaris software provides a level of self-diagnosis and automatic recovery (panic and reboot). Solaris software utilizes the SC hardware watchdog logic to trap hang conditions and force an automatic recovery reboot.
Four hardware paths of communication between the SCs (two Ethernet connections, the heartbeat network, and one SC-to-SC heartbeat signal) are used in the high-availability SC configuration by each SC to detect hangs or failures on the other SC.
SMS practices self-diagnosis and institutes automatic failure recovery procedures, even in non-high-availability SC configurations.
Upon recovery, SMS software either takes corrective actions as necessary to restore the platform hardware to a known, functional configuration or reports the inability to do so.
SMS software records and logs sufficient information to enable engineering diagnosis of single-occurrence software failures in the field.
SMS software takes a noticeable interval to initialize itself and become fully functional. The user interfaces behave predictably during this interval. Any rejections of user commands are clearly identified as due to system initialization, with advice to try again after a suitable interval.
SMS software implementation uses a distributed client-server architecture. Any errors encountered during SMS initialization due to attempts to interact with a process that has not yet completed initialization are dealt with silently.