C H A P T E R 3 |
SMS Internals |
SMS operations are generally performed by a set of daemons and commands. This chapter provides an overview of how SMS works and describes the SMS daemons, processes, commands, and system files. For more information, refer to the System Management Services (SMS) 1.4 Reference Manual.
Caution - Changes made to files in /opt/SUNWSMS can cause serious damage to the system. Only very experienced system administrators should risk changing the files described in this chapter. |
This chapter includes the following sections:
The events that take place when the SMS boots are as follows:
User powers on the Sun Fire high-end (CPU/disk and CD-ROM) platform. The Solaris operating environment on the SC boots automatically.
During the boot process, the /etc/init.d/sms script is called. This script, for security reasons, disables forwarding, broadcast, and multicasting over the MAN network. It then starts the SMS software by invoking a background process, which starts and monitors ssd. ssd is the SMS startup daemon responsible for starting and monitoring all the SMS daemons and servers.
ssd(1M) in turn invokes: mld, pcd, hwad, tmd, dsmd, esmd, mand, osd, dca, efe,codd, efhd, elad, erd, smnptd, picld, and wcapp.
For more information, see SMS Daemons, and Message Logging. For efe, refer to the latest Sun Management Center documentation available at http://docs.sun.com.
Once the daemons are running, you can use SMS commands such as console.
SMS startup can take a few minutes during which time any commands run will return an error message indicating that SMS has not completed startup. The message "SMS software start-up complete" is posted to the platform log when startup is complete and can be viewed using the showlogs(1M) command.
The SMS 1.4 daemons play a central role on Sun Fire high-end systems. Daemons are persistent processes that provide SMS services to clients using an API.
Daemons are always running, initiated at system startup, and restarted whenever necessary. Each daemon is fully described in its corresponding man page (with the exception of efe, which is referenced separately in the Sun Management Center documentation).
This section looks at the SMS daemons, their relationship to one another, and includes which CLIs (if any) access them.
FIGURE 3-1 illustrates the Sun Fire high-end system software components and their high-level interaction.
The capacity on demand daemon (codd (1M)) is a process that runs on the main system controller (SC).
This process does the following:
Monitors the COD resources being used and verifies that the resources used are in agreement with the licenses in the COD license database.
Provides information on installed licenses, resource use, and board status.
Handles the requests to add or delete COD license keys.
Configures headroom quantities and domain right-to-use (RTU) license reservations.
FIGURE 3-2 illustrates the CODD client server relationships to the SMS daemons and CLI commands.
dca(1M) supports remote dynamic reconfiguration (DR) by enabling communication between applications and the domain configuration server (dcs) running on a Solaris 8 or Solaris 9 domain. One dca per domain runs on the SC. Each dca communicates with its dcs over the Management Network (MAN).
ssd(1M) starts dca when the domain is brought up. ssd restarts dca if it is killed while the domain is still running. dca is terminated when the domain is shut down.
dca is an SMS application that waits for dynamic reconfiguration requests. When a DR request arrives, dca creates a dcs session. Once a session is established, dca forwards the request to dcs. dcs attempts to honor the DR request and sends the results of the operation to the dca. Once the results have been sent, the session is ended. The remote DR operation is complete when dca returns the results of the DR operation.
FIGURE 3-3 illustrates the DCA client server relationships to the SMS daemons and CLIs.
dsmd(1M) monitors domain state signatures, CPU reset conditions and Solaris heartbeat for up to 18 domains on a Sun Fire 15K and up to nine on a Sun Fire 12K system. It also handles domain stop events related to hardware failure.
dsmd detects timeouts that can occur in reboot transition flow and panic transition flow, and handles various domain hung conditions.
dsmd notifies the domain X server (dxs(1M)) and Sun Management Center of all domain state changes and automatically recovers the domain based on the domain state signature, domain stop events, and automatic system recovery (ASR) Policy. ASR Policy consists of those procedures that restore the system to running all properly configured domains after one or more domains have been rendered inactive. This can be due to software or hardware failures or to unacceptable environmental conditions. For more information, see Automatic System Recovery (ASR) and Domain Stop Events.
dsmd also passes automatic diagnosis (AD) information related to the domain stop to efhd.
FIGURE 3-4 illustrates DSMD client server relationships to the SMS daemons and CLIs.
dxs(1M) provides software support for a running domain. This support includes virtual console functionality, dynamic reconfiguration support, and HPCI support. dxs handles domain driver requests and events. dxs provides an interface for getting and setting HPCI slot status. The slot status includes cassette presence, power, frequency, and health of the cassette. This interface makes it possible to power control HPCI cassettes for hot plug operations.
The virtual console functionality allows one or more users running the console program to access the domain's virtual console. dxs acts as a link between SMS console applications and the domain virtual console drivers.
A Sun Fire 15K system can support up to 18 different domains. A Sun Fire 12K system can support up to nine domains. Each domain may require software support from the SC, and dxs provides that support. The following domain-related projects require dxs support:
There is one domain X server for each Sun Fire high-end system domain. dxs is started by ssd for every active domain, that is, a domain running OS software, and terminated when the domain is shut down.
FIGURE 3-9 illustrates DXS client server relationships to the SMS daemons.
Performs automatic error diagnosis based on the domain stop information passed by dsmd(1M)
Updates the component health status for those components that have been associated with a fault, as determined by the diagnosis engine (SMS or the Solaris operating environment) or by POST.
Passes the fault event to erd(1M) for error reporting.
FIGURE 3-6 illustrates EFHD client server relationships to the SMS daemons.
elad (1M) controls access to the SMS event log, which records fault and error events identified by the automatic diagnosis (AD) or POST diagnosis engines on a Sun Fire high-end system. elad also archives events when the event log fills.
FIGURE 3-7 illustrates the ELAD client server relationships to the SMS daemons and CLI commands.
erd(1M) provides reporting services that deliver fault event text messages to the platform and domain logs, fault event information to Sun Management Center and SRS Net Connect, and email that contains fault event messages.
erd reads the email control file and the email template file each time email event notification occurs.
FIGURE 3-8 illustrates the ERD client server relationships to the SMS daemons.
esmd(1M) monitors system cabinet environmental conditions, for example, voltage, temperature, fan tray, power supply and clock phasing. esmd logs abnormal conditions and takes action to protect the hardware, if necessary.
See Environmental Events for more information on esmd.
FIGURE 3-9 illustrates ESMD client server relationships to the SMS daemons.
fomd(1M) is the core of the SC failover mechanism. fomd detects faults on the local and remote SCs and takes the appropriate action (directing a failover or takeover). fomd tests and ensures that important configuration data is kept synchronized between both SCs. fomd runs on both the main and spare SCs.
For more information on fomd see SC Failover.
FIGURE 3-10 illustrates FOMD client server relationships to the SMS daemons.
frad(1M) is the field-replaceable unit (FRU) access daemon for SMS. frad provides controlled access to any serial electrically erasable programmable read-only memory (SEEPROM) within the Sun Fire high-end platform that is accessible by the SC. frad supports dynamic FRUID which provides improved FRU data access using the Solaris platform information and control library daemon (PICLD). FRU identification is for Sun Service use only and transparent to the user.
FIGURE 3-11 illustrates FRAD client server relationships to the SMS daemons.
hwad(1M) provides hardware access to SMS daemons and a mechanism for all daemons exclusively to access, control, monitor, and configure the hardware.
hwad runs in either main or spare mode when it comes up. The failover daemon (fomd(1M)) determines which role hwad plays.
On both the main and spare, hwad:
Opens all the drivers (sbbc, echip, gchip, and consbus) and uses ioctl(2) calls to interface with them.
Configures the local system clock and sets the clock source for each board present in the system.
Disables SC to SC interrupt.
Disables DARB interrupts by clearing SBBC system interrupt enable register.
Creates an echip interface, which waits for any interrupt coming from the Echip driver. At startup this is the SC heartbeat interrupt.
Reads the contents of the device presence register to identify the boards present in the system and makes them accessible to the clients.
Takes control of I2C steering and initializes all board objects present in the machine.
Checks that clocks are phase locked. If they are, hwad checks that all clock sources are pointing to the main SC. If the clocks are not phase locked, hwad does not change any clock sources and disables automatic clock switch.
Initializes the DARB interrupt, enables DARB interrupt, and enables PCI interrupt generation. Disables clock failure interrupt in gchip, disables console bus error interrupt in Echip, disables power supply failure interrupt in echip.
Initializes the interrupt handler for events and creates threads to service events for mand, dsmd and each osd.
Creates the IOSRAM interfaces for 18 domains. This enables communication between the SC and the domain.
Sets the spare SC clock to the main SC clock. Also sets the reference select to 0. Initializes SC to SC interrupt.
hwad directs communication to the IOSRAM (tunnel switch) for dynamic reconfiguration (DR).
hwad notifies dsmd(1M) if there is a dstop or rstop. It also notifies related SMS daemon(s) depending on the type of the Mbox interrupt that occurs.
hwad detects and logs console bus and JTAG errors.
Hardware access to a Sun Fire high-end system on the SC is done either by going through the PCI bus or console bus. Through the PCI bus you can access:
Through the Console bus you can access:
Various ASICs internal registers
Read/write chips
Local I2C devices on various boards for temperature and chip level power control/status.
FIGURE 3-12 illustrates HWAD client server relationships to the SMS daemons and CLIs.
The key management daemon provides a mechanism for managing security for socket communications between the SC and the domains.
The current default configuration includes authentication policies for the dca(1M) and dxs(1M) clients on the SC, which connect to the dcs(1M) and cvcd(1M) servers on a domain.
kmd(1M) manages the IPSec security associations (SAs) needed to secure the communication between the SC and servers running on a domain.
kmd manages per-socket policies for connections initiated by clients on the SC to servers on a domain.
At system startup, kmd creates a domain interface for each domain that is active. An active domain has both a valid IOSRAM and is running the Solaris operating environment. Domain change events can trigger creation or removal of a domain kmd interface.
kmd manages shared policies for connections initiated by clients on the domain to servers on the SC. The kmd policy manager reads a configuration file and stores policies used to manage security associations. A request received by kmd is compared to the current set of policies to ensure that it is valid and to set various parameters for the request.
Static global policies are configured using ipsecconf(1M) and associated data file (/etc/inet/ipsecinit.conf). Global policies are used for connections initiated from the domains to the SC. Corresponding entries are made in the kmd configuration file. Shared security associations for domain to SC connections are created by kmd when the domain becomes active.
Note - In order to work properly, policies created by ipsecconf and kmd must match. |
The kmd configuration file is used for both SC-to-domain and domain-to-SC initiated connections. The kmd configuration file resides in
/etc/opt/SUNWSMS/config/kmd_policy.conf.
The format of the kmd configuration files is as follows:
dir:d_port:protocol:sa_type:aut_alg:encr_alg:domain:login
FIGURE 3-13 illustrates KMD client server relationships to the SMS daemons.
mand(1M) supports the Management Network (MAN). See Management Network Services. By default, mand comes up in spare mode and switches to main when told to do so by the failover daemon (fomd(1M)). fomd determines which role mand plays.
At system startup, mand comes up in the role of spare and configures the SC-to-SC private network. This information is obtained from the file /etc/opt/SUNWSMS/config/MAN.cf, which is created by the smsconfig(1M) command. The failover daemon (fomd(1M)) directs mand to assume the role of main.
Registers for domain change events from platform configuration database (pcd) to track changes in the domain active board list.
Creates the mapping between domain_tag and IP address in the pcd,
Initializes the scman(7d) driver with the current domain configuration.
Registers for events from hwad to track active Ethernet information from the dman(7d) driver.
Updates the scman driver and pcd, as appropriate.
Registers for domain keyswitch events to communicate system startup MAN information to each domain when the domain is powered on (setkeyswitch on). This information includes Ethernet and MAN IP addressing, and active board list information used during the initial software installation on the domain.
FIGURE 3-14 illustrates MAND client server relationships to the SMS daemons.
The message logging daemon, mld, captures the output of all other SMS daemons and processes. mld supports three configuration directives: File, Level, and Mode, in the /var/opt/SUNWSMS/adm/.logger file.
File--Specifies the default output locations for the message files. The default is msgdaemon and should not be changed.
Platform messages are stored on the SC in /var/opt/SUNWSMS/adm/platform/messages
Domain messages are stored on the SC in /var/opt/SUNWSMS/adm/domain_id/messages
Domain console messages are stored on the SC in/var/opt/SUNWSMS/adm/domain_id/console
Domain syslog messages are stored on the SC in/var/opt/SUNWSMS/adm/domain_id/syslog.
Level--Specifies the minimum level necessary for a message to be logged. The supported levels are NOTICE, WARNING, ERR, CRIT, ALERT, and EMERG. The default level is NOTICE.
Mode--Specifies the verbosity of the messages. Two modes are available: verbose and terse. The default is verbose.
mld monitors the size of each of the message log files. For each message log type, mld keeps up to ten message files at a time, x.0 though x.9. For more information on log messages, see Message Logging.
FIGURE 3-15 illustrates MLD client server relationships to the SMS daemons and CLIs.
osd(1M) provides support to the OpenBoot PROM process running on a domain. osd and OpenBoot PROM communication is through a mailbox that resides on the domain. The osd daemon monitors the OpenBoot PROM mailbox. When the OpenBoot PROM writes requests to the mailbox, osd executes the requests accordingly.
osd runs at all times on the SC even if there are no domains configured. osd provides virtual TOD service, virtual NVRAM, and virtual REBOOTINFO for OpenBoot PROM and an interface to dsmd(1M) to facilitate auto-domain recovery. osd also provides an interface for the following commands: setobpparams(1M), showobpparams(1M), setdate(1M), and showdate(1M). See also SMS Configuration.
osd is a trusted daemon in that it will not export any interface to other SMS processes. It exclusively reads and writes from and to all OpenBoot PROM mailboxes. There is one OpenBoot PROM mailbox for each domain.
osd has two main tasks; to maintain its current state of the domain configuration, and to monitor the OpenBoot PROM mailbox.
FIGURE 3-16 illustrates OSD client server relationships to the SMS daemons and CLIs.
pcd(1M) is a Sun Fire high-end system management daemon that runs on the SC with primary responsibility for managing and providing controlled access to platform and domain configuration data.
pcd manages an array of information that describes the Sun Fire system configuration. In its physical form, the database information is a collection of flat files, each file appropriately identifiable by the information contained within it. All SMS applications that want to access the database information must go through pcd.
In addition to managing platform configuration data, pcd is responsible for platform configuration change notifications. When pertinent platform configuration changes occur within the system, the pcd sends out notification of the changes to clients who have registered to receive the notification.
FIGURE 3-17 illustrates PCD client server relationships to the SMS daemons and CLIs.
The following information uniquely identifies the platform:
Platform type
Platform name
The Chassis HostID is used only by the COD feature to identify the platform for COD licensing purposes. The Chassis HostID is the centerplane serial number and is recorded internally within the system. To view the Chassis HostID, run the showplatform -p cod command.
The chassis serial number identifies a Sun Fire high-end system and is used to identify the platform in messages and events. It is also used by service providers to correlate events and service actions to the correct system. The chassis serial number is printed on a label located on the front of the system chassis, near the bottom center. Starting with the SMS 1.4 release, the chassis serial number is automatically recorded by Sun manufacturing on systems that ship with SMS 1.4 installed. To view the chassis serial number, run the showplatform -p csn command.
If you are upgrading to SMS 1.4 from an earlier SMS version, use the setcsn(1M) command to record the chassis serial number. For details on the setcsn command, refer to the command description in the System Management Services (SMS) 1.4 Reference Manual.
Cacheable address slice map
System clock frequency
System clock type
SC IP address
SC0 to SC1 IP address
SC1 to SC0 IP address
SC to SC IP netmask
COD instant access CPUs (headroom)
The following information is domain related:
domain_id
domain_tag
OS version (currently not used)
OS type (currently not used)
Available component list
Assigned board list
Active board list
Golden IOSRAM I/O board
Virtual keyswitch setting for a domain
Active Ethernet I/O board
Domain creation time
Domain dump state
Domain bringup priority
IP host address
Host name
Host netmask
Host broadcast address
Virtual OpenBoot PROM address
Physical OpenBoot PROM address
COD RTU license reservation
The following information is related to system boards:
Expander position
Slot position
Board type
Board state
Domain Identifier assigned to board
Available component list state
Board test status
Board test level
Board memory clear state
COD enabled flag
ssd(1M) is responsible for starting and maintaining all SMS daemons and domain X servers.
ssd checks the environment for availability of certain files and the availability of the Sun Fire high-end system, sets environment variables, and then starts esmd(1M) on the main. esmd monitors environmental changes by polling the related hardware components. When an abnormal condition is detected, esmd handles it or generates an event so that the correspondent handlers take appropriate action and/or update their current status. Some of those handlers are: dsmd, pcd and Sun Management Center (if installed). The main objective of ssd is to ensure that the SMS daemons and servers are always up and running.
FIGURE 3-18 illustrates SSD client server relationships to the SMS daemons.
ssd uses a configuration file, ssd_start to determine which components and in what order to start up the SMS software. This configuration file is located in the
/etc/opt/SUNWSMS/startup directory.
ssd_start consists of entries in the following format:
name:args:nice:role:type:trigger:startup_timeout:shutdown_timeout:uid:start_order:stop_order
Each time ssd starts, it comes up in spare mode. Once ssd has started the platform core daemons running, it queries fomd(1M) for its role. If the fomd query returns with spare, ssd stays in this mode. If the fomd returns with main, then ssd transitions to main mode.
After this initial query phase, ssd only switches between modes through events received from the fomd.
When in spare mode, ssd starts and monitors all of the core platform role, auto trigger programs in the ssd_start file. Currently, this list is made up of the following programs.
If, while in main mode, ssd receives a spare event, then ssd shuts down all programs except the core platform role and auto trigger programs found in the ssd_start file.
ssd stays in spare mode until it receives a main event. At that time, ssd starts and monitors (in addition to the already running daemons) all of the platform role (main only) event trigger programs, in the ssd_start file. This list is made up of the following programs.
Finally, after starting all the platform role, event trigger programs, ssd queries the pcd to determine which domains are active. For each of these domains, ssd starts all the domain role, event trigger programs found in the ssd_start file.
ssd uses domain start and stop events from pcd as instructions for starting and stopping domain-specific servers.
Upon reception, ssd either starts or stops all of the domain role, event trigger programs (for the domain identified) found in the ssd_start file.
Once ssd has started a process, it monitors the process and restarts in the event the process fails.
In certain instances, such as SMS software upgrades, the SMS software needs to be shut down. ssd provides a mechanism to shut down itself and all SMS daemons and servers under its control.
ssd notifies all SMS software components under its control to shut down. After all the SMS software components have been shut down, ssd shuts itself down.
tmd(1M) provides task management services such as scheduling for SMS. This reduces the number of conflicts that can arise during concurrent invocations of the hardware tests and configuration software.
Currently, the only service exported by tmd is the hpost(1M) scheduling service. In a Sun Fire high-end system, hpost is scheduled based on two factors.
Restriction of hpost. When the platform first comes up and no domains have been configured, a single instance of hpost takes exclusive control of all expanders and configures the centerplane ASICs. All subsequent hpost invocations wait until this is complete before proceeding.
Only a single hpost invocation can act on any one expander at a time. For a Sun Fire high-end system configured without split expanders, this restriction does not prevent multiple hpost invocations from running. This restriction does come into play however, when the machine is configured with split expanders.
System-wide hpost throttle limit. There is a limit to the number of concurrent hpost invocations that can run at a single time without saturating the system. The ability to throttle hpost invocations is available using the -t option in ssd_startup.
Caution - Changing the default value can adversely affect system functionality. Do not adjust this parameter unless instructed by a Sun service representative to do so. |
FIGURE 3-19 illustrates TMD client server relationships to the SMS daemons.
Basic SMS environment defaults must be set in your configuration files to run SMS commands.
PATH to include /opt/SUNWSMS/bin
LD_LIBRARY_PATH to include/opt/SUNWSMS/lib
MANPATH to include /opt/SUNWSMS/man
Setting other environment variables when you log in can save time. TABLE 3-2 suggests some useful SMS environment variables.
Copyright © 2003, Sun Microsystems, Inc. All rights reserved.