C H A P T E R 4 |
SMS Internals |
SMS operations are generally performed by a set of daemons and commands. This chapter provides an overview of how SMS works and describes the SMS daemons, processes, commands, and system files. For more information, refer to the System Management Services (SMS) 1.6 Reference Manual.
Caution - Changes made to files in /opt/SUNWSMScan cause serious damage to the system. Only very experienced system administrators should risk changing the files described in this chapter. |
This chapter contains the following sections:
The following events take place when the SMS boots:
1. User powers on the Sun Fire high-end (CPU/disk and DVD-ROM) platform. The Solaris OS on the SC boots automatically.
2. During the boot process, the /etc/init.d/sms script is called. This script, for security reasons, disables forwarding, broadcast, and multicasting over the MAN network. The script then starts the SMS software by invoking a background process, which starts and monitors ssd. ssd is the SMS startup daemon responsible for starting and monitoring all the SMS daemons and servers.
3. ssd(1M) in turn invokes the following daemons and processes: mld, pcd, hwad, tmd, dsmd, esmd, mand, osd, dca, efe,codd, efhd, elad, erd, smnptd, picld, and wcapp.
For more information about the SMS daemons, see SMS Daemons. For more information about efe, refer to the latest Sun Management Center documentation available at: http://docs.sun.com
4. Once the daemons are running, you can use SMS commands such as console.
SMS startup can take a few minutes during which time any commands run will return an error message indicating that SMS has not completed startup. The message "SMS software start-up complete" is posted to the platform log when startup is complete, and can be viewed using the showlogs(1M) command.
The SMS 1.6 daemons play a central role on Sun Fire high-end systems. Daemons are persistent processes that provide SMS services to clients using an API.
Daemons are always running, initiated at system startup and restarted whenever necessary.
Each daemon is fully described in its corresponding man page with the exception of efe, which is referenced separately in the Sun Management Center documentation.
This section looks at the SMS daemons, their relationship to one another, and which CLIs access them.
FIGURE 4-1 illustrates the Sun Fire high-end system software components and their high-level interaction.
The capacity on demand daemon, codd (1M), is a process that runs on the main system controller (SC).
This process does the following:
FIGURE 4-2 illustrates the CODD client-server relationships to the SMS daemons and CLI commands.
The domain configuration agent daemon, dca(1M), supports remote dynamic reconfiguration (DR) by enabling communication between applications and the domain configuration server (dcs) running on a Solaris 8, 9, or 10 domain. One dca per domain runs on the SC. Each dca communicates with its dcs over the Management Network (MAN).
ssd(1M) starts dca when the domain is brought up. ssd restarts dca if it is terminated while the domain is still running. dca is terminated when the domain is shut down.
dca is an SMS application that waits for dynamic reconfiguration requests. When a DR request arrives, dca creates a dcs session. Once a session is established, dca forwards the request to dcs. dcs attempts to honor the DR request and sends the results of the operation to the dca. Once the results have been sent, the session is ended. The remote DR operation is complete when dca returns the results of the DR operation.
FIGURE 4-3 illustrates the DCA client-server relationships to the SMS daemons and CLIs.
The domain status monitoring daemon, dsmd(1M), monitors domain state signatures, CPU reset conditions, and Solaris heartbeat for up to 18 domains on a Sun Fire 15K and up to 9 on a Sun Fire 12K system. This daemon also handles domain stop events related to hardware failure.
dsmd detects timeouts that can occur in reboot transition flow and panic transition flow, and handles various domain hung conditions.
dsmd notifies the domain X server (dxs(1M)) and Sun Management Center of all domain state changes, and automatically recovers the domain based on the domain state signature, domain stop events, and automatic system recovery (ASR) policy. ASR policy consists of those procedures that restore the system to running all properly configured domains after one or more domains have been rendered inactive. This inactivity can be due to software or hardware failures or to unacceptable environmental conditions. For more information, see Automatic System Recovery (ASR) and Domain Stop Events.
dsmd also passes automatic diagnosis (AD) information related to the domain stop to efhd.
FIGURE 4-4 illustrates DSMD client-server relationships to the SMS daemons and CLIs.
The domain X server, dxs(1M), provides software support for a running domain. This support includes virtual console functionality, dynamic reconfiguration support, and HPCI support. dxs handles domain driver requests and events. dxs provides an interface for getting and setting HPCI slot status. The slot status includes cassette presence, power, frequency, and health of the cassette. This interface makes it possible to power control HPCI cassettes for hot-plug operations.
The virtual console functionality enables one or more users running the console program to access the domain's virtual console. dxs acts as a link between SMS console applications and the domain virtual console drivers.
A Sun Fire 15K system can support up to 18 different domains. A Sun Fire 12K system can support up to 9 domains. Each domain might require software support from the SC, and dxs provides that support. The following domain-related projects require dxs support:
There is one domain X server for each Sun Fire high-end system domain. dxs is started by ssd for every active domain, that is, a domain running OS software, and terminated when the domain is shut down.
FIGURE 4-5 illustrates DXS client-server relationships to the SMS daemons.
The error and fault handling daemon, efhd(1M), does the following:
FIGURE 4-5 illustrates EFHD client-server relationships to the SMS daemons.
The event log access daemon, elad(1M), controls access to the SMS event log, which records fault and error events identified by the automatic diagnosis (AD) or POST diagnosis engines on a Sun Fire high-end system. elad also archives events when the event log fills.
FIGURE 4-7 illustrates the ELAD client-server relationships to the SMS daemons and CLI commands.
The event reporting daemon, erd(1M), provides reporting services that deliver fault event text messages to the platform and domain logs, fault event information to Sun Management Center and Sun Remote Services (SRS) Net Connect, and email that contains fault event messages.
erd reads the email control file and the email template file each time email event notification occurs.
FIGURE 4-8 illustrates the ERD client-server relationships to the SMS daemons.
The environmental status monitoring daemon, esmd(1M), monitors system cabinet environmental conditions, for example, voltage, temperature, fan tray, power supply and clock phasing. esmd logs abnormal conditions and takes action to protect the hardware, if necessary.
See Environmental Events for more information about esmd.
FIGURE 4-9 illustrates ESMD client-server relationships to the SMS daemons.
The failover management daemon, fomd(1M), is the core of the SC failover mechanism. fomd detects faults on the local and remote SCs and takes the appropriate action (initiating a failover or takeover). fomd tests and ensures that important configuration data is kept synchronized between both SCs. fomd runs on both the main and spare SCs.
For more information on fomd, see Chapter 12.
FIGURE 4-10 illustrates FOMD client-server relationships to the SMS daemons.
The FRU access daemon, frad(1M), is the field-replaceable unit (FRU) access daemon for SMS. frad provides controlled access to any SEEPROM within the Sun Fire high-end platform that is accessible by the SC. frad supports dynamic FRUID, which provides improved FRU data access using the Solaris platform information and control library daemon (PICLD). FRU identification is for Sun Service use only and transparent to the user.
FIGURE 4-11 illustrates FRAD client-server relationships to the SMS daemons.
The hardware access daemon, hwad(1M), provides hardware access to SMS daemons and a mechanism for all daemons exclusively to access, control, monitor, and configure the hardware.
hwad runs in either main or spare mode when it comes up. The failover daemon (fomd(1M)) determines which role hwad plays.
On both the main and spare, hwad does the following:
On the main SC, hwad does the following:
On the Spare SC, hwad performs these tasks:
hwad directs communication to the IOSRAM (tunnel switch) for dynamic reconfiguration (DR).
hwad notifies dsmd(1M) if there is a dstop or rstop. It also notifies related SMS daemons, depending on the type of the Mbox interrupt that occurs.
hwad detects and logs console bus and JTAG errors.
Hardware access to a Sun Fire high-end system on the SC is done either by going through the PCI bus or console bus. Through the PCI bus you can access:
Through the console bus you can access:
FIGURE 4-12 illustrates HWAD client-server relationships to the SMS daemons and CLIs.
The key management daemon, kmd(1M), provides a mechanism for managing security for socket communications between the SC and the domains.
The current default configuration includes authentication policies for the dca(1M) and dxs(1M) clients on the SC, which connect to the dcs(1M) and cvcd(1M) servers on a domain.
kmd manages the IPSec security associations (SAs) needed to secure the communication between the SC and servers running on a domain.
kmd manages per-socket policies for connections initiated by clients on the SC to servers on a domain.
At system startup, kmd creates a domain interface for each domain that is active. An active domain has a valid IOSRAM and is running the Solaris OS. Domain change events can trigger creation or removal of a domain kmd interface.
kmd manages shared policies for connections initiated by clients on the domain to servers on the SC. The kmd policy manager reads a configuration file and stores policies used to manage security associations. A request received by kmd is compared to the current set of policies to ensure that it is valid and to set various parameters for the request.
Static global policies are configured using ipsecconf(1M) and its associated data file (/etc/inet/ipsecinit.conf). Global policies are used for connections initiated from the domains to the SC. Corresponding entries are made in the kmd configuration file. Shared security associations for domain-to-SC connections are created by kmd when the domain becomes active.
Note - To work properly, policies created by ipsecconf and kmd must match. |
The kmd configuration file is used for both SC-to-domain and domain-to-SC initiated connections. The kmd configuration file resides in
/etc/opt/SUNWSMS/config/kmd_policy.conf.
The format of the kmd configuration files is as follows:
FIGURE 4-13 illustrates KMD client-server relationships to the SMS daemons.
The management network daemon, mand(1M), supports the Management Network (MAN). (For more information about the MAN network, see Management Network Services.) By default, mand comes up in spare mode and switches to main when told to do so by the failover daemon (fomd(1M)). fomd determines which role mand plays.
At system startup, mand comes up in the role of spare and configures the SC-to-SC private network. This information is obtained from the file /etc/opt/SUNWSMS/config/MAN.cf, which is created by the smsconfig(1M) command. The failover daemon (fomd(1M)) directs mand to assume the role of main.
In the main role, mand does the following:
FIGURE 4-14 illustrates MAND client-server relationships to the SMS daemons.
The message logging daemon, mld(1M), captures the output of all other SMS daemons and processes. mld supports three configuration directives: File, Level, and Mode, in the /var/opt/SUNWSMS/adm/.logger file.
mld monitors the size of each of the message log files. For each message log type, mld keeps up to ten message files at a time, x.0 though x.9. For more information on log messages, see Message Logging.
FIGURE 4-15 illustrates MLD client-server relationships to the SMS daemons and CLIs.
The OpenBoot PROM support daemon, osd(1M), provides support to the OpenBoot PROM process running on a domain. osd and OpenBoot PROM communication is through a mailbox that resides on the domain. The osd daemon monitors the OpenBoot PROM mailbox. When the OpenBoot PROM writes requests to the mailbox, osd executes the requests accordingly.
osd runs at all times on the SC, even if there are no domains configured. osd provides virtual time of day (TOD) service, virtual nonvolatile random access memory (NVRAM), and virtual REBOOTINFO for OpenBoot PROM, and an interface to dsmd(1M) to facilitate auto-domain recovery. osd also provides an interface for the following commands: setobpparams(1M), showobpparams(1M), setdate(1M), and showdate(1M). See also Chapter 5.
osd is a trusted daemon in that it will not export any interface to other SMS processes. It exclusively reads and writes from and to all OpenBoot PROM mailboxes. There is one OpenBoot PROM mailbox for each domain.
osd has two main tasks: to maintain its current state of the domain configuration, and to monitor the OpenBoot PROM mailbox.
FIGURE 4-16 illustrates OSD client-server relationships to the SMS daemons and CLIs.
The platform configuration daemon, pcd(1M), is a Sun Fire high-end system management daemon that runs on the SC with primary responsibility for managing and providing controlled access to platform and domain configuration data.
pcd manages an array of information that describes the Sun Fire system configuration. In its physical form, the database information is a collection of flat files, each file appropriately identifiable by the information contained within it. All SMS applications must go through pcd to access the database information.
In addition to managing platform configuration data, pcd is responsible for platform configuration change notifications. When pertinent platform configuration changes occur within the system, the pcd sends out notification of the changes to clients who have registered to receive the notification.
FIGURE 4-17 illustrates PCD client-server relationships to the SMS daemons and CLIs.
The following information uniquely identifies the platform:
The Chassis HostID is used only by the COD feature to identify the platform for COD licensing purposes. The Chassis HostID is the centerplane serial number and is recorded internally within the system. To view the Chassis HostID, run the showplatform -p cod command.
The chassis serial number identifies a Sun Fire high-end system and is used to identify the platform in messages and events. It is also used by service providers to correlate events and service actions to the correct system. The chassis serial number is printed on a label located on the front of the system chassis, near the bottom center. Starting with the SMS 1.4 release, the chassis serial number is automatically recorded by Sun manufacturing on systems that ship with SMS installed. To view the chassis serial number, run the showplatform -p csn command.
If you are upgrading to SMS 1.6 or later from an earlier SMS version, use the setcsn(1M) command to record the chassis serial number. For details on the setcsn command, refer to the command description in the System Management Services (SMS) 1.6 Reference Manual.
The following information is domain-related:
The following information is related to system boards:
The SMS startup daemon, ssd(1M), is responsible for starting and maintaining all SMS daemons and domain X servers.
ssd checks the environment for availability of certain files and the availability of the Sun Fire high-end system, sets environment variables, and then starts esmd(1M) on the main SC. esmd monitors environmental changes by polling the related hardware components. When an abnormal condition is detected, esmd handles it or generates an event so that the correspondent handlers take appropriate action and/or update their current status. Some of those handlers are dsmd, pcd, and Sun Management Center (if installed). The main objective of ssd is to ensure that the SMS daemons and servers are always up and running.
FIGURE 4-18 illustrates SSD client-server relationships to the SMS daemons.
ssd uses a configuration file, ssd_start, to determine which SMS components to start, and in which order to start them. This configuration file is located in the
/etc/opt/SUNWSMS/startup directory.
ssd_start consists of entries in the following format:
name:args:nice:role:type:trigger:startup-timeout:shut down-timeout:uid:start-order:stop-order
Each time ssd starts, it comes up in spare mode. Once ssd has started the platform core daemons running, it queries fomd(1M) for its role. If the fomd query returns with spare, ssd stays in this mode. If the fomd returns with main, then ssd transitions to main mode.
After this initial query phase, ssd only switches between modes through events received from the fomd.
When in spare mode, ssd starts and monitors all of the core platform role, auto trigger programs in the ssd_start file. Currently, this list is made up of the following programs:
If, while in main mode, ssd receives a spare event, then ssd shuts down all programs except the core platform role and auto trigger programs found in the ssd_start file.
ssd stays in spare mode until it receives a main event. At that time, ssd starts and monitors (in addition to the daemons that are already running) all of the main platform role event trigger programs in the ssd_start file. This list is made up of the following programs:
Finally, after starting all the platform role, event trigger programs, ssd queries the pcd to determine which domains are active. For each of these domains, ssd starts all the domain role, event trigger programs found in the ssd_start file.
ssd uses domain start and stop events from pcd as instructions for starting and stopping domain-specific servers.
Upon reception, ssd either starts or stops all of the domain role, event trigger programs (for the domain identified) found in the ssd_start file.
Once ssd has started a process, it monitors the process and restarts it in the event the process fails.
In certain instances, such as SMS software upgrades, the SMS software must be shut down. ssd provides a mechanism to shut down itself and all SMS daemons and servers under its control.
ssd notifies all SMS software components under its control to shut down. After all the SMS software components have been shut down, ssd shuts itself down.
The task management daemon, tmd(1M), provides task management services such as scheduling for SMS. This reduces the number of conflicts that can arise during concurrent invocations of the hardware tests and configuration software.
Currently, the only service exported by tmd is the hpost(1M) scheduling service. In a Sun Fire high-end system, hpost is scheduled based on the following two factors.
Only a single hpost invocation can act on any one expander at a time. For a Sun Fire high-end system configured without split expanders, this restriction does not prevent multiple hpost invocations from running. This restriction does come into play, however, when the machine is configured with split expanders.
Caution - Changing the default value can adversely affect system functionality. Do not adjust this parameter unless instructed by a Sun service representative to do so. |
FIGURE 4-19 illustrates TMD client-server relationships to the SMS daemons.
Basic SMS environment defaults must be set in your configuration files to run SMS commands.
Setting other environment variables when you log in can save time. TABLE 4-2 suggests some useful SMS environment variables.Table listing some useful SMS environment variables.
Copyright © 2006, Sun Microsystems, Inc. All Rights Reserved.