SMF Best Practices and Troubleshooting

Language:

This appendix describes best practices and troubleshooting, including:

Repairing a service instance that is degraded, offline, or in maintenance
Diagnosing and repairing SMF repository problems
Specifying the amount of SMF startup messaging
Specifying the SMF milestone to which to boot
Investigating system boot problems
Converting inetd services to SMF services

SMF Best Practices

Most services describe configuration policy. If the configuration you want is not implemented, modify the policy description by modifying the service. Modify the values of service properties or create new service instances with different property values. Do not disable service instances and edit configuration files that are intended to be managed by an SMF service. An increasing number of fundamental Oracle Solaris features are configured by SMF service properties, not by editing configuration files.

Do not modify manifests and system profiles that are delivered by Oracle or by third-party software vendors. These manifests and profiles might be replaced when you upgrade your system, and then your changes to these files will be lost. Instead, do one of the following:

Add a new service instance with different property values as described in Adding Service Instances.
Create a profile to customize the service. Use the svcbundle command or the svccfg extract command to create a profile file. Customize property values in that file, and include comments about the reason for each customization. Copy the profile file to the appropriate /etc/svc/profile subdirectory, and restart the manifest-import service.

To apply the same custom configuration to multiple systems, copy the same profile file to the same /etc/svc/profile subdirectory on each system, and restart the manifest-import service on each system. To automate delivering the profile to each system, package the profile. See Configuring Multiple Systems.
Use the svccfg command or the inetadm command to manipulate the properties directly. If you use the svccfg command to modify property values, be sure to refresh the service instance as explained in Understanding Configuration Changes. For information about modifying, adding, and deleting service configuration, see Configuring Services. To see configuration that has already been modified, see Showing Configuration Customizations. To delete custom configuration, see Example 40, Deleting Customizations and Example 42, Unmasking Configuration.

When you create a custom profile, make sure the configuration defined does not conflict with configuration defined in the same layer in another manifest or profile for the same service or service instance. Configuration conflicts are not permitted within any layer. If conflicting configuration is delivered by multiple files in any single layer, and is not set at a higher layer, the manifest-import service log reports the conflict and the service with the conflicting configuration is not started. See Conflicting Configuration for more information.

Do not use non-standard locations for manifest and profile files. See Service Bundles for manifest and profile standard locations.

When you create a service for your own use, use site at the beginning of the service name: svc:/site/service_name:instance_name.

Do not modify the configuration of the master restarter service, svc:/system/svc/restarter:default, except to configure logging levels as described in Specifying the Amount of Startup Messaging.

Before you use the svccfg delcust command, use the svccfg listcust command with the same options. The delcust subcommand can potentially remove all administrative customizations on a service. Use the listcust subcommand to verify which customizations will be deleted by the delcust subcommand.

In scripts, use the full service instance FMRI: svc:/service_name:instance_name.

Troubleshooting Services Problems

This section discusses the following topics:

Committing configuration changes into the running snapshot
Fixing services that are reported to have problems
Manually transitioning an instance to the degraded or maintenance state
Fixing a corrupt service configuration repository
Configuring the amount of messaging to display or store on system startup
Transitioning or booting to a specified milestone
Using SMF to investigate booting problems
Converting inetd services to SMF services

Understanding Configuration Changes

In the service configuration repository, SMF stores property changes separately from properties in the running snapshot. When you change service configuration, those changes do not immediately appear in the running snapshot.

The refresh operation updates the running snapshot of the specified service instance with the values from the editing configuration.

By default, the svcprop command shows properties in the running snapshot, and the svccfg command shows properties in the editing configuration. If you have changed property values but not performed a configuration refresh, the svcprop and svccfg commands show different property values. After you perform a configuration refresh, the svcprop and svccfg commands show the same property values.

Rebooting does not change the running snapshot. The svcadm restart command does not refresh configuration. Use the svcadm refresh or svccfg refresh command to commit configuration changes into the running snapshot.

Repairing an Instance That Is Degraded, Offline, or in Maintenance

Use the svcs -x command with no arguments to display explanatory information about any service instances that match any of the following descriptions:

The service is enabled but is not running.
The service is enabled but is not running at normal capacity.
The service is preventing another enabled service from running.
The service is disabled but is not able to complete the transition to the disabled state.

The following list summarizes how to approach service problems:

Diagnose the problem, starting with viewing the service log file.

Log files located in /var/svc/log and /system/volatile. The service log file shows time stamps and method exit reasons.

The location of the log file for a particular service is given by the following command:
```
$ svcs -L service-name
```
The following command displays the end of the log file for a particular service:
```
$ svcs -Lx service-name
```
Fix the problem.
- If multiple service failures are identified, start by looking at the first failure to occur, using the time stamps in the service log files.
- Use the following command to show impacted dependencies of the failed service:
```
$ svcs -l service-name
```
  Use the following command to show services on which service-name depends:
```
$ svcs -d service-name
```
- If fixing the problem involves modifying service configuration, refresh the service.
Move affected services to a running state.

How to Repair an Instance That Is in Maintenance

A service instance that is in maintenance is enabled but not able to run or is disabled but not able to complete the transition to the disabled state.

Determine why the instance is in maintenance.
The instance might be transitioning through the maintenance state because an administrative action has not yet completed. If the instance is transitioning, its state should be shown as maintenance* with an asterisk at the end.

An instance that is configured to restart after failure might be placed in maintenance because it was restarting too frequently. In this case you need to determine the cause of the consistent failure.

If an instance is in maintenance because it has conflicts, or conflicting property values, see Conflicting Configuration.

The instance might be in the maintenance state because the instance was disabled but is unable to reach the disabled state because the stop method failed.

In the following example, the “State” and “Reason” lines show that the pkg/depot service is in the maintenance state because its start method failed.
```
$ svcs -x
svc:/application/pkg/depot:default (IPS Depot)
 State: maintenance since September 11, 2013 01:30:42 PM PDT
Reason: Start method exited with $SMF_EXIT_ERR_FATAL.
   See: http://support.oracle.com/msg/SMF-8000-KS
   See: pkg.depot-config(8)
   See: /var/svc/log/application-pkg-depot:default.log
Impact: This service is not running.
```
Log in to the Oracle support site to view the referenced Predictive Self-Healing knowledge article. In this case, the article tells you to view the log file to determine why the start method failed. The svcs output gives the name of the log file. See Viewing Service Log Files for information about how to view the log file. In this example, the log file shows the start method invocation and the fatal error message.
```
[ Sep 11 13:30:42 Executing start method ("/lib/svc/method/svc-pkg-depot start"). ]
pkg.depot-config: Unable to get publisher information: 
The path '/export/ipsrepos/Solaris11' does not contain a valid package repository.
```
Fix the problem.
One or more of the following steps might be needed.
- Update service configuration.
  If fixing the reported problem required modifying service configuration, use the svccfg refresh or svcadm refresh command for any services whose configuration changed. Verify that the configuration is updated in the running snapshot by using the svcprop command to check property values or by other tests specific to this service.
- Ensure dependencies are running.
  Sometimes the “Impact” line in the svcs -x output tells you that services that depend on the service that is in the maintenance state are not running. Use the svcs -l command to check the current state of dependent services. Ensure that all required dependencies are running. Use the svcs -x command to verify that all enabled services are running.
- Ensure contract processes are stopped.
  If the service that is in the maintenance state is a contract service, determine whether any processes that were started by the service have not stopped. When a contract service instance is in a maintenance state, the contract ID should be blank, as shown in the following example, and all processes associated with that contract should have stopped. Use svcs -l or svcs -o ctid to check that no contract exists for a service instance in maintenance. Use svcs -p to check whether any processes associated with this service instance are still running. Any processes shown by svcs -p for a service instance in maintenance should be killed.
```
$ svcs -l system-repository
fmri         svc:/application/pkg/system-repository:default
name         IPS System Repository
enabled      true
state        maintenance
next_state   none
state_time   September 17, 2013 07:18:19 AM PDT
logfile      /var/svc/log/application-pkg-system-repository:default.log
restarter    svc:/system/svc/restarter:default
contract_id
manifest     /lib/svc/manifest/application/pkg/pkg-system-repository.xml
dependency   require_all/error svc:/milestone/network:default (online)
dependency   require_all/none svc:/system/filesystem/local:default (online)
dependency   optional_all/error svc:/system/filesystem/autofs:default (online)
```
Notify the restarter that the instance is repaired.
When the reported problem is fixed, use the svcadm clear command to tell the restarter for that service that the instance is repaired. SMF will attempt to transition the instance to its configured state. If the instance is enabled, SMF will attempt to bring the instance online. If the instance is disabled, SMF will transition the instance to the disabled state.
```
$ svcadm clear pkg/depot:default
```
If you specify the -s option, the svcadm command waits to return until the instance reaches the online state or until it determines that the instance cannot reach the online state without administrator intervention. Use the -T option with the -s option to specify an upper bound in seconds to make the transition or determine that the transition cannot be made.
Verify that the instance is repaired.
Use the svcs command to verify that the service that was in maintenance is now online. Use the svcs -x command to verify that all enabled services are running.

How to Repair an Instance That Is Offline

A service instance that is offline is enabled but not running or available to run.

Determine why the instance is offline.
The instance might be transitioning through the offline state because its dependencies are not yet satisfied. If the instance is transitioning, its state should be shown as offline*.
Fix the problem.
- Enable service dependencies.
  If required dependencies are disabled, enable them with the following command:
```
$ svcadm enable -r FMRI
```
- Fix dependency file.
  A dependency file might be missing or unreadable. You might want to use pkg fix or pkg revert to fix this type of problem. See the pkg(1) man page.
Restart the instance if necessary.
If the instance was offline because a required dependency was not satisfied, fixing or enabling the dependency might cause the offline instance to restart and come online with no further administrative action needed.

If you made some other fix to the service, then restart the instance.
```
$ svcadm restart FMRI
```
Verify that the instance is repaired.
Use the svcs command to verify that the instance that was offline is now online. Use the svcs -x command to verify that all enabled services are running.

How to Repair an Instance That Is Degraded

A service instance that is degraded is enabled and running or available to run, but is functioning at a limited capacity.

Determine why the instance is degraded.
Fix the problem.
Request the restarter to online the instance.
When the reported problem is fixed, use the svcadm clear command to return the instance to the online state. For instances in the degraded state, the clear subcommand requests that the restarter for that instance transition the instance to the online state.
```
$ svcadm clear pkg/depot:default
```
Verify that the instance is repaired.
Use the svcs command to verify that the instance that was degraded is now online. Use the svcs -x command to verify that all enabled services are running.

Marking an Instance as Degraded or in Maintenance

You can mark a service instance as being in either the degraded state or the maintenance state. You might want to do this if the application is stuck in a loop or is deadlocked, for example. The information about the state change propagates to the dependencies of the marked instance, which can help debug other related instances.

Specify the -I option to request an immediate state change.

When you mark an instance as maintenance, you can specify the -t option to request a temporary state change. Temporary requests last only until reboot.

If you specify the -s option with the svcadm mark command, svcadm marks the instance and waits for the instance to enter the degraded, or maintenance state before returning. Use the -T option with the -s option to specify an upper bound in seconds to make the transition or determine that the transition cannot be made.

Diagnosing and Repairing Repository Problems

On system startup, the repository daemon, svc.configd, performs an integrity check of the configuration repository stored in /etc/svc/repository.db. If the svc.configd integrity check fails, the svc.configd daemon writes a message to the console similar to the following:

svc.configd: smf(7) database integrity check of:

    /etc/svc/repository.db

  failed.  The database might be damaged or a media error might have
  prevented it from being verified.  Additional information useful to
  your service provider is in:

    /system/volatile/db_errors

  The system will not be able to boot until you have restored a working
  database.  svc.startd(8) will provide a sulogin(8) prompt for recovery
  purposes.  The command:

    /lib/svc/bin/restore_repository

  can be run to restore a backup version of your repository. See
  http://support.oracle.com/msg/SMF-8000-MY for more information.

The svc.configd daemon then exits. That exit is detected by the svc.startd daemon, and svc.startd then starts sulogin.

At the sulogin prompt, enter Ctrl-D to exit sulogin. The svc.startd daemon recognizes the sulogin exit and restarts the svc.configd daemon, which checks the repository again. The problem might not reappear after this additional restart.

Caution - Do not directly invoke the svc.configd daemon. The svc.startd daemon starts the svc.configd daemon.

If svc.configd again reports a failed integrity check and you are again at the sulogin prompt, ensure that required file systems are not full. Using the root password, log in either remotely or at the sulogin prompt. Check that space is available on both the root and system/volatile file systems. If either of these file systems is full, clean up and start the system again. If neither of these file systems is full, follow the procedure How to Restore a Repository From Backup.

The service configuration repository can become corrupted for any of the following reasons:

Disk failure
Hardware bug
Software bug
Accidental overwrite of the file

The following procedure shows how to replace a corrupt repository with a backup copy of the repository.

How to Restore a Repository From Backup

Before You Begin

Caution - Only restore a corrupt repository. Do not use this repository restore procedure to delete unwanted configuration changes. To undo configuration changes, see Showing Configuration Customizations, Example 40, Deleting Customizations, and Example 42, Unmasking Configuration.

Log in.
Using the root password, log in either remotely or at the sulogin prompt.
Run the repository restore command:
```
# /lib/svc/bin/restore_repository
```
Running this command takes you through the necessary steps to restore a non-corrupt backup. SMF automatically takes backups of the repository as described in Repository Backups.

SMF maintains persistent and non-persistent configuration data. See Service Configuration Repository for descriptions of these two repositories. The restore_repository command only restores the persistent repository. The restore_repository command also reboots the system, which destroys the non-persistent configuration data. The non-persistent data is runtime data that is not needed across system reboot.

When started, the /lib/svc/bin/restore_repository command displays a message similar to the following:
```
See http://support.oracle.com/msg/SMF-8000-MY for more information on the use of
this script to restore backup copies of the smf(7) repository.

If there are any problems which need human intervention, this script will
give instructions and then exit back to your shell. 
```
After the root ( /) file system is mounted with write permissions, or if the system is a local zone, you are prompted to select the repository backup to restore:
```
The following backups of /etc/svc/repository.db exists, from
oldest to newest:

... list of backups ...
```
Backups are given names, based on type and the time the backup was taken. Backups beginning with boot are completed before the first change is made to the repository after system boot. Backups beginning with manifest_import are completed after svc:/system/manifest-import:default finishes its process. The time of the backup is given in YYYYMMDD_HHMMSS format.

Enter the appropriate response.

Typically, the most recent backup option is selected.

Please enter either a specific backup repository from the above list to
restore it, or one of the following choices:

        CHOICE            ACTION
        ----------------  ----------------------------------------------
        boot              restore the most recent post-boot backup
        manifest_import   restore the most recent manifest_import backup
        -seed-            restore the initial starting repository  (All
                            customizations will be lost, including those
                            made by the install/upgrade process.)
        -quit-            cancel script and quit

Enter response [boot]:

If you press Enter without specifying a backup to restore, the default response, enclosed in [] is selected. Selecting -quit- exits the restore_repository script, returning you to your shell prompt.

Caution - Selecting -seed- restores the seed repository. This repository is designed for use during initial installation and upgrades. Only use the seed repository for recovery purposes when no other service configuration change or backup service repository will work. All configuration changes will be lost, including changes to fundamental Oracle Solaris features that were delivered by installing or updating packages. Using the seed repository for recovery purposes should be a last resort.

After you have selected the backup that you want to restore, that backup is validated and its integrity is checked. If any problems are discovered, the restore_repository command prints error messages and prompts you for another selection. Once you have selected a valid backup, the following information is printed, and you are prompted for final confirmation.

After confirmation, the following steps will be taken:

svc.startd(8) and svc.configd(8) will be quiesced, if running.
/etc/svc/repository.db
    -- renamed --> /etc/svc/repository.db_old_YYYYMMDD_HHMMSS
/system/volatile/db_errors
    -- copied --> /etc/svc/repository.db_old_YYYYMMDD_HHMMSS_errors
repository_to_restore
    -- copied --> /etc/svc/repository.db
and the system will be rebooted with reboot(8).

Proceed [yes/no]?

Type yes to remedy the fault.
The system reboots after the restore_repository command executes all of the listed actions.

Specifying the Amount of Startup Messaging

By default, each service that starts during system boot does not display a message on the console. Use one of the following methods to change which messages appear on the console and which are recorded only in the svc.startd log file. The value of logging-level can be one of the values shown in the table below.

When booting a SPARC system, specify the -m option to the boot command at the ok prompt. See “Messages options” in the kernel(8) man page.
```
ok boot -m logging-level
```
When booting an x86 system, edit the GRUB menu to specify the -m option. See Adding Kernel Arguments at Boot Time in Booting and Shutting Down Oracle Solaris 11.4 Systems and “Messages options” in the kernel(8) man page.
Prior to rebooting a system, use the svccfg command to change the value of the options/logging property. If this property has never been changed on this system, then it will not exit and you will have to add it. The following example changes to verbose messaging. The change takes effect on the next restart of the svc.startd daemon.
```
$ svccfg -s system/svc/restarter:default listprop options/logging
$ svccfg -s system/svc/restarter:default addpg options application
$ svccfg -s system/svc/restarter:default setprop options/logging=verbose
$ svccfg -s system/svc/restarter:default listprop options/logging
options/logging astring     verbose
```

Table 2 SMF Startup Message Logging Levels

Logging Level Keyword	Description
`quiet`	Display on the console any error messages that require administrative intervention. Also record these messages in `syslog` and in `/var/svc/log/svc.startd.log`.
`verbose`	In addition to the messaging provided at the `quiet` level, display on the console a single message for each service started, and record in `/var/svc/log/svc.startd.log` information about errors that do not require administrative intervention.
`debug`	In addition to the messaging provided at the `quiet` level, display on the console a single message for each service started, and record any `svc.startd` debug messages in `/var/svc/log/svc.startd.log`.

Specifying the SMF Milestone to Which to Boot

When you boot a system, you can specify the SMF milestone to which to boot.

By default, all services for which the value of the general/enabled property is true are started at system boot. To change the milestone to which to boot a system, use one of the following methods. The value of milestone can be the FMRI of a milestone service or a keyword as shown in Figure 3, Table 3, SMF Boot Milestones and Corresponding Run Levels.

When booting a SPARC system, specify the -m option to the boot command at the ok prompt. See the -m option in the kernel(8) man page.
```
ok boot -m milestone=milestone
```
When booting an x86 system, edit the GRUB menu to specify the -m option. See Adding Kernel Arguments at Boot Time in Booting and Shutting Down Oracle Solaris 11.4 Systems and the -m option in the kernel(8) man page.
Prior to rebooting a system, use the svcadm milestone command with the -d option. Note that with or without the -d option, this command restricts and restores running services immediately. With the -d option, the command also makes the specified milestone the default boot milestone. This new default is persistent across reboots.
```
$ svcadm milestone -d milestone
```
This command does not change the current run level of the system. To change the current run level of the system, use the init command.

If you specify the -s option, svcadm changes the milestone and then waits for the transition to the specified milestone to complete before returning. The svcadm command returns when all instances have transitioned to the state necessary to reach the specified milestone or when it determines that administrator intervention is required to make a transition. Use the -T option with the -s option to specify an upper bound in seconds to complete the milestone change operation or return.

The following table describes SMF boot milestones, including any corresponding Oracle Solaris run level. A system’s run level defines what services and resources are available to users. A system can be in only one run level at a time. For information about run levels,see How Run Levels Work in Booting and Shutting Down Oracle Solaris 11.4 Systems, the inittab(5) man page, and the /etc/init.d/README file. For more information about SMF boot milestones, see the milestone subcommand in the svcadm(8) man page.

Table 3 SMF Boot Milestones and Corresponding Run Levels

SMF Milestone FMRI or Keyword	Corresponding Run Level	Description
`none`		The `none` keyword represents a milestone where no services are running except for the master restarter. When `none` is specified, all services except for `svc:/system/svc/restarter:default` are temporarily disabled. The `none` milestone can be useful when for debugging startup problems. See How to Investigate Problems Starting Services at System Boot for specific instructions.
`all`		The `all` keyword represents a milestone that depends on every service. When `all` is specified, temporary enable and disable requests are ignored for all services. This is the default milestone used by `svc.startd`.
`svc:/milestone/single-user`	s or S	Ignore temporary enable and disable requests for `svc:/milestone/single-user:default` and all services on which it depends either directly or indirectly. Temporarily disable all other services.
`svc:/milestone/multi-user`	2	Ignore temporary enable and disable requests for `svc:/milestone/multi-user:default` and all services on which it depends either directly or indirectly. Temporarily disable all other services.
`svc:/milestone/multi-user-server`	3	Ignore temporary enable and disable requests for `svc:/milestone/multi-user-server:default` and all services on which it depends either directly or indirectly. Temporarily disable all other services.

To determine the milestone to which a system is currently booted, use the svcs command. The following example shows that the system is booted to run level 3, milestone/multi-user-server:

$ svcs 'milestone/*'
STATE          STIME    FMRI
online          9:08:05 svc:/milestone/unconfig:default
online          9:08:06 svc:/milestone/config:default
online          9:08:07 svc:/milestone/devices:default
online          9:08:25 svc:/milestone/network:default
online          9:08:31 svc:/milestone/single-user:default
online          9:08:51 svc:/milestone/name-services:default
online          9:09:13 svc:/milestone/self-assembly-complete:default
online          9:09:23 svc:/milestone/multi-user:default
online          9:09:24 svc:/milestone/multi-user-server:default

Using SMF to Investigate System Boot Problems

This section describes actions to take if your system hangs during boot or if a key service fails to start during boot.

How to Investigate Problems Starting Services at System Boot

If problems occur when starting services at system boot, sometimes the system will hang during boot. This procedure shows how to investigate services problems that occur at boot time.

Boot without starting any services.
The following command instructs the svc.startd daemon to temporarily disable all services and start sulogin on the console.
```
ok boot -m milestone=none
```
See Specifying the SMF Milestone to Which to Boot for a list of SMF milestones that you can use with the boot -m command.
Log in to the system as root.
Enable all services.
```
# svcadm milestone all
```
Determine where the boot process is hanging.
When the boot process hangs, determine which services are not running by running svcs -a. Look for error messages in the log files in /var/svc/log.
After fixing the problems, verify that all services have started.
1. Verify that all needed services are online.
```
# svcs -x
```
2. Verify that the console-login service dependencies are satisfied.
  This command verifies that the login process on the console will run.
```
# svcs -l system/console-login:default
```
Continue the normal booting process.

How to Force Single-User Login if the Local File System Service Fails During Boot

Local file systems that are not required to boot the system are mounted by the svc:/system/filesystem/local:default service. When any of those file systems cannot be mounted, the filesystem/local service enters a maintenance state. System startup continues, and any services that do not depend on filesystem/local are started. Services that have a required dependency on the filesystem/local service are not started.

This procedure explains how to change the configuration of the system so that a sulogin prompt appears immediately after the service fails instead of allowing system startup to continue.

Modify the system/console-login service.

$ svccfg -s svc:/system/console-login
svc:/system/console-login> addpg site,filesystem-local dependency
svc:/system/console-login> setprop site,filesystem-local/entities = fmri: svc:/system/filesystem/local
svc:/system/console-login> setprop site,filesystem-local/grouping = astring: require_all
svc:/system/console-login> setprop site,filesystem-local/restart_on = astring: none
svc:/system/console-login> setprop site,filesystem-local/type = astring: service
svc:/system/console-login> end

Refresh the service.
```
$ svcadm refresh console-login
```
When a failure occurs with the system/filesystem/local:default service, use the svcs -vx command to identify the failure. After the failure has been fixed, use the following command to clear the error state and allow the system boot to continue:
```
$ svcadm clear filesystem/local
```