This feature is new in the Solaris Express 6/04 release. The Solaris Express 10/04 release and the Solaris 10 3/05 release provided important enhancements.
Sun Microsystems has developed a new architecture for building and deploying systems and services capable of Predictive Self-Healing. Self-healing technology enables Sun systems and services to maximize availability when software and hardware faults occur. In addition, the self-healing technology facilitates a simpler and more effective end-to-end experience for system administrators and service providers, thereby reducing costs. The first major set of new features to result from this initiative is available in the Solaris 10 OS. The Solaris 10 software includes components that facilitate self-healing for CPU, memory, I/O bus nexus components, and system services.
Specific details about the components of this new architecture are covered in the following descriptions for the Solaris Service Manager and the Solaris Fault Manager.
Introduced in the Solaris Express 10/04 release and enhanced in the Solaris 10 3/05 release, the Solaris Service Manager provides an infrastructure that augments the traditional UNIX startup scripts, init run levels, and configuration files. This infrastructure provides the following functions:
Automatically restarts failed services in dependency order, whether the services failed as the result of administrator error, a software bug, or an uncorrectable hardware error.
Makes services objects that can be viewed, with the new svcs command, and managed, with svcadm and svccfg commands. You can also view the relationships between services and processes by using svcs -p, for both SMF services and legacy init.d scripts.
Makes it easy to back up, restore, and undo changes to services by taking automatic snapshots of service configurations.
Makes it easy to debug. You can ask questions about services and receive an explanation of why a service isn't running by using svcs -x. Also, this process is eased by individual and persistent log files for each service.
Enhances the ability of administrators to securely delegate tasks to nonroot users, including the ability to modify properties and start, stop, or restart services on the system.
Boots faster on large systems by starting services in parallel according to the dependencies of the services. The opposite process occurs during shutdown.
Enables you to customize the boot console output either to be as quiet as possible, which is the default, or to be verbose by using boot -m verbose.
Preserves compatibility with existing administrative practices wherever possible. For example, most customer and ISV-supplied rc scripts still work as usual.
Enables you to configure your system services in one of two modes, both represented as smf(5) profiles. The “generic_open.xml” profile enables all the traditional Internet services that were previously enabled by default in the Solaris OS. The “generic_limited_net.xml” profile disables a large number of services that are frequently disabled during the process of hardening a system. However, this profile is not a replacement for the Solaris Security Toolkit (JASS) tool. See the individual profiles for details.
See Chapter 9, “Managing Services (Overview),” in the System Administration Guide: Basic Administration for more information about this infrastructure. An overview of the infrastructure can be found in the smf(5) man page.
Predictive Self-Healing systems include a simplified administration model. Traditional error messages are replaced by telemetry events that are consumed by software components. The software components automatically diagnose the underlying fault or defect and initiate self-healing activities. Examples of self-healing activities include administrator messaging, isolation or deactivation of faulty components, and guided repair. A new software component is called Fault Manager, fmd(1M). The Fault Manager manages the telemetry, log files, and components. The new fmadm(1M), fmdump(1M), and fmstat(1M) tools are also available in the Solaris 10 OS to interact with the Fault Manager and new log files.
When appropriate, the Fault Manager sends a message to the syslogd(1M) service to notify an administrator that a problem has been detected. The message directs administrators to a knowledge article on Sun's new message Web site, http://www.sun.com/msg/, which explains more about the problem impact and appropriate responses and repair actions.
The Solaris Express 6/04 release introduced self-healing components for automated diagnosis and recovery for UltraSPARC-III and UltraSPARC-IV CPU and memory systems. This release also provided enhanced resilience and telemetry for PCI-based I/O.