Writing Device Drivers

Chapter 8 Power Management

This chapter describes the interfaces for the Power Management framework, which regulates and reduces the power consumed by computer systems and devices.

Power Management Overview

Power management provides the ability to control and manage the electrical power usage of a computer system or device. Power management enables systems to conserve energy by using less power when idle and by shutting down completely when not in use. For example, desktop computer systems can use a significant amount of power, and often (particularly at night) they are left idle. Power management software can detect that the system is not being used and power it or some of its components down. Power management can also be used in battery-powered computers (such as notebook computers) to extend battery life by powering down unused components.

The Solaris Power Management framework depends on device drivers to implement the device-specific power management functionality, such as detection of idleness in the device and changing the power state of the device. In order for a driver to do this, the device must be designed to support multiple power states.

The Solaris Power Management framework is implemented in two ways:

Device Power Management

To perform effective device power management, system software monitors the different components of the device and determines when they are not in use. Since only device drivers are able to determine when a device is idle, and only device drivers are able to reduce power consumption of a device, the Power Management framework exports interfaces to enable communication between the system software and the device driver.

The Solaris Power Management framework provides the following:

System Power Management

System power management consists of turning off the entire computer after saving its state so that it can be returned to the same state immediately when it is turned back on.

To shut down an entire system and later return it to the state it was in prior to the shutdown, it is necessary to stop (and later restart) kernel threads and user processes, notify interested processes that the system has been suspended, and save (and later restore) the hardware state of all devices on the system. System power management is currently implemented only on some SPARC systems supported by the Solaris 7 software.

The Solaris System Power Management framework provides the following:

Power Management Additions to the State Structure

This chapter adds the following fields to the state structure. See "Software State Management" for more information.

	struct xx_saved_device_state  device_state;
 	int xx_suspended;					/* suspended for system power management */
 	int xx_pm_suspended;			/* suspended for device power management */
 	int xx_power_level[num_components]; /* component power level */

Device Power Management Model

The following sections describe the details of the device power management model. This model includes the following elements:

Components

In the power management model, each device is composed of zero or more power-manageable components. If a device has no components, then the device is not power manageable.

Components correspond to parts of the device that can be put into a state that requires less power than normal. The definition of which components a device implements depends on the device driver writer, with one exception: component zero must represent all parts of the device that have hardware state that would be lost if power were to be completely removed from the device.

The device driver notifies the system of the device components by calling pm_create_components(9F) in its attach(9E) entry point as part of driver initialization.

Idleness

Each component of a device may be in one of two states: busy or idle. The device driver notifies the framework of changes in the device state by calling pm_busy_component(9F) and pm_idle_component(9F).

Power Levels

The current implementation of the Device Power Management framework only keeps track of two power levels for each device, its current power level and its normal power level. The normal power level of a component is the power level required for normal operation of the component, and is the power level to which the component is returned by the framework when the device is needed. The device driver informs the framework of the normal power level of the component by calling pm_set_normal_power(9F).

The current power level of the component is the power level at which the component is currently operating. The device driver should ensure that the component is set to the normal power level at initialization time. The framework assumes that when a device attaches it will be operating at its normal power level until the framework power manages it.

Power-level values that represent power on states must be positive integers greater than zero. A value of zero means the device has been set to the lowest operating power available.

Dependency

A device component might depend on one or more other devices. A device component depends on another device if the component can be powered off only when all the components of all the devices it depends on are also powered off. For example, the component of the frame buffer device that represents the monitor depends on the mouse and keyboard devices. The frame buffer monitor component can thus only be powered off when both the mouse and keyboard devices are powered off.

The power.conf(4) file specifies the dependencies among devices.

Policy

The power.conf(4) file lists the devices that may be powered off and specifies dependencies between devices. Associated with each component of a device is a threshold of idle time. The threshold for each power-manageable device component is also specified in the power.conf(4) file.

The system checks the state of each device specified in power.conf(4). When a component has been idle for threshold seconds and all the dependents of the device are powered off, that component of the device is set to power level zero.

Device Power Management Interfaces

A device driver that supports a device with power-manageable components must notify the system of the existence of these components and their normal power values, and notify the system of the component state transitions from idle to busy and vice versa.

The notification of the existence of the components and their normal power values is typically done in the driver's attach(9E) entry point as part of driver initialization. The following interfaces handle creating and destroying device components and setting and getting the normal power levels of device components.

pm_create_components()

	int pm_create_components(dev_info_t *dip, int components);

pm_create_components(9F) notifies the system that the device indicated by dip has the number of components indicated by components. This function is called in the attach(9E) routine of the device driver.

pm_destroy_components()

	void pm_destroy_components(dev_info_t *dip);

pm_destroy_components(9F) removes all the components associated with the device indicated by dip from the system. This function is called in the detach(9E) routine.

pm_set_normal_power()

	pm_set_normal_power(dev_info_t *dip, int component, int level);

pm_set_normal_power(9F) sets the normal power level for the specified component. Whenever the system turns the component on again, it calls into the driver to set the current power level to normal power level.

pm_get_normal_power()

	pm_get_normal_power(dev_info_t *dip, int component);

pm_get_normal_power(9F) retrieves the current setting of the normal power level for a component.

Busy-Idle State Transitions

The driver must keep the framework informed of device state transitions from idle to busy or busy to idle. Where these transitions happen is entirely device specific. Some components are created and marked busy and never change. Some are created and never marked busy (components created by pm_create_components(9F) are created in an idle state). For example, a frame buffer currently supports two components: component 0 represents the frame buffer electronics and is always busy, and component 1 represents the monitor and is always idle (but dependent on the keyboard and mouse).


Note -

Component 0 represents the state of the device that would be lost if power is removed.


Some devices, such as the keyboard and mouse, are never marked busy but have their idle time reset each time a keystroke or mouse event is processed. The transitions from idle to busy and from busy to idle depend on the nature of the device and the abstraction represented by the specific component. For example, SCSI disk target drivers typically export a single component, which represents whether the SCSI target disk drive is spun up or not. It is marked busy whenever there is an outstanding request to the drive and idle when the last queued request finishes.

The following interfaces notify the Power Management framework of busy-idle state transitions.

pm_busy_component()

	int pm_busy_component(dev_info_t *dip, int component);

pm_busy_component(9F) marks the component as busy.

While the component is busy, it will not be powered off. If the component is already powered off, then marking it busy doesn't change its power level. The driver needs to call ddi_dev_is_needed(9F) for this purpose. Calls to pm_busy_component(9F) are stacked and require a corresponding number of calls to pm_idle_component(9F) to idle the component.

pm_idle_component()

	int pm_idle_component(dev_info_t *dip, int component);

pm_idle_component(9F) marks component as idle. An idle component is subject to being powered off.

pm_idle_component(9F) must be called once for each call to pm_busy_component(9F) in order to idle the component.

Device Power State Transitions

A device driver can call ddi_dev_is_needed(9F) to request that a component be set to a given power level. This is necessary before using a component that has been powered off. For example, a SCSI disk target driver's read(9E) or write(9E) routine might need to spin up the disk if it had been powered off before completing the read or write. ddi_dev_is_needed(9F) notifies the Power Management framework of device state transitions.

ddi_dev_is_needed()

	int ddi_dev_is_needed(dev_info_t *dip, int component,
 						 int level);

ddi_dev_is_needed(9F) is called when the driver discovers that a component needed for some operation has been powered off. This interface arranges for the driver to be called to set the current power level of the component to the level specified in the request. All the devices that depend on this component are also brought back to normal power by this call.

When a component has been powered off by pm(7D) and a request or interrupt occurs that requires the component to be powered up, the driver must call ddi_dev_is_needed(9F) so that the framework can restore the component (and all of the devices that depend on it) to normal power.

Entry Points Used by Device Power Management

The Power Management framework uses the following entry points:

power()

	int xxpower(dev_info_t *dip, int component, int level);

The system calls the power(9E) entry point (either directly or as a result of a call to ddi_dev_is_needed(9F)) when it determines that a component's current power level needs to be changed. The action taken by this entry point is device driver specific. In the example of the SCSI target disk driver mentioned previously, setting the power level of the component to 0 results in sending a SCSI command to spin down the disk, while setting the power level to the normal power level results in sending a SCSI command to spin up the disk. Example 8-1 shows a sample power(9E) routine.


Example 8-1 power(9E) Routine

int
xxpower(dev_info_t *dip, int component, int level)
{
  	struct xxstate *xsp;
	   int	instance;
	   instance = ddi_get_instance(dip);
  	xsp = ddi_get_soft_state(statep, instance);

	   /*
	   * Make sure that the request is valid
  	*/
 	if (xx_valid_power_level(component, level))
	    	return (DDI_FAILURE);

 	mutex_enter(&xsp->mu);
 	if (xsp->xx_power_level[component] != level) {
	    	device- and component-specific setting of power level.
	    	xsp->xx_power_level[component] = level;
 	}
 	mutex_exit(&xsp->mu);
 	return (DDI_SUCCESS);
}

detach()

	int detach(dev_info_t *dip, ddi_detach_cmd_t cmd);

Before the system sets component 0 (entire device) to power level 0, it calls the driver's detach(9E) entry point with a detach command of DDI_PM_SUSPEND to allow the driver to save all hardware state to memory.

If the device is busy and has outstanding operations, it should fail the detach(9E) call. The framework will try again later after the device has been idle for its threshold time. Otherwise, the driver must arrange to block all subsequent accesses to the hardware until the device has been resumed (which the driver can initiate by calling ddi_dev_is_needed(9F)), and save all hardware state to memory.

Example 8-2 shows an example of a detach(9E) routine with DDI_PM_SUSPEND implemented.


Example 8-2 detach(9E) Routine Showing the Use of DDI_PM_SUSPEND

int
xxdetach(devinfo_t *dip, ddi_detach_cmd_t cmd)
{
 	struct xxstate *xsp;
 	int	instance;
 	instance = ddi_get_instance(dip);
 	xsp = ddi_get_soft_state(statep, instance);

 	switch (cmd) {
 	case DDI_DETACH:
	   	see chapter 5, Autoconfiguration, for discussion

		case DDI_SUSPEND:
	    	see Example 8-4		case DDI_PM_SUSPEND:
		   /*
		    * We won't be called with DDI_PM_SUSPEND when already called
		    * with DDI_SUSPEND.
		    */
	    	 mutex_enter(&xsp->mu);
	    	 if (xsp->xx_busy) {
			      mutex_exit(&xsp->mu);
			      return(DDI_FAILURE);
		    }

	    	 xsp->xx_pm_suspended = 1; 
	    	 Save device register contents into xsp->xx_device_state
	
	    	this section is optional, only needed if the driver maintains a running 	
	        	timeout (but be sure to drop the  mutex in any case)
	    	 /* cancel timeouts */
	    	 if (xsp->xx_timeout_id) {
			     timeout_id_t temp_timeout_id = xsp->xx_timeout_id;

			     xsp->xx_timeout_id = 0;
			     mutex_exit(&xsp->mu);
			     untimeout(temp_timeout_id);
	    	 } else {
			     mutex_exit(&xsp->mu);
	    	 }
	   	 return(DDI_SUCCESS);

 	default:
	    	 return(DDI_FAILURE);
 	}
}

attach()

	int attach(dev_info_t *dip, ddi_attach_cmd_t cmd);

When a device that has been suspended is needed again, its power(9E) entry point is called to restore the power level of component 0 (entire device) to its normal power. The driver's attach(9E) entry point is then called with an attach command value of DDI_PM_RESUME to restore the device hardware state saved in the detach(9E) routine and unblock any pending operations. Example 8-3 shows an attach(9E) routine with DDI_PM_RESUME implemented.


Example 8-3 attach(9E) Showing the Use of DDI_PM_RESUME

int
xxattach(devinfo_t *dip, ddi_attach_cmd_t cmd)
{
 	struct xxstate *xsp;
 	int	instance;

 	instance = ddi_get_instance(dip);
 	xsp = ddi_get_soft_state(statep, instance);

 	switch (cmd) {
 	case DDI_ATTACH:
	   	see chapter 5, Autoconfiguration for discussion

 	case DDI_RESUME:
	   	see Example 8-5 for DDI_RESUME implementation

 	case DDI_PM_RESUME:
		   /*
		    * We won't be DDI_PM_RESUMEd while DDI_SUSPENDed
		    */
	   	 mutex_enter(&xsp->mu);
	   	 Restore device register contents from xsp->xx_device_state
		
	      	this section is optional, only needed if the driver maintains a running timeout
	   	 /* restart timeouts */
	   	 xsp->xx_timeout_id = timeout({...});
		
	   	 xsp->xx_pm_suspended = 0;	/* allow new operations */
	   	 cv_broadcast(&xsp->cv);
	   	 mutex_exit(&xsp->mu);
	   	 return(DDI_SUCCESS);

 	default:
	   	return(DDI_FAILURE);
 	}
}

System Power Management Model

This section describes the details of the System Power Management model. The model includes the following components:

Autoshutdown Threshold

The system may be shut down (powered off) automatically after a configurable period of idleness. This period is known as the autoshutdown threshold. This behavior may be suppressed.

Busy State

There are several ways to measure the busy state of the system. The currently supported built-in metrics are keyboard characters, mouse activity, tty characters, load average, disk reads, and NFS requests. Any one of theses metrics may make the system busy. In addition to the built-in metrics, an interface is defined for running a user-specified process that may indicate that the system is busy.

Hardware State

Devices that export a reg property are considered to have hardware state that must be saved prior to shutting down the system. If a device does not have a reg property, then it is considered to be stateless. However, this consideration can be overridden by the device driver.

A device that has hardware state but no reg property (such as a SCSI target driver, which has hardware at the other end of the SCSI bus), is called to save and restore its state if it exports a pm-hardware-state property with the value needs-suspend-resume. Otherwise, the lack of a reg property is taken to mean that the device has no hardware state. For information on device properties, see "Device Attribute Representations".

A device that has a reg property but no hardware state may export a pm-hardware-state property with the value no-suspend-resume to keep the framework from calling into the driver to save and restore that state. For more information on Power Management properties, see the pm_props(9E) man page.

Policy

The system will be shut down if the following conditions apply:

Entry Points Used by System Power Management

System power management passes the command DDI_SUSPEND to the detach(9E) driver entry point to request the driver to save the device hardware state. It passes the command DDI_RESUME to the attach(9E) driver entry point to request the driver to restore the device hardware state. If a device has a reg property or a pm-hardware-state property with a value of needs-suspend-resume, then the framework calls into the driver's detach(9E) entry point to allow the driver to save the hardware state of the device to memory so that it can be restored after the system power returns.

detach()

	int detach(dev_info_t *dip, ddi_detach_cmd_t cmd);

To process the DDI_SUSPEND command, detach(9E) must do the following:

If, for some reason, the driver is not able to suspend the device and save its state to memory, then it must return DDI_FAILURE, and the framework aborts the system power management operation.

Dump requests must be honored. The framework uses the dump(9E) entry point to write out the state file containing the contents of memory. See dump(9E) for restrictions imposed on the device driver when using this entry point.


Note -

The entry point dump(9E) was previously used only for writing kernel crash dumps to disk. It is now also used to write out the state file containing the information necessary to restore the system to its state prior to a system power management suspend.


If the device implements a power-manageable component zero, the device may already have been suspended and powered off using the command DDI_PM_SUSPEND when its detach(9E) entry point is called with the DDI_SUSPEND command. The additional processing necessary in this case is to cancel pending timeouts and suppress the call to ddi_dev_is_needed(9F) until the device is resumed by a call to attach(9E) with a command of DDI_RESUME. The driver must keep sufficient track of its state to be able to deal appropriately with this possibility.

Example 8-4 shows an example of a detach(9E) routine with the DDI_SUSPEND command implemented.


Example 8-4 detach(9E) Routine Showing the Use of DDI_SUSPEND

int
xxdetach(devinfo_t *dip, ddi_detach_cmd_t cmd)
{
 	struct xxstate *xsp;
	 int	instance;

 	instance = ddi_get_instance(dip);
 	xsp = ddi_get_soft_state(statep, instance);

 	switch (cmd) {
 	case DDI_DETACH:
	   	see chapter 5, Autoconfiguration for discussion

 	case DDI_SUSPEND:
	  	   mutex_enter(&xsp->mu);
	  	   xsp->xx_suspended = 1;	/* stop new operations */
	  	   if (!xsp->xx_pm_suspended) {
			   /*
			    * This code assumes that we'll get a cv_broadcast when
			    * we're no longer busy
			    */
			   while(xsp->xx_busy)	/* wait for pending ops */
					   cv_wait(&xsp->xx_busy_cv, &xsp->mu);
			Save device register contents into xsp->xx_device_state
				/*
				 * If a callback is outstanding which cannot be 
				 * cancelled then either wait for the callback
				 * to complete or fail the suspend request
				 */
			this section is optional, only needed if the driver maintains a running 	
			timeout (but be sure to drop the  mutex in any case)
			   /* cancel timeouts */
			   if (xsp->xx_timeout_id) {
				    timeout_id_t temp_timeout_id = xsp->xx_timeout_id;
				
				    xsp->xx_timeout_id = 0;
				    mutex_exit(&xsp->mu);
				    untimeout(temp_timeout_id);
			   } else {
				    mutex_exit(&xsp->mu);
			   }
	     	} else {
			      mutex_exit(&xsp->mu);
	   	}
	  	  return(DDI_SUCCESS);
 	case DDI_PM_SUSPEND:
	    	see Example 8-2 
 	default:
	    	return(DDI_FAILURE);
 	}
}

attach()

	int attach(dev_info_t *dip, ddi_attach_cmd_t cmd);

When power is restored to the system, each device with a reg property or with a pm-hardware-state property of value needs-suspend-resume has its attach(9E) entry point called with a command value of DDI_RESUME. If the system shutdown was aborted for some reason, each driver that was suspended is called to resume, even though the power has not been shut off. Consequently, the resume code in attach(9E) must make no assumptions about the state of the hardware; it may or may not have lost power.

The resume code must restore the hardware state from the saved image in memory (possibly including reloading firmware), reregister any necessary timeouts, and unblock any pending requests.

Example 8-5 shows an example of an attach(9E) routine with the DDI_RESUME command.


Example 8-5 attach(9E) Routine Showing the Use of DDI_RESUME

int
xxattach(devinfo_t *dip, ddi_attach_cmd_t cmd)
{
	   struct xxstate *xsp;
  	int	instance;

 	instance = ddi_get_instance(dip);
 	xsp = ddi_get_soft_state(statep, instance);

 	switch (cmd) {
 	case DDI_ATTACH:
	    	see chapter 5, Autoconfiguration, for discussion

 	case DDI_RESUME:
	  	   mutex_enter(&xsp->mu);
	     	if (!xsp->xx_pm_suspended) {
			    Restore device register contents from xsp->xx_device_state}
			    this section is optional, only needed if the driver maintains a running timeout
			    /* restart timeouts */
			    xsp->xx_timeout_id = timeout({...});
	    	}
	  	   xsp->xx_suspended = 0;	/* allow new operations */
	    	cv_broadcast(&xsp->cv);
	   	mutex_exit(&xsp->mu);
	   	return(DDI_SUCCESS);

 	case DDI_PM_RESUME:
	    	see Example 8-3	   default:
	    	return(DDI_FAILURE);
 	}
}


Note -

The detach(9E) and attach(9E) interfaces may also be used to resume a system that has been quiesced.


Device Access

If power management is supported, and detach(9E) and attach(9E) have code such as shown in the previous examples, the code fragment in Example 8-6 can be used where device access is about to be made to the device from user context (for example, in read(2), write(2), ioctl(2)).

In the following example, it is assumed that the operation about to be performed requires a component component that is operating at power level level.


Example 8-6 Device Access

/*
 * Because multiple threads may come through this code
 * simultaneously and ddi_dev_is_needed() does not
 * atomically set the power level and trigger the attach
 * call with DDI_PM_RESUME, a lot of checking is done here
 *
	* prevent us from being powered down again immediately
	* due to still being idle
 */
pm_busy_component(dip, component);
mutex_enter(&xsp->mu);
do {
		 /*
		  * Block commands if/while device suspended via DDI_SUSPEND
		  */
		 while(xsp->xx_suspended)
			  cv_wait(&xsp->cv, &xsp->mu);
		 /* system may have been power cycled here */
		 if (xsp->xx_power_level[component] < level) {
			  /*
			   * Drop mutex because ddi_dev_is_needed will result in
			   * a call back into our power and/or attach routine
			   */
			   mutex_exit(&xsp->mu);
			   ddi_dev_is_needed(dip, component, level);
			   mutex_enter(&xsp->mu);
		}
		/*
		 * Block commands if device still suspended with
		 * DDI_PM_SUSPEND; because we had to drop the mutex to
		 * call ddi_dev_is_needed we may be executing in a
		 * thread that came in after the power level was raised
		 * but before attach was called with DDI_PM_RESUME
		 */
		while(xsp->xx_pm_suspended)
			  cv_wait(&xsp->cv, &xsp->mu);
		/*
		 * Because we may have dropped the lock in the cv_wait,
		 * we could have gotten a DDI_SUSPEND request, or we could
		 * have found the power level high enough on the way in
		 * but got powered down after we checked it, so we have
		 * to check everything over again (except the
		 * xx_pm_suspended state that we checked in the last
		 * while loop) because any of the things we tested before
		 * may have changed when we dropped the mutex
	 	*/

} while(xsp->xx_suspended || xsp->xx_power_level[component]
				< level);
check for busy, initiate commands, and so on
when command completes and there are no more commands pending
pm_idle_component(dip, component);
....

Power Management Flow of Control

The following sections describe the general flow of control for power management states from busy to powered off. The sequence of states differs depending on whether the component is component zero or another component. Figure 8-1 illustrates the flow of control in the Power Management framework.

Device Power Management Flow of Control for Component Zero

When a component's activity is complete, a driver can call pm_idle_component(9F) to mark the component as idle. When the component has been idle for its threshold time, the framework may power off the component. If the component is component zero, the framework calls the driver's detach(9E) entry point with the command DDI_PM_SUSPEND to enable the driver to save all hardware state to memory. The framework then calls the power(9E) entry point to set the power level of the component to power level zero.

When component zero is needed again, the driver calls ddi_dev_is_needed(9F) on the powered off component. The framework then calls power(9E) to power up the component and calls attach(9E) with the DDI_PM_RESUME command to restore the state of the device and unblock any pending operations. At this point, the ddi_dev_is_needed(9F) call returns to the device driver. The component is idle but powered on, and the driver can mark it as busy by calling pm_busy_component(9F).

Device Power Manangement Flow of Control for Components Other Than Component Zero

As with component zero, the driver uses pm_idle_component(9F) to mark components other than zero as idle. The framework may power off an idle component other than component zero by calling the power(9E) entry point to set the component to power level zero.

When a driver finds that a needed component is powered off, the driver calls ddi_dev_is_needed(9F) on the powered off component. When a driver calls ddi_dev_is_needed(9F) for a component other than component 0 that is powered off, the framework calls power(9E) to set the new power level before ddi_dev_is_needed(9F) returns. ddi_dev_is_needed(9F) keeps the framework informed of the state of the device and arranges for devices that this component depends on to be powered up. When ddi_dev_is_needed(9F) returns, the component is idle but powered on. The driver can mark it as busy by calling pm_busy_component(9F).

Figure 8-1 Power Management Conceptual State Diagram

Graphic