Writing Device Drivers

Part I Designing Device Drivers for the Oracle Solaris Platform

The first part of this manual provides general information for developing device drivers on the Oracle Solaris platform. This part includes the following chapters:

Chapter 1, Overview of Oracle Solaris Device Drivers provides an introduction to device drivers and associated entry points on the Oracle Solaris platform. The entry points for each device driver type are presented in tables.
Chapter 2, Oracle Solaris Kernel and Device Tree provides an overview of the Oracle Solaris kernel with an explanation of how devices are represented as nodes in a device tree.
Chapter 3, Multithreading describes the aspects of the Oracle Solaris multithreaded kernel that are relevant for device driver developers.
Chapter 4, Properties describes the set of interfaces for using device properties.
Chapter 5, Managing Events and Queueing Tasks describes how device drivers log events and how to use task queues to perform a task at a later time.
Chapter 6, Driver Autoconfiguration explains the support that a driver must provide for autoconfiguration.
Chapter 7, Device Access: Programmed I/O describes the interfaces and methodologies for drivers to read or write to device memory.
Chapter 8, Interrupt Handlers describes the mechanisms for handling interrupts. These mechanisms include allocating, registering, servicing, and removing interrupts.
Chapter 9, Direct Memory Access (DMA) describes direct memory access (DMA) and the DMA interfaces.
Chapter 10, Mapping Device and Kernel Memory describes interfaces for managing device and kernel memory.
Chapter 11, Device Context Management describes the set of interfaces that enable device drivers to manage user access to devices.
Chapter 12, Power Management explains the interfaces for the Power Management feature, a framework for managing power consumption.
Chapter 13, Hardening Oracle Solaris Drivers describes how to integrate fault management capabilities into I/O device drivers, how to incorporate defensive programming practices, and how to use the driver hardening test harness.
Chapter 14, Layered Driver Interface (LDI) describes the LDI, which enables kernel modules to access other devices in the system.

Chapter 1 Overview of Oracle Solaris Device Drivers

This chapter gives an overview of Oracle Solaris device drivers. The chapter provides information on the following subjects:

Device Driver Basics

This section introduces you to device drivers and their entry points on the Oracle Solaris platform.

What Is a Device Driver?

A device driver is a kernel module that is responsible for managing the low-level I/O operations of a hardware device. Device drivers are written with standard interfaces that the kernel can call to interface with a device. Device drivers can also be software-only, emulating a device that exists only in software, such as RAM disks, buses, and pseudo-terminals.

A device driver contains all the device-specific code necessary to communicate with a device. This code includes a standard set of interfaces to the rest of the system. This interface shields the kernel from device specifics just as the system call interface protects application programs from platform specifics. Application programs and the rest of the kernel need little, if any, device-specific code to address the device. In this way, device drivers make the system more portable and easier to maintain.

When the Oracle Solaris operating system (Oracle Solaris OS) is initialized, devices identify themselves and are organized into the device tree, a hierarchy of devices. In effect, the device tree is a hardware model for the kernel. An individual device driver is represented as a node in the tree with no children. This type of node is referred to as a leaf driver. A driver that provides services to other drivers is called a bus nexus driver and is shown as a node with children. As part of the boot process, physical devices are mapped to drivers in the tree so that the drivers can be located when needed. For more information on how the Oracle Solaris OS accommodates devices, see Chapter 2, Oracle Solaris Kernel and Device Tree.

Device drivers are classified by how they handle I/O. Device drivers fall into three broad categories:

Block device drivers – For cases where handling I/O data as asynchronous chunks is appropriate. Typically, block drivers are used to manage devices with physically addressable storage media, such as disks.
Character device drivers – For devices that perform I/O on a continuous flow of bytes.

Note –
A driver can be both block and character at the same time if you set up two different interfaces to the file system. See Devices as Special Files.

Included in the character category are drivers that use the STREAMS model (see below), programmed I/O, direct memory access, SCSI buses, USB, and other network I/O.
STREAMS device drivers – Subset of character drivers that uses the streamio(7I) set of routines for character I/O within the kernel.

What Is a Device Driver Entry Point?

An entry point is a function within a device driver that can be called by an external entity to get access to some driver functionality or to operate a device. Each device driver provides a standard set of functions as entry points. For the complete list of entry points for all driver types, see the Intro(9E) man page. The Oracle Solaris kernel uses entry points for these general task areas:

Loading and unloading the driver
Autoconfiguring the device – Autoconfiguration is the process of loading a device driver's code and static data into memory so that the driver is registered with the system.
Providing I/O services for the driver

Drivers for different types of devices have different sets of entry points according to the kinds of operations the devices perform. A driver for a memory-mapped character-oriented device, for example, supports a devmap(9E) entry point, while a block driver does not support this entry.

Use a prefix based on the name of your driver to give driver functions unique names. Typically, this prefix is the name of the driver, such as xx_open() for the open(9E) routine of driver xx. See Use a Unique Prefix to Avoid Kernel Symbol Collisions for more information. In subsequent examples in this book, xx is used as the driver prefix.

Device Driver Entry Points

This section provides lists of entry points for the following categories:

Entry Points Common to All Drivers

Some operations can be performed by any type of driver, such as the functions that are required for module loading and for the required autoconfiguration entry points. This section discusses types of entry points that are common to all drivers. The common entry points are listed in Summary of Common Entry Points with links to man pages and other relevant discussions.

Device Access Entry Points

Drivers for character and block devices export the cb_ops(9S) structure, which defines the driver entry points for block device access and character device access. Both types of drivers are required to support the open(9E) and close(9E) entry points. Block drivers are required to support strategy(9E), while character drivers can choose to implement whatever mix of read(9E), write(9E), ioctl(9E), mmap(9E), or devmap(9E) entry points is appropriate for the type of device. Character drivers can also support a polling interface through chpoll(9E). Asynchronous I/O is supported through aread(9E) and awrite(9E) for block drivers and those drivers that can use both block and character file systems.

Loadable Module Entry Points

All drivers are required to implement the loadable module entry points _init(9E), _fini(9E), and _info(9E) to load, unload, and report information about the driver module.

Drivers should allocate and initialize any global resources in _init(9E). Drivers should release their resources in _fini(9E).

Note –

In the Oracle Solaris OS, only the loadable module routines must be visible outside the driver object module. Other routines can have the storage class static.

Autoconfiguration Entry Points

Drivers are required to implement the attach(9E), detach(9E), and getinfo(9E) entry points for device autoconfiguration. Drivers can also implement the optional entry point probe(9E) in cases where devices do not identify themselves during boot-up, such as SCSI target devices. See Chapter 6, Driver Autoconfiguration for more information on these routines.

Kernel Statistics Entry Points

The Oracle Solaris platform provides a rich set of interfaces to maintain and export kernel-level statistics, also known as kstats. Drivers are free to use these interfaces to export driver and device statistics that can be used by user applications to observe the internal state of the driver. Two entry points are provided for working with kernel statistics:

ks_snapshot(9E) captures kstats at a specific time.
ks_update(9E) can be used to update kstat data at will. ks_update() is useful in situations where a device is set up to track kernel data but extracting that data is time-consuming.

For further information, see the kstat_create(9F) and kstat(9S) man pages. See also Kernel Statistics.

Power Management Entry Point

Drivers for hardware devices that provide Power Management functionality can support the optional power(9E) entry point. See Chapter 12, Power Management for details about this entry point.

System Quiesce Entry Point

A driver that manages devices must implement the quiesce(9E) entry point. Drivers that do not manage devices can set the devo_quiesce field in the dev_ops structure to ddi_quiesce_not_needed(). The quiesce() function can be called only when the system is single-threaded at high PIL (priority interrupt level) with preemption disabled. Therefore, this function must not be blocked. If a device has a defined reset state configuration, the driver should return that device to that reset state as part of the quiesce operation. An example of this case is Fast Reboot, where firmware is bypassed when booting to a new operating system image.

Summary of Common Entry Points

The following table lists entry points that can be used by all types of drivers.

Table 1–1 Entry Points for All Driver Types


Category / Entry Point	Usage	Description
`cb_ops` Entry Points
open(9E)	Required	Gets access to a device. Additional information: `open()` Entry Point (Character Drivers) `open()` Entry Point (Block Drivers)
close(9E)	Required	Gives up access to a device. The version of `close()` for STREAMS drivers has a different signature than character and block drivers. Additional information: `close()` Entry Point (Character Drivers) `close()` Entry Point (Block Drivers)
Loadable Module Entry Points
_init(9E)	Required	Initializes a loadable module. Additional information: Loadable Driver Interfaces
_fini(9E)	Required	Prepares a loadable module for unloading. Required for all driver types. Additional information: Loadable Driver Interfaces
_info(9E)	Required	Returns information about a loadable module. Additional information: Loadable Driver Interfaces
Autoconfiguration Entry Points
attach(9E)	Required	Adds a device to the system as part of initialization. Also used to resume a system that has been suspended. Additional information: `attach()` Entry Point
detach(9E)	Required	Detaches a device from the system. Also, used to suspend a device temporarily. Additional information: `detach()` Entry Point
getinfo(9E)	Required	Gets device information that is specific to the driver, such as the mapping between a device number and the corresponding instance. Additional information: `getinfo()` Entry Point `getinfo()` Entry Point (SCSI Target Drivers).
probe(9E)	See Description	Determines if a non-self-identifying device is present. Required for a device that cannot identify itself. Additional information: `probe()` Entry Point `probe()` Entry Point (SCSI Target Drivers)
Kernel Statistics Entry Points
ks_snapshot(9E)	Optional	Takes a snapshot of kstat(9S) data. Additional information: Kernel Statistics
ks_update(9E)	Optional	Updates kstat(9S) data dynamically. Additional information: Kernel Statistics
Power Management Entry Point
power(9E)	Required	Sets the power level of a device. If not used, set to `NULL`. Additional information: `power()` Entry Point
System Quiesce Entry Point
quiesce(9E)	See Description	Quiesces a device so that the device no longer generates interrupts or modifies or accesses memory.
Miscellaneous Entry Points
prop_op(9E)	See Description	Reports driver property information. Required unless ddi_prop_op(9F) is substituted. Additional information: Creating and Updating Properties `prop_op()` Entry Point
dump(9E)	See Description	Dumps memory to a device during system failure. Required for any device that is to be used as the dump device during a panic. Additional information: `dump()` Entry Point (Block Drivers) Dump Handling
`identify`(9E)	Obsolete	Do not use this entry point. Assign nulldev(9F) to this entry point in the `dev_ops` structure.

Entry Points for Block Device Drivers

Devices that support a file system are known as block devices. Drivers written for these devices are known as block device drivers. Block device drivers take a file system request, in the form of a buf(9S) structure, and issue the I/O operations to the disk to transfer the specified block. The main interface to the file system is the strategy(9E) routine. See Chapter 16, Drivers for Block Devices for more information.

A block device driver can also provide a character driver interface to enable utility programs to bypass the file system and to access the device directly. This device access is commonly referred to as the raw interface to a block device.

The following table lists additional entry points that can be used by block device drivers. See also Entry Points Common to All Drivers.

Table 1–2 Additional Entry Points for Block Drivers


Entry Point	Usage	Description
aread(9E)	Optional	Performs an asynchronous read. Drivers that do not support an `aread()` entry point should use the nodev(9F) error return function. Additional information: Differences Between Synchronous and Asynchronous I/O DMA Transfers (Asynchronous)
awrite(9E)	Optional	Performs an asynchronous write. Drivers that do not support an `awrite()` entry point should use the nodev(9F) error return function. Additional information: Differences Between Synchronous and Asynchronous I/O DMA Transfers (Asynchronous)
print(9E)	Required	Displays a driver message on the system console. Additional information: `print()` Entry Point (Block Drivers)
strategy(9E)	Required	Perform block I/O. Additional information: Canceling DMA Callbacks DMA Transfers (Synchronous) `strategy()` Entry Point DMA Transfers (Asynchronous) General Flow of Control x86 Target Driver Configuration Properties

Entry Points for Character Device Drivers

Character device drivers normally perform I/O in a byte stream. Examples of devices that use character drivers include tape drives and serial ports. Character device drivers can also provide additional interfaces not present in block drivers, such as I/O control (ioctl) commands, memory mapping, and device polling. See Chapter 15, Drivers for Character Devices for more information.

The main task of any device driver is to perform I/O, and many character device drivers do what is called byte-stream or character I/O. The driver transfers data to and from the device without using a specific device address. This type of transfer is in contrast to block device drivers, where part of the file system request identifies a specific location on the device.

The read(9E) and write(9E) entry points handle byte-stream I/O for standard character drivers. See I/O Request Handling for more information.

The following table lists additional entry points that can be used by character device drivers. For other entry points, see Entry Points Common to All Drivers.

Table 1–3 Additional Entry Points for Character Drivers


Entry Point	Usage	Description
chpoll(9E)	Optional	Polls events for a non-STREAMS character driver. Additional information: Multiplexing I/O on File Descriptors
ioctl(9E)	Optional	Performs a range of I/O commands for character drivers. `ioctl()` routines must make sure that user data is copied into or out of the kernel address space explicitly using copyin(9F), copyout(9F), ddi_copyin(9F), and ddi_copyout(9F), as appropriate. Additional information: `ioctl()` Entry Point (Character Drivers) Well Known `ioctl` Interfaces
read(9E)	Required	Reads data from a device. Additional information: Vectored I/O Differences Between Synchronous and Asynchronous I/O Programmed I/O Transfers DMA Transfers (Synchronous) General Flow of Control
segmap(9E)	Optional	Maps device memory into user space. Additional information: Exporting the Mapping Allocating Kernel Memory for User Access Associating User Mappings With Driver Notifications
write(9E)	Required	Writes data to a device. Additional information: Device Access Functions Vectored I/O Differences Between Synchronous and Asynchronous I/O Programmed I/O Transfers DMA Transfers (Synchronous) General Flow of Control

Entry Points for STREAMS Device Drivers

STREAMS is a separate programming model for writing a character driver. Devices that receive data asynchronously, such as terminal and network devices, are suited to a STREAMS implementation. STREAMS device drivers must provide the loading and autoconfiguration support described in Chapter 6, Driver Autoconfiguration. See the STREAMS Programming Guide for additional information on how to write STREAMS drivers.

The following table lists additional entry points that can be used by STREAMS device drivers. For other entry points, see Entry Points Common to All Drivers and Entry Points for Character Device Drivers.

Table 1–4 Entry Points for STREAMS Drivers


Entry Point	Usage	Description
put(9E)	See Description	Coordinates the passing of messages from one queue to the next queue in a stream. Required, except for the side of the driver that reads data. Additional information: STREAMS Programming Guide
srv(9E)	Required	Manipulate messages in a queue. Additional information: STREAMS Programming Guide

Entry Points for Memory Mapped Devices

For certain devices, such as frame buffers, providing application programs with direct access to device memory is more efficient than byte-stream I/O. Applications can map device memory into their address spaces using the mmap(2) system call. To support memory mapping, device drivers implement segmap(9E) and devmap(9E) entry points. For information on devmap(9E), see Chapter 10, Mapping Device and Kernel Memory. For information on segmap(9E), see Chapter 15, Drivers for Character Devices.

Drivers that define the devmap(9E) entry point usually do not define read(9E) and write(9E) entry points, because application programs perform I/O directly to the devices after calling mmap(2).

The following table lists additional entry points that can be used by character device drivers that use the devmap framework to perform memory mapping. For other entry points, see Entry Points Common to All Drivers and Entry Points for Character Device Drivers.

Table 1–5 Entry Points for Character Drivers That Use devmap for Memory Mapping


Entry Point	Usage	Description
devmap(9E)	Required	Validates and translates virtual mapping for a memory-mapped device. Additional information: Exporting the Mapping
devmap_access(9E)	Optional	Notifies drivers when an access is made to a mapping with validation or protection problems. Additional information: `devmap_access()` Entry Point
devmap_contextmgt(9E)	Required	Performs device context switching on a mapping. Additional information: `devmap_contextmgt()` Entry Point
devmap_dup(9E)	Optional	Duplicates a device mapping. Additional information: `devmap_dup()` Entry Point
devmap_map(9E)	Optional	Creates a device mapping. Additional information: `devmap_map()` Entry Point
devmap_unmap(9E)	Optional	Cancels a device mapping. Additional information: `devmap_unmap()` Entry Point

Entry Points for Network Device Drivers

See Table 19–1 for a list of entry points for network device drivers that use the Generic LAN Driver version 3 (GLDv3) framework. For more information, see GLDv3 Network Device Driver Framework and GLDv3 MAC Registration Functions in Chapter 19, Drivers for Network Devices.

Entry Points for SCSI HBA Drivers

The following table lists additional entry points that can be used by SCSI HBA device drivers. For information on the SCSI HBA transport structure, see scsi_hba_tran(9S). For other entry points, see Entry Points Common to All Drivers and Entry Points for Character Device Drivers.

Table 1–6 Additional Entry Points for SCSI HBA Drivers


Entry Point	Usage	Description
tran_abort(9E)	Required	Aborts a specified SCSI command that has been transported to a SCSI Host Bus Adapter (HBA) driver. Additional information: `tran_abort()` Entry Point
tran_bus_reset(9E)	Optional	Resets a SCSI bus. Additional information: `tran_bus_reset()` Entry Point
tran_destroy_pkt(9E)	Required	Frees resources that are allocated for a SCSI packet. Additional information: `tran_destroy_pkt()` Entry Point
tran_dmafree(9E)	Required	Frees DMA resources that have been allocated for a SCSI packet. Additional information: `tran_dmafree()` Entry Point
tran_getcap(9E)	Required	Gets the current value of a specific capability that is provided by the HBA driver. Additional information: `tran_getcap()` Entry Point
tran_init_pkt(9E)	Required	Allocate and initialize resources for a SCSI packet. Additional information: Resource Allocation
tran_quiesce(9E)	Optional	Stop all activity on a SCSI bus, typically for dynamic reconfiguration. Additional information: Dynamic Reconfiguration
tran_reset(9E)	Required	Resets a SCSI bus or target device. Additional information: `tran_reset()` Entry Point
tran_reset_notify(9E)	Optional	Requests notification of a SCSI target device for a bus reset. Additional information: `tran_reset_notify()` Entry Point
tran_setcap(9E)	Required	Sets the value of a specific capability that is provided by the SCSI HBA driver. Additional information: `tran_setcap()` Entry Point
tran_start(9E)	Required	Requests the transport of a SCSI command. Additional information: `tran_start()` Entry Point
tran_sync_pkt(9E)	Required	Synchronizes the view of data by an HBA driver or device. Additional information: `tran_sync_pkt()` Entry Point
tran_tgt_free(9E)	Optional	Requests allocated SCSI HBA resources to be freed on behalf of a target device. Additional information: `tran_tgt_free()` Entry Point Transport Structure Cloning
tran_tgt_init(9E)	Optional	Requests SCSI HBA resources to be initialized on behalf of a target device. Additional information: `tran_tgt_init()` Entry Point `scsi_device` Structure
tran_tgt_probe(9E)	Optional	Probes a specified target on a SCSI bus. Additional information: `tran_tgt_probe()` Entry Point
tran_unquiesce(9E)	Optional	Resumes I/O activity on a SCSI bus after tran_quiesce(9E) has been called, typically for dynamic reconfiguration. Additional information: Dynamic Reconfiguration

Entry Points for PC Card Drivers

The following table lists additional entry points that can be used by PC Card device drivers. For other entry points, see Entry Points Common to All Drivers and Entry Points for Character Device Drivers.

Table 1–7 Entry Points for PC Card Drivers Only


Entry Point	Usage	Description
csx_event_handler(9E)	Required	Handles events for a PC Card driver. The driver must call the csx_RegisterClient(9F) function explicitly to set the entry point instead of using a structure field like `cb_ops`.

Considerations in Device Driver Design

A device driver must be compatible with the Oracle Solaris OS, both as a consumer and provider of services. This section discusses the following issues, which should be considered in device driver design:

DDI/DKI Facilities

The Oracle Solaris DDI/DKI interfaces are provided for driver portability. With DDI/DKI, developers can write driver code in a standard fashion without having to worry about hardware or platform differences. This section describes aspects of the DDI/DKI interfaces.

Device IDs

The DDI interfaces enable drivers to provide a persistent, unique identifier for a device. The device ID can be used to identify or locate a device. The ID is independent of the device's name or number (dev_t). Applications can use the functions defined in libdevid(3LIB) to read and manipulate the device IDs registered by the drivers.

Device Properties

The attributes of a device or device driver are specified by properties. A property is a name-value pair. The name is a string that identifies the property with an associated value. Properties can be defined by the FCode of a self-identifying device, by a hardware configuration file (see the driver.conf(4) man page), or by the driver itself using the ddi_prop_update(9F) family of routines.

Interrupt Handling

The DDI/DKI addresses the following aspects of device interrupt handling:

Registering device interrupts with the system
Removing device interrupts
Dispatching interrupts to interrupt handlers

Device interrupt sources are contained in a property called interrupt, which is either provided by the PROM of a self-identifying device, in a hardware configuration file, or by the booting system on the x86 platform.

Callback Functions

Certain DDI mechanisms provide a callback mechanism. DDI functions provide a mechanism for scheduling a callback when a condition is met. Callback functions can be used for the following typical conditions:

A transfer has completed
A resource has become available
A time-out period has expired

Callback functions are somewhat similar to entry points, for example, interrupt handlers. DDI functions that allow callbacks expect the callback function to perform certain tasks. In the case of DMA routines, a callback function must return a value indicating whether the callback function needs to be rescheduled in case of a failure.

Callback functions execute as a separate interrupt thread. Callbacks must handle all the usual multithreading issues.

Note –

A driver must cancel all scheduled callback functions before detaching a device.

Software State Management

To assist device driver writers in allocating state structures, the DDI/DKI provides a set of memory management routines called the software state management routines, also known as the soft-state routines. These routines dynamically allocate, retrieve, and destroy memory items of a specified size, and hide the details of list management. An instance number is used to identify the desired memory item. This number is typically the instance number assigned by the system.

Routines are provided for the following tasks:

Initialize a driver's soft-state list
Allocate space for an instance of a driver's soft state
Retrieve a pointer to an instance of a driver's soft state
Free the memory for an instance of a driver's soft state
Finish using a driver's soft-state list

See Loadable Driver Interfaces for an example of how to use these routines.

Programmed I/O Device Access

Programmed I/O device access is the act of reading and writing of device registers or device memory by the host CPU. The Oracle Solaris DDI provides interfaces for mapping a device's registers or memory by the kernel as well as interfaces for reading and writing to device memory from the driver. These interfaces enable drivers to be developed that are platform and bus independent, by automatically managing any difference in device and host endianness as well as by enforcing any memory-store sequence requirements imposed by the device.

Direct Memory Access (DMA)

The Oracle Solaris platform defines a high-level, architecture-independent model for supporting DMA-capable devices. The Oracle Solaris DDI shields drivers from platform-specific details. This concept enables a common driver to run on multiple platforms and architectures.

Layered Driver Interfaces

The DDI/DKI provides a group of interfaces referred to as layered device interfaces (LDI). These interfaces enable a device to be accessed from within the Oracle Solaris kernel. This capability enables developers to write applications that observe kernel device usage. For example, both the prtconf(1M) and fuser(1M) commands use LDI to enable system administrators to track aspects of device usage. The LDI is covered in more detail in Chapter 14, Layered Driver Interface (LDI).

Driver Context

The driver context refers to the condition under which a driver is currently operating. The context limits the operations that a driver can perform. The driver context depends on the executing code that is invoked. Driver code executes in four contexts:

User context. A driver entry point has user context when invoked by a user thread in a synchronous fashion. That is, the user thread waits for the system to return from the entry point that was invoked. For example, the read(9E) entry point of the driver has user context when invoked by a read(2) system call. In this case, the driver has access to the user area for copying data into and out of the user thread.
Kernel context. A driver function has kernel context when invoked by some part of the kernel. In a block device driver, the strategy(9E) entry point can be called by the pageout daemon to write pages to the device. Because the page daemon has no relation to the current user thread, strategy(9E) has kernel context in this case.
Interrupt context.Interrupt context is a more restrictive form of kernel context. Interrupt context is invoked as a result of the servicing of an interrupt. Driver interrupt routines operate in interrupt context with an associated interrupt level. Callback routines also operate in an interrupt context. See Chapter 8, Interrupt Handlers for more information.
High-level interrupt context.High-level interrupt context is a more restricted form of interrupt context. If ddi_intr_hilevel(9F) indicates that an interrupt is high level, the driver interrupt handler runs in high-level interrupt context. See Chapter 8, Interrupt Handlers for more information.

The manual pages in section 9F document the allowable contexts for each function. For example, in kernel context the driver must not call copyin(9F).

Returning Errors

Device drivers do not usually print messages, except for unexpected errors such as data corruption. Instead, the driver entry points should return error codes so that the application can determine how to handle the error. Use the cmn_err(9F) function to write messages to a system log that can then be displayed on the console.

The format string specifier interpreted by cmn_err(9F) is similar to the printf(3C) format string specifier, with the addition of the format %b, which prints bit fields. The first character of the format string can have a special meaning. Calls to cmn_err(9F) also specify the message level, which indicates the severity label to be printed. See the cmn_err(9F) man page for more details.

The level CE_PANIC has the side effect of crashing the system. This level should be used only if the system is in such an unstable state that to continue would cause more problems. The level can also be used to get a system core dump when debugging. CE_PANIC should not be used in production device drivers.

Dynamic Memory Allocation

Device drivers must be prepared to simultaneously handle all attached devices that the drivers claim to drive. The number of devices that the driver handles should not be limited. All per-device information must be dynamically allocated.

void *kmem_alloc(size_t size, int flag);

The standard kernel memory allocation routine is kmem_alloc(9F). kmem_alloc() is similar to the C library routine malloc(3C), with the addition of the flag argument. The flag argument can be either KM_SLEEP or KM_NOSLEEP, indicating whether the caller is willing to block if the requested size is not available. If KM_NOSLEEP is set and memory is not available, kmem_alloc(9F) returns NULL.

kmem_zalloc(9F) is similar to kmem_alloc(9F), but also clears the contents of the allocated memory.

Note –

Kernel memory is a limited resource, not pageable, and competes with user applications and the rest of the kernel for physical memory. Drivers that allocate a large amount of kernel memory can cause system performance to degrade.

void kmem_free(void *cp, size_t size);

Memory allocated by kmem_alloc(9F) or by kmem_zalloc(9F) is returned to the system with kmem_free(9F). kmem_free() is similar to the C library routine free(3C), with the addition of the size argument. Drivers must keep track of the size of each allocated object in order to call kmem_free(9F) later.

Hotplugging

This manual does not highlight hotplugging information. If you follow the rules and suggestions for writing device drivers given in this book, your driver should be able to handle hotplugging. In particular, make sure that both autoconfiguration (see Chapter 6, Driver Autoconfiguration) and detach(9E) work correctly in your driver. In addition, if you are designing a driver that uses power management, you should follow the information given in Chapter 12, Power Management. SCSI HBA drivers might need to add a cb_ops structure to their dev_ops structure (see Chapter 18, SCSI Host Bus Adapter Drivers) to take advantage of hotplugging capabilities.

Previous versions of the Oracle Solaris OS required hotpluggable drivers to include a DT_HOTPLUG property, but this property is no longer required. Driver writers are free, however, to include and use the DT_HOTPLUG property as they see fit.

Chapter 2 Oracle Solaris Kernel and Device Tree

A device driver needs to work transparently as an integral part of the operating system. Understanding how the kernel works is a prerequisite for learning about device drivers. This chapter provides an overview of the Oracle Solaris kernel and device tree. For an overview of how device drivers work, see Chapter 1, Overview of Oracle Solaris Device Drivers.

This chapter provides information on the following subjects:

What Is the Kernel?

The Oracle Solaris kernel is a program that manages system resources. The kernel insulates applications from the system hardware and provides them with essential system services such as input/output (I/O) management, virtual memory, and scheduling. The kernel consists of object modules that are dynamically loaded into memory when needed.

The Oracle Solaris kernel can be divided logically into two parts: the first part, referred to as the kernel, manages file systems, scheduling, and virtual memory. The second part, referred to as the I/O subsystem, manages the physical components.

The kernel provides a set of interfaces for applications to use that are accessible through system calls. System calls are documented in section 2 of the Reference Manual Collection (see Intro(2)). Some system calls are used to invoke device drivers to perform I/O. Device drivers are loadable kernel modules that manage data transfers while insulating the rest of the kernel from the device hardware. To be compatible with the operating system, device drivers need to be able to accommodate such features as multithreading, virtual memory addressing, and both 32-bit and 64-bit operation.

The following figure illustrates the kernel. The kernel modules handle system calls from application programs. The I/O modules communicate with hardware.

Figure 2–1 Oracle Solaris Kernel

Diagram shows calls from user-level applications to specific
kernel-level modules, and calls between drivers and other modules to devices.

The kernel provides access to device drivers through the following features:

Device-to-driver mapping. The kernel maintains the device tree. Each node in the tree represents a virtual or a physical device. The kernel binds each node to a driver by matching the device node name with the set of drivers installed in the system. The device is made accessible to applications only if there is a driver binding.
DDI/DKI interfaces. DDI/DKI (Device Driver Interface/Driver-Kernel Interface) interfaces standardize interactions between the driver and the kernel, the device hardware, and the boot/configuration software. These interfaces keep the driver independent from the kernel and improve the driver's portability across successive releases of the operating system on a particular machine.
LDI. The LDI (Layered Driver Interface) is an extension of the DDI/DKI. The LDI enables a kernel module to access other devices in the system. The LDI also enables you to determine which devices are currently being used by the kernel. See Chapter 14, Layered Driver Interface (LDI).

Multithreaded Execution Environment

The Oracle Solaris kernel is multithreaded. On a multiprocessor machine, multiple kernel threads can be running kernel code, and can do so concurrently. Kernel threads can also be preempted by other kernel threads at any time.

The multithreading of the kernel imposes some additional restrictions on device drivers. For more information on multithreading considerations, see Chapter 3, Multithreading. Device drivers must be coded to run as needed at the request of many different threads. For each thread, a driver must handle contention problems from overlapping I/O requests.

Virtual Memory

A complete overview of the Oracle Solaris virtual memory system is beyond the scope of this book, but two virtual memory terms of special importance are used when discussing device drivers: virtual address and address space.

Virtual address. A virtual address is an address that is mapped by the memory management unit (MMU) to a physical hardware address. All addresses directly accessible by the driver are kernel virtual addresses. Kernel virtual addresses refer to the kernel address space.
Address space. An address space is a set of virtual address segments. Each segment is a contiguous range of virtual addresses. Each user process has an address space called the user address space. The kernel has its own address space, called the kernel address space.

Devices as Special Files

Devices are represented in the file system by special files. In the Oracle Solaris OS, these files reside in the /devices directory hierarchy.

Special files can be of type block or character. The type indicates which kind of device driver operates the device. Drivers can be implemented to operate on both types. For example, disk drivers export a character interface for use by the fsck(1) and mkfs(1) utilities, and a block interface for use by the file system.

Associated with each special file is a device number (dev_t). A device number consists of a major number and a minor number. The major number identifies the device driver associated with the special file. The minor number is created and used by the device driver to further identify the special file. Usually, the minor number is an encoding that is used to identify which device instance the driver should access and which type of access should be performed. For example, the minor number can identify a tape device used for backup and can specify that the tape needs to be rewound when the backup operation is complete.

DDI/DKI Interfaces

In System V Release 4 (SVR4), the interface between device drivers and the rest of the UNIX kernel was standardized as the DDI/DKI. The DDI/DKI is documented in section 9 of the Reference Manual Collection. Section 9E documents driver entry points, section 9F documents driver-callable functions, and section 9S documents kernel data structures used by device drivers. See Intro(9E), Intro(9F), and Intro(9S).

The DDI/DKI is intended to standardize and document all interfaces between device drivers and the rest of the kernel. In addition, the DDI/DKI enables source and binary compatibility for drivers on any machine that runs the Oracle Solaris OS, regardless of the processor architecture, whether SPARC or x86. Drivers that use only kernel facilities that are part of the DDI/DKI are known as DDI/DKI-compliant device drivers.

The DDI/DKI enables you to write platform-independent device drivers for any machine that runs the Oracle Solaris OS. These binary-compatible drivers enable you to more easily integrate third-party hardware and software into any machine that runs the Oracle Solaris OS. The DDI/DKI is architecture independent, which enables the same driver to work across a diverse set of machine architectures.

Platform independence is accomplished by the design of DDI in the following areas:

Dynamic loading and unloading of modules
Power management
Interrupt handling
Accessing the device space from the kernel or a user process, that is, register mapping and memory mapping
Accessing kernel or user process space from the device using DMA services
Managing device properties

Overview of the Device Tree

Devices in the Oracle Solaris OS are represented as a tree of interconnected device information nodes. The device tree describes the configuration of loaded devices for a particular machine.

Device Tree Components

The system builds a tree structure that contains information about the devices connected to the machine at boot time. The device tree can also be modified by dynamic reconfiguration operations while the system is in normal operation. The tree begins at the root device node, which represents the platform.

Below the root node are the branches of the device tree. A branch consists of one or more bus nexus devices and a terminating leaf device.

A bus nexus device provides bus mapping and translation services to subordinate devices in the device tree. PCI - PCI bridges, PCMCIA adapters, and SCSI HBAs are all examples of nexus devices. The discussion of writing drivers for nexus devices is limited to the development of SCSI HBA drivers (see Chapter 18, SCSI Host Bus Adapter Drivers).

Leaf devices are typically peripheral devices such as disks, tapes, network adapters, frame buffers, and so forth. Leaf device drivers export the traditional character driver interfaces and block driver interfaces. The interfaces enable user processes to read data from and write data to either storage or communication devices.

The system goes through the following steps to build the tree:

The CPU is initialized and searches for firmware.
The main firmware (OpenBoot, Basic Input/Output System (BIOS), or Bootconf) initializes and creates the device tree with known or self-identifying hardware.
When the main firmware finds compatible firmware on a device, the main firmware initializes the device and retrieves the device's properties.
The firmware locates and boots the operating system.
The kernel starts at the root node of the tree, searches for a matching device driver, and binds that driver to the device.
If the device is a nexus, the kernel looks for child devices that have not been detected by the firmware. The kernel adds any child devices to the tree below the nexus node.
The kernel repeats the process from Step 5 until no further device nodes need to be created.

Each driver exports a device operations structure dev_ops(9S) to define the operations that the device driver can perform. The device operations structure contains function pointers for generic operations such as attach(9E), detach(9E), and getinfo(9E). The structure also contains a pointer to a set of operations specific to bus nexus drivers and a pointer to a set of operations specific to leaf drivers.

The tree structure creates a parent-child relationship between nodes. This parent-child relationship is the key to architectural independence. When a leaf or bus nexus driver requires a service that is architecturally dependent in nature, that driver requests its parent to provide the service. This approach enables drivers to function regardless of the architecture of the machine or the processor. A typical device tree is shown in the following figure.

Figure 2–2 Example Device Tree

Diagram shows leaves and nodes in a typical device tree.

The nexus nodes can have one or more children. The leaf nodes represent individual devices.

Displaying the Device Tree

The device tree can be displayed in three ways:

The libdevinfo library provides interfaces to access the contents of the device tree programmatically.
The prtconf(1M) command displays the complete contents of the device tree.
The /devices hierarchy is a representation of the device tree. Use the ls(1) command to view the hierarchy.

Note –

/devices displays only devices that have drivers configured into the system. The prtconf(1M) command shows all device nodes regardless of whether a driver for the device exists on the system.

`libdevinfo` Library

The libdevinfo library provides interfaces for accessing all public device configuration data. See the libdevinfo(3LIB) man page for a list of interfaces.

`prtconf` Command

The following excerpted prtconf(1M) command example displays all the devices in the system.

Memory size: 128 Megabytes
System Peripherals (Software Nodes):

SUNW,Ultra-5_10
    packages (driver not attached)
        terminal-emulator (driver not attached)
        deblocker (driver not attached)
        obp-tftp (driver not attached)
        disk-label (driver not attached)
        SUNW,builtin-drivers (driver not attached)
        sun-keyboard (driver not attached)
        ufs-file-system (driver not attached)
    chosen (driver not attached)
    openprom (driver not attached)
        client-services (driver not attached)
    options, instance #0
    aliases (driver not attached)
    memory (driver not attached)
    virtual-memory (driver not attached)
    pci, instance #0
        pci, instance #0
            ebus, instance #0
                auxio (driver not attached)
                power, instance #0
                SUNW,pll (driver not attached)
                se, instance #0
                su, instance #0
                su, instance #1
                ecpp (driver not attached)
                fdthree, instance #0
                eeprom (driver not attached)
                flashprom (driver not attached)
                SUNW,CS4231 (driver not attached)
            network, instance #0
            SUNW,m64B (driver not attached)
            ide, instance #0
                disk (driver not attached)
                cdrom (driver not attached)
                dad, instance #0
                sd, instance #15
        pci, instance #1
            pci, instance #0
                pci108e,1000 (driver not attached)
                SUNW,hme, instance #1
                SUNW,isptwo, instance #0
                    sd (driver not attached)
                    st (driver not attached)
                    sd, instance #0 (driver not attached)
                    sd, instance #1 (driver not attached)
                    sd, instance #2 (driver not attached)
                    ...
    SUNW,UltraSPARC-IIi (driver not attached)
    SUNW,ffb, instance #0
    pseudo, instance #0

`/devices` Directory

The /devices hierarchy provides a namespace that represents the device tree. Following is an abbreviated listing of the /devices namespace. The sample output corresponds to the example device tree and prtconf(1M) output shown previously.

/devices
/devices/pseudo
/devices/pci@1f,0:devctl
/devices/SUNW,ffb@1e,0:ffb0
/devices/pci@1f,0
/devices/pci@1f,0/pci@1,1
/devices/pci@1f,0/pci@1,1/SUNW,m64B@2:m640
/devices/pci@1f,0/pci@1,1/ide@3:devctl
/devices/pci@1f,0/pci@1,1/ide@3:scsi
/devices/pci@1f,0/pci@1,1/ebus@1
/devices/pci@1f,0/pci@1,1/ebus@1/power@14,724000:power_button
/devices/pci@1f,0/pci@1,1/ebus@1/se@14,400000:a
/devices/pci@1f,0/pci@1,1/ebus@1/se@14,400000:b
/devices/pci@1f,0/pci@1,1/ebus@1/se@14,400000:0,hdlc
/devices/pci@1f,0/pci@1,1/ebus@1/se@14,400000:1,hdlc
/devices/pci@1f,0/pci@1,1/ebus@1/se@14,400000:a,cu
/devices/pci@1f,0/pci@1,1/ebus@1/se@14,400000:b,cu
/devices/pci@1f,0/pci@1,1/ebus@1/ecpp@14,3043bc:ecpp0
/devices/pci@1f,0/pci@1,1/ebus@1/fdthree@14,3023f0:a
/devices/pci@1f,0/pci@1,1/ebus@1/fdthree@14,3023f0:a,raw
/devices/pci@1f,0/pci@1,1/ebus@1/SUNW,CS4231@14,200000:sound,audio
/devices/pci@1f,0/pci@1,1/ebus@1/SUNW,CS4231@14,200000:sound,audioctl
/devices/pci@1f,0/pci@1,1/ide@3
/devices/pci@1f,0/pci@1,1/ide@3/sd@2,0:a
/devices/pci@1f,0/pci@1,1/ide@3/sd@2,0:a,raw
/devices/pci@1f,0/pci@1,1/ide@3/dad@0,0:a
/devices/pci@1f,0/pci@1,1/ide@3/dad@0,0:a,raw
/devices/pci@1f,0/pci@1
/devices/pci@1f,0/pci@1/pci@2
/devices/pci@1f,0/pci@1/pci@2/SUNW,isptwo@4:devctl
/devices/pci@1f,0/pci@1/pci@2/SUNW,isptwo@4:scsi

Binding a Driver to a Device

In addition to constructing the device tree, the kernel determines the drivers that are used to manage the devices.

Binding a driver to a device refers to the process by which the system selects a driver to manage a particular device. The binding name is the name that links a driver to a unique device node in the device information tree. For each device in the device tree, the system attempts to choose a driver from a list of installed drivers.

Each device node has an associated name property. This property can be assigned either from an external agent, such as the PROM, during system boot or from a driver.conf configuration file. In any case, the name property represents the node name assigned to a device in the device tree. The node name is the name visible in /devices and listed in the prtconf(1M) output.

Figure 2–3 Device Node Names

Diagram shows a simple example of device node names.

A device node can have an associated compatible property as well. The compatible property contains an ordered list of one or more possible driver names or driver aliases for the device.

The system uses both the compatible and the name properties to select a driver for the device. The system first attempts to match the contents of the compatible property, if the compatible property exists, to a driver on the system. Beginning with the first driver name on the compatible property list, the system attempts to match the driver name to a known driver on the system. Each entry on the list is processed until the system either finds a match or reaches the end of the list.

If the contents of either the name property or the compatible property match a driver on the system, then that driver is bound to the device node. If no match is found, no driver is bound to the device node.

Generic Device Names

Some devices specify a generic device name as the value for the name property. Generic device names describe the function of a device without actually identifying a specific driver for the device. For example, a SCSI host bus adapter might have a generic device name of scsi. An Ethernet device might have a generic device name of ethernet.

The compatible property enables the system to determine alternate driver names for devices with a generic device name, for example, glm for scsi HBA device drivers or hme for ethernet device drivers.

Devices with generic device names are required to supply a compatible property.

Note –

For a complete description of generic device names, see the IEEE 1275 Open Firmware Boot Standard.

The following figure shows a device node with a specific device name. The driver binding name SUNW,ffb is the same name as the device node name.

Figure 2–4 Specific Driver Node Binding

Diagram shows a device node using a specific device name:
SUNW, ffb.

The following figure shows a device node with the generic device name display. The driver binding name SUNW,ffb is the first name on the compatible property driver list that matches a driver on the system driver list. In this case, display is a generic device name for frame buffers.

Figure 2–5 Generic Driver Node Binding

Diagram shows a device node using a generic device name:
display.

Chapter 3 Multithreading

This chapter describes the locking primitives and thread synchronization mechanisms of the Oracle Solaris multithreaded kernel. You should design device drivers to take advantage of multithreading. This chapter provides information on the following subjects:

Locking Primitives

In traditional UNIX systems, every section of kernel code terminates either through an explicit call to sleep(1) to give up the processor or through a hardware interrupt. The Oracle Solaris OS operates differently. A kernel thread can be preempted at any time to run another thread. Because all kernel threads share kernel address space and often need to read and modify the same data, the kernel provides a number of locking primitives to prevent threads from corrupting shared data. These mechanisms include mutual exclusion locks, which are also known as mutexes, readers/writer locks, and semaphores.

Storage Classes of Driver Data

The storage class of data is a guide to whether the driver might need to take explicit steps to control access to the data. The three data storage classes are:

Automatic (stack) data. Every thread has a private stack, so drivers never need to lock automatic variables.
Global static data. Global static data can be shared by any number of threads in the driver. The driver might need to lock this type of data at times.
Kernel heap data. Any number of threads in the driver can share kernel heap data, such as data allocated by kmem_alloc(9F). The driver needs to protect shared data at all times.

Mutual-Exclusion Locks

A mutual-exclusion lock, or mutex, is usually associated with a set of data and regulates access to that data. Mutexes provide a way to allow only one thread at a time access to that data. The mutex functions are:

mutex_destroy(9F): Releases any associated storage.
mutex_enter(9F): Acquires a mutex.
mutex_exit(9F): Releases a mutex.
mutex_init(9F): Initializes a mutex.
mutex_owned(9F): Tests to determine whether the mutex is held by the current thread. To be used in ASSERT(9F) only.
mutex_tryenter(9F): Acquires a mutex if available, but does not block.

Setting Up Mutexes

Device drivers usually allocate a mutex for each driver data structure. The mutex is typically a field in the structure of type kmutex_t. mutex_init(9F) is called to prepare the mutex for use. This call is usually made at attach(9E) time for per-device mutexes and _init(9E) time for global driver mutexes.

For example,

struct xxstate *xsp;
/* ... */
mutex_init(&xsp->mu, NULL, MUTEX_DRIVER, NULL);
/* ... */

For a more complete example of mutex initialization, see Chapter 6, Driver Autoconfiguration.

The driver must destroy the mutex with mutex_destroy(9F) before being unloaded. Destroying the mutex is usually done at detach(9E) time for per-device mutexes and _fini(9E) time for global driver mutexes.

Using Mutexes

Every section of the driver code that needs to read or write the shared data structure must do the following tasks:

Acquire the mutex
Access the data
Release the mutex

The scope of a mutex, that is, the data the mutex protects, is entirely up to the programmer. A mutex protects a data structure only if every code path that accesses the data structure does so while holding the mutex.

Readers/Writer Locks

A readers/writer lock regulates access to a set of data. The readers/writer lock is so called because many threads can hold the lock simultaneously for reading, but only one thread can hold the lock for writing.

Most device drivers do not use readers/writer locks. These locks are slower than mutexes. The locks provide a performance gain only when they protect commonly read data that is not frequently written. In this case, contention for a mutex could become a bottleneck, so using a readers/writer lock might be more efficient. The readers/writer functions are summarized in the following table. See the rwlock(9F) man page for detailed information. The readers/writer lock functions are:

rw_destroy(9F): Destroys a readers/writer lock
rw_downgrade(9F): Downgrades a readers/writer lock holder from writer to reader
rw_enter(9F): Acquires a readers/writer lock
rw_exit(9F): Releases a readers/writer lock
rw_init(9F): Initializes a readers/writer lock
rw_read_locked(9F): Determines whether a readers/writer lock is held for read or write
rw_tryenter(9F): Attempts to acquire a readers/writer lock without waiting
rw_tryupgrade(9F): Attempts to upgrade readers/writer lock holder from reader to writer

Semaphores

Counting semaphores are available as an alternative primitive for managing threads within device drivers. See the semaphore(9F) man page for more information. The semaphore functions are:

sema_destroy(9F): Destroys a semaphore.
sema_init(9F): Initialize a semaphore.
sema_p(9F): Decrement semaphore and possibly block.
sema_p_sig(9F): Decrement semaphore but do not block if signal is pending. See Threads Unable to Receive Signals.
sema_tryp(9F): Attempt to decrement semaphore, but do not block.
sema_v(9F): Increment semaphore and possibly unblock waiter.

Thread Synchronization

In addition to protecting shared data, drivers often need to synchronize execution among multiple threads.

Condition Variables in Thread Synchronization

Condition variables are a standard form of thread synchronization. They are designed to be used with mutexes. The associated mutex is used to ensure that a condition can be checked atomically, and that the thread can block on the associated condition variable without missing either a change to the condition or a signal that the condition has changed.

The condvar(9F) functions are:

cv_broadcast(9F): Signals all threads waiting on the condition variable.
cv_destroy(9F): Destroys a condition variable.
cv_init(9F): Initializes a condition variable.
cv_signal(9F): Signals one thread waiting on the condition variable.
cv_timedwait(9F): Waits for condition, time-out, or signal. See Threads Unable to Receive Signals.
cv_timedwait_sig(9F): Waits for condition or time-out.
cv_wait(9F): Waits for condition.
cv_wait_sig(9F): Waits for condition or return zero on receipt of a signal. See Threads Unable to Receive Signals.

Initializing Condition Variables

Declare a condition variable of type kcondvar_t for each condition. Usually, the condition variables are declared in the driver's soft-state structure. Use cv_init(9F) to initialize each condition variable. Similar to mutexes, condition variables are usually initialized at attach(9E) time. A typical example of initializing a condition variable is:

cv_init(&xsp->cv, NULL, CV_DRIVER, NULL);

For a more complete example of condition variable initialization, see Chapter 6, Driver Autoconfiguration.

Waiting for the Condition

To use condition variables, follow these steps in the code path waiting for the condition:

Acquire the mutex guarding the condition.
Test the condition.
If the test results do not allow the thread to continue, use cv_wait(9F) to block the current thread on the condition. The cv_wait(9F) function releases the mutex before blocking the thread and reacquires the mutex before returning. On return from cv_wait(9F), repeat the test.
After the test allows the thread to continue, set the condition to its new value. For example, set a device flag to busy.
Release the mutex.

Signaling the Condition

Follow these steps in the code path to signal the condition:

Acquire the mutex guarding the condition.
Set the condition.
Signal the blocked thread with cv_broadcast(9F).
Release the mutex.

The following example uses a busy flag along with mutex and condition variables to force the read(9E) routine to wait until the device is no longer busy before starting a transfer.

Example 3–1 Using Mutexes and Condition Variables

static int
xxread(dev_t dev, struct uio *uiop, cred_t *credp)
{
        struct xxstate *xsp;
        /* ... */
        mutex_enter(&xsp->mu);
        while (xsp->busy)
                cv_wait(&xsp->cv, &xsp->mu);
        xsp->busy = 1;
        mutex_exit(&xsp->mu);
        /* perform the data access */
}

static uint_t
xxintr(caddr_t arg)
{
        struct xxstate *xsp = (struct xxstate *)arg;
        mutex_enter(&xsp->mu);
        xsp->busy = 0;
        cv_broadcast(&xsp->cv);
        mutex_exit(&xsp->mu);
}

`cv_wait()` and `cv_timedwait()` Functions

If a thread is blocked on a condition with cv_wait(9F) and that condition does not occur, the thread would wait forever. To avoid that situation, use cv_timedwait(9F), which depends upon another thread to perform a wakeup. cv_timedwait() takes an absolute wait time as an argument. cv_timedwait() returns -1 if the time is reached and the event has not occurred. cv_timedwait() returns a positive value if the condition is met.

cv_timedwait(9F) requires an absolute wait time expressed in clock ticks since the system was last rebooted. The wait time can be determined by retrieving the current value with ddi_get_lbolt(9F). The driver usually has a maximum number of seconds or microseconds to wait, so this value is converted to clock ticks with drv_usectohz(9F) and added to the value from ddi_get_lbolt(9F).

The following example shows how to use cv_timedwait(9F) to wait up to five seconds to access the device before returning EIO to the caller.

Example 3–2 Using `cv_timedwait()`

clock_t            cur_ticks, to;
mutex_enter(&xsp->mu);
while (xsp->busy) {
        cur_ticks = ddi_get_lbolt();
        to = cur_ticks + drv_usectohz(5000000); /* 5 seconds from now */
        if (cv_timedwait(&xsp->cv, &xsp->mu, to) == -1) {
                /*
                 * The timeout time 'to' was reached without the
                 * condition being signaled.
                 */
                /* tidy up and exit */
                mutex_exit(&xsp->mu);
                return (EIO);
        }
}
xsp->busy = 1;
mutex_exit(&xsp->mu);

Although device driver writers generally prefer to use cv_timedwait(9F) over cv_wait(9F), sometimes cv_wait(9F) is a better choice. For example, cv_wait(9F) is better if a driver is waiting on the following conditions:

Internal driver state changes, where such a state change might require some command to be executed, or a set amount of time to pass
Something the driver needs to single-thread
Some situation that is already managing a possible timeout, as when “A” depends on “B,” and “B” is using cv_timedwait(9F)

`cv_wait_sig()` Function

A driver might be waiting for a condition that cannot occur or will not happen for a long time. In such cases, the user can send a signal to abort the thread. Depending on the driver design, the signal might not cause the driver to wake up.

cv_wait_sig(9F) allows a signal to unblock the thread. This capability enables the user to break out of potentially long waits by sending a signal to the thread with kill(1) or by typing the interrupt character. cv_wait_sig(9F) returns zero if it is returning because of a signal, or nonzero if the condition occurred. However, see Threads Unable to Receive Signals for cases in which signals might not be received.

The following example shows how to use cv_wait_sig(9F) to allow a signal to unblock the thread.

Example 3–3 Using `cv_wait_sig()`

mutex_enter(&xsp->mu);
while (xsp->busy) {
        if (cv_wait_sig(&xsp->cv, &xsp->mu) == 0) {
        /* Signaled while waiting for the condition */
                /* tidy up and exit */
                mutex_exit(&xsp->mu);
                return (EINTR);
        }
}
xsp->busy = 1;
mutex_exit(&xsp->mu);

`cv_timedwait_sig()` Function

cv_timedwait_sig(9F) is similar to cv_timedwait(9F) and cv_wait_sig(9F), except that cv_timedwait_sig() returns -1 without the condition being signaled after a timeout has been reached, or 0 if a signal (for example, kill(2)) is sent to the thread.

For both cv_timedwait(9F) and cv_timedwait_sig(9F), time is measured in absolute clock ticks since the last system reboot.

Choosing a Locking Scheme

The locking scheme for most device drivers should be kept straightforward. Using additional locks allows more concurrency but increases overhead. Using fewer locks is less time consuming but allows less concurrency. Generally, use one mutex per data structure, a condition variable for each event or condition the driver must wait for, and a mutex for each major set of data global to the driver. Avoid holding mutexes for long periods of time. Use the following guidelines when choosing a locking scheme:

Use the multithreading semantics of the entry point to your advantage.
Make all entry points re-entrant. You can reduce the amount of shared data by changing a static variable to automatic.
If your driver acquires multiple mutexes, acquire and release the mutexes in the same order in all code paths.
Hold and release locks within the same functional space.
Avoid holding driver mutexes when calling DDI interfaces that can block, for example, kmem_alloc(9F) with KM_SLEEP.

To look at lock usage, use lockstat(1M). lockstat(1M) monitors all kernel lock events, gathers frequency and timing data about the events, and displays the data.

See the Multithreaded Programming Guide for more details on multithreaded operations.

Potential Locking Pitfalls

Mutexes are not re-entrant by the same thread. If you already own the mutex, attempting to claim this mutex a second time leads to the following panic:

panic: recursive mutex_enter. mutex %x caller %x

Releasing a mutex that the current thread does not hold causes this panic:

panic: mutex_adaptive_exit: mutex not held by thread

The following panic occurs only on uniprocessors:

panic: lock_set: lock held and only one CPU

The lock_set panic indicates that a spin mutex is held and will spin forever, because no other CPU can release this mutex. This situation can happen if the driver forgets to release the mutex on one code path or becomes blocked while holding the mutex.

A common cause of the lock_set panic occurs when a device with a high-level interrupt calls a routine that blocks, such as cv_wait(9F). Another typical cause is a high-level handler grabbing an adaptive mutex by calling mutex_enter(9F).

Threads Unable to Receive Signals

The sema_p_sig(), cv_wait_sig(), and cv_timedwait_sig() functions can be awakened when the thread receives a signal. A problem can arise because some threads are unable to receive signals. For example, when close(9E) is called as a result of the application calling close(2), signals can be received. However, when close(9E) is called from within the exit(2) processing that closes all open file descriptors, the thread cannot receive signals. When the thread cannot receive signals, sema_p_sig() behaves as sema_p(), cv_wait_sig() behaves as cv_wait(), and cv_timedwait_sig() behaves as cv_timedwait().

Use caution to avoid sleeping forever on events that might never occur. Events that never occur create unkillable (defunct) threads and make the device unusable until the system is rebooted. Signals cannot be received by defunct processes.

To detect whether the current thread is able to receive a signal, use the ddi_can_receive_sig(9F) function. If the ddi_can_receive_sig()function returns B_TRUE, then the above functions can wake up on a received signal. If the ddi_can_receive_sig()function returns B_FALSE, then the above functions cannot wake up on a received signal. If the ddi_can_receive_sig()function returns B_FALSE, then the driver should use an alternate means, such as the timeout(9F) function, to reawaken.

One important case where this problem occurs is with serial ports. If the remote system asserts flow control and the close(9E) function blocks while attempting to drain the output data, a port can be stuck until the flow control condition is resolved or the system is rebooted. Such drivers should detect this case and set up a timer to abort the drain operation when the flow control condition persists for an excessive period of time.

This issue also affects the qwait_sig(9F) function, which is described in Chapter 7, STREAMS Framework – Kernel Level, in STREAMS Programming Guide.

Chapter 4 Properties

Properties are user-defined, name-value pair structures that are managed using the DDI/DKI interfaces. This chapter provides information on the following subjects:

Device Properties

Device attribute information can be represented by a name-value pair notation called a property.

For example, device registers and onboard memory can be represented by the reg property. The reg property is a software abstraction that describes device hardware registers. The value of the reg property encodes the device register address location and size. Drivers use the reg property to access device registers.

Another example is the interrupt property. An interrupt property represents the device interrupt. The value of the interrupt property encodes the device-interrupt PIN.

Five types of values can be assigned to properties:

Byte array – Series of bytes of an arbitrary length
Integer property – An integer value
Integer array property – An array of integers
String property – A null-terminated string
String array property – A list of null-terminated strings

A property that has no value is considered to be a Boolean property. A Boolean property that exists is true. A Boolean value that does not exist is false.

Device Property Names

Strictly speaking, DDI/DKI software property names have no restrictions. Certain uses are recommended, however. The IEEE 1275-1994 Standard for Boot Firmware defines properties as follows:

A property is a human readable text string consisting of from 1 to 31 printable characters. Property names cannot contain upper case characters or the characters “/”, “\”, “:”, “[“, “]” and “@”. Property names beginning with the character “+” are reserved for use by future revisions of IEEE 1275-1994.

By convention, underscores are not used in property names. Use a hyphen (-) instead. By convention, property names ending with the question mark character (?) contain values that are strings, typically TRUE or FALSE, for example auto-boot?.

Predefined property names are listed in publications of the IEEE 1275 Working Group. See http://playground.sun.com/1275/ for information about how to obtain these publications. For a discussion of adding properties in driver configuration files, see the driver.conf(4) man page. The pm(9P) and pm-components(9P) man pages show how properties are used in power management. Read the sd(7D) man page as an example of how properties should be documented in device driver man pages.

Creating and Updating Properties

To create a property for a driver, or to update an existing property, use an interface from the DDI driver update interfaces such as ddi_prop_update_int(9F) or ddi_prop_update_string(9F) with the appropriate property type. See Table 4–1 for a list of available property interfaces. These interfaces are typically called from the driver's attach(9E) entry point. In the following example, ddi_prop_update_string()creates a string property called pm-hardware-state with a value of needs-suspend-resume.

/* The following code is to tell cpr that this device
 * needs to be suspended and resumed.
 */
(void) ddi_prop_update_string(device, dip,
    "pm-hardware-state", "needs-suspend-resume");

In most cases, using a ddi_prop_update() routine is sufficient for updating a property. Sometimes, however, the overhead of updating a property value that is subject to frequent change can cause performance problems. See prop_op() Entry Point for a description of using a local instance of a property value to avoid using ddi_prop_update().

Looking Up Properties

A driver can request a property from its parent, which in turn can ask its parent. The driver can control whether the request can go higher than its parent.

For example, the esp driver in the following example maintains an integer property called targetx-sync-speed for each target. The x in targetx-sync-speed represents the target number. The prtconf(1M) command displays driver properties in verbose mode. The following example shows a partial listing for the esp driver.

% prtconf -v
...
       esp, instance #0
            Driver software properties:
                name <target2-sync-speed> length <4>
                    value <0x00000fa0>.
...

The following table provides a summary of the property interfaces.

Table 4–1 Property Interface Uses


Family	Property Interfaces	Description
`ddi_prop_lookup`	ddi_prop_exists(9F)	Looks up a property and returns successfully if the property exists. Fails if the property does not exist
	ddi_prop_get_int(9F)	Looks up and returns an integer property
	ddi_prop_get_int64(9F)	Looks up and returns a 64-bit integer property
	ddi_prop_lookup_int_array(9F)	Looks up and returns an integer array property
	ddi_prop_lookup_int64_array(9F)	Looks up and returns a 64-bit integer array property
	ddi_prop_lookup_string(9F)	Looks up and returns a string property
	ddi_prop_lookup_string_array(9F)	Looks up and returns a string array property
	ddi_prop_lookup_byte_array(9F)	Looks up and returns a byte array property
`ddi_prop_update`	ddi_prop_update_int(9F)	Updates or creates an integer property
	ddi_prop_update_int64(9F)	Updates or creates a single 64-bit integer property
	ddi_prop_update_int_array(9F)	Updates or creates an integer array property
	ddi_prop_update_string(9F)	Updates or creates a string property
	ddi_prop_update_string_array(9F)	Updates or creates a string array property
	ddi_prop_update_int64_array(9F)	Updates or creates a 64-bit integer array property
	ddi_prop_update_byte_array(9F)	Updates or creates a byte array property
`ddi_prop_remove`	ddi_prop_remove(9F)	Removes a property
	ddi_prop_remove_all(9F)	Removes all properties that are associated with a device

Whenever possible, use 64-bit versions of int property interfaces such as ddi_prop_update_int64(9F) instead of 32-bit versions such as ddi_prop_update_int(9F)).

`prop_op()` Entry Point

The prop_op(9E) entry point is generally required for reporting device properties or driver properties to the system. If the driver does not need to create or manage its own properties, then the ddi_prop_op(9F) function can be used for this entry point.

ddi_prop_op(9F) can be used as the prop_op(9E) entry point for a device driver when ddi_prop_op() is defined in the driver's cb_ops(9S) structure. ddi_prop_op() enables a leaf device to search for and obtain property values from the device's property list.

If the driver has to maintain a property whose value changes frequently, you should define a driver-specific prop_op() routine within the cb_ops structure instead of calling ddi_prop_op(). This technique avoids the inefficiency of using ddi_prop_update() repeatedly. The driver should then maintain a copy of the property value either within its soft-state structure or in a driver variable.

The prop_op(9E) entry point reports the values of specific driver properties and device properties to the system. In many cases, the ddi_prop_op(9F) routine can be used as the driver's prop_op() entry point in the cb_ops(9S) structure. ddi_prop_op() performs all of the required processing. ddi_prop_op() is sufficient for drivers that do not require special processing when handling device property requests.

However, sometimes the driver must provide a prop_op() entry point. For example, if a driver maintains a property whose value changes frequently, updating the property with ddi_prop_update(9F) for each change is not efficient. Instead, the driver should maintain a shadow copy of the property in the instance's soft state. The driver would then update the shadow copy when the value changes without using any of the ddi_prop_update() routines. The prop_op() entry point must intercept requests for this property and use one of the ddi_prop_update() routines to update the value of the property before passing the request to ddi_prop_op() to process the property request.

In the following example, prop_op() intercepts requests for the temperature property. The driver updates a variable in the state structure whenever the property changes. However, the property is updated only when a request is made. The driver then uses ddi_prop_op() to process the property request. If the property request is not specific to a device, the driver does not intercept the request. This situation is indicated when the value of the dev parameter is equal to DDI_DEV_T_ANY, the wildcard device number.

Example 4–1 `prop_op()` Routine

static int
xx_prop_op(dev_t dev, dev_info_t *dip, ddi_prop_op_t prop_op,
    int flags, char *name, caddr_t valuep, int *lengthp)
{
        minor_t instance;
        struct xxstate *xsp;
        if (dev != DDI_DEV_T_ANY) {
                return (ddi_prop_op(dev, dip, prop_op, flags, name,
                    valuep, lengthp));
        }

        instance = getminor(dev);
        xsp = ddi_get_soft_state(statep, instance);
        if (xsp == NULL)
                return (DDI_PROP_NOTFOUND);
        if (strcmp(name, "temperature") == 0) {
                ddi_prop_update_int(dev, dip, name, temperature);
        }

        /* other cases */    
}

Chapter 5 Managing Events and Queueing Tasks

Drivers use events to respond to state changes. This chapter provides the following information on events:

Drivers use task queues to manage resource dependencies between tasks. This chapter provides the following information about task queues:

Managing Events

A system often needs to respond to a condition change such as a user action or system request. For example, a device might issue a warning when a component begins to overheat, or might start a movie player when a DVD is inserted into a drive. Device drivers can use a special message called an event to inform the system that a change in state has taken place.

Introduction to Events

An event is a message that a device driver sends to interested entities to indicate that a change of state has taken place. Events are implemented in the Oracle Solaris OS as user-defined, name-value pair structures that are managed using the nvlist* functions. (See the nvlist_alloc(9F) man page. Events are organized by vendor, class, and subclass. For example, you could define a class for monitoring environmental conditions. An environmental class could have subclasses to indicate changes in temperature, fan status, and power.

When a change in state occurs, the device notifies the driver. The driver then uses the ddi_log_sysevent(9F) function to log this event in a queue called sysevent. The sysevent queue passes events to the user level for handling by either the syseventd daemon or syseventconfd daemon. These daemons send notifications to any applications that have subscribed for notification of the specified event.

Two methods for designers of user-level applications deal with events:

An application can use the routines in libsysevent(3LIB) to subscribe with the syseventd daemon for notification when a specific event occurs.
A developer can write a separate user-level application to respond to an event. This type of application needs to be registered with syseventadm(1M). When syseventconfd encounters the specified event, the application is run and deals with the event accordingly.

This process is illustrated in the following figure.

Figure 5–1 Event Plumbing

Diagram shows how events are logged into the sysevent
queue for notification of user-level applications.

Using `ddi_log_sysevent()` to Log Events

Device drivers use the ddi_log_sysevent(9F) interface to generate and log events with the system.

`ddi_log_sysevent()` Syntax

ddi_log_sysevent() uses the following syntax:

int ddi_log_sysevent(dev_info_t *dip, char *vendor, char *class, 
    char *subclass, nvlist_t *attr-list, sysevent_id_t *eidp, int sleep-flag);

where:

dip

A pointer to the dev_info node for this driver.

vendor

A pointer to a string that defines the driver's vendor. Third-party drivers should use their company's stock symbol or a similarly enduring identifier. Oracle-supplied drivers use DDI_VENDOR_SUNW.

class

A pointer to a string defining the event's class. class is a driver-specific value. An example of a class might be a string that represents a set of environmental conditions that affect a device. This value must be understood by the event consumer.

subclass

A driver-specific string that represents a subset of the class argument. For example, within a class that represents environmental conditions, an event subclass might refer to the device's temperature. This value must be intelligible to the event consumer.

attr-list

A pointer to an nvlist_t structure that lists name-value attributes associated with the event. Name-value attributes are driver-defined and can refer to a specific attribute or condition of the device.

For example, consider a device that reads both CD-ROMs and DVDs. That device could have an attribute with the name disc_type and the value equal to either cd_rom or dvd.

As with class and subclass, an event consumer must be able to interpret the name-value pairs.

For more information on name-value pairs and the nvlist_t structure, see Defining Event Attributes, as well as the nvlist_alloc(9F) man page.

If the event has no attributes, then this argument should be set to NULL.

eidp

The address of a sysevent_id_t structure. The sysevent_id_t structure is used to provide a unique identification for the event. ddi_log_sysevent(9F) returns this structure with a system-provided event sequence number and time stamp. See the ddi_log_sysevent(9F) man page for more information on the sysevent_id_t structure.

sleep-flag

A flag that indicates how the caller wants to handle the possibility of resources not being available. If sleep-flag is set to DDI_SLEEP, the driver blocks until the resources become available. With DDI_NOSLEEP, an allocation will not sleep and cannot be guaranteed to succeed. If DDI_ENOMEM is returned, the driver would need to retry the operation at a later time.

Even with DDI_SLEEP, other error returns are possible with this interface, such as system busy, the syseventd daemon not responding, or trying to log an event in interrupt context.

Sample Code for Logging Events

A device driver performs the following tasks to log events:

Allocate memory for the attribute list using nvlist_alloc(9F)
Add name-value pairs to the attribute list
Use the ddi_log_sysevent(9F) function to log the event in the sysevent queue
Call nvlist_free(9F) when the attribute list is no longer needed

The following example demonstrates how to use ddi_log_sysevent().

Example 5–1 Calling `ddi_log_sysevent()`

char *vendor_name = "DDI_VENDOR_JGJG"
char *my_class = "JGJG_event";
char *my_subclass = "JGJG_alert";
nvlist_t *nvl;
/* ... */
nvlist_alloc(&nvl, nvflag, kmflag);
/* ... */
(void) nvlist_add_byte_array(nvl, propname, (uchar_t *)propval, proplen + 1); 
/* ... */
if (ddi_log_sysevent(dip, vendor_name, my_class, 
    my_subclass, nvl, NULL, DDI_SLEEP)!= DDI_SUCCESS)
    cmn_err(CE_WARN, "error logging system event"); 
nvlist_free(nvl);

Defining Event Attributes

Event attributes are defined as a list of name-value pairs. The Oracle Solaris DDI provides routines and structures for storing information in name-value pairs. Name-value pairs are retained in an nvlist_t structure, which is opaque to the driver. The value for a name-value pair can be a Boolean, an int, a byte, a string, an nvlist, or an array of these data types. An int can be defined as 16 bits, 32 bits, or 64 bits and can be signed or unsigned.

The steps in creating a list of name-value pairs are as follows.

Create an nvlist_t structure with nvlist_alloc(9F).

The nvlist_alloc() interface takes three arguments:
- nvlp – Pointer to a pointer to an nvlist_t structure
- nvflag – Flag to indicate the uniqueness of the names of the pairs. If this flag is set to NV_UNIQUE_NAME_TYPE, any existing pair that matches the name and type of a new pair is removed from the list. If the flag is set to NV_UNIQUE_NAME, then any existing pair with a duplicate name is removed, regardless of its type. Specifying NV_UNIQUE_NAME_TYPE allows a list to contain two or more pairs with the same name as long as their types are different, whereas with NV_UNIQUE_NAME only one instance of a pair name can be in the list. If the flag is not set, then no uniqueness checking is done and the consumer of the list is responsible for dealing with duplicates.
- kmflag – Flag to indicate the allocation policy for kernel memory. If this argument is set to KM_SLEEP, then the driver blocks until the requested memory is available for allocation. KM_SLEEP allocations might sleep but are guaranteed to succeed. KM_NOSLEEP allocations are guaranteed not to sleep but might return NULL if no memory is currently available.
Populate the nvlist with name-value pairs. For example, to add a string, use nvlist_add_string(9F). To add an array of 32-bit integers, use nvlist_add_int32_array(9F). The nvlist_add_boolean(9F) man page contains a complete list of interfaces for adding pairs.

To deallocate a list, use nvlist_free(9F).

The following code sample illustrates the creation of a name-value list.

Example 5–2 Creating and Populating a Name-Value Pair List

nvlist_t*
create_nvlist()
    {
    int err;
    char *str = "child";
    int32_t ints[] = {0, 1, 2};
    nvlist_t *nvl;

    err = nvlist_alloc(&nvl, NV_UNIQUE_NAME, 0);    /* allocate list */
    if (err)
        return (NULL);
    if ((nvlist_add_string(nvl, "name", str) != 0) ||
        (nvlist_add_int32_array(nvl, "prop", ints, 3) != 0)) {
        nvlist_free(nvl);
        return (NULL);
    }
    return (nvl);
}

Drivers can retrieve the elements of an nvlist by using a lookup function for that type, such as nvlist_lookup_int32_array(9F), which takes as an argument the name of the pair to be searched for.

Note –

These interfaces work only if either NV_UNIQUE_NAME or NV_UNIQUE_NAME_TYPE is specified when nvlist_alloc(9F) is called. Otherwise, ENOTSUP is returned, because the list cannot contain multiple pairs with the same name.

A list of name-value list pairs can be placed in contiguous memory. This approach is useful for passing the list to an entity that has subscribed for notification. The first step is to get the size of the memory block that is needed for the list with nvlist_size(9F). The next step is to pack the list into the buffer with nvlist_pack(9F). The consumer receiving the buffer's content can unpack the buffer with nvlist_unpack(9F).

The functions for manipulating name-value pairs are available to both user-level and kernel-level developers. You can find identical man pages for these functions in both man pages section 3: Library Interfaces and Headers and in man pages section 9: DDI and DKI Kernel Functions. For a list of functions that operate on name-value pairs, see the following table.

Table 5–1 Functions for Using Name-Value Pairs


Man Page	Purpose / Functions
nvlist_add_boolean(9F)	Add name-value pairs to the list. Functions include: `nvlist_add_boolean()`, `nvlist_add_boolean_value()`, `nvlist_add_byte()`, `nvlist_add_int8()`, `nvlist_add_uint8()`, `nvlist_add_int16()`, `nvlist_add_uint16()`, `nvlist_add_int32()`, `nvlist_add_uint32()`, `nvlist_add_int64()`, `nvlist_add_uint64()`, `nvlist_add_string()`, `nvlist_add_nvlist()`, `nvlist_add_nvpair()`, `nvlist_add_boolean_array()`, `nvlist_add_int8_array, nvlist_add_uint8_array()`, `nvlist_add_nvlist_array()`, `nvlist_add_byte_array()`, `nvlist_add_int16_array()`, `nvlist_add_uint16_array()`, `nvlist_add_int32_array()`, `nvlist_add_uint32_array()`, `nvlist_add_int64_array()`, `nvlist_add_uint64_array()`, `nvlist_add_string_array()`
nvlist_alloc(9F)	Manipulate the name-value list buffer. Functions include: `nvlist_alloc()`, `nvlist_free()`, `nvlist_size()`, `nvlist_pack()`, `nvlist_unpack()`, `nvlist_dup()`, `nvlist_merge()`
nvlist_lookup_boolean(9F)	Search for name-value pairs. Functions include: `nvlist_lookup_boolean()`, `nvlist_lookup_boolean_value()`, `nvlist_lookup_byte()`, `nvlist_lookup_int8()`, `nvlist_lookup_int16()`, `nvlist_lookup_int32()`, `nvlist_lookup_int64()`, `nvlist_lookup_uint8()`, `nvlist_lookup_uint16()`, `nvlist_lookup_uint32()`, `nvlist_lookup_uint64()`, `nvlist_lookup_string()`, `nvlist_lookup_nvlist()`, `nvlist_lookup_boolean_array, nvlist_lookup_byte_array()`, `nvlist_lookup_int8_array()`, `nvlist_lookup_int16_array()`, `nvlist_lookup_int32_array()`, `nvlist_lookup_int64_array()`, `nvlist_lookup_uint8_array()`, `nvlist_lookup_uint16_array()`, `nvlist_lookup_uint32_array()`, `nvlist_lookup_uint64_array()`, `nvlist_lookup_string_array()`, `nvlist_lookup_nvlist_array()`, `nvlist_lookup_pairs()`
nvlist_next_nvpair(9F)	Get name-value pair data. Functions include: `nvlist_next_nvpair()`, `nvpair_name()`, `nvpair_type()`
nvlist_remove(9F)	Remove name-value pairs. Functions include: `nv_remove()`, `nv_remove_all()`

Queueing Tasks

This section discusses how to use task queues to postpone processing of some tasks and delegate their execution to another kernel thread.

Introduction to Task Queues

A common operation in kernel programming is to schedule a task to be performed at a later time, by a different thread. The following examples give some reasons that you might want a different thread to perform a task at a later time:

Your current code path is time critical. The additional task you want to perform is not time critical.
The additional task might require grabbing a lock that another thread is currently holding.
You cannot block in your current context. The additional task might need to block, for example to wait for memory.
A condition is preventing your code path from completing, but your current code path cannot sleep or fail. You need to queue the current task to execute after the condition disappears.
You need to launch multiple tasks in parallel.

In each of these cases, a task is executed in a different context. A different context is usually a different kernel thread with a different set of locks held and possibly a different priority. Task queues provide a generic kernel API for scheduling asynchronous tasks.

A task queue is a list of tasks with one or more threads to service the list. If a task queue has a single service thread, all tasks are guaranteed to execute in the order in which they are added to the list. If a task queue has more than one service thread, the order in which the tasks will execute is not known.

Note –

If the task queue has more than one service thread, make sure that the execution of one task does not depend on the execution of any other task. Dependencies between tasks can cause a deadlock to occur.

Task Queue Interfaces

The following DDI interfaces manage task queues. These interfaces are defined in the sys/sunddi.h header file. See the taskq(9F) man page for more information about these interfaces.

`ddi_taskq_t`	Opaque handle
`TASKQ_DEFAULTPRI`	System default priority
`DDI_SLEEP`	Can block for memory
`DDI_NOSLEEP`	Cannot block for memory
`ddi_taskq_create()`	Create a task queue
`ddi_taskq_destroy()`	Destroy a task queue
`ddi_taskq_dispatch()`	Add a task to a task queue
`ddi_taskq_wait()`	Wait for pending tasks to complete
`ddi_taskq_suspend()`	Suspend a task queue
`ddi_taskq_suspended()`	Check whether a task queue is suspended
`ddi_taskq_resume()`	Resume a suspended task queue

Observing Task Queues

The typical usage in drivers is to create task queues at attach(9E). Most taskq_dispatch() invocations are from interrupt context.

This section describes two techniques that you can use to monitor the system resources that are consumed by a task queue. Task queues export statistics on the use of system time by task queue threads. Task queues also use DTrace SDT probes to determine when a task queue starts and finishes execution of a task.

Task Queue Kernel Statistics Counters

Every task queue has an associated set of kstat counters. Examine the output of the following kstat(1M) command:

$ kstat -c taskq
module: unix                            instance: 0     
name:   ata_nexus_enum_tq               class:    taskq
        crtime                          53.877907833
        executed                        0
        maxtasks                        0
        nactive                         1
        nalloc                          0
        priority                        60
        snaptime                        258059.249256749
        tasks                           0
        threads                         1
        totaltime                       0

module: unix                            instance: 0     
name:   callout_taskq                   class:    taskq
        crtime                          0
        executed                        13956358
        maxtasks                        4
        nactive                         4
        nalloc                          0
        priority                        99
        snaptime                        258059.24981709
        tasks                           13956358
        threads                         2
        totaltime                       120247890619

The kstat output shown above includes the following information:

The name of the task queue and its instance number
The number of scheduled (tasks) and executed (executed) tasks
The number of kernel threads processing the task queue (threads) and their priority (priority)
The total time (in nanoseconds) spent processing all the tasks (totaltime)

The following example shows how you can use the kstat command to observe how a counter (number of scheduled tasks) increases over time:

$ kstat -p unix:0:callout_taskq:tasks 1 5
unix:0:callout_taskq:tasks      13994642

unix:0:callout_taskq:tasks      13994711

unix:0:callout_taskq:tasks      13994784

unix:0:callout_taskq:tasks      13994855

unix:0:callout_taskq:tasks      13994926

Task Queue DTrace SDT Probes

Task queues provide several useful SDT probes. All the probes described in this section have the following two arguments:

The task queue pointer returned by ddi_taskq_create()
The pointer to the taskq_ent_t structure. Use this pointer in your D script to extract the function and the argument.

You can use these probes to collect precise timing information about individual task queues and individual tasks being executed through them. For example, the following script prints the functions that were scheduled through task queues for every 10 seconds:

# !/usr/sbin/dtrace -qs

sdt:genunix::taskq-enqueue
{
  this->tq  = (taskq_t *)arg0;
  this->tqe = (taskq_ent_t *) arg1;
  @[this->tq->tq_name,
    this->tq->tq_instance,
    this->tqe->tqent_func] = count();
}

tick-10s
{
  printa ("%s(%d): %a called %@d times\n", @);
  trunc(@);
}

On a particular machine, the above D script produced the following output:

callout_taskq(1): genunix`callout_execute called 51 times
callout_taskq(0): genunix`callout_execute called 701 times
kmem_taskq(0): genunix`kmem_update_timeout called 1 times
kmem_taskq(0): genunix`kmem_hash_rescale called 4 times
callout_taskq(1): genunix`callout_execute called 40 times
USB_hid_81_pipehndl_tq_1(14): usba`hcdi_cb_thread called 256 times
callout_taskq(0): genunix`callout_execute called 702 times
kmem_taskq(0): genunix`kmem_update_timeout called 1 times
kmem_taskq(0): genunix`kmem_hash_rescale called 4 times
callout_taskq(1): genunix`callout_execute called 28 times
USB_hid_81_pipehndl_tq_1(14): usba`hcdi_cb_thread called 228 times
callout_taskq(0): genunix`callout_execute called 706 times
callout_taskq(1): genunix`callout_execute called 24 times
USB_hid_81_pipehndl_tq_1(14): usba`hcdi_cb_thread called 141 times
callout_taskq(0): genunix`callout_execute called 708 times

Chapter 6 Driver Autoconfiguration

Autoconfiguration means the driver loads code and static data into memory. This information is then registered with the system. Autoconfiguration also involves attaching individual device instances that are controlled by the driver.

This chapter provides information on the following subjects:

Driver Loading and Unloading

The system loads driver binary modules from the drv subdirectory of the kernel module directory for autoconfiguration. See Copying the Driver to a Module Directory.

After a module is read into memory with all symbols resolved, the system calls the _init(9E) entry point for that module. The _init() function calls mod_install(9F), which actually loads the module.

Note –

During the call to mod_install(), other threads are able to call attach(9E) as soon as mod_install() is called. From a programming standpoint, all _init() initialization must occur before mod_install() is called. If mod_install() fails (that is a nonzero value is returned), then the initialization must be backed out.

Upon successful completion of _init(), the driver is properly registered with the system. At this point, the driver is not actively managing any device. Device management happens as part of device configuration.

The system unloads driver binary modules either to conserve system memory or at the explicit request of a user. Before deleting the driver code and data from memory, the _fini(9E) entry point of the driver is invoked. The driver is unloaded, if and only if _fini() returns success.

The following figure provides a structural overview of a device driver. The shaded area highlights the driver data structures and entry points. The upper half of the shaded area contains data structures and entry points that support driver loading and unloading. The lower half is concerned with driver configuration.

Figure 6–1 Module Loading and Autoconfiguration Entry Points

Diagram shows structures and entry points used in autoconfiguration
and module loading.

Data Structures Required for Drivers

To support autoconfiguration, drivers are required to statically initialize the following data structures:

The data structures in Figure 5-1 are relied on by the driver. These structures must be provided and be initialized correctly. Without these data structures, the driver might not load properly. As a result, the necessary routines might not be loaded. If an operation is not supported by the driver, the address of the nodev(9F) routine can be used as a placeholder. In some instances, the driver supports the entry point and only needs to return success or failure. In such cases, the address of the routine nulldev(9F) can be used.

Note –

These structures should be initialized at compile-time. The driver should not access or change the structures at any other time.

`modlinkage` Structure

static struct modlinkage xxmodlinkage = {
    MODREV_1,       /* ml_rev */
    &xxmodldrv,     /* ml_linkage[] */
    NULL            /* NULL termination */
};

The first field is the version number of the module that loads the subsystem. This field should be MODREV_1. The second field points to driver's modldrv structure defined next. The last element of the structure should always be NULL.

`modldrv` Structure

static struct modldrv xxmodldrv = {
    &mod_driverops,           /* drv_modops */
    "generic driver v1.1",    /* drv_linkinfo */
    &xx_dev_ops               /* drv_dev_ops */
};

This structure describes the module in more detail. The first field provides information regarding installation of the module. This field should be set to &mod_driverops for driver modules. The second field is a string to be displayed by modinfo(1M). The second field should contain sufficient information for identifying the version of source code that generated the driver binary. The last field points to the driver's dev_ops structure defined in the following section.

`dev_ops` Structure

static struct dev_ops xx_dev_ops = {
    DEVO_REV,                  /* devo_rev */
    0,                         /* devo_refcnt  */
    xxgetinfo,                 /* devo_getinfo: getinfo(9E) */
    nulldev,                   /* devo_identify: identify(9E) */
    xxprobe,                   /* devo_probe: probe(9E) */
    xxattach,                  /* devo_attach: attach(9E) */
    xxdetach,                  /* devo_detach: detach(9E) */
    nodev,                     /* devo_reset: see devo_quiesce */
    &xx_cb_ops,                /* devo_cb_ops */
    NULL,                      /* devo_bus_ops */
    &xxpower,                  /* devo_power: power(9E) */
    ddi_quiesce_not_needed,    /* devo_quiesce: quiesce(9E) */
};

The dev_ops(9S) structure enables the kernel to find the autoconfiguration entry points of the device driver. The devo_rev field identifies the revision number of the structure. This field must be set to DEVO_REV. The devo_refcnt field must be initialized to zero. The function address fields should be filled in with the address of the appropriate driver entry point, except in the following cases:

Set the devo_identify field to nulldev(9F). The identify() entry point is obsolete.
Set the devo_probe field to nulldev(9F) if a probe(9E) routine is not needed.
Set the devo_reset field to nodev(9F). The nodev() function returns ENXIO. See devo_quiesce.
Set the devo_power field to NULL if a power() routine is not needed. Drivers for devices that provide Power Management functionality must have a power(9E) entry point. See Chapter 12, Power Management.
Set the devo_quiesce field to ddi_quiesce_not_needed() if the driver does not need to implement quiesce. Drivers that manage devices must provide a quiesce(9E) entry point.

The devo_cb_ops member should include the address of the cb_ops(9S) structure. The devo_bus_ops field must be set to NULL.

`cb_ops` Structure

static struct cb_ops xx_cb_ops = {
    xxopen,         /* open(9E) */
    xxclose,        /* close(9E) */
    xxstrategy,     /* strategy(9E) */
    xxprint,        /* print(9E) */
    xxdump,         /* dump(9E) */
    xxread,         /* read(9E) */
    xxwrite,        /* write(9E) */
    xxioctl,        /* ioctl(9E) */
    xxdevmap,       /* devmap(9E) */
    nodev,          /* mmap(9E) */
    xxsegmap,       /* segmap(9E) */
    xxchpoll,       /* chpoll(9E) */
    xxprop_op,      /* prop_op(9E) */
    NULL,           /* streamtab(9S) */
    D_MP | D_64BIT, /* cb_flag */
    CB_REV,         /* cb_rev */
    xxaread,        /* aread(9E) */
    xxawrite        /* awrite(9E) */
};

The cb_ops(9S) structure contains the entry points for the character operations and block operations of the device driver. Any entry points that the driver does not support should be initialized to nodev(9F). For example, character device drivers should set all the block-only fields, such as cb_stategy, to nodev(9F). Note that the mmap(9E) entry point is maintained for compatibility with previous releases. Drivers should use the devmap(9E) entry point for device memory mapping. If devmap(9E) is supported, set mmap(9E) to nodev(9F).

The streamtab field indicates whether the driver is STREAMS-based. Only the network device drivers that are discussed in Chapter 19, Drivers for Network Devices are STREAMS-based. All non-STREAMS-based drivers must set the streamtab field to NULL.

The cb_flag member contains the following flags:

The D_MP flag indicates that the driver is safe for multithreading. The Oracle Solaris OS supports only thread-safe drivers so D_MP must be set.
The D_64BIT flag causes the driver to use the uio_loffset field of the uio(9S) structure. The driver should set the D_64BIT flag in the cb_flag field to handle 64-bit offsets properly.
The D_DEVMAP flag supports the devmap(9E) entry point. For information on devmap(9E), see Chapter 10, Mapping Device and Kernel Memory.

cb_rev is the cb_ops structure revision number. This field must be set to CB_REV.

Loadable Driver Interfaces

Device drivers must be dynamically loadable. Drivers should also be unloadable to help conserve memory resources. Drivers that can be unloaded are also easier to test, debug, and patch.

Each device driver is required to implement _init(9E), _fini(9E), and _info(9E) entry points to support driver loading and unloading. The following example shows a typical implementation of loadable driver interfaces.

Example 6–1 Loadable Interface Section

static void *statep;                /* for soft state routines */
static struct cb_ops xx_cb_ops;     /* forward reference */
static struct dev_ops xx_ops = {
    DEVO_REV,
    0,
    xxgetinfo,
    nulldev,
    xxprobe,
    xxattach,
    xxdetach,
    xxreset,
    nodev,
    &xx_cb_ops,
    NULL,
    xxpower,
    ddi_quiesce_not_needed,
};

static struct modldrv modldrv = {
    &mod_driverops,
    "xx driver v1.0",
    &xx_ops
};

static struct modlinkage modlinkage = {
    MODREV_1,
    &modldrv,
    NULL
};

int
_init(void)
{
    int error;
    ddi_soft_state_init(&statep, sizeof (struct xxstate),
        estimated_number_of_instances);
    /* further per-module initialization if necessary */
    error = mod_install(&modlinkage);
    if (error != 0) {
        /* undo any per-module initialization done earlier */
        ddi_soft_state_fini(&statep);
    }
    return (error);
}

int
_fini(void)
{
    int error;
    error = mod_remove(&modlinkage);
    if (error == 0) {
        /* release per-module resources if any were allocated */
        ddi_soft_state_fini(&statep);
    }
    return (error);
}

int
_info(struct modinfo *modinfop)
{
    return (mod_info(&modlinkage, modinfop));
}

`_init()` Example

The following example shows a typical _init(9E) interface.

Example 6–2 `_init()` Function

static void *xxstatep;
int
_init(void)
{
    int error;
    const int max_instance = 20;    /* estimated max device instances */

    ddi_soft_state_init(&xxstatep, sizeof (struct xxstate), max_instance);
    error = mod_install(&xxmodlinkage);
    if (error != 0) {
        /*
         * Cleanup after a failure
         */
        ddi_soft_state_fini(&xxstatep);
    }
    return (error);
}

The driver should perform any one-time resource allocation or data initialization during driver loading in _init(). For example, the driver should initialize any mutexes global to the driver in this routine. The driver should not, however, use _init(9E) to allocate or initialize anything that has to do with a particular instance of the device. Per-instance initialization must be done in attach(9E). For example, if a driver for a printer can handle more than one printer at the same time, that driver should allocate resources specific to each printer instance in attach().

Note –

Once _init(9E) has called mod_install(9F), the driver should not change any of the data structures attached to the modlinkage(9S) structure because the system might make copies or change the data structures.

`_fini()` Example

The following example demonstrates the _fini() routine.

int
_fini(void)
{
    int error;
    error = mod_remove(&modlinkage);
    if (error != 0) {
        return (error);
    }
    /*
     * Cleanup resources allocated in _init()
     */
    ddi_soft_state_fini(&xxstatep);
    return (0);
}

Similarly, in _fini(), the driver should release any resources that were allocated in _init(). The driver must remove itself from the system module list.

Note –

_fini() might be called when the driver is attached to hardware instances. In this case, mod_remove(9F) returns failure. Therefore, driver resources should not be released until mod_remove() returns success.

`_info()` Example

The following example demonstrates the _info(9E) routine.

int
_info(struct modinfo *modinfop)
{
    return (mod_info(&xxmodlinkage, modinfop));
}

The driver is called to return module information. The entry point should be implemented as shown above.

Device Configuration Concepts

For each node in the kernel device tree, the system selects a driver for the node based on the node name and the compatible property (see Binding a Driver to a Device). The same driver might bind to multiple device nodes. The driver can differentiate different nodes by instance numbers assigned by the system.

After a driver is selected for a device node, the driver's probe(9E) entry point is called to determine the presence of the device on the system. If probe() is successful, the driver's attach(9E) entry point is invoked to set up and manage the device. The device can be opened if and only if attach() returns success (see attach() Entry Point).

A device might be unconfigured to conserve system memory resources or to enable the device to be removed while the system is still running. To enable the device to be unconfigured, the system first checks whether the device instance is referenced. This check involves calling the driver's getinfo(9E) entry point to obtain information known only to the driver (see getinfo() Entry Point). If the device instance is not referenced, the driver's detach(9E) routine is invoked to unconfigure the device (see detach() Entry Point).

To recap, each driver must define the following entry points that are used by the kernel for device configuration:

Note that attach(), detach(), and getinfo() are required. probe() is only required for devices that cannot identify themselves. For self-identifying devices, an explicit probe() routine can be provided, or nulldev(9F) can be specified in the dev_ops structure for the probe() entry point.

Device Instances and Instance Numbers

The system assigns an instance number to each device. The driver might not reliably predict the value of the instance number assigned to a particular device. The driver should retrieve the particular instance number that has been assigned by calling ddi_get_instance(9F).

Instance numbers represent the system's notion of devices. Each dev_info, that is, each node in the device tree, for a particular driver is assigned an instance number by the kernel. Furthermore, instance numbers provide a convenient mechanism for indexing data specific to a particular physical device. The most common use of instance numbers is ddi_get_soft_state(9F), which uses instance numbers to retrieve soft state data for specific physical devices.

Caution –

For pseudo devices, that is, the children of pseudo nexuses, the instance numbers are defined in the driver.conf(4) file using the instance property. If the driver.conf file does not contain the instance property, the behavior is undefined. For hardware device nodes, the system assigns instance numbers when the device is first seen by the OS. The instance numbers persist across system reboots and OS upgrades.

Minor Nodes and Minor Numbers

Drivers are responsible for managing their minor number namespace. For example, the sd driver needs to export eight character minor nodes and eight block minor nodes to the file system for each disk. Each minor node represents either a block interface or a character interface to a portion of the disk. The getinfo(9E) entry point informs the system about the mapping from minor number to device instance (see getinfo() Entry Point).

`probe()` Entry Point

For non-self-identifying devices, the probe(9E) entry point should determine whether the hardware device is present on the system.

For probe() to determine whether the instance of the device is present, probe() needs to perform many tasks that are also commonly done by attach(9E). In particular, probe() might need to map the device registers.

Probing the device registers is device-specific. The driver often has to perform a series of tests of the hardware to assure that the hardware is really present. The test criteria must be rigorous enough to avoid misidentifying devices. For example, a device might appear to be present when in fact that device is not available, because a different device seems to behave like the expected device.

The test returns the following flags:

DDI_PROBE_SUCCESS if the probe was successful
DDI_PROBE_FAILURE if the probe failed
DDI_PROBE_DONTCARE if the probe was unsuccessful yet attach(9E) still needs to be called
DDI_PROBE_PARTIAL if the instance is not present now, but might be present in the future

For a given device instance, attach(9E) will not be called until probe(9E) has succeeded at least once on that device.

probe(9E) must free all the resources that probe() has allocated, because probe() might be called multiple times. However, attach(9E) is not necessarily called even if probe(9E) has succeeded

ddi_dev_is_sid(9F) can be used in a driver's probe(9E) routine to determine whether the device is self-identifying. ddi_dev_is_sid() is useful in drivers written for self-identifying and non-self-identifying versions of the same device.

The following example is a sample probe() routine.

Example 6–3 probe(9E) Routine

static int
xxprobe(dev_info_t *dip)
{
    ddi_acc_handle_t dev_hdl;
    ddi_device_acc_attr_t dev_attr;
    Pio_csr *csrp;
    uint8_t csrval;

    /*
     * if the device is self identifying, no need to probe
     */
    if (ddi_dev_is_sid(dip) == DDI_SUCCESS)
    return (DDI_PROBE_DONTCARE);

    /*
     * Initalize the device access attributes and map in
     * the devices CSR register (register 0)
     */
    dev_attr.devacc_attr_version = DDI_DEVICE_ATTR_V0;
    dev_attr.devacc_attr_endian_flags = DDI_STRUCTURE_LE_ACC;
    dev_attr.devacc_attr_dataorder = DDI_STRICTORDER_ACC;

    if (ddi_regs_map_setup(dip, 0, (caddr_t *)&csrp, 0, sizeof (Pio_csr),
    &dev_attr, &dev_hdl) != DDI_SUCCESS)
    return (DDI_PROBE_FAILURE);

    /*
     * Reset the device
     * Once the reset completes the CSR should read back
     * (PIO_DEV_READY | PIO_IDLE_INTR)
     */
    ddi_put8(dev_hdl, csrp, PIO_RESET);
    csrval = ddi_get8(dev_hdl, csrp);

    /*
     * tear down the mappings and return probe success/failure
     */
    ddi_regs_map_free(&dev_hdl);
    if ((csrval & 0xff) == (PIO_DEV_READY | PIO_IDLE_INTR))
    return (DDI_PROBE_SUCCESS);
    else
    return (DDI_PROBE_FAILURE);
}

When the driver's probe(9E) routine is called, the driver does not know whether the device being probed exists on the bus. Therefore, the driver might attempt to access device registers for a nonexistent device. A bus fault might be generated on some buses as a result.

The following example shows a probe(9E) routine that uses ddi_poke8(9F) to check for the existence of the device. ddi_poke8() cautiously attempts to write a value to a specified virtual address, using the parent nexus driver to assist in the process where necessary. If the address is not valid or the value cannot be written without an error occurring, an error code is returned. See also ddi_peek(9F).

In this example, ddi_regs_map_setup(9F) is used to map the device registers.

Example 6–4 probe(9E) Routine Using ddi_poke8(9F)

static int
xxprobe(dev_info_t *dip)
{
    ddi_acc_handle_t dev_hdl;
    ddi_device_acc_attr_t dev_attr;
    Pio_csr *csrp;
    uint8_t csrval;

    /*
     * if the device is self-identifying, no need to probe
     */
    if (ddi_dev_is_sid(dip) == DDI_SUCCESS)
    return (DDI_PROBE_DONTCARE);

    /*
     * Initialize the device access attrributes and map in
     * the device's CSR register (register 0)
     */
    dev_attr.devacc_attr_version - DDI_DEVICE_ATTR_V0;
    dev_attr.devacc_attr_endian_flags = DDI_STRUCTURE_LE_ACC;
    dev_attr.devacc_attr_dataorder = DDI_STRICTORDER_ACC;

    if (ddi_regs_map_setup(dip, 0, (caddr_t *)&csrp, 0, sizeof (Pio_csr),
    &dev_attr, &dev_hdl) != DDI_SUCCESS)
    return (DDI_PROBE_FAILURE);

    /*
     * The bus can generate a fault when probing for devices that
     * do not exist.  Use ddi_poke8(9f) to handle any faults that
     * might occur.
     *
     * Reset the device.  Once the reset completes the CSR should read
     * back (PIO_DEV_READY | PIO_IDLE_INTR)
     */
    if (ddi_poke8(dip, csrp, PIO_RESET) != DDI_SUCCESS) {
    ddi_regs_map_free(&dev_hdl);
    return (DDI_FAILURE);

    csrval = ddi_get8(dev_hdl, csrp);
    /*
     * tear down the mappings and return probe success/failure
     */
    ddi_regs_map_free(&dev_hdl);
    if ((csrval & 0xff) == (PIO_DEV_READY | PIO_IDLE_INTR))
    return (DDI_PROBE_SUCCESS);
    else
    return (DDI_PROBE_FAILURE);
}

`attach()` Entry Point

The kernel calls a driver's attach(9E) entry point to attach an instance of a device or to resume operation for an instance of a device that has been suspended or has been shut down by the power management framework. This section discusses only the operation of attaching device instances. Power management is discussed in Chapter 12, Power Management.

A driver's attach(9E) entry point is called to attach each instance of a device that is bound to the driver. The entry point is called with the instance of the device node to attach, with DDI_ATTACH specified as the cmd argument to attach(9E). The attach entry point typically includes the following types of processing:

Allocating a soft-state structure for the device instance
Initializing per-instance mutexes
Initializing condition variables
Registering the device's interrupts
Mapping the registers and memory of the device instance
Creating minor device nodes for the device instance
Reporting that the device instance has attached

Driver Soft-State Management

To assist device driver writers in allocating state structures, the Oracle Solaris DDI/DKI provides a set of memory management routines called software state management routines, which are also known as the soft-state routines. These routines dynamically allocate, retrieve, and destroy memory items of a specified size, and hide the details of list management. An instance number identifies the desired memory item. This number is typically the instance number assigned by the system.

Drivers typically allocate a soft-state structure for each device instance that attaches to the driver by calling ddi_soft_state_zalloc(9F), passing the instance number of the device. Because no two device nodes can have the same instance number, ddi_soft_state_zalloc(9F) fails if an allocation already exists for a given instance number.

A driver's character or block entry point (cb_ops(9S)) references a particular soft state structure by first decoding the device's instance number from the dev_t argument that is passed to the entry point function. The driver then calls ddi_get_soft_state(9F), passing the per-driver soft-state list and the instance number that was derived. A NULL return value indicates that effectively the device does not exist and the appropriate code should be returned by the driver.

See Creating Minor Device Nodes for additional information on how instance numbers and device numbers, or dev_t's, are related.

Lock Variable and Conditional Variable Initialization

Drivers should initialize any per-instance locks and condition variables during attach. The initialization of any locks that are acquired by the driver's interrupt handler must be initialized prior to adding any interrupt handlers. See Chapter 3, Multithreading for a description of lock initialization and usage. See Chapter 8, Interrupt Handlers for a discussion of interrupt handler and lock issues.

Creating Minor Device Nodes

An important part of the attach process is the creation of minor nodes for the device instance. A minor node contains the information exported by the device and the DDI framework. The system uses this information to create a special file for the minor node under /devices.

Minor nodes are created when the driver calls ddi_create_minor_node(9F). The driver supplies a minor number, a minor name, a minor node type, and whether the minor node represents a block or character device.

Drivers can create any number of minor nodes for a device. The Oracle Solaris DDI/DKI expects certain classes of devices to have minor nodes created in a particular format. For example, disk drivers are expected to create 16 minor nodes for each physical disk instance attached. Eight minor nodes are created, representing the a - h block device interfaces, with an additional eight minor nodes for the a,raw - h,raw character device interfaces.

The minor number passed to ddi_create_minor_node(9F) is defined wholly by the driver. The minor number is usually an encoding of the instance number of the device with a minor node identifier. In the preceding example, the driver creates minor numbers for each of the minor nodes by shifting the instance number of the device left by three bits and using the OR of that result with the minor node index. The values of the minor node index range from 0 to 7. Note that minor nodes a and a,raw share the same minor number. These minor nodes are distinguished by the spec_type argument passed to ddi_create_minor_node().

The minor node type passed to ddi_create_minor_node(9F) classifies the type of device, such as disks, tapes, network interfaces, frame buffers, and so forth.

The following table lists the types of possible nodes that might be created.

Table 6–1 Possible Node Types


Constant	Description
`DDI_NT_SERIAL`	Serial port
`DDI_NT_SERIAL_DO`	Dialout ports
`DDI_NT_BLOCK`	Hard disks
`DDI_NT_BLOCK_CHAN`	Hard disks with channel or target numbers
`DDI_NT_CD`	ROM drives (CD-ROM)
`DDI_NT_CD_CHAN`	ROM drives with channel or target numbers
`DDI_NT_FD`	Floppy disks
`DDI_NT_TAPE`	Tape drives
`DDI_NT_NET`	Network devices
`DDI_NT_DISPLAY`	Display devices
`DDI_NT_MOUSE`	Mouse
`DDI_NT_KEYBOARD`	Keyboard
`DDI_NT_AUDIO`	Audio Device
`DDI_PSEUDO`	General pseudo devices

The node types DDI_NT_BLOCK, DDI_NT_BLOCK_CHAN, DDI_NT_CD, and DDI_NT_CD_CHAN cause devfsadm(1M) to identify the device instance as a disk and to create names in the /dev/dsk or /dev/rdsk directory.

The node type DDI_NT_TAPE causes devfsadm(1M) to identify the device instance as a tape and to create names in the /dev/rmt directory.

The node types DDI_NT_SERIAL and DDI_NT_SERIAL_DO cause devfsadm(1M) to perform these actions:

Identify the device instance as a serial port
Create names in the /dev/term directory
Add entries to the /etc/inittab file

Vendor-supplied strings should include an identifying value such as a name or stock symbol to make the strings unique. The string can be used in conjunction with devfsadm(1M) and the devlinks.tab file (see the devlinks(1M) man page) to create logical names in /dev.

Deferred Attach

open(9E) might be called on a minor device before attach(9E) has succeeded on the corresponding instance. open() must then return ENXIO, which causes the system to attempt to attach the device. If the attach() succeeds, the open() is retried automatically.

Example 6–5 Typical `attach()` Entry Point

/*
 * Attach an instance of the driver.  We take all the knowledge we
 * have about our board and check it against what has been filled in
 * for us from our FCode or from our driver.conf(4) file.
 */
static int
xxattach(dev_info_t *dip, ddi_attach_cmd_t cmd)
{
    int instance;
    Pio *pio_p;
    ddi_device_acc_attr_t   da_attr;
    static int pio_validate_device(dev_info_t *);

    switch (cmd) {
    case DDI_ATTACH:
    /*
     * first validate the device conforms to a configuration this driver
     * supports
     */
    if (pio_validate_device(dip) == 0)
        return (DDI_FAILURE);
    /*
     * Allocate a soft state structure for this device instance
     * Store a pointer to the device node in our soft state structure
     * and a reference to the soft state structure in the device
     * node.
     */
    instance = ddi_get_instance(dip);
    if (ddi_soft_state_zalloc(pio_softstate, instance) != 0)
        return (DDI_FAILURE);
    pio_p = ddi_get_soft_state(pio_softstate, instance);
    ddi_set_driver_private(dip, (caddr_t)pio_p);
    pio_p->dip = dip;
    /*
     * Before adding the interrupt, get the interrupt block
     * cookie associated with the interrupt specification to
     * initialize the mutex used by the interrupt handler.
     */
    if (ddi_get_iblock_cookie(dip, 0, &pio_p->iblock_cookie) !=
      DDI_SUCCESS) {
        ddi_soft_state_free(pio_softstate, instance);
        return (DDI_FAILURE);
    }

    mutex_init(&pio_p->mutex, NULL, MUTEX_DRIVER, pio_p->iblock_cookie);
    /*
     * Now that the mutex is initialized, add the interrupt itself.
     */
    if (ddi_add_intr(dip, 0, NULL, NULL, pio_intr, (caddr_t)instance) !=
      DDI_SUCCESS) {
        mutex_destroy(&pio_p>mutex);
        ddi_soft_state_free(pio_softstate, instance);
        return (DDI_FAILURE);
    }
    /*
     * Initialize the device access attributes for the register mapping
     */
    dev_acc_attr.devacc_attr_version = DDI_DEVICE_ATTR_V0;
    dev_acc_attr.devacc_attr_endian_flags = DDI_STRUCTURE_LE_ACC;
    dev_acc_attr.devacc_attr_dataorder = DDI_STRICTORDER_ACC;
    /*
     * Map in the csr register (register 0)
     */
    if (ddi_regs_map_setup(dip, 0, (caddr_t *)&(pio_p->csr), 0,
        sizeof (Pio_csr), &dev_acc_attr, &pio_p->csr_handle) !=
        DDI_SUCCESS) {
        ddi_remove_intr(pio_p->dip, 0, pio_p->iblock_cookie);
        mutex_destroy(&pio_p->mutex);
        ddi_soft_state_free(pio_softstate, instance);
        return (DDI_FAILURE);
    }
    /*
     * Map in the data register (register 1)
     */
    if (ddi_regs_map_setup(dip, 1, (caddr_t *)&(pio_p->data), 0,
        sizeof (uchar_t), &dev_acc_attr, &pio_p->data_handle) !=
        DDI_SUCCESS) {
        ddi_remove_intr(pio_p->dip, 0, pio_p->iblock_cookie);
        ddi_regs_map_free(&pio_p->csr_handle);
        mutex_destroy(&pio_p->mutex);
        ddi_soft_state_free(pio_softstate, instance);
        return (DDI_FAILURE);
    }
    /*
     * Create an entry in /devices for user processes to open(2)
     * This driver will create a minor node entry in /devices
     * of the form:  /devices/..../pio@X,Y:pio
     */
    if (ddi_create_minor_node(dip, ddi_get_name(dip), S_IFCHR,
        instance, DDI_PSEUDO, 0) == DDI_FAILURE) {
        ddi_remove_intr(pio_p->dip, 0, pio_p->iblock_cookie);
        ddi_regs_map_free(&pio_p->csr_handle);
        ddi_regs_map_free(&pio_p->data_handle);
        mutex_destroy(&pio_p->mutex);
        ddi_soft_state_free(pio_softstate, instance);
        return (DDI_FAILURE);
    }
    /*
     * reset device (including disabling interrupts)
     */
    ddi_put8(pio_p->csr_handle, pio_p->csr, PIO_RESET);
    /*
     * report the name of the device instance which has attached
     */
    ddi_report_dev(dip);
    return (DDI_SUCCESS);

    case DDI_RESUME:
    return (DDI_SUCCESS);

    default:
    return (DDI_FAILURE);
    }
}

Note –

The attach() routine must not make any assumptions about the order of invocations on different device instances. The system might invoke attach() concurrently on different device instances. The system might also invoke attach() and detach() concurrently on different device instances.

`detach()` Entry Point

The kernel calls a driver's detach(9E) entry point to detach an instance of a device or to suspend operation for an instance of a device by power management. This section discusses the operation of detaching device instances. Refer to Chapter 12, Power Management for a discussion of power management issues.

A driver's detach() entry point is called to detach an instance of a device that is bound to the driver. The entry point is called with the instance of the device node to be detached and with DDI_DETACH, which is specified as the cmd argument to the entry point.

A driver is required to cancel or wait for any time outs or callbacks to complete, then release any resources that are allocated to the device instance before returning. If for some reason a driver cannot cancel outstanding callbacks for free resources, the driver is required to return the device to its original state and return DDI_FAILURE from the entry point, leaving the device instance in the attached state.

There are two types of callback routines: those callbacks that can be canceled and those that cannot be canceled. timeout(9F) and bufcall(9F) callbacks can be atomically cancelled by the driver during detach(9E). Other types of callbacks such as scsi_init_pkt(9F) and ddi_dma_buf_bind_handle(9F) cannot be canceled. The driver must either block in detach() until the callback completes or else fail the request to detach.

Example 6–6 Typical `detach()` Entry Point

/*
 * detach(9e)
 * free the resources that were allocated in attach(9e)
 */
static int
xxdetach(dev_info_t *dip, ddi_detach_cmd_t cmd)
{
    Pio     *pio_p;
    int     instance;

    switch (cmd) {
    case DDI_DETACH:

    instance = ddi_get_instance(dip);
    pio_p = ddi_get_soft_state(pio_softstate, instance);

    /*
     * turn off the device
     * free any resources allocated in attach
     */
    ddi_put8(pio_p->csr_handle, pio_p->csr, PIO_RESET);
    ddi_remove_minor_node(dip, NULL);
    ddi_regs_map_free(&pio_p->csr_handle);
    ddi_regs_map_free(&pio_p->data_handle);
    ddi_remove_intr(pio_p->dip, 0, pio_p->iblock_cookie);
    mutex_destroy(&pio_p->mutex);
    ddi_soft_state_free(pio_softstate, instance);
    return (DDI_SUCCESS);

    case DDI_SUSPEND:
    default:
    return (DDI_FAILURE);
    }
}

`getinfo()` Entry Point

The system calls getinfo(9E) to obtain configuration information that only the driver knows. The mapping of minor numbers to device instances is entirely under the control of the driver. The system sometimes needs to ask the driver which device a particular dev_t represents.

The getinfo() function can take either DDI_INFO_DEVT2INSTANCE or DDI_INFO_DEVT2DEVINFO as its infocmd argument. The DDI_INFO_DEVT2INSTANCE command requests the instance number of a device. The DDI_INFO_DEVT2DEVINFO command requests a pointer to the dev_info structure of a device.

In the DDI_INFO_DEVT2INSTANCE case, arg is a dev_t, and getinfo() must translate the minor number in dev_t to an instance number. In the following example, the minor number is the instance number, so getinfo() simply passes back the minor number. In this case, the driver must not assume that a state structure is available, since getinfo() might be called before attach(). The mapping defined by the driver between the minor device number and the instance number does not necessarily follow the mapping shown in the example. In all cases, however, the mapping must be static.

In the DDI_INFO_DEVT2DEVINFO case, arg is again a dev_t, so getinfo() first decodes the instance number for the device. getinfo() then passes back the dev_info pointer saved in the driver's soft state structure for the appropriate device, as shown in the following example.

Example 6–7 Typical `getinfo()` Entry Point

/*
 * getinfo(9e)
 * Return the instance number or device node given a dev_t
 */
static int
xxgetinfo(dev_info_t *dip, ddi_info_cmd_t infocmd, void *arg, void **result)
{
    int error;
    Pio *pio_p;
    int instance = getminor((dev_t)arg);

    switch (infocmd) {
    /*
     * return the device node if the driver has attached the
     * device instance identified by the dev_t value which was passed
     */
    case DDI_INFO_DEVT2DEVINFO:
    pio_p = ddi_get_soft_state(pio_softstate, instance);
    if (pio_p == NULL) {
        *result = NULL;
        error = DDI_FAILURE;
    } else {
        mutex_enter(&pio_p->mutex);
        *result = pio_p->dip;
        mutex_exit(&pio_p->mutex);
        error = DDI_SUCCESS;
    }
    break;
    /*
     * the driver can always return the instance number given a dev_t
     * value, even if the instance is not attached.
     */
    case DDI_INFO_DEVT2INSTANCE:
    *result = (void *)instance;
    error = DDI_SUCCESS;
    break;
    default:
    *result = NULL;
    error = DDI_FAILURE;
    }
    return (error);
}

Note –

The getinfo() routine must be kept in sync with the minor nodes that the driver creates. If the minor nodes get out of sync, any hotplug operations might fail and cause a system panic.

Using Device IDs

The Oracle Solaris DDI interfaces enable drivers to provide the device ID, a persistent unique identifier for a device. The device ID can be used to identify or locate a device. The device ID is independent of the /devices name or device number (dev_t). Applications can use the functions defined in libdevid(3LIB) to read and manipulate the device IDs registered by the drivers.

Before a driver can export a device ID, the driver needs to verify the device is capable of either providing a unique ID or of storing a host-generated unique ID in a not normally accessible area. WWN (world-wide number) is an example of a unique ID that is provided by the device. Device NVRAM and reserved sectors are examples of non-accessible areas where host-generated unique IDs can be safely stored.

Registering Device IDs

Drivers typically initialize and register device IDs in the driver's attach(9E) handler. As mentioned above, the driver is responsible for registering a device ID that is persistent. As such, the driver might be required to handle both devices that can provide a unique ID directly (WWN) and devices where fabricated IDs are written to and read from stable storage.

Registering a Device-Supplied ID

If the device can supply the driver with an identifier that is unique, the driver can simply initialize the device ID with this identifier and register the ID with the Oracle Solaris DDI.

/*
 * The device provides a guaranteed unique identifier,
 * in this case a SCSI3-WWN.  The WWN for the device has been
 * stored in the device's soft state.
 */
if (ddi_devid_init(dip, DEVID_SCSI3_WWN, un->un_wwn_len, un->un_wwn,
    &un->un_devid) != DDI_SUCCESS)
    return (DDI_FAILURE);

(void) ddi_devid_register(dip, un->un_devid);

Registering a Fabricated ID

A driver might also register device IDs for devices that do not directly supply a unique ID. Registering these IDs requires the device to be capable of storing and retrieving a small amount of data in a reserved area. The driver can then create a fabricated device ID and write it to the reserved area.

/*
 * the device doesn't supply a unique ID, attempt to read
 * a fabricated ID from the device's reserved data.
 */
if (xxx_read_deviceid(un, &devid_buf) == XXX_OK) {
    if (ddi_devid_valid(devid_buf) == DDI_SUCCESS) {
        devid_sz = ddi_devi_sizeof(devid_buf);
        un->un_devid = kmem_alloc(devid_sz, KM_SLEEP);
        bcopy(devid_buf, un->un_devid, devid_sz);
        ddi_devid_register(dip, un->un_devid);
        return (XXX_OK);
    }
}
/*
 * we failed to read a valid device ID from the device
 * fabricate an ID, store it on the device, and register
 * it with the DDI
 */
if (ddi_devid_init(dip, DEVID_FAB, 0, NULL, &un->un_devid)
    == DDI_FAILURE) {
    return (XXX_FAILURE);
}
if (xxx_write_deviceid(un) != XXX_OK) {
    ddi_devid_free(un->un_devid);
    un->un_devid = NULL;
    return (XXX_FAILURE);
}
ddi_devid_register(dip, un->un_devid);
return (XXX_OK);

Unregistering Device IDs

Drivers typically unregister and free any device IDs that are allocated as part of the detach(9E) handling. The driver first calls ddi_devid_unregister(9F) to unregister the device ID for the device instance. The driver must then free the device ID handle itself by calling ddi_devid_free(9F), and then passing the handle that had been returned by ddi_devid_init(9F). The driver is responsible for managing any space allocated for WWN or Serial Number data.

Chapter 7 Device Access: Programmed I/O

The Oracle Solaris OS provides driver developers with a comprehensive set of interfaces for accessing device memory. These interfaces are designed to shield the driver from platform-specific dependencies by handling mismatches between processor and device endianness as well as enforcing any data order dependencies the device might have. By using these interfaces, you can develop a single-source driver that runs on both the SPARC and x86 processor architectures as well as the various platforms from each respective processor family.

This chapter provides information on the following subjects:

Device Memory

Devices that support programmed I/O are assigned one or more regions of bus address space that map to addressable regions of the device. These mappings are described as pairs of values in the reg property associated with the device. Each value pair describes a segment of a bus address.

Drivers identify a particular bus address mapping by specifying the register number, or regspec, which is an index into the devices' reg property. The reg property identifies the busaddr and size for the device. Drivers pass the register number when making calls to DDI functions such as ddi_regs_map_setup(9F). Drivers can determine how many mappable regions have been assigned to the device by calling ddi_dev_nregs(9F).

Managing Differences in Device and Host Endianness

The data format of the host can have different endian characteristics than the data format of the device. In such a case, data transferred between the host and device would need to be byte-swapped to conform to the data format requirements of the destination location. Devices with the same endian characteristics of the host require no byte-swapping of the data.

Drivers specify the endian characteristics of the device by setting the appropriate flag in the ddi_device_acc_attr(9S) structure that is passed to ddi_regs_map_setup(9F). The DDI framework then performs any required byte-swapping when the driver calls a ddi_getX routine like ddi_get8(9F) or a ddi_putX routine like ddi_put16(9F) to read or write to device memory.

Managing Data Ordering Requirements

Platforms can reorder loads and stores of data to optimize performance of the platform. Because reordering might not be allowed by certain devices, the driver is required to specify the device's ordering requirements when setting up mappings to the device.

`ddi_device_acc_attr` Structure

This structure describes the endian and data order requirements of the device. The driver is required to initialize and pass this structure as an argument to ddi_regs_map_setup(9F).

typedef struct ddi_device_acc_attr {
    ushort_t    devacc_attr_version;
    uchar_t     devacc_attr_endian_flags;
    uchar_t     devacc_attr_dataorder;
} ddi_device_acc_attr_t;

devacc_attr_version

Specifies DDI_DEVICE_ATTR_V0

devacc_attr_endian_flags

Describes the endian characteristics of the device. Specified as a bit value whose possible values are:

DDI_NEVERSWAP_ACC – Never swap data
DDI_STRUCTURE_BE_ACC – The device data format is big-endian
DDI_STRUCTURE_LE_ACC – The device data format is little-endian

devacc_attr_dataorder

Describes the order in which the CPU must reference data as required by the device. Specified as an enumerated value, where data access restrictions are ordered from most strict to least strict.

DDI_STRICTORDER_ACC – The host must issue the references in order, as specified by the programmer. This flag is the default behavior.
DDI_UNORDERED_OK_ACC – The host is allowed to reorder loads and stores to device memory.
DDI_MERGING_OK_ACC – The host is allowed to merge individual stores to consecutive locations. This setting also implies reordering.
DDI_LOADCACHING_OK_ACC – The host is allowed to read data from the device until a store occurs.
DDI_STORECACHING_OK_ACC – The host is allowed to cache data written to the device. The host can then defer writing the data to the device until a future time.

Note –

The system can access data more strictly than the driver specifies in devacc_attr_dataorder. The restriction to the host diminishes while moving from strict data ordering to cache storing in terms of data accesses by the driver.

Mapping Device Memory

Drivers typically map all regions of a device during attach(9E). The driver maps a region of device memory by calling ddi_regs_map_setup(9F), specifying the register number of the region to map, the device access attributes for the region, an offset, and size. The DDI framework sets up the mappings for the device region and returns an opaque handle to the driver. This data access handle is passed as an argument to the ddi_get8(9F) or ddi_put8(9F) family of routines when reading data from or writing data to that region of the device.

The driver verifies that the shape of the device mappings match what the driver is expecting by checking the number of mappings exported by the device. The driver calls ddi_dev_nregs(9F) and then verifies the size of each mapping by calling ddi_dev_regsize(9F).

Mapping Setup Example

The following simple example demonstrates the DDI data access interfaces. This driver is for a fictional little endian device that accepts one character at a time and generates an interrupt when ready for another character. This device implements two register sets: the first is an 8-bit CSR register, and the second is an 8-bit data register.

Example 7–1 Mapping Setup

#define CSR_REG 0
#define DATA_REG 1
/*
 * Initialize the device access attributes for the register
 * mapping
 */
dev_acc_attr.devacc_attr_version = DDI_DEVICE_ATTR_V0;
dev_acc_attr.devacc_attr_endian_flags = DDI_STRUCTURE_LE_ACC;
dev_acc_attr.devacc_attr_dataorder = DDI_STRICTORDER_ACC;
/*
 * Map in the csr register (register 0)
 */
if (ddi_regs_map_setup(dip, CSR_REG, (caddr_t *)&(pio_p->csr), 0,
  sizeof (Pio_csr), &dev_acc_attr, &pio_p->csr_handle) != DDI_SUCCESS) {
    mutex_destroy(&pio_p->mutex);
    ddi_soft_state_free(pio_softstate, instance);
    return (DDI_FAILURE);
}
/*
 * Map in the data register (register 1)
 */
if (ddi_regs_map_setup(dip, DATA_REG, (caddr_t *)&(pio_p->data), 0,
  sizeof (uchar_t), &dev_acc_attr, &pio_p->data_handle) \
  != DDI_SUCCESS) {
    mutex_destroy(&pio_p->mutex);
    ddi_regs_map_free(&pio_p->csr_handle);
    ddi_soft_state_free(pio_softstate, instance);
    return (DDI_FAILURE);
}

Device Access Functions

Drivers use the ddi_get8(9F) and ddi_put8(9F) family of routines in conjunction with the handle returned by ddi_regs_map_setup(9F) to transfer data to and from a device. The DDI framework automatically handles any byte-swapping that is required to meet the endian format for the host or device, and enforces any store-ordering constraints the device might have.

The DDI provides interfaces for transferring data in 8-bit, 16-bit, 32-bit, and 64-bit quantities, as well as interfaces for transferring multiple values repeatedly. See the man pages for the ddi_get8(9F), ddi_put8(9F), ddi_rep_get8(9F) and ddi_rep_put8(9F) families of routines for a complete listing and description of these interfaces.

The following example builds on Example 7–1 where the driver mapped the device's CSR and data registers. Here, the driver's write(9E) entry point, when called, writes a buffer of data to the device one byte at a time.

Example 7–2 Mapping Setup: Buffer

static  int
pio_write(dev_t dev, struct uio *uiop, cred_t *credp)
{
    int  retval;
    int  error = OK;
    Pio *pio_p = ddi_get_soft_state(pio_softstate, getminor(dev));

    if (pio_p == NULL)
        return (ENXIO);
    mutex_enter(&pio_p->mutex);
    /*
     * enable interrupts from the device by setting the Interrupt
     * Enable bit in the devices CSR register
     */
    ddi_put8(pio_p->csr_handle, pio_p->csr,
      (ddi_get8(pio_p->csr_handle, pio_p->csr) | PIO_INTR_ENABLE));

    while (uiop->uio_resid > 0) {
    /*
     * This device issues an IDLE interrupt when it is ready
     * to accept a character; the interrupt can be cleared
     * by setting PIO_INTR_CLEAR.  The interrupt is reasserted
     * after the next character is written or the next time
     * PIO_INTR_ENABLE is toggled on.
     *
     * wait for interrupt (see pio_intr)
     */
        cv_wait(&pio_p->cv, &pio_p->mutex);

    /*
     * get a character from the user's write request
     * fail the write request if any errors are encountered
     */
        if ((retval = uwritec(uiop)) == -1) {
            error = retval;
            break;
        }

    /*
     * pass the character to the device by writing it to
     * the device's data register
     */
        ddi_put8(pio_p->data_handle, pio_p->data, (uchar_t)retval);
    }

    /*
     * disable interrupts by clearing the Interrupt Enable bit
     * in the CSR
     */
    ddi_put8(pio_p->csr_handle, pio_p->csr,
      (ddi_get8(pio_p->csr_handle, pio_p->csr) & ~PIO_INTR_ENABLE));

    mutex_exit(&pio_p->mutex);
    return (error);
}

Alternate Device Access Interfaces

In addition to implementing all device accesses through the ddi_get8(9F) and ddi_put8(9F) families of interfaces, the Oracle Solaris OS provides interfaces that are specific to particular bus implementations. While these functions can be more efficient on some platforms, use of these routines can limit the ability of the driver to remain portable across different bus versions of the device.

Memory Space Access

With memory mapped access, device registers appear in memory address space. The ddi_getX family of routines and the ddi_putX family are available for use by drivers as an alternative to the standard device access interfaces.

I/O Space Access

With I/O space access, the device registers appear in I/O space, where each addressable element is called an I/O port. The ddi_io_get8(9F) and ddi_io_put8(9F) routines are available for use by drivers as an alternative to the standard device access interfaces.

PCI Configuration Space Access

To access PCI configuration space without using the normal device access interfaces, a driver is required to map PCI configuration space by calling pci_config_setup(9F) in place of ddi_regs_map_setup(9F). The driver can then call the pci_config_get8(9F) and pci_config_put8(9F) families of interfaces to access PCI configuration space.

Chapter 8 Interrupt Handlers

This chapter describes mechanisms for handling interrupts, such as allocating, registering, servicing, and removing interrupts. This chapter provides information on the following subjects:

Interrupt Handler Overview

An interrupt is a hardware signal from a device to a CPU. An interrupt tells the CPU that the device needs attention and that the CPU should stop any current activity and respond to the device. If the CPU is not performing a task that has higher priority than the priority of the interrupt, then the CPU suspends the current thread. The CPU then invokes the interrupt handler for the device that sent the interrupt signal. The job of the interrupt handler is to service the device and stop the device from interrupting. When the interrupt handler returns, the CPU resumes the work it was doing before the interrupt occurred.

The Oracle Solaris DDI/DKI provides interfaces for performing the following tasks:

Determining interrupt type and registration requirements
Registering interrupts
Servicing interrupts
Masking interrupts
Getting interrupt pending information
Getting and setting priority information

Device Interrupts

I/O buses implement interrupts in two common ways: vectored and polled. Both methods commonly supply a bus-interrupt priority level. Vectored devices also supply an interrupt vector. Polled devices do not supply interrupt vectors.

To stay current with changing bus technologies, the Oracle Solaris OS has been enhanced to accommodate both newer types of interrupts and more traditional interrupts that have been in use for many years. Specifically, the operating system now recognizes three types of interrupts:

Legacy interrupts – Legacy or fixed interrupts refer to interrupts that use older bus technologies. With these technologies, interrupts are signaled by using one or more external pins that are wired “out-of-band,” that is, separately from the main lines of the bus. Newer bus technologies such as PCI Express maintain software compatibility by emulating legacy interrupts through in-band mechanisms. These emulated interrupts are treated as legacy interrupts by the host OS.
Message-signaled interrupts – Instead of using pins, message-signaled interrupts (MSI) are in-band messages and can target addresses in the host bridge. (See PCI Local Bus for more information on host bridges.) MSIs can send data along with the interrupt message. Each MSI is unshared so that an MSI that is assigned to a device is guaranteed to be unique within the system. A PCI function can request up to 32 MSI messages.
Extended message-signaled interrupts – Extended message-signaled interrupts (MSI-X) are an enhanced version of MSIs. MSI-X interrupts have the following added advantages:
- Support 2048 messages rather than 32 messages
- Support independent message address and message data for each message
- Support per-message masking
- Enable more flexibility when software allocates fewer vectors than hardware requests. The software can reuse the same MSI-X address and data in multiple MSI-X slots.

Note –

Some newer bus technologies such as PCI Express require MSIs but can accommodate legacy interrupts by using INTx emulation. INTx emulation is used for compatibility purposes, but is not considered to be good practice.

High-Level Interrupts

A bus prioritizes a device interrupt at a bus-interrupt level. The bus interrupt level is then mapped to a processor-interrupt level. A bus interrupt level that maps to a CPU interrupt priority above the scheduler priority level is called a high-level interrupt. High-level interrupt handlers are restricted to calling the following DDI interfaces:

mutex_enter(9F) and mutex_exit(9F) on a mutex that is initialized with an interrupt priority associated with the high-level interrupt
ddi_intr_trigger_softint(9F)
The following DDI get and put routines: ddi_get8(9F), ddi_put8(9F), ddi_get16(9F), ddi_put16(9F), ddi_get32(9F), ddi_put32(9F), ddi_get64(9F), and ddi_put64(9F).

A bus-interrupt level by itself does not determine whether a device interrupts at a high level. A particular bus-interrupt level can map to a high-level interrupt on one platform, but map to an ordinary interrupt on another platform.

A driver is not required to support devices that have high-level interrupts. However, the driver is required to check the interrupt level. If the interrupt priority is greater than or equal to the highest system priority, the interrupt handler runs in high-level interrupt context. In this case, the driver can fail to attach, or the driver can use a two-level scheme to handle interrupts. For more information, see Handling High-Level Interrupts.

Legacy Interrupts

The only information that the system has about a device interrupt is the priority level of the bus interrupt and the interrupt request number. An example of the priority level for a bus interrupt is the IPL on an SBus in a SPARC machine. An example of an interrupt request number is the IRQ on an ISA bus in an x86 machine.

When an interrupt handler is registered, the system adds the handler to a list of potential interrupt handlers for each IPL or IRQ. When the interrupt occurs, the system must determine which device actually caused the interrupt, among all devices that are associated with a given IPL or IRQ. The system calls all the interrupt handlers for the designated IPL or IRQ until one handler claims the interrupt.

The following buses are capable of supporting polled interrupts:

SBus
ISA
PCI

Standard and Extended Message-Signaled Interrupts

Both standard (MSI) and extended (MSI-X) message-signaled interrupts are implemented as in-band messages. A message-signaled interrupt is posted as a write with an address and value that are specified by the software.

MSI Interrupts

Conventional PCI specifications include optional support for Message Signaled Interrupts (MSI). An MSI is an in-band message that is implemented as a posted write. The address and the data for the MSI are specified by software and are specific to the host bridge. Because the messages are in-band, the receipt of the message can be used to “push” data that is associated with the interrupt. By definition, MSI interrupts are unshared. Each MSI message that is assigned to a device is guaranteed to be a unique message in the system. PCI functions can request 1, 2, 4, 8, 16, or 32 MSI messages. Note that the system software can allocate fewer MSI messages to a function than the function requested. The host bridge can be limited in the number of unique MSI messages that are allocated for devices.

MSI-X Interrupts

MSI-X interrupts are enhanced versions of MSI interrupts that have the same features as MSI interrupts with the following key differences:

A maximum of 2048 MSI-X interrupt vectors are supported per device.
Address and data entries are unique per interrupt vector.
MSI-X supports per function masking and per vector masking.

With MSI-X interrupts, an unallocated interrupt vector of a device can use a previously added or initialized MSI-X interrupt vector to share the same vector address, vector data, interrupt handler, and handler arguments. Use the ddi_intr_dup_handler(9F) function to alias the resources provided by the Oracle Solaris OS to the unallocated interrupt vectors on an associated device. For example, if 2 MSI-X interrupts are allocated to a driver and 32 interrupts are supported on the device, then the driver can use ddi_intr_dup_handler() to alias the 2 interrupts it received to the 30 additional interrupts on the device.

The ddi_intr_dup_handler() function can duplicate interrupts that were added with ddi_intr_add_handler(9F) or initialized with ddi_intr_enable(9F).

A duplicated interrupt is disabled initially. Use ddi_intr_enable() to enable the duplicated interrupt. You cannot remove the original MSI-X interrupt handler until all duplicated interrupt handlers that are associated with this original interrupt handler are removed. To remove a duplicated interrupt handler, first call ddi_intr_disable(9F), and then call ddi_intr_free(9F). When all duplicated interrupt handlers that are associated with this original interrupt handler are removed, then you can use ddi_intr_remove_handler(9F) to remove the original MSI-X interrupt handler. See the ddi_intr_dup_handler(9F) man page for examples.

Software Interrupts

The Oracle Solaris DDI/DKI supports software interrupts, also known as soft interrupts. Soft interrupts are initiated by software rather than by a hardware device. Handlers for these interrupts must also be added to and removed from the system. Soft interrupt handlers run in interrupt context and therefore can be used to do many of the tasks that belong to an interrupt handler.

Hardware interrupt handlers must perform their tasks quickly, because the handlers might have to suspend other system activity while doing these tasks. This requirement is particularly true for high-level interrupt handlers, which operate at priority levels greater than the priority level of the system scheduler. High-level interrupt handlers mask the operations of all lower-priority interrupts, including the interrupt operations of the system clock. Consequently, the interrupt handler must avoid involvement in activities that might cause it to sleep, such as acquiring a mutex.

If the handler sleeps, then the system might hang because the clock is masked and incapable of scheduling the sleeping thread. For this reason, high-level interrupt handlers normally perform a minimum amount of work at high-priority levels and delegate other tasks to software interrupts, which run below the priority level of the high-level interrupt handler. Because software interrupt handlers run below the priority level of the system scheduler, software interrupt handlers can do the work that the high-level interrupt handler was incapable of doing.

DDI Interrupt Functions

The Oracle Solaris OS provides a framework for registering and unregistering interrupts and provides support for Message Signaled Interrupts (MSIs). Interrupt management interfaces enable you to manipulate priorities, capabilities, and interrupt masking, and to obtain pending information.

Interrupt Capability Functions

Use the following functions to obtain interrupt information:

ddi_intr_get_navail(9F): Returns the number of interrupts available for a specified hardware device and interrupt type.
ddi_intr_get_nintrs(9F): Returns the number of interrupts that the device supports for the specified interrupt type.
ddi_intr_get_supported_types(9F): Returns the hardware interrupt types that are supported by both the device and the host.
ddi_intr_get_cap(9F): Returns interrupt capability flags for the specified interrupt.

Interrupt Initialization and Destruction Functions

Use the following functions to create and remove interrupts:

ddi_intr_alloc(9F): Allocates system resources and interrupt vectors for the specified type of interrupt.
ddi_intr_free(9F): Releases the system resources and interrupt vectors for a specified interrupt handle.
ddi_intr_set_cap(9F): Sets the capability of the specified interrupt through the use of the DDI_INTR_FLAG_LEVEL and DDI_INTR_FLAG_EDGE flags.
ddi_intr_add_handler(9F): Adds an interrupt handler.
ddi_intr_dup_handler(9F): Use with MSI-X only. Copies an address and data pair for an allocated interrupt vector to an unused interrupt vector on the same device.
ddi_intr_remove_handler(9F): Removes the specified interrupt handler.
ddi_intr_enable(9F): Enables the specified interrupt.
ddi_intr_disable(9F): Disables the specified interrupt.
ddi_intr_block_enable(9F): Use with MSI only. Enables the specified range of interrupts.
ddi_intr_block_disable(9F): Use with MSI only. Disables the specified range of interrupts.
ddi_intr_set_mask(9F): Sets an interrupt mask if the specified interrupt is enabled.
ddi_intr_clr_mask(9F): Clears an interrupt mask if the specified interrupt is enabled.
ddi_intr_get_pending(9F): Reads the interrupt pending bit if such a bit is supported by either the host bridge or the device.

Priority Management Functions

Use the following functions to obtain and set priority information:

ddi_intr_get_pri(9F): Returns the current software priority setting for the specified interrupt.
ddi_intr_set_pri(9F): Sets the interrupt priority level for the specified interrupt.
ddi_intr_get_hilevel_pri(9F): Returns the minimum priority level for a high-level interrupt.

Soft Interrupt Functions

Use the following functions to manipulate soft interrupts and soft interrupt handlers:

ddi_intr_add_softint(9F): Adds a soft interrupt handler.
ddi_intr_trigger_softint(9F): Triggers the specified soft interrupt.
ddi_intr_remove_softint(9F): Removes the specified soft interrupt handler.
ddi_intr_get_softint_pri(9F): Returns the soft interrupt priority for the specified interrupt.
ddi_intr_set_softint_pri(9F): Changes the relative soft interrupt priority for the specified soft interrupt.

Interrupt Function Examples

This section provides examples for performing the following tasks:

Changing soft interrupt priority
Checking for pending interrupts
Setting interrupt masks
Clearing interrupt masks

Example 8–1 Changing Soft Interrupt Priority

Use the ddi_intr_set_softint_pri(9F) function to change the soft interrupt priority to 9.

if (ddi_intr_set_softint_pri(mydev->mydev_softint_hdl, 9) != DDI_SUCCESS)
    cmn_err (CE_WARN, "ddi_intr_set_softint_pri failed");

Example 8–2 Checking for Pending Interrupts

Use the ddi_intr_get_pending(9F) function to check whether an interrupt is pending.

if (ddi_intr_get_pending(mydevp->htable[0], &pending) != DDI_SUCCESS)
    cmn_err(CE_WARN, "ddi_intr_get_pending() failed");
else if (pending)
    cmn_err(CE_NOTE, "ddi_intr_get_pending(): Interrupt pending");

Example 8–3 Setting Interrupt Masks

Use the ddi_intr_set_mask(9F) function to set interrupt masking to prevent the device from receiving interrupts.

if ((ddi_intr_set_mask(mydevp->htable[0]) != DDI_SUCCESS))
    cmn_err(CE_WARN, "ddi_intr_set_mask() failed");

Example 8–4 Clearing Interrupt Masks

Use the ddi_intr_clr_mask(9F) function to clear interrupt masking. The ddi_intr_clr_mask(9F) function fails if the specified interrupt is not enabled. If the ddi_intr_clr_mask(9F) function succeeds, the device starts generating interrupts.

if (ddi_intr_clr_mask(mydevp->htable[0]) != DDI_SUCCESS)
    cmn_err(CE_WARN, "ddi_intr_clr_mask() failed");

Registering Interrupts

Before a device driver can receive and service interrupts, the driver must call ddi_intr_add_handler(9F) to register an interrupt handler with the system. Registering interrupt handlers provides the system with a way to associate an interrupt handler with an interrupt specification. The interrupt handler is called when the device might have been responsible for the interrupt. The handler has the responsibility of determining whether it should handle the interrupt and, if so, of claiming that interrupt.

Tip –

Use the ::interrupts command in the mdb or kmdb debugger to retrieve the registered interrupt information of a device on supported SPARC and x86 systems.

Registering Legacy Interrupts

To register a driver's interrupt handler, the driver typically performs the following steps in its attach(9E) entry point:

Use ddi_intr_get_supported_types(9F) to determine which types of interrupts are supported.
Use ddi_intr_get_nintrs(9F) to determine the number of supported interrupt types.
Use kmem_zalloc(9F) to allocate memory for DDI interrupt handles.
For each interrupt type that you allocate, take the following steps:
1. Use ddi_intr_get_pri(9F) to get the priority for the interrupt.
2. If you need to set a new priority for the interrupt, use ddi_intr_set_pri(9F).
3. Use mutex_init(9F) to initialize the lock.
4. Use ddi_intr_add_handler(9F) to register the handler for the interrupt.
5. Use ddi_intr_enable(9F) to enable the interrupt.
Take the following steps to free each interrupt:
1. Disable each interrupt using ddi_intr_disable(9F).
2. Remove the interrupt handler using ddi_intr_remove_handler(9F).
3. Remove the lock using mutex_destroy(9F).
4. Free the interrupt using ddi_intr_free(9F) and kmem_free(9F) to free memory that was allocated for DDI interrupt handles.

Example 8–5 Registering a Legacy Interrupt

The following example shows how to install an interrupt handler for a device called mydev. This example assumes that mydev supports one interrupt only.

/* Determine which types of interrupts supported */
ret = ddi_intr_get_supported_types(mydevp->mydev_dip, &type);

if ((ret != DDI_SUCCESS) || (!(type & DDI_INTR_TYPE_FIXED))) {
    cmn_err(CE_WARN, "Fixed type interrupt is not supported");
    return (DDI_FAILURE);
}

/* Determine number of supported interrupts */
ret = ddi_intr_get_nintrs(mydevp->mydev_dip, DDI_INTR_TYPE_FIXED,
    &count);

/*
 * Fixed interrupts can only have one interrupt. Check to make
 * sure that number of supported interrupts and number of
 * available interrupts are both equal to 1.
 */
if ((ret != DDI_SUCCESS) || (count != 1)) {
    cmn_err(CE_WARN, "No fixed interrupts");
    return (DDI_FAILURE);
}

/* Allocate memory for DDI interrupt handles */
mydevp->mydev_htable = kmem_zalloc(sizeof (ddi_intr_handle_t),
    KM_SLEEP);
ret = ddi_intr_alloc(mydevp->mydev_dip, mydevp->mydev_htable,
    DDI_INTR_TYPE_FIXED, 0, count, &actual, 0);

if ((ret != DDI_SUCCESS) || (actual != 1)) {
    cmn_err(CE_WARN, "ddi_intr_alloc() failed 0x%x", ret);
    kmem_free(mydevp->mydev_htable, sizeof (ddi_intr_handle_t));
    return (DDI_FAILURE);
}

/* Sanity check that count and available are the same. */
ASSERT(count == actual);

/* Get the priority of the interrupt */
if (ddi_intr_get_pri(mydevp->mydev_htable[0], &mydevp->mydev_intr_pri)) {
    cmn_err(CE_WARN, "ddi_intr_alloc() failed 0x%x", ret);

    (void) ddi_intr_free(mydevp->mydev_htable[0]);
    kmem_free(mydevp->mydev_htable, sizeof (ddi_intr_handle_t));

    return (DDI_FAILURE);
}

cmn_err(CE_NOTE, "Supported Interrupt pri = 0x%x", mydevp->mydev_intr_pri);

/* Test for high level mutex */
if (mydevp->mydev_intr_pri >= ddi_intr_get_hilevel_pri()) {
    cmn_err(CE_WARN, "Hi level interrupt not supported");

    (void) ddi_intr_free(mydevp->mydev_htable[0]);
    kmem_free(mydevp->mydev_htable, sizeof (ddi_intr_handle_t));

    return (DDI_FAILURE);
}

/* Initialize the mutex */
mutex_init(&mydevp->mydev_int_mutex, NULL, MUTEX_DRIVER,
    DDI_INTR_PRI(mydevp->mydev_intr_pri));

/* Register the interrupt handler */
if (ddi_intr_add_handler(mydevp->mydev_htable[0], mydev_intr, 
   (caddr_t)mydevp, NULL) !=DDI_SUCCESS) {
    cmn_err(CE_WARN, "ddi_intr_add_handler() failed");

    mutex_destroy(&mydevp->mydev_int_mutex);
    (void) ddi_intr_free(mydevp->mydev_htable[0]);
    kmem_free(mydevp->mydev_htable, sizeof (ddi_intr_handle_t));

    return (DDI_FAILURE);
}

/* Enable the interrupt */
if (ddi_intr_enable(mydevp->mydev_htable[0]) != DDI_SUCCESS) {
    cmn_err(CE_WARN, "ddi_intr_enable() failed");

    (void) ddi_intr_remove_handler(mydevp->mydev_htable[0]);
    mutex_destroy(&mydevp->mydev_int_mutex);
    (void) ddi_intr_free(mydevp->mydev_htable[0]);
    kmem_free(mydevp->mydev_htable, sizeof (ddi_intr_handle_t));

    return (DDI_FAILURE);
}
return (DDI_SUCCESS);
}

Example 8–6 Removing a Legacy Interrupt

The following example shows how legacy interrupts are removed.

/* disable interrupt */
(void) ddi_intr_disable(mydevp->mydev_htable[0]);

/* Remove interrupt handler */
(void) ddi_intr_remove_handler(mydevp->mydev_htable[0]);

/* free interrupt handle */
(void) ddi_intr_free(mydevp->mydev_htable[0]);

/* free memory */
kmem_free(mydevp->mydev_htable, sizeof (ddi_intr_handle_t));

Registering MSI Interrupts

To register a driver's interrupt handler, the driver typically performs the following steps in its attach(9E) entry point:

Use ddi_intr_get_supported_types(9F) to determine which types of interrupts are supported.
Use ddi_intr_get_nintrs(9F) to determine the number of supported MSI interrupt types.
Use ddi_intr_alloc(9F) to allocate memory for the MSI interrupts.
For each interrupt type that you allocate, take the following steps:
1. Use ddi_intr_get_pri(9F) to get the priority for the interrupt.
2. If you need to set a new priority for the interrupt, use ddi_intr_set_pri(9F).
3. Use mutex_init(9F) to initialize the lock.
4. Use ddi_intr_add_handler(9F) to register the handler for the interrupt.
Use one of the following functions to enable all the interrupts:
- Use ddi_intr_block_enable(9F) to enable all the interrupts in a block.
- Use ddi_intr_enable(9F) in a loop to enable each interrupt individually.

Example 8–7 Registering a Set of MSI Interrupts

The following example illustrates how to register an MSI interrupt for a device called mydev.

/* Get supported interrupt types */
if (ddi_intr_get_supported_types(devinfo, &intr_types) != DDI_SUCCESS) {
    cmn_err(CE_WARN, "ddi_intr_get_supported_types failed");
    goto attach_fail;
}

if (intr_types & DDI_INTR_TYPE_MSI)
    mydev_add_msi_intrs(mydevp);

/* Check count, available and actual interrupts */
static int
mydev_add_msi_intrs(mydev_t *mydevp)
{
    dev_info_t    *devinfo = mydevp->devinfo;
    int           count, avail, actual;
    int           x, y, rc, inum = 0;

    /* Get number of interrupts */
    rc = ddi_intr_get_nintrs(devinfo, DDI_INTR_TYPE_MSI, &count);
    if ((rc != DDI_SUCCESS) || (count == 0)) {
        cmn_err(CE_WARN, "ddi_intr_get_nintrs() failure, rc: %d, "
            "count: %d", rc, count);
        return (DDI_FAILURE);
    }

    /* Get number of available interrupts */
    rc = ddi_intr_get_navail(devinfo, DDI_INTR_TYPE_MSI, &avail);
    if ((rc != DDI_SUCCESS) || (avail == 0)) {
        cmn_err(CE_WARN, "ddi_intr_get_navail() failure, "
            "rc: %d, avail: %d\n", rc, avail);
        return (DDI_FAILURE);
    }
    if (avail < count) {
        cmn_err(CE_NOTE, "nitrs() returned %d, navail returned %d",
            count, avail);
    }

    /* Allocate memory for MSI interrupts */
    mydevp->intr_size = count * sizeof (ddi_intr_handle_t);
    mydevp->htable = kmem_alloc(mydevp->intr_size, KM_SLEEP);

    rc = ddi_intr_alloc(devinfo, mydevp->htable, DDI_INTR_TYPE_MSI, inum,
        count, &actual, DDI_INTR_ALLOC_NORMAL);

    if ((rc != DDI_SUCCESS) || (actual == 0)) {
        cmn_err(CE_WARN, "ddi_intr_alloc() failed: %d", rc);
        kmem_free(mydevp->htable, mydevp->intr_size);
        return (DDI_FAILURE);
    }

    if (actual < count) {
        cmn_err(CE_NOTE, "Requested: %d, Received: %d", count, actual);
    }

    mydevp->intr_cnt = actual;
    /*
     * Get priority for first msi, assume remaining are all the same
     */
    if (ddi_intr_get_pri(mydevp->htable[0], &mydev->intr_pri) !=
        DDI_SUCCESS) {
        cmn_err(CE_WARN, "ddi_intr_get_pri() failed");

        /* Free already allocated intr */
        for (y = 0; y < actual; y++) {
            (void) ddi_intr_free(mydevp->htable[y]);
        }

        kmem_free(mydevp->htable, mydevp->intr_size);
        return (DDI_FAILURE);
    }

    /* Call ddi_intr_add_handler() */
    for (x = 0; x < actual; x++) {
        if (ddi_intr_add_handler(mydevp->htable[x], mydev_intr,
           (caddr_t)mydevp, NULL) != DDI_SUCCESS) {
            cmn_err(CE_WARN, "ddi_intr_add_handler() failed");

            /* Free already allocated intr */
            for (y = 0; y < actual; y++) {
                (void) ddi_intr_free(mydevp->htable[y]);
            }

            kmem_free(mydevp->htable, mydevp->intr_size);
            return (DDI_FAILURE);
        }
    }

    (void) ddi_intr_get_cap(mydevp->htable[0], &mydevp->intr_cap);
    if (mydev->m_intr_cap & DDI_INTR_FLAG_BLOCK) {
        /* Call ddi_intr_block_enable() for MSI */
        (void) ddi_intr_block_enable(mydev->m_htable, mydev->m_intr_cnt);
    } else {
        /* Call ddi_intr_enable() for MSI non block enable */
        for (x = 0; x < mydev->m_intr_cnt; x++) {
            (void) ddi_intr_enable(mydev->m_htable[x]);
        }
    }
    return (DDI_SUCCESS);
}

Example 8–8 Removing MSI Interrupts

The following example shows how to remove MSI interrupts.

static void
mydev_rem_intrs(mydev_t *mydev)
{
    int    x;

    /* Disable all interrupts */
    if (mydev->m_intr_cap & DDI_INTR_FLAG_BLOCK) {
        /* Call ddi_intr_block_disable() */
        (void) ddi_intr_block_disable(mydev->m_htable, mydev->m_intr_cnt);
    } else {
        for (x = 0; x < mydev->m_intr_cnt; x++) {
            (void) ddi_intr_disable(mydev->m_htable[x]);
        }
    }

    /* Call ddi_intr_remove_handler() */
    for (x = 0; x < mydev->m_intr_cnt; x++) {
        (void) ddi_intr_remove_handler(mydev->m_htable[x]);
        (void) ddi_intr_free(mydev->m_htable[x]);
    }

    kmem_free(mydev->m_htable, mydev->m_intr_size);
}

Interrupt Resource Management

This section discusses how a driver for a device that can generate many different interruptible conditions can utilize the Interrupt Resource Management feature to optimize its allocation of interrupt vectors.

The Interrupt Resource Management Feature

The Interrupt Resource Management feature can enable a device driver to use more interrupt resources by dynamically managing the driver's interrupt configuration. When the Interrupt Resource Management feature is not used, configuration of interrupt handling usually only occurs in a driver's attach(9E) routine. The Interrupt Resource Management feature monitors the system for changes, recalculates the number of interrupt vectors to grant to each device in response to those changes, and notifies each affected participating driver of the driver's new allocation of interrupt vectors. A participating driver is a driver that has registered a callback handler as described in Callback Interfaces. Changes that can cause interrupt vector reallocation include adding or removing devices, or an explicit request as described in Modify Number of Interrupt Vectors Requested.

The Interrupt Resource Management feature is not available on every Oracle Solaris platform. The feature is only available to PCIe devices that utilize MSI-X interrupts.

Note –

A driver that utilizes the Interrupt Resource Management feature must be able to adapt correctly when the feature is not available.

When the Interrupt Resource Management feature is available, it can enable a driver to gain access to more interrupt vectors than the driver might otherwise be allocated. A driver might process interrupt conditions more efficiently when utilizing a larger number of interrupt vectors.

The Interrupt Resource Management feature dynamically adjusts the number of interrupt vectors granted to each participating driver depending upon the following constraints:

Total number available. A finite number of interrupt vectors exists in the system.
Total number requested. A driver might be granted fewer, but never more than the number of interrupt vectors it requested.
Fairness to other drivers. The total number of interrupt vectors available is shared by many drivers in a manner that is fair in relation to the total number requested by each driver.

The number of interrupt vectors made available to a device at any given time can vary:

As other devices are dynamically added to or removed from the system
As drivers dynamically change the number of interrupt vectors they request in response to load

A driver must provide the following support to use the Interrupt Resource Management feature:

Callback support. Drivers must register a callback handler so they can be notified when their number of available interrupts has been changed by the system. Drivers must be able to increase or decrease their interrupt usage.
Interrupt requests. Drivers must specify how many interrupts they want to use.
Interrupt usage. Drivers must request the correct number of interrupts at any given time, based on:
- What interruptible conditions their hardware can generate
- How many processors can be used to process those conditions in parallel
Interrupt flexibility. Drivers must be flexible enough to assign one or more interruptible conditions to each interrupt vector in a manner that best fits their current number of available interrupts. Drivers might need to reconfigure these assignments when their number of available interrupts increases or decreases at arbitrary times.

Callback Interfaces

A driver must use the following interfaces to register its callback support.

Table 8–1 Callback Support Interfaces


Interface	Data Structures	Description
`ddi_cb_register()`	`ddi_cb_flags_t`, `ddi_cb_handle_t`	Register a callback handler function to receive specific types of actions.
`ddi_cb_unregister()`	`ddi_cb_handle_t`	Unregister a callback handler function.
`(*ddi_cb_func_t)()`	`ddi_cb_action_t`	Receive callback actions and specific arguments relevant to each action to be processed.

Register a Callback Handler Function

Use the ddi_cb_register(9F) function to register a callback handler function for a driver.

int
ddi_cb_register (dev_info_t *dip, ddi_cb_flags_t cbflags,
                 ddi_cb_func_t cbfunc, void *arg1, void *arg2,
                 ddi_cb_handle_t *ret_hdlp);

The driver can register only one callback function. This one callback function is used to handle all individual callback actions. The cbflags parameter determines which types of actions should be received by the driver when they occur. The cbfunc() routine is called whenever a relevant action should be processed by the driver. The driver specifies two private arguments (arg1 and arg2) to send to itself during each execution of its cbfunc() routine.

The cbflags() parameter is an enumerated type that specifies which actions the driver supports.

typedef enum {
        DDI_CB_FLAG_INTR
} ddi_cb_flags_t;

To register support for Interrupt Resource Management actions, a driver must register a handler and include the DDI_CB_FLAG_INTR flag. When the callback handler is successfully registered, an opaque handle is returned through the ret_hdlp parameter. When the driver is finished with the callback handler, the driver can use the ret_hdlp parameter to unregister the callback handler.

Register the callback handler in the driver's attach(9F) entry point. Save the opaque handle in the driver's soft state. Unregister the callback handler in the driver's detach(9F) entry point.

Unregister a Callback Handler Function

Use the ddi_cb_unregister(9F) function to unregister a callback handler function for a driver.

int
ddi_cb_unregister (ddi_cb_handle_t hdl);

Make this call in the driver's detach(9F) entry point. After this call, the driver no longer receives callback actions.

The driver also loses any additional support from the system that it gained from having a registered callback handling function. For example, some interrupt vectors previously made available to the driver are immediately taken back when the driver unregisters its callback handling function. Before returning successfully, the ddi_cb_unregister() function notifies the driver of any final actions that result from losing support from the system.

Callback Handler Function

Use the registered callback handling function to receive callback actions and receive arguments that are specific to each action to be processed.

typedef int (*ddi_cb_func_t)(dev_info_t *dip, ddi_cb_action_t cbaction,
                             void *cbarg, void *arg1, void *arg2);

The cbaction parameter specifies what action the driver is receiving a callback to process.

typedef enum {
        DDI_CB_INTR_ADD,
        DDI_CB_INTR_REMOVE
} ddi_cb_action_t;

A DDI_CB_INTR_ADD action means that the driver now has more interrupts available to use. A DDI_CB_INTR_REMOVE action means that the driver now has fewer interrupts available to use. Cast the cbarg parameter to an int to determine the number of interrupts added or removed. The cbarg value represents the change in the number of interrupts that are available.

For example, get the change in the number of interrupts available:

count = (int)(uintptr_t)cbarg;

If the cbaction is DDI_CB_INTR_ADD, add cbarg number of interrupt vectors. If the cbaction is DDI_CB_INTR_REMOVE, free cbarg number of interrupt vectors.

See ddi_cb_register(9F) for an explanation of arg1 and arg2.

The callback handling function must be able to perform correctly for the entire time that the function is registered. The callback function cannot depend upon any data structures that might be destroyed before the callback function is successfully unregistered.

The callback handling function must return one of the following values:

DDI_SUCCESS if it correctly handled the action
DDI_FAILURE if it encountered an internal error
DDI_ENOTSUP if it received an unrecognized action

Interrupt Request Interfaces

A driver must use the following interfaces to request interrupt vectors from the system.

Table 8–2 Interrupt Vector Request Interfaces


Interface	Data Structures	Description
`ddi_intr_alloc()`	`ddi_intr_handle_t`	Allocate interrupts.
`ddi_intr_set_nreq()`		Change number of interrupt vectors requested.

Allocate an Interrupt

Use the ddi_intr_alloc(9F) function to initially allocate interrupts.

int
ddi_intr_alloc (dev_info_t *dip, ddi_intr_handle_t *h_array, int type,
                int inum, int count, int *actualp, int behavior);

Before calling this function, the driver must allocate an empty handle array large enough to contain the number of interrupts requested. The ddi_intr_alloc() function attempts to allocate count number of interrupt handles, and initialize the array with the assigned interrupt vectors beginning at the offset specified by the inum parameter. The actualp parameter returns the actual number of interrupt vectors that were allocated.

A driver can use the ddi_intr_alloc() function in two ways:

The driver can call the ddi_intr_alloc() function multiple times to allocate interrupt vectors to individual members of the interrupt handle array in separate steps.
The driver can call the ddi_intr_alloc() function one time to allocate all of the interrupt vectors for the device at once.

If you are using the Interrupt Resource Management feature, call ddi_intr_alloc() one time to allocate all interrupt vectors at once. The count parameter is the total number of interrupt vectors requested by the driver. If the value in actualp is less than the value of count, then the system is not able to fulfill the request completely. The Interrupt Resource Management feature saves this request (count becomes nreq - see below) and might be able to allocate more interrupt vectors to this driver at a later time.

Note –

When you use the Interrupt Resource Management feature, additional calls to ddi_intr_alloc() do not change the total number of interrupt vectors requested. Use the ddi_intr_set_nreq(9F) function to change the number of interrupt vectors requested.

Modify Number of Interrupt Vectors Requested

Use the ddi_intr_set_nreq(9F) function to change the number of interrupt vectors requested.

int
ddi_intr_set_nreq (dev_info_t *dip, int nreq);

When the Interrupt Resource Management feature is available, a driver can use the ddi_intr_set_nreq() function to dynamically adjust the total number of interrupt vectors requested. The driver might do this in response to the actual load that exists once the driver is attached.

A driver must first call ddi_intr_alloc(9F) to request an initial number of interrupt vectors. Any time after the ddi_intr_alloc()call, the driver can call ddi_intr_set_nreq() to change its request size. The specified nreq value is the driver's new total number of requested interrupt vectors. The Interrupt Resource Management feature might rebalance the number of interrupts allocated to each driver in the system in response to this new request. Whenever the Interrupt Resource Management feature rebalances the number of interrupts allocated to drivers, each affected driver receives a callback notification that more or fewer interrupt vectors are available for the driver to use.

A driver might dynamically adjust its total number of requested interrupt vectors if, for example, it uses interrupts in conjunction with specific transactions that it is processing. A storage driver might associate a DMA engine with each ongoing transaction, thus requiring interrupt vectors for that reason. A driver might make calls to ddi_intr_set_nreq() in its open(9F) and close(9F) routines to scale its interrupt usage in response to actual use of the driver.

Interrupt Usage and Flexibility

A driver for a device that supports many different interruptible conditions must be able to map those conditions to an arbitrary number of interrupt vectors. The driver cannot assume that interrupt vectors that are allocated will remain available. Some currently available interrupts might later be taken back by the system to accommodate the needs of other drivers in the system.

A driver must be able to:

Determine how many interrupts its hardware supports.
Determine how many interrupts are appropriate to use. For example, the total number of processors in the system might affect this evaluation.
Compare the number of interrupts needed with the number of interrupts available at any given time.

In summary, the driver must be able to select a mixture of interrupt handling functions and program its hardware to generate interrupts according to need and interrupt availability. In some cases multiple interrupts might be targeted to the same vector, and the interrupt handler for that interrupt vector must determine which interrupts occurred. The performance of the device can be affected by how well the driver maps interrupts to interrupt vectors.

Example Implementation of Interrupt Resource Management

One type of device driver that is an excellent candidate for interrupt resource management is a network device driver. The network device hardware supports multiple transmit and receive channels.

The network device generates a unique interrupt condition whenever the device receives a packet on one of its receive channels or transmits a packet on one of its transmit channels. The hardware can send a specific MSI-X interrupt for each event that can occur. A table in the hardware determines which MSI-X interrupt to generate for each event.

To optimize performance, the driver requests enough interrupts from the system to give each separate interrupt its own interrupt vector. The driver makes this request when it first calls ddi_intr_alloc(9F) in its attach(9F) routine.

The driver then evaluates the actual number of interrupts it received from ddi_intr_alloc() in actualp. It might receive all the interrupts it requested, or it might receive fewer interrupts.

A separate function inside the driver uses the total number of available interrupts to calculate which MSI-X interrupts to generate for each event. This function programs the table in the hardware accordingly.

If the driver receives all of its requested interrupt vectors, each entry in the hardware table has its own unique MSI-X interrupt. A one-to-one mapping exists between interrupt conditions and interrupt vectors. The hardware generates a unique MSI-X interrupt for each type of event.
If the driver has fewer interrupt vectors available, some MSI-X interrupt numbers must appear multiple times in the hardware table. The hardware generates the same MSI-X interrupt for more than one type of event.

The driver should have two different interrupt handler functions.

One interrupt handler function performs a specific task in response to an interrupt. This simple function handles interrupts that are generated by only one of the possible hardware events.
A second interrupt handler function is more complicated. This function handles the case where multiple interrupts are mapped to the same MSI-X interrupt vector.

In the example driver in this section, the function xx_setup_interrupts() uses the number of available interrupt vectors to program the hardware and calls the appropriate interrupt handler for each of those interrupt vectors. The xx_setup_interrupts() function is called in two places: after ddi_intr_alloc() is called in xx_attach(), and after interrupt vector allocations are adjusted in the xx_cbfunc() callback handler function.

int
xx_setup_interrupts(xx_state_t *statep, int navail, xx_intrs_t *xx_intrs_p);

The xx_setup_interrupts() function is called with an array of xx_intrs_t data structures.

typedef struct {
        ddi_intr_handler_t      inthandler;
        void                    *arg1;
        void                    *arg2;
} xx_intrs_t;

This xx_setup_interrupts() functionality must exist in the driver independent of whether the Interrupt Resource Management feature is available. Drivers must be able to function with fewer interrupt vectors than the number requested during attach. If the Interrupt Resource Management feature is available, you can modify the driver to dynamically adjust to a new number of available interrupt vectors.

Other functionality that the driver must provide independent of whether the Interrupt Resource Management feature is available includes the ability to quiesce the hardware and resume the hardware. Quiesce and resume are needed for certain events related to power management and hotplugging. Quiesce and resume also are required to handle interrupt callback actions.

The quiesce function is called in xx_detach().

int
xx_quiesce(xx_state_t *statep);

The resume function is called in xx_attach().

int
xx_resume(xx_state_t *statep);

Make the following modifications to enhance this device driver to use the Interrupt Resource Management feature:

Register a callback handler. The driver must register for the actions that indicate when fewer or more interrupts are available.
Handle callbacks. The driver must quiesce its hardware, reprogram its interrupt handling, and resume its hardware in response to each such callback action.

/*
 * attach(9F) routine.
 *
 * Creates soft state, registers callback handler, initializes
 * hardware, and sets up interrupt handling for the driver.
 */
xx_attach(dev_info_t *dip, ddi_attach_cmd_t cmd)
{
        xx_state_t              *statep = NULL;
        xx_intr_t               *intrs = NULL;
        ddi_intr_handle_t       *hdls;
        ddi_cb_handle_t         cb_hdl;
        int                     instance;
        int                     type;
        int                     types;
        int                     nintrs;
        int                     nactual;
        int                     inum;

        /* Get device instance */
        instance = ddi_get_instance(dip);

        switch (cmd) {
        case DDI_ATTACH:

                /* Get soft state */
                if (ddi_soft_state_zalloc(state_list, instance) != 0)
                        return (DDI_FAILURE);
                statep = ddi_get_soft_state(state_list, instance);
                ddi_set_driver_private(dip, (caddr_t)statep);
                statep->dip = dip;

                /* Initialize hardware */
                xx_initialize(statep);

                /* Register callback handler */
                if (ddi_cb_register(dip, DDI_CB_FLAG_INTR, xx_cbfunc,
                    statep, NULL, &cb_hdl) != 0) {
                        ddi_soft_state_free(state_list, instance);
                        return (DDI_FAILURE);
                }
                statep->cb_hdl = cb_hdl;

                /* Select interrupt type */
                ddi_intr_get_supported_types(dip, &types);
                if (types & DDI_INTR_TYPE_MSIX) {
                        type = DDI_INTR_TYPE_MSIX;
                } else if (types & DDI_INTR_TYPE_MSI) {
                        type = DDI_INTR_TYPE_MSI;
                } else {
                        type = DDI_INTR_TYPE_FIXED;
                }
                statep->type = type;

                /* Get number of supported interrupts */
                ddi_intr_get_nintrs(dip, type, &nintrs);

                /* Allocate interrupt handle array */
                statep->hdls_size = nintrs * sizeof (ddi_intr_handle_t);
                statep->hdls = kmem_zalloc(statep->hdls_size, KMEM_SLEEP);

                /* Allocate interrupt setup array */
                statep->intrs_size = nintrs * sizeof (xx_intr_t);
                statep->intrs = kmem_zalloc(statep->intrs_size, KMEM_SLEEP);

                /* Allocate interrupt vectors */
                ddi_intr_alloc(dip, hdls, type, 0, nintrs, &nactual, 0);
                statep->nactual = nactual;

                /* Configure interrupt handling */
                xx_setup_interrupts(statep, statep->nactual, statep->intrs);

                /* Install and enable interrupt handlers */
                for (inum = 0; inum < nactual; inum++) {
                        ddi_intr_add_handler(&hdls[inum],
                            intrs[inum].inthandler,
                            intrs[inum].arg1, intrs[inum].arg2);
                        ddi_intr_enable(hdls[inum]);
                }

                break;

        case DDI_RESUME:

                /* Get soft state */
                statep = ddi_get_soft_state(state_list, instance);
                if (statep == NULL)
                        return (DDI_FAILURE);

                /* Resume hardware */
                xx_resume(statep);

                break;
        }

        return (DDI_SUCESS);
}

/*
 * detach(9F) routine.
 *
 * Stops the hardware, disables interrupt handling, unregisters
 * a callback handler, and destroys the soft state for the driver.
 */
xx_detach(dev_info_t *dip, ddi_detach_cmd_t cmd)
{
        xx_state_t      *statep = NULL;
        int             instance;
        int             inum;

        /* Get device instance */
        instance = ddi_get_instance(dip);

        switch (cmd) {
        case DDI_DETACH:

                /* Get soft state */
                statep = ddi_get_soft_state(state_list, instance);
                if (statep == NULL)
                        return (DDI_FAILURE);

                /* Stop device */
                xx_uninitialize(statep);

                /* Disable and free interrupts */
                for (inum = 0; inum < statep->nactual; inum++) {
                        ddi_intr_disable(statep->hdls[inum]);
                        ddi_intr_remove_handler(statep->hdls[inum]);
                        ddi_intr_free(statep->hdls[inum]);
                }

                /* Unregister callback handler */
                ddi_cb_unregister(statep->cb_hdl);

                /* Free interrupt handle array */
                kmem_free(statep->hdls, statep->hdls_size);

                /* Free interrupt setup array */
                kmem_free(statep->intrs, statep->intrs_size);

                /* Free soft state */
                ddi_soft_state_free(state_list, instance);

                break;

        case DDI_SUSPEND:

                /* Get soft state */
                statep = ddi_get_soft_state(state_list, instance);
                if (statep == NULL)
                        return (DDI_FAILURE);

                /* Suspend hardware */
                xx_quiesce(statep);

                break;
        }

        return (DDI_SUCCESS);
}

/*
 * (*ddi_cbfunc)() routine.
 *
 * Adapt interrupt usage when availability changes.
 */
int
xx_cbfunc(dev_info_t *dip, ddi_cb_action_t cbaction, void *cbarg,
    void *arg1, void *arg2)
{
        xx_state_t      *statep = (xx_state_t *)arg1;
        int             count;
        int             inum;
        int             nactual;

        switch (cbaction) {
        case DDI_CB_INTR_ADD:
        case DDI_CB_INTR_REMOVE:

                /* Get change in availability */
                count = (int)(uintptr_t)cbarg;

                /* Suspend hardware */
                xx_quiesce(statep);

                /* Tear down previous interrupt handling */
                for (inum = 0; inum < statep->nactual; inum++) {
                        ddi_intr_disable(statep->hdls[inum]);
                        ddi_intr_remove_handler(statep->hdls[inum]);
                }

                /* Adjust interrupt vector allocations */
                if (cbaction == DDI_CB_INTR_ADD) {

                        /* Allocate additional interrupt vectors */
                        ddi_intr_alloc(dip, statep->hdls, statep->type,
                            statep->nactual, count, &nactual, 0);

                        /* Update actual count of available interrupts */
                        statep->nactual += nactual;

                } else {

                        /* Free removed interrupt vectors */
                        for (inum = statep->nactual - count;
                            inum < statep->nactual; inum++) {
                                ddi_intr_free(statep->hdls[inum]);
                        }

                        /* Update actual count of available interrupts */
                        statep->nactual -= count;
                }

                /* Configure interrupt handling */
                xx_setup_interrupts(statep, statep->nactual, statep->intrs);

                /* Install and enable interrupt handlers */
                for (inum = 0; inum < statep->nactual; inum++) {
                        ddi_intr_add_handler(&statep->hdls[inum],
                            statep->intrs[inum].inthandler,
                            statep->intrs[inum].arg1,
                            statep->intrs[inum].arg2);
                        ddi_intr_enable(statep->hdls[inum]);
                }

                /* Resume hardware */
                xx_resume(statep);

                break;

        default:
                return (DDI_ENOTSUP);
        }

        return (DDI_SUCCESS);
}

Interrupt Handler Functionality

The driver framework and the device each place demands on the interrupt handler. All interrupt handlers are required to do the following tasks:

Determine whether the device is interrupting and possibly reject the interrupt.

The interrupt handler first examines the device to determine whether this device issued the interrupt. If this device did not issue the interrupt, the handler must return DDI_INTR_UNCLAIMED. This step enables the implementation of device polling. Any device at the given interrupt priority level might have issued the interrupt. Device polling tells the system whether this device issued the interrupt.
Inform the device that the device is being serviced.

Informing a device about servicing is a device-specific operation that is required for the majority of devices. For example, SBus devices are required to interrupt until the driver tells the SBus devices to stop. This approach guarantees that all SBus devices that interrupt at the same priority level are serviced.
Perform any I/O request-related processing.

Devices interrupt for different reasons, such as transfer done or transfer error. This step can involve using data access functions to read the device's data buffer, examine the device's error register, and set the status field in a data structure accordingly. Interrupt dispatching and processing are relatively time consuming.
Do any additional processing that could prevent another interrupt.

For example, read the next item of data from the device.
Return DDI_INTR_CLAIMED.
MSI interrupts must always be claimed.

Claiming an interrupt is optional for MSI-X interrupts. In either case, the ownership of the interrupt need not be checked, because MSI and MSI-X interrupts are not shared with other devices.
Drivers that support hotplugging and multiple MSI or MSI-X interrupts should retain a separate interrupt for hotplug events and register a separate ISR (interrupt service routine) for that interrupt.

The following example shows an interrupt routine for a device called mydev.

Example 8–9 Interrupt Example

static uint_t
mydev_intr(caddr_t arg1, caddr_t arg2)
{
    struct mydevstate *xsp = (struct mydevstate *)arg1;
    uint8_t     status; 
    volatile    uint8_t  temp;

    /*
     * Claim or reject the interrupt.This example assumes
     * that the device's CSR includes this information.
     */
    mutex_enter(&xsp->high_mu);

    /* use data access routines to read status */
    status = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
    if (!(status & INTERRUPTING)) {
        mutex_exit(&xsp->high_mu);
        return (DDI_INTR_UNCLAIMED); /* dev not interrupting */
    }
    /*
     * Inform the device that it is being serviced, and re-enable
     * interrupts. The example assumes that writing to the
     * CSR accomplishes this. The driver must ensure that this data
     * access operation makes it to the device before the interrupt
     * service routine returns. For example, using the data access
     * functions to read the CSR, if it does not result in unwanted
     * effects, can ensure this.
     */
    ddi_put8(xsp->data_access_handle, &xsp->regp->csr,
        CLEAR_INTERRUPT | ENABLE_INTERRUPTS);

    /* flush store buffers */
    temp = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
    
    mutex_exit(&xsp->mu);
    return (DDI_INTR_CLAIMED);
}

Most of the steps performed by the interrupt routine depend on the specifics of the device itself. Consult the hardware manual for the device to determine the cause of the interrupt, detect error conditions, and access the device data registers.

Handling High-Level Interrupts

High-level interrupts are those interrupts that interrupt at the level of the scheduler and above. This level does not allow the scheduler to run. Therefore, high-level interrupt handlers cannot be preempted by the scheduler. High-level interrupts cannot block because of the scheduler. High-level interrupts can only use mutual exclusion locks for locking.

The driver must determine whether the device is using high-level interrupts. Do this test in the driver's attach(9E) entry point when you register interrupts. See High-Level Interrupt Handling Example.

If the interrupt priority returned from ddi_intr_get_pri(9F) is greater than or equal to the priority returned from ddi_intr_get_hilevel_pri(9F), the driver can fail to attach, or the driver can implement a high-level interrupt handler. The high-level interrupt handler uses a lower-priority software interrupt to handle the device. To allow more concurrency, use a separate mutex to protect data from the high-level handler.
If the interrupt priority returned from ddi_intr_get_pri(9F) is less than the priority returned from ddi_intr_get_hilevel_pri(9F), the attach(9E) entry point falls through to regular interrupt registration. In this case, a soft interrupt is not necessary.

High-Level Mutexes

A mutex initialized with an interrupt priority that represents a high-level interrupt is known as a high-level mutex. While holding a high-level mutex, the driver is subject to the same restrictions as a high-level interrupt handler.

High-Level Interrupt Handling Example

In the following example, the high-level mutex (xsp->high_mu) is used only to protect data shared between the high-level interrupt handler and the soft interrupt handler. The protected data includes a queue used by both the high-level interrupt handler and the low-level handler, and a flag that indicates that the low-level handler is running. A separate low-level mutex (xsp->low_mu) protects the rest of the driver from the soft interrupt handler.

Example 8–10 Handling High-Level Interrupts With `attach()`

static int
mydevattach(dev_info_t *dip, ddi_attach_cmd_t cmd)
{
    struct mydevstate *xsp;
    /* ... */

    ret = ddi_intr_get_supported_types(dip, &type);
    if ((ret != DDI_SUCCESS) || (!(type & DDI_INTR_TYPE_FIXED))) {
        cmn_err(CE_WARN, "ddi_intr_get_supported_types() failed");
        return (DDI_FAILURE);
    }

    ret = ddi_intr_get_nintrs(dip, DDI_INTR_TYPE_FIXED, &count);

    /*
     * Fixed interrupts can only have one interrupt. Check to make
     * sure that number of supported interrupts and number of
     * available interrupts are both equal to 1.
     */
    if ((ret != DDI_SUCCESS) || (count != 1)) {
    cmn_err(CE_WARN, "No fixed interrupts found");
            return (DDI_FAILURE);
    }

    xsp->xs_htable = kmem_zalloc(count * sizeof (ddi_intr_handle_t),
        KM_SLEEP);

    ret = ddi_intr_alloc(dip, xsp->xs_htable, DDI_INTR_TYPE_FIXED, 0,
        count, &actual, 0);

    if ((ret != DDI_SUCCESS) || (actual != 1)) {
    cmn_err(CE_WARN, "ddi_intr_alloc failed 0x%x", ret");
        kmem_free(xsp->xs_htable, sizeof (ddi_intr_handle_t));
        return (DDI_FAILURE);
    }

    ret = ddi_intr_get_pri(xsp->xs_htable[0], &intr_pri);
    if (ret != DDI_SUCCESS) {
        cmn_err(CE_WARN, "ddi_intr_get_pri failed 0x%x", ret");
        (void) ddi_intr_free(xsp->xs_htable[0]);
        kmem_free(xsp->xs_htable, sizeof (ddi_intr_handle_t));
        return (DDI_FAILURE);
    }

    if (intr_pri >= ddi_intr_get_hilevel_pri()) {

        mutex_init(&xsp->high_mu, NULL, MUTEX_DRIVER,
            DDI_INTR_PRI(intr_pri));

        ret = ddi_intr_add_handler(xsp->xs_htable[0],
            mydevhigh_intr, (caddr_t)xsp, NULL);

        if (ret != DDI_SUCCESS) {
            cmn_err(CE_WARN, "ddi_intr_add_handler failed 0x%x", ret");
            mutex_destroy(&xsp>xs_int_mutex);
                (void) ddi_intr_free(xsp->xs_htable[0]);
                kmem_free(xsp->xs_htable, sizeof (ddi_intr_handle_t));
            return (DDI_FAILURE);
        }

        /* add soft interrupt */
        if (ddi_intr_add_softint(xsp->xs_dip, &xsp->xs_softint_hdl,
            DDI_INTR_SOFTPRI_MAX, xs_soft_intr, (caddr_t)xsp) !=
            DDI_SUCCESS) {
            cmn_err(CE_WARN, "add soft interrupt failed");
            mutex_destroy(&xsp->high_mu);
            (void) ddi_intr_remove_handler(xsp->xs_htable[0]);
            (void) ddi_intr_free(xsp->xs_htable[0]);
            kmem_free(xsp->xs_htable, sizeof (ddi_intr_handle_t));
            return (DDI_FAILURE);
        }

        xsp->low_soft_pri = DDI_INTR_SOFTPRI_MAX;

        mutex_init(&xsp->low_mu, NULL, MUTEX_DRIVER,
            DDI_INTR_PRI(xsp->low_soft_pri));

    } else {
    /*
     * regular interrupt registration continues from here
     * do not use a soft interrupt
     */
    }

    return (DDI_SUCCESS);
}

The high-level interrupt routine services the device and queues the data. The high-level routine triggers a software interrupt if the low-level routine is not running, as the following example demonstrates.

Example 8–11 High-level Interrupt Routine

static uint_t
mydevhigh_intr(caddr_t arg1, caddr_t arg2)
{
    struct mydevstate    *xsp = (struct mydevstate *)arg1;
    uint8_t    status;
    volatile  uint8_t  temp;
    int    need_softint;

    mutex_enter(&xsp->high_mu);
    /* read status */
    status = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
    if (!(status & INTERRUPTING)) {
        mutex_exit(&xsp->high_mu);
        return (DDI_INTR_UNCLAIMED); /* dev not interrupting */
    }

    ddi_put8(xsp->data_access_handle,&xsp->regp->csr,
        CLEAR_INTERRUPT | ENABLE_INTERRUPTS);
    /* flush store buffers */
    temp = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);

    /* read data from device, queue data for low-level interrupt handler */
    if (xsp->softint_running)
        need_softint = 0;
    else {
        xsp->softint_count++;
        need_softint = 1;
    }
    mutex_exit(&xsp->high_mu);

    /* read-only access to xsp->id, no mutex needed */
    if (need_softint) {
        ret = ddi_intr_trigger_softint(xsp->xs_softint_hdl, NULL);
        if (ret == DDI_EPENDING) {
            cmn_err(CE_WARN, "ddi_intr_trigger_softint() soft interrupt "
                "already pending for this handler");
        } else if (ret != DDI_SUCCESS) {
            cmn_err(CE_WARN, "ddi_intr_trigger_softint() failed");
        }           
    }

    return (DDI_INTR_CLAIMED);
}

The low-level interrupt routine is started by the high-level interrupt routine, which triggers a software interrupt. The low-level interrupt routine runs until there is nothing left to process, as the following example shows.

Example 8–12 Low-Level Soft Interrupt Routine

static uint_t
mydev_soft_intr(caddr_t arg1, caddr_t arg2)
{
    struct mydevstate *mydevp = (struct mydevstate *)arg1;
    /* ... */
    mutex_enter(&mydevp->low_mu);
    mutex_enter(&mydevp->high_mu);
    if (mydevp->softint_count > 1) {
        mydevp->softint_count--;
        mutex_exit(&mydevp->high_mu);
        mutex_exit(&mydevp->low_mu);
        return (DDI_INTR_CLAIMED);
    }

    if ( /* queue empty */ ) {
        mutex_exit(&mydevp->high_mu);
        mutex_exit(&mydevp->low_mu);
        return (DDI_INTR_UNCLAIMED);
    }

    mydevp->softint_running = 1;
    while (EMBEDDED COMMENT:data on queue) {
        ASSERT(mutex_owned(&mydevp->high_mu);
        /* Dequeue data from high-level queue. */
        mutex_exit(&mydevp->high_mu);
        /* normal interrupt processing */
        mutex_enter(&mydevp->high_mu);
    }

    mydevp->softint_running = 0;
    mydevp->softint_count = 0;
    mutex_exit(&mydevp->high_mu);
    mutex_exit(&mydevp->low_mu);
    return (DDI_INTR_CLAIMED);
}

Chapter 9 Direct Memory Access (DMA)

Many devices can temporarily take control of the bus. These devices can perform data transfers that involve main memory and other devices. Because the device is doing the work without the help of the CPU, this type of data transfer is known as direct memory access (DMA). The following types of DMA transfers can be performed:

Between two devices
Between a device and memory
Between memory and memory

This chapter explains transfers between a device and memory only. The chapter provides information on the following subjects:

DMA Model

The Oracle Solaris Device Driver Interface/Driver-Kernel Interface (DDI/DKI) provides a high-level, architecture-independent model for DMA. This model enables the framework, that is, the DMA routines, to hide architecture-specific details such as the following:

Setting up DMA mappings
Building scatter-gather lists
Ensuring that I/O and CPU caches are consistent

Several abstractions are used in the DDI/DKI to describe aspects of a DMA transaction:

DMA object – Memory that is the source or destination of a DMA transfer.
DMA handle – An opaque object returned from a successful ddi_dma_alloc_handle(9F) call. The DMA handle can be used in subsequent DMA subroutine calls to refer to such DMA objects.
DMA cookie – A ddi_dma_cookie(9S) structure (ddi_dma_cookie_t) describes a contiguous portion of a DMA object that is entirely addressable by the device. The cookie contains DMA addressing information that is required to program the DMA engine.

Rather than map an object directly into memory, device drivers allocate DMA resources for a memory object. The DMA routines then perform any platform-specific operations that are needed to set up the object for DMA access. The driver receives a DMA handle to identify the DMA resources that are allocated for the object. This handle is opaque to the device driver. The driver must save the handle and pass the handle in subsequent calls to DMA routines. The driver should not interpret the handle in any way.

Operations that provide the following services are defined on a DMA handle:

Manipulating DMA resources
Synchronizing DMA objects
Retrieving attributes of the allocated resources

Types of Device DMA

Devices perform one of the following three types of DMA:

Bus-master DMA
Third-party DMA
First-party DMA

Bus-Master DMA

The driver should program the device's DMA registers directly in cases where the device acts like a true bus master. For example, a device acts like a bus master when the DMA engine resides on the device board. The transfer address and count are obtained from the DMA cookie to be passed on to the device.

Third-Party DMA

Third-party DMA uses a system DMA engine resident on the main system board, which has several DMA channels that are available for use by devices. The device relies on the system's DMA engine to perform the data transfers between the device and memory. The driver uses DMA engine routines (see the ddi_dmae(9F) function) to initialize and program the DMA engine. For each DMA data transfer, the driver programs the DMA engine and then gives the device a command to initiate the transfer in cooperation with that engine.

First-Party DMA

Under first-party DMA, the device uses a channel from the system's DMA engine to drive that device's DMA bus cycles. Use the ddi_dmae_1stparty(9F) function to configure this channel in a cascade mode so that the DMA engine does not interfere with the transfer.

Types of Host Platform DMA

The platform on which the device operates provides either direct memory access (DMA) or direct virtual memory access (DVMA).

On platforms that support DMA, the system provides the device with a physical address in order to perform transfers. In this case, the transfer of a DMA object can actually consist of a number of physically discontiguous transfers. An example is when an application transfers a buffer that spans several contiguous virtual pages that map to physically discontiguous pages. To deal with the discontiguous memory, devices for these platforms usually have some kind of scatter-gather DMA capability. Typically, x86 systems provide physical addresses for direct memory transfers.

On platforms that support DVMA, the system provides the device with a virtual address to perform transfers. In this case, memory management units (MMU) provided by the underlying platform translate device accesses to these virtual addresses into the proper physical addresses. The device transfers to and from a contiguous virtual image that can be mapped to discontiguous physical pages. Devices that operate in these platforms do not need scatter-gather DMA capability. Typically, SPARC platforms provide virtual addresses for direct memory transfers.

DMA Software Components: Handles, Windows, and Cookies

A DMA handle is an opaque pointer that represents an object, usually a memory buffer or address. A DMA handle enables a device to perform DMA transfers. Several different calls to DMA routines use the handle to identify the DMA resources that are allocated for the object.

An object represented by a DMA handle is completely covered by one or more DMA cookies. A DMA cookie represents a contiguous piece of memory that is used in data transfers by the DMA engine. The system divides objects into multiple cookies based on the following information:

The ddi_dma_attr(9S) attribute structure provided by the driver
Memory location of the target object
Alignment of the target object

If an object does not fit within the limitations of the DMA engine, that object must be broken into multiple DMA windows. You can only activate and allocate resources for one window at a time. Use the ddi_dma_getwin(9F) function to position between windows within an object. Each DMA window consists of one or more DMA cookies. For more information, see DMA Windows.

Some DMA engines can accept more than one cookie. Such engines perform scatter-gather I/O without the help of the system. If multiple cookies are returned from a bind, the driver should call ddi_dma_nextcookie(9F) repeatedly to retrieve each cookie. These cookies must then be programmed into the engine. The device can then be programmed to transfer the total number of bytes covered by the aggregate of these DMA cookies.

DMA Operations

The steps in a DMA transfer are similar among the types of DMA. The following sections present methods for performing DMA transfers.

Note –

You do not need to ensure that the DMA object is locked in memory in block drivers for buffers that come from the file system. The file system has already locked the data in memory.

Performing Bus-Master DMA Transfers

The driver should perform the following steps for bus-master DMA.

Describe the DMA attributes. This step enables the routines to ensure that the device is able to access the buffer.
Allocate a DMA handle.
Ensure that the DMA object is locked in memory. See the physio(9F) or ddi_umem_lock(9F) man page.
Allocate DMA resources for the object.
Program the DMA engine on the device.
Start the engine.
When the transfer is complete, continue the bus master operation.
Perform any required object synchronizations.
Release the DMA resources.
Free the DMA handle.

Performing First-Party DMA Transfers

The driver should perform the following steps for first-party DMA.

Allocate a DMA channel.
Use ddi_dmae_1stparty(9F) to configure the channel.
Ensure that the DMA object is locked in memory. See the physio(9F) or ddi_umem_lock(9F) man page.
Allocate DMA resources for the object.
Program the DMA engine on the device.
Start the engine.
When the transfer is complete, continue the bus-master operation.
Perform any required object synchronizations.
Release the DMA resources.
Deallocate the DMA channel.

Performing Third-Party DMA Transfers

The driver should perform these steps for third-party DMA.

Allocate a DMA channel.
Retrieve the system's DMA engine attributes with ddi_dmae_getattr(9F).
Lock the DMA object in memory. See the physio(9F) or ddi_umem_lock(9F) man page.
Allocate DMA resources for the object.
Use ddi_dmae_prog(9F) to program the system DMA engine to perform the transfer.
Perform any required object synchronizations.
Use ddi_dmae_stop(9F) to stop the DMA engine.
Release the DMA resources.
Deallocate the DMA channel.

Certain hardware platforms restrict DMA capabilities in a bus-specific way. Drivers should use ddi_slaveonly(9F) to determine whether the device is in a slot in which DMA is possible.

DMA Attributes

DMA attributes describe the attributes and limits of a DMA engine, which include:

Limits on addresses that the device can access
Maximum transfer count
Address alignment restrictions

A device driver must inform the system about any DMA engine limitations through the ddi_dma_attr(9S) structure. This action ensures that DMA resources that are allocated by the system can be accessed by the device's DMA engine. The system can impose additional restrictions on the device attributes, but the system never removes any of the driver-supplied restrictions.

`ddi_dma_attr` Structure

The DMA attribute structure has the following members:

typedef struct ddi_dma_attr {
    uint_t      dma_attr_version;       /* version number */
    uint64_t    dma_attr_addr_lo;       /* low DMA address range */
    uint64_t    dma_attr_addr_hi;       /* high DMA address range */
    uint64_t    dma_attr_count_max;     /* DMA counter register */
    uint64_t    dma_attr_align;         /* DMA address alignment */
    uint_t      dma_attr_burstsizes;    /* DMA burstsizes */
    uint32_t    dma_attr_minxfer;       /* min effective DMA size */
    uint64_t    dma_attr_maxxfer;       /* max DMA xfer size */
    uint64_t    dma_attr_seg;           /* segment boundary */
    int         dma_attr_sgllen;        /* s/g length */
    uint32_t    dma_attr_granular;      /* granularity of device */
    uint_t      dma_attr_flags;         /* Bus specific DMA flags */
} ddi_dma_attr_t;

where:

dma_attr_version: Version number of the attribute structure. dma_attr_version should be set to DMA_ATTR_V0.
dma_attr_addr_lo: Lowest bus address that the DMA engine can access.
dma_attr_addr_hi: Highest bus address that the DMA engine can access.
dma_attr_count_max: Specifies the maximum transfer count that the DMA engine can handle in one cookie. The limit is expressed as the maximum count minus one. This count is used as a bit mask, so the count must also be one less than a power of two.
dma_attr_align: Specifies alignment requirements when allocating memory from ddi_dma_mem_alloc(9F). An example of an alignment requirement is alignment on a page boundary. The dma_attr_align field is used only when allocating memory. This field is ignored during bind operations. For bind operations, the driver must ensure that the buffer is aligned appropriately.
dma_attr_burstsizes: Specifies the burst sizes that the device supports. A burst size is the amount of data the device can transfer before relinquishing the bus. This member is a binary encoding of burst sizes, which are assumed to be powers of two. For example, if the device is capable of doing 1-byte, 2-byte, 4-byte, and 16-byte bursts, this field should be set to 0x17. The system also uses this field to determine alignment restrictions.
dma_attr_minxfer: Minimum effective transfer size that the device can perform. This size also influences restrictions on alignment and on padding.
dma_attr_maxxfer: Describes the maximum number of bytes that the DMA engine can accommodate in one I/O command. This limitation is only significant if dma_attr_maxxfer is less than (dma_attr_count_max + 1) * dma_attr_sgllen.
dma_attr_seg: Upper bound of the DMA engine's address register. dma_attr_seg is often used where the upper 8 bits of an address register are a latch that contains a segment number. The lower 24 bits are used to address a segment. In this case, dma_attr_seg would be set to 0xFFFFFF, which prevents the system from crossing a 24-bit segment boundary when allocating resources for the object.
dma_attr_sgllen: Specifies the maximum number of entries in the scatter-gather list. dma_attr_sgllen is the number of cookies that the DMA engine can consume in one I/O request to the device. If the DMA engine has no scatter-gather list, this field should be set to 1.
dma_attr_granular: This field gives the granularity in bytes of the DMA transfer ability of the device. An example of how this value is used is to specify the sector size of a mass storage device. When a bind operation requires a partial mapping, this field is used to ensure that the sum of the sizes of the cookies in a DMA window is a whole multiple of granularity. However, if the device does not have a scatter-gather capability, it is impossible for the DDI to ensure the granularity. For this case, the value of the dma_attr_granular field should be 1.
dma_attr_flags: This field can be set to DDI_DMA_FORCE_PHYSICAL, which indicates that the system should return physical rather than virtual I/O addresses if the system supports both. If the system does not support physical DMA, the return value from ddi_dma_alloc_handle(9F) is DDI_DMA_BADATTR. In this case, the driver has to clear DDI_DMA_FORCE_PHYSICAL and retry the operation.

SBus Example

A DMA engine on an SBus in a SPARC machine has the following attributes:

Access to addresses ranging from 0xFF000000 to 0xFFFFFFFF only
32-bit DMA counter register
Ability to handle byte-aligned transfers
Support for 1-byte, 2-byte, and 4-byte burst sizes
Minimum effective transfer size of 1 byte
32-bit address register
No scatter-gather list
Operation on sectors only, for example, a disk

A DMA engine on an SBus in a SPARC machine has the following attribute structure:

static ddi_dma_attr_t attributes = {
    DMA_ATTR_V0,   /* Version number */
    0xFF000000,    /* low address */
    0xFFFFFFFF,    /* high address */
    0xFFFFFFFF,    /* counter register max */
    1,             /* byte alignment */
    0x7,           /* burst sizes: 0x1 | 0x2 | 0x4 */
    0x1,           /* minimum transfer size */
    0xFFFFFFFF,    /* max transfer size */
    0xFFFFFFFF,    /* address register max */
    1,             /* no scatter-gather */
    512,           /* device operates on sectors */
    0,             /* attr flag: set to 0 */
};

ISA Bus Example

A DMA engine on an ISA bus in an x86 machine has the following attributes:

Access to the first 16 megabytes of memory only
Inability to cross a 1-megabyte boundary in a single DMA transfer
16-bit counter register
Ability to handle byte-aligned transfers
Support for 1-byte, 2-byte, and 4-byte burst sizes
Minimum effective transfer size of 1 byte
Ability to hold up to 17 scatter-gather transfers
Operation on sectors only, for example, a disk

A DMA engine on an ISA bus in an x86 machine has the following attribute structure:

static ddi_dma_attr_t attributes = {
    DMA_ATTR_V0,   /* Version number */
    0x00000000,    /* low address */
    0x00FFFFFF,    /* high address */
    0xFFFF,        /* counter register max */
    1,             /* byte alignment */
    0x7,           /* burst sizes */
    0x1,           /* minimum transfer size */
    0xFFFFFFFF,    /* max transfer size */
    0x000FFFFF,    /* address register max */
    17,            /* scatter-gather */
    512,           /* device operates on sectors */
    0,             /* attr flag: set to 0 */
};

Managing DMA Resources

This section describes how to manage DMA resources.

Object Locking

Before allocating the DMA resources for a memory object, the object must be prevented from moving. Otherwise, the system can remove the object from memory while the device is trying to write to that object. A missing object would cause the data transfer to fail and possibly corrupt the system. The process of preventing memory objects from moving during a DMA transfer is known as locking down the object.

The following object types do not require explicit locking:

Buffers coming from the file system through strategy(9E). These buffers are already locked by the file system.
Kernel memory allocated within the device driver, such as that allocated by ddi_dma_mem_alloc(9F).

For other objects such as buffers from user space, physio(9F) or ddi_umem_lock(9F) must be used to lock down the objects. Locking down objects with these functions is usually performed in the read(9E) or write(9E) routines of a character device driver. See Data Transfer Methods for an example.

Allocating a DMA Handle

A DMA handle is an opaque object that is used as a reference to subsequently allocated DMA resources. The DMA handle is usually allocated in the driver's attach() entry point that uses ddi_dma_alloc_handle(9F). The ddi_dma_alloc_handle() function takes the device information that is referred to by dip and the device's DMA attributes described by a ddi_dma_attr(9S) structure as parameters. The ddi_dma_alloc_handle() function has the following syntax:

int ddi_dma_alloc_handle(dev_info_t *dip,
    ddi_dma_attr_t *attr, int (*callback)(caddr_t),
    caddr_t arg, ddi_dma_handle_t *handlep);

where:

dip: Pointer to the device's dev_info structure.
attr: Pointer to a ddi_dma_attr(9S) structure, as described in DMA Attributes.
callback: Address of the callback function for handling resource allocation failures.
arg: Argument to be passed to the callback function.
handlep: Pointer to a DMA handle to store the returned handle.

Allocating DMA Resources

Two interfaces allocate DMA resources:

ddi_dma_buf_bind_handle(9F) – Used with buf(9S) structures

ddi_dma_addr_bind_handle(9F) – Used with virtual addresses

DMA resources are usually allocated in the driver's xxstart() routine, if an xxstart() routine exists. See Asynchronous Data Transfers (Block Drivers) for a discussion of xxstart(). These two interfaces have the following syntax:

int ddi_dma_addr_bind_handle(ddi_dma_handle_t handle,
    struct as *as, caddr_t addr,
    size_t len, uint_t flags, int (*callback)(caddr_t),
    caddr_t arg, ddi_dma_cookie_t *cookiep, uint_t *ccountp);

int ddi_dma_buf_bind_handle(ddi_dma_handle_t handle,
    struct buf *bp, uint_t flags,
    int (*callback)(caddr_t), caddr_t arg,
    ddi_dma_cookie_t *cookiep, uint_t *ccountp);

The following arguments are common to both ddi_dma_addr_bind_handle(9F) and ddi_dma_buf_bind_handle(9F):

handle: DMA handle and the object for allocating resources.
flags: Set of flags that indicate the transfer direction and other attributes. DDI_DMA_READ indicates a data transfer from device to memory. DDI_DMA_WRITE indicates a data transfer from memory to device. See the ddi_dma_addr_bind_handle(9F) or ddi_dma_buf_bind_handle(9F) man page for a complete discussion of the available flags.
callback: Address of callback function for handling resource allocation failures. See the ddi_dma_alloc_handle(9F) man page.
arg: Argument to pass to the callback function.
cookiep: Pointer to the first DMA cookie for this object.
ccountp: Pointer to the number of DMA cookies for this object.

For ddi_dma_addr_bind_handle(9F), the object is described by an address range with the following parameters:

as: Pointer to an address space structure. The value of as must be NULL.
addr: Base kernel address of the object.
len: Length of the object in bytes.

For ddi_dma_buf_bind_handle(9F), the object is described by a buf(9S) structure pointed to by bp.

Device Register Structure

DMA-capable devices require more registers than were used in the previous examples.

The following fields are used in the device register structure to support DMA-capable device with no scatter-gather support:

uint32_t      dma_addr;      /* starting address for DMA */
uint32_t      dma_size;      /* amount of data to transfer */

The following fields are used in the device register structure to support DMA-capable devices with scatter-gather support:

struct sglentry {
    uint32_t    dma_addr;
    uint32_t    dma_size;
} sglist[SGLLEN];

caddr_t       iopb_addr;     /* When written, informs the device of the next */
                             /* command's parameter block address. */
                             /* When read after an interrupt, contains */
                             /* the address of the completed command. */

DMA Callback Example

In Example 9–1, xxstart() is used as the callback function. The per-device state structure is used as the argument to xxstart(). The xxstart() function attempts to start the command. If the command cannot be started because resources are not available, xxstart() is scheduled to be called later when resources are available.

Because xxstart() is used as a DMA callback, xxstart() must adhere to the following rules, which are imposed on DMA callbacks:

Resources cannot be assumed to be available. The callback must try to allocate resources again.
The callback must indicate to the system whether allocation succeeded. DDI_DMA_CALLBACK_RUNOUT should be returned if the callback fails to allocate resources, in which case xxstart() needs to be called again later. DDI_DMA_CALLBACK_DONE indicates success, so that no further callback is necessary.

Example 9–1 DMA Callback Example

static int
xxstart(caddr_t arg)
{
    struct xxstate *xsp = (struct xxstate *)arg;
    struct device_reg *regp;
    int flags;
    mutex_enter(&xsp->mu);
    if (xsp->busy) {
        /* transfer in progress */
        mutex_exit(&xsp->mu);
        return (DDI_DMA_CALLBACK_RUNOUT);
    }
    xsp->busy = 1;
    regp = xsp->regp;
    if ( /* transfer is a read */ ) {
        flags = DDI_DMA_READ;
    } else {
        flags = DDI_DMA_WRITE;
    }
    mutex_exit(&xsp->mu);
    if (ddi_dma_buf_bind_handle(xsp->handle,xsp->bp,flags, xxstart,
        (caddr_t)xsp, &cookie, &ccount) != DDI_DMA_MAPPED) {
        /* really should check all return values in a switch */
        mutex_enter(&xsp->mu);
        xsp->busy=0;
        mutex_exit(&xsp->mu);
        return (DDI_DMA_CALLBACK_RUNOUT);
    }
    /* Program the DMA engine. */
    return (DDI_DMA_CALLBACK_DONE);
}

Determining Maximum Burst Sizes

Drivers specify the DMA burst sizes that their device supports in the dma_attr_burstsizesfield of the ddi_dma_attr(9S) structure. This field is a bitmap of the supported burst sizes. However, when DMA resources are allocated, the system might impose further restrictions on the burst sizes that might be actually used by the device. The ddi_dma_burstsizes(9F) routine can be used to obtain the allowed burst sizes. This routine returns the appropriate burst size bitmap for the device. When DMA resources are allocated, a driver can ask the system for appropriate burst sizes to use for its DMA engine.

Example 9–2 Determining Burst Size

#define BEST_BURST_SIZE 0x20 /* 32 bytes */

    if (ddi_dma_buf_bind_handle(xsp->handle,xsp->bp, flags, xxstart,
        (caddr_t)xsp, &cookie, &ccount) != DDI_DMA_MAPPED) {
        /* error handling */
    }
    burst = ddi_dma_burstsizes(xsp->handle);
    /* check which bit is set and choose one burstsize to */
    /* program the DMA engine */
    if (burst & BEST_BURST_SIZE) {
        /* program DMA engine to use this burst size */
    } else {
        /* other cases */
    }

Allocating Private DMA Buffers

Some device drivers might need to allocate memory for DMA transfers in addition to performing transfers requested by user threads and the kernel. Some examples of allocating private DMA buffers are setting up shared memory for communication with the device and allocating intermediate transfer buffers. Use ddi_dma_mem_alloc(9F) to allocate memory for DMA transfers.

int ddi_dma_mem_alloc(ddi_dma_handle_t handle, size_t length,
    ddi_device_acc_attr_t *accattrp, uint_t flags,
    int (*waitfp)(caddr_t), caddr_t arg, caddr_t *kaddrp,
    size_t *real_length, ddi_acc_handle_t *handlep);

where:

handle: DMA handle
length: Length in bytes of the desired allocation
accattrp: Pointer to a device access attribute structure
flags: Data transfer mode flags. Possible values are DDI_DMA_CONSISTENT and DDI_DMA_STREAMING.
waitfp: Address of callback function for handling resource allocation failures. See the ddi_dma_alloc_handle(9F) man page.
arg: Argument to pass to the callback function
kaddrp: Pointer on a successful return that contains the address of the allocated storage
real_length: Length in bytes that was allocated
handlep: Pointer to a data access handle

The flags parameter should be set to DDI_DMA_CONSISTENT if the device accesses in a nonsequential fashion. Synchronization steps that use ddi_dma_sync(9F) should be as lightweight as possible due to frequent application to small objects. This type of access is commonly known as consistent access. Consistent access is particularly useful for I/O parameter blocks that are used for communication between a device and the driver.

On the x86 platform, allocation of DMA memory that is physically contiguous has these requirements:

The length of the scatter-gather list dma_attr_sgllen in the ddi_dma_attr(9S) structure must be set to 1.
Do not specify DDI_DMA_PARTIAL. DDI_DMA_PARTIAL allows partial resource allocation.

The following example shows how to allocate IOPB memory and the necessary DMA resources to access this memory. DMA resources must still be allocated, and the DDI_DMA_CONSISTENT flag must be passed to the allocation function.

Example 9–3 Using `ddi_dma_mem_alloc(9F)`

if (ddi_dma_mem_alloc(xsp->iopb_handle, size, &accattr,
    DDI_DMA_CONSISTENT, DDI_DMA_SLEEP, NULL, &xsp->iopb_array,
    &real_length, &xsp->acchandle) != DDI_SUCCESS) {
    /* error handling */
    goto failure;
}
if (ddi_dma_addr_bind_handle(xsp->iopb_handle, NULL,
    xsp->iopb_array, real_length,
    DDI_DMA_READ | DDI_DMA_CONSISTENT, DDI_DMA_SLEEP,
    NULL, &cookie, &count) != DDI_DMA_MAPPED) {
    /* error handling */
    ddi_dma_mem_free(&xsp->acchandle);
    goto failure;
}

The flags parameter should be set to DDI_DMA_STREAMING for memory transfers that are sequential, unidirectional, block-sized, and block-aligned. This type of access is commonly known as streaming access.

In some cases, an I/O transfer can be sped up by using an I/O cache. I/O cache transfers one cache line at a minimum. The ddi_dma_mem_alloc(9F) routine rounds size to a multiple of the cache line to avoid data corruption.

The ddi_dma_mem_alloc(9F) function returns the actual size of the allocated memory object. Because of padding and alignment requirements, the actual size might be larger than the requested size. The ddi_dma_addr_bind_handle(9F) function requires the actual length.

Use the ddi_dma_mem_free(9F) function to free the memory allocated by ddi_dma_mem_alloc(9F).

Note –

Drivers must ensure that buffers are aligned appropriately. Drivers for devices that have alignment requirements on down bound DMA buffers might need to copy the data into a driver intermediate buffer that meets the requirements, and then bind that intermediate buffer to the DMA handle for DMA. Use ddi_dma_mem_alloc(9F) to allocate the driver intermediate buffer. Always use ddi_dma_mem_alloc(9F) instead of kmem_alloc(9F) to allocate memory for the device to access.

Handling Resource Allocation Failures

The resource-allocation routines provide the driver with several options when handling allocation failures. The waitfp argument indicates whether the allocation routines block, return immediately, or schedule a callback, as shown in the following table.

Table 9–1 Resource Allocation Handling


`waitfp` value	Indicated Action
`DDI_DMA_DONTWAIT`	Driver does not want to wait for resources to become available
`DDI_DMA_SLEEP`	Driver is willing to wait indefinitely for resources to become available
Other values	The address of a function to be called when resources are likely to be available

Programming the DMA Engine

When the resources have been successfully allocated, the device must be programmed. Although programming a DMA engine is device specific, all DMA engines require a starting address and a transfer count. Device drivers retrieve these two values from the DMA cookie returned by a successful call from ddi_dma_addr_bind_handle(9F), ddi_dma_buf_bind_handle(9F), or ddi_dma_getwin(9F). These functions all return the first DMA cookie and a cookie count indicating whether the DMA object consists of more than one cookie. If the cookie count N is greater than 1, ddi_dma_nextcookie(9F) must be called N-1 times to retrieve all the remaining cookies.

A DMA cookie is of type ddi_dma_cookie(9S). This type of cookie has the following fields:

uint64_t    _dmac_ll;       /* 64-bit DMA address */
uint32_t    _dmac_la[2];    /* 2 x 32-bit address */
size_t      dmac_size;      /* DMA cookie size */
uint_t      dmac_type;      /* bus specific type bits */

The dmac_laddress specifies a 64-bit I/O address that is appropriate for programming the device's DMA engine. If a device has a 64-bit DMA address register, a driver should use this field to program the DMA engine. The dmac_address field specifies a 32-bit I/O address that should be used for devices that have a 32-bit DMA address register. The dmac_size field contains the transfer count. Depending on the bus architecture, the dmac_type field in the cookie might be required by the driver. The driver should not perform any manipulations, such as logical or arithmetic, on the cookie.

Example 9–4 `ddi_dma_cookie(9S)` Example

ddi_dma_cookie_t            cookie;

     if (ddi_dma_buf_bind_handle(xsp->handle,xsp->bp, flags, xxstart,
     (caddr_t)xsp, &cookie, &xsp->ccount) != DDI_DMA_MAPPED) {
         /* error handling */
      }
     sglp = regp->sglist;
     for (cnt = 1; cnt <= SGLLEN; cnt++, sglp++) {
     /* store the cookie parms into the S/G list */
     ddi_put32(xsp->access_hdl, &sglp->dma_size,
         (uint32_t)cookie.dmac_size);
     ddi_put32(xsp->access_hdl, &sglp->dma_addr,
         cookie.dmac_address);
     /* Check for end of cookie list */
     if (cnt == xsp->ccount)
         break;
     /* Get next DMA cookie */
     (void) ddi_dma_nextcookie(xsp->handle, &cookie);
     }
     /* start DMA transfer */
     ddi_put8(xsp->access_hdl, &regp->csr,
     ENABLE_INTERRUPTS | START_TRANSFER);

Freeing the DMA Resources

After a DMA transfer is completed, usually in the interrupt routine, the driver can release DMA resources by calling ddi_dma_unbind_handle(9F).

As described in Synchronizing Memory Objects, ddi_dma_unbind_handle(9F) calls ddi_dma_sync(9F), eliminating the need for any explicit synchronization. After calling ddi_dma_unbind_handle(9F), the DMA resources become invalid, and further references to the resources have undefined results. The following example shows how to use ddi_dma_unbind_handle(9F).

Example 9–5 Freeing DMA Resources

static uint_t
xxintr(caddr_t arg)
{
     struct xxstate *xsp = (struct xxstate *)arg;
     uint8_t    status;
     volatile   uint8_t   temp;
     mutex_enter(&xsp->mu);
     /* read status */
     status = ddi_get8(xsp->access_hdl, &xsp->regp->csr);
     if (!(status & INTERRUPTING)) {
        mutex_exit(&xsp->mu);
        return (DDI_INTR_UNCLAIMED);
     }
     ddi_put8(xsp->access_hdl, &xsp->regp->csr, CLEAR_INTERRUPT);
     /* for store buffers */
     temp = ddi_get8(xsp->access_hdl, &xsp->regp->csr);
     ddi_dma_unbind_handle(xsp->handle);
     /* Check for errors. */
     xsp->busy = 0;
     mutex_exit(&xsp->mu);
     if ( /* pending transfers */ ) {
        (void) xxstart((caddr_t)xsp);
     }
     return (DDI_INTR_CLAIMED);
}

The DMA resources should be released. The DMA resources should be reallocated if a different object is to be used in the next transfer. However, if the same object is always used, the resources can be allocated once. The resources can then be reused as long as intervening calls to ddi_dma_sync(9F) remain.

Freeing the DMA Handle

When the driver is detached, the DMA handle must be freed. The ddi_dma_free_handle(9F) function destroys the DMA handle and destroys any residual resources that the system is caching on the handle. Any further references of the DMA handle will have undefined results.

Canceling DMA Callbacks

DMA callbacks cannot be canceled. Canceling a DMA callback requires some additional code in the driver's detach(9E) entry point. The detach() routine must not return DDI_SUCCESS if any outstanding callbacks exist. See Example 9–6. When DMA callbacks occur, the detach() routine must wait for the callback to run. When the callback has finished, detach() must prevent the callback from rescheduling itself. Callbacks can be prevented from rescheduling through additional fields in the state structure, as shown in the following example.

Example 9–6 Canceling DMA Callbacks

static int
xxdetach(dev_info_t *dip, ddi_detach_cmd_t cmd)
{
     /* ... */
     mutex_enter(&xsp->callback_mutex);
     xsp->cancel_callbacks = 1;
     while (xsp->callback_count > 0) {
        cv_wait(&xsp->callback_cv, &xsp->callback_mutex);
     }
     mutex_exit(&xsp->callback_mutex);
     /* ... */
 }

static int
xxstrategy(struct buf *bp)
{
     /* ... */
     mutex_enter(&xsp->callback_mutex);
       xsp->bp = bp;
     error = ddi_dma_buf_bind_handle(xsp->handle, xsp->bp, flags,
         xxdmacallback, (caddr_t)xsp, &cookie, &ccount);
     if (error == DDI_DMA_NORESOURCES)
       xsp->callback_count++;
     mutex_exit(&xsp->callback_mutex);
     /* ... */
}

static int
xxdmacallback(caddr_t callbackarg)
{
     struct xxstate *xsp = (struct xxstate *)callbackarg;
     /* ... */
     mutex_enter(&xsp->callback_mutex);
     if (xsp->cancel_callbacks) {
        /* do not reschedule, in process of detaching */
        xsp->callback_count--;
        if (xsp->callback_count == 0)
           cv_signal(&xsp->callback_cv);
        mutex_exit(&xsp->callback_mutex);
        return (DDI_DMA_CALLBACK_DONE);    /* don't reschedule it */
     }
     /*
      * Presumably at this point the device is still active
      * and will not be detached until the DMA has completed.
      * A return of 0 means try again later
      */
     error = ddi_dma_buf_bind_handle(xsp->handle, xsp->bp, flags,
         DDI_DMA_DONTWAIT, NULL, &cookie, &ccount);
     if (error == DDI_DMA_MAPPED) {
        /* Program the DMA engine. */
        xsp->callback_count--;
        mutex_exit(&xsp->callback_mutex);
        return (DDI_DMA_CALLBACK_DONE);
     }
     if (error != DDI_DMA_NORESOURCES) {
        xsp->callback_count--;
        mutex_exit(&xsp->callback_mutex);
        return (DDI_DMA_CALLBACK_DONE);
     }
     mutex_exit(&xsp->callback_mutex);
     return (DDI_DMA_CALLBACK_RUNOUT);
}

Synchronizing Memory Objects

In the process of accessing the memory object, the driver might need to synchronize the memory object with respect to various caches. This section provides guidelines on when and how to synchronize memory objects.

Cache

CPU cache is a very high-speed memory that sits between the CPU and the system's main memory. I/O cache sits between the device and the system's main memory, as shown in the following figure.

Figure 9–1 CPU and System I/O Caches

Diagram shows how the cache is used to speed data transfers
involving devices.

When an attempt is made to read data from main memory, the associated cache checks for the requested data. If the data is available, the cache supplies the data quickly. If the cache does not have the data, the cache retrieves the data from main memory. The cache then passes the data on to the requester and saves the data in case of a subsequent request.

Similarly, on a write cycle, the data is stored in the cache quickly. The CPU or device is allowed to continue executing, that is, transferring data. Storing data in a cache takes much less time than waiting for the data to be written to memory.

With this model, after a device transfer is complete, the data can still be in the I/O cache with no data in main memory. If the CPU accesses the memory, the CPU might read the wrong data from the CPU cache. The driver must call a synchronization routine to flush the data from the I/O cache and update the CPU cache with the new data. This action ensures a consistent view of the memory for the CPU. Similarly, a synchronization step is required if data modified by the CPU is to be accessed by a device.

You can create additional caches and buffers between the device and memory, such as bus extenders and bridges. Use ddi_dma_sync(9F) to synchronize all applicable caches.

`ddi_dma_sync()` Function

A memory object might have multiple mappings, such as for the CPU and for a device, by means of a DMA handle. A driver with multiple mappings needs to call ddi_dma_sync(9F) if any mappings are used to modify the memory object. Calling ddi_dma_sync() ensures that the modification of the memory object is complete before the object is accessed through a different mapping. The ddi_dma_sync() function can also inform other mappings of the object if any cached references to the object are now stale. Additionally, ddi_dma_sync() flushes or invalidates stale cache references as necessary.

Generally, the driver must call ddi_dma_sync() when a DMA transfer completes. The exception to this rule is if deallocating the DMA resources with ddi_dma_unbind_handle(9F) does an implicit ddi_dma_sync() on behalf of the driver. The syntax for ddi_dma_sync() is as follows:

int ddi_dma_sync(ddi_dma_handle_t handle, off_t off,
size_t length, uint_t type);

If the object is going to be read by the DMA engine of the device, the device's view of the object must be synchronized by setting type to DDI_DMA_SYNC_FORDEV. If the DMA engine of the device has written to the memory object and the object is going to be read by the CPU, the CPU's view of the object must be synchronized by setting type to DDI_DMA_SYNC_FORCPU.

The following example demonstrates synchronizing a DMA object for the CPU:

if (ddi_dma_sync(xsp->handle, 0, length, DDI_DMA_SYNC_FORCPU)
    == DDI_SUCCESS) {
    /* the CPU can now access the transferred data */
    /* ... */
} else {
    /* error handling */
}

Use the flag DDI_DMA_SYNC_FORKERNEL if the only mapping is for the kernel, as in the case of memory that is allocated by ddi_dma_mem_alloc(9F). The system tries to synchronize the kernel's view more quickly than the CPU's view. If the system cannot synchronize the kernel view faster, the system acts as if the DDI_DMA_SYNC_FORCPU flag were set.

DMA Windows

If an object does not fit within the limitations of the DMA engine, the transfer must be broken into a series of smaller transfers. The driver can break up the transfer itself. Alternatively, the driver can allow the system to allocate resources for only part of the object, thereby creating a series of DMA windows. Allowing the system to allocate resources is the preferred solution, because the system can manage the resources more effectively than the driver can manage the resources.

A DMA window has two attributes. The offset attribute is measured from the beginning of the object. The length attribute is the number of bytes of memory to be allocated. After a partial allocation, only a range of length bytes that starts at offset has allocated resources.

A DMA window is requested by specifying the DDI_DMA_PARTIAL flag as a parameter to ddi_dma_buf_bind_handle(9F) or ddi_dma_addr_bind_handle(9F). Both functions return DDI_DMA_PARTIAL_MAP if a window can be established. However, the system might allocate resources for the entire object, in which case DDI_DMA_MAPPED is returned. The driver should check the return value to determine whether DMA windows are in use. See the following example.

Example 9–7 Setting Up DMA Windows

static int
xxstart (caddr_t arg)
{
    struct xxstate *xsp = (struct xxstate *)arg;
    struct device_reg *regp = xsp->reg;
    ddi_dma_cookie_t cookie;
    int status;
    mutex_enter(&xsp->mu);
    if (xsp->busy) {
        /* transfer in progress */
        mutex_exit(&xsp->mu);
        return (DDI_DMA_CALLBACK_RUNOUT);
    }
    xsp->busy = 1;
    mutex_exit(&xsp->mu);
    if ( /* transfer is a read */) {
        flags = DDI_DMA_READ;
    } else {
        flags = DDI_DMA_WRITE;
    }
    flags |= DDI_DMA_PARTIAL;
    status = ddi_dma_buf_bind_handle(xsp->handle, xsp->bp,
        flags, xxstart, (caddr_t)xsp, &cookie, &ccount);
    if (status != DDI_DMA_MAPPED &&
        status != DDI_DMA_PARTIAL_MAP)
        return (DDI_DMA_CALLBACK_RUNOUT);
    if (status == DDI_DMA_PARTIAL_MAP) {
        ddi_dma_numwin(xsp->handle, &xsp->nwin);
        xsp->partial = 1;
        xsp->windex = 0;
    } else {
        xsp->partial = 0;
    }
    /* Program the DMA engine. */
    return (DDI_DMA_CALLBACK_DONE);
}

Two functions operate with DMA windows. The first, ddi_dma_numwin(9F), returns the number of DMA windows for a particular DMA object. The other function, ddi_dma_getwin(9F), allows repositioning within the object, that is, reallocation of system resources. The ddi_dma_getwin() function shifts the current window to a new window within the object. Because ddi_dma_getwin() reallocates system resources to the new window, the previous window becomes invalid.

Caution –

Do not move the DMA windows with a call to ddi_dma_getwin() before transfers into the current window are complete. Wait until the transfer to the current window is complete, which is when the interrupt arrives. Then call ddi_dma_getwin() to avoid data corruption.

The ddi_dma_getwin() function is normally called from an interrupt routine, as shown in Example 9–8. The first DMA transfer is initiated as a result of a call to the driver. Subsequent transfers are started from the interrupt routine.

The interrupt routine examines the status of the device to determine whether the device completes the transfer successfully. If not, normal error recovery occurs. If the transfer is successful, the routine must determine whether the logical transfer is complete. A complete transfer includes the entire object as specified by the buf(9S) structure. In a partial transfer, only one DMA window is moved. In a partial transfer, the interrupt routine moves the window with ddi_dma_getwin(9F), retrieves a new cookie, and starts another DMA transfer.

If the logical request has been completed, the interrupt routine checks for pending requests. If necessary, the interrupt routine starts a transfer. Otherwise, the routine returns without invoking another DMA transfer. The following example illustrates the usual flow control.

Example 9–8 Interrupt Handler Using DMA Windows

static uint_t
xxintr(caddr_t arg)
{
    struct xxstate *xsp = (struct xxstate *)arg;
    uint8_t    status;
    volatile   uint8_t   temp;
    mutex_enter(&xsp->mu);
    /* read status */
    status = ddi_get8(xsp->access_hdl, &xsp->regp->csr);
    if (!(status & INTERRUPTING)) {
        mutex_exit(&xsp->mu);
        return (DDI_INTR_UNCLAIMED);
    }
    ddi_put8(xsp->access_hdl,&xsp->regp->csr, CLEAR_INTERRUPT);
    /* for store buffers */
    temp = ddi_get8(xsp->access_hdl, &xsp->regp->csr);
    if ( /* an error occurred during transfer */ ) {
        bioerror(xsp->bp, EIO);
        xsp->partial = 0;
    } else {
        xsp->bp->b_resid -= /* amount transferred */ ;
    }

    if (xsp->partial && (++xsp->windex < xsp->nwin)) {
        /* device still marked busy to protect state */
        mutex_exit(&xsp->mu);
        (void) ddi_dma_getwin(xsp->handle, xsp->windex,
            &offset, &len, &cookie, &ccount);
        /* Program the DMA engine with the new cookie(s). */
        return (DDI_INTR_CLAIMED);
    }
    ddi_dma_unbind_handle(xsp->handle);
    biodone(xsp->bp);
    xsp->busy = 0;
    xsp->partial = 0;
    mutex_exit(&xsp->mu);
    if ( /* pending transfers */ ) {
        (void) xxstart((caddr_t)xsp);
    }
    return (DDI_INTR_CLAIMED);
}

Chapter 10 Mapping Device and Kernel Memory

Some device drivers allow applications to access device or kernel memory through mmap(2). Frame buffer drivers, for example, enable the frame buffer to be mapped into a user thread. Another example would be a pseudo driver that uses a shared kernel memory pool to communicate with an application. This chapter provides information on the following subjects:

Memory Mapping Overview

The steps that a driver must take to export device or kernel memory are as follows:

Set the D_DEVMAP flag in the cb_flag flag of the cb_ops(9S) structure.
Define a devmap(9E) driver entry point and optional segmap(9E) entry point to export the mapping.
Use devmap_devmem_setup(9F) to set up user mappings to a device. To set up user mappings to kernel memory, use devmap_umem_setup(9F).

Exporting the Mapping

This section describes how to use the segmap(9E) and devmap(9E) entry points.

The `segmap`(9E) Entry Point

The segmap(9E) entry point is responsible for setting up a memory mapping requested by an mmap(2) system call. Drivers for many memory-mapped devices use ddi_devmap_segmap(9F) as the entry point rather than defining their own segmap(9E) routine. By providing a segmap() entry point, a driver can take care of general tasks before or after creating the mapping. For example, the driver can check mapping permissions and allocate private mapping resources. The driver can also make adjustments to the mapping to accommodate non-page-aligned device buffers. The segmap() entry point must call the ddi_devmap_segmap(9F) function before returning. The ddi_devmap_segmap() function calls the driver's devmap(9E) entry point to perform the actual mapping.

The segmap() function has the following syntax:

int segmap(dev_t dev, off_t off, struct as *asp, caddr_t *addrp,
     off_t len, unsigned int prot, unsigned int maxprot,
     unsigned int flags, cred_t *credp);

where:

dev

Device whose memory is to be mapped.

off

Offset within device memory at which mapping begins.

asp

Pointer to the address space into which the device memory should be mapped.

Note that this argument can be either a struct as *, as shown in Example 10–1, or a ddi_as_handle_t, as shown in Example 10–2. This is because ddidevmap.h includes the following declaration:

typedef struct as *ddi_as_handle_t

addrp

Pointer to the address in the address space to which the device memory should be mapped.

len

Length (in bytes) of the memory being mapped.

prot

A bit field that specifies the protections. Possible settings are PROT_READ, PROT_WRITE, PROT_EXEC, PROT_USER, and PROT_ALL. See the man page for details.

maxprot

Maximum protection flag possible for attempted mapping. The PROT_WRITE bit can be masked out if the user opened the special file read-only.

flags

Flags that indicate the type of mapping. Possible values include MAP_SHARED and MAP_PRIVATE.

credp

Pointer to the user credentials structure.

In the following example, the driver controls a frame buffer that allows write-only mappings. The driver returns EINVAL if the application tries to gain read access and then calls ddi_devmap_segmap(9F) to set up the user mapping.

Example 10–1 `segmap(9E)` Routine

static int
xxsegmap(dev_t dev, off_t off, struct as *asp, caddr_t *addrp,
    off_t len, unsigned int prot, unsigned int maxprot,
    unsigned int flags, cred_t *credp)
{
    if (prot & PROT_READ)
        return (EINVAL);
    return (ddi_devmap_segmap(dev, off, as, addrp,
        len, prot, maxprot, flags, cred));
}

The following example shows how to handle a device that has a buffer that is not page-aligned in its register space. This example maps a buffer that starts at offset 0x800, so that mmap(2) returns an address that corresponds to the start of the buffer. The devmap_devmem_setup(9F) function maps entire pages, requires the mapping to be page aligned, and returns an address to the start of a page. If this address is passed through segmap(9E), or if no segmap() entry point is defined, mmap() returns the address that corresponds to the start of the page, not the address that corresponds to the start of the buffer. In this example, the buffer offset is added to the page-aligned address that was returned by devmap_devmem_setup so that the resulting address returned is the desired start of the buffer.

Example 10–2 Using the `segmap()` Function to Change the Address Returned by the `mmap()` Call

#define	BUFFER_OFFSET 0x800

int
xx_segmap(dev_t dev, off_t off, ddi_as_handle_t as, caddr_t *addrp, off_t len,
    uint_t prot, uint_t maxprot, uint_t flags, cred_t *credp)
{
        int rval;
        unsigned long pagemask = ptob(1L) - 1L;

        if ((rval = ddi_devmap_segmap(dev, off, as, addrp, len, prot, maxprot,
            flags, credp)) == DDI_SUCCESS) {
                /*
                 * The address returned by ddi_devmap_segmap is the start of the page
                 * that contains the buffer.  Add the offset of the buffer to get the
                 * final address.
                 */
                *addrp += BUFFER_OFFSET & pagemask);
        }
        return (rval);
}

The `devmap`(9E) Entry Point

The devmap(9E) entry point is called from the ddi_devmap_segmap(9F) function inside the segmap(9E) entry point.

The devmap(9E) entry point is called as a result of the mmap(2) system call. The devmap(9E) function is called to export device memory or kernel memory to user applications. The devmap() function is used for the following operations:

Validate the user mapping to the device or kernel memory
Translate the logical offset within the application mapping to the corresponding offset within the device or kernel memory
Pass the mapping information to the system for setting up the mapping

The devmap() function has the following syntax:

int devmap(dev_t dev, devmap_cookie_t handle, offset_t off,
     size_t len, size_t *maplen, uint_t model);

where:

dev: Device whose memory is to be mapped.
handle: Device-mapping handle that the system creates and uses to describe a mapping to contiguous memory in the device or kernel.
off: Logical offset within the application mapping that has to be translated by the driver to the corresponding offset within the device or kernel memory.
len: Length (in bytes) of the memory being mapped.
maplen: Enables driver to associate different kernel memory regions or multiple physically discontiguous memory regions with one contiguous user application mapping.
model: Data model type of the current thread.

The system creates multiple mapping handles in one mmap(2) system call. For example, the mapping might contain multiple physically discontiguous memory regions.

Initially, devmap(9E) is called with the parameters off and len. These parameters are passed by the application to mmap(2). devmap(9E) sets *maplen to the length from off to the end of a contiguous memory region. The *maplen value must be rounded up to a multiple of a page size. The *maplen value can be set to less than the original mapping length len. If so, the system uses a new mapping handle with adjusted off and len parameters to call devmap(9E) repeatedly until the initial mapping length is satisfied.

If a driver supports multiple application data models, model must be passed to ddi_model_convert_from(9F). The ddi_model_convert_from() function determines whether a data model mismatch exists between the current thread and the device driver. The device driver might have to adjust the shape of data structures before exporting the structures to a user thread that supports a different data model. See Appendix C, Making a Device Driver 64-Bit Ready page for more details.

The devmap(9E) entry point must return -1 if the logical offset, off, is out of the range of memory exported by the driver.

Associating Device Memory With User Mappings

Call devmap_devmem_setup(9F) from the driver's devmap(9E) entry point to export device memory to user applications.

The devmap_devmem_setup(9F) function has the following syntax:

int devmap_devmem_setup(devmap_cookie_t handle, dev_info_t *dip,
    struct devmap_callback_ctl *callbackops, uint_t rnumber, 
    offset_t roff, size_t len, uint_t maxprot, uint_t flags, 
    ddi_device_acc_attr_t *accattrp);

where:

handle: Opaque device-mapping handle that the system uses to identify the mapping.
dip: Pointer to the device's dev_info structure.
callbackops: Pointer to a devmap_callback_ctl(9S) structure that enables the driver to be notified of user events on the mapping.
rnumber: Index number to the register address space set.
roff: Offset into the device memory.
len: Length in bytes that is exported.
maxprot: Allows the driver to specify different protections for different regions within the exported device memory.
flags: Must be set to DEVMAP_DEFAULTS.
accattrp: Pointer to a ddi_device_acc_attr(9S) structure.

The roff and len arguments describe a range within the device memory specified by the register set rnumber. The register specifications that are referred to by rnumber are described by the reg property. For devices with only one register set, pass zero for rnumber. The range is defined by roff and len. The range is made accessible to the user's application mapping at the offset that is passed in by the devmap(9E) entry point. Usually the driver passes the devmap(9E) offset directly to devmap_devmem_setup(9F). The return address of mmap(2) then maps to the beginning address of the register set.

The maxprot argument enables the driver to specify different protections for different regions within the exported device memory. For example, to disallow write access for a region, set only PROT_READ and PROT_USER for that region.

The following example shows how to export device memory to an application. The driver first determines whether the requested mapping falls within the device memory region. The size of the device memory is determined using ddi_dev_regsize(9F). The length of the mapping is rounded up to a multiple of a page size using ptob(9F) and btopr(9F). Then devmap_devmem_setup(9F) is called to export the device memory to the application.

Example 10–3 Using the `devmap_devmem_setup()` Routine

static int
xxdevmap(dev_t dev, devmap_cookie_t handle, offset_t off, size_t len,
    size_t *maplen, uint_t model)
{
    struct xxstate *xsp;
    int    error, rnumber;
    off_t regsize;
    
    /* Set up data access attribute structure */
    struct ddi_device_acc_attr xx_acc_attr = {
        DDI_DEVICE_ATTR_V0,
        DDI_NEVERSWAP_ACC,
        DDI_STRICTORDER_ACC
    };
    xsp = ddi_get_soft_state(statep, getminor(dev));
    if (xsp == NULL)
        return (-1);
    /* use register set 0 */
    rnumber = 0;
    /* get size of register set */
    if (ddi_dev_regsize(xsp->dip, rnumber, &regsize) != DDI_SUCCESS)
        return (-1);
    /* round up len to a multiple of a page size */
       len = ptob(btopr(len));
    if (off + len > regsize)
        return (-1);
    /* Set up the device mapping */
    error = devmap_devmem_setup(handle, xsp->dip, NULL, rnumber, 
    off, len, PROT_ALL, DEVMAP_DEFAULTS, &xx_acc_attr);
    /* acknowledge the entire range */
    *maplen = len;
    return (error);
}

Associating Kernel Memory With User Mappings

Some device drivers might need to allocate kernel memory that is made accessible to user programs through mmap(2). One example is setting up shared memory for communication between two applications. Another example is sharing memory between a driver and an application.

When exporting kernel memory to user applications, follow these steps:

Use ddi_umem_alloc(9F) to allocate kernel memory.
Use devmap_umem_setup(9F) to export the memory.
Use ddi_umem_free(9F) to free the memory when the memory is no longer needed.

Allocating Kernel Memory for User Access

Use ddi_umem_alloc(9F) to allocate kernel memory that is exported to applications. ddi_umem_alloc() uses the following syntax:

void *ddi_umem_alloc(size_t size, int flag, ddi_umem_cookie_t 
*cookiep);

where:

size: Number of bytes to allocate.
flag: Used to determine the sleep conditions and the memory type.
cookiep: Pointer to a kernel memory cookie.

ddi_umem_alloc(9F) allocates page-aligned kernel memory. ddi_umem_alloc() returns a pointer to the allocated memory. Initially, the memory is filled with zeroes. The number of bytes that are allocated is a multiple of the system page size, which is rounded up from the size parameter. The allocated memory can be used in the kernel. This memory can be exported to applications as well. cookiep is a pointer to the kernel memory cookie that describes the kernel memory being allocated. cookiep is used in devmap_umem_setup(9F) when the driver exports the kernel memory to a user application.

The flag argument indicates whether ddi_umem_alloc(9F) blocks or returns immediately, and whether the allocated kernel memory is pageable. The values for the flag argument as follows:

DDI_UMEM_NOSLEEP: Driver does not need to wait for memory to become available. Return NULL if memory is not available.
DDI_UMEM_SLEEP: Driver can wait indefinitely for memory to become available.
DDI_UMEM_PAGEABLE: Driver allows memory to be paged out. If not set, the memory is locked down.

The ddi_umem_lock() function can perform device-locked-memory checks. The function checks against the limit value that is specified in project.max-locked-memory. If the current project locked-memory usage is below the limit, the project's locked-memory byte count is increased. After the limit check, the memory is locked. The ddi_umem_unlock() function unlocks the memory, and the project's locked-memory byte count is decremented.

The accounting method that is used is an imprecise full price model. For example, two callers of umem_lockmemory() within the same project with overlapping memory regions are charged twice.

For information about the project.max-locked-memory and zone.max-locked_memory resource controls on Oracle Solaris systems with zones installed, see Resource Management and Oracle Solaris Zones Developer’s Guide and see resource_controls(5).

The following example shows how to allocate kernel memory for application access. The driver exports one page of kernel memory, which is used by multiple applications as a shared memory area. The memory is allocated in segmap(9E) when an application maps the shared page the first time. An additional page is allocated if the driver has to support multiple application data models. For example, a 64-bit driver might export memory both to 64-bit applications and to 32-bit applications. 64-bit applications share the first page, and 32-bit applications share the second page.

Example 10–4 Using the `ddi_umem_alloc()` Routine

static int
xxsegmap(dev_t dev, off_t off, struct as *asp, caddr_t *addrp, off_t len,
    unsigned int prot, unsigned int maxprot, unsigned int flags, 
    cred_t *credp)
{
    int error;
    minor_t instance = getminor(dev);
    struct xxstate *xsp = ddi_get_soft_state(statep, instance);

    size_t mem_size;
        /* 64-bit driver supports 64-bit and 32-bit applications */
    switch (ddi_mmap_get_model()) {
        case DDI_MODEL_LP64:
             mem_size = ptob(2);
             break;
        case DDI_MODEL_ILP32:
             mem_size = ptob(1);
             break;
    }

    mutex_enter(&xsp->mu);
    if (xsp->umem == NULL) {
        /* allocate the shared area as kernel pageable memory */
        xsp->umem = ddi_umem_alloc(mem_size,
            DDI_UMEM_SLEEP | DDI_UMEM_PAGEABLE, &xsp->ucookie);
    }
    mutex_exit(&xsp->mu);
    /* Set up the user mapping */
    error = devmap_setup(dev, (offset_t)off, asp, addrp, len,
        prot, maxprot, flags, credp);
    return (error);
}

Exporting Kernel Memory to Applications

Use devmap_umem_setup(9F) to export kernel memory to user applications. devmap_umem_setup() must be called from the driver's devmap(9E) entry point. The syntax for devmap_umem_setup() is as follows:

int devmap_umem_setup(devmap_cookie_t handle, dev_info_t *dip,
    struct devmap_callback_ctl *callbackops, ddi_umem_cookie_t cookie,
    offset_t koff, size_t len, uint_t maxprot, uint_t flags,
    ddi_device_acc_attr_t *accattrp);

where:

handle: Opaque structure used to describe the mapping.
dip: Pointer to the device's dev_info structure.
callbackops: Pointer to a devmap_callback_ctl(9S) structure.
cookie: Kernel memory cookie returned by ddi_umem_alloc(9F).
koff: Offset into the kernel memory specified by cookie.
len: Length in bytes that is exported.
maxprot: Specifies the maximum protection possible for the exported mapping.
flags: Must be set to DEVMAP_DEFAULTS.
accattrp: Pointer to a ddi_device_acc_attr(9S) structure.

handle is a device-mapping handle that the system uses to identify the mapping. handle is passed in by the devmap(9E) entry point. dip is a pointer to the device's dev_info structure. callbackops enables the driver to be notified of user events on the mapping. Most drivers set callbackops to NULL when kernel memory is exported.

koff and len specify a range within the kernel memory allocated by ddi_umem_alloc(9F). This range is made accessible to the user's application mapping at the offset that is passed in by the devmap(9E) entry point. Usually, the driver passes the devmap(9E) offset directly to devmap_umem_setup(9F). The return address of mmap(2) then maps to the kernel address returned by ddi_umem_alloc(9F). koff and len must be page-aligned.

maxprot enables the driver to specify different protections for different regions within the exported kernel memory. For example, one region might not allow write access by only setting PROT_READ and PROT_USER.

The following example shows how to export kernel memory to an application. The driver first checks whether the requested mapping falls within the allocated kernel memory region. If a 64-bit driver receives a mapping request from a 32-bit application, the request is redirected to the second page of the kernel memory area. This redirection ensures that only applications compiled to the same data model share the same page.

Example 10–5 devmap_umem_setup(9F) Routine

static int
xxdevmap(dev_t dev, devmap_cookie_t handle, offset_t off, size_t len,
    size_t *maplen, uint_t model)
{
    struct xxstate *xsp;
    int    error;

    /* round up len to a multiple of a page size */
    len = ptob(btopr(len));
    /* check if the requested range is ok */
    if (off + len > ptob(1))
        return (ENXIO);
    xsp = ddi_get_soft_state(statep, getminor(dev));
    if (xsp == NULL)
        return (ENXIO);

    if (ddi_model_convert_from(model) == DDI_MODEL_ILP32)
        /* request from 32-bit application. Skip first page */
        off += ptob(1);

    /* export the memory to the application */
    error = devmap_umem_setup(handle, xsp->dip, NULL, xsp->ucookie,
        off, len, PROT_ALL, DEVMAP_DEFAULTS, NULL);
    *maplen = len;
    return (error);
}

Freeing Kernel Memory Exported for User Access

When the driver is unloaded, the memory that was allocated by ddi_umem_alloc(9F) must be freed by calling ddi_umem_free(9F).

void ddi_umem_free(ddi_umem_cookie_t cookie);

cookie is the kernel memory cookie returned by ddi_umem_alloc(9F).

Chapter 11 Device Context Management

Some device drivers, such as drivers for graphics hardware, provide user processes with direct access to the device. These devices often require that only one process at a time accesses the device.

This chapter describes the set of interfaces that enable device drivers to manage access to such devices. The chapter provides information on the following subjects:

Introduction to Device Context

This section introduces device context and the context management model.

What Is a Device Context?

The context of a device is the current state of the device hardware. The device driver manages the device context for a process on behalf of the process. The driver must maintain a separate device context for each process that accesses the device. The device driver has the responsibility to restore the correct device context when a process accesses the device.

Context Management Model

Frame buffers provide a good example of device context management. An accelerated frame buffer enables user processes to directly manipulate the control registers of the device through memory-mapped access. Because these processes do not use traditional system calls, a process that accesses the device need not call the device driver. However, the device driver must be notified when a process is about to access a device. The driver needs to restore the correct device context and needs to provide any necessary synchronization.

To resolve this problem, the device context management interfaces enable a device driver to be notified when a user process accesses memory-mapped regions of the device, and to control accesses to the device's hardware. Synchronization and management of the various device contexts are the responsibility of the device driver. When a user process accesses a mapping, the device driver must restore the correct device context for that process.

A device driver is notified whenever a user process performs any of the following actions:

Accesses a mapping
Duplicates a mapping
Frees a mapping
Creates a mapping

The following figure shows multiple user processes that have memory-mapped a device. The driver has granted process B access to the device, and process B no longer notifies the driver of accesses. However, the driver is still notified if either process A or process C accesses the device.

Figure 11–1 Device Context Management

Diagram shows three processes, A, B, and C, with Process
B having sole access to the device.

At some point in the future, process A accesses the device. The device driver is notified and blocks future access to the device by process B. The driver then saves the device context for process B. The driver restores the device context of process A. The driver then grants access to process A, as illustrated in the following figure. At this point, the device driver is notified if either process B or process C accesses the device.

Figure 11–2 Device Context Switched to User Process A

Diagram continues example in previous figure with sole
device access switched to Process A.

On a multiprocessor machine, multiple processes could attempt to access the device at the same time. This situation can cause thrashing. Some devices require a longer time to restore a device context. To prevent more CPU time from being used to restore a device context than to actually use that device context, the minimum time that a process needs to have access to the device can be set using devmap_set_ctx_timeout(9F).

The kernel guarantees that once a device driver has granted access to a process, no other process is allowed to request access to the same device for the time interval specified by devmap_set_ctx_timeout(9F).

Context Management Operation

The general steps for performing device context management are as follows:

Define a devmap_callback_ctl(9S) structure.
Allocate space to save device context if necessary.
Set up user mappings to the device and driver notifications with devmap_devmem_setup(9F).
Manage user access to the device with devmap_load(9F) and devmap_unload(9F).
Free the device context structure, if needed.

`devmap_callback_ctl` Structure

The device driver must allocate and initialize a devmap_callback_ctl(9S) structure to inform the system about the entry point routines for device context management.

This structure uses the following syntax:

struct devmap_callback_ctl {    
    int devmap_rev;
    int (*devmap_map)(devmap_cookie_t dhp, dev_t dev,
    uint_t flags, offset_t off, size_t len, void **pvtp);
    int (*devmap_access)(devmap_cookie_t dhp, void *pvtp,
    offset_t off, size_t len, uint_t type, uint_t rw);
    int (*devmap_dup)(devmap_cookie_t dhp, void *pvtp,
    devmap_cookie_t new_dhp, void **new_pvtp);
    void (*devmap_unmap)(devmap_cookie_t dhp, void *pvtp,
    offset_t off, size_t len, devmap_cookie_t new_dhp1,
    void **new_pvtp1, devmap_cookie_t new_dhp2,
    void **new_pvtp2);
};

devmap_rev: The version number of the devmap_callback_ctl structure. The version number must be set to DEVMAP_OPS_REV.
devmap_map: Must be set to the address of the driver's devmap_map(9E) entry point.
devmap_access: Must be set to the address of the driver's devmap_access(9E) entry point.
devmap_dup: Must be set to the address of the driver's devmap_dup(9E) entry point.
devmap_unmap: Must be set to the address of the driver's devmap_unmap(9E) entry point.

Entry Points for Device Context Management

The following entry points are used to manage device context:

`devmap_map()` Entry Point

The syntax for devmap(9E) is as follows:

int xxdevmap_map(devmap_cookie_t handle, dev_t dev, uint_t flags,
    offset_t offset, size_t len, void **new-devprivate);

The devmap_map() entry point is called after the driver returns from its devmap() entry point and the system has established the user mapping to the device memory. The devmap() entry point enables a driver to perform additional processing or to allocate mapping specific private data. For example, in order to support context switching, the driver has to allocate a context structure. The driver must then associate the context structure with the mapping.

The system expects the driver to return a pointer to the allocated private data in *new-devprivate. The driver must store offset and len, which define the range of the mapping, in its private data. Later, when the system calls devmap_unmap(9E), the driver uses this information to determine how much of the mapping is being unmapped.

flags indicates whether the driver should allocate a private context for the mapping. For example, a driver can allocate a memory region to store the device context if flags is set to MAP_PRIVATE. If MAP_SHARED is set, the driver returns a pointer to a shared region.

The following example shows a devmap() entry point. The driver allocates a new context structure. The driver then saves relevant parameters passed in by the entry point. Next, the mapping is assigned a new context either through allocation or by attaching the mapping to an already existing shared context. The minimum time interval that the mapping should have access to the device is set to one millisecond.

Example 11–1 Using the `devmap()` Routine

static int
int xxdevmap_map(devmap_cookie_t handle, dev_t dev, uint_t flags,
    offset_t offset, size_t len, void **new_devprivate)
{
    struct xxstate *xsp = ddi_get_soft_state(statep,
                  getminor(dev));
    struct xxctx *newctx;

    /* create a new context structure */
    newctx = kmem_alloc(sizeof (struct xxctx), KM_SLEEP);
    newctx->xsp = xsp;
    newctx->handle = handle;
    newctx->offset = offset;
    newctx->flags = flags;
    newctx->len = len;
    mutex_enter(&xsp->ctx_lock);
    if (flags & MAP_PRIVATE) {
        /* allocate a private context and initialize it */
        newctx->context = kmem_alloc(XXCTX_SIZE, KM_SLEEP);
        xxctxinit(newctx);
    } else {
        /* set a pointer to the shared context */
        newctx->context = xsp->ctx_shared;
    }
    mutex_exit(&xsp->ctx_lock);
    /* give at least 1 ms access before context switching */
    devmap_set_ctx_timeout(handle, drv_usectohz(1000));
    /* return the context structure */
    *new_devprivate = newctx;
    return(0);
}

`devmap_access()` Entry Point

The devmap_access(9E) entry point is called when an access is made to a mapping whose translations are invalid. Mapping translations are invalidated when the mapping is created with devmap_devmem_setup(9F) in response to mmap(2), duplicated by fork(2), or explicitly invalidated by a call to devmap_unload(9F).

The syntax for devmap_access() is as follows:

int xxdevmap_access(devmap_cookie_t handle, void *devprivate,
    offset_t offset, size_t len, uint_t type, uint_t rw);

where:

handle: Mapping handle of the mapping that was accessed by a user process.
devprivate: Pointer to the driver private data associated with the mapping.
offset: Offset within the mapping that was accessed.
len: Length in bytes of the memory being accessed.
type: Type of access operation.
rw: Specifies the direction of access.

The system expects devmap_access(9E) to call either devmap_do_ctxmgt(9F) or devmap_default_access(9F) to load the memory address translations before devmap_access() returns. For mappings that support context switching, the device driver should call devmap_do_ctxmgt(). This routine is passed all parameters from devmap_access(9E), as well as a pointer to the driver entry point devmap_contextmgt(9E), which handles the context switching. For mappings that do not support context switching, the driver should call devmap_default_access(9F). The purpose of devmap_default_access() is to call devmap_load(9F) to load the user translation.

The following example shows a devmap_access(9E) entry point. The mapping is divided into two regions. The region that starts at offset OFF_CTXMG with a length of CTXMGT_SIZE bytes supports context management. The rest of the mapping supports default access.

Example 11–2 Using the `devmap_access()` Routine

#define OFF_CTXMG      0
#define CTXMGT_SIZE    0x20000    
static int
xxdevmap_access(devmap_cookie_t handle, void *devprivate,
    offset_t off, size_t len, uint_t type, uint_t rw)
{
    offset_t diff;
    int    error;

    if ((diff = off - OFF_CTXMG) >= 0 && diff < CTXMGT_SIZE) {
        error = devmap_do_ctxmgt(handle, devprivate, off,
            len, type, rw, xxdevmap_contextmgt);
    } else {
        error = devmap_default_access(handle, devprivate,
            off, len, type, rw);
    }
    return (error);
}

`devmap_contextmgt()` Entry Point

The syntax for devmap_contextmgt(9E) is as follows:

int xxdevmap_contextmgt(devmap_cookie_t handle, void *devprivate,
    offset_t offset, size_t len, uint_t type, uint_t rw);

devmap_contextmgt() should call devmap_unload(9F) with the handle of the mapping that currently has access to the device. This approach invalidates the translations for that mapping. The approach ensures that a call to devmap_access(9E) occurs for the current mapping the next time the mapping is accessed. The mapping translations for the mapping that caused the access event to occur need to be validated. Accordingly, the driver must restore the device context for the process requesting access. Furthermore, the driver must call devmap_load(9F) on the handle of the mapping that generated the call to this entry point.

Accesses to portions of mappings that have had their mapping translations validated by a call to devmap_load() do not generate a call to devmap_access(). A subsequent call to devmap_unload() invalidates the mapping translations. This call enables devmap_access() to be called again.

If either devmap_load() or devmap_unload() returns an error, devmap_contextmgt() should immediately return that error. If the device driver encounters a hardware failure while restoring a device context, a -1 should be returned. Otherwise, after successfully handling the access request, devmap_contextmgt() should return zero. A return of other than zero from devmap_contextmgt() causes a SIGBUS or SIGSEGV to be sent to the process.

The following example shows how to manage a one-page device context.

Note –

xxctxsave() and xxctxrestore() are device-dependent context save and restore functions. xxctxsave() reads data from the registers and saves the data in the soft state structure. xxctxrestore() takes data that is saved in the soft state structure and writes the data to device registers. Note that the read, write, and save are all performed with the DDI/DKI data access routines.

Example 11–3 Using the `devmap_contextmgt()` Routine

static int
xxdevmap_contextmgt(devmap_cookie_t handle, void *devprivate,
    offset_t off, size_t len, uint_t type, uint_t rw)
{
    int    error;
    struct xxctx *ctxp = devprivate;
    struct xxstate *xsp = ctxp->xsp;
    mutex_enter(&xsp->ctx_lock);
    /* unload mapping for current context */
    if (xsp->current_ctx != NULL) {
        if ((error = devmap_unload(xsp->current_ctx->handle,
            off, len)) != 0) {
            xsp->current_ctx = NULL;
            mutex_exit(&xsp->ctx_lock);
            return (error);
        }
    }
    /* Switch device context - device dependent */
    if (xxctxsave(xsp->current_ctx, off, len) < 0) {
        xsp->current_ctx = NULL;
        mutex_exit(&xsp->ctx_lock);
        return (-1);
    }
    if (xxctxrestore(ctxp, off, len) < 0){
        xsp->current_ctx = NULL;
        mutex_exit(&xsp->ctx_lock);
        return (-1);
    }
    xsp->current_ctx = ctxp;
    /* establish mapping for new context and return */
    error = devmap_load(handle, off, len, type, rw);
    if (error)
        xsp->current_ctx = NULL;
    mutex_exit(&xsp->ctx_lock);
    return (error);
}

`devmap_dup()` Entry Point

The devmap_dup(9E) entry point is called when a device mapping is duplicated, for example, by a user process that calls fork(2). The driver is expected to generate new driver private data for the new mapping.

The syntax fordevmap_dup() is as follows:

int xxdevmap_dup(devmap_cookie_t handle, void *devprivate,
    devmap_cookie_t new-handle, void **new-devprivate);

where:

handle: Mapping handle of the mapping being duplicated.
new-handle: Mapping handle of the mapping that was duplicated.
devprivate: Pointer to the driver private data associated with the mapping being duplicated.
*new-devprivate: Should be set to point to the new driver private data for the new mapping.

Mappings that have been created with devmap_dup() by default have their mapping translations invalidated. Invalid mapping translations force a call to the devmap_access(9E) entry point the first time the mapping is accessed.

The following example shows a typical devmap_dup() routine.

Example 11–4 Using the `devmap_dup()` Routine

static int
xxdevmap_dup(devmap_cookie_t handle, void *devprivate,
    devmap_cookie_t new_handle, void **new_devprivate)
{
    struct xxctx *ctxp = devprivate;
    struct xxstate *xsp = ctxp->xsp;
    struct xxctx *newctx;
    /* Create a new context for the duplicated mapping */
    newctx = kmem_alloc(sizeof (struct xxctx), KM_SLEEP);
    newctx->xsp = xsp;
    newctx->handle = new_handle;
    newctx->offset = ctxp->offset;
    newctx->flags = ctxp->flags;
    newctx->len = ctxp->len;
    mutex_enter(&xsp->ctx_lock);
    if (ctxp->flags & MAP_PRIVATE) {
        newctx->context = kmem_alloc(XXCTX_SIZE, KM_SLEEP);
        bcopy(ctxp->context, newctx->context, XXCTX_SIZE);
    } else {
        newctx->context = xsp->ctx_shared;
    }
    mutex_exit(&xsp->ctx_lock);
    *new_devprivate = newctx;
    return(0);
}

`devmap_unmap()` Entry Point

The devmap_unmap(9E) entry point is called when a mapping is unmapped. Unmapping can be caused by a user process exiting or by calling the munmap(2) system call.

The syntax for devmap_unmap() is as follows:

void xxdevmap_unmap(devmap_cookie_t handle, void *devprivate,
    offset_t off, size_t len, devmap_cookie_t new-handle1,
    void **new-devprivate1, devmap_cookie_t new-handle2,
    void **new-devprivate2);

where:

handle: Mapping handle of the mapping being freed.
devprivate: Pointer to the driver private data associated with the mapping.
off: Offset within the logical device memory at which the unmapping begins.
len: Length in bytes of the memory being unmapped.
new-handle1: Handle that the system uses to describe the new region that ends at off - 1. The value of new-handle1 can be NULL.
new-devprivate1: Pointer to be filled in by the driver with the private driver mapping data for the new region that ends at off -1. new-devprivate1 is ignored if new-handle1 is NULL.
new-handle2: Handle that the system uses to describe the new region that begins at off + len. The value of new-handle2 can be NULL.
new-devprivate2: Pointer to be filled in by the driver with the driver private mapping data for the new region that begins at off + len. new-devprivate2 is ignored if new-handle2 is NULL.

The devmap_unmap() routine is expected to free any driver private resources that were allocated when this mapping was created, either by devmap_map(9E) or by devmap_dup(9E). If the mapping is only partially unmapped, the driver must allocate new private data for the remaining mapping before freeing the old private data. Calling devmap_unload(9F) on the handle of the freed mapping is not necessary, even if this handle points to the mapping with the valid translations. However, to prevent future devmap_access(9E) problems, the device driver should make sure the current mapping representation is set to “no current mapping”.

The following example shows a typical devmap_unmap() routine.

Example 11–5 Using the `devmap_unmap()` Routine

static void
xxdevmap_unmap(devmap_cookie_t handle, void *devprivate,
    offset_t off, size_t len, devmap_cookie_t new_handle1,
    void **new_devprivate1, devmap_cookie_t new_handle2,
    void **new_devprivate2)
{
    struct xxctx *ctxp = devprivate;
    struct xxstate *xsp = ctxp->xsp;
    mutex_enter(&xsp->ctx_lock);
    /*
     * If new_handle1 is not NULL, we are unmapping
     * at the end of the mapping.
     */
    if (new_handle1 != NULL) {
        /* Create a new context structure for the mapping */
        newctx = kmem_alloc(sizeof (struct xxctx), KM_SLEEP);
        newctx->xsp = xsp;
        if (ctxp->flags & MAP_PRIVATE) {
            /* allocate memory for the private context and copy it */
            newctx->context = kmem_alloc(XXCTX_SIZE, KM_SLEEP);
            bcopy(ctxp->context, newctx->context, XXCTX_SIZE);
        } else {
            /* point to the shared context */
            newctx->context = xsp->ctx_shared;
        }
        newctx->handle = new_handle1;
        newctx->offset = ctxp->offset;
        newctx->len = off - ctxp->offset;
        *new_devprivate1 = newctx;
    }
    /*
     * If new_handle2 is not NULL, we are unmapping
     * at the beginning of the mapping.
     */
    if (new_handle2 != NULL) {
        /* Create a new context for the mapping */
        newctx = kmem_alloc(sizeof (struct xxctx), KM_SLEEP);
        newctx->xsp = xsp;
        if (ctxp->flags & MAP_PRIVATE) {
            newctx->context = kmem_alloc(XXCTX_SIZE, KM_SLEEP);
            bcopy(ctxp->context, newctx->context, XXCTX_SIZE);
        } else {
            newctx->context = xsp->ctx_shared;
        }
        newctx->handle = new_handle2;
        newctx->offset = off + len;
        newctx->flags = ctxp->flags;
        newctx->len = ctxp->len - (off + len - ctxp->off);
        *new_devprivate2 = newctx;
    }
    if (xsp->current_ctx == ctxp)
        xsp->current_ctx = NULL;
    mutex_exit(&xsp->ctx_lock);
    if (ctxp->flags & MAP_PRIVATE)
        kmem_free(ctxp->context, XXCTX_SIZE);
    kmem_free(ctxp, sizeof (struct xxctx));
}

Associating User Mappings With Driver Notifications

When a user process requests a mapping to a device with mmap(2), the driver`s segmap(9E) entry point is called. The driver must use ddi_devmap_segmap(9F) or devmap_setup(9F) when setting up the memory mapping if the driver needs to manage device contexts. Both functions call the driver's devmap(9E) entry point, which uses devmap_devmem_setup(9F) to associate the device memory with the user mapping. See Chapter 10, Mapping Device and Kernel Memory for details on how to map device memory.

The driver must inform the system of the devmap_callback_ctl(9S) entry points to get notifications of accesses to the user mapping. The driver informs the system by providing a pointer to a devmap_callback_ctl(9S) structure to devmap_devmem_setup(9F). A devmap_callback_ctl(9S) structure describes a set of entry points for context management. These entry points are called by the system to notify a device driver to manage events on the device mappings.

The system associates each mapping with a mapping handle. This handle is passed to each of the entry points for context management. The mapping handle can be used to invalidate and validate the mapping translations. If the driver invalidates the mapping translations, the driver will be notified of any future access to the mapping. If the driver validates the mapping translations, the driver will no longer be notified of accesses to the mapping. Mappings are always created with the mapping translations invalidated so that the driver will be notified on first access to the mapping.

The following example shows how to set up a mapping using the device context management interfaces.

Example 11–6 devmap(9E) Entry Point With Context Management Support

static struct devmap_callback_ctl xx_callback_ctl = {
    DEVMAP_OPS_REV, xxdevmap_map, xxdevmap_access,
    xxdevmap_dup, xxdevmap_unmap
};

static int
xxdevmap(dev_t dev, devmap_cookie_t handle, offset_t off,
    size_t len, size_t *maplen, uint_t model)
{
    struct xxstate *xsp;
    uint_t rnumber;
    int    error;
    
    /* Setup data access attribute structure */
    struct ddi_device_acc_attr xx_acc_attr = {
        DDI_DEVICE_ATTR_V0,
        DDI_NEVERSWAP_ACC,
        DDI_STRICTORDER_ACC
    };
    xsp = ddi_get_soft_state(statep, getminor(dev));
    if (xsp == NULL)
        return (ENXIO);
    len = ptob(btopr(len));
    rnumber = 0;
    /* Set up the device mapping */
    error = devmap_devmem_setup(handle, xsp->dip, &xx_callback_ctl,
        rnumber, off, len, PROT_ALL, 0, &xx_acc_attr);
    *maplen = len;
    return (error);
}

Managing Mapping Accesses

The device driver is notified when a user process accesses an address in the memory-mapped region that does not have valid mapping translations. When the access event occurs, the mapping translations of the process that currently has access to the device must be invalidated. The device context of the process that requested access to the device must be restored. Furthermore, the translations of the mapping of the process requesting access must be validated.

The functions devmap_load(9F) and devmap_unload(9F) are used to validate and invalidate mapping translations.

`devmap_load()` Entry Point

The syntax for devmap_load(9F) is as follows:

int devmap_load(devmap_cookie_t handle, offset_t offset,
    size_t len, uint_t type, uint_t rw);

devmap_load() validates the mapping translations for the pages of the mapping specified by handle,offset, and len. By validating the mapping translations for these pages, the driver is telling the system not to intercept accesses to these pages of the mapping. Furthermore, the system must not allow accesses to proceed without notifying the device driver.

devmap_load() must be called with the offset and the handle of the mapping that generated the access event for the access to complete. If devmap_load(9F) is not called on this handle, the mapping translations are not validated, and the process receives a SIGBUS.

`devmap_unload()` Entry Point

The syntax for devmap_unload(9F) is as follows:

int devmap_unload(devmap_cookie_t handle, offset_t offset, size_t len);

devmap_unload() invalidates the mapping translations for the pages of the mapping specified by handle, offset, and len. By invalidating the mapping translations for these pages, the device driver is telling the system to intercept accesses to these pages of the mapping. Furthermore, the system must notify the device driver the next time that these mapping pages are accessed by calling the devmap_access(9E) entry point.

For both functions, requests affect the entire page that contains the offset and all pages up to and including the entire page that contains the last byte, as indicated by offset + len. The device driver must ensure that for each page of device memory being mapped, only one process has valid translations at any one time.

Both functions return zero if successful. If, however, an error occurred in validating or invalidating the mapping translations, that error is returned to the device driver. The device driver must return this error to the system.

Chapter 12 Power Management

Power management provides the ability to control and manage the electrical power usage of a computer system or device. Power management enables systems to conserve energy by using less power when idle and by shutting down completely when not in use. For example, desktop computer systems can use a significant amount of power and often are left idle, particularly at night. Power management software can detect that the system is not being used. Accordingly, power management can power down the system or some of its components.

This chapter provides information on the following subjects:

Power Management Framework

The Oracle Solaris Power Management framework depends on device drivers to implement device-specific power management functions. The framework is implemented in two parts:

Device power management – Automatically turns off unused devices to reduce power consumption
System power management – Automatically turns off the computer when the entire system is idle

Device Power Management

The framework enables devices to reduce their energy consumption after a specified idle time interval. As part of power management, system software checks for idle devices. The Power Management framework exports interfaces that enable communication between the system software and the device driver.

The Oracle Solaris Power Management framework provides the following features for device power management:

A device-independent model for power-manageable devices.
dtpower(1M), a tool for configuring workstation power management. Power management can also be implemented through the power.conf(4) and /etc/default/power files.
A set of DDI interfaces for notifying the framework about power management compatibility and idleness state.

System Power Management

System power management involves saving the state of the system prior to powering the system down. Thus, the system can be returned to the same state immediately when the system is turned back on.

To shut down an entire system with return to the state prior to the shutdown, take the following steps:

Stop kernel threads and user processes. Restart these threads and processes later.
Save the hardware state of all devices on the system to disk. Restore the state later.

SPARC only –

System power management is currently implemented only on some SPARC systems supported by the Oracle Solaris OS. See the power.conf(4) man page for more information.

The System Power Management framework in the Oracle Solaris OS provides the following features for system power management:

A platform-independent model of system idleness.
pmconfig(1M), a tool for configuring workstation power management. Power management can also be implemented through the power.conf(4) and /etc/default/power files.
A set of interfaces for the device driver to override the method for determining which drivers have hardware state.
A set of interfaces to enable the framework to call into the driver to save and restore the device state.
A mechanism for notifying processes that a resume operation has occurred.

Device Power Management Model

The following sections describe the details of the device power management model. This model includes the following elements:

Components
Idleness
Power levels
Dependency
Policy
Device power management interfaces
Power management entry points

Power Management Components

A device is power manageable if the power consumption of the device can be reduced when the device is idle. Conceptually, a power-manageable device consists of a number of power-manageable hardware units that are called components.

The device driver notifies the system about device components and their associated power levels. Accordingly, the driver creates a pm-components(9P) property in the driver's attach(9E) entry point as part of driver initialization.

Most devices that are power manageable implement only a single component. An example of a single-component, power-manageable device is a disk whose spindle motor can be stopped to save power when the disk is idle.

If a device has multiple power-manageable units that are separately controllable, the device should implement multiple components.

An example of a two-component, power-manageable device is a frame buffer card with a monitor. Frame buffer electronics is the first component [component 0]. The frame buffer's power consumption can be reduced when not in use. The monitor is the second component [component 1]. The monitor can also enter a lower power mode when the monitor is not in use. The frame buffer electronics and monitor are considered by the system as one device with two components.

Multiple Power Management Components

To the power management framework, all components are considered equal and completely independent of each other. If the component states are not completely compatible, the device driver must ensure that undesirable state combinations do not occur. For example, a frame buffer/monitor card has the following possible states: D0, D1, D2, and D3. The monitor attached to the card has the following potential states: On, Standby, Suspend, and Off. These states are not necessarily compatible with each other. For example, if the monitor is On, then the frame buffer must be at D0, that is, full on. If the frame buffer driver gets a request to power up the monitor to On while the frame buffer is at D3, the driver must call pm_raise_power(9F) to bring the frame buffer up before setting the monitor On. System requests to lower the power of the frame buffer while the monitor is On must be refused by the driver.

Power Management States

Each component of a device can be in one of two states: busy or idle. The device driver notifies the framework of changes in the device state by calling pm_busy_component(9F) and pm_idle_component(9F). When components are initially created, the components are considered idle.

Power Levels

From the pm-components property exported by the device, the Device Power Management framework knows what power levels the device supports. Power-level values must be positive integers. The interpretation of power levels is determined by the device driver writer. Power levels must be listed in monotonically increasing order in the pm-components property. A power level of 0 is interpreted by the framework to mean off. When the framework must power up a device due to a dependency, the framework sets each component at its highest power level.

The following example shows a pm-components entry from the .conf file of a driver that implements a single power-managed component consisting of a disk spindle motor. The disk spindle motor is component 0. The spindle motor supports two power levels. These levels represent “stopped” and “spinning at full speed.”

Example 12–1 Sample `pm-component` Entry

pm-components="NAME=Spindle Motor", "0=Stopped", "1=Full Speed";

The following example shows how Example 12–1 could be implemented in the attach() routine of the driver.

Example 12–2 attach(9E) Routine With `pm-components` Property

static char *pmcomps[] = {
    "NAME=Spindle Motor",
    "0=Stopped",
    "1=Full Speed"
};
/* ... */
xxattach(dev_info_t *dip, ddi_attach_cmd_t cmd)
{
    /* ... */
    if (ddi_prop_update_string_array(DDI_DEV_T_NONE, dip,
        "pm-components", &pmcomp[0],
        sizeof (pmcomps) / sizeof (char *)) != DDI_PROP_SUCCESS)
        goto failed;
    /* ... */

The following example shows a frame buffer that implements two components. Component 0 is the frame buffer electronics that support four different power levels. Component 1 represents the state of power management of the attached monitor.

Example 12–3 Multiple Component `pm-components` Entry

pm-components="NAME=Frame Buffer", "0=Off", "1=Suspend", \
    "2=Standby", "3=On",
    "NAME=Monitor", "0=Off", "1=Suspend", "2=Standby", "3=On";

When a device driver is first attached, the framework does not know the power level of the device. A power transition can occur when:

The driver calls pm_raise_power(9F) or pm_lower_power(9F).
The framework has lowered the power level of a component because a time threshold has been exceeded.
Another device has changed power and a dependency exists between the two devices. See Power Management Dependencies.

After a power transition, the framework begins tracking the power level of each component of the device. Tracking also occurs if the driver has informed the framework of the power level. The driver informs the framework of a power level change by calling pm_power_has_changed(9F).

The system calculates a default threshold for each potential power transition. These thresholds are based on the system idleness threshold. The default thresholds can be overridden using pmconfig or power.conf(4). Another default threshold based on the system idleness threshold is used when the component power level is unknown.

Power Management Dependencies

Some devices should be powered down only when other devices are also powered down. For example, if a CD-ROM drive is allowed to power down, necessary functions, such as the ability to eject a CD, might be lost.

To prevent a device from powering down independently, you can make that device dependent on another device that is likely to remain powered on. Typically, a device is made dependent upon a frame buffer, because a monitor is generally on whenever a user is utilizing a system.

The power.conf(4)file specifies the dependencies among devices. (A parent node in the device tree implicitly depends upon its children. This dependency is handled automatically by the power management framework.) You can specify a particular dependency with a power.conf(4) entry of this form:

device-dependency dependent-phys-path phys-path

Where dependent-phys-path is the device that is kept powered up, such as the CD-ROM drive. phys-path represents the device whose power state is to be depended on, such as the frame buffer.

Adding an entry to power.conf for every new device that is plugged into the system would be burdensome. The following syntax enables you to indicate dependency in a more general fashion:

device-dependency-property property phys-path

Such an entry mandates that any device that exports the property property must be dependent upon the device named by phys-path. Because this dependency applies especially to removable-media devices, /etc/power.conf includes the following line by default:

device_dependent-property  removable-media  /dev/fb

With this syntax, no device that exports the removable-media property can be powered down unless the console frame buffer is also powered down.

For more information, see the power.conf(4) and removable-media(9P) man pages.

Automatic Power Management for Devices

If automatic power management is enabled by pmconfig or power.conf(4), then all devices with a pm-components(9P) property automatically will use power management. After a component has been idle for a default period, the component is automatically lowered to the next lowest power level. The default period is calculated by the power management framework to set the entire device to its lowest power state within the system idleness threshold.

Note –

By default, automatic power management is enabled on all SPARC desktop systems first shipped after July 1, 1999. This feature is disabled by default for all other systems. To determine whether automatic power management is enabled on your machine, refer to the power.conf(4) man page for instructions.

power.conf(4) can be used to override the defaults calculated by the framework.

Device Power Management Interfaces

A device driver that supports a device with power-manageable components must create a pm-components(9P) property. This property indicates to the system that the device has power-manageable components. pm-components also tells the system which power levels are available. The driver typically informs the system by calling ddi_prop_update_string_array(9F) from the driver's attach(9E) entry point. An alternative means of informing the system is from a driver.conf(4) file. See the pm-components(9P) man page for details.

Busy-Idle State Transitions

The driver must keep the framework informed of device state transitions from idle to busy or busy to idle. Where these transitions happen is entirely device-specific. The transitions between the busy and idle states depend on the nature of the device and the abstraction represented by the specific component. For example, SCSI disk target drivers typically export a single component, which represents whether the SCSI target disk drive is spun up or not. The component is marked busy whenever an outstanding request to the drive exists. The component is marked idle when the last queued request finishes. Some components are created and never marked busy. For example, components created by pm-components(9P) are created in an idle state.

The pm_busy_component(9F) and pm_idle_component(9F) interfaces notify the power management framework of busy-idle state transitions. The pm_busy_component(9F) call has the following syntax:

int pm_busy_component(dev_info_t *dip, int component);

pm_busy_component(9F) marks component as busy. While the component is busy, that component should not be powered off. If the component is already powered off, then marking that component busy does not change the power level. The driver needs to call pm_raise_power(9F) for this purpose. Calls to pm_busy_component(9F) are cumulative and require a corresponding number of calls to pm_idle_component(9F) to idle the component.

The pm_idle_component(9F) routine has the following syntax:

int pm_idle_component(dev_info_t *dip, int component);

pm_idle_component(9F) marks component as idle. An idle component is subject to being powered off. pm_idle_component(9F) must be called once for each call to pm_busy_component(9F) in order to idle the component.

Device Power State Transitions

A device driver can call pm_raise_power(9F) to request that a component be set to at least a given power level. Setting the power level in this manner is necessary before using a component that has been powered off. For example, the read(9E) routine of a SCSI disk target driver might need to spin up the disk, if the disk has been powered off. The pm_raise_power(9F) function requests the power management framework to initiate a device power state transition to a higher power level. Normally, reductions in component power levels are initiated by the framework. However, a device driver should call pm_lower_power(9F) when detaching, in order to reduce the power consumption of unused devices as much as possible.

Powering down can pose risks for some devices. For example, some tape drives damage tapes when power is removed. Similarly, some disk drives have a limited tolerance for power cycles, because each cycle results in a head landing. Use the no-involuntary-power-cycles(9P) property to notify the system that the device driver should control all power cycles for the device. This approach prevents power from being removed from a device while the device driver is detached unless the device was powered off by a driver's call to pm_lower_power(9F) from its detach(9E) entry point.

The pm_raise_power(9F) function is called when the driver discovers that a component needed for some operation is at an insufficient power level. This interface causes the driver to raise the current power level of the component to the needed level. All the devices that depend on this device are also brought back to full power by this call.

Call the pm_lower_power(9F) function when the device is detaching once access to the device is no longer needed. Call pm_lower_power(9F) to set each component at the lowest power so that the device uses as little power as possible while not in use. The pm_lower_power() function must be called from the detach() entry point. The pm_lower_power() function has no effect if it is called from any other part of the driver.

The pm_power_has_changed(9F) function is called to notify the framework about a power transition. The transition might be due to the device changing its own power level. The transition might also be due to an operation such as suspend-resume. The syntax for pm_power_has_changed(9F) is the same as the syntax for pm_raise_power(9F).

`power()` Entry Point

The power management framework uses the power(9E) entry point.

power() uses the following syntax:

int power(dev_info_t *dip, int component, int level);

When a component's power level needs to be changed, the system calls the power(9E) entry point. The action taken by this entry point is device driver-specific. In the example of the SCSI target disk driver mentioned previously, setting the power level to 0 results in sending a SCSI command to spin down the disk, while setting the power level to the full power level results in sending a SCSI command to spin up the disk.

If a power transition can cause the device to lose state, the driver must save any necessary state in memory for later restoration. If a power transition requires the saved state to be restored before the device can be used again, then the driver must restore that state. The framework makes no assumptions about what power transactions cause the loss of state or require the restoration of state for automatically power-managed devices. The following example shows a sample power() routine.

Example 12–4 Using the `power()` Routine for a Single-Component Device

int
xxpower(dev_info_t *dip, int component, int level)
{
    struct xxstate *xsp;
    int instance;

    instance = ddi_get_instance(dip);
    xsp = ddi_get_soft_state(statep, instance);
    /*
     * Make sure the request is valid
     */
    if (!xx_valid_power_level(component, level))
        return (DDI_FAILURE);
    mutex_enter(&xsp->mu);
    /*
     * If the device is busy, don't lower its power level
     */
    if (xsp->xx_busy[component] &&
        xsp->xx_power_level[component] > level) {
        mutex_exit(&xsp->mu);
        return (DDI_FAILURE);
    }

    if (xsp->xx_power_level[component] != level) {
        /*
         * device- and component-specific setting of power level
         * goes here
         */
        xsp->xx_power_level[component] = level;
    }
    mutex_exit(&xsp->mu);
    return (DDI_SUCCESS);
}

The following example is a power() routine for a device with two components, where component 0 must be on when component 1 is on.

Example 12–5 power(9E) Routine for Multiple-Component Device

int
xxpower(dev_info_t *dip, int component, int level)
{
    struct xxstate *xsp;
    int instance;

    instance = ddi_get_instance(dip);
    xsp = ddi_get_soft_state(statep, instance);
    /*
     * Make sure the request is valid
     */
    if (!xx_valid_power_level(component, level))
        return (DDI_FAILURE);
    mutex_enter(&xsp->mu);
    /*
     * If the device is busy, don't lower its power level
     */
    if (xsp->xx_busy[component] &&
        xsp->xx_power_level[component] > level) {
        mutex_exit(&xsp->mu);
        return (DDI_FAILURE);
    }
    /*
     * This code implements inter-component dependencies:
     * If we are bringing up component 1 and component 0 
     * is off, we must bring component 0 up first, and if
     * we are asked to shut down component 0 while component
     * 1 is up we must refuse
     */
    if (component == 1 && level > 0 && xsp->xx_power_level[0] == 0) {
        xsp->xx_busy[0]++;
        if (pm_busy_component(dip, 0) != DDI_SUCCESS) {
            /*
             * This can only happen if the args to 
             * pm_busy_component()
             * are wrong, or pm-components property was not
             * exported by the driver.
             */
            xsp->xx_busy[0]--;
            mutex_exit(&xsp->mu);
            cmn_err(CE_WARN, "xxpower pm_busy_component() 
                failed");
            return (DDI_FAILURE);
        }
        mutex_exit(&xsp->mu);
        if (pm_raise_power(dip, 0, XX_FULL_POWER_0) != DDI_SUCCESS)
            return (DDI_FAILURE);
        mutex_enter(&xsp->mu);
    }
    if (component == 0 && level == 0 && xsp->xx_power_level[1] != 0) {
        mutex_exit(&xsp->mu);
        return (DDI_FAILURE);
    }
    if (xsp->xx_power_level[component] != level) {
        /*
         * device- and component-specific setting of power level
         * goes here
         */
        xsp->xx_power_level[component] = level;
    }
    mutex_exit(&xsp->mu);
    return (DDI_SUCCESS);
}

System Power Management Model

This section describes the details of the System Power Management model. The model includes the following components:

Autoshutdown threshold
Busy state
Hardware state
Policy
Power management entry points

Autoshutdown Threshold

The system can be shut down, that is, powered off, automatically after a configurable period of idleness. This period is known as the autoshutdown threshold. This behavior is enabled by default for SPARC desktop systems first shipped after October 1, 1995 and before July 1, 1999. See the power.conf(4)man page for more information. Autoshutdown can be overridden using dtpower(1M) or power.conf(4).

Busy State

The busy state of the system can be measured in several ways. The currently supported built-in metric items are keyboard characters, mouse activity, tty characters, load average, disk reads, and NFS requests. Any one of these items can make the system busy. In addition to the built-in metrics, an interface is defined for running a user-specified process that can indicate that the system is busy.

Hardware State

Devices that export a reg property are considered to have hardware state that must be saved prior to shutting down the system. A device without the reg property is considered to be stateless. However, this consideration can be overridden by the device driver.

A device with hardware state but no reg property, such as a SCSI driver, must be called to save and restore the state if the driver exports a pm-hardware-state property with the value needs-suspend-resume. Otherwise, the lack of a reg property is taken to mean that the device has no hardware state. For information on device properties, see Chapter 4, Properties.

A device with a reg property and no hardware state can export a pm-hardware-state property with the value no-suspend-resume. Using no-suspend-resume with the pm-hardware-state property keeps the framework from calling the driver to save and restore that state. For more information on power management properties, see the pm-components(9P) man page.

Automatic Power Management for Systems

The system is shut down if the following conditions apply:

Autoshutdown is enabled by dtpower(1M) or power.conf(4).
The system has been idle for autoshutdown threshold minutes.
All of the metrics that are specified in power.conf have been satisfied.

Entry Points Used by System Power Management

System power management passes the command DDI_SUSPEND to the detach(9E) driver entry point to request the driver to save the device hardware state. System power management passes the command DDI_RESUME to the attach(9E) driver entry point to request the driver to restore the device hardware state.

`detach()` Entry Point

The syntax for detach(9E) is as follows:

int detach(dev_info_t *dip, ddi_detach_cmd_t cmd);

A device with a reg property or a pm-hardware-state property set to needs-suspend-resume must be able to save the hardware state of the device. The framework calls into the driver's detach(9E) entry point to enable the driver to save the state for restoration after the system power returns. To process the DDI_SUSPEND command, detach(9E) must perform the following tasks:

Block further operations from being initiated until the device is resumed, except for dump(9E) requests.
Wait until outstanding operations have completed. If an outstanding operation can be restarted, you can abort that operation.
Cancel any timeouts and callbacks that are pending.
Save any volatile hardware state to memory. The state includes the contents of device registers, and can also include downloaded firmware.

If the driver is unable to suspend the device and save its state to memory, then the driver must return DDI_FAILURE. The framework then aborts the system power management operation.

In some cases, powering down a device involves certain risks. For example, if a tape drive is powered off with a tape inside, the tape can be damaged. In such a case, attach(9E) should do the following:

Call ddi_removing_power(9F) to determine whether a DDI_SUSPEND command can cause power to be removed from the device.
Determine whether power removal can cause problems.

If both cases are true, the DDI_SUSPEND request should be rejected. Example 12–6 shows an attach(9E) routine using ddi_removing_power(9F) to check whether the DDI_SUSPEND command causes problems.

Dump requests must be honored. The framework uses the dump(9E) entry point to write out the state file that contains the contents of memory. See the dump(9E) man page for the restrictions that are imposed on the device driver when using this entry point.

Calling the detach(9E) entry point of a power-manageable component with the DDI_SUSPEND command should save the state when the device is powered off. The driver should cancel pending timeouts. The driver should also suppress any calls to pm_raise_power(9F) except for dump(9E) requests. When the device is resumed by a call to attach(9E) with a command of DDI_RESUME, timeouts and calls to pm_raise_power() can be resumed. The driver must keep sufficient track of its state to be able to deal appropriately with this possibility. The following example shows a detach(9E) routine with the DDI_SUSPEND command implemented.

Example 12–6 detach(9E) Routine Implementing `DDI_SUSPEND`

int
xxdetach(dev_info_t *dip, ddi_detach_cmd_t cmd)
{
    struct xxstate *xsp;
    int instance;

    instance = ddi_get_instance(dip);
    xsp = ddi_get_soft_state(statep, instance);

    switch (cmd) {
    case DDI_DETACH:
        /* ... */
    case DDI_SUSPEND:
        /*
         * We do not allow DDI_SUSPEND if power will be removed and
         * we have a device that damages tape when power is removed
         * We do support DDI_SUSPEND for Device Reconfiguration.
         */
        if (ddi_removing_power(dip) && xxdamages_tape(dip))
            return (DDI_FAILURE);
        mutex_enter(&xsp->mu);
        xsp->xx_suspended = 1;  /* stop new operations */
        /*
         * Sleep waiting for all the commands to be completed
         *
         * If a callback is outstanding which cannot be cancelled
         * then either wait for the callback to complete or fail the
         * suspend request
         *
         * This section is only needed if the driver maintains a
         * running timeout
         */
        if (xsp->xx_timeout_id) {
            timeout_id_t temp_timeout_id = xsp->xx_timeout_id;
            xsp->xx_timeout_id = 0;
            mutex_exit(&xsp->mu);
            untimeout(temp_timeout_id);
            mutex_enter(&xsp->mu);
        }
        if (!xsp->xx_state_saved) {
            /*
             * Save device register contents into
             * xsp->xx_device_state
             */
        }
        mutex_exit(&xsp->mu);
        return (DDI_SUCCESS);
    default:
        return (DDI_FAILURE);
}

`attach()` Entry Point

The syntax for attach(9E) is as follows:

int attach(dev_info_t *dip, ddi_attach_cmd_t cmd);

When power is restored to the system, each device with a reg property or with a pm-hardware-state property of value needs-suspend-resume has its attach(9E) entry point called with a command value of DDI_RESUME. If the system shutdown is aborted, each suspended driver is called to resume even though the power has not been shut off. Consequently, the resume code in attach(9E) must make no assumptions about whether the system actually lost power.

The power management framework considers the power level of the components to be unknown at DDI_RESUME time. Depending on the nature of the device, the driver writer has two choices:

If the driver can determine the actual power level of the components of the device without powering the components up, such as by reading a register, then the driver should notify the framework of the power level of each component by calling pm_power_has_changed(9F).
If the driver cannot determine the power levels of the components, then the driver should mark each component internally as unknown and call pm_raise_power(9F) before the first access to each component.

The following example shows an attach(9E) routine with the DDI_RESUME command.

Example 12–7 attach(9E) Routine Implementing `DDI_RESUME`

int
xxattach(devinfo_t *dip, ddi_attach_cmd_t cmd)
{
    struct xxstate *xsp;
    int    instance;

    instance = ddi_get_instance(dip);
    xsp = ddi_get_soft_state(statep, instance);

    switch (cmd) {
    case DDI_ATTACH:
    /* ... */
    case DDI_RESUME:
        mutex_enter(&xsp->mu);
        if (xsp->xx_pm_state_saved) {
            /*
             * Restore device register contents from
             * xsp->xx_device_state
             */
        }
        /*
         * This section is optional and only needed if the
         * driver maintains a running timeout
         */
        xsp->xx_timeout_id = timeout( /* ... */ );

        xsp->xx_suspended = 0;        /* allow new operations */
        cv_broadcast(&xsp->xx_suspend_cv);
        /* If it is possible to determine in a device-specific 
         * way what the power levels of components are without 
         * powering the components up,
         * then the following code is recommended
         */
        for (i = 0; i < num_components; i++) {
            xsp->xx_power_level[i] = xx_get_power_level(dip, i);
            if (xsp->xx_power_level[i] != XX_LEVEL_UNKNOWN)
                (void) pm_power_has_changed(dip, i, 
                    xsp->xx_power_level[i]);
        }
        mutex_exit(&xsp->mu);
        return(DDI_SUCCESS);
    default:
        return(DDI_FAILURE);
    }
}

Note –

The detach(9E) and attach(9E) interfaces can also be used to resume a system that has been quiesced.

Power Management Device Access Example

If power management is supported, and detach(9E) and attach(9E) are used as in Example 12–6 and Example 12–7, then access to the device can be made from user context, for example, from read(2), write(2), and ioctl(2).

The following example demonstrates this approach. The example assumes that the operation about to be performed requires a component component that is operating at power level level.

Example 12–8 Device Access

mutex_enter(&xsp->mu);
/*
 * Block command while device is suspended by DDI_SUSPEND
 */
while (xsp->xx_suspended)
    cv_wait(&xsp->xx_suspend_cv, &xsp->mu);
/*
 * Mark component busy so xx_power() will reject attempt to lower power
 */
xsp->xx_busy[component]++;
if (pm_busy_component(dip, component) != DDI_SUCCESS) {
    xsp->xx_busy[component]--;
    /*
     * Log error and abort
     */
}
if (xsp->xx_power_level[component] < level) {
    mutex_exit(&xsp->mu);
    if (pm_raise_power(dip, component, level) != DDI_SUCCESS) {
        /*
         * Log error and abort
         */
    }
    mutex_enter(&xsp->mu);
}

The code fragment in the following example can be used when device operation completes, for example, in the device's interrupt handler.

Example 12–9 Device Operation Completion

/*
 * For each command completion, decrement the busy count and unstack
 * the pm_busy_component() call by calling pm_idle_component(). This
 * will allow device power to be lowered when all commands complete
 * (all pm_busy_component() counts are unstacked)
 */
xsp->xx_busy[component]--;
if (pm_idle_component(dip, component) != DDI_SUCCESS) {
    xsp->xx_busy[component]++;
    /*
     * Log error and abort
     */
}
/*
 * If no more outstanding commands, wake up anyone (like DDI_SUSPEND)
 * waiting for all commands to  be completed
 */

Power Management Flow of Control

Figure 12–1 illustrates the flow of control in the power management framework.

When a component's activity is complete, a driver can call pm_idle_component(9F) to mark the component as idle. When the component has been idle for its threshold time, the framework can lower the power of the component to its next lower level. The framework calls the power(9E) function to set the component's power to the next lower supported power level, if a lower level exists. The driver's power(9E) function should reject any attempt to lower the power level of a component when that component is busy. The power(9E) function should save any state that could be lost in a transition to a lower level prior to making that transition.

When the component is needed at a higher level, the driver calls pm_busy_component(9F). This call keeps the framework from lowering the power still further and then calls pm_raise_power(9F) on the component. The framework next calls power(9E) to raise the power of the component before the call to pm_raise_power(9F) returns. The driver's power(9E) code must restore any state that was lost in the lower level but that is needed in the higher level.

When a driver is detaching, the driver should call pm_lower_power(9F) for each component to lower its power to its lowest level. The framework can then call the driver's power(9E) routine to lower the power of the component before the call to pm_lower_power(9F) returns.

Figure 12–1 Power Management Conceptual State Diagram

Diagram shows the flow of control through power management
routines.

Changes to Power Management Interfaces

Prior to the Solaris 8 release, power management of devices was not automatic. Developers had to add an entry to /etc/power.conf for each device that was to be power-managed. The framework assumed that all devices supported only two power levels: 0 and standard power.

Power assumed an implied dependency of all other components on component 0. When component 0 changed to level 0, a call was made into the driver's detach(9E) with the DDI_PM_SUSPEND command to save the hardware state. When component 0 changed from level 0, a call was made to the attach(9E) routine with the command DDI_PM_RESUME to restore hardware state.

The following interfaces and commands are obsolete, although they are still supported for binary purposes:

ddi_dev_is_needed(9F)
pm_create_components(9F)
pm_destroy_components(9F)
pm_get_normal_power(9F)
pm_set_normal_power(9F)
DDI_PM_SUSPEND
DDI_PM_RESUME

Since the Solaris 8 release, devices that export the pm-components property automatically use power management if autopm is enabled.

The framework now knows from the pm-components property which power levels are supported by each device.

The framework makes no assumptions about dependencies among the different components of a device. The device driver is responsible for saving and restoring hardware state as needed when changing power levels.

These changes enable the power management framework to deal with emerging device technology. Power management now results in greater power savings. The framework can detect automatically which devices can save power. The framework can use intermediate power states of the devices. A system can now meet energy consumption goals without powering down the entire system and without any functions.

Table 12–1 Power Management Interfaces


Removed Interfaces	Equivalent Interfaces
`pm_create_components`(9F)	pm-components(9P)
`pm_set_normal_power`(9F)	pm-components(9P)
`pm_destroy_components`(9F)	None
`pm_get_normal_power`(9F)	None
`ddi_dev_is_needed`(9F)	pm_raise_power(9F)
None	pm_lower_power(9F)
None	pm_power_has_changed(9F)
DDI_PM_SUSPEND	None
DDI_PM_RESUME	None

Chapter 13 Hardening Oracle Solaris Drivers

Fault Management Architecture (FMA) I/O Fault Services enable driver developers to integrate fault management capabilities into I/O device drivers. The Oracle Solaris I/O fault services framework defines a set of interfaces that enable all drivers to coordinate and perform basic error handling tasks and activities. The Oracle Solaris FMA as a whole provides for error handling and fault diagnosis, in addition to response and recovery. FMA is a component of Oracle's Predictive Self-Healing strategy.

A driver is considered hardened when it uses the defensive programming practices described in this document in addition to the I/O fault services framework for error handling and diagnosis. The driver hardening test harness tests that the I/O fault services and defensive programming requirements have been correctly fulfilled.

This document contains the following section:

Oracle Fault Management Architecture I/O Fault Services provides a reference for driver developers who want to integrate fault management capabilities into I/O device drivers.

Oracle Fault Management Architecture I/O Fault Services

This section explains how to integrate fault management error reporting, error handling, and diagnosis for I/O device drivers. This section provides an in-depth examination of the I/O fault services framework and how to utilize the I/O fault service APIs within a device driver.

This section discusses the following topics:

What Is Predictive Self-Healing? provides background and an overview of the Oracle Fault Management Architecture.
Oracle Solaris Fault Manager describes additional background with a focus on a high-level overview of the Oracle Solaris Fault Manager, fmd(1M).
Error Handling is the primary section for driver developers. This section highlights the best practice coding techniques for high-availability and the use of I/O fault services in driver code to interact with the FMA.

What Is Predictive Self-Healing?

Traditionally, systems have exported hardware and software error information directly to human administrators and to management software in the form of syslog messages. Often, error detection, diagnosis, reporting, and handling was embedded in the code of each driver.

A system like the Solaris OS predictive self-healing system is first and foremost self-diagnosing. Self-diagnosing means the system provides technology to automatically diagnose problems from observed symptoms, and the results of the diagnosis can then be used to trigger automated response and recovery. A fault in hardware or a defect in software can be associated with a set of possible observed symptoms called errors. The data generated by the system as the result of observing an error is called an error report or ereport.

In a system capable of self-healing, ereports are captured by the system and are encoded as a set of name-value pairs described by an extensible event protocol to form an ereport event. Ereport events and other data are gathered to facilitate self-healing, and are dispatched to software components called diagnosis engines designed to diagnose the underlying problems corresponding to the error symptoms observed by the system. A diagnosis engine runs in the background and silently consumes error telemetry until it can produce a diagnosis or predict a fault.

After processing sufficient telemetry to reach a conclusion, a diagnosis engine produces another event called a fault event. The fault event is then broadcast to all agents that are interested in the specific fault event. An agent is a software component that initiates recovery and responds to specific fault events. A software component known as the Oracle Solaris Fault Manager, fmd(1M), manages the multiplexing of events between ereport generators, diagnosis engines, and agent software.

Oracle Solaris Fault Manager

The Oracle Solaris Fault Manager, fmd(1M), is responsible for dispatching in-bound error telemetry events to the appropriate diagnosis engines. The diagnosis engine is responsible for identifying the underlying hardware faults or software defects that are producing the error symptoms. The fmd(1M) daemon is the Oracle Solaris OS implementation of a fault manager. It starts at boot time and loads all of the diagnosis engines and agents available on the system. The Oracle Solaris Fault Manager also provides interfaces for system administrators and service personnel to observe fault management activity.

Diagnosis, Suspect Lists, and Fault Events

Once a diagnosis has been made, the diagnosis is output in the form of a list.suspect event. A list.suspect event is an event comprised of one or more possible fault or defect events. Sometimes the diagnosis cannot narrow the cause of errors to a single fault or defect. For example, the underlying problem might be a broken wire connecting controllers to the main system bus. The problem might be with a component on the bus or with the bus itself. In this specific case, the list.suspect event will contain multiple fault events: one for each controller attached to the bus, and one for the bus itself.

In addition to describing the fault that was diagnosed, a fault event also contains four payload members for which the diagnosis is applicable.

The resource is the component that was diagnosed as faulty. The fmdump(1M) command shows this payload member as “Problem in.”
The Automated System Recovery Unit (ASRU) is the hardware or software component that must be disabled to prevent further error symptoms from occurring. The fmdump(1M) command shows this payload member as “Affects.”
The Field Replaceable Unit (FRU) is the component that must be replaced or repaired to fix the underlying problem.
The Label payload is a string that gives the location of the FRU in the same form as it is printed on the chassis or motherboard, for example next to a DIMM slot or PCI card slot. The fmdumpcommand shows this payload member as “Location.”

For example, after receiving a certain number of ECC correctable errors in a given amount of time for a particular memory location, the CPU and memory diagnosis engine issues a diagnosis (list.suspect event) for a faulty DIMM.

# fmdump -v -u 38bd6f1b-a4de-4c21-db4e-ccd26fa8573c
TIME                 UUID                                 SUNW-MSG-ID
Oct 31 13:40:18.1864 38bd6f1b-a4de-4c21-db4e-ccd26fa8573c AMD-8000-8L
100%  fault.cpu.amd.icachetag

Problem in: hc:///motherboard=0/chip=0/cpu=0
Affects: cpu:///cpuid=0
FRU: hc:///motherboard=0/chip=0
Location: SLOT 2

In this example, fmd(1M) has identified a problem in a resource, specifically a CPU (hc:///motherboard=0/chip=0/cpu=0). To suppress further error symptoms and to prevent an uncorrectable error from occurring, an ASRU, (cpu:///cpuid=0), is identified for retirement. The component that needs to be replaced is the FRU (hc:///motherboard=0/chip=0).

Response Agents

An agent is a software component that takes action in response to a diagnosis or repair. For example, the CPU and memory retire agent is designed to act on list.suspects that contain a fault.cpu.* event. The cpumem-retire agent will attempt to off-line a CPU or retire a physical memory page from service. If the agent is successful, an entry in the fault manager's ASRU cache is added for the page or CPU that was successfully retired. The fmadm(1M) utility, as shown in the example below, shows an entry for a memory rank that has been diagnosed as having a fault. ASRUs that the system does not have the ability to off-line, retire, or disable, will also have an entry in the ASRU cache, but they will be seen as degraded. Degraded means the resource associated with the ASRU is faulty, but the ASRU is unable to be removed from service. Currently Oracle Solaris agent software cannot act upon I/O ASRUs (device instances). All faulty I/O resource entries in the cache are in the degraded state.

# fmadm faulty
   STATE RESOURCE / UUID
-------- ----------------------------------------------------------------------
degraded mem:///motherboard=0/chip=1/memory-controller=0/dimm=3/rank=0
         ccae89df-2217-4f5c-add4-d920f78b4faf
-------- ----------------------------------------------------------------------

The primary purpose of a retire agent is to isolate (safely remove from service) the piece of hardware or software that has been diagnosed as faulty.

Agents can also take other important actions such as the following actions:

Send alerts via SNMP traps. This can translate a diagnosis into an alert for SNMP that plugs into existing software mechanisms.
Post a syslog message. Message specific diagnoses (for example, syslog message agent) can take the result of a diagnosis and translate it into a syslog message that administrators can use to take a specific action.
Other agent actions such as update the FRUID. Response agents can be platform-specific.

Message IDs and Dictionary Files

The syslog message agent takes the output of the diagnosis (the list.suspect event) and writes specific messages to the console or /var/adm/messages. Often console messages can be difficult to understand. FMA remedies this problem by providing a defined fault message structure that is generated every time a list.suspect event is delivered to a syslog message.

The syslog agent generates a message identifier (MSG ID). The event registry generates dictionary files (.dict files) that map a list.suspect event to a structured message identifier that should be used to identify and view the associated knowledge article. Message files, (.po files) map the message ID to localized messages for every possible list of suspected faults that the diagnosis engine can generate. The following is an example of a fault message emitted on a test system.

SUNW-MSG-ID: AMD-8000-7U, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Fri Jul 28 04:26:51 PDT 2006
PLATFORM: Sun Fire V40z, CSN: XG051535088, HOSTNAME: parity
SOURCE: eft, REV: 1.16
EVENT-ID: add96f65-5473-69e6-dbe1-8b3d00d5c47b
DESC: The number of errors associated with this CPU has exceeded 
acceptable levels. Refer to http://sun.com/msg/AMD-8000-7U for 
more information.
AUTO-RESPONSE: An attempt will be made to remove this CPU from service.
IMPACT: Performance of this system may be affected.
REC-ACTION: Schedule a repair procedure to replace the affected CPU. 
Use fmdump -v -u <EVENT_ID> to identify the module.

System Topology

To identify where a fault might have occurred, diagnosis engines need to have the topology for a given software or hardware system represented. The fmd(1M) daemon provides diagnosis engines with a handle to a topology snapshot that can be used during diagnosis. Topology information is used to represent the resource, ASRU, and FRU found in each fault event. The topology can also be used to store the platform label, FRUID, and serial number identification.

The resource payload member in the fault event is always represented by the physical path location from the platform chassis outward. For example, a PCI controller function that is bridged from the main system bus to a PCI local bus is represented by its hc scheme path name:

hc:///motherboard=0/hostbridge=1/pcibus=0/pcidev=13/pcifn=0

The ASRU payload member in the fault event is typically represented by the Oracle Solaris device tree instance name that is bound to a hardware controller, device, or function. FMA uses the dev scheme to represent the ASRU in its native format for actions that might be taken by a future implementation of a retire agent specifically designed for I/O devices:

dev:////pci@1e,600000/ide@d

The FRU payload representation in the fault event varies depending on the closest replaceable component to the I/O resource that has been diagnosed as faulty. For example, a fault event for a broken embedded PCI controller might name the motherboard of the system as the FRU that needs to be replaced:

hc:///motherboard=0

The label payload is a string that gives the location of the FRU in the same form as it is printed on the chassis or motherboard, for example next to a DIMM slot or PCI card slot:

Label: SLOT 2

Error Handling

This section describes how to use I/O fault services APIs to handle errors within a driver. This section discusses how drivers should indicate and initialize their fault management capabilities, generate error reports, and register the driver's error handler routine.

Excerpts are provided from source code examples that demonstrate the use of the I/O fault services API from the Broadcom 1Gb NIC driver, bge. Follow these examples as a model for how to integrate fault management capability into your own drivers. Take the following steps to study the complete bge driver code:

Go to ON (OS/Net) Sources.
Enter bge in the File Path field.
Click the Search button.

Drivers that have been instrumented to provide FMA error report telemetry detect errors and determine the impact of those errors on the services provided by the driver. Following the detection of an error, the driver should determine when its services have been impacted and to what degree.

An I/O driver must respond immediately to detected errors. Appropriate responses include:

Attempt recovery
Retry an I/O transaction
Attempt fail-over techniques
Report the error to the calling application/stack
If the error cannot be constrained any other way, then panic

Errors detected by the driver are communicated to the fault management daemon as an ereport. An ereport is a structured event defined by the FMA event protocol. The event protocol is a specification for a set of common data fields that must be used to describe all possible error and fault events, in addition to the list of suspected faults. Ereports are gathered into a flow of error telemetry and dispatched to the diagnosis engine.

Declaring Fault Management Capabilities

A hardened device driver must declare its fault management capabilities to the I/O Fault Management framework. Use the ddi_fm_init(9F) function to declare the fault management capabilities of your driver.

void ddi_fm_init(dev_info_t *dip, int *fmcap, ddi_iblock_cookie_t *ibcp)

The ddi_fm_init() function can be called from kernel context in a driver attach(9E) or detach(9E) entry point. The ddi_fm_init() function usually is called from the attach() entry point. The ddi_fm_init() function allocates and initializes resources according to fmcap. The fmcap parameter must be set to the bitwise-inclusive-OR of the following fault management capabilities:

DDI_FM_EREPORT_CAPABLE - Driver is responsible for and capable of generating FMA protocol error events (ereports) upon detection of an error condition.
DDI_FM_ACCCHK_CAPABLE - Driver is responsible for and capable of checking for errors upon completion of one or more access I/O transactions.
DDI_FM_DMACHK_CAPABLE - Driver is responsible for and capable of checking for errors upon completion of one or more DMA I/O transactions.
DDI_FM_ERRCB_CAPABLE - Driver has an error callback function.

A hardened leaf driver generally sets all these capabilities. However, if its parent nexus is not capable of supporting any one of the requested capabilities, the associated bit is cleared and returned as such to the driver. Before returning from ddi_fm_init(9F), the I/O fault services framework creates a set of fault management capability properties: fm-ereport-capable, fm-accchk-capable, fm-dmachk-capable and fm-errcb-capable. The currently supported fault management capability level is observable by using the prtconf(1M) command.

To make your driver support administrative selection of fault management capabilities, export and set the fault management capability level properties to the values described above in the driver.conf(4) file. The fm-capable properties must be set and read prior to calling ddi_fm_init() with the desired capability list.

The following example from the bge driver shows the bge_fm_init() function, which calls the ddi_fm_init(9F) function. The bge_fm_init() function is called in the bge_attach() function.

static void
bge_fm_init(bge_t *bgep)
{
        ddi_iblock_cookie_t iblk;

        /* Only register with IO Fault Services if we have some capability */
        if (bgep->fm_capabilities) {
                bge_reg_accattr.devacc_attr_access = DDI_FLAGERR_ACC;
                dma_attr.dma_attr_flags = DDI_DMA_FLAGERR;
                /* 
                 * Register capabilities with IO Fault Services
                 */
                ddi_fm_init(bgep->devinfo, &bgep->fm_capabilities, &iblk);
                /*
                 * Initialize pci ereport capabilities if ereport capable
                 */
                if (DDI_FM_EREPORT_CAP(bgep->fm_capabilities) ||
                    DDI_FM_ERRCB_CAP(bgep->fm_capabilities))
                        pci_ereport_setup(bgep->devinfo);
                /*
                 * Register error callback if error callback capable
                 */
                if (DDI_FM_ERRCB_CAP(bgep->fm_capabilities))
                        ddi_fm_handler_register(bgep->devinfo,
                        bge_fm_error_cb, (void*) bgep);
        } else {
                /*
                 * These fields have to be cleared of FMA if there are no
                 * FMA capabilities at runtime.
                 */
                bge_reg_accattr.devacc_attr_access = DDI_DEFAULT_ACC;
                dma_attr.dma_attr_flags = 0;
        }
}

Cleaning Up Fault Management Resources

The ddi_fm_fini(9F) function cleans up resources allocated to support fault management for dip.

void ddi_fm_fini(dev_info_t *dip)

The ddi_fm_fini() function can be called from kernel context in a driver attach(9E) or detach(9E) entry point.

The following example from the bge driver shows the bge_fm_fini() function, which calls the ddi_fm_fini(9F) function. The bge_fm_fini() function is called in the bge_unattach() function, which is called in both the bge_attach() and bge_detach() functions.

static void
bge_fm_fini(bge_t *bgep)
{
        /* Only unregister FMA capabilities if we registered some */
        if (bgep->fm_capabilities) {
                /*
                 * Release any resources allocated by pci_ereport_setup()
                 */
                if (DDI_FM_EREPORT_CAP(bgep->fm_capabilities) ||
                    DDI_FM_ERRCB_CAP(bgep->fm_capabilities))
                        pci_ereport_teardown(bgep->devinfo);
                /*
                 * Un-register error callback if error callback capable
                 */
                if (DDI_FM_ERRCB_CAP(bgep->fm_capabilities))
                        ddi_fm_handler_unregister(bgep->devinfo);
                /*
                 * Unregister from IO Fault Services
                 */
                ddi_fm_fini(bgep->devinfo);
        }
}

Getting the Fault Management Capability Bit Mask

The ddi_fm_capable(9F) function returns the capability bit mask currently set for dip.

void ddi_fm_capable(dev_info_t *dip)

Reporting Errors

This section provides information about the following topics:

Queueing an Error Event discusses how to queue error events.
Detecting and Reporting PCI-Related Errors describes how to report PCI-related errors.
Reporting Standard I/O Controller Errors describes how to report standard I/O controller errors.
Service Impact Function discusses how to report whether an error has impacted the services provided by a device.

Queueing an Error Event

The ddi_fm_ereport_post(9F) function causes an ereport event to be queued for delivery to the fault manager daemon, fmd(1M).

void ddi_fm_ereport_post(dev_info_t *dip, 
                         const char *error_class, 
                         uint64_t ena, 
                         int sflag, ...)

The sflag parameter indicates whether the caller is willing to wait for system memory and event channel resources to become available.

The ENA indicates the Error Numeric Association (ENA) for this error report. The ENA might have been initialized and obtained from another error detecting software module such as a bus nexus driver. If the ENA is set to 0, it will be initialized by ddi_fm_ereport_post().

The name-value pair (nvpair) variable argument list contains one or more name, type, value pointer nvpair tuples for non-array data_type_t types or one or more name, type, number of element, value pointer tuples for data_type_t array types. The nvpair tuples make up the ereport event payload required for diagnosis. The end of the argument list is specified by NULL.

The ereport class names and payloads described in Reporting Standard I/O Controller Errors for I/O controllers are used as appropriate for error_class. Other ereport class names and payloads can be defined, but they must be registered in the Oracle event registry and accompanied by driver specific diagnosis engine software, or the Eversholt fault tree (eft) rules.

void
bge_fm_ereport(bge_t *bgep, char *detail)
{
        uint64_t ena;
        char buf[FM_MAX_CLASS];
        (void) snprintf(buf, FM_MAX_CLASS, "%s.%s", DDI_FM_DEVICE, detail);
        ena = fm_ena_generate(0, FM_ENA_FMT1);
        if (DDI_FM_EREPORT_CAP(bgep->fm_capabilities)) {
                ddi_fm_ereport_post(bgep->devinfo, buf, ena, DDI_NOSLEEP,
                    FM_VERSION, DATA_TYPE_UINT8, FM_EREPORT_VERS0, NULL);
        }
}

Detecting and Reporting PCI-Related Errors

PCI-related errors, including PCI, PCI-X, and PCI-E, are automatically detected and reported when you use pci_ereport_post(9F).

void pci_ereport_post(dev_info_t *dip, ddi_fm_error_t *derr, uint16_t *xx_status)

Drivers do not need to generate driver-specific ereports for errors that occur in the PCI Local Bus configuration status registers. The pci_ereport_post() function can report data parity errors, master aborts, target aborts, signaled system errors, and much more.

If pci_ereport_post() is to be used by a driver, then pci_ereport_setup(9F) must have been previously called during the driver's attach(9E) routine, and pci_ereport_teardown(9F) must subsequently be called during the driver's detach(9E) routine.

The bge code samples below show the bge driver invoking the pci_ereport_post() function from the driver's error handler.

/*
 * The I/O fault service error handling callback function
 */
/*ARGSUSED*/
static int
bge_fm_error_cb(dev_info_t *dip, ddi_fm_error_t *err, const void *impl_data)
{
     /*
      * as the driver can always deal with an error 
      * in any dma or access handle, we can just return 
      * the fme_status value.
      */
     pci_ereport_post(dip, err, NULL);
     return (err->fme_status);
}

Reporting Standard I/O Controller Errors

A standard set of device ereports is defined for commonly seen errors for I/O controllers. These ereports should be generated whenever one of the error symptoms described in this section is detected.

The ereports described in this section are dispatched for diagnosis to the eft diagnosis engine, which uses a common set of standard rules to diagnose them. Any other errors detected by device drivers must be defined as ereport events in the Sun event registry and must be accompanied by device specific diagnosis software or eft rules.

DDI_FM_DEVICE_INVAL_STATE

The driver has detected that the device is in an invalid state.

A driver should post an error when it detects that the data it transmits or receives appear to be invalid. For example, in the bge code, the bge_chip_reset() and bge_receive_ring() routines generate the ereport.io.device.inval_state error when these routines detect invalid data.

/*
 * The SEND INDEX registers should be reset to zero by the
 * global chip reset; if they're not, there'll be trouble
 * later on.
 */
sx0 = bge_reg_get32(bgep, NIC_DIAG_SEND_INDEX_REG(0));
if (sx0 != 0) {
    BGE_REPORT((bgep, "SEND INDEX - device didn't RESET"));
    bge_fm_ereport(bgep, DDI_FM_DEVICE_INVAL_STATE);
    return (DDI_FAILURE);
}
/* ... */
/*
 * Sync (all) the receive ring descriptors
 * before accepting the packets they describe
 */
DMA_SYNC(rrp->desc, DDI_DMA_SYNC_FORKERNEL);
if (*rrp->prod_index_p >= rrp->desc.nslots) {
    bgep->bge_chip_state = BGE_CHIP_ERROR;
    bge_fm_ereport(bgep, DDI_FM_DEVICE_INVAL_STATE);
    return (NULL);
}

DDI_FM_DEVICE_INTERN_CORR

The device has reported a self-corrected internal error. For example, a correctable ECC error has been detected by the hardware in an internal buffer within the device.This error flag is not used in the bge driver.

DDI_FM_DEVICE_INTERN_UNCORR

The device has reported an uncorrectable internal error. For example, an uncorrectable ECC error has been detected by the hardware in an internal buffer within the device.

This error flag is not used in the bge driver.

DDI_FM_DEVICE_STALL

The driver has detected that data transfer has stalled unexpectedly.

The bge_factotum_stall_check() routine provides an example of stall detection.

dogval = bge_atomic_shl32(&bgep->watchdog, 1);
if (dogval < bge_watchdog_count)
    return (B_FALSE);

BGE_REPORT((bgep, "Tx stall detected, 
watchdog code 0x%x", dogval));
bge_fm_ereport(bgep, DDI_FM_DEVICE_STALL);
return (B_TRUE);

DDI_FM_DEVICE_NO_RESPONSE

The device is not responding to a driver command.

bge_chip_poll_engine(bge_t *bgep, bge_regno_t regno,
        uint32_t mask, uint32_t val)
{
        uint32_t regval;
        uint32_t n;

        for (n = 200; n; --n) {
                regval = bge_reg_get32(bgep, regno);
                if ((regval & mask) == val)
                        return (B_TRUE);
                drv_usecwait(100);
        }
        bge_fm_ereport(bgep, DDI_FM_DEVICE_NO_RESPONSE);
        return (B_FALSE);
}

DDI_FM_DEVICE_BADINT_LIMIT

The device has raised too many consecutive invalid interrupts.

The bge_intr() routine within the bge driver provides an example of stuck interrupt detection. The bge_fm_ereport() function is a wrapper for the ddi_fm_ereport_post(9F) function. See the bge_fm_ereport() example in Queueing an Error Event.

if (bgep->missed_dmas >= bge_dma_miss_limit) {
    /*
     * If this happens multiple times in a row,
     * it means DMA is just not working.  Maybe
     * the chip has failed, or maybe there's a
     * problem on the PCI bus or in the host-PCI
     * bridge (Tomatillo).
     *
     * At all events, we want to stop further
     * interrupts and let the recovery code take
     * over to see whether anything can be done
     * about it ...
     */
    bge_fm_ereport(bgep,
        DDI_FM_DEVICE_BADINT_LIMIT);
    goto chip_stop;
}

Service Impact Function

A fault management capable driver must indicate whether or not an error has impacted the services provided by a device. Following detection of an error and, if necessary, a shutdown of services, the driver should invoke the

Chapter 14 Layered Driver Interface (LDI)

The LDI is a set of DDI/DKI that enables a kernel module to access other devices in the system. The LDI also enables you to determine which devices are currently being used by kernel modules.

This chapter covers the following topics:

LDI Overview

The LDI includes two categories of interfaces:

Kernel interfaces. User applications use system calls to open, read, and write to devices that are managed by a device driver within the kernel. Kernel modules can use the LDI kernel interfaces to open, read, and write to devices that are managed by another device driver within the kernel. For example, a user application might use read(2) and a kernel module might use ldi_read(9F) to read the same device. See Kernel Interfaces.
User interfaces. The LDI user interfaces can provide information to user processes regarding which devices are currently being used by other devices in the kernel. See User Interfaces.

The following terms are commonly used in discussing the LDI:

Target Device. A target device is a device within the kernel that is managed by a device driver and is being accessed by a device consumer.
Device Consumer. A device consumer is a user process or kernel module that opens and accesses a target device. A device consumer normally performs operations such as open, read, write, or ioctl on a target device.
Kernel Device Consumer. A kernel device consumer is a particular kind of device consumer. A kernel device consumer is a kernel module that accesses a target device. The kernel device consumer usually is not the device driver that manages the target device that is being accessed. Instead, the kernel device consumer accesses the target device indirectly through the device driver that manages the target device.
Layered Driver. A layered driver is a particular kind of kernel device consumer. A layered driver is a kernel driver that does not directly manage any piece of hardware. Instead, a layered driver accesses one of more target devices indirectly through the device drivers that manage those target devices. Volume managers and STREAMS multiplexers are good examples of layered drivers.

Kernel Interfaces

Some LDI kernel interfaces enable the LDI to track and report kernel device usage information. See Layered Identifiers – Kernel Device Consumers.

Other LDI kernel interfaces enable kernel modules to perform access operations such as open, read, and write a target device. These LDI kernel interfaces also enable a kernel device consumer to query property and event information about target devices. See Layered Driver Handles – Target Devices.

LDI Kernel Interfaces Example shows an example driver that uses many of these LDI interfaces.

Layered Identifiers – Kernel Device Consumers

Layered identifiers enable the LDI to track and report kernel device usage information. A layered identifier (ldi_ident_t) identifies a kernel device consumer. Kernel device consumers must obtain a layered identifier prior to opening a target device using the LDI.

Layered drivers are the only supported types of kernel device consumers. Therefore, a layered driver must obtain a layered identifier that is associated with the device number, the device information node, or the stream of the layered driver. The layered identifier is associated with the layered driver. The layered identifier is not associated with the target device.

You can retrieve the kernel device usage information that is collected by the LDI by using the libdevinfo(3LIB) interfaces, the fuser(1M) command, or the prtconf(1M) command. For example, the prtconf(1M) command can show which target devices a layered driver is accessing or which layered drivers are accessing a particular target device. See User Interfaces to learn more about how to retrieve device usage information.

The following describes the LDI layered identifier interfaces:

ldi_ident_t: Layered identifier. An opaque type.
ldi_ident_from_dev(9F): Allocate and retrieve a layered identifier that is associated with a dev_t device number.
ldi_ident_from_dip(9F): Allocate and retrieve a layered identifier that is associated with a dev_info_t device information node.
ldi_ident_from_stream(9F): Allocate and retrieve a layered identifier that is associated with a stream.
ldi_ident_release(9F): Release a layered identifier that was allocated with ldi_ident_from_dev(9F), ldi_ident_from_dip(9F), or ldi_ident_from_stream(9F).

Layered Driver Handles – Target Devices

Kernel device consumers must use a layered driver handle (ldi_handle_t) to access a target device through LDI interfaces. The ldi_handle_t type is valid only with LDI interfaces. The LDI allocates and returns this handle when the LDI successfully opens a device. A kernel device consumer can then use this handle to access the target device through the LDI interfaces. The LDI deallocates the handle when the LDI closes the device. See LDI Kernel Interfaces Example for an example.

This section discusses how kernel device consumers can access target devices and retrieve different types of information. See Opening and Closing Target Devices to learn how kernel device consumers can open and close target devices. See Accessing Target Devices to learn how kernel device consumers can perform operations such as read, write, strategy, and ioctl on target devices. Retrieving Target Device Information describes interfaces that retrieve target device information such as device open type and device minor name. Retrieving Target Device Property Values describes interfaces that retrieve values and address of target device properties. See Receiving Asynchronous Device Event Notification to learn how kernel device consumers can receive event notification from target devices.

Opening and Closing Target Devices

This section describes the LDI kernel interfaces for opening and closing target devices. The open interfaces take a pointer to a layered driver handle. The open interfaces attempt to open the target device specified by the device number, device ID, or path name. If the open operation is successful, the open interfaces allocate and return a layered driver handle that can be used to access the target device. The close interface closes the target device associated with the specified layered driver handle and then frees the layered driver handle.

ldi_handle_t: Layered driver handle for target device access. An opaque data structure that is returned when a device is successfully opened.
ldi_open_by_dev(9F): Open the device specified by the dev_t device number parameter.
ldi_open_by_devid(9F): Open the device specified by the ddi_devid_t device ID parameter. You also must specify the minor node name to open.
ldi_open_by_name(9F): Open a device by path name. The path name is a null-terminated string in the kernel address space. The path name must be an absolute path, beginning with a forward slash character (/).
ldi_close(9F): Close a device that was opened with ldi_open_by_dev(9F), ldi_open_by_devid(9F), or ldi_open_by_name(9F). After ldi_close(9F) returns, the layered driver handle of the device that was closed is no longer valid.

Accessing Target Devices

This section describes the LDI kernel interfaces for accessing target devices. These interfaces enable a kernel device consumer to perform operations on the target device specified by the layered driver handle. Kernel device consumers can perform operations such as read, write, strategy, and ioctl on the target device.

ldi_handle_t: Layered driver handle for target device access. An opaque data structure.
ldi_read(9F): Pass a read request to the device entry point for the target device. This operation is supported for block, character, and STREAMS devices.
ldi_aread(9F): Pass an asynchronous read request to the device entry point for the target device. This operation is supported for block and character devices.
ldi_write(9F): Pass a write request to the device entry point for the target device. This operation is supported for block, character, and STREAMS devices.
ldi_awrite(9F): Pass an asynchronous write request to the device entry point for the target device. This operation is supported for block and character devices.
ldi_strategy(9F): Pass a strategy request to the device entry point for the target device. This operation is supported for block and character devices.
ldi_dump(9F): Pass a dump request to the device entry point for the target device. This operation is supported for block and character devices.
ldi_poll(9F): Pass a poll request to the device entry point for the target device. This operation is supported for block, character, and STREAMS devices.
ldi_ioctl(9F): Pass an ioctl request to the device entry point for the target device. This operation is supported for block, character, and STREAMS devices. The LDI supports STREAMS linking and STREAMS ioctl commands. See the “STREAM IOCTLS” section of the ldi_ioctl(9F) man page. See also the ioctl commands in the streamio(7I) man page.
ldi_devmap(9F): Pass a devmap request to the device entry point for the target device. This operation is supported for block and character devices.
ldi_getmsg(9F): Get a message block from a stream.
ldi_putmsg(9F): Put a message block on a stream.

Retrieving Target Device Information

This section describes LDI interfaces that kernel device consumers can use to retrieve device information about a specified target device. A target device is specified by a layered driver handle. A kernel device consumer can receive information such as device number, device open type, device ID, device minor name, and device size.

ldi_get_dev(9F): Get the dev_t device number for the target device specified by the layered driver handle.
ldi_get_otyp(9F): Get the open flag that was used to open the target device specified by the layered driver handle. This flag tells you whether the target device is a character device or a block device.
ldi_get_devid(9F): Get the ddi_devid_t device ID for the target device specified by the layered driver handle. Use ddi_devid_free(9F) to free the ddi_devid_t when you are finished using the device ID.
ldi_get_minor_name(9F): Retrieve a buffer that contains the name of the minor node that was opened for the target device. Use kmem_free(9F) to release the buffer when you are finished using the minor node name.
ldi_get_size(9F): Retrieve the partition size of the target device specified by the layered driver handle.

Retrieving Target Device Property Values

This section describes LDI interfaces that kernel device consumers can use to retrieve property information about a specified target device. A target device is specified by a layered driver handle. A kernel device consumer can receive values and addresses of properties and determine whether a property exists.

ldi_prop_exists(9F): Return 1 if the property exists for the target device specified by the layered driver handle. Return 0 if the property does not exist for the specified target device.
ldi_prop_get_int(9F): Search for an int integer property that is associated with the target device specified by the layered driver handle. If the integer property is found, return the property value.
ldi_prop_get_int64(9F): Search for an int64_t integer property that is associated with the target device specified by the layered driver handle. If the integer property is found, return the property value.
ldi_prop_lookup_int_array(9F): Retrieve the address of an int integer array property value for the target device specified by the layered driver handle.
ldi_prop_lookup_int64_array(9F): Retrieve the address of an int64_t integer array property value for the target device specified by the layered driver handle.
ldi_prop_lookup_string(9F): Retrieve the address of a null-terminated string property value for the target device specified by the layered driver handle.
ldi_prop_lookup_string_array(9F): Retrieve the address of an array of strings. The string array is an array of pointers to null-terminated strings of property values for the target device specified by the layered driver handle.
ldi_prop_lookup_byte_array(9F): Retrieve the address of an array of bytes. The byte array is a property value of the target device specified by the layered driver handle.

Receiving Asynchronous Device Event Notification

The LDI enables kernel device consumers to register for event notification and to receive event notification from target devices. A kernel device consumer can register an event handler that will be called when the event occurs. The kernel device consumer must open a device and receive a layered driver handle before the kernel device consumer can register for event notification with the LDI event notification interfaces.

The LDI event notification interfaces enable a kernel device consumer to specify an event name and to retrieve an associated kernel event cookie. The kernel device consumer can then pass the layered driver handle (ldi_handle_t), the cookie (ddi_eventcookie_t), and the event handler to ldi_add_event_handler(9F) to register for event notification. When registration completes successfully, the kernel device consumer receives a unique LDI event handler identifier (ldi_callback_id_t). The LDI event handler identifier is an opaque type that can be used only with the LDI event notification interfaces.

The LDI provides a framework to register for events generated by other devices. The LDI itself does not define any event types or provide interfaces for generating events.

The following describes the LDI asynchronous event notification interfaces:

ldi_callback_id_t: Event handler identifier. An opaque type.
ldi_get_eventcookie(9F): Retrieve an event service cookie for the target device specified by the layered driver handle.
ldi_add_event_handler(9F): Add the callback handler specified by the ldi_callback_id_t registration identifier. The callback handler is invoked when the event specified by the ddi_eventcookie_t cookie occurs.
ldi_remove_event_handler(9F): Remove the callback handler specified by the ldi_callback_id_t registration identifier.

LDI Kernel Interfaces Example

This section shows an example kernel device consumer that uses some of the LDI calls discussed in the preceding sections in this chapter. This section discusses the following aspects of this example module:

This example kernel device consumer is named lyr. The lyr module is a layered driver that uses LDI calls to send data to a target device. In its open(9E) entry point, the lyr driver opens the device that is specified by the lyr_targ property in the lyr.conf configuration file. In its write(9E) entry point, the lyr driver writes all of its incoming data to the device specified by the lyr_targ property.

Device Configuration File

In the configuration file shown below, the target device that the lyr driver is writing to is the console.

Example 14–1 Configuration File

#
# Use is subject to license terms.
#
#pragma ident	"%Z%%M%	%I%	%E% SMI"

name="lyr" parent="pseudo" instance=1;
lyr_targ="/dev/console";

Driver Source File

In the driver source file shown below, the lyr_state_t structure holds the soft state for the lyr driver. The soft state includes the layered driver handle (lh) for the lyr_targ device and the layered identifier (li) for the lyr device. For more information on soft state, see Retrieving Driver Soft State Information.

In the lyr_open() entry point, ddi_prop_lookup_string(9F) retrieves from the lyr_targ property the name of the target device for the lyr device to open. The ldi_ident_from_dev(9F) function gets an LDI layered identifier for the lyr device. The ldi_open_by_name(9F) function opens the lyr_targ device and gets a layered driver handle for the lyr_targ device.

Note that if any failure occurs in lyr_open(), the ldi_close(9F), ldi_ident_release(9F), and ddi_prop_free(9F) calls undo everything that was done. The ldi_close(9F) function closes the lyr_targ device. The ldi_ident_release(9F) function releases the lyr layered identifier. The ddi_prop_free(9F) function frees resources allocated when the lyr_targ device name was retrieved. If no failure occurs, the ldi_close(9F) and ldi_ident_release(9F) functions are called in the lyr_close() entry point.

In the last line of the driver module, the ldi_write(9F) function is called. The ldi_write(9F) function takes the data written to the lyr device in the lyr_write() entry point and writes that data to the lyr_targ device. The ldi_write(9F) function uses the layered driver handle for the lyr_targ device to write the data to the lyr_targ device.

Example 14–2 Driver Source File

#include <sys/types.h>
#include <sys/file.h>
#include <sys/errno.h>
#include <sys/open.h>
#include <sys/cred.h>
#include <sys/cmn_err.h>
#include <sys/modctl.h>
#include <sys/conf.h>
#include <sys/stat.h>
#include <sys/ddi.h>
#include <sys/sunddi.h>
#include <sys/sunldi.h>

typedef struct lyr_state {
    ldi_handle_t    lh;
    ldi_ident_t     li;
    dev_info_t      *dip;
    minor_t         minor;
    int             flags;
    kmutex_t        lock;
} lyr_state_t;

#define LYR_OPENED      0x1     /* lh is valid */
#define LYR_IDENTED     0x2     /* li is valid */

static int lyr_info(dev_info_t *, ddi_info_cmd_t, void *, void **);
static int lyr_attach(dev_info_t *, ddi_attach_cmd_t);
static int lyr_detach(dev_info_t *, ddi_detach_cmd_t);
static int lyr_open(dev_t *, int, int, cred_t *);
static int lyr_close(dev_t, int, int, cred_t *);
static int lyr_write(dev_t, struct uio *, cred_t *);

static void *lyr_statep;

static struct cb_ops lyr_cb_ops = {
    lyr_open,            /* open */
    lyr_close,           /* close */
    nodev,               /* strategy */
    nodev,               /* print */
    nodev,               /* dump */
    nodev,               /* read */
    lyr_write,           /* write */
    nodev,               /* ioctl */
    nodev,               /* devmap */
    nodev,               /* mmap */
    nodev,               /* segmap */
    nochpoll,            /* poll */
    ddi_prop_op,         /* prop_op */
    NULL,                /* streamtab  */
    D_NEW | D_MP,        /* cb_flag */
    CB_REV,              /* cb_rev */
    nodev,               /* aread */
    nodev                /* awrite */
};

static struct dev_ops lyr_dev_ops = {
    DEVO_REV,            /* devo_rev, */
    0,                   /* refcnt  */
    lyr_info,            /* getinfo */
    nulldev,             /* identify */
    nulldev,             /* probe */
    lyr_attach,          /* attach */
    lyr_detach,          /* detach */
    nodev,               /* reset */
    &lyr_cb_ops,         /* cb_ops */
    NULL,                /* bus_ops */
    NULL,                /* power */
    ddi_quiesce_not_needed,     /* quiesce */
};

static struct modldrv modldrv = {
    &mod_driverops,
    "LDI example driver",
    &lyr_dev_ops
};

static struct modlinkage modlinkage = {
    MODREV_1,
    &modldrv,
    NULL
};

int
_init(void)
{
    int rv;

    if ((rv = ddi_soft_state_init(&lyr_statep, sizeof (lyr_state_t),
        0)) != 0) {
        cmn_err(CE_WARN, "lyr _init: soft state init failed\n");
        return (rv);
    }
    if ((rv = mod_install(&modlinkage)) != 0) {
        cmn_err(CE_WARN, "lyr _init: mod_install failed\n");
        goto FAIL;
    }
    return (rv);
    /*NOTEREACHED*/
FAIL:
    ddi_soft_state_fini(&lyr_statep);
    return (rv);
}

int
_info(struct modinfo *modinfop)
{
    return (mod_info(&modlinkage, modinfop));
}

int
_fini(void)
{
    int rv;

    if ((rv = mod_remove(&modlinkage)) != 0) {
        return(rv);
    }
    ddi_soft_state_fini(&lyr_statep);
    return (rv);
}
/*
 * 1:1 mapping between minor number and instance
 */
static int
lyr_info(dev_info_t *dip, ddi_info_cmd_t infocmd, void *arg, void **result)
{
    int inst;
    minor_t minor;
    lyr_state_t *statep;
    char *myname = "lyr_info";

    minor = getminor((dev_t)arg);
    inst = minor;
    switch (infocmd) {
    case DDI_INFO_DEVT2DEVINFO:
        statep = ddi_get_soft_state(lyr_statep, inst);
        if (statep == NULL) {
            cmn_err(CE_WARN, "%s: get soft state "
                "failed on inst %d\n", myname, inst);
            return (DDI_FAILURE);
        }
        *result = (void *)statep->dip;
        break;
    case DDI_INFO_DEVT2INSTANCE:
        *result = (void *)inst;
        break;
    default:
        break;
    }

    return (DDI_SUCCESS);
}

static int
lyr_attach(dev_info_t *dip, ddi_attach_cmd_t cmd)
{
    int inst;
    lyr_state_t *statep;
    char *myname = "lyr_attach";

    switch (cmd) {
    case DDI_ATTACH:
        inst = ddi_get_instance(dip);

        if (ddi_soft_state_zalloc(lyr_statep, inst) != DDI_SUCCESS) {
            cmn_err(CE_WARN, "%s: ddi_soft_state_zallac failed "
                "on inst %d\n", myname, inst);
            goto FAIL;
        }
        statep = (lyr_state_t *)ddi_get_soft_state(lyr_statep, inst);
        if (statep == NULL) {
            cmn_err(CE_WARN, "%s: ddi_get_soft_state failed on "
                "inst %d\n", myname, inst);
            goto FAIL;
        }
        statep->dip = dip;
        statep->minor = inst;
        if (ddi_create_minor_node(dip, "node", S_IFCHR, statep->minor,
            DDI_PSEUDO, 0) != DDI_SUCCESS) {
            cmn_err(CE_WARN, "%s: ddi_create_minor_node failed on "
                "inst %d\n", myname, inst);
            goto FAIL;
        }
        mutex_init(&statep->lock, NULL, MUTEX_DRIVER, NULL);
        return (DDI_SUCCESS);
    case DDI_RESUME:
    case DDI_PM_RESUME:
    default:
        break;
    }
    return (DDI_FAILURE);
    /*NOTREACHED*/
FAIL:
    ddi_soft_state_free(lyr_statep, inst);
    ddi_remove_minor_node(dip, NULL);
    return (DDI_FAILURE);
}

static int
lyr_detach(dev_info_t *dip, ddi_detach_cmd_t cmd)
{
    int inst;
    lyr_state_t *statep;
    char *myname = "lyr_detach";

    inst = ddi_get_instance(dip);
    statep = ddi_get_soft_state(lyr_statep, inst);
    if (statep == NULL) {
        cmn_err(CE_WARN, "%s: get soft state failed on "
            "inst %d\n", myname, inst);
        return (DDI_FAILURE);
    }
    if (statep->dip != dip) {
        cmn_err(CE_WARN, "%s: soft state does not match devinfo "
            "on inst %d\n", myname, inst);
        return (DDI_FAILURE);
    }
    switch (cmd) {
    case DDI_DETACH:
        mutex_destroy(&statep->lock);
        ddi_soft_state_free(lyr_statep, inst);
        ddi_remove_minor_node(dip, NULL);
        return (DDI_SUCCESS);
    case DDI_SUSPEND:
    case DDI_PM_SUSPEND:
    default:
        break;
    }
    return (DDI_FAILURE);
}
/*
 * on this driver's open, we open the target specified by a property and store
 * the layered handle and ident in our soft state.  a good target would be
 * "/dev/console" or more interestingly, a pseudo terminal as specified by the
 * tty command
 */
/*ARGSUSED*/
static int
lyr_open(dev_t *devtp, int oflag, int otyp, cred_t *credp)
{
    int rv, inst = getminor(*devtp);
    lyr_state_t *statep;
    char *myname = "lyr_open";
    dev_info_t *dip;
    char *lyr_targ = NULL;

    statep = (lyr_state_t *)ddi_get_soft_state(lyr_statep, inst);
    if (statep == NULL) {
        cmn_err(CE_WARN, "%s: ddi_get_soft_state failed on "
            "inst %d\n", myname, inst);
        return (EIO);
    }
    dip = statep->dip;
    /*
     * our target device to open should be specified by the "lyr_targ"
     * string property, which should be set in this driver's .conf file
     */
    if (ddi_prop_lookup_string(DDI_DEV_T_ANY, dip, DDI_PROP_NOTPROM,
        "lyr_targ", &lyr_targ) != DDI_PROP_SUCCESS) {
        cmn_err(CE_WARN, "%s: ddi_prop_lookup_string failed on "
            "inst %d\n", myname, inst);
        return (EIO);
    }
    /*
     * since we only have one pair of lh's and li's available, we don't
     * allow multiple on the same instance
     */
    mutex_enter(&statep->lock);
    if (statep->flags & (LYR_OPENED | LYR_IDENTED)) {
        cmn_err(CE_WARN, "%s: multiple layered opens or idents "
            "from inst %d not allowed\n", myname, inst);
        mutex_exit(&statep->lock);
        ddi_prop_free(lyr_targ);
        return (EIO);
    }
    rv = ldi_ident_from_dev(*devtp, &statep->li);
    if (rv != 0) {
        cmn_err(CE_WARN, "%s: ldi_ident_from_dev failed on inst %d\n",
            myname, inst);
        goto FAIL;
    }
    statep->flags |= LYR_IDENTED;
    rv = ldi_open_by_name(lyr_targ, FREAD | FWRITE, credp, &statep->lh,
        statep->li);
    if (rv != 0) {
        cmn_err(CE_WARN, "%s: ldi_open_by_name failed on inst %d\n",
            myname, inst);
        goto FAIL;
    }
    statep->flags |= LYR_OPENED;
    cmn_err(CE_CONT, "\n%s: opened target '%s' successfully on inst %d\n",
        myname, lyr_targ, inst);
    rv = 0;

FAIL:
    /* cleanup on error */
    if (rv != 0) {
        if (statep->flags & LYR_OPENED)
            (void)ldi_close(statep->lh, FREAD | FWRITE, credp);
        if (statep->flags & LYR_IDENTED)
            ldi_ident_release(statep->li);
        statep->flags &= ~(LYR_OPENED | LYR_IDENTED);
    }
    mutex_exit(&statep->lock);
    if (lyr_targ != NULL)
        ddi_prop_free(lyr_targ);
    return (rv);
}
/*
 * on this driver's close, we close the target indicated by the lh member
 * in our soft state and release the ident, li as well.  in fact, we MUST do
 * both of these at all times even if close yields an error because the
 * device framework effectively closes the device, releasing all data
 * associated with it and simply returning whatever value the target's
 * close(9E) returned.  therefore, we must as well.
 */
/*ARGSUSED*/
static int
lyr_close(dev_t devt, int oflag, int otyp, cred_t *credp)
{
    int rv, inst = getminor(devt);
    lyr_state_t *statep;
    char *myname = "lyr_close";
    statep = (lyr_state_t *)ddi_get_soft_state(lyr_statep, inst);
    if (statep == NULL) {
        cmn_err(CE_WARN, "%s: ddi_get_soft_state failed on "
            "inst %d\n", myname, inst);
        return (EIO);
    }
    mutex_enter(&statep->lock);
    rv = ldi_close(statep->lh, FREAD | FWRITE, credp);
    if (rv != 0) {
        cmn_err(CE_WARN, "%s: ldi_close failed on inst %d, but will ",
            "continue to release ident\n", myname, inst);
    }
    ldi_ident_release(statep->li);
    if (rv == 0) {
        cmn_err(CE_CONT, "\n%s: closed target successfully on "
            "inst %d\n", myname, inst);
    }
    statep->flags &= ~(LYR_OPENED | LYR_IDENTED);
    mutex_exit(&statep->lock);
    return (rv);
}
/*
 * echo the data we receive to the target
 */
/*ARGSUSED*/
static int
lyr_write(dev_t devt, struct uio *uiop, cred_t *credp)
{
    int rv, inst = getminor(devt);
    lyr_state_t *statep;
    char *myname = "lyr_write";

    statep = (lyr_state_t *)ddi_get_soft_state(lyr_statep, inst);
    if (statep == NULL) {
        cmn_err(CE_WARN, "%s: ddi_get_soft_state failed on "
            "inst %d\n", myname, inst);
        return (EIO);
    }
    return (ldi_write(statep->lh, uiop, credp));
}

How to Build and Load the Layered Driver

Compile the driver.

Use the -D_KERNEL option to indicate that this is a kernel module.
- If you are compiling for a SPARC architecture, use the -xarch=v9 option:
  % cc -c -D_KERNEL -xarch=v9 lyr.c
- If you are compiling for a 32-bit x86 architecture, use the following command:
  % cc -c -D_KERNEL lyr.c

Link the driver.
% ld -r -o lyr lyr.o

Install the configuration file.

As user root, copy the configuration file to the kernel driver area of the machine:
# cp lyr.conf /usr/kernel/drv

Install the driver binary.
- As user root, copy the driver binary to the sparcv9 driver area on a SPARC architecture:
  # cp lyr /usr/kernel/drv/sparcv9
- As user root, copy the driver binary to the drv driver area on a 32-bit x86 architecture:
  # cp lyr /usr/kernel/drv

Load the driver.

As user root, use the add_drv(1M) command to load the driver.
# add_drv lyr
List the pseudo devices to confirm that the lyr device now exists:
# ls /devices/pseudo | grep lyr lyr@1 lyr@1:node

Test the Layered Driver

To test the lyr driver, write a message to the lyr device and verify that the message displays on the lyr_targ device.

Example 14–3 Write a Short Message to the Layered Device

In this example, the lyr_targ device is the console of the system where the lyr device is installed.

If the display you are viewing is also the display for the console device of the system where the lyr device is installed, note that writing to the console will corrupt your display. The console messages will appear outside your window system. You will need to redraw or refresh your display after testing the lyr driver.

If the display you are viewing is not the display for the console device of the system where the lyr device is installed, log into or otherwise gain a view of the display of the target console device.

The following command writes a very brief message to the lyr device:

# echo "\n\n\t===> Hello World!! <===\n" > /devices/pseudo/lyr@1:node

You should see the following messages displayed on the target console:

console login:

    ===> Hello World!! <===

lyr: 
lyr_open: opened target '/dev/console' successfully on inst 1
lyr: 
lyr_close: closed target successfully on inst 1

The messages from lyr_open() and lyr_close() come from the cmn_err(9F) calls in the lyr_open() and lyr_close() entry points.

Example 14–4 Write a Longer Message to the Layered Device

The following command writes a longer message to the lyr device:

# cat lyr.conf > /devices/pseudo/lyr@1:node

You should see the following messages displayed on the target console:

lyr: 
lyr_open: opened target '/dev/console' successfully on inst 1
#
# Use is subject to license terms.
#
#pragma ident   "%Z%%M% %I%     %E% SMI"

name="lyr" parent="pseudo" instance=1;
lyr_targ="/dev/console";
lyr: 
lyr_close: closed target successfully on inst 1

Example 14–5 Change the Target Device

To change the target device, edit /usr/kernel/drv/lyr.conf and change the value of the lyr_targ property to be a path to a different target device. For example, the target device could be the output of a tty command in a local terminal. An example of such a device path is /dev/pts/4.

Make sure the lyr device is not in use before you update the driver to use the new target device.

# modinfo -c | grep lyr
174          3 lyr                              UNLOADED/UNINSTALLED

Use the update_drv(1M) command to reload the lyr.conf configuration file:

# update_drv lyr

Write a message to the lyr device again and verify that the message displays on the new lyr_targ device.

User Interfaces

The LDI includes user-level library and command interfaces to report device layering and usage information. Device Information Library Interfaces discusses the libdevinfo(3LIB) interfaces for reporting device layering information. Print System Configuration Command Interfaces discusses the prtconf(1M) interfaces for reporting kernel device usage information. Device User Command Interfaces discusses the fuser(1M) interfaces for reporting device consumer information.

Device Information Library Interfaces

The LDI includes libdevinfo(3LIB) interfaces that report a snapshot of device layering information. Device layering occurs when one device in the system is a consumer of another device in the system. Device layering information is reported only if both the consumer and the target are bound to a device node that is contained within the snapshot.

Device layering information is reported by the libdevinfo(3LIB) interfaces as a directed graph. An lnode is an abstraction that represents a vertex in the graph and is bound to a device node. You can use libdevinfo(3LIB) interfaces to access properties of an lnode, such as the name and device number of the node.

The edges in the graph are represented by a link. A link has a source lnode that represents the device consumer. A link also has a target lnode that represents the target device.

The following describes the libdevinfo(3LIB) device layering information interfaces:

DINFOLYR: Snapshot flag that enables you to capture device layering information.
di_link_t: A directed link between two endpoints. Each endpoint is a di_lnode_t. An opaque structure.
di_lnode_t: The endpoint of a link. An opaque structure. A di_lnode_t is bound to a di_node_t.
di_node_t: Represents a device node. An opaque structure. A di_node_t is not necessarily bound to a di_lnode_t.
di_walk_link(3DEVINFO): Walk all links in the snapshot.
di_walk_lnode(3DEVINFO): Walk all lnodes in the snapshot.
di_link_next_by_node(3DEVINFO): Get a handle to the next link where the specified di_node_t node is either the source or the target.
di_link_next_by_lnode(3DEVINFO): Get a handle to the next link where the specified di_lnode_t lnode is either the source or the target.
di_link_to_lnode(3DEVINFO): Get the lnode that corresponds to the specified endpoint of a di_link_t link.
di_link_spectype(3DEVINFO): Get the link spectype. The spectype indicates how the target device is being accessed. The target device is represented by the target lnode.
di_lnode_next(3DEVINFO): Get a handle to the next occurrence of the specified di_lnode_t lnode associated with the specified di_node_t device node.
di_lnode_name(3DEVINFO): Get the name that is associated with the specified lnode.
di_lnode_devinfo(3DEVINFO): Get a handle to the device node that is associated with the specified lnode.
di_lnode_devt(3DEVINFO): Get the device number of the device node that is associated with the specified lnode.

The device layering information returned by the LDI can be quite complex. Therefore, the LDI provides interfaces to help you traverse the device tree and the device usage graph. These interfaces enable the consumer of a device tree snapshot to associate custom data pointers with different structures within the snapshot. For example, as an application traverses lnodes, the application can update the custom pointer associated with each lnode to mark which lnodes already have been seen.

The following describes the libdevinfo(3LIB) node and link marking interfaces:

di_lnode_private_set(3DEVINFO): Associate the specified data with the specified lnode. This association enables you to traverse lnodes in the snapshot.
di_lnode_private_get(3DEVINFO): Retrieve a pointer to data that was associated with an lnode through a call to di_lnode_private_set(3DEVINFO).
di_link_private_set(3DEVINFO): Associate the specified data with the specified link. This association enables you to traverse links in the snapshot.
di_link_private_get(3DEVINFO): Retrieve a pointer to data that was associated with a link through a call to di_link_private_set(3DEVINFO).

Print System Configuration Command Interfaces

The prtconf(1M) command is enhanced to display kernel device usage information. The default prtconf(1M) output is not changed. Device usage information is displayed when you specify the verbose option (-v) with the prtconf(1M) command. Usage information about a particular device is displayed when you specify a path to that device on the prtconf(1M) command line.

prtconf -v: Display device minor node and device usage information. Show kernel consumers and the minor nodes each kernel consumer currently has open.
prtconf path: Display device usage information for the device specified by path.
prtconf -a path: Display device usage information for the device specified by path and all device nodes that are ancestors of path.
prtconf -c path: Display device usage information for the device specified by path and all device nodes that are children of path.

Example 14–6 Device Usage Information

When you want usage information about a particular device, the value of the path parameter can be any valid device path.

% prtconf /dev/cfg/c0
SUNW,isptwo, instance #0

Example 14–7 Ancestor Node Usage Information

To display usage information about a particular device and all device nodes that are ancestors of that particular device, specify the -a flag with the prtconf(1M) command. Ancestors include all nodes up to the root of the device tree. If you specify the -a flag with the prtconf(1M) command, then you must also specify a device path name.

% prtconf -a /dev/cfg/c0
SUNW,Sun-Fire
    ssm, instance #0
        pci, instance #0
            pci, instance #0
                SUNW,isptwo, instance #0

Example 14–8 Child Node Usage Information

To display usage information about a particular device and all device nodes that are children of that particular device, specify the -c flag with the prtconf(1M) command. If you specify the -c flag with the prtconf(1M) command, then you must also specify a device path name.

% prtconf -c /dev/cfg/c0
SUNW,isptwo, instance #0
    sd (driver not attached)
    st (driver not attached)
    sd, instance #1
    sd, instance #0
    sd, instance #6
    st, instance #1 (driver not attached)
    st, instance #0 (driver not attached)
    st, instance #2 (driver not attached)
    st, instance #3 (driver not attached)
    st, instance #4 (driver not attached)
    st, instance #5 (driver not attached)
    st, instance #6 (driver not attached)
    ses, instance #0 (driver not attached)
    ...

Example 14–9 Layering and Device Minor Node Information – Keyboard

To display device layering and device minor node information about a particular device, specify the -v flag with the prtconf(1M) command.

% prtconf -v /dev/kbd
conskbd, instance #0
    System properties:
        ...
    Device Layered Over:
        mod=kb8042 dev=(101,0)
            dev_path=/isa/i8042@1,60/keyboard@0
    Device Minor Nodes:
        dev=(103,0)
            dev_path=/pseudo/conskbd@0:kbd
                spectype=chr type=minor
                dev_link=/dev/kbd
        dev=(103,1)
            dev_path=/pseudo/conskbd@0:conskbd
                spectype=chr type=internal
            Device Minor Layered Under:
                mod=wc accesstype=chr
                    dev_path=/pseudo/wc@0

This example shows that the /dev/kbd device is layered on top of the hardware keyboard device (/isa/i8042@1,60/keyboard@0). This example also shows that the /dev/kbd device has two device minor nodes. The first minor node has a /dev link that can be used to access the node. The second minor node is an internal node that is not accessible through the file system. The second minor node has been opened by the wc driver, which is the workstation console. Compare the output from this example to the output from Example 14–12.

Example 14–10 Layering and Device Minor Node Information – Network Device

This example shows which devices are using the currently plumbed network device.

% prtconf -v /dev/iprb0
pci1028,145, instance #0
    Hardware properties:
        ...
    Interrupt Specifications:
        ...
    Device Minor Nodes:
        dev=(27,1)
            dev_path=/pci@0,0/pci8086,244e@1e/pci1028,145@c:iprb0
                spectype=chr type=minor
                alias=/dev/iprb0
        dev=(27,4098)
            dev_path=<clone>
            Device Minor Layered Under:
                mod=udp6 accesstype=chr
                    dev_path=/pseudo/udp6@0
        dev=(27,4097)
            dev_path=<clone>
            Device Minor Layered Under:
                mod=udp accesstype=chr
                    dev_path=/pseudo/udp@0
        dev=(27,4096)
            dev_path=<clone>
            Device Minor Layered Under:
                mod=udp accesstype=chr
                    dev_path=/pseudo/udp@0

This example shows that the iprb0 device has been linked under udp and udp6. Notice that no paths are shown to the minor nodes that udp and udp6 are using. No paths are shown in this case because the minor nodes were created through clone opens of the iprb driver, and therefore there are no file system paths by which these nodes can be accessed. Compare the output from this example to the output from Example 14–11.

Device User Command Interfaces

The fuser(1M) command is enhanced to display device usage information. The fuser(1M) command displays device usage information only if path represents a device minor node. The -d flag is valid for the fuser(1M) command only if you specify a path that represents a device minor node.

fuser path: Display information about application device consumers and kernel device consumers if path represents a device minor node.
fuser -d path: Display all users of the underlying device that is associated with the device minor node represented by path.

Kernel device consumers are reported in one of the following four formats. Kernel device consumers always are surrounded by square brackets ([]).

        [kernel_module_name]
        [kernel_module_name,dev_path=path]
        [kernel_module_name,dev=(major,minor)]
        [kernel_module_name,dev=(major,minor),dev_path=path]

When the fuser(1M) command displays file or device users, the output consists of a process ID on stdout followed by a character on stderr. The character on stderr describes how the file or device is being used. All kernel consumer information is displayed to stderr. No kernel consumer information is displayed to stdout.

If you do not use the -d flag, then the fuser(1M) command reports consumers of only the device minor node that is specified by path. If you use the -d flag, then the fuser(1M) command reports consumers of the device node that underlies the minor node specified by path. The following example illustrates the difference in report output in these two cases.

Example 14–11 Consumers of Underlying Device Nodes

Most network devices clone their minor node when the device is opened. If you request device usage information for the clone minor node, the usage information might show that no process is using the device. If instead you request device usage information for the underlying device node, the usage information might show that a process is using the device. In this example, no device consumers are reported when only a device path is passed to the fuser(1M) command. When the -d flag is used, the output shows that the device is being accessed by udp and udp6.

% fuser /dev/iprb0
/dev/iprb0:
% fuser -d /dev/iprb0
/dev/iprb0:  [udp,dev_path=/pseudo/udp@0] [udp6,dev_path=/pseudo/udp6@0]

Compare the output from this example to the output from Example 14–10.

Example 14–12 Consumer of the Keyboard Device

In this example, a kernel consumer is accessing /dev/kbd. The kernel consumer that is accessing the /dev/kbd device is the workstation console driver.

% fuser -d /dev/kbd
/dev/kbd:  [genunix] [wc,dev_path=/pseudo/wc@0]

Compare the output from this example to the output from Example 14–9.

Part I Designing Device Drivers for the Oracle Solaris Platform

Chapter 1 Overview of Oracle Solaris Device Drivers

Device Driver Basics

What Is a Device Driver?

What Is a Device Driver Entry Point?

Device Driver Entry Points

Entry Points Common to All Drivers

Device Access Entry Points

Loadable Module Entry Points

Autoconfiguration Entry Points

Kernel Statistics Entry Points

Power Management Entry Point

System Quiesce Entry Point

Summary of Common Entry Points

Entry Points for Block Device Drivers

Entry Points for Character Device Drivers

Entry Points for STREAMS Device Drivers

Entry Points for Memory Mapped Devices

Entry Points for Network Device Drivers

Entry Points for SCSI HBA Drivers

Entry Points for PC Card Drivers

Considerations in Device Driver Design

DDI/DKI Facilities

Device IDs

Device Properties

Interrupt Handling

Callback Functions

Software State Management

Programmed I/O Device Access

Direct Memory Access (DMA)

Layered Driver Interfaces

Driver Context

Returning Errors

Dynamic Memory Allocation

Hotplugging

Chapter 2 Oracle Solaris Kernel and Device Tree

What Is the Kernel?

Figure 2–1 Oracle Solaris Kernel

Multithreaded Execution Environment

Virtual Memory

Devices as Special Files

DDI/DKI Interfaces

Overview of the Device Tree

Device Tree Components

Figure 2–2 Example Device Tree

Displaying the Device Tree

libdevinfo Library

prtconf Command

/devices Directory

Binding a Driver to a Device

Figure 2–3 Device Node Names

Generic Device Names

Figure 2–4 Specific Driver Node Binding

Figure 2–5 Generic Driver Node Binding

Chapter 3 Multithreading

Locking Primitives

Storage Classes of Driver Data

Mutual-Exclusion Locks

Setting Up Mutexes

Using Mutexes

Readers/Writer Locks

Semaphores

Thread Synchronization

Condition Variables in Thread Synchronization

Initializing Condition Variables

Waiting for the Condition

Signaling the Condition

Example 3–1 Using Mutexes and Condition Variables

cv_wait() and cv_timedwait() Functions

Example 3–2 Using cv_timedwait()

cv_wait_sig() Function

Example 3–3 Using cv_wait_sig()

cv_timedwait_sig() Function

Choosing a Locking Scheme

Potential Locking Pitfalls

Threads Unable to Receive Signals

Chapter 4 Properties

Device Properties

Device Property Names

Creating and Updating Properties

`libdevinfo` Library

`prtconf` Command

`/devices` Directory

`cv_wait()` and `cv_timedwait()` Functions

Example 3–2 Using `cv_timedwait()`

`cv_wait_sig()` Function

Example 3–3 Using `cv_wait_sig()`

`cv_timedwait_sig()` Function

`prop_op()` Entry Point

Example 4–1 `prop_op()` Routine

Using `ddi_log_sysevent()` to Log Events

`ddi_log_sysevent()` Syntax

Example 5–1 Calling `ddi_log_sysevent()`

`modlinkage` Structure

`modldrv` Structure

`dev_ops` Structure

`cb_ops` Structure

`_init()` Example

Example 6–2 `_init()` Function

`_fini()` Example

`_info()` Example

`probe()` Entry Point

`attach()` Entry Point

Example 6–5 Typical `attach()` Entry Point

`detach()` Entry Point

Example 6–6 Typical `detach()` Entry Point

`getinfo()` Entry Point

Example 6–7 Typical `getinfo()` Entry Point

`ddi_device_acc_attr` Structure