STREAMS Programming Guide

Chapter 8 Messages - Kernel Level

This chapter describes the structure and use of each STREAMS message type.

ioctl(2) Processing

STREAMS is a special type of character device driver that is different from the historical character input/output (I/O) mechanism in several ways.

In the classical device driver, all ioctl(2) calls are processed by the single device driver, which is responsible for their resolution. The classical device driver has user context, that is, all data can be copied directly to and from user space.

By contrast, the stream head itself can process some ioctl(2) calls (defined in streamio(7I)). Generally, STREAMS ioctl(2) calls operate independently of any particular module or driver on the stream. This means the valid ioctl(2) calls that are processed on a stream change over time, as modules are pushed and popped on the stream. The stream modules have no user context and must rely on the stream head to perform copyin and copyout requests.

There is no user context in a module or driver when the information associated with the ioctl(2) call is received. This prevents use of ddi_copyin(9F) or ddi_copyout(9F) by the module. No user context also prevents the module and driver from associating any kernel data with the currently running process. In any case, by the time the module or driver receives the ioctl(2) call, the process generating can have exited.

STREAMS allows user processes to control functions on specific modules and drivers in a stream using ioctl(2) calls. In fact, many streamio(7I) ioctl(2) commands go no further than the stream head. They are fully processed there and no related messages are sent downstream. If, however, it is an I_STR ioctl(2) or an unrecognized ioctl(2) command, the stream head creates an M_IOCTL message, which includes the ioctl(2) argument. This is then sent downstream to be processed by the pertinent module or driver. The M_IOCTL message is the precursor message type carrying ioctl(2) information to modules. Other message types are used to complete the ioctl processing in the stream. Each module has its own set of M_IOCTL messages it must recognize.

Message Allocation and Freeing

The allocb(9F) utility routine allocates a message and the space to hold the data for the message. allocb(9F) returns a pointer to a message block containing a data buffer of at least the size requested, providing there is enough memory available. The routinereturns NULL on failure. allocb(9F) always returns a message of type M_DATA. The type can then be changed if required. b_rptr and b_wptr are set to db_base (see msgb(9S) and datab(9S)), which is the start of the memory location for the data.

allocb(9F) can return a buffer larger than the size requested. If allocb(9F) indicates buffers are not available (allocb(9F) fails), the put or service procedure cannot block to wait for a buffer to become available. Instead, bufcall(9F) defers processing in the module or the driver until a buffer becomes available.

If message space allocation is done by the put procedure and allocb(9F) fails, the message is usually discarded. If the allocation fails in the service routine, the message is returned to the queue. bufcall(9F) is called to set a call to the service routine when a message buffer becomes available, and the service routine returns.

freeb(9F) releases the message block descriptor and the corresponding data block, if the reference count (see datab(9S)) is equal to 1. If the reference count exceeds 1, the data block is not released.

freemsg(9F) releases all message blocks in a message. It uses freeb(9F) to free all message blocks and corresponding data blocks.

In Example 8-1, allocb(9F) is used by the bappend subroutine that appends a character to a message block:


Example 8-1 Use of allocb(9F)

/*
 * Append a character to a message block.
 * If (*bpp) is null, it will allocate a new block
 * Returns 0 when the message block is full, 1 otherwise
 */
#define MODBLKSZ						128			/* size of message blocks */

static int bappend(mblk_t **bpp, int ch)
{
 	mblk_t *bp;

 	if ((bp = *bpp) != NULL) {
 			if (bp->b_wptr >= bp->b_datap->db_lim)
 				return (0);
 	} else {
 			if ((*bpp = bp = allocb(MODBLKSZ, BPRI_MED)) == NULL)
 				return (1);
 	}
 	*bp->b_wptr++ = ch;
 	return 1;
}

bappend receives a pointer to a message block and a character as arguments. If a message block is supplied (*bpp != NULL), bappend checks if there is room for more data in the block. If not, it fails. If there is no message block, a block of at least MODBLKSZ is allocated through allocb(9F).

If allocb(9F) fails, bappend returns success and discards the character. If the original message block is not full or the allocb(9F) is successful, bappend stores the character in the block.

Example 8-2 shows the processing of all the message blocks in any downstream data (type M_DATA) messages. freemsg(9F) frees messages.


Example 8-2 Subroutine modwput

/* Write side put procedure */
static int modwput(queue_t *q, mblk_t *mp)
{
 	switch (mp->b_datap->db_type) {
 	default:
 			putnext(q, mp);					/* Don't do these, pass along */
 			break;

	case M_DATA: {
 			mblk_t *bp;
			struct mblk_t *nmp = NULL, *nbp = NULL;

			for (bp = mp; bp != NULL; bp = bp->b_cont) {
 				while (bp->b_rptr < bp->b_wptr) {
 						if (*bp->b_rptr == '\n')
 								if (!bappend(&nbp, '\r'))
 									goto newblk;
 						if (!bappend(&nbp, *bp->b_rptr))
 								goto newblk;

						bp->b_rptr++;
 						continue;

				newblk:
 						if (nmp == NULL)
 								nmp = nbp;
 						else { /* link msg blk to tail of nmp */
 								linkb(nmp, nbp);
 								nbp = NULL;
 						}
 				}
 			}
			if (nmp == NULL)
	 			nmp = nbp;
 			else
	 			linkb(nmp, nbp);
	 		freemsg(mp); /* de-allocate message */
 			if (nmp)
 				putnext(q, nmp);
 			break;
 	 	}
 	}
}

Data messages are scanned and filtered. modwput copies the original message into new blocks, modifying as it copies. nbp points to the current new message block. nmp points to the new message being formed as multiple M_DATA message blocks. The outer for loop goes through each message block of the original message. The inner while loop goes through each byte. bappend is used to add characters to the current or new block. If bappend fails, the current new block is full. If nmp is NULL, nmp is pointed at the new block. If nmp is not NULL, the new block is linked to the end of nmp by use of linkb(9F).

At the end of the loops, the final new block is linked to nmp. The original message (all message blocks) is returned to the pool by freemsg(9F). If a new message exists, it is sent downstream.

Recovering From No Buffers

bufcall(9F) can be used to recover from an allocb(9F) failure. The call syntax is as follows:


bufcall_id_t bufcall(int size, int pri, void(*func)(), long arg);

Note -

qbufcall(9F) and qunbufcall(9F) must be used with perimeters.


bufcall(9F) calls (*func)(arg) when a buffer of size bytes is available. When func is called, it has no user context and must return without blocking. Also, there is no guarantee that when func is called, a buffer will actually still be available.

On success, bufcall(9F) returns a nonzero identifier that can be used as a parameter to unbufcall(9F) to cancel the request later. On failure, 0 is returned and the requested function is never called.


Caution - Caution -

Care must be taken to avoid deadlock when holding resources while waiting for bufcall(9F) to call (*func)(arg). bufcall(9F) should be used sparingly.


Two examples are provided. Example 8-3 is a device-receive-interrupt handler and Example 8-4 is a write service procedure:


Example 8-3 Device Interrupt handler

#include <sys/types.h>
#include <sys/param.h>
#include <sys/stream.h>
buffcall_id_t id;						/* hold id val for unbufcall */

dev_rintr(dev)
{
 	/* process incoming message ... */
 	/* allocate new buffer for device */
 	dev_re_load(dev);
}

/*
 * Reload device with a new receive buffer
 */
dev_re_load(dev)
{
 	mblk_t *bp;
 	id = 0;						/* begin with no waiting for buffers */
 	if ((bp = allocb(DEVBLKSZ, BPRI_MED)) == NULL) {
 			cmn_err(CE_WARN,"dev:allocbfailure(size%d)\n",
 				 DEVBLKSZ);
 			/*
 			 * Allocation failed. Use bufcall to
 			 * schedule a call to ourselves.
 			 */
 			id = bufcall(DEVBLKSZ,BPRI_MED,dev_re_load,dev);
 			return;
 	}

 	/* pass buffer to device ... */
}

See Chapter 12, MultiThreaded STREAMS for more information on the uses of unbufcall(9F). These references are protected by MT locks.

Since bufcall(9F) can fail, there is still a chance that the device hangs. A better strategy, if bufcall(9F) fails, is to discard the current input message and resubmit that buffer to the device. Losing input data is preferable than hanging.

Example 8-4, mod_wsrv prefixes each output message with a header.


Example 8-4 Write Service Procedure

static int mod_wsrv(queue_t *q)
{
 	extern int qenable();
 	mblk_t *mp, *bp;
		while (mp = getq(q)) {
			/* check for priority messages and canput ... */

			/* Allocate a header to prepend to the message.
 		 * If the allocb fails, use bufcall to reschedule.
 		 */
 		if ((bp = allocb(HDRSZ, BPRI_MED)) == NULL) {
 			if (!(id=bufcall(HDRSZ,BPRI_MED,qenable, q))) {
  				timeout(qenable, (caddr_t)q,
					drv_usectohz());
 				/*
 				 * Put the msg back and exit, we will be
 				 * re-enabled later
					 */
 				putbq(q, mp);
 				return;
 			}
 			/* process message .... */
 		}
		}
	}

In this example, mod_wsrv illustrates a potential deadlock case. If allocb(9F) fails, mod_wsrv tends to recover without loss of data and calls bufcall(9F). In this case, the routine passed to bufcall(9F) is qenable(9F). When a buffer is available, the service procedure is automatically re-enabled. Before exiting, the current message is put back in the queue. Example 8-4 deals with bufcall(9F) failure by calling timeout(9F). timeout(9F)

timeout(9F) schedules the given function to be run with the given argument in the given number of clock cycles. In this example, if bufcall(9F) fails, the system runs qenable(9F) after two seconds have passed.

Releasing Callback Requests

When allocb(9F) fails and bufcall(9F) is called, a callback is pending until a buffer is actually returned. Since this callback is asynchronous, it must be released before all processing is complete. To release this queued event, use unbufcall(9F).

Pass the id returned by bufcall(9F) to unbufcall(9F). Then close the driver in the normal way. If this sequence of unbufcall(9F) and xxclose is not followed, a situation exists where the callback can occur and the driver is closed. This is one of the most difficult types of bugs to find during the debugging stage.


Caution - Caution -

All bufcall(9F) and timeouts must be canceled in the close routine.


Extended STREAMS Buffers

Some hardware using the STREAMS mechanism supports memory-mapped I/O (see mmap(2)) that allows the sharing of buffers between users, kernel, and the I/O card.

If the hardware supports memory-mapped I/O, data received from the hardware is placed in the DARAM (dual access RAM) section of the I/O card. Since DARAM is memory shared between the kernel and the I/O card, coordinated data transfer between the kernel and the I/O card is eliminated. Once in kernel space, the data buffer is manipulated as if it were a kernel resident buffer. Similarly, data sent downstream is placed in the DARAM and forwarded to the network.

In a typical network arrangement, data is received from the network by the I/O card. The controller reads the block of data into the card's internal buffer. It interrupts the host computer to notify that data have arrived. The STREAMS driver gives the controller the kernel address where the data block is to go and the number of bytes to transfer. After the controller has read the data into its buffer and verified the checksum, it copies the data into main memory to the address specified by the DMA (direct memory access) memory address. Once in the kernel space, the data is packaged into message blocks and processed in the usual manner.

When data is transmitted from a user process to the network, it is copied from the user space to the kernel space, packaged as a message block, and sent to the downstream driver. The driver interrupts the I/O card, signaling that data is ready to be transmitted to the network. The controller copies the data from the kernel space to the internal buffer on the I/O card, and from there it is placed on the network.

The STREAMS buffer allocation mechanism enables the allocation of message and data blocks to point directly to a client-supplied (non-STREAMS) buffer. Message and data blocks allocated this way are indistinguishable from the normal data blocks. The client-supplied buffers are processed as if they were normal STREAMS data buffers.

Drivers can attach non-STREAMS data buffers and also free them. This is done as follows:

freeb(9F) detects when a buffer is a client supplied, non-STREAMS buffer. If it is, freeb(9F) finds the free_rtn(9S) structure associated with the buffer. After calling the driver-dependent routine (defined in free_rtn(9S)) to free the buffer, freeb(9F) frees the message and data block.

The free routine should not reference any dynamically allocated data structures that are freed when the driver is closed, as messages can exist in a stream after the driver is closed. For example, when a stream is closed, the driver close routine is called and its private data structure can be deallocated. If the driver sends a message created by esballoc upstream, that message can still be on the stream head read queue. When the stream head read queue is flushed, the message is freed and a call is made to the driver's free routine after the driver has been closed.

The format of the free_rtn(9S) structure is as follows:


void (*free_func)();   /*driver dependent free routine*/
char *free_arg;        /* argument for free_rtn */

The structure has two fields: a pointer to a function and a location for any argument passed to the function. Instead of defining a specific number of arguments, free_arg is defined as a char *. This way, drivers can pass pointers to structures if more than one argument is needed.

The method by which free_func is called is implementation-specific. Do not assume that free_func is or is not called directly from STREAMS utility routines like freeb(9F). The free_func function must not call another module's put procedure nor try to acquire a private module lock that can be held by another thread across a call to a STREAMS utility routine that could free a message block. Otherwise, the possibility for lock recursion and deadlock exists.

esballoc(9F), provides a common interface for allocating and initializing data blocks. It makes the allocation as transparent to the driver as possible and provides a way to modify the fields of the data block, since modification should only be performed by STREAMS. The driver calls this routine to attach its own data buffer to a newly allocated message and data block. If the routine successfully completes the allocation and assigns the buffer, it returns a pointer to the message block. The driver is responsible for supplying the arguments to esballoc(9F), a pointer to its data buffer, the size of the buffer, the priority of the data block, and a pointer to the free_rtn structure. All arguments should be non-NULL. See Appendix B, STREAMS Utilities, for a description of esballoc(9F).

esballoc(9F) Example

Example 8-5 (which will not compile) shows how extended buffers are managed in the multithreaded environment. The driver maintains a pool of special memory that is allocated by esballoc(9F). The allocator free routine uses the queue struct assigned to the driver or other queue private data, so the allocator and the close routine need to coordinate to ensure that no outstanding esballoc(9F) memory blocks remain after the close. The special memory blocks are of type ebm_t, the counter is ebm, the mutex mp and the condition variable cvp are used to implement the coordination.


Example 8-5 esballoc Example

ebm_t *
special_new()
{
		mutex_enter(&mp);
		/*
 	 * allocate some special memory
		 */
		esballoc();
		/*
		 * increment counter
		 */
		ebm++;
		mutex_exit(&mp);
}

void
special_free()
{
		mutex_enter(&mp);
		/*
 	 * de-allocate some special memory
		 */
		freeb();
	
		/*
		 * decrement counter
		 */
		ebm--;
		if (ebm == 0)
			cv_broadcast(&cvp);
		mutex_exit(&mp);
}

open_close(q, .....)
	....
{
		/*
		 * do some stuff
		 */
		/*
		 * Time to decommission the special allocator.  Are there
		 * any outstanding allocations from it?
		 */
		mutex_enter(&mp);
		while (ebm > 0)
			cv_wait(&cvp, &mp);
	
		mutex_exit(&mp);
}


Caution - Caution -

Close routine must wait for all esballoc(9F) memory to be freed.


General ioctl(2) Processing


Note -

Please see ioctl() section in the Writing Device Driversfor information on the 64-bit data structure macros.


When the stream head is called to process an ioctl(2) that it does not recognize, it creates an M_IOCTL message and sends it down the stream. An M_IOCTL message is a single M_IOCTL message block followed by zero or more M_DATA blocks. The M_IOCTL message block has the form of an iocblk(9S) structure. This structure contains the following elements.


int        ioc_cmd;              /* ioctls command type */
cred_t     *ioc_cr;              /* full credentials */
uint       ioc_id;               /* ioctl id */
uint       ioc_count;            /* byte cnt in data field */
int        ioc_error;            /* error code */
int        ioc_rval;             /* return value */

For an I_STR ioctl(2), ioc_cmd contains the command supplied by the user in the ic_cmd member of the strioctl structure defined in streamio(7I). For others, it is the value of the cmd argument in the call to ioctl(2). The ioc_cr field is the credentials of the user process.

The ioc_id field is a unique identifier used by the stream head to identify the ioctl and its response messages.

The ioc_count field indicates the number of bytes of data associated with this ioctl request. If the value is greater than zero, there will be one or more M_DATA mblks linked to the M_IOCTL mblkb_cont field. If the value of the ioc_count field is zero, there will be no M_DATA mblk associated with the M_IOCTL mblk. If the value of ioc_count is equal to the special value TRANSPARENT, then there is one M_DATA mblk linked to this mblk, its contents will be the value of the argument passed to ioctl(2). This can be a user address or numeric value. (see "Transparent ioctl(2) Processing").

An M_IOCTL message is processed by the first module or driver that recognizes it. If a module does not recognize the command, it should pass it down. If a driver does not recognize the command, it should send a negative acknowledgment or M_IOCNAK message upstream. In all circumstances, if a module or driver processes an M_IOCTL message it must acknowledge it.

Modules must always pass unrecognized messages on. Drivers should nak unrecognized ioctl(2) messages and free any other unrecognized message.

If a module or driver finds an error in an M_IOCTL message for any reason, it must produce the negative acknowledgment message. To do this, set the message type to M_IOCNAK and send the message upstream. No data or return value can be sent. If ioc_error is set to 0, the stream head causes the ioctl(2) to fail with EINVAL. The module can set ioc_error to an alternate error number optionally.

ioc_error can be set to a nonzero value in both M_IOCACK and M_IOCNAK. This causes the value to be returned as an error number to the process that sent the ioctl(2).

If a module checks what ioctl(2) of other modules below it are doing, the module should not just search for a specific M_IOCTL on the write side, but also look for M_IOCACK or M_IOCNAK on the read side. For example, the module's write side sees TCSETA (see termio(7I)) and records what is being set. The read-side processing knows that the module is waiting for an answer for the ioctl(2). When the read-side processing sees an ack or nak, it checks for the same ioctl(2) by checking the command (here TCSETA) and the ioc_id. If these match, the module can use the information previously saved.

If the module checks, for example, the TCSETA/TCGETA group of ioctl(2) calls as they pass up or down a stream, it must never assume that because TCSETA comes down it actually has a data buffer attached to it. The user can form TCSETA as an I_STR call and accidentally given a NULL data buffer pointer. Always check b_cont to see if it is NULL before using it as an index to the data block that goes with M_IOCTL messages.

The TCGETA call, if formed as an I_STR call with a data buffer pointer set to a value by the user, always has a data buffer attached to b_cont from the main message block. Do not assume that the data block is missing and allocate a new buffer, then assign b_cont to point to it, because the original buffer will be lost.

STREAMS ioctl Issues

Regular device drivers have user context in the ioctl(9E) call. However, in a STREAMS driver or module, the only guarantee of user context is in the open(9E) and close(9E) routines. It is therefore necessary to have some indication of the calling context where data is used.

The notion of data models as well as new macros for handling data structure access are discussed in Writing Device Drivers. A STREAMS driver or module writer should use these flags and macros when dealing with structures that change size between data models.

A new flag value which represents the data model of the entity invoking the operation has been added to the ioc_flag field of the iocblk(9S) structure, the cq_flag of the copyreq(9S) structure, and the cp_flag of the copyresp(9S) structure.

The data model flag is one of these possibilities:

In addition, IOC_NATIVE is conditionally defined to match the data model of the kernel implementation.

By looking at the data model flag field of the relevant iocblk(9S), copyreq(9S), or copyresp(9S) structures, the STREAMS module can determine the best method of handling the data.


Caution - Caution -

The layout of the iocblk, copyreq, and copyresp structures is different between the 32-bit and 64-bit kernels. Be cautious of any data structure overloading in the cp_private, cq_private, or the cq_filler fields since alignment has changed.


I_STR ioctl(2) Processing

Neither the transparent nor nontransparent method implements ioctl(2) in the stream head, but in the STREAMS driver or module itself. I_STR ioctl(2) (also referred to as nontransparent ioctl(2)) is created when a user requests an I_STR ioctl(2) and specifies a pointer to a strioctl structure as the argument. For example, assuming that fd is an open lp STREAMS device and LP_CRLF is a valid option, the user could make a request by issuing the following:

struct strioctl *str;
short lp_opt = LP_CRLF;

str.ic_cmd = SET_OPTIONS;
str.ic_timout = -1;
str.ic_dp = (char *)&lp_opt;
str.ic_len = sizeof (lp_opt)

ioctl(fd, I_STR, &str);

On receipt of the I_STR ioctl(2) request, the stream head creates an M_IOCTL message. ioc_cmd is set to SET_OPTIONS, ioc_count is set to the value contained in ic_len (in this example sizeof (short)). An M_DATA mblk is linked to the M_IOCTL mblk and the data pointed to by ic_dp is copied into it (in this case LP_CRLF).

Example 8-6, illustrates processing associated with an I_STR ioctl(2). lpdoioctl is called by lp write-side put or service procedure to process M_IOCTL messages:


Example 8-6 I_STR ioctl(2)

static void
lpdoioctl (queue_t *q, mblk_t	 *mp)
{
		struct iocblk *iocp;
		struct lp *lp;

		lp = (struct lp *)q->q_ptr;

		/* 1st block contains iocblk structure */
		iocp = (struct iocblk *)mp->b_rptr;

		switch (iocp->ioc_cmd) {
			case SET_OPTIONS:
				/* Count should be exactly one short's worth
				 * (for this example) */
				if (iocp->ioc_count != sizeof(short))
					goto iocnak;
				if (mp->b_cont == NULL)
					goto lognak; /* not shown in this example */
				/* Actual data is in 2nd message block */
				iocp->ioc_error = lpsetopt (lp, *(short *)mp->b_cont->b_rptr)

				/* ACK the ioctl */
				mp->b_datap->db_type = M_IOCACK;
				iocp->ioc_count = 0;
				qreply(q, mp);
				break;
			default:
				iocnak:
				/* NAK the ioctl */
				mp->b_datap->db_type = M_IOCNAK;
				qreply(q, mp);
		}
	}

lpdoioctl illustrates driver M_IOCTL processing, which also applies to modules. In this example, only one command is recognized, SET_OPTIONS. ioc_count contains the number of user-supplied data bytes. For this example, ioc_count must equal the size of a short.

Once the command has been verified [lines 20-24], lpsetopt (not shown here) is called to process the request [lines 26-27]. lpsetopt returns 0 if the request is satisfied, otherwise an error number is returned.

If ioc_error is nonzero, on receipt of the acknowledgment the stream head returns -1 to the application's ioctl(2) request and sets errno to the value of ioc_error. The ioctl(2) is acknowledged [lines 30-33). This includes changing the M_IOCTL message type to M_IOCACK and setting the ioc_count field to zero to indicate that no data is to be returned to the user. Finally, the message is sent upstream using qreply(9F).

If ioc_count was left nonzero, the stream head would copy that many bytes from the second through the nth message blocks into the user buffer. You must set ioc_count if you want to pass any data back to the user.

This example is for a driver. In the default case, for unrecognized commands, or for malformed requests, a nak is generated [lines 34-38). This is done by changing the message type to an M_IOCNAK and sending it back up stream. A module does not acknowledge (nak) an unrecognized command, but passes the message on. A module does not acknowledge (nak) a malformed request.

Transparent ioctls

Transparent ioctl's are used from within module to tell the stream head to perform a copyin() or copyout() on behalf of the module. It is important for the stream head to have knowledge of the data model of the caller in order to process the copyin and copyout properly. The user should use the ioctl macros as described in Writing Device Drivers when coding a STREAMS module that uses Transparent ioctls.

Transparent ioctl(2) Messages

The transparent STREAMS ioctl(2) mechanism is needed because user context does not exist in modules and drivers when an ioctl(2) is processed. This prevents them from using the kernel ddi_copyin/ddi_copyout functions.

Transparent ioctl(2) also let an application be written using conventional ioctl(2) semantics instead of the I_STR ioctl(2) and an strioctl structure. The difference between transparent and nontransparent ioctl(2)

ioctl(2) processing in a STREAMS driver and module is the way data is transferred from user to kernel space.

The transparent ioctl(2) mechanism allows backward compatibility for older programs. This transparency only works for modules and drivers that support transparent ioctl(2). Trying to use transparent ioctl(2) on a stream that doesn't support them makes the driver send an error message upstream, causing the ioctl to fail.

The following example illustrates the semantic difference between a nontransparent and transparent ioctl(2). A module that allows arbitrary character translations is pushed on the stream The ioctl(2) specifies the translation to do, and in this case all uppercase vowels are changed to lowercase. A transparent ioctl(2) uses XCASE instead of I_STR to inform the module directly.

Assume that fd points to a STREAMS device and that the conversion module has been pushed onto it. Use a nontransparent I_STR command to inform the module to change the case of AEIOU. The semantics of this command are:

strioctl.ic_cmd = XCASE;
strioctl.ic_timout = 0;
strioctl.ic_dp = "AEIOU"
strioctl.ic_len = strlen(strioctl.ic_dp);
ioctl(fd,I_STR, &strioctl);

When the stream head receives the I_STR ioctl(2) it creates an M_IOCTL message with the ioc_cmd set to XCASE and the data specified by ic_dp. AEIOU is copied into the first mblk following the M_IOCTL mblk.

The same ioctl(2) specified as a transparent ioctl(2) is called as follows:


ioctl(fd, XCASE, "AEIOU");

The stream head creates an M_IOCTL message with the ioc_cmd set to XCASE, but the data is not copied in. Instead, ioc_count is set to TRANSPARENT and the address of the user data is placed in the first mblk following the M_IOCTL mblk. The module then requests the stream head to copy in the data ("AEIOU") from user space.

Unlike the nontransparent ioctl(2), which can specify a timeout parameter, transparent ioctl(2)s block until processing is complete.


Caution - Caution -

Incorrectly written drivers can cause applications using transparent ioctl(2) to block indefinitely.


Notice that even though this process is simpler in the application, transparent ioctl adds considerable complexity to modules and drivers, and additional overhead to the time required to process the request.

The form of the M_IOCTL message generated by the stream head for a transparent ioctl(2) is a single M_IOCTL message block followed by one M_DATA block. The form of the iocblk(9S) structure in the M_IOCTL block is the same as described under General ioctl(2) processing. However, ioc_cmd is set to the value of the command argument in ioctl(2) and ioc_count is set to the special value of TRANSPARENT. The value TRANSPARENT distinguishes when an I_STR ioctl(2) can specify a value of ioc_cmd that is equivalent to the command argument of a transparent ioctl(2). The b_cont block of the message contains the value of the arg parameter in the call.


Caution - Caution -

If a module processes a specific ioc_cmd and does not validate the ioc_count field of the M_IOCTL message, it breaks when transparent ioctl(2) is performed with the same command.



Note -

Write modules and drivers to support both transparent and I_STR ioctl(2).


All M_IOCTL message types (M_COPYIN, M_COPYOUT, M_IOCDATA,M_IOCACK and M_IOCNACK) have some similar data structures and sizes. Reuse these structures instead of reallocating them. Note the similarities in the command type, credentials, and id.

The iocblk(9S) structure is contained in M_IOCTL, M_IOCACK and M_IOCNAK message types. For the transparent case, M_IOCTL has one M_DATA message linked to it. This message contains a copy of the argument passed to ioctl(2). Transparent processing of M_IOCACK and M_IONAK does not allow any messages to be linked to them.

The copyreq(9S) structure is contained in M_COPYIN and M_COPYOUT message types. The M_COPYIN message type must not have any other message linked to it (that is, b_cont == NULL). The M_COPYOUT message type must have one or more M_DATA messages linked to it. These messages contain the data to be copied into user space.

The copyresp(9S) structure is contained in M_IOCDATA response message types. These messages are generated by the stream head in response to an M_COPYIN or M_COPYOUT request. If the message is in response to an M_COPYOUT request, the message has no messages attached to it (b_cont is NULL). If the response is to an M_COPYIN, then zero or more M_DATA message types are attached to the M_IOCDATA message. These attached messages contain a copy of the user data requested by the M_COPYIN message.

The iocblk(9S), copyreq(9S), and copyresp(9S) structures contain a field indicating the type of ioctl(2) command, a pointer to the user's credentials, and a unique identifier for this ioctl(2). These fields must be preserved.

The structure member cq_private is reserved for use by the module. M_COPYIN and M_COPYOUT request messages contain a cq_private field that can be set to contain state information for ioctl(2) processing (which identifies what the subsequent M_IOCDATA response message contains). This state is returned in cp_private in the M_IOCDATA message. This state information determines the next step in processing the message. Keeping the state in the message makes the message self-describing and simplifies the ioctl(2) processing.

For each piece of data the module copies from user space an M_COPYIN message is sent to the stream head. The M_COPYIN message specifies the user address (cq_addr) and number of bytes (cq_size) to copy from user space. The stream head responds to the M_COPYIN request with a M_IOCDATA message. The b_cont field of the M_IOCDATA mblk contains the contents pointed to by the M_COPYIN request. Likewise, for each piece of data the module copies to user space, an M_COPYOUT message is sent to the stream head. Specify the user address (cq_addr) and number of bytes to copy (cq_size). The data to be copied is linked to the M_COPYOUT message as one or more M_DATA messages. The stream head responds to M_COPYOUT requests with an M_IOCDATA message, but b_cont is null.

After the module has completed processing the ioctl (that is, all M_COPYIN and M_COPYOUT requests have been processed), the ioctl(2) must be acknowledged with an M_IOCACK to indicate successful completion of the command or an M_IOCNAK to indicate failure.

If an error occurs when attempting to copy data to or from user address space, the stream head will set cp_rval in the M_IOCDATA message to the error number. In the event of such an error, the M_IOCDATA message should be freed by the module or driver. No acknowledgement of the ioctl(2) is sent in this case.

Transparent ioctl(2) Examples

Following are three examples of transparent ioctl(2) processing. The first illustrates M_COPYIN to copy data from user space. The second illustrates M_COPYOUT to copy data to user space. The third is a more complex example showing state transitions that combine M_COPYIN and M_COPYOUT.

In these examples the message blocks are reused to avoid the overhead of allocating, copying, and releasing message.. This is standard practice.

The stream head guarantees that the size of the message block containing an iocblk(9S) structure is large enough to also hold the copyreq(9S) and copyresp(9S) structures.

M_COPYIN Example


Note -

Please see copyin() section in the Writing Device Driversfor information on the 64-bit data structure macros.


Example 8-7 illustrates only the processing of a transparent ioctl(2) request (nontransparent request processing is not shown). In this example, the contents of a user buffer are to be transferred into the kernel as part of an ioctl call of the form


ioctl(fd, SET_ADDR, (caddr_t) &bufadd);

where bufadd is a struct address whose elements are:


struct address {	
     int            ad_len;;          /* buffer length in bytes */
     caddr_t        ad_addr;          /* buffer address */
};

This requires two pairs of messages (request and response) following receipt of the M_IOCTL message: the first copyin(9F)s the structure (address) and the second copyin(9F) the buffer (address.ad.addr). Two states are maintained and processed in this example: GETSTRUCT is for copying in the address structure, and GETADDR for copying in the ad_addr of the structure.

xxwput verifies that the SET_ADDR is TRANSPARENT to avoid confusion with an I_STR ioctl(2), which uses a value of ioc_cmd equivalent to the command argument of a transparent ioctl(2). This is done by checking if the size count is equal to TRANSPARENT[line 28]. If it is equal to TRANSPARENT, then the message was generated from a transparent ioctl(2); that is not from an I_STR ioctl(2)[line 29-32].


Example 8-7 M_COPYIN Example

	struct address {			/* same members as in user space */
		int	ad_len;	/* length in bytes */
		caddr_t	ad_addr;	/* buffer address */
	};

	/* state values (overloaded in private field) */
	#define GETSTRUCT 0			/* address structure */
	#define GETADDR	 1		/* byte string from ad_addr */

	static void xxioc(queue_t *q, mblk_t *mp);

	static int
	xxwput(q, mp)
		queue_t *q;		/* write queue */
		mblk_t *mp;
	{
		struct iocblk *iocbp;
		struct copyreq *cqp;

		switch (mp->b_datap->db_type) {
			.
			.
			.
			case M_IOCTL:
				/* Process ioctl commands */
				iocbp = (struct iocblk *)mp->b_rptr;
				switch (iocbp->ioc_cmd) {
					case SET_ADDR;
						if (iocbp->ioc_count != TRANSPARENT) {
							/* do non-transparent processing here
							 *       (not shown here) */
						} else {
							/* ioctl command is transparent 
							 * Reuse M_IOCTL block for first M_COPYIN request 
							 * of address structure */
							cqp = (struct copyreq *)mp->b_rptr;
							/* Get user space structure address from linked 
							 * M_DATA block */
							cqp->cq_addr = *(caddr_t *) mp->b_cont->b_rptr;
							cqp->cq_size = sizeof(struct address);
							/* MUST free linked blks */
							freemsg(mp->b_cont);
							mp->b_cont = NULL;

							/* identify response */
							cqp->cq_private = (mblk_t *)GETSTRUCT;

							/* Finish describing M_COPYIN message */
							cqp->cq_flag = 0;
							mp->b_datap->db_type = M_COPYIN;
							mp->b_wptr = mp->b_rptr + sizeof(struct copyreq);
							qreply(q, mp);
						break;
					default: /* M_IOCTL not for us */
						/* if module, pass on */
						/* if driver, nak ioctl */
						break;
				} /* switch (iocbp->ioc_cmd) */
				break;
			case M_IOCDATA:
				/* all M_IOCDATA processing done here */
				xxioc(q, mp);
				break;
		}
		return (0);
	}

The transparent part of the SET_ADDR M_IOCTL message processing requires the address structure to be copied from user address space. To accomplish this, it issues an M_COPYIN request to the stream head [lines 37-64].

The mblk is reused and mapped into a copyreq(9S) structure [line 42]. The user space address of bufadd is contained in the b_cont of the M_IOCTL mblk. This address and its size are copied into the copyreq(9S) message [lines 47-49]. The b_cont of the copy request mblk is not needed, so it is freed and then NULLed [lines 51-52].


Caution - Caution -

The layout of the iocblk, copyreq, and copyresp structures is different between 32-bit and 64-bit kernels. Be cautious of any data structure overloading in the cp_private or the cq_filler fields since alignment has changed.


On receipt of the M_IOCDATA message for the SET_ADDR command, xxioc routine checks cp_rval. If an error occurred during the copyin operation, cp_rval is set. The mblk is freed [lines 93-96] and, if necessary, xxioc cleans up from previous M_IOCTL requests, freeing memory, resetting state variables, and so on. The stream head returns the appropriate error to the user.

The cp_private field is set to GETSTRUCT [lines 97-99]. This indicates that the linked b_cont mblk contains a copy of the user's address structure. The example then copies the actual address specified in address.ad_addr.

The program issues another M_COPYIN request to the stream head [lines 100-116], but this time cq_private contains GETADDR to indicate that the M_IOCDATA response will contain a copy of address.ad_addr. The stream head copies the information at the requested user address and sends it downstream in another, final M_IOCDATA message.

The final M_IOCDATA message arrives from the stream head. cp_private contains GETADDR [line 118]. The ad_addr data is contained in the b_cont link of the mblk. If the address is successfully processed by xx_set_addr (not shown here), the message is acknowledged with a M_IOCACK message [lines 124-128]. If xx_set_addr fails, the message is rejected with an M_IOCNAK message [lines 121-122]. xx_set_addr is a routine (not shown in the example) that processes the user address from the ioctl(2).

After the final M_IOCDATA message is processed, the module acknowledges the ioctl(2), to let the stream head know that processing is complete. This is done by sending an M_IOCACK message upstream if the request was successfully processed. Always zero ioc_error, otherwise an error code could be passed to the user application. ioc_rval and ioc_count are also zeroed to reflect that a return value of 0 and no data is to be passed up [lines 124-128].

If the request cannot be processed, either an M_IOCNAK or M_IOCACK can be sent upstream with an appropriate error number. When sending an M_IOCNAK or M_IOCACK, freeing the linked M_DATA block is not mandatory, but is more efficient, as the stream head frees it.

If ioc_error is set in an M_IOCNAK or M_IOCNACK message, this error code will be returned to the user. If no error code is set in an M_IOCNAK message, EINVAL will be returned to the user.

	xxioc(queue_t *q, mblk_t *mp)			/* M_IOCDATA processing */
	{
		struct iocblk *iocbp;
		struct copyreq *cqp;
		struct copyresp *csp;
		struct address *ap;

		csp = (struct copyresp *)mp->b_rptr;
		iocbp = (struct iocblk *)mp->b_rptr;

		/* validate this M_IOCDATA is for this module */
		switch (csp->cp_cmd) {
			case SET_ADDR:
				if (csp->cp_rval){ /*GETSTRUCT or GETADDRfail*/
					freemsg(mp);
					return;
				}
				switch ((int)csp->cp_private){ /*determine state*/
					case GETSTRUCT:					/* user structure has arrived */
						/* reuse M_IOCDATA block */
						mp->b_datap->db_type = M_COPYIN;
						mp->b_wptr = mp->b_rptr + sizeof (struct copyreq);
						cqp = (struct copyreq *)mp->b_rptr;
						/* user structure */
						ap = (struct address *)mp->b_cont->b_rptr;
						/* buffer length */
						cqp->cq_size = ap->ad_len;
						/* user space buffer address */
						cqp->cq_addr = ap->ad_addr;
						freemsg(mp->b_cont);
						mp->b_cont = NULL;
						cqp->cq_flag = 0;
						cqp->cp_private=(mblk_t *)GETADDR;  /*nxt st*/
						qreply(q, mp);
						break;

					case GETADDR:						/* user address is here */
						/* hypothetical routine */
						if (xx_set_addr(mp->b_cont) == FAILURE) {
							mp->b_datap->db_type = M_IOCNAK;
							iocbp->ioc_error = EIO;
						} else {
							mp->b_datap->db_type=M_IOCACK;/*success*/
							/* can have been overwritten */
							iocbp->ioc_error = 0;
							iocbp->ioc_count = 0;
							iocbp->ioc_rval = 0;
						}
						mp->b_wptr=mp->b_rptr + sizeof (struct ioclk);
						freemsg(mp->b_cont);
						mp->b_cont = NULL;
						qreply(q, mp);
						break;

					default: /* invalid state: can't happen */
						freemsg(mp->b_cont);
						mp->b_cont = NULL;
						mp->b_datap->db_type = M_IOCNAK;
						mp->b_wptr = mp->rptr + sizeof(struct iocblk);
						/* can have been overwritten */
						iocbp->ioc_error = EINVAL;
						qreply(q, mp);
						break;
				}
				break;						/* switch (cp_private) */

			default: /* M_IOCDATA not for us */
				/* if module, pass message on */
				/* if driver, free message */
				break;

M_COPYOUT Example


Note -

Please see copyout() section in the Writing Device Driversfor information on the 64-bit data structure macros.


Example 8-8 returns option values for this STREAMS device by placing them in the user's options structure. This is done by a transparent ioctl(2) call of the form


struct options optadd;

ioctl(fd, GET_OPTIONS,(caddr_t) &optadd) 

or by an I_STR call

	struct strioctl opts_strioctl;
	structure options optadd;

	opts_strioctl.ic_cmd = GET_OPTIONS;
	opts_strioctl.ic_timeout = -1
	opts_strioctl.ic_len = sizeof (struct options);
	opts_strioctl.ic_dp = (char *)&optadd;
	ioctl(fd, I_STR, (caddr_t) &opts_strioctl) 

In the I_STR case, opts_strioctl.ic_dp points to the options structure, optadd.

Example 8-8 illustrates support of both the I_STR and transparent forms of ioctl(2). The transparent form requires a single M_COPYOUT message following receipt of the M_IOCTL to copyout the contents of the structure. xxwput is the write-side put procedure of module or driver xx.

This example first checks if the ioctl(2) command is transparent [line 22]. If it is, the message is reused as an M_COPYOUT copy request message [lines 24-32]. The pointer to the receiving buffer is in the linked message and is copied into cq_addr [lines 26-27]. Since only a single copy out is being done, no state information needs to be stored in cq_private. The original linked message is freed, in case it isn't big enough to hold the request [lines 32-33]. As an optimization, the following code checks the size of the message for reuse:


mp->;b_cont->b_datap->db_lim - mp->b_cont->b_datap->db_base >= 
sizeof (struct options)

A new linked message is allocated to hold the option request [lines 32-40]. When using the transparent ioctl(2) the M_COPYOUT command data contained in the linked message is passed to the stream head. The stream head will copy the data to the user's address space and issue an M_IOCDATA in response to the M_COPYOUT message, which the module must acknowledge in a M_IOCACK message [lines 59-73]. ioc_error, ioc_count, and ioc_rval are cleared to prevent any stale data from being passed back to the stream head [lines 69-71].

If the message is not transparent (is issued through an I_STR ioctl(2)) the data is sent with the M_IOCACK acknowledgment message and copied into the buffer specified by the strioctl data structure [lines 50-51].


Example 8-8 M_COPYOUT Example

	struct options {						/* same members as in user space */
		int			op_one;
		int			op_two;
		short			op_three;
		long			op_four;
	};

	static int
	xxwput (queue_t *q, mblk_t *mp)
	{
		struct iocblk *iocbp;
		struct copyreq *cqp;
		struct copyresp *csp;
		int transparent = 0;

		switch (mp->b_datap->db_type) {
			.
			.
			.
			case M_IOCTL:
				iocbp = (struct iocblk *)mp->b_rptr;
				switch (iocbp->ioc_cmd) {
					case GET_OPTIONS:
						if (iocbp->ioc_count == TRANSPARENT) {
							transparent = 1;
							cqp = (struct copyreq *)mp->b_rptr;
							cqp->cq_size = sizeof(struct options);
							/* Get struct address from linked M_DATA block */
							cqp->cq_addr = (caddr_t) 
														*(caddr_t *)mp->b_cont->b_rptr;
							cqp->cq_flag = 0;
							/* No state necessary - we will only ever get one 
							 * M_IOCDATA from the Stream head indicating success or 
							 * failure for the copyout */
						}
						if (mp->b_cont)
							freemsg(mp->b_cont);
						if ((mp->b_cont = 
									allocb(sizeof(struct options), BPRI_MED)) == NULL) {
							mp->b_datap->db_type = M_IOCNAK;
							iocbp->ioc_error = EAGAIN;
							qreply(q, mp);
							break;
						}
						/* hypothetical routine */
						xx_get_options(mp->b_cont);
						if (transparent) {
							mp->b_datap->db_type = M_COPYOUT;
							mp->b_wptr = mp->b_rptr + sizeof(struct copyreq);
						} else {
							mp->b_datap->db_type = M_IOCACK;
							iocbp->ioc_count = sizeof(struct options);
						}
						qreply(q, mp);
						break;

					default: /* M_IOCTL not for us */
						/*if module, pass on;if driver, nak ioctl*/
						break;
				} /* switch (iocbp->ioc_cmd) */
				break;

			case M_IOCDATA:
				csp = (struct copyresp *)mp->b_rptr;
				/* M_IOCDATA not for us */
				if (csp->cmd != GET_OPTIONS) {
					/*if module/pass on, if driver/free message*/
					break;
				}
				if ( csp->cp_rval ) {
					freemsg(mp);	/* failure */
					return (0);
				}
				/* Data successfully copied out, ack */

				/* reuse M_IOCDATA for ack */
				mp->b_datap->db_type = M_IOCACK;
				mp->b_wptr = mp->b_rptr + sizeof(struct iocblk);
				/* can have been overwritten */
				iocbp->ioc_error = 0;
				iocbp->ioc_count = 0;
				iocbp->ioc_rval = 0;
				qreply(q, mp);
				break;
				.
				.
				.
			} /* switch (mp->b_datap->db_type) */
			return (0);

Bidirectional Transfer Example

Example 8-9 illustrates bidirectional data transfer between the kernel and application during transparent ioctl(2) processing. It also shows how to use more complex state information.

The user wants to send and receive data from user buffers as part of a transparent ioctl(2) call of the form



	ioctl(fd, XX_IOCTL, (caddr_t) &addr_xxdata) 

Three pairs of messages are required following the M_IOCTL message: the first to copyin the structure; the second to copyin one user buffer; and the third to copyout the second user buffer. xxwput is the write-side put procedure for module or driver xx:


Example 8-9 Bidirectional Transfer

struct xxdata {             /* same members in user space */
   int         x_inlen;     /* number of bytes copied in */
   caddr_t     x_inaddr;    /* buf addr of data copied in */
   int         x_outlen;    /* number of bytes copied out */
   caddr_t     x_outaddr;   /* buf addr of data copied out */
};
/* State information for ioctl processing */
struct state {
		int         st_state;    /* see below */
		struct xxdata		st_data;				/* see above */
};
/* state values */

#define GETSTRUC     0   /* get xxdata structure */
#define GETINDATA    1   /*get data from x_inaddr */
#define PUTOUTDATA   2   /* get response from M_COPYOUT */

static void xxioc(queue_t *q, mblk_t *mp);

static int
xxwput (queue_t *q, 	mblk_t *mp) {
		struct iocblk *iocbp;
		struct copyreq *cqp;
		struct state *stp;
		mblk_t *tmp;

		switch (mp->b_datap->db_type) {
			.
			.
			.
			case M_IOCTL:
				iocbp = (struct iocblk *)mp->b_rptr;
				switch (iocbp->ioc_cmd) {
				case XX_IOCTL:
				/* do non-transparent processing. (See I_STR ioctl
				 * processing discussed in previous section.)
				 */
				/*Reuse M_IOCTL block for M_COPYIN request*/

				cqp = (struct copyreq *)mp->b_rptr;

				/* Get structure's user address from
				 * linked M_DATA block */

				cqp->cq_addr = (caddr_t)
				 *(long *)mp->b_cont->b_rptr;
				freemsg(mp->b_cont);
				mp->b_cont = NULL;

				/* Allocate state buffer */

				if ((tmp = allocb(sizeof(struct state),
				 BPRI_MED)) == NULL) {
						mp->b_datap->db_type = M_IOCNAK;
						iocbp->ioc_error = EAGAIN;
						qreply(q, mp);
						break;
				}
				tmp->b_wptr += sizeof(struct state);
				stp = (struct state *)tmp->b_rptr;
				stp->st_state = GETSTRUCT;
				cqp->cq_private = tmp;

				/* Finish describing M_COPYIN message */

				cqp->cq_size = sizeof(struct xxdata);
				cqp->cq_flag = 0;
				mp->b_datap->db_type = M_COPYIN;
				mp->b_wptr=mp->b_rptr+sizeof(struct copyreq);
				qreply(q, mp);
				break;

			default: /* M_IOCTL not for us */
				/* if module, pass on */
				/* if driver, nak ioctl */
				break;

			} /* switch (iocbp->ioc_cmd) */
			break;

	case M_IOCDATA:
			xxioc(q, mp);/*all M_IOCDATA processing here*/
			break;
			.
			.
			.
	} /* switch (mp->b_datap->db_type) */
}

xxwput allocates a message block to contain the state structure and reuses the M_IOCTL to create an M_COPYIN message to read in the xxdata structure.

M_IOCDATA processing is done in xxioc():

xxioc(										/* M_IOCDATA processing */
	queue_t *q,
	mblk_t *mp)
{
	struct iocblk *iocbp;
	struct copyreq *cqp;
	struct copyresp *csp;
	struct state *stp;
	mblk_t *xx_indata();

	csp = (struct copyresp *)mp->b_rptr;
	iocbp = (struct iocblk *)mp->b_rptr;
	switch (csp->cp_cmd) {

	case XX_IOCTL:
			if (csp->cp_rval) { /* failure */
				if (csp->cp_private) /* state structure */
						freemsg(csp->cp_private);
				freemsg(mp);
				return;
			 }
			stp = (struct state *)csp->cp_private->b_rptr;
			switch (stp->st_state) {

			case GETSTRUCT:					/* xxdata structure copied in */
					/* save structure */

				stp->st_data =
				 *(struct xxdata *)mp->b_cont->b_rptr;
				freemsg(mp->b_cont);
				mp->b_cont = NULL;
				/* Reuse M_IOCDATA to copyin data */
				mp->b_datap->db_type = M_COPYIN;
				cqp = (struct copyreq *)mp->b_rptr;
				cqp->cq_size = stp->st_data.x_inlen;
				cqp->cq_addr = stp->st_data.x_inaddr;
				cqp->cq_flag = 0;
				stp->st_state = GETINDATA; /* next state */
				qreply(q, mp);
				break;

			case GETINDATA: /* data successfully copied in */
				/* Process input, return output */
				if ((mp->b_cont = xx_indata(mp->b_cont))
				 == NULL) { /* hypothetical */
							/* fail xx_indata */
							mp->b_datap->db_type = M_IOCNAK;
							mp->b_wptr = mp->b_rptr +
								sizeof(struct iocblk);
						iocbp->ioc_error = EIO;
						qreply(q, mp);
						break;
				}
				mp->b_datap->db_type = M_COPYOUT;
				cqp = (struct copyreq *)mp->b_rptr;
				cqp->cq_size = min(msgdsize(mp->b_cont),
				 stp->st_data.x_outlen);
				cqp->cq_addr = stp->st_data.x_outaddr;
				cqp->cq_flag = 0;
				stp->st_state = PUTOUTDATA; /* next state */
				qreply(q, mp);
				break;
			case PUTOUTDATA: /* data copied out, ack ioctl */
				freemsg(csp->cp_private); /*state structure*/
				mp->b_datap->db_type = M_IOCACK;
				mp->b_wtpr = mp->b_rptr + sizeof (struct iocblk);
c				/* can have been overwritten */
				iocbp->ioc_error = 0;
				iocbp->ioc_count = 0;
				iocbp->ioc_rval = 0;
				qreply(q, mp);
				break;

			default: /* invalid state: can't happen */
				freemsg(mp->b_cont);
				mp->b_cont = NULL;
				mp->b_datap->db_type = M_IOCNAK;
				mp->b_wptr=mp->b_rptr + sizeof (struct iocblk);
				iocbp->ioc_error = EINVAL;
				qreply(q, mp);
				break;
			} /* switch (stp->st_state) */
			break;
	default: /* M_IOCDATA not for us */
			/* if module, pass message on */
			/* if driver, free message */
			break;
	} /* switch (csp->cp_cmd) */
}

At case GETSTRUCT, the user xxdata structure is copied into the module's state structure (pointed to by cp_private in the message) and the M_IOCDATA message is reused to create a second M_COPYIN message to read the user data. At case GETINDATA, the input user data is processed by xx_indata (not supplied in the example), which frees the linked M_DATA block and returns the output data message block. The M_IOCDATA message is reused to create an M_COPYOUT message to write the user data. At case PUTOUTDATA, the message block containing the state structure is freed and an acknowledgment is sent upstream.

Care must be taken at the "can't happen" default case since the message block containing the state structure (cp_private) is not returned to the pool because it might not be valid. This might result in a lost block. The ASSERT helps find errors in the module if a "can't happen" condition occurs.

I_LIST ioctl(2)Example

The I_LIST ioctl(2) lists the drivers and module in a stream.

(Available as I-LIST2 file)

#include <stdio.h>
#include <string.h>
#include <stropts.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <sys/socket.h>

main(int argc, const char **argv)
{
		int               s, i;
		int               mods;
		struct str_list   mod_list;
		struct str_mlist *mlist;

		/* Get a socket... */
		if((s = socket(AF_INET, SOCK_STREAM, 0)) <= 0) {
			perror("socket: ");
			exit(1);
		}

		/* Determine the number of modules in the stream. */
		if((mods = ioctl(s, I_LIST, 0)) < 0){
			perror("I_LIST ioctl");
		}
		if(mods == 0) {
			printf("No modules\n");
			exit(1);
		} else {
			printf("%d modules\n", mods);
		}
		/* Allocate memory for all of the module names... */
		mlist = (struct str_mlist *) calloc(mods, sizeof(struct str_mlist));
		if(mlist == 0){
			perror("malloc failure");
			exit(1);
		}
		mod_list.sl_modlist = mlist;
		mod_list.sl_nmods = mods;

		/* Do the ioctl and get the module names. */
		if(ioctl(s, I_LIST, &mod_list) < 0){
			perror("I_LIST ioctl fetch");
			exit(1);
		}

		/* Print out the name of the modules */
		for(i = 0; i < mods; i++) {
			printf("s: %s\n", mod_list.sl_modlist[i].l_name);
		}

		free(mlist);

		exit(0);
}

Flush Handling

All modules and drivers are expected to handle M_FLUSH messages. An M_FLUSH message can originate at the stream head or from a module or a driver. The user can cause data to be flushed from queued messages of a stream by submiting an I_FLUSH ioctl(2). Data can be flushed from the read side, write side, or both sides of a stream.



ioctl(fd,I_FLUSH, arg);

The first byte of the M_FLUSH message is an option flag that can have values described in Table 8-1.

Table 8-1 M_FLUSH Arguments and bi_flag Values

Flag 

Description

FLUSHR

Flush read side of stream 

FLUSHW

Flush write queue  

FLUSHRW

Flush both, read and write, queues 

FLUSHBAND

Flush a specified priority band only 

Flushing Priority Bands

In addition to being able to flush all the data from a queue, a specific band can be flushed using the I_FLUSHBAND ioctl(2).



ioctl(fd, I_FLUSHBAND, bandp); 

The ioctl(2) is passed a pointer to a bandinfo structure. The bi_pri field indicates the band priority to be flushed (from 0 to 255]. The bi_flag field is used to indicate the type of flushing to be done. The legal values for bi_flag are defined in Table 8-1. bandinfo has the following format:


struct bandinfo {
		unsigned char       bi_pri;
		in                  bi_flag;
};

See "M_FLUSH" for details on how modules and drivers should handle flush band requests.

Figure 8-1 and Figure 8-2 further demonstrate flushing the entire stream due to a line break. Figure 8-1 shows the flushing of the write-side of a stream, and Figure 8-2 shows the flushing of the read-side of a stream. In the figures, dotted boxes indicate flushed queues.

Figure 8-1 Flushing the Write-Side of a Stream

Graphic

The following takes place (dotted lines mean flushed queues):

  1. A break is detected by a driver.

  2. The driver generates an M_BREAK message and sends it upstream.

  3. The module translates the M_BREAK into an M_FLUSH message with FLUSHW set, then sends it upstream.

  4. The stream head does not flush the write queue (no messages are ever queued there).

  5. The stream head turns the message around (sends it down the write-side).

  6. The module flushes its write queue.

  7. The message is passed downstream.

  8. The driver flushes its write queue and frees the message.

    Figure 8-2 shows flushing the read-side of a stream.

    Figure 8-2 Flushing the Read-Side of a Stream

    Graphic

    The events taking place are:

  1. After generating the first M_FLUSH message, the module generates an M_FLUSH with FLUSHR set and sends it downstream.

  2. The driver flushes its read queue.

  3. The driver turns the message around (sends it up the read-side).

  4. The module flushes its read queue.

  5. The message is passed upstream.

  6. The stream head flushes the read queue and frees the message.

    The following code shows line discipline module flush handling.

    static int
    ld_put(
     	queue_t *q,						/* pointer to read/write queue */
     	mblk_t *mp)						/* pointer to message being passed */
    {
     	switch (mp->b_datap->db_type) {
     		default:
     			putq(q, mp); /* queue everything */
    			return (0);					 /* except flush */
    
     		case M_FLUSH:
     			if (*mp->b_rptr & FLUSHW)					/* flush write q */
     					flushq(WR(q), FLUSHDATA);
    
     			if (*mp->b_rptr & FLUSHR)					/* flush read q */
     					flushq(RD(q), FLUSHDATA);
    
     			putnext(q, mp);											/* pass it on */
     			return(0);
     	}
    }

The stream head turns around the M_FLUSH message if FLUSHW is set (FLUSHR is cleared). A driver turns around M_FLUSH if FLUSHR is set (should mask off FLUSHW).

Flushing Priority Band

The bi_flag field is one of FLUSHR, FLUSHW, or FLUSHRW.

The following example shows flushing according to the priority band.

queue_t *rdq;								/* read queue */
queue_t *wrq;								/* write queue */

	case M_FLUSH:
		if (*bp->b_rptr & FLUSHBAND) {
			if (*bp->b_rptr & FLUSHW)
				flushband(wrq, FLUSHDATA, *(bp->b_rptr + 1));
			if (*bp->b_rptr & FLUSHR)
				flushband(rdq, FLUSHDATA, *(bp->b_rptr + 1));
		} else {
			if (*bp->b_rptr & FLUSHW)
				flushq(wrq, FLUSHDATA);
			if (*bp->b_rptr & FLUSHR)
				flushq(rdq, FLUSHDATA);
		}
		/*
		 * modules pass the message on;
		 * drivers shut off FLUSHW and loop the message
		 * up the read-side if FLUSHR is set; otherwise,
		 * drivers free the message.
		 */
		break;

Note that modules and drivers are not required to treat messages as flowing in separate bands. Modules and drivers can view the queue having only two bands of flow, normal and high priority. However, the latter alternative flushes the entire queue whenever an M_FLUSH message is received.

One use of the field b_flag of the msgb structure is provided to give the stream head a way to stop M_FLUSH messages from being reflected forever when the stream is used as a pipe. When the stream head receives an M_FLUSH message, it sets the MSGNOLOOP flag in the b_flag field before reflecting the message down the write-side of the stream. If the stream head receives an M_FLUSH message with this flag set, the message is freed rather than reflected.

Figure 8-3 Interfaces Affecting Drivers

Graphic

The set of STREAMS utilities available to drivers are listed in Appendix B, STREAMS Utilities. No system-defined macros that manipulate global kernel data or introduce structure-size dependencies are permitted in these utilities. So, some utilities that have been implemented as macros in the prior Solaris system releases are implemented as functions in the SunOS 5 System. This does not preclude the existence of both macro and function versions of these utilities. It is intended that driver source code include a header file that picks up function declarations while the core operating system source includes a header file that defines the macros. With the DKI interface, the following STREAMS utilities are implemented as C programming language functions: datamsg(9F), OTHERQ(9F), putnext(9F), RD(9F), and WR(9F).

Replacing macros such as RD with function equivalents in the driver source code allows driver objects to be insulated from changes in the data structures and their size, increasing the useful lifetime of driver source code and objects. Multithreaded drivers are also protected against changes in implementation-specific STREAMS synchronization.

The DKI defines an interface suitable for drivers and there is no need for drivers to access global kernel data structures directly. The kernel function drv_getparm(9F) fetches information from these structures. This restriction has an important consequence. Since drivers are not permitted to access global kernel data structures directly, changes in the contents/offsets of information within these structures will not break objects.

Driver and Module Service Interfaces

STREAMS provides the means to implement a service interface between any two components in a stream, and between a user process and the topmost module in the stream. A service interface is defined at the boundary between a service user and a service provider (see Figure 8-4). A service interface is a set of primitives. The rules that define a service and the allowable state transitions that result as these primitives are passed between the user and the provider. These rules are typically represented by a state machine. In STREAMS, the service user and provider are implemented in a module, driver, or user process. The primitives are carried bidirectionally between a service user and provider in M_PROTO and M_PCPROTO messages.

PROTO messages (M_PROTO and M_PCPROTO) can be multiblock, with the second through last blocks of type M_DATA. The first block in a PROTO message contains the control part of the primitive in a form agreed upon by the user and provider. The block is not intended to carry protocol headers. (Although its use is not recommended, upstream PROTO messages can have multiple PROTO blocks at the start of the message. getmsg(2) compacts the blocks into a single control part when sending to a user process.) The M_DATA block contains any data part associated with the primitive. The data part can be processed in a module that receives it, or it can be sent to the next stream component, along with any data generated by the module. The contents of PROTO messages and their allowable sequences are determined by the service interface specification.

PROTO messages can be sent bidirectionally (upstream and downstream) on a stream and between a stream and a user process. putmsg(2) and getmsg(2) system calls are analogous respectively to write(2) and read(2) except that the former allow both data and control parts to be (separately) passed, and they retain the message boundaries across the user-stream interface. putmsg(2) and getmsg(2) separately copy the control part (M_PROTO or M_PCPROTO block) and data part (M_DATA blocks) between the stream and user process.

An M_PCPROTO message is normally used to acknowledge primitives composed of other messages. M_PCPROTO ensures that the acknowledgment reaches the service user before any other message. If the service user is a user process, the stream head will only store a single M_PCPROTO message, and discard subsequent M_PCPROTO messages until the first one is read with getmsg(2).

Figure 8-4 Protocol Substitution

Graphic

By defining a service interface through which applications interact with a transport protocol, you can substitute a different protocol below the service interface completely transparent to the application. In Figure 8-5, the same application can run over the Transmission Control Protocol (TCP) and the ISO transport protocol. Of course, the service interface must define a set of services common to both protocols.

The three components of any service interface are the service user, the service provider, and the service interface itself, as seen in Figure 8-5.

Figure 8-5 Service Interface

Graphic

Typically, an application makes requests of a service provider using some well-defined service primitive. Responses and event indications are also passed from the provider to the user using service primitives.

Each service interface primitive is a distinct STREAMS message that has two parts, control part and a data part. The control part contains information that identifies the primitive and includes all necessary parameters. The data part contains user data associated with that primitive.

An example of a service interface primitive is a transport protocol connect request. This primitive requests the transport protocol service provider to establish a connection with another transport user. The parameters associated with this primitive can include a destination protocol address and specific protocol options to be associated with that connection. Some transport protocols also allow a user to send data with the connect request. A STREAMS message would be used to define this primitive. The control part would identify the primitive as a connect request and would include the protocol address and options. The data part would contain the associated user data.

Service Interface Library Example

The service interface library example presented here includes four functions that let a user do the following:

Five primitives are defined. The first two represent requests from the service user to the service provider. These are:

BIND_REQ

This request asks the provider to bind a specified protocol address. It requires an acknowledgment from the provider to verify that the contents of the request were syntactically correct.

UNITDATA_REQ

This request asks the provider to send data to the specified destination address. It does not require an acknowledgment from the provider.

The three other primitives represent acknowledgments of requests, or indications of incoming events, and are passed from the service provider to the service user.

OK_ACK

This primitive informs the user that a previous bind request was received successfully by the service provider.

ERROR_ACK

This primitive informs the user that a nonfatal error was found in the previous bind request. It indicates that no action was taken with the primitive that caused the error.

UNITDATA_IND

This primitive indicates that data destined for the user has arrived.

The defined structures describe the contents of the control part of each service interface message passed between the service user and service provider. The first field of each control part defines the type of primitive being passed.

Module Service Interface Example

The following code is part of a module that illustrates the concept of a service interface. The module implements a simple service interface and mirrors the service interface library example. The following rules pertain to service interfaces.

Declarations

The service interface primitives are defined in the declarations:

#include <sys/types.h>
#include <sys/param.h>
#include <sys/stream.h>
#include <sys/errno.h>

/* Primitives initiated by the service user */

#define BIND_REQ                      1    /* bind request */
#define UNITDATA_REQ                  2    /* unitdata request */

 /* Primitives initiated by the service provider */

#define OK_ACK                        3    /* bind acknowledgment */
#define ERROR_ACK                     4    /* error acknowledgment */
#define UNITDATA_IND                  5    /* unitdata indication */
/*
 * The following structures define the format of the
 * stream message block of the above primitives.

 */
struct bind_req {                       /* bind request */
   t_scalar_t    PRIM_type;             /* always BIND_REQ */
   t_uscalar_t   BIND_addr;             /* addr to bind	*/
};
struct unitdata_req {                   /* unitdata request */
   t_scalar_t    PRIM_type;             /* always UNITDATA_REQ */
   t_scalar_t    DEST_addr;             /* dest addr */
};
struct ok_ack {                         /* ok acknowledgment */
   t_scalar_t    PRIM_type;             /* always OK_ACK */
};
struct error_ack {                      /* error acknowledgment */
   t_scalar_t    PRIM_type;             /* always ERROR_ACK */
   t_scalar_t    UNIX_error;            /* UNIX system error code*/
};
struct unitdata_ind {                   /* unitdata indication */
   t_scalar_t    PRIM_type;             /* always UNITDATA_IND */
   t_scalar_t    SRC_addr;              /* source addr */
};

union primitives {								/* union of all primitives */
   long                      type;
   struct bind_req           bind_req;
   struct unitdata_req       unitdata_req;
   struct ok_ack             ok_ack;
   struct error_ack          error_ack;
   struct unitdata_ind       unitdata_ind;
};
struct dgproto {                        /* structure minor device */
   short state;                         /* current provider state */
   long addr;                           /* net address */
};

/* Provider states */
#define IDLE 0
#define BOUND 1

In general, the M_PROTO or M_PCPROTO block is described by a data structure containing the service interface information. In this example, union primitives is that structure.

The module recognizes two commands:

BIND_REQ

Give this stream a protocol address (for example, give it a name on the network). After a BIND_REQ is completed, data from other senders will find their way through the network to this particular stream.

UNITDATA_REQ

Send data to the specified address.

The module generates three messages:

OK_ACK

A positive acknowledgment (ack) of BIND_REQ.

ERROR_ACK

A negative acknowledgment (nak) of BIND_REQ.

UNITDATA_IND

Data from the network has been received.

The acknowledgment of a BIND_REQ informs the user that the request was syntactically correct (or incorrect if ERROR_ACK). The receipt of a BIND_REQ is acknowledged with an M_PCPROTO to ensure that the acknowledgment reaches the user before any other message. For example, a UNITDATA_IND comes through before the bind is completed, the application is confused.

The driver uses a per-minor device data structure, dgproto, which contains the following:

state

Current state of the service provider IDLE or BOUND

addr

Network address that has been bound to this stream

It is assumed (though not shown) that the module open procedure sets the write queue q_ptr to point at the appropriate private data structure.

Service Interface Procedure

The write put procedure is:

static int protowput(queue_t *q, mblk_t *mp)
{
 	union primitives *proto;
 	struct dgproto *dgproto;
 	int err;
 	dgproto = (struct dgproto *) q->q_ptr;  /* priv data struct */
 	switch (mp->b_datap->db_type) {
 	default:
 			/* don't understand it */
 			mp->b_datap->db_type = M_ERROR;
 			mp->b_rptr = mp->b_wptr = mp->b_datap->db_base;
 			*mp->b_wptr++ = EPROTO;
 			qreply(q, mp);
 			break;
 	case M_FLUSH: /* standard flush handling goes here ... */
 			break;
 	case M_PROTO:
 			/* Protocol message -> user request */
 			proto = (union primitives *) mp->b_rptr;
 			switch (proto->type) {
 			default:
 				mp->b_datap->db_type = M_ERROR;
 				mp->b_rptr = mp->b_wptr = mp->b_datap->db_base;
 				*mp->b_wptr++ = EPROTO;
 				qreply(q, mp);
 				return;
 			case BIND_REQ:
 				if (dgproto->state != IDLE) {
 						err = EINVAL;
 						goto error_ack;
 				}
 				if (mp->b_wptr - mp->b_rptr !=
 				 sizeof(struct bind_req)) {
 						err = EINVAL;
 						goto error_ack;
 				}
 				if (err = chkaddr(proto->bind_req.BIND_addr))
 						goto error_ack;
 				dgproto->state = BOUND;
 				dgproto->addr = proto->bind_req.BIND_addr;
 				mp->b_datap->db_type = M_PCPROTO;
 				proto->type = OK_ACK;
 				mp->b_wptr=mp->b_rptr+sizeof(structok_ack);
 				qreply(q, mp);
 				break;
			error_ack:
 				mp->b_datap->db_type = M_PCPROTO;
 				proto->type = ERROR_ACK;
 				proto->error_ack.UNIX_error = err;
 				mp->b_wptr = mp->b_rptr+sizeof(structerror_ack);
 				qreply(q, mp);
 				break;
 			case UNITDATA_REQ:
 				if (dgproto->state != BOUND)
 						goto bad;
 				if (mp->b_wptr - mp->b_rptr !=
 					 sizeof(struct unitdata_req))
 						goto bad;
 				if(err=chkaddr(proto->unitdata_req.DEST_addr))
 						goto bad;
 				putq(q, mp);
 				/* start device or mux output ... */
 				break;
 			bad:
 				freemsg(mp);
 				break;
 			}
	 }
return(0);
}

The write put procedure switches on the message type. The only types accepted are M_FLUSH and M_PROTO. For M_FLUSH messages, the driver performs the canonical flush handling (not shown). For M_PROTO messages, the driver assumes the message block contains a union primitive and switches on the type field. Two types are understood: BIND_REQ and UNITDATA_REQ.

For a BIND_REQ, the current state is checked; it must be IDLE. Next, the message size is checked. If it is the correct size, the passed-in address is verified for legality by calling chkaddr. If everything checks, the incoming message is converted into an OK_ACK and sent upstream. If there was any error, the incoming message is converted into an ERROR_ACK and sent upstream.

For UNITDATA_REQ, the state is also checked; it must be BOUND. As above, the message size and destination address are checked. If there is any error, the message is discarded. If all is well, the message is put in the queue, and the lower half of the driver is started.

If the write put procedure receives a message type that it does not understand, either a bad b_datap->db_type or bad proto->type, the message is converted into an M_ERROR message and is then sent upstream.

The generation of UNITDATA_IND messages (not shown in the example) would normally occur in the device interrupt if this is a hardware driver or in the lower read put procedure if this is a multiplexer. The algorithm is simple: the data part of the message is prefixed by an M_PROTO message block that contains a unitdata_ind structure and sent upstream.

Message Type Change Rules

Well-known ioctl Interfaces

Many ioctl operations are common to a class of STREAMS drivers or STREAMS modules. Modules that deal with terminals usually implement a subset of the termio(7I) ioctls. Similarly, drivers that deal with audio devices usually implement a subset of the audio(7I) interfaces.

Because no data structures have changed size as a result of the LP64 data model for either termio(7I) or audio(7I), you do no need to use any of the structure macros to decode any of these ioctls.

FIORDCHK

The FIORDCHK ioctl returns a count (in bytes) of the number of bytes to be read as the return value. Although FIORDCHK should be able to return more than MAXINT bytes, it is constrained to returning an int by the type of the ioctl(2) function.

FIONREAD

The FIONREAD ioctl returns the number of data byte (in all data messages queued) in the location pointed to by the arg parameter. The ioctl returns a 32-bit quantity for both 32-bit and 64-bit application., Therefore, code that passes the address of a long variable needs to be changed to pass an int variable for 64-bit applications.

I_NREAD

The I_NREAD ioctl (streamio(7I)) is an informational ioctl which counts the data bytes as well as the number of messages in the stream head read queue. The number of bytes in the stream head read queue is returned in the location pointed to by the arg parameter of the ioctl. The number of messages in the stream head read queue is returned as the return value of the ioctl.

Like FIONREAD, the arg parameter to the I_NREAD ioctl should be a pointer to an int, not a long. And, like FIORDCHK, the return value is constrained to be less than or equal to MAXINT bytes, even if more data is available.

Signals

STREAMS modules and drivers send signals to application processes through a special signal message. If the signal specified by the module or driver is not SIGPOLL (see signal(5)), the signal is delivered to the process group associated with the stream. If the signal is SIGPOLL, the signal is only sent to processes that have registered for the signal by using the I_SETSIG ioctl(2).

Modules or drivers use an M_SIG message to insert an explicit in-band signal into a message stream. For example, a message can be sent to the application process immediately before a particular service interface message. When the M_SIG message reaches the head of the stream read queue, a signal is generated and the M_SIG message is removed. The service interface message is the next message to be processed by the user. (The M_SIG message is usually defined as part of the service interface of the driver or module.)