System Interface Guide

Chapter 8 Real-time Programming and Administration

This chapter describes writing and porting real-time applications to run under Solaris SunOS 5.0 through 5.8. This chapter is written for programmers experienced in writing real-time applications and administrators familiar with real-time processing and the Solaris system.

Basic Rules of Real-time Applications

Real-time response is guaranteed when certain conditions are met. This section identifies these conditions and some of the more significant design errors that can cause problems or disable a system.

Most of the potential problems described here can degrade the response time of the system. One of the potential problems can freeze a workstation. Other, more subtle, mistakes are priority inversion and system overload.

A Solaris real-time process:

Runs in the RT scheduling class, as described in "Scheduling".
Locks down all the memory in its process address space, as described in "Memory Locking".
Is from a statically-linked program or from a program in which all dynamic binding is completed early, as described in "Shared Libraries".

Real-time operations are described in this chapter in terms of single-threaded processes, but the description can also apply to multithreaded processes (for detailed information about multithreaded processes, see the Multithreaded Programming Guide). To guarantee real-time scheduling of a thread, it must be created as a bound thread, and the thread's LWP must be run in the RT scheduling class. The locking of memory and early dynamic binding is effective for all threads in a process.

When a process is the highest priority real-time process, it:

Acquires the processor within the guaranteed dispatch latency period of becoming runnable (see"Dispatch Latency")
Continues to run for as long as it remains the highest priority runnable process

A real-time process can lose control of the processor or can be unable to gain control of the processor because of other events on the system. These events include external events (such as interrupts), resource starvation, waiting on external events (synchronous I/O), and preemption by a higher priority process.

Real-time scheduling generally does not apply to system initialization and termination services such as open(2) and close(2).

Degrading Response Time

The problems described in this section all increase the response time of the system to varying extents. The degradation can be serious enough to cause an application to miss a critical deadline.

Real-time processing can also significantly impact the operation of aspects of other applications active on a system running a real-time application. Since real-time processes have higher priority, time-sharing processes can be prevented from running for significant amounts of time. This can cause interactive activities, such as displays and keyboard response time, to be noticeably slowed.

System Response Time

System response under SunOS 5.0 through 5.8 provides no bounds to the timing of I/O events. This means that synchronous I/O calls should never be included in any program segment whose execution is time critical. Even program segments that permit very large time bounds must not perform synchronous I/O. Mass storage I/O is such a case, where causing a read or write operation hangs the system while the operation takes place.

A common application mistake is to perform I/O to get error message text from disk. This should be done from an independent nonreal-time process or thread.

Interrupt Servicing

Interrupt priorities are independent of process priorities. Prioritizing processes does not carry through to prioritizing the services of hardware interrupts that result from the actions of the processes. This means that interrupt processing for a device controlled by a real-time process is not necessarily done before interrupt processing for another device controlled by a timeshare process.

Shared Libraries

Time-sharing processes can save significant amounts of memory by using dynamically linked, shared libraries. This type of linking is implemented through a form of file mapping. Dynamically linked library routines cause implicit reads.

Real-time programs can use shared libraries, yet avoid dynamic binding, by setting the environment variable LD_BIND_NOW to a non-NULL value when the program is invoked. This forces all dynamic linking to be bound before the program begins execution. See the Linker and Libraries Guide for more information.

Priority Inversion

A time-sharing process can block a real-time process by acquiring a resource that is required by a real-time process. Priority inversion is a condition that occurs when a higher priority process is blocked by a lower priority process. The term blocking describes a situation in which a process must wait for one or more processes to relinquish control of resources. If this blocking is prolonged, even for lower level resources, deadlines might be missed.

By way of illustration, consider the case in Figure 8-1 where a high priority process wanting to use a shared resource gets blocked when a lower priority process holds the resource, and the lower priority process is preempted by an intermediate priority process. This condition can persist for a long time, arbitrarily long, in fact, since the amount of time the high priority process must wait for the resource depends not only on the duration of the critical section being executed by the lower priority process, but on the duration until the intermediate process blocks. Any number of intermediate processes can be involved.

Figure 8-1 Unbounded Priority Inversion

This issue and the methods of dealing with it are described in "Mutual Exclusion Lock Attributes" in Multithreaded Programming Guide.

Sticky Locks

A page is permanently locked into memory when its lock count reaches 65535 (0xFFFF). The value 0xFFFF is implementation-defined and might change in future releases. Pages locked this way cannot be unlocked.

Runaway Real-time Processes

Runaway real-time processes can cause the system to halt or can slow the system response so much that the system appears to halt.

Note -

If you have a runaway process on a SPARC system, press Stop-A. You might have to do this more than one time. If this doesn't work, or on non-SPARC systems, turn the power off, wait a moment, then turn it back on.

When a high priority real-time process does not relinquish control of the CPU, there is no simple way to regain control of the system until the infinite loop is forced to terminate. Such a runaway process does not respond to control-C. Attempts to use a shell set at a higher priority than that of a runaway process do not work.

I/O Behavior

Asynchronous I/O

There is no guarantee that asynchronous I/O operations will be done in the sequence in which they are queued to the kernel. Nor is there any guarantee that asynchronous operations will be returned to the caller in the sequence in which they were done.

If a single buffer is specified for a rapid sequence of calls to aioread(3AIO), there is no guarantee about the state of the buffer between the time that the first call is made and the time that the last result is signaled to the caller.

An individual aio_result_t structure can be used only for one asynchronous read or write at a time.

Real-time Files

SunOS 5.0 through 5.8 provides no facilities to ensure that files will be allocated as physically contiguous.

For regular files, the read(2) and write(2) operations are always buffered. An application can use mmap(2) and msync(3C) to effect direct I/O transfers between secondary storage and process memory.

Scheduling

Real-time scheduling constraints are necessary to manage data acquisition or process control hardware. The real-time environment requires that a process be able to react to external events in a bounded amount of time. Such constraints can exceed the capabilities of a kernel designed to provide a "fair" distribution of the processing resources to a set of time-sharing processes.

This section describes the SunOS 5.0 through 5.8 real-time scheduler, its priority queue, and how to use system calls and utilities that control scheduling.

Dispatch Latency

The most significant element in scheduling behavior for real-time applications is the provision of a real-time scheduling class. The standard time-sharing scheduling class is not suitable for real-time applications because this scheduling class treats every process equally and has a limited notion of priority. Real-time applications require a scheduling class in which process priorities are taken as absolute and are changed only by explicit application operations.

The term dispatch latency describes the amount of time it takes for a system to respond to a request for a process to begin operation. With a scheduler written specifically to honor application priorities, real-time applications can be developed with a bounded dispatch latency.

Figure 8-2 illustrates the amount of time it takes an application to respond to a request from an external event.

Figure 8-2 Application Response Time.

The overall application response time is composed of the interrupt response time, the dispatch latency, and the time it takes the application itself to determine its response.

The interrupt response time for an application includes both the interrupt latency of the system and the device driver's own interrupt processing time. The interrupt latency is determined by the longest interval that the system must run with interrupts disabled; this is minimized in SunOS 5.0 through 5.8 using synchronization primitives that do not commonly require a raised processor interrupt level.

During interrupt processing, the driver's interrupt routine wakes up the high priority process and returns when finished. The system detects that a process with higher priority than the interrupted process in now dispatchable and arranges to dispatch that process. The time to switch context from a lower priority process to a higher priority process is included in the dispatch latency time.

Figure 8-3 illustrates the internal dispatch latency/application response time of a system, defined in terms of the amount of time it takes for a system to respond to an internal event. The dispatch latency of an internal event represents the amount of time required for one process to wake up another higher priority process, and for the system to dispatch the higher priority process.

The application response time is the amount of time it takes for a driver to wake up a higher priority process, have a low priority process release resources, reschedule the higher priority task, calculate the response, and dispatch the task.

Note -

Interrupts can arrive and be processed during the dispatch latency interval. This processing increases the application response time, but is not attributed to the dispatch latency measurement, and so is not bounded by the dispatch latency guarantee.

Figure 8-3 Internal Dispatch Latency

With the new scheduling techniques provided with real-time SunOS 5.0 through 5.8, the system dispatch latency time is within specified bounds. As you can see in the table below, dispatch latency improves with a bounded number of processes.

Table 8-1 Real-time System Dispatch Latency with SunOS 5.0 through 5.8


Workstation	Bounded Number of Processes	Arbitrary Number of Processes
SPARCstation 2	<0.5 milliseconds in a system with fewer than 16 active processes	1.0 milliseconds
SPARCstation 5	<0.3 millisecond	0.3 millisecond
Ultra 1-167	<0.15 millisecond	<0.15 millisecond

Tests for dispatch latency and experience with such critical environments as manufacturing and data acquisition have proven that SunOS 5.8 is an effective platform for the development of real-time applications. (These examples are not of current products.)

Scheduling Classes

The SunOS 5.0 through 5.8 kernel dispatches processes by priority. The scheduler (or dispatcher) supports the concept of scheduling classes. Classes are defined as Real-time (RT), System (sys), and Time-Sharing (TS). Each class has a unique scheduling policy for dispatching processes within its class.

The kernel dispatches highest priority processes first. By default, real-time processes have precedence over sys and TS processes, but administrators can configure systems so that TS and RT processes have overlapping priorities.

Figure 8-4 illustrates the concept of classes as viewed by the SunOS 5.0 through 5.8 kernel.

Figure 8-4 Dispatch Priorities for Scheduling Classes

At highest priority are the hardware interrupts; these cannot be controlled by software. The interrupt processing routines are dispatched directly and immediately from interrupts, without regard to the priority of the current process.

Real-time processes have the highest default software priority. Processes in the RT class have a priority and time quantum value. RT processes are scheduled strictly on the basis of these parameters. As long as an RT process is ready to run, no SYS or TS process can run. Fixed priority scheduling allows critical processes to run in a predetermined order until completion. These priorities never change unless an application changes them.

An RT class process inherits the parent's time quantum, whether finite or infinite. A process with a finite time quantum runs until the time quantum expires or the process terminates, blocks (while waiting for an I/O event), or is preempted by a higher priority runnable real-time process. A process with an infinite time quantum ceases execution only when it terminates, blocks, or is preempted.

The SYS class exists to schedule the execution of special system processes, such as paging, STREAMS, and the swapper. It is not possible to change the class of a process to the SYS class. The SYS class of processes has fixed priorities established by the kernel when the processes are started.

At lowest priority are the time-sharing (TS) processes. TS class processes are scheduled dynamically, with a few hundred milliseconds for each time slice. The TS scheduler switches context in round-robin fashion often enough to give every process an equal opportunity to run, depending upon its time slice value, its process history (when the process was last put to sleep), and considerations for CPU utilization. Default time-sharing policy gives larger time slices to processes with lower priority.

A child process inherits the scheduling class and attributes of the parent process through fork(2). A process' scheduling class and attributes are unchanged by exec(2).

Different algorithms dispatch each scheduling class. Class dependent routines are called by the kernel to make decisions about CPU process scheduling. The kernel is class-independent, and takes the highest priority process off its queue. Each class is responsible for calculating a process' priority value for its class. This value is placed into the dispatch priority variable of that process.

As Figure 8-5 illustrates, each class algorithm has its own method of nominating the highest priority process to place on the global run queue.

Figure 8-5 The Kernel Dispatch Queue

Each class has a set of priority levels that apply to processes in that class. A class-specific mapping maps these priorities into a set of global priorities. It is not required that a set of global scheduling priority maps start with zero, nor that they be contiguous.

By default, the global priority values for time-sharing (TS) processes range from -20 to +20, mapped into the kernel from 0-40, with temporary assignments as high as 99. The default priorities for real-time (RT) processes range from 0-59, and are mapped into the kernel from 100 to 159. The kernel's class-independent code runs the process with the highest global priority on the queue.

Dispatch Queue

The dispatch queue is a linear-linked list of processes with the same global priority. Each process is invoked with class specific information attached to it. A process is dispatched from the kernel dispatch table based upon its global priority.

Dispatching Processes

When a process is dispatched, the process' context is mapped into memory along with its memory management information, its registers, and its stack. Then execution begins. Memory management information is in the form of hardware registers containing data needed to perform virtual memory translations for the currently running process.

Preemption

When a higher priority process becomes dispatchable, the kernel interrupts its computation and forces the context switch, preempting the currently running process. A process can be preempted at any time if the kernel finds that a higher priority process is now dispatchable.

For example, suppose that process A performs a read from a peripheral device. Process A is put into the sleep state by the kernel. The kernel then finds that a lower priority process B is runnable, so process B is dispatched and begins execution. Eventually, the peripheral device interrupts, and the driver of the device is entered. The device driver makes process A runnable and returns. Rather than returning to the interrupted process B, the kernel now preempts B from processing and resumes execution of the awakened process A.

Another interesting situation occurs when several processes contend for kernel resources. When a lower priority process releases a resource for which a higher priority real-time process is waiting, the kernel immediately preempts the lower priority process and resumes execution of the higher priority process.

Kernel Priority Inversion

Priority inversion occurs when a higher priority process is blocked by one or more lower priority processes for a long time. The use of synchronization primitives such as mutual-exclusion locks in the SunOS 5.0 through 5.8 kernel can lead to priority inversion.

A process is blocked when it must wait for one or more processes to relinquish resources. If blocking continues, it can lead to missed deadlines, even for low levels of utilization.

The problem of priority inversion has been addressed for mutual-exclusion locks for the SunOS 5.0 through 5.8 kernel by implementing a basic priority inheritance policy. The policy states that a lower priority process inherits the priority of a higher priority process when the lower priority process blocks the execution of the higher priority process. This places an upper bound on the amount of time a process can remain blocked. The policy is a property of the kernel's behavior, not a solution that a programmer institutes through system calls or function execution. User-level processes can still exhibit priority inversion, however.

User Priority Inversion

This issue and the means to deal with it are discussed in "Mutual Exclusion Lock Attributes" in Multithreaded Programming Guide.

Function Calls That Control Scheduling

priocntl(2)

Control over scheduling of active classes is done with priocntl(2). Class attributes are inherited through fork(2) and exec(2), along with scheduling parameters and permissions required for priority control. This is true for both the RT and the TS classes.

The priocntl(2) function is the interface for specifying a real-time process, a set of processes, or a class to which the system call applies. priocntlset(2) also provides the more general interface for specifying an entire set of processes to which the system call applies.

The command arguments of priocntl(2) can be one of: PC_GETCID, PC_GETCLINFO, PC_GETPARMS, or PC_SETPARMS. The real or effective ID of the calling process must match that of the affected processes, or must have super-user privilege.

`PC_GETCID`	This command takes the name field of a structure that contains a recognizable class name (RT for real-time and TS for time-sharing). The class ID and an array of class attribute data are returned.
`PC_GETCLINFO`	This command takes the ID field of a structure that contains a recognizable class identifier. The class name and an array of class attribute data are returned.
`PC_GETPARMS`	This command returns the scheduling class identifier and/or the class specific scheduling parameters of one of the specified processes. Even though `idtype` & `id` might specify a big set, `PC_GETPARMS` returns the parameter of only one process. It is up to the class to select which one.
`PC_SETPARMS`	This command sets the scheduling class and/or the class specific scheduling parameters of the specified process or processes.

sched_get_priority_max(3RT)

Returns the maximum values for the specified policy.

sched_get_priority_min(3RT)

Returns the minimum values for the specified policy (see sched_get_priority_max(3R)).

sched_rr_get_interval(3RT)

Updates the specified timespec structure to the current execution time limit (see sched_get_priority_max(3RT)).

sched_setparam(3RT), sched_getparam(3RT)

Sets or gets the scheduling parameters of the specified process.

sched_yield(3RT)

Blocks the calling process until it returns to the head of the process list.

Utilities That Control Scheduling

The administrative utilities that control process scheduling are dispadmin(1M) and priocntl(1). Both these utilities support the priocntl(2) system call with compatible options and loadable modules. These utilities provide system administration functions that control real-time process scheduling during runtime.

priocntl(1)

The priocntl(1) command sets and retrieves scheduler parameters for processes.

dispadmin(1M)

The dispadmin(1M) utility displays all current process scheduling classes by including the -l command line option during runtime. Process scheduling can also be changed for the class specified after the -c option, using RT as the argument for the real-time class.

The options shown in Table 8-2 are also available.

Table 8-2 Class Options for the dispadmin(1M) Utility


Option	Meaning
`-l`	Lists scheduler classes currently configured
`-c`	Specifies the class whose parameters are to be displayed or changed
`-g`	Gets the dispatch parameters for the specified class
`-r`	Used with -g, specifies time quantum resolution
`-s`	Specifies a file where values can be located

A class specific file containing the dispatch parameters can also be loaded during runtime. Use this file to establish a new set of priorities replacing the default values established during boot time. This class specific file must assert the arguments in the format used by the -g option. Parameters for the RT class are found in the rt_dptbl(4), and are listed in the example at the end of this section.

To add an RT class file to the system, the following modules must be present:

An rt_init() routine in the class module that loads the rt_dptbl(4)
An rt_dptbl(4) module that provides the dispatch parameters and a routine to return pointers to config_rt_dptbl
The dispadmin(1M) executable

Load the class specific module with the following command, where module_name is the class specific module:
# modload /kernel/sched/module_name

Invoke the dispadmin(1M) command:
# dispadmin -c RT -s file_name
The file must describe a table with the same number of entries as the table that is being overwritten.

Configuring Scheduling

Associated with both scheduling classes is a parameter table, rt_dptbl(4), and ts_dptbl(4). These tables are configurable by using a loadable module at boot time, or with dispadmin(1M) during runtime.

Dispatcher Parameter Table

The in-core table for real-time establishes the properties for RT scheduling. The rt_dptbl(4) structure consists of an array of parameters, struct rt_dpent_t, one for each of the n priority levels. The properties of a given priority level are specified by the ith parameter structure in the array, rt_dptbl[i].

A parameter structure consists of the following members (also described in the /usr/include/sys/rt.h header file).

`rt_globpri`	The global scheduling priority associated with this priority level. The `rt_globpri` values cannot be changed with dispadmin(1M).
`rt_quantum`	The length of the time quantum allocated to processes at this level in ticks (see "Timestamp Functions"). The time quantum value is only a default or starting value for processes at a particular level. The time quantum of a realtime process can be changed by using the priocntl(1) command or the priocntl(2) system call.

Reconfiguring `config_rt_dptbl`

A real-time administrator can change the behavior of the real-time portion of the scheduler by reconfiguring the config_rt_dptbl at any time. One method is described in rt_dptbl(4) in the section titled "REPLACING THE RT_DPTBL LOADABLE MODULE."

A second method for examining or modifying the real-time parameter table on a running system is through using the dispadmin(1M) command. Invoking dispadmin(1M) for the real-time class allows retrieval of the current rt_quantum values in the current config_rt_dptbl configuration from the kernel's in-core table. When overwriting the current in-core table, the configuration file used for input to dispadmin(1M) must conform to the specific format described in rt_dptbl(4).

Following is an example of prioritized processes rtdpent_t with their associated time quantum config_rt_dptbl[] value as they might appear in config_rt_dptbl[]:

rtdpent_t  rt_dptbl[] = { 			129,    60,
	 /* prilevel Time quantum */							130,    40,
		100,    100,											131,    40,
		101,    100,											132,    40,
		102,    100,											133,    40,
		103,    100,											134,    40,
		104,    100,											135,    40,
		105,    100,											136,    40,
		106,    100,											137,    40,
		107,    100,											138,    40
		108,    100,											139,    40,
		109,    100,											140,    20,
		110,    80,												141,    20,
		111,    80,												142,    20,
		112,    80,												143,    20,
		113,    80,												144,    20,
		114,    80,												145,    20,
		115,    80,												146,    20,
		116,    80,												147,    20,
		117,    80,												148,    20,
		118,    80,												149,    20,
		119,    80,												150,    10,
		120,    60,												151,    10,
		121,    60,												152,    10,
		122,    60,												153,    10,
		123,    60,												154,    10,
		124,    60,												155,    10,
		125,    60,												156,    10,
		126,    60,												157,    10,
		126,    60,												158,    10,
		127,    60,												159,    10,
		128,    60,										}

Memory Locking

Locking memory is one of the most important issues for real-time applications. In a real-time environment, a process must be able to guarantee continuous memory residence to reduce latency and to prevent paging and swapping.

This section describes the memory locking mechanisms available to real-time applications in SunOS 5.0 through 5.8.

Overview

Under SunOS 5.0 through 5.8, the memory residency of a process is determined by its current state, the total available physical memory, the number of active processes, and the processes' demand for memory. This is appropriate in a time-share environment, but it is often unacceptable for a real-time process. In a real-time environment, a process must guarantee a memory residence for all or part of itself to reduce its memory access and dispatch latency.

For real-time in SunOS 5.0 through 5.8, memory locking is provided by a set of library routines that allow a process running with superuser privileges to lock specified portions of its virtual address space into physical memory. Pages locked in this manner are exempt from paging until they are unlocked or the process exits.

There is a system-wide limit on the number of pages that can be locked at any time. This is a tunable parameter whose default value is calculated at boot time. It is based on the number of page frames less another percentage (currently set at ten percent).

Locking a Page

A call to mlock(3C) requests that one segment of memory be locked into the system's physical memory. The pages that make up the specified segment are faulted in and the lock count of each is incremented. Any page with a lock count greater than 0 is exempt from paging activity.

A particular page can be locked multiple times by multiple processes through different mappings. If two different processes lock the same page, the page remains locked until both processes remove their locks. However, within a given mapping, page locks do not nest. Multiple calls of locking functions on the same address by the same process are removed by a single unlock request.

If the mapping through which a lock has been performed is removed, the memory segment is implicitly unlocked. When a page is deleted through closing or truncating the file, it is also unlocked implicitly.

Locks are not inherited by a child process after a fork(2) call is made. So, if a process with memory locked forks a child, the child must perform a memory locking operation in its own behalf to lock its own pages. Otherwise, the child process incurs copy-on-write page faults, which are the usual penalties associated with forking a process.

Unlocking a Page

To unlock a page of memory, a process requests that a segment of locked virtual pages be released by a call to munlock(3C). The lock counts of the specified physical pages are decremented. Once the lock count of a page has been decremented to 0, the page is swapped normally.

Locking All Pages

A superuser process can request that all mappings within its address space be locked by a call to mlockall(3C). If the flag MCL_CURRENT is set, all the existing memory mappings are locked. If the flag MCL_FUTURE is set, every mapping that is added to or that replaces an existing mapping is locked into memory.

Sticky Locks

A page is permanently locked into memory when its lock count reaches 65535 (0xFFFF). The value 0xFFFF is implementation defined and might change in future releases. Pages locked in this manner cannot be unlocked. Reboot the system to recover.

High Performance I/O

This section describes I/O with realtime processes. In SunOS 5.0 through 5.8, the libraries supply two sets of functions and calls to perform fast, asynchronous, I/O operations. The POSIX asynchronous I/O interfaces are the new standard. For robustness, SunOS also provides file and in-memory synchronization operations and modes to prevent information loss and data inconsistency.

Standard UNIX I/O is synchronous to the application programmer. An application that calls read(2) or write(2) usually waits until the system call has finished.

Real-time applications need asynchronous, bounded I/O behavior. A process that issues an asynchronous I/O call proceeds without waiting for the I/O operation to complete. The caller is notified when the I/O operation has finished. In the mean time the process does something useful.

Asynchronous I/O can be used with any SunOS file. Files are opened in the synchronous way and no special flagging is required. An asynchronous I/O transfer has three elements: call, request, and operation. The application calls an asynchronous I/O function, the request for the I/O is placed on a queue, and the call returns immediately. At some point, the system dequeues the request and initiates the I/O operation.

Asynchronous and standard I/O requests can be intermingled on any file descriptor. The system maintains no particular sequence of read and write requests. The system arbitrarily resequences all pending read and write requests. If a specific sequence is required for the application, the application must insure the completion of prior operations before issuing the dependent requests.

POSIX Asynchronous I/O

POSIX asynchronous I/O is performed using aiocb structures. An aiocb control block identifies each asynchronous I/O request and contains all of the controlling information. A control block can be used for only one request at a time and can be reused after its request has been completed.

A typical POSIX asynchronous I/O operation is initiated by a call to aio_read(3RT) or aio_write(3RT). Either polling or signals can be used to determine the completion of an operation. If signals are used for operation completion, each operation can be uniquely tagged and the tag is returned in the si_value component of the generated signal (see siginfo(3HEAD)).

aio_read(3RT)

aio_read(3RT) is called with an asynchronous I/O control block to initiate a read operation.

aio_write(3RT)

aio_write(3RT) is called with an asynchronous I/O control block to initiate a write operation.

aio_return(3RT) and aio_error(3RT)

aio_return(3RT) and aio_error(3RT) are called to obtain return and error values. respectively, after an operation is known to have been completed.

aio_cancel(3RT)

aio_cancel(3RT) is called with an asynchronous I/O control block to cancel pending operations. It can be used to cancel a specific request, if the control block specifies one, or all of the requests pending for the specified file descriptor.

aio_fsync(3RT)

aio_fsync(3RT) queues an asynchronousfsync(3C) or fdatasync(3RT) request for all of the pending I/O operations on the specified file.

aio_suspend(3RT)

aio_suspend(3RT) suspends the caller as though one, or more, of the preceding asynchronous I/O requests had been made synchronously.

Solaris Asynchronous I/O

Notification (SIGIO)

When an asynchronous I/O call returns successfully, the I/O operation has only been queued, waiting to be done. The actual operation also has a return value and a potential error identifier, the values that would have been returned to the caller as the result of a synchronous call. When the I/O is finished, the return value and error value are stored at a location given by the user at the time of the request as a pointer to an aio_result_t. The structure of the aio_result_t is defined in <sys/asynch.h>:

typedef struct aio_result_t {
 	ssize_t	aio_return; /* return value of read or write */
 	int 		aio_errno;  /* errno generated by the IO */
 } aio_result_t;

When aio_result_t has been updated, a SIGIO signal is delivered to the process that made the I/O request.

Note that a process with two or more asynchronous I/O operations pending has no certain way to determine which request, if any, is the cause of the SIGIO signal. A process receiving a SIGIO should check all its conditions that could be generating the SIGIO signal.

aioread(3AIO)

aioread(3AIO) is the asynchronous version of read(2). In addition to the normal read arguments, aioread(3AIO) takes the arguments specifying a file position and the address of an aio_result_t structure in which the system stores the result information about the operation. The file position specifies a seek to be performed within the file before the operation. Whether the aioread(3AIO) call succeeds or fails, the file pointer is updated.

aiowrite(3AIO)

aiowrite(3AIO) is the asynchronous version of write(2). In addition to the normal write arguments, aiowrite(3AIO) takes arguments specifying a file position and the address of an aio_result_t structure in which the system is to store the resulting information about the operation.

The file position specifies a seek to be performed within the file before the operation. If the aiowrite(3AIO) call succeeds, the file pointer is updated to the position that would have resulted in a successful seek and write. The file pointer is also updated when a write fails to allow for subsequent write requests.

aiocancel(3AIO)

aiocancel(3AIO) attempts to cancel the asynchronous request whose aio_result_t structure is given as an argument. An aiocancel(3AIO) call succeeds only if the request is still queued. If the operation is in progress, aiocancel(3AIO) fails.

aiowait(3AIO)

A call to aiowait(3AIO) blocks the calling process until at least one outstanding asynchronous I/O operation is completed. The timeout parameter points to a maximum interval to wait for I/O completion. A timeout value of zero specifies that no wait is wanted. aiowait(3AIO) returns a pointer to the aio_result_t structure for the completed operation.

poll(2)

To synchronously determine the completion of an asynchronous I/O event rather than depend on a SIGIO interrupt, use poll(2). You can also poll to determine the origin of a SIGIO interrupt.

Use of poll(2) for very large numbers of files is slow. This problem is resolved by poll(7D).

poll(7D)

/dev/poll provides a highly scalable way of polling a large number of file descriptors. This is provided through a new set of APIs and a new driver, /dev/poll. The /dev/poll API is an alternative to, not a replacement of poll(2). poll(7D) provides details and examples of the /dev/poll API. When used properly, the /dev/poll API scales much better than poll(2). It is especially suited for applications that satisfy the following criteria:

Applications repeatedly poll a large number of file descriptors.
The polled file descriptors are relatively stable, that is they are not constantly closed and reopened.
The set of file descriptors which actually have polled events pending is small, comparing to the total number of file descriptors being polled.

close(2)

Files are closed by calling close(2). close(2) cancels any outstanding asynchronous I/O request that can be. close(2) waits for an operation that cannot be cancelled (see "aiocancel(3AIO)"). When close(2) returns, there is no asynchronous I/O pending for the file descriptor. Only asynchronous I/O requests queued to the specified file descriptor are cancelled when a file is closed. Any I/O pending requests for other file descriptors are not cancelled.

Synchronized I/O

Applications might need to guarantee that information has been written to stable storage, or that file updates are performed in a particular order. Synchronized I/O provides for these needs.

Modes of Synchronization

Under SunOS 5.0 through 5.8, for a write operation, data is successfully transferred to a file when the system ensures that all written data is readable after any subsequent open of the file in the absence of a failure of the physical storage medium (even one that follows a system or power failure). For a read operation data is successfully transferred when an image of the data on the physical storage medium is available to the requesting process. An I/O operation is complete when either the associated data has been successfully transferred or the operation has been diagnosed as unsuccessful.

An I/O operation has reached synchronized I/O data integrity completion when:

For reads, the operation has been completed or diagnosed unsuccessful. The read is complete only when an image of the data has been successfully transferred to the requesting process. If there were any pending write requests affecting the data to be read at the time that the synchronized read operation was requested, these write requests are successfully transferred prior to reading the data.

For writes, the operation has been completed or diagnosed if unsuccessful. The write is complete only when the data specified in the write request is successfully transferred, and all file system information required to retrieve the data is successfully transferred.

File attributes that are not necessary for data retrieval (access time, modification time, status change time) are not transferred prior to returning to the calling process.

Synchronized I/O file integrity completion is identical to synchronized I/O data integrity completion with the addition that all file attributes relative to the I/O operation (including access time, modification time, status change time) must be successfully transferred prior to returning to the calling process.

Synchronizing a File

The fsync(3C) and fdatasync(3RT) functions explicitly synchronize a file to secondary storage.

fsync(3C) guarantees the function is synchronized at the I/O file integrity completion level, while fdatasync(3RT) guarantees the function is synchronized at the I/O data integrity completion level.

Applications can synchronize each I/O operation before the operation completes. Setting the O_DSYNC flag on the file description by open(2) or fcntl(2) ensures that all I/O writes (write(2) and aiowrite(3AIO)) have reached I/O data completion before the operation is indicated as completed. Setting the O_SYNC flag on the file description ensures that all I/O writes have reached completion before the operation is indicated as completed. Setting the O_RSYNC flag on the file description ensures that all I/O reads read(2) and aio_read(3RT)) have reached the same level of completion as request for writes by the setting, O_DSYNC or O_SYNC, on the descriptor.

Interprocess Communication

This section describes the interprocess communication (IPC) functions of SunOS 5.0 through 5.8 as they relate to real-time processing. Signals, pipes, FIFOs (named pipes), message queues, shared memory, file mapping, and semaphores are described here. For more information about the libraries, functions, and routines useful for interprocess communication, see Chapter 7, Interprocess Communication .

Overview

Real-time processing often requires fast, high-bandwidth interprocess communication. The choice of which mechanisms should be used can be dictated by functional requirements, and the relative performance will depend upon application behavior.

The traditional method of interprocess communication in UNIX is the pipe. Unfortunately, pipes can have framing problems. Messages can become intermingled by multiple writers or torn apart by multiple readers.

IPC messages mimic the reading and writing of files. They are easier to use than pipes when more than two processes must communicate by using a single medium.

The IPC shared semaphore facility provides process synchronization. Shared memory is the fastest form of interprocess communication. The main advantage of shared memory is that the copying of message data is eliminated. The usual mechanism for synchronizing shared memory access is semaphores.

Signals

Signals can be used to send a small amount of information between processes. The sender can use sigqueue(3RT) to send a signal together with a small amount of information to a target process.

The target process must have the SA_SIGINFO bit set for the specified signal (see sigaction(2)), for subsequent occurrences of a pending signal to be queued also.

The target process can receive signals either synchronously or asynchronously. Blocking a signal (see sigprocmask(2)) and calling either sigwaitinfo(3RT) or sigtimedwait(3RT), causes the signal to be received synchronously, with the value sent by the caller of sigqueue(3RT) stored in the si_value member of the siginfo_t argument. Leaving the signal unblocked causes the signal to be delivered to the signal handler specified by sigaction(2), with the value appearing in the si_value of the siginfo_t argument to the handler.

Only a fixed number of signals with associated values can be sent by a process and remain undelivered. Storage for {SIGQUEUE_MAX} signals is allocated at the first call to sigqueue(3RT). Thereafter, a call to sigqueue(3RT) either successfully enqueues at the target process or fails within a bounded amount of time.

Pipes

Pipes provide one-way communication between processes. Processes must have a common ancestor in order to communicate with pipes. Data passed through a pipe is treated as a conventional UNIX byte stream. See "Pipes" for more information about pipes.

Named Pipes

SunOS 5.0 through 5.8 provides named pipes or FIFOs. The FIFO is more flexible than the pipe because it is a named entity in a directory. Once created, a FIFO can be opened by any process that has legitimate access to it. Processes do not have to share a parent and there is no need for a parent to initiate the pipe and pass it to the descendants. See "Named Pipes" for more information.

Message Queues

Message queues provide another means of communicating between processes that also allows any number of processes to send and receive from a single message queue. Messages are passed as blocks of arbitrary size, not as byte streams. Message queues are provided in both System V and POSIX versions. See "System V Messages" and "POSIX Messages" for more information.

Semaphores

The semaphore is a mechanism to synchronize access to shared resources. Semaphores are also provided in both System V and POSIX styles. The System V semaphores are very flexible and very heavy weight. The POSIX semaphores are quite light weight. See "System V Semaphores" and "POSIX Semaphores" for more information.

Note that using semaphores can cause priority inversions unless these are explicitly avoided by the techniques mentioned earlier in this chapter.

Shared Memory

The fastest way for processes to communicate is directly, through a shared segment of memory. A common memory area is added to the address space of sharing processes. Applications use stores to send data and fetches to receive communicated data. SunOS 5.0 through 5.8 provides three mechanisms for shared memory: memory mapped files, described in "Memory Management Interfaces", System V IPC shared memory, and POSIX shared memory.

The major difficulty with shared memory is that results can be wrong when more than two processes are trying to read and write in it at the same time. See "Shared Memory Synchronization" for more information.

Memory Mapped Files

The mmap(2) interface connects a shared memory segment to the caller's address space. The caller specifies the shared segment by address and length. The caller must also specify access protection flags and how the mapped pages are managed. mmap(2) can also be used to map a file or a segment of a file to a process's memory. This technique is very convenient in some applications, but it is easy to forget that any store to the mapped file segment results in implicit I/O. This can make an otherwise bounded process have unpredictable response times. msync(3C) forces immediate or eventual copies of the specified memory segment to its permanent storage location(s). See "Memory Management Interfaces" for more information.

Fileless Memory Mapping

The zero special file, /dev/zero(4S), can be used to create an unnamed, zero initialized memory object. The length of the memory object is the least number of pages that contain the mapping. The object can be shared only by descendants of a common ancestor process.

System V IPC Shared Memory

A shmget(2) call can be used to create a shared memory segment or to obtain an existing shared memory segment. shmget(2) returns an identifier that is analogous to a file identifier. A call to shmat(2) makes the shared memory segment a virtual segment of the process memory much like mmap(2). See "System V Shared Memory".

POSIX Shared Memory

POSIX shared memory is a variation of System V shared memory and provides similar capabilities with some minor variations. See "POSIX Shared Memory" for more information.

Shared Memory Synchronization

In sharing memory, a portion of memory is mapped into the address space of one or more processes. No method of coordinating access is automatically provided, so nothing prevents two processes from writing to the shared memory at the same time in the same place. So, it is typically used with semaphores or another mechanism used to synchronize processes. System V and POSIX semaphores both can be used for this purpose. Mutual exclusion locks, reader/writer locks, semaphores, and conditional variables provided in the multithread library can also be used for this purpose.

Choice of IPC and Synchronization Mechanisms

Applications can have specific functional requirements that determine which IPC mechanism to use. If one of several mechanisms can be used, the application writer determines which mechanism performs best for the application. The SunOS 5.0 through 5.8 interprocess communication facilities are sensitive to application behavior. Determine which mechanism provides the best response capabilities by measuring the throughput capacity of each mechanism for the particular combination of message sizes used in the application.

Asynchronous Networking

This section introduces asynchronous network communication, using sockets or Transport-Level Interface (TLI) for real-time applications. Asynchronous networking with sockets is done by setting an open socket, of type SOCK_STREAM, to asynchronous and non blocking (see "Asynchronous Socket I/O in "Advanced Topics" in Network Interface Guide). Asynchronous network processing of TLI events is supported using a combination of STREAMS asynchronous features and the non-blocking mode of the TLI library routines (see "Asynchronous Networking" in Network Interface Guide).

For more information on the Transport-Level Interface, see "Socket Interfaces" in Network Interface Guide.

Modes of Networking

Both sockets and Transport-Level Interface provide two modes of service: connection-mode and connectionless-mode.

Connection-Mode Service

Connection-mode service is circuit-oriented and enables the transmission of data over an established connection in a reliable, sequenced manner. It also provides an identification procedure that avoids the overhead of address resolution and transmission during the data transfer phase. This service is attractive for applications that require relatively long-lived, datastream-oriented interactions.

Connectionless-Mode Service

Connectionless-mode service is message-oriented and supports data transfer in self-contained units with no logical relationship required among multiple units. All information required to deliver a unit of data, including the destination address, is passed by the sender to the transport provider, together with the data, in a single service request. Connectionless-mode service is attractive for applications that involve short-term request/response interactions and do not require guaranteed, in-sequence delivery of data. It is generally assumed that connectionless transports are unreliable.

Timers

This section describes the timing facilities available for real-time applications under SunOS 5.0 through 5.8. Real-time applications that use these mechanisms require detailed information from the manual pages of the routines listed in this section.

The timing functions of SunOS 5.0 through 5.8 fall into two separate areas of functionality: timestamps and interval timers. The timestamp functions provide a measure of elapsed time and allow the application to measure the duration of a state or the time between events. Interval timers allow an application to wake up at specified times and to schedule activities based on the passage of time. Although an application can poll a timestamp function to schedule itself, such an application would monopolize the processor to the detriment of other system functions.

Timestamp Functions

Two functions provide timestamps. The gettimeofday(3C) function provides the current time in a timeval structure, representing the time in seconds and microseconds since midnight, Greenwich Mean Time, on January 1, 1970. The clock_gettime(3R) function, with a clockid of CLOCK_REALTIME, provides the current time in a timespec structure, representing in seconds and nanoseconds the same time interval returned by gettimeofday(3C).

SunOS 5.0 through 5.8 uses a hardware periodic timer. For some workstations, this is the sole timing information, and the accuracy of timestamps is limited to the resolution of that periodic timer. For other platforms, a timer register with a resolution of one microsecond allows SunOS 5.0 through 5.8 to provide timestamps accurate to one microsecond.

Interval Timer Functions

Real-time applications often schedule actions using interval timers. Interval timers can be either of two types: a one-shot type or a periodic type.

A one-shot is an armed timer that is set to an expiration time relative to either current time or an absolute time. The timer expires once and is disarmed. Such a timer is useful for clearing buffers after the data has been transferred to storage, or to time-out an operation.

A periodic timer is armed with an initial expiration time (either absolute or relative) and a repetition interval. Each time the interval timer expires it is reloaded with the repetition interval and rearmed. This timer is useful for data logging or for servo-control. In calls to interval timer functions, time values smaller than the resolution of the system hardware periodic timer are rounded up to the next multiple of the hardware timer interval (typically 10 ms).

There are two sets of timer interfaces in SunOS 5.0 through 5.8. The setitimer(2) and getitimer(2) interfaces operate fixed set timers, called the BSD timers, using the timeval structure to specify time intervals. The POSIX timers, timer_create(3RT), operate the POSIX clock, CLOCK_REALTIME. POSIX timer operations are expressed in terms of the timespec structure.

The functions getitimer(2) and setitimer(2) retrieve and establish, respectively, the value of the specified BSD interval timer. There are three BSD interval timers available to a process, including a real-time timer designated ITIMER_REAL. If a BSD timer is armed and allowed to expire, the system sends a signal appropriate to the timer to the process that set the timer.

timer_create(3RT) can create up to TIMER_MAX POSIX timers. The caller can specify what signal and what associated value are sent to the process when the timer expires. timer_settime(3RT) and timer_gettime(3RT) retrieve and establish respectively the value of the specified POSIX interval timer. Expirations of POSIX timers while the required signal is pending delivery are counted, and timer_getoverrun(3RT) retrieves the count of such expirations. timer_delete(3RT) deallocates a POSIX timer.

Example 8-1 illustrates how to use setitimer(2) to generate a periodic interrupt, and how to control the arrival of timer interrupts.

Example 8-1 Controlling Timer Interrupts

#include	<unistd.h>
#include	<signal.h>
#include	<sys/time.h>

#define TIMERCNT 8

void timerhandler();
int	 timercnt;
struct	 timeval alarmtimes[TIMERCNT];

main()
{
	struct itimerval times;
	sigset_t	sigset;
	int		i, ret;
	struct sigaction act;
	siginfo_t	si;

	/* block SIGALRM */
	sigemptyset (&sigset);
	sigaddset (&sigset, SIGALRM);
	sigprocmask (SIG_BLOCK, &sigset, NULL);

	/* set up handler for SIGALRM */
	act.sa_action = timerhandler;
	sigemptyset (&act.sa_mask);
	act.sa_flags = SA_SIGINFO;
	sigaction (SIGALRM, &act, NULL);
	/*
	 * set up interval timer, starting in three seconds,
	 *	then every 1/3 second
	 */
	times.it_value.tv_sec = 3;
	times.it_value.tv_usec = 0;
	times.it_interval.tv_sec = 0;
	times.it_interval.tv_usec = 333333;
	ret = setitimer (ITIMER_REAL, &times, NULL);
	printf ("main:setitimer ret = %d\n", ret);

	/* now wait for the alarms */
	sigemptyset (&sigset);
	timerhandler (0, si, NULL);
	while (timercnt < TIMERCNT) {
		ret = sigsuspend (&sigset);
	}
	printtimes();
}

void timerhandler (sig, siginfo, context)
	int		sig;
	siginfo_t	*siginfo;
	void		*context;
{
	printf ("timerhandler:start\n");
	gettimeofday (&alarmtimes[timercnt], NULL);
	timercnt++;
	printf ("timerhandler:timercnt = %d\n", timercnt);
}

printtimes ()
{
	int	i;

	for (i = 0; i < TIMERCNT; i++) {
		printf("%ld.%0l6d\n", alarmtimes[i].tv_sec,
				alarmtimes[i].tv_usec);
	}
}

Chapter 8 Real-time Programming and Administration

Basic Rules of Real-time Applications

Degrading Response Time

System Response Time

Interrupt Servicing

Shared Libraries

Priority Inversion

Figure 8-1 Unbounded Priority Inversion

Sticky Locks

Runaway Real-time Processes

I/O Behavior

Asynchronous I/O

Real-time Files

Scheduling

Dispatch Latency

Figure 8-2 Application Response Time.

Figure 8-3 Internal Dispatch Latency

Scheduling Classes

Figure 8-4 Dispatch Priorities for Scheduling Classes

Figure 8-5 The Kernel Dispatch Queue

Dispatch Queue

Dispatching Processes

Preemption

Kernel Priority Inversion

User Priority Inversion

Function Calls That Control Scheduling

sched_setparam(3RT), sched_getparam(3RT)

Utilities That Control Scheduling

Configuring Scheduling

Dispatcher Parameter Table

Reconfiguring config_rt_dptbl

Memory Locking

Overview

Locking a Page

Unlocking a Page

Locking All Pages

Sticky Locks

High Performance I/O

POSIX Asynchronous I/O

aio_return(3RT) and aio_error(3RT)

Solaris Asynchronous I/O

Notification (SIGIO)

Synchronized I/O

Modes of Synchronization

Synchronizing a File

Interprocess Communication

Overview

Signals

Pipes

Named Pipes

Message Queues

Semaphores

Shared Memory

Memory Mapped Files

Fileless Memory Mapping

System V IPC Shared Memory

POSIX Shared Memory

Shared Memory Synchronization

Choice of IPC and Synchronization Mechanisms

Asynchronous Networking

Modes of Networking

Connection-Mode Service

Connectionless-Mode Service

Timers

Timestamp Functions

Interval Timer Functions

Example 8-1 Controlling Timer Interrupts

Reconfiguring `config_rt_dptbl`