Transport Interfaces Programming Guide

Advanced Topics

For most programmers, the mechanisms already described are enough to build distributed applications. Others need some of the additional features in this section.

Out-of-Band Data

The stream socket abstraction includes out-of-band data. Out-of-band data is a logically independent transmission channel between a pair of connected stream sockets. Out-of-band data is delivered independent of normal data. The out-of-band data facilities must support the reliable delivery of at least one out-of-band message at a time. This message can contain at least one byte of data, and at least one message can be pending delivery at any time.

For communications protocols that support only in-band signaling (that is, urgent data is delivered in sequence with normal data), the message is extracted from the normal data stream and stored separately. This lets users choose between receiving the urgent data in order and receiving it out of sequence, without having to buffer the intervening data.

You can peek (with MSG_PEEK) at out-of-band data. If the socket has a process group, a SIGURG signal is generated when the protocol is notified of its existence. A process can set the process group or process id to be informed by SIGURG with the appropriate fcntl() call, as described in "Interrupt-Driven Socket I/O" for SIGIO. If multiple sockets have out-of-band data waiting delivery, a select() call for exceptional conditions can be used to determine the sockets with such data pending.

A logical mark is placed in the data stream at the point at which the out-of-band data was sent. The remote login and remote shell applications use this facility to propagate signals between client and server processes. When a signal is received, all data up to the mark in the data stream is discarded.

To send an out-of-band message, the MSG_OOB flag is applied to send() or sendto(). To receive out-of-band data, specify MSG_OOB to recvfrom() or recv() (unless out-of-band data is taken in line, in which case the MSG_OOB flag is not needed). The SIOCATMARK ioctl tells whether the read pointer currently points at the mark in the data stream:

int yes;
ioctl(s, SIOCATMARK, &yes);

If yes is 1 on return, the next read returns data after the mark. Otherwise, assuming out-of-band data has arrived, the next read provides data sent by the client before sending the out-of-band signal. The routine in the remote login process that flushes output on receipt of an interrupt or quit signal is shown in Example 2-13. This code reads the normal data up to the mark (to discard it), then reads the out-of-band byte.

A process can also read or peek at the out-of-band data without first reading up to the mark. This is more difficult when the underlying protocol delivers the urgent data in-band with the normal data, and only sends notification of its presence ahead of time (for example, TCP, the protocol used to provide socket streams in the Internet domain). With such protocols, the out-of-band byte might not yet have arrived when a recv() is done with the MSG_OOB flag. In that case, the call returns the error of EWOULDBLOCK. Also, there might be enough in-band data in the input buffer that normal flow control prevents the peer from sending the urgent data until the buffer is cleared. The process must then read enough of the queued data before the urgent data can be delivered.

Example 2-13 Flushing Terminal I/O on Receipt of Out-of-Band Data

#include <sys/ioctl.h>
#include <sys/file.h>
...
oob()
{
		int out = FWRITE;
		char waste[BUFSIZ];
		int mark = 0;
 
		/* flush local terminal output */
		ioctl(1, TIOCFLUSH, (char *) &out);
		while(1) {
			if (ioctl(rem, SIOCATMARK, &mark) == -1) {
				perror("ioctl");
				break;
			}
			if (mark)
				break;
			(void) read(rem, waste, sizeof waste);
		}
		if (recv(rem, &mark, 1, MSG_OOB) == -1) {
			perror("recv");
			...
		}
		...
}

There is also a facility to retain the position of urgent in-line data in the socket stream. This is available as a socket-level option, SO_OOBINLINE. See the getsockopt(3N) manpage for usage. With this option, the position of urgent data (the mark) is retained, but the urgent data immediately follows the mark in the normal data stream returned without the MSG_OOB flag. Reception of multiple urgent indications causes the mark to move, but no out-of-band data are lost.

Nonblocking Sockets

Some applications require sockets that do not block. For example, requests that cannot complete immediately and would cause the process to be suspended (awaiting completion) are not executed. An error code would be returned. After a socket is created and any connection to another socket is made, it can be made nonblocking by issuing a fcntl() call, as shown in Example 2-14.

Example 2-14 Set Nonblocking Socket

#include <fcntl.h>
#include <sys/file.h>
...
int fileflags;
int s;
...
s = socket(AF_INET, SOCK_STREAM, 0);
...
if (fileflags = fcntl(s, F_GETFL, 0) == -1)
		perror("fcntl F_GETFL");
		exit(1);
}
if (fcntl(s, F_SETFL, fileflags | FNDELAY) == -1)
		perror("fcntl F_SETFL, FNDELAY");
		exit(1);
}
...

When doing I/O on a nonblocking socket, check for the error EWOULDBLOCK (in errno.h), which occurs when an operation would normally block. accept(), connect(), send(), recv(), read(), and write() can all return EWOULDBLOCK. If an operation such as a send() cannot be done in its entirety, but partial writes work (such as when using a stream socket), the data that can be sent immediately are processed, and the return value is the amount actually sent.

Asynchronous Socket I/O

Asynchronous communication between processes is required in applications that handle multiple requests simultaneously. Asynchronous sockets must be SOCK_STREAM type. To make a socket asynchronous, you issue a fcntl() call, as shown in Example 2-15.

Example 2-15 Making a Socket Asynchronous

#include <fcntl.h>
#include <sys/file.h>
...
int fileflags;
int s;
...
s = socket(AF_INET, SOCK_STREAM, 0);
...
if (fileflags = fcntl(s, F_GETFL ) == -1)
		perror("fcntl F_GETFL");
		exit(1);
}
if (fcntl(s, F_SETFL, fileflags | FNDELAY | FASYNC) == -1)
		perror("fcntl F_SETFL, FNDELAY | FASYNC");
		exit(1);
}
...

After sockets are initialized, connected, and made asynchronous, communication is similar to reading and writing a file asynchronously. A send(), write(), recv(), or read() initiates a data transfer. A data transfer is completed by a signal-driven I/O routine, described in the next section.

Interrupt-Driven Socket I/O

The SIGIO signal notifies a process when a socket (actually any file descriptor) has finished a data transfer. The steps in using SIGIO are:

Set up a SIGIO signal handler with the signal() or sigvec() calls.
Use fcntl() to set the process ID or process group ID to route the signal to its own process ID or process group ID (the default process group of a socket is group 0).
Convert the socket to asynchronous, as shown in "Asynchronous Socket I/O".

Example 2-16 shows some sample code to allow a given process to receive information on pending requests as they occur for a socket. With the addition of a handler for SIGURG(), this code can also be used to prepare for receipt of SIGURG signals.

Example 2-16 Asynchronous Notification of I/O Requests

#include <fcntl.h>
#include <sys/file.h>
 ...
signal(SIGIO, io_handler);
/* Set the process receiving SIGIO/SIGURG signals to us. */
if (fcntl(s, F_SETOWN, getpid()) < 0) {
		perror("fcntl F_SETOWN");
		exit(1);
}

Signals and Process Group ID

For SIGURG and SIGIO, each socket has a process number and a process group ID. These values are initialized to zero, but can be redefined at a later time with the F_SETOWN fcntl(), as in the previous example. A positive third argument to fcntl() sets the socket's process ID. A negative third argument to fcntl() sets the socket's process group ID. The only allowed recipient of SIGURG and SIGIO signals is the calling process. A similar fcntl(), F_GETOWN, returns the process number of a socket.

Reception of SIGURG and SIGIO can also be enabled by using ioctl() to assign the socket to the user's process group:

/* oobdata is the out-of-band data handling routine */
sigset(SIGURG, oobdata);
int pid = -getpid();
if (ioctl(client, SIOCSPGRP, (char *) &pid) < 0) {
		perror("ioctl: SIOCSPGRP");
}

Another signal that is useful in server processes is SIGCHLD. This signal is delivered to a process when any child process changes state. Normally, servers use the signal to "reap" child processes that have exited without explicitly awaiting their termination or periodically polling for exit status. For example, the remote login server loop shown previously can be augmented as shown in Example 2-17.

Example 2-17 `SIGCHLD` Signal

int reaper();
...
sigset(SIGCHLD, reaper);
listen(f, 5);
while (1) {
		int g, len = sizeof from;
		g = accept(f, (struct sockaddr *) &from, &len);
		if (g < 0) {
			if (errno != EINTR)
				syslog(LOG_ERR, "rlogind: accept: %m");
			continue;
		}
		...
}
 
#include <wait.h>
 
reaper()
{
		int options;
		int error;
		siginfo_t info;
 
		options = WNOHANG | WEXITED;
		bzero((char *) &info, sizeof(info));
		error = waitid(P_ALL, 0, &info, options);
}

If the parent server process fails to reap its children, zombie processes result.

Selecting Specific Protocols

If the third argument of the socket() call is 0, socket() selects a default protocol to use with the returned socket of the type requested. The default protocol is usually correct, and alternate choices are not usually available. When using "raw" sockets to communicate directly with lower-level protocols or hardware interfaces, it may be important for the protocol argument to set up de-multiplexing. For example, raw sockets in the Internet domain can be used to implement a new protocol on IP, and the socket receives packets only for the protocol specified. To obtain a particular protocol, determine the protocol number as defined in the protocol domain. For the Internet domain, use one of the library routines discussed in "Standard Routines", such as getprotobyname():

#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netdb.h>
 ...
pp = getprotobyname("newtcp");
s = socket(AF_INET, SOCK_STREAM, pp->p_proto);

This results in a socket s using a stream-based connection, but with protocol type of newtcp instead of the default tcp.

Address Binding

TCP and UDP use a 4-tuple of local IP address, local port number, foreign IP address, and foreign port number to do their addressing. TCP requires these 4-tuples to be unique. UDP does not. It is unrealistic to expect user programs to always know proper values to use for the local address and local port, since a host can reside on multiple networks and the set of allocated port numbers is not directly accessible to a user. To avoid these problems, you can leave parts of the address unspecified and let the system assign the parts appropriately when needed. Various portions of these tuples may be specified by various parts of the sockets API.

bind(): local address or local port or both
connect(): foreign address and foreign port

A call to accept() retrieves connection information from a foreign client, so it causes the local address and port to be specified to the system (even though the caller of accept() didn't specify anything), and the foreign address and port to be returned.

A call to listen() can cause a local port to be chosen. If no explicit bind() has been done to assign local information, listen() causes an ephemeral port number to be assigned.

A service that resides at a particular port, but which does not care what local address is chosen, can bind() itself to its port and leave the local address unspecified (set to INADDR_ANY, a constant defined in <netinet/in.h>). If the local port need not be fixed, a call to listen() causes a port to be chosen. Specifying an address of INADDR_ANY or a port number of 0 is known as wildcarding.

The wildcard address simplifies local address binding in the Internet domain. The sample code below binds a specific port number, MYPORT, to a socket, and leaves the local address unspecified.

Example 2-18 Bind Port Number to Socket

#include <sys/types.h>
#include <netinet/in.h>
...
struct sockaddr_in sin;
...
		s = socket(AF_INET, SOCK_STREAM, 0);
		sin.sin_family = AF_INET;
		sin.sin_addr.s_addr = htonl(INADDR_ANY);
		sin.sin_port = htons(MYPORT);
		bind(s, (struct sockaddr *) &sin, sizeof sin);

Each network interface on a host typically has a unique IP address. Sockets with wildcard local addresses can receive messages directed to the specified port number and sent to any of the possible addresses assigned to a host. For example, if a host has two interfaces with addresses 128.32.0.4 and 10.0.0.78, and a socket is bound as in Example 2-18, the process can accept connection requests addressed to 128.32.0.4 or 10.0.0.78. To allow only hosts on a specific network to connect to it, a server binds the address of the interface on the appropriate network.

Similarly, a local port number can be left unspecified (specified as 0), in which case the system selects a port number. For example, to bind a specific local address to a socket, but to leave the local port number unspecified:

sin.sin_addr.s_addr = inet_addr("127.0.0.1");
sin.sin_family = AF_INET;
sin.sin_port = htons(0);
bind(s, (struct sockaddr *) &sin, sizeof sin);

The system uses two criteria to select the local port number:

The first is that Internet port numbers less than 1024 (IPPORT_RESERVED) are reserved for privileged users (that is, the superuser). Nonprivileged users can use any Internet port number greater than 1024. The largest Internet port number is 65535.
The second criterion is that the port number is not currently bound to some other socket.

The port number and IP address of the client is found through either accept() (the from result) or getpeername().

In certain cases, the algorithm used by the system to select port numbers is unsuitable for an application. This is because associations are created in a two-step process. For example, the Internet file transfer protocol specifies that data connections must always originate from the same local port. However, duplicate associations are avoided by connecting to different foreign ports. In this situation, the system would disallow binding the same local address and port number to a socket if a previous data connection's socket still existed. To override the default port selection algorithm, you must perform an option call before address binding:

 ...
int on = 1;
...
setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &on, sizeof on);
bind(s, (struct sockaddr *) &sin, sizeof sin);

With this call, local addresses already in use can be bound. This does not violate the uniqueness requirement, because the system still verifies at connect time that any other sockets with the same local address and port do not have the same foreign address and port. If the association already exists, the error EADDRINUSE is returned.

Broadcasting and Determining Network Configuration

Messages sent by datagram sockets can be broadcast to reach all of the hosts on an attached network. The network must support broadcast; the system provides no simulation of broadcast in software. Broadcast messages can place a high load on a network since they force every host on the network to service them. Broadcasting is usually used for either of two reasons: to find a resource on a local network without having its address, or functions like routing require that information be sent to all accessible neighbors.

To send a broadcast message, create an Internet datagram socket:

s = socket(AF_INET, SOCK_DGRAM, 0);

and bind a port number to the socket:

sin.sin_family = AF_INET;
sin.sin_addr.s_addr = htonl(INADDR_ANY);
sin.sin_port = htons(MYPORT);
bind(s, (struct sockaddr *) &sin, sizeof sin);

The datagram can be broadcast on only one network by sending to the network's broadcast address. A datagram can also be broadcast on all attached networks by sending to the special address INADDR_BROADCAST, defined in netinet/in.h.

The system provides a mechanism to determine a number of pieces of information (including the IP address and broadcast address) about the network interfaces on the system. The SIOCGIFCONF ioctl() call returns the interface configuration of a host in a single ifconf structure. This structure contains an array of ifreq structures, one for each address domain supported by each network interface to which the host is connected. Example 2-19 shows these structures defined in net/if.h.

Example 2-19 `net/if.h` Header File

struct ifreq {
#define IFNAMSIZ 16
char ifr_name[IFNAMSIZ]; /* if name, e.g., "en0" */
union {
		struct sockaddr ifru_addr;
		struct sockaddr ifru_dstaddr;
		char ifru_oname[IFNAMSIZ]; /* other if name */
		struct sockaddr ifru_broadaddr;
		short ifru_flags;
		int ifru_metric;
		char ifru_data[1]; /* interface dependent data */
		char ifru_enaddr[6];
} ifr_ifru;
#define ifr_addr ifr_ifru.ifru_addr
#define ifr_dstaddr ifr_ifru.ifru_dstaddr
#define ifr_oname ifr_ifru.ifru_oname
#define ifr_broadaddr ifr_ifru.ifru_broadaddr
#define ifr_flags ifr_ifru.ifru_flags
#define ifr_metric ifr_ifru.ifru_metric
#define ifr_data ifr_ifru.ifru_data
#define ifr_enaddr ifr_ifru.ifru_enaddr
};

The call that obtains the interface configuration is:

/*
 * Do SIOCGIFNUM ioctl to find the number of interfaces
 *
 * Allocate space for number of interfaces found
 *
 * Do SIOCGIFCONF with allocated buffer
 *
 */
if (ioctl(s, SIOCGIFNUM, (char *)&numifs) == -1) {
        numifs = MAXIFS;
}
bufsize = numifs * sizeof(struct ifreq);
reqbuf = (struct ifreq *)malloc(bufsize);
if (reqbuf == NULL) {
        fprintf(stderr, "out of memory\n");
        exit(1);
}
ifc.ifc_buf = (caddr_t)&reqbuf[0];
ifc.ifc_len = bufsize;
if (ioctl(s, SIOCGIFCONF, (char *)&ifc) == -1) {
        perror("ioctl(SIOCGIFCONF)");
        exit(1);
}
...
}

After this call, buf contains an array of ifreq structures, one for each network to which the host is connected. These structures are ordered first by interface name, then by supported address families. ifc.ifc_len is set to the number of bytes used by the ifreq structures.

Each structure has a set of interface flags that tell whether the corresponding network is up or down, point-to-point or broadcast, and so on. Example 2-20 shows the SIOCGIFFLAGS ioctl() returning these flags for an interface specified by an ifreq structure.

Example 2-20 Obtaining Interface Flags

struct ifreq *ifr;
ifr = ifc.ifc_req;
for (n = ifc.ifc_len/sizeof (struct ifreq); --n >= 0; ifr++) {
   /*
    * Be careful not to use an interface devoted to an address
    * domain other than those intended.
    */
   if (ifr->ifr_addr.sa_family != AF_INET)
      continue;
   if (ioctl(s, SIOCGIFFLAGS, (char *) ifr) < 0) {
      ...
   }
   /* Skip boring cases */
   if ((ifr->ifr_flags & IFF_UP) == 0 ||
      (ifr->ifr_flags & IFF_LOOPBACK) ||
      (ifr->ifr_flags & (IFF_BROADCAST | IFF_POINTOPOINT)) == 0)
      continue;
}

Example 2-21 shows the broadcast of an interface can be obtained with the SIOGGIFBRDADDR ioctl().

Example 2-21 Broadcast Address of an Interface

if (ioctl(s, SIOCGIFBRDADDR, (char *) ifr) < 0) {
		...
}
memcpy((char *) &dst, (char *) &ifr->ifr_broadaddr,
		sizeof ifr->ifr_broadaddr);

The SIOGGIFBRDADDR ioctl() can also be used to get the destination address of a point-to-point interface.

After the interface broadcast address is obtained, transmit the broadcast datagram with sendto():

sendto(s, buf, buflen, 0, (struct sockaddr *)&dst, sizeof dst);

Use one sendto() for each interface to which the host is connected that supports the broadcast or point-to-point addressing.

Zero Copy and Checksum Offload

In Solaris 2.6, the TCP/IP protocol stack has been enhanced to support two new features: zero copy and TCP checksum offload.

Zero copy uses virtual memory MMU remapping and a copy-on-write technique to move data between the application and the kernel space.
Checksum offloading relies on special hardware logic to offload the TCP checksum calculation.

Note -

Although zero copy and checksum offloading are functionally independent of one another, they have to work together to obtain the optimal performance. Checksum offloading requires hardware support from the network interface and, without this hardware support, zero copy is not enabled.

Zero copy requires that the applications supply page-aligned buffers before VM page remapping can be applied. Applications should use large, circular buffers on the transmit side to avoid expensive copy-on-write faults. A typical buffer allocation is sixteen 8k buffers.

Socket Options

You can set and get several options on sockets through setsockopt() and getsockopt(); for example by changing the send or receive buffer space. The general forms of the calls are:

setsockopt(s, level, optname, optval, optlen);

and

getsockopt(s, level, optname, optval, optlen);

Note -

In some cases, such as setting the buffer sizes, these are only hints to the operating system. The operating system reserves the right to adjust the values appropriately.

Table 2-4 shows the arguments of the calls.

Table 2-4 setsockopt() and getsockopt() Arguments


Arguments	Description
`s`	Socket on which the option is to be applied
`level`	Specifies the protocol level, such as socket level, indicated by the symbolic constant `SOL_SOCKET` in `sys/socket.h`
`optname`	Symbolic constant defined in `sys/socket.h` that specifies the option
`optval`	Points to the value of the option
`optlen`	Points to the length of the value of the option

For getsockopt(), optlen is a value-result argument, initially set to the size of the storage area pointed to by optval and set on return to the length of storage used.

It is sometimes useful to determine the type (for example, stream or datagram) of an existing socket. Programs invoked by inetd can do this by using the SO_TYPE socket option and the getsockopt() call:

#include <sys/types.h>
#include <sys/socket.h>
 
int type, size;
 
size = sizeof (int);
if (getsockopt(s, SOL_SOCKET, SO_TYPE, (char *) &type, &size) <0) {
 	...
}

After getsockopt(), type is set to the value of the socket type, as defined in sys/socket.h. For a datagram socket, type would be SOCK_DGRAM.

`inetd` Daemon

One of the daemons provided with the system is inetd. It is invoked at start-up time, and gets the services for which it listens from the /etc/inetd.conf file. The daemon creates one socket for each service listed in /etc/inetd.conf, binding the appropriate port number to each socket. See the inetd(1M) man page for details.

inetd polls each socket, waiting for a connection request to the service corresponding to that socket. For SOCK_STREAM type sockets, inetd does an accept() on the listening socket, fork()s, dup()s the new socket to file descriptors 0 and 1 (stdin and stdout), closes other open file descriptors, and exec()s the appropriate server.

The primary benefit of inetd is that services that are not in use are not taking up machine resources. A secondary benefit is that inetd does most of the work to establish a connection. The server started by inetd has the socket connected to its client on file descriptors 0 and 1, and can immediately read(), write(), send(), or recv(). Servers can use buffered I/O as provided by the stdio conventions, as long as they use fflush() when appropriate.

getpeername() returns the address of the peer (process) connected to a socket; it is useful in servers started by inetd. For example, to log the Internet address in decimal dot notation (such as 128.32.0.4, which is conventional for representing an IP address of a client), an inetd server could use the following:

struct sockaddr_in name;
int namelen = sizeof name;
 ...
if (getpeername(0, (struct sockaddr *) &name, &namelen) < 0) {
		syslog(LOG_ERR, "getpeername: %m");
		exit(1);
} else
		syslog(LOG_INFO, "Connection from %s",
			inet_ntoa(name.sin_addr));
 ...