Go to main content

man pages section 3: Library Interfaces and Headers

Exit Print View

Updated: Wednesday, July 27, 2022
 
 

net_kernel_bypass (3LIB)

Name

net_kernel_bypass - kernel network data path bypass library

Description

The net_kernel_bypass is a library which has to be pre-loaded when you execute an application that needs to utilize the kernel network data path bypass capability. For more information, see the ld.so.1(1) man page. The net_kernel_bypass library interposes necessary socket calls so that the kernel network data path is bypassed. This means that applications neither require re-compilation nor changes to utilize the bypass capability. Currently, this library only supports UDP data path bypass. An asynchronous I/O is not supported on a bypass socket. For more information on asynchronous I/O, see the aio.h(3HEAD) man page.

This library tracks all UDP socket creation. At socket bind time, an application calls bind(3C), connect(3C), sendto(3C), or setsockopt(3C) with IP_MULTICAST_IF, ..., the library then determines if the network interface to be used for data communication is capable to bypass or not. If yes, it will set up the necessary resource. If not, it will forget about this socket, and all future socket operations on this socket will go through the normal kernel socket path. All socket operations except for send and receive operations, still go through the normal kernel socket path. They are tracked by the library to determine if data path bypass can be set up. Note that an application needs to have the PRIV_SYS_NET_CONFIG privilege to perform kernel network data path bypass. If data path bypass can be used, then send and receive operations are done only in user context. The application thread which calls send(3C) to send a message, puts the message in the device mapped memory for the network interface hardware to pick up and send out. When the network interface hardware receives a packet destined to bypass socket, it DMA the packet to the application's mapped memory. When the application calls recv(3C), the application thread copies the packet content from the mapped memory to the supplied buffer. The size of mapped memory is fixed and cannot be changed.

There are a number of limitations on using data path bypass socket. A UDP socket bind(3C) to a wildcard address will not be able to use data path bypass as there is not one single network interface specified. The exception is for multicast traffic, as the network interface that is used to join a multicast group, needs to be specified. This interface is later checked if data path bypass is possible. A bypass socket can only join one multicast group. If a UDP socket is used to send to multiple addresses, only one address may use data path bypass depending on the outgoing interface. Data sent to all other addresses will go through the normal socket path.

Packet fragmentation is currently not supported using data path bypass. The network interface hardware cannot perform a valid filter on a received fragment. The pre-load library also does not fragment outgoing packets. If send(3C) is called with a payload which requires sending a packet larger than the interface's MTU, the kernel data path is used to send out that packet.

Currently, data path bypass can only be used in the global zone. Non-global zones and kernel zone are not supported. LDOM is supported as long as the entire network interface card is dedicated to a guest domain. The bypass capability does not support a virtual interface. However, it can be used on a physical function. Dynamic re-configuration of the interface card is not supported. A bypass socket can only be associated with one kernel data path bypass capable network interface for both, send and receive operations. IPMP interface and link aggregation links are not currently supported.

Since the kernel network data path is completely bypassed, kernel features which need to inspect packets will not function properly as they will not be able to inspect packets belonging to bypass sockets. These features include IPsec, Trusted Extensions, packet filter, link protection, packet sniffing such as snoop, and IP/UDP dtrace probes. If IPsec, Trusted Extension, and/or packet filtering are enabled, then no data path bypass is allowed. But if these features are enabled after a bypass socket is created, the bypass socket will continue to function.

Currently, the bypass socket does not get an error notification generated as it receives an ICMP error message, such as "Destination unreachable". For example, using a normal socket, an application sends a message to an unreachable peer application. This may result in the peer host sending back an ICMP "Destination unreachable" message. When the application sends the message again, the send call will get the ECONNREFUSED (errno 146) error. Doing the same operation with a bypass socket, the app will not get the error.

Packet sniffing is not supported. Running sniffers like snoop(8) does not affect creating a bypass socket. But packets sent and received using the bypass socket will not show up in the sniffer. Likewise, enabled IP/UDP dtrace probes will not fire for sent and received packets of a bypass socket. Link protection does not affect bypass socket creation.

The bypass socket does not interact with the kernel flow created by flowadm(8). This means that flowstat(8) cannot show statistics about bypass socket. Having the flow defined does not affect creation of bypass socket. A bypass socket can be considered as a flow with highest priority, and can utilize an network interface card without restriction.

The command netstat(8) with option –k can be used to display sockets using data path bypass. Without using this option, a socket using data path bypass is displayed like any other normal sockets. When this option is used, only sockets using data path bypass are displayed. The following example shows the output of running netstat -akP udp:

UDP: IPv4
Local Address        Remote Address       State       If       Ipkts       Opkts       Dpkts
-------------------- -------------------- ----------- -------- ----------- ----------- -----------
*.48944                                   Idle       net11       0         281790       0

This option can also be used with "–u" and "–v" option. An example output running `netstat -aukP udp`:

UDP: IPv4
Local Address        Remote Address       State       If       Ipkts       Opkts       Dpkts      User      Pid       Command
-------------------- -------------------- ----------- -------- ----------- ----------- ----------- -------- ---------  -------------
*.48944                                   Idle        net11      0          18           0        root      1495      udpsend       

And also with "–v":

UDP: IPv4 
Local Address     Remote Address    State  If     Ipkts   Opkts   Dpkts  User   Pid  Command
----------------  ----------------- ----- ------- ------ ------- ------- ----- ----  ---------
*.48944                             Idle   net11  0       281896   0     root   1495  sparc/udpsend -n -i 192.168.0.1 -m 239.8.8.8 -p 8888 -r 1

The command dladm(8) with subcommand show-phys –G can be used to show underlying physical device ring-group resource information. This information can also be displayed by the –o option using field names RG-AVAIL and RG-INUSE-UMAC. This information will not be displayed in the default output without using either the –o or –G option.

# dladm show-phys -G net5
LINK            RG-AVAIL    RG-INUSE-UMAC RG-INUSE-VNIC RG-INUSE-FLOW
net5            234         0             0             1

This shows that a total number of 265 ring-groups can be used for kernel bypass socket and 1 is in use. The resource available is reported by the underlying network interface driver.

Interfaces

Certain behavior of the pre-load library can be changed by the following environment variables:

_NET_KERNEL_BYPASS_ERR_EXIT

Controls how the library behaves when an error occurs while setting up the kernel datapath bypass. By default, if the bypass set up fails, the library falls back to use normal socket and the app will function properly. However, the kernel data path bypass is not used. The set up can fail if there is no more hardware bypass resource available.

0 is the default value if this variable is not set. If this variable is set to 1, the library calls the exit() function if the bypass set up fails.

_NET_KERNEL_BYPASS_RECV_MODE

Controls how the library behaves when there is no data to be read for a socket. If this variable is set to 0, and there is no data to be read, net_kernel_bypass will call the poll() function. When new data is available, the poll() function returns and the data will be returned to the application.

If this variable is set to 1 and there is no data to be read, the application thread will be busy to wait on new data. This mode allows the application to receive new data with the lowest latency possible. But, this mode also consumes the most CPU cycles.

1 is the default value if this variable is not set.

_NET_KERNEL_BYPASS_SEND_BUSY_WAIT_CNT

Controls how many times the library will try to send a packet out, if the hardware resource is not available. This happens when the transmit buffer is full. The FV hardware has a fair sharing policy which guarantees no starvation. If the library still cannot send a packet successfully after many trials, it will return an error to the application.

15 is the default value if this variable is not set.

The following socket option at the SOL_SOCKET level is supported by the pre-load library.

SO_NET_KERNEL_BYPASS_STATS, option value type is as follows:

typedef struct so_nkp_stats_s { 
    uint64_t nkps_ipkts; 
    uint64_t nkps_ipkts_spin; 
    uint64_t nkps_opks; 
    uint64_t nkps_opkts_spin;
    uint32_t nkps_send_spin_max; 
} so_nkp_stats_t; 

It is a read only option. An application can use this option to gather statistics for performance tuning purposes.

- nkps_ipkts: total number of received packets using bypass
- nkps_ipkts_spin: number of packets received which requires spinning
- nkps_opkts: total number of sent packets
- nkps_opkts_spin: number of packets sent which requires spinning
- nkps_send_spin_max: maximum number of spinning

Examples

Example 1 How to invoke a command using the UDP kernel data path bypass?

The following shell script can be used to invoke a command to make use of UDP kernel data path bypass.

#!/bin/sh

export LD_PRELOAD=net_kernel_bypass.so.1
exec $*

Files

/usr/lib/net_kernel_bypass.so.1

Shared object

/usr/lib/64/net_kernel_bypass.so.1

64-bit shared object

Attributes

See attributes(7) for descriptions of the following attributes:

ATTRIBUTE TYPE
ATTRIBUTE VALUE
Availability
system/library/net_kernel_bypass
Interface Stability
Uncommitted
MT-Level
Safe

See Also

ld.so.1(1), intro(3), libsocket(3LIB), attributes(7), privileges(7), dladm(8), flowadm(8), flowstat(8), netstat(8), snoop(8)