1.1 About the Unbreakable Enterprise Kernel

1.1.1 About UEK Release 1
1.1.2 About UEK Release 2
1.1.3 About UEK Release 3

In September 2010, Oracle announced the new Unbreakable Enterprise Kernel (UEK) for Oracle Linux as a recommended kernel for deployment with Oracle Linux 5. Beginning with Oracle Linux 5.5, you could choose to use either the Red Hat Compatible Kernel or the UEK. In Oracle Linux 5.6, the UEK became the default kernel.

The prime motivation for creating the UEK was to provide a modern, high performance Linux kernel for the Exadata and Exalogic engineered systems. The kernel needed to scale as the number of CPUs, memory and InfiniBand connects was increased.

Oracle tests the UEK intensively with demanding Oracle workloads, and recommends the UEK for Oracle deployments and all other enterprise deployments. Oracle is committed to offering compatibility with Red Hat, and continues to release and support the Red Hat Compatible Kernel as part of Oracle Linux for customers that require strict RHEL compatibility. Under the Oracle Linux Support Program, customers can receive full support for Oracle Linux running with either kernel.

Oracle releases new versions of the UEK every 12-18 months. The latest version of the UEK receives quarterly patch updates including drivers for new hardware support, bug fixes, and critical security patches. Oracle also provides critical security patches for previous versions of the UEK. These patches are available as new installable kernels and, with the exception of device driver updates, as Ksplice patches.

Using the UEK instead of the Red Hat Compatible Kernel changes only the operating system kernel. There are no changes to any libraries, APIs, or any user-space applications Existing applications run unchanged regardless of which kernel you use. Using a different kernel does not change system libraries such as glibc. The version of glibc in Oracle Linux 6 remains the same, regardless of the kernel version.

1.1.1 About UEK Release 1

Release 1 of the UEK is based on a stable 2.6.32 Linux kernel and provides additional performance improvements, including:

  • Improved IRQ (interrupt request) balancing.

  • Reduced lock contention across the kernel.

  • Improved network I/O by the use of receive packet steering and RDS improvements.

  • Improved virtual memory performance.

The UEK release 1 includes optimizations developed in collaboration with Oracle’s Database, Middleware, and Hardware engineering teams to ensure stability and optimal performance for demanding enterprise workloads. In addition to performance improvements for large systems, the following UEK features are relevant to using Linux in the data center:

  • The Infiniband OpenFabrics Enterprise Distribution (OFED) 1.5.1 implements Remote Direct Memory Access (RDMA) and kernel bypass mechanisms to deliver high-efficiency computing, wire-speed messaging, ultra-low microsecond latencies and fast I/O for servers, block storage and file systems. This also includes an improved RDS (reliable datagram sockets) stack for high-speed, low-latency networking. As an InfiniBand Upper Layer Protocol (ULP), RDS allows the reliable transmission of IPC datagrams up to 1 MB in size, and is currently used in Oracle Real Application Clusters (RAC), and in the Exadata and Exalogic products.

  • A number of additional patches significantly improve the performance of Non-Uniform Memory Access (NUMA) systems with many CPUs, CPU cores, and memory nodes.

  • Receive Packet Steering (RPS) is a software implementation of Receive Side Scaling (RSS) that improves overall networking performance, especially for high loads. RPS distributes the load of received network packet processing across multiple CPUs and ensures that the same CPU handles all packets for a specific combination of IP address and port.

    To configure the list of CPUs to which RPS can forward traffic, use /sys/class/net/interface/queues/rx-N/rps_cpus, which implements a CPU bitmap for a specified network interface and receive queue. The default value is zero, which disables RPS and results in the CPU that is handling the network interrupt also processing the incoming packet. To enable RPS and allow a particular set of CPUs to handle interrupts for the receive queue on an interface, set the value of their positions in the bitmap to 1. For example, to enable RPS to use CPUs 0, 1, 2, and 3 for the rx-0 queue on eth0, set the value of rps_cpus to f (that is, 1+2+4+8 = 15 in hexadecimal):

    # cat f > /sys/class/net/eth0/queues/rx-0/rps_cpus

    There is no benefit in configuring RPS on a system with a multiqueue network device as RSS is usually automatically configured to map a CPU to each receive queue.

    For an interface with a single transmit queue, you should typically set rps_cpus for CPUs in the same memory domain so that they share the same queue. On a non-NUMA system, this means that you would set all the available CPUs in rps_cpus.

    Tip

    To verify which CPUs are handling receive interrupts, use the command watch -n1 cat /proc/softirqs and monitor the value of NET_RX for each CPU.

  • Receive Flow Steering (RFS) extends RPS to coordinate how the system processes network packets in parallel. RFS performs application matching to direct network traffic to the CPU on which the application is running.

    To configure RFS, use /proc/sys/net/core/rps_sock_flow_entries, which sets the number of entries in the global flow table, and /sys/class/net/interface/queues/rx-N/rps_flow_cnt, which sets the number of entries in the per-queue flow table for a network interface. The default values are both zero, which disables RFS. To enable RFS, set the value of rps_sock_flow_entries to the maximum expected number of concurrently active connections, and the value of rps_flow_cnt to rps_sock_flow_entries/Nq, where Nq is the number of receive queues on a device. Any value that you enter is rounded up to the nearest power of 2. The suggested value of rps_sock_flow_entries is 32768 for a moderately loaded server.

  • The kernel can detect solid state disks (SSDs), and tune itself for their use by bypassing the optimization code for spinning media and by dispatching I/O without delay to the SSD.

  • The data integrity features verify data from the database all the way down to the individual storage spindle or device. The Linux data integrity framework (DIF) allows applications or kernel subsystems to attach metadata to I/O operations, allowing devices that support DIF to verify the integrity before passing them further down the stack and physically committing them to disk. The Data Integrity Extensions (DIX) feature enables the exchange of protection metadata between the operating system and the host bus adapter (HBA), and helps to prevent silent data corruption. The data-integrity enabled Automatic Storage Manager (ASM) that is available as an add-on with Oracle Database also protects against data corruption from application to disk platter.

    For more information about the data integrity features, including programming with the block layer integrity API, see http://www.kernel.org/doc/Documentation/block/data-integrity.txt.

  • Oracle Cluster File System 2 (OCFS2) version 1.6 includes a large number of features. For more information, see Chapter 7, Oracle Cluster File System Version 2.

1.1.2 About UEK Release 2

Note

The kernel version in UEK Release 2 (UEK R2) is stated as 2.6.39, but it is actually based on the 3.0-stable Linux kernel. This renumbering allows some low-level system utilities that expect the kernel version to start with 2.6 to run without change.

UEK R2 includes the following improvements over release 1:

  • Interrupt scalability is refined, and scheduler tuning is improved, especially for Java workloads.

  • Transcendent memory helps the performance of virtualization solutions for a broad range of workloads by allowing a hypervisor to cache clean memory pages and eliminating costly disk reads of file data by virtual machines, allowing you to increase their capacity and usage level. Transcendent memory also implements an LZO-compressed page cache, or zcache, which reduces disk I/O.

  • Transmit packet steering (XPS) distributes outgoing network packets from a multiqueue network device across the CPUs. XPS chooses the transmit queue for outgoing packets based on the lock contention and NUMA cost on each CPU, and it selects which CPU uses that queue to send a packet.

    To configure the list of CPUs to which XPS can forward traffic, use /sys/class/net/interface/queues/tx-N/xps_cpus, which implements a CPU bitmap for a specified network interface and transmit queue. The default value is zero, which disables XPS. To enable XPS and allow a particular set of CPUs to use a specified transmit queue on an interface, set the value of their positions in the bitmap to 1. For example, to enable XPS to use CPUs 4, 5, 6, and 7 for the tx-0 queue on eth0, set the value of rps_cpus to f0 (that is, 16+32+64+128 = 240 in hexadecimal):

    # cat f0 > /sys/class/net/eth0/queues/tx-0/xps_cpus

    There is no benefit in configuring XPS for a network device with a single transmit queue.

    For a system with a multiqueue network device, configure XPS so that each CPU maps onto one transmit queue. If a system has an equal number of CPUs and transit queues, you can configure exclusive pairings in XPS to eliminate queue contention. If a system has more CPUs than queues, configure CPUs that share the same cache to the same transmit queue.

  • The btrfs file system for Linux is designed to meet the expanding scalability requirements of large storage subsystems. For more information, see Chapter 5, The Btrfs File System.

  • Cgroups provide fine-grained control of CPU, I/O and memory resources. For more information, see Chapter 8, Control Groups.

  • Linux containers provide multiple user-space versions of the operating system on the same server.Each container is an isolated environment with its own process and network space. For more information, see Chapter 9, Linux Containers.

  • Transparent huge pages take advantage of the memory management capabilities of modern CPUs to allow the kernel to manage physical memory more efficiently by reducing overhead in the virtual memory subsystem, and by improving the caching of frequently accessed virtual addresses for memory-intensive workloads. For more information, see Chapter 10, HugePages.

  • DTrace allows you to explore your system to understand how it works, to track down performance problems across many layers of software, or to locate the causes of aberrant behavior. DTrace is currently available only on ULN. For more information, see Chapter 12, DTrace.

  • The configfs virtual file system, engineered by Oracle, allows you to configure the settings of kernel objects where a file system or device driver implements this feature. configfs provides an alternative mechanism for changing the values of settings to the ioctl() system call, and complements the intended functionality of sysfs as a means to view kernel objects.

    The cluster stack for OCFS2, O2CB, uses configfs to set cluster timeouts and to examine the cluster status.

    The low-level I/O (LIO) driver uses configfs as a multiprotocal SCSI target to support the configuration of FCoE, Fibre Channel, iSCSI and InfiniBand using the lio-utils tool set.

    For more information about the implementation of configfs, see http://www.kernel.org/doc/Documentation/filesystems/configfs/configfs.txt.

  • The dm-nfs feature creates virtual disk devices (LUNs) where the data is stored in an NFS file instead of on local storage. Managed networked storage has many benefits over keeping virtual devices on a disk that is local to the physical host.

    The dm-nfs kernel module provides a device-mapper target that allows you to treat a file on an NFS file system as a block device that can be loopback-mounted locally.

    The following sample code demonstrates how to use dmsetup to create a mapped device (/dev/mapper/$dm_nfsdev) for the file $filename that is accessible on a mounted NFS file system:

    nblks=`stat -c '%s' $filename`
    echo -n "0 $nblks nfs $filename 0" | dmsetup create $dm_nfsdev

    A sample use case is the fast migration of guest VMs for load balancing or if a physical host requires maintenance. This functionality is also possible using iSCSI LUNs, but the advantage of dm-nfs is that you can manage new virtual drives on a local host system, rather than requiring a storage administrator to initialize new LUNs.

    dm-nfs uses asynchronous direct I/O so that I/O is performed efficiently and coherently. A guest's disk data is not cached locally on the host. If the host crashes, there is a lower probability of data corruption. If a guest is frozen, you can take a clean backup of its virtual disk, as you can be certain that its data has been fully written out.

1.1.3 About UEK Release 3

Note

The kernel version in UEK Release 3 (UEK R3) is based on the mainline Linux kernel version 3.8.13. Low-level system utilities that expect the kernel version to start with 2.6 can run without change if they use the UNAME26 personality (for example, by using the uname26 wrapper utility).

UEK R3 includes the following major improvements over UEK R2:

  • Integrated DTrace support in the UEK R3 kernel and user-space tracing of DTrace-enabled applications.

  • Device mapper support for an external, read-only device as the origin for a thinly-provisioned volume.

  • The loop driver provides the same I/O functionality as dm-nfs by extending the AIO interface to perform direct I/O. To create the loopback device, use the losetup command instead of dmsetup. The dm-nfs module is not provided with UEK R3.

  • Btrfs send and receive subcommands allow you to record the differences between two subvolumes, which can either be snapshots of the same subvolume or parent and child subvolumes.

  • Btrfs quota groups (qgroups) allow you to set different size limits for a volume and its subvolumes.

  • Btrfs supports replacing devices without unmounting or otherwise disrupting access to the file system.

  • Ext4 quotas are enabled as soon as the file system is mounted.

  • TCP controlled delay management (CoDel) is a new active queue management algorithm that is designed to handle excessive buffering across a network connection (bufferbloat). The algorithm is based on for how long packets are buffered in the queue rather than the size of the queue. If the minimum queuing time rises above a threshold value, the algorithm discards packets and reduces the transmission rate of TCP.

  • TCP connection repair implements process checkpointing and restart, which allows a TCP connection to be stopped on one host and restarted on another host. Container virtualization can use this feature to move a network connection between hosts.

  • TCP and STCP early retransmit allows fast retransmission (under certain conditions) to reduce the number of duplicate acknowledgements.

  • TCP fast open (TFO) can speed up the opening of successive TCP connections between two endpoints by eliminating one round time trip (RTT) from some TCP transactions.

  • The TCP small queue algorithm is another mechanism intended to help deal with bufferbloat. The algorithm limits the amount of data that can be queued for transmission by a socket.

  • The secure computing mode feature (seccomp) is a simple sandbox mechanism that, in strict mode, allows a thread to transition to a state where it cannot make any system calls except from a very restricted set (_exit(), read(), sigreturn(), and write()) and it can only use file descriptors that were already open. In filter mode, a thread can specify an arbitrary filter of permitted systems calls that would be forbidden in strict mode. Access to this feature is by using the prctl() system call. For more information, see the prctl(2) manual page.

  • The OpenFabrics Enterprise Distribution (OFED) 2.0 stack supports the following protocols:

    • SCSI RDMA Protocol (SRP) enables access to remote SCSI devices via remote direct memory access (RDMA)

    • iSCSI Extensions for remote direct memory access (iSER) provide access to iSCSI storage devices

    • Reliable Datagram Sockets (RDS) is a high-performance, low-latency, reliable connectionless protocol for datagram delivery

    • Sockets Direct Protocol (SDP) supports stream sockets for RDMA network fabrics

    • Ethernet over InfiniBand (EoIB)

    • IP encapsulation over InfiniBand (IPoIB)

    • Ethernet tunneling over InfiniBand (eIPoIB)

    The OFED 2.0 stack also supports the following RDS features:

    • Async Send (AS)

    • Quality of Service (QoS)

    • Automatic Path Migration (APM)

    • Active Bonding (AB)

    • Shared Request Queue (SRQ)

    • Netfilter (NF)

  • Paravirtualization support has been enabled for Oracle Linux guests on Windows Server 2008 Hyper-V or Windows Server 2008 R2 Hyper-V.

  • The Virtual Extensible LAN (VXLAN) tunneling protocol overlays a virtual network on an existing Layer 3 infrastructure to allow the transfer of Layer 2 Ethernet packets over UDP. This feature is intended for use by a virtual network infrastructure in a virtualized environment. Use cases include virtual machine migration and software-defined networking (SDN).

The UEK R3 kernel packages are available on the ol6_x86_64_UEKR3_latest channel. For more information, see the Unbreakable Enterprise Kernel Release 3 Release Notes.