1.1 Notable Changes

The following sections describe the major new features of Unbreakable Enterprise Kernel Release 3 (UEK R3) relative to UEK R2. If applicable, the mainline version in which a feature was introduced is noted in parentheses.

For brief summaries of other changes, see Appendix A, Other Changes.

1.1.1 Architecture

  • Support for the Intel IVB processor family has been added.

  • The efivars module provides an area of firmware-managed, nonvolatile storage, which can be used as a persistent storage backend to maintain copies of kernel oopses and aid the diagnosis of problems. (3.1)

1.1.2 Control Groups and Linux Containers

Control groups (cgroups) and Linux Containers (LXC) are now supported features. LXC is supported for 64-bit hosts, but not 32-bit hosts (in any case, UEK R3 is not available for the 32-bit x86 architecture). Both 32-bit and 64-bit guest containers can be configured. However, some applications might not be supported for use with these features.

  • The cgroups feature allows you to manage access to system resources by processes. For more information, see Control Groups.

  • LXC is based on the cgroups and namespaces functionality. Containers allow you to safely and securely run multiple applications or instances of an operating system on a single host without risking them interfering with each other. Containers are lightweight and resource-friendly, which saves both rack space and power. For more information, see Linux Containers.

    The lxc-attach command is supported by UEK R3 with the lxc-0.9.0-2.0.4 package. lxc-attach allows you to execute an arbitrary command inside a running container from outside the container. For more information, see the lxc-attach(1) manual page.

    Note

    To access this feature, use yum update to install the lxc-0.9.0-2.0.4 package (or later version of this package).

1.1.3 Core Kernel Functionality

  • To avoid binary incompatibility in applications that do not understand the 3.x versioning scheme, the UNAME26 personality patch can be used to report the kernel version as 2.6.x where x is derived from the real kernel version. The uname26 program is provided to activate the UNAME26 personality patch for 3.x kernels. uname26 does not replace the uname command. Instead, it acts as a wrapper that modifies the return value of the uname() system call to return a 2.6.x version number. If an application fails due to the 3.8.x version number, you can use the following command to start it in a 2.6 context:

    # uname26 application

    The following example demonstrates the effect of using uname26 as a wrapper program:

    # uname -r
    3.8.13-16.el6uek.x86_64
    # uname26 uname -r
    2.6.48-16.el6uek.x86_64

    The uname26 program is available in the uname26 package. (3.1)

  • Structured logging in /dev/kmsg uses printk() to attach arbitrary key/value pairs to logged messages, which carry machine-readable data that describes the context of the message when it was created. The key/value pairs allow you to reliably identify messages according to device, driver, subsystem, class, and type. The addition of a facility number to the syslog prefix allows continuation records to be merged. (3.5)

  • PCI Express runtime D3cold power state is supported. This deepest power saving state for PCIe devices removes all main power. (3.6)

  • Virtual Function I/O (VFIO) allows safe, non-privileged access to bare-metal devices from user-space drivers by virtual machines that use direct device access (device assignment) to obtain high I/O performance. From perspective of the device and the host, the VM appears as a user-space driver, which provides the benefits of reduced latency, higher bandwidth, and the direct use of bare-metal device drivers. This feature could potentially be used by high-performance computing and similar applications. (3.6)

  • Huge pages support a zero page as a performance optimization. This feature was previously available only for normal sized pages (4 KB). When a process references a new memory page, the kernel assigns a pointer to the zero page rather than allocating a real page of memory and filling this with zeroes. When the process does attempt to write to the zero page, a write-protection fault is generated and the kernel allocates a real page of memory to the process's address space. (3.8)

  • A new foundation for the NUMA implementation will be used as the basis for future enhancements. (3.8)

  • The memory control group now supports both stack and slab kernel usage parameters with the following additional memory usage parameters (specified relative to memory.kmem):

    failcnt

    Kernel memory usage hits (display only).

    limit_in_bytes

    Kernel memory hard limit (set or display).

    max_usage_in_bytes

    Maximum recorded kernel memory usage (display only).

    usage_in_bytes

    Current kernel memory allocation (display only).

    memory.kmem.limit_in_bytes is intended to help limit the effect of fork bombs. (3.8)

  • Automatic balancing of memory allocation for NUMA nodes. (3.8)

  • The value of the SCSI error-handling timeout is now tunable. If a SCSI device times out while processing file system I/O, the kernel attempts to bring the device back online by resetting the device, followed by resetting the bus, and finally by resetting the controller. The error-handling timeout defines how many seconds the kernel should wait for a response after each recovery attempt before performing the next step in the process. For some fast-fail scenarios, it is useful to be able to adjust this value as the kernel might need additional time to try several combinations of bus device, target, bus, and controller. You can read and set the timeout via /sys/class/scsi_device/*/device/eh_timeout. The default timeout value is 10 seconds. (3.8)

  • Variable-sized huge pages via the flags argument to mmap() or the shmflg argument to shmget(). Bits 26-31 of these arguments specify the base-2 logarithm of the page size. For example, values of 21 << 26 and 30 << 26 represent page sizes of 2 MB (2^21) and 1 GB (2^30) respectively. A value of zero selects the default huge page size. (3.8)

  • The watchdog timer device (displayed in /proc/devices) provides a framework for all watchdog timer drivers, /dev/watchdog, and the sysfs interface for hardware-specific watchdog code. (3.8)

  • The Precision Time Protocol (PTP), defined in IEEE 1588, is enabled. PTP can be used to achieve synchronization of systems to within a few tens of microseconds. If hardware time-stamping units are used, synchronization to within a few hundred nanoseconds can be achieved. (3.8)

1.1.4 Cryptography

  • An Extended Verification Module (EVM) includes a digital signature that allows file metadata to be protected by using digital signatures instead of Hashed Message Authentication Control (HMAC). (3.3)

  • Kernel modules can now be signed using X.509 certificates. (3.7)

1.1.5 Device Mapper

The device mapper supports an external, read-only device as the origin for a thinly-provisioned volume. Any reads to the unprovisioned area of the thin device are passed through to this device. For example, a host could run its guest VMs on thinly provisioned volumes where the base image for all of the VMs resides on a single device. (3.4)

1.1.6 Diagnostics

  • The cpupowerutils feature extends the capabilities of cpufrequtils, and provides statistics for CPU idle and turbo/boost modes. On AMD systems, it also displays information about boost states and their frequencies. For more information, see http://lwn.net/Articles/433002/. (3.1)

  • zcache version 3 supports multiple clients and in-kernel transcendent memory (tmem) code, and adds tmem callbacks to support RAMster and corresponding no-op stubs in the zcache driver. New sysfs parameters provide additional information and allow policy control. (3.1)

1.1.7 DTrace

DTrace is a comprehensive dynamic tracing framework that was initially developed for the Oracle Solaris operating system. DTrace provides a powerful infrastructure to permit administrators, developers, and service personnel to concisely answer arbitrary questions about the behavior of the operating system and user programs in real time.

Note

The DTrace utility packages (dtrace-utils*) are available only on the Unbreakable Linux Network (ULN).

DTrace 0.4 in UEK R3 has the following additional features compared with DTrace 0.3.2 in UEK R2:

  • In UEK R2, you had to install separately available packages that contained a DTrace-enabled version of the kernel, and you had to boot the system with this kernel to be able to use DTrace. In UEK R3, DTrace support is integrated with the kernel. To use DTrace, you still need to install the dtrace-utils and dtrace-modules packages, which are available on the ol6_x86_64_UEKR3_latest and ol6_x86_64_Dtrace_userspace_latest channels. If you use yum to install the dtrace-utils package, it automatically pulls in the other packages, such as dtrace-modules, that are required.

  • The libdtrace headers, which required for implementing a libdtrace consumer, are now located in the separate dtrace-utils-devel package. The headers for provider development are located in the dtrace-modules-provider-headers package. If you require these packages, you must install them separately from the dtrace-modules or dtrace-utils packages.

  • Meta-provider support has been implemented, which allows DTrace to instantiate providers dynamically on demand. An example of a meta-provider is the fasttrap provider that is used for user-space tracing.

  • User-space statically defined tracing (USDT) supports SDT-like probes in user-space executable and libraries. To ensure that your program computes the arguments to a DTrace probe only when required, you can use an is-enabled probe test to verify whether the probe is currently enabled.

  • USDT requires programs to be modified to include embedded static probe points. The sys/sdt.h header file is provided to support USDT, but you can also use the -h option to dtrace to generate a suitable header file from a provider description file.

    The -G option to the dtrace command processes the provider description file and the compiled object files for the code that contains the probe points to generate a DOF ELF object file (which is a Extensible Linking Format (ELF) object file with a DTrace Object Format (DOF) section). You can then create a DTrace-enabled executable or shared library by linking this DOF ELF object file with the object files.

    For more information, refer to the chapter Statically Defined Tracing for User Applications in the Oracle Linux 6 Dynamic Tracing Guide, which you can find in the Oracle Linux 6 documentation library at http://docs.oracle.com/cd/E37670_01/index.html.

  • To enable the use of USDT probes in DTrace-enabled programs, you must load the new fasttrap module:

    # modprobe fasttrap

    Currently, the fasttrap provider supports the use of USDT probes. It is not used to implement the pid provider.

  • DTrace-enabled versions of user-space applications are planned to be made available via the playground repository of Oracle Public Yum (http://public-yum.oracle.com/repo/OracleLinux/OL6/playground/latest/x86_64/). The packages that are provided in the playground repository are intended for experimentation only and you should not use them with production systems. Oracle does not offer support for these packages and does not accept any liability for their use.

    PHP 5.4.20, PHP 5.5.4, and later versions can be built with DTrace support on Oracle Linux. See https://blogs.oracle.com/opal/entry/using_php_dtrace_on_oracle.

    PostgreSQL 9.2.4 includes support for DTrace as described in http://www.postgresql.org/docs/9.2/static/dynamic-trace.html. You can build a DTrace-enabled version of pgsql by specifying the --enable-dtrace option to configure as described in http://www.postgresql.org/docs/9.2/static/install-procedure.html. For information about obtaining the PostgreSQL packages, see http://www.postgresql.org/download/linux/redhat/.

  • The DTrace header files in the kernel, kernel modules, and DTrace user-space utility have been restructured to provide better support for custom consumers and DTrace-related utilities.

  • The systrace provider has been updated to account for changes in the 3.8.13 kernel.

  • Symbol lookup can now be performed by the & operator. ustack() output contains symbolic names instead of addresses provided that the symbols are present in the DT_NEEDED section of the ELF objects or in libraries that have been loaded with dlopen() or dlmopen(). Symbol lookup of global symbols in user-space processes respects symbol interposition and similar methods of symbol-ordering. Symbol lookup works correctly with programs that you compiled against the version of the GNU C Library (glibc) that ships with Oracle Linux 6.4 or later. With other versions of glibc, symbol lookup might fall back to using a simpler approach that does not support symbol interposition or dlmopen(). As symbol lookup depends on new machinery in the kernel that uses waitfd() and PTRACE_GETMAPFD, it does not work with earlier DTrace kernels.

  • The -x evaltime={exec | main | preinit | postinit} option to dtrace is now available with the following limitations:

    • postinit (the default behavior) is equivalent to main.

    • For statically linked binaries, preinit is equivalent to exec, and it might not skip ld.so initialization, which can happen after main().

    • For stripped, statically linked binaries, both postinit and main are equivalent to preinit, because the main symbol cannot be looked up if there is no symbol table.

    In previous versions of DTrace, the default behavior was equivalent to evaltime=exec being set.

  • You can now set DTrace options by using environment variables named DTRACE_OPT_NAME, where NAME is the name of the option in upper case. For example, the variable name corresponding to incdir, which adds a #include directory to the preprocessor search path, is DTRACE_OPT_INCDIR:

    # export DTRACE_OPT_INCDIR=/usr/lib64/dtrace:/usr/include/sys
  • The following changes have been made to user-visible internals:

    • The name of the ELF section in which CTF data is stored has been changed from .dtrace_ctf to .ctf.

    • The storage representation of internal kernel symbols has been improved, which reduces DTrace memory usage at start up by approximately one megabyte.

    • The libdtrace public API header now names its arguments.

    • The prototypes for several libdtrace functions have changed.

    • Two undocumented libproc environment variables (_LIBPROC_INCORE_ELF and _LIBPROC_NO_QSORT) from Oracle Solaris have been removed because the code, whose behaviour they adjusted, no longer exists.

    • New low-overhead debugging machinery has been implemented. If you export the DTRACE_DEBUG=signal environment variable, DTrace will emit debugging output only when it receives a SIGUSR1, avoiding the overhead due to printf() locking affecting any timings. The mechanism uses a ring buffer with a default size of 100 (in units of megabytes), which you can adjust by setting the value of the DTRACE_DEBUG_BUF_SIZE variable.

  • Negative values specified to dtrace options that take only positive integers are now correctly diagnosed as errors.

  • It is now possible to obtain correct value for the ERR registers.

  • For more information about DTrace, refer to the Oracle Linux 6 Administrator's Solutions Guide and the Oracle Linux 6 Dynamic Tracing Guide, which you can find in the Oracle Linux 6 documentation library at http://docs.oracle.com/cd/E37670_01/index.html.

1.1.8 File Systems

btrfs

In UEK R3, btrfs is based on version 3.8, whereas btrfs in the latest update to UEK R2 is based on version 3.0 with some additional backported features, such as support for large metadata blocks and device statistics.

The following notable features are implemented for the btrfs file system in UEK R3 in addition to those features that are already provided in UEK R2:

  • Support for changing the RAID profile without unmounting the file system. (3.3)

  • The btrfs-restore data recovery tool attempts to extract files from a damaged file system and copy them to a safe location. (3.4)

  • fsck in btrfs can now repair extent-allocation trees. (3.4)

  • Support in mkfs for metadata blocks of up to 64 KB (either 16 or 32 KB is recommended). (3.4)

  • Performance improvements to page cache and CPU usage, and the copy-on-write mechanisms. (3.4)

  • Improved auditing to handle unexpected conditions more effectively. When unexpected errors occur, current transactions abort, errors are returned to user-space callers, and the file system enters read-only mode. (3.4)

  • The btrfs device stats command reports I/O failure statistics, including I/O errors, CRC errors, and generation checks of metadata blocks for each drive. (3.5)

  • Performance improvements to memory reclamation and synchronous I/O latency. (3.5)

  • Subvolume-aware quota groups (qgroups) allow you to set different size limits for a volume and its subvolumes. For more information, see https://btrfs.wiki.kernel.org/index.php/UseCases. (3.6)

  • The send and receive subcommands of btrfs allow you to record the differences between two subvolumes, which can either be snapshots of the same subvolume or parent and child subvolumes. For an example of using the send/receive feature to implement an efficient incremental backup mechanism, see https://btrfs.wiki.kernel.org/index.php/Incremental_Backup. (3.6)

  • Cross-subvolume reflinks allow you to clone files across different subvolumes within a single mounted btrfs file system. However, you cannot clone files between subvolumes that are mounted separately. (3.6)

  • The copy-on-write mechanism can be disabled for an empty file by using the chattr +C command to add the NOCOW file attribute to the file, or by creating the file in a directory on which you have set NOCOW. For some applications this feature can reduce fragmentation and improve performance. (3.7)

  • File hole punching, which allows you to mark a portion of a file as unused, so freeing up the associated storage. The FALLOC_FL_PUNCH_HOLE flag to the fallocate() system call removes the specified data range from a file. The call does not change the size of the file even if you remove blocks from the end of the file. A typical use case for hole punching is to deallocate unused storage previously allocated to virtual machine images. (3.7)

  • The fsync() system call writes the modified data of a file to the hard disk. (3.7)

  • Replacing devices without unmounting or otherwise disrupting access to the file system by using the replace subcommand to btrfs, for example:

    # btrfs replace failed_device replacement_device mountpoint

    You do not need to unmount the file system or to stop active tasks. If the power fails during replacment, the process resumes when the file system is next mounted. (3.8)

For more information, see https://btrfs.wiki.kernel.org/index.php/Changelog.

cifs

The Common Internet File System (CIFS) now provides experimental support for SMB v2, which is the successor to the CIFS and SMB network file sharing protocols. (3.7)

ext3 and ext4

File system barriers are now enabled by default. If you experience a performance regression, you can disable the feature by specifying the barrier=0 option to mount. (3.1)

ext4

  • Store checksums of various metadata fields. Each time that a metadata field is read, the checksum of the read data is compared with the stored checksum to detect metadata corruption. (3.5)

  • Quota files are now stored in hidden inodes as file system metadata instead of as separate files in the file system director hierarchy. Quotas are enabled as soon as the file system is mounted. (3.6)

f2fs

f2fs is an experimental file system that is optimized for flash memory storage devices and solid state drives (SSDs). (3.8)

FUSE

The numa mount option has been added to select code paths that improve performance on NUMA systems.

NFS

The NFS version 4.1 client supports Sessions, Directory Delegations, and parallel NFS (pNFS) as defined in RFC 5661. pNFS can take advantage of cluster systems by providing scalable parallel access, either to a file system or to individual files that are distributed on multiple servers. (3.7)

XFS

Journals now implement checksums for verifying log integrity. (3.8)

1.1.9 Memory Management

  • The frontswap feature can store swap data is stored in transcendent memory, which is neither directly accessible to nor addressable by the kernel. Using transcendent memory in this way can significantly reduce swap I/O. Frontswap is so named because it can be thought of as being the opposite of a backing store for a swap device. A suitable storage medium is a synchronous, concurrency-safe, page-oriented, pseudo-RAM device such as Xen Transcendent Memory (tmem) or in-kernel compressed memory (zmem). (3.5)

  • Safe swapping is supported using network block devices (NBDs) or NFS. (3.6)

1.1.10 Networking

  • TCP controlled delay management (CoDel) is a new active queue management algorithm that is designed to handle excessive buffering across a network connection (bufferbloat). The algorithm is based on for how long packets are buffered in the queue rather than the size of the queue. If the minimum queuing time rises above a threshold value, the algorithm discards packets and reduces the transmission rate of TCP. (3.5)

  • TCP connection repair implements process checkpointing and restart, which allows a TCP connection to be stopped on one host and restarted on another host. Container virtualization can use this feature to move a network connection between hosts. (3.5)

  • TCP and STCP early retransmit allows fast retransmission (under certain conditions) to reduce the number of duplicate acknowledgements. (3.5)

  • TCP fast open (TFO) can speed up the opening of successive TCP connections between two endpoints by eliminating one round time trip (RTT) from some TCP transactions. A performance improvement of between 4 and 41% has been measured for web page loading.

    TFO is not enabled by default. To enable it, use the following command:

    # sysctl -w net.ipv4.tcp_fastopen=1

    To make the change persist across system reboots, add the following entry to /etc/sysctl.conf:

    net.ipv4.tcp_fastopen = 1

    Applications that want to use TFO must notify the system using appropriate API calls, such as the TCP_FASTOPEN option to setsockopt() on the server side or the MSG_FASTOPEN flag with sendto() on the client side. (client side 3.6, server side 3.7)

  • The TCP small queue algorithm is another mechanism intended to help deal with bufferbloat. The algorithm limits the amount of data that can be queued for transmission by a socket. The limit is set by /proc/sys/net/ipv4/tcp_limit_output_bytes, where the default value is 128 KB. To reduce network latency, specify a lower value for this limit. (3.6)

1.1.11 Performance

  • The slub slab allocator now implements wider lockless operations for most paths on CPU architectures that support CMPXCHG (compare and exchange) instructions. This change can improve the performance of slab intensive workloads. (3.1)

  • The perf report --gtk command launches a simple GTK2-based performance report browser. (3.4)

  • The perf annotate command now allows you to use the Enter key to trace recursively through function calls in the TUI interface. (3.4)

  • The perf record -b command supports a new hardware-based, branch-profiling feature on some CPUs that allows you to examine branch execution. (3.4)

  • Uprobes allow you to place a performance probe at any memory address in a user application so that you can collect debugging and performance information non-disruptively. (3.5)

  • The perf trace command can be used to record a workload according to a specified script, and to display a detailed trace of a workload that was previously recorded. This command provides an alternative interface to strace. (3.7)

1.1.12 Security

  • The secure computing mode feature (seccomp) is a simple sandbox mechanism that, in strict mode, allows a thread to transition to a state where it cannot make any system calls except from a very restricted set (_exit(), read(), sigreturn(), and write()) and it can only use file descriptors that were already open. In filter mode, a thread can specify an arbitrary filter of permitted systems calls that would be forbidden in strict mode. Access to this feature is by using the prctl() system call. For more information, see the prctl(2) manual page. (3.5)

  • Supervisor mode access prevention (SMAP) is a new security feature that will be supported by future Intel processors. SMAP forbids kernel access to user-space memory pages, which should help eliminate some forms of exploit. If the SMAP bit has been set in CR4, an attempt is made to access user-space memory from privileged mode causes a page-fault exception. For more information, refer to the Intel® Architecture Instruction Set Extensions Programming Reference. (3.7)

1.1.13 Storage

  • The LSI MPT3SAS driver has been added to support LSI MPT Fusion based SAS3 (SAS 12.0 Gb/s) controllers.

  • The OpenFabrics Enterprise Distribution (OFED) 2.0 stack has been integrated, which supports the following InfiniBand (IB) hardware on systems with an x86-64 architecture:

    • Mellanox ConnectX-2 InfiniBand Host Channel Adapters

    • Mellanox ConnectX-3 InfiniBand Host Channel Adapters are supported for Oracle X4-2, X4-2L, and Netra X3-2 servers

    • Sun InfiniBand QDR Host Channel Adapter PCIe #375-3696

    OFED 2.0 supports the following protocols:

    • SCSI RDMA Protocol (SRP) enables access to remote SCSI devices via remote direct memory access (RDMA)

    • iSCSI Extensions for remote direct memory access (iSER) provide access to iSCSI storage devices

    • Reliable Datagram Sockets (RDS) is a high-performance, low-latency, reliable connectionless protocol for datagram delivery

    • Sockets Direct Protocol (SDP) supports stream sockets for RDMA network fabrics

    • Ethernet over InfiniBand (EoIB)

    • Internet Protocol over InfiniBand (IPoIB)

    • Ethernet tunneling over IPoIB (eIPoIB)

    and the following RDS features:

    • Async Send (AS)

    • Quality of Service (QoS)

    • Automatic Path Migration (APM)

    • Active Bonding (AB)

    • Shared Request Queue (SRQ)

    • Netfilter (NF)

  • Support for IB, OFED, and RDS is integrated into the kernel. The OFED user-space RPMs continue to be provided, but the kernel-ib and ofa-kernel RPMs are not required.

  • A new iSCSI implementation raises the supported iSCSI target framework to LIO version 4.1. (3.1)

1.1.14 Virtualization

  • Paravirtualization support has been enabled for Oracle Linux guests on Windows Server 2008 Hyper-V or Windows Server 2008 R2 Hyper-V.

  • VFS scalability improvements:

    • The inode_sta.nr_unused counter has been converted to a per-CPU counter.

    • The global LRU list of unused inodes has been converted to a per-superblock LRU list.

    • The ipruce_sem semaphore has been removed because of changes to the LRU lists.

    • The i_alloc_sem functionality has been replaced with a simplified scheme.

    • The scalability of mount locks has been improved for file systems that do not have mount points.

    • The use of inode_hash_lock is avoided for pipes and sockets.

    (3.1)

  • privcmd is a new character device driver that handles access to arbitrary hypercalls through XenFS. (3.3)

  • xenbus_backend is a new device driver for xenbus used by XenFS. (3.3)

  • The xenbus device driver adds a new character device featuring nmap for the pre-allocated ring and an ioctl() for the event channel via XenFS. (3.3)

  • The Virtual Extensible LAN (VXLAN) tunneling protocol overlays a virtual network on an existing Layer 3 infrastructure to allow the transfer of Layer 2 Ethernet packets over UDP. This feature is intended for use by a virtual network infrastructure in a virtualized environment. Use cases include virtual machine migration and software-defined networking (SDN). (3.7)