1.1.2 Core Kernel Functionality

The following notable core kernel features are implemented in UEK R4:

  • The performance of SPECjbb is improved for a system with more than 10 CPUs by removing contention for the global epmutex lock, which is used in EPOLL_CTL_ADD and EPOLL_CTL_DEL operations. For example, in a typical 16-socket run the performance increases from 35k jOPS to 125k jOPS. Benchmarks also exhibit good scaling from 10 sockets to over 40 sockets.

  • The sysctl_numa_balancing_settle_count parameter used by the NUMA scheduler has been removed.

  • The following tracepoints are now provided to monitor NUMA scheduler activity:

    trace_sched_move_numa

    Triggered when a task is moved to a node.

    trace_sched_stick_numa

    Triggered when a NUMA migration fails.

    trace_sched_swap_numa

    Triggered when a task is swapped for another task.

  • The new SCHED_STACK_END_CHECK kernel debugging option can be used to check for a stack overrun on calls to schedule() on a NUMA system. If the stack end location is overwritten, the system panics as the content of the corrupted region cannot be trusted.

  • Sysbench performance has been improved by preventing spurious active NUMA migration.

  • CPU clock frequency scaling for performance management. The possible governor settings as displayed by /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor are:

    ondemand

    Sets the CPU clock frequency between the minimum and maximum possible frequencies, according to the current demand usage. The following sysfs parameters are adjustable:

    ignore_nice_load

    Whether processes with a nice value count (0) or do not count (1) toward CPU usage. The default value is 0.

    powersave_bias

    How much to reduce the target CPU frequency by as a fraction of 1000. A value of 0 disables this feature.

    sampling_down_factor

    A multiplier that the kernel applies to sampling_rate when the CPU is running at its maximum clock frequency. The default value is 1.

    sampling_rate_min

    Minimum sampling rate.

    sampling_rate

    Interval in microseconds between assessments of whether the kernel needs to change the clock frequency.

    up_threshold

    Threshold of average CPU usage as a percentage for the kernel to increase the clock frequency.

    ondemand is the default governor setting if tuned is not configured.

    This setting is equivalent to powersave for more recent microarchitecture CPUs (for example, Haswell, Broadwell, and later) with which the pstate power scaling driver can interact. For older design architecture CPUs (for example, Ivy Bridge, Sandy Bridge, and earlier), ondemand is equivalent to performance as the cores must be kept in a higher power state to minimize CPU latency.

    performance

    Sets the CPU clock frequency to the maximum possible frequency.

    Note

    performance is the default governor setting for the tuned throughput-performance profile.

    The performance profile is appropriate for some real-time applications but it might not be appropriate for all workloads. Running a CPU at maximum frequency can prevent turbo mode from being enabled because doing so would exceed the thermal envelope.

    powersave

    Sets the CPU clock frequency to the minimum possible frequency.

    userspace

    Permits a user-space program running as an effective root user to control the CPU clock frequency by creating and using a file named scaling_setspeed in the CPU-device directory under sysfs.

    Oracle recommends that you use tuned-adm to select a tuned performance profile for your system that is based on its hardware and software configuration, for example:

    • If your system has Xeon processors or multiple disks, choose a profile such as latency-performance for a cloud server, throughput-performance for a database server, or virtual-host for a virtual host server.

      Note

      These profiles set the CPU governor setting to performance, which might not be appropriate for all workloads.

    • For a virtual machine guest, choose the virtual-guest profile.

    • For a laptop, choose a suitable laptop profile such as laptop-ac-powersave or laptop-battery-powersave.

    • For a desktop machine, choose either the desktop or balanced profile.

    You can use the tuned-adm list command to display the available profiles.

    If tuned is not configured, the default CPU governor setting is ondemand, which can cause some bursty, CPU-intensive workloads to run more slowly because of demand hysteresis.

    If necessary, you can create your own performance profiles based on the profiles that are provided in the /etc/tune-profiles directory hierarchy.

    When comparing system performance under different profiles, use benchmarks that simulate your server's typical workload.

    For more information, see the tuned(8) and tuned-adm(1) manual pages, which are available in the tuned package.