Programming Interfaces Guide

Locality Groups and Thread and Memory Placement

This section discusses the APIs used to discover and affect thread and memory placement with respect to lgroups. The lgrp_home() function is used to discover thread placement. The meminfo(2) system call is used to discover memory placement. The MADV_ACCESS flags to the madvise(3C) function are used to affect memory allocation among lgroups. The lgrp_affinity_set() function can affect thread and memory placement by setting a thread's affinity for a given lgroup. The affinities of an lgroup may specify an order of preference for lgroups from which to allocate resources. The kernel needs information about the likely pattern of an application's memory use in order to allocate memory resources efficiently. The madvise() function, and its shared object analogue madv.so.1, provide this information to the kernel. A running process can gather memory usage information about itself by using the meminfo() system call.

Using lgrp_home()

The lgrp_home() function returns the home lgroup for the specified process or thread.

#include <sys/lgrp_user.h>
lgrp_id_t lgrp_home(idtype_t idtype, id_t id);

The lgrp_home() function returns EINVAL when the ID type is not valid. The lgrp_home() function returns EPERM when the effective user of the calling process is not the superuser and the calling process' real or effective user ID does not match the real or effective user ID of one of the threads. The lgrp_home() function returns ESRCH when the specified process or thread is not found.

Using madvise()

The madvise() function advises the kernel that a region of user virtual memory in the range starting at the address specified in addr and with length equal to the value of the len parameter is expected to follow a particular pattern of use. The kernel uses this information to optimize the procedure for manipulating and maintaining the resources associated with the specified range. Use of the madvise() function can increase system performance when used by programs that have specific knowledge of their access patterns over memory.

#include <sys/types.h>
#include <sys/mman.h>
int madvise(caddr_t addr, size_t len, int advice);

The madvise() function provides the following flags to affect how a thread's memory is allocated among lgroups:

MADV_ACCESS_DEFAULT

This flag resets the kernel's expected access pattern for the specified range to the default.

MADV_ACCESS_LWP

This flag advises the kernel that the next LWP to touch the specified address range is the LWP that will access that range the most. The kernel allocates the memory and other resources for this range and the LWP accordingly.

MADV_ACCESS_MANY

This flag advises the kernel that many processes or LWPs will access the specified address range randomly across the system. The kernel allocates the memory and other resources for this range accordingly.

The madvise() function returns EAGAIN when some or all of the mappings in the specified address range, from addr to addr+len, are locked for I/O. The madvise() function returns EINVAL when the value of the addr parameter is not a multiple of the page size as returned by sysconf(3C). The madvise() function returns EINVAL when the length of the specified address range is less than or equal to zero. The madvise() function returns EINVAL when the advice is invalid. The madvise() function returns EIO when an I/O error occurs while reading from or writing to the file system. The madvise() function returns ENOMEM when addresses in the specified address range are outside the valid range for the address space of a process, or the addresses in the specified address range specifiy one or more pages that are not mapped. The madvise() function returns ESTALE when the NFS file handle is stale.

Using madv.so.1

The madv.so.1 shared object enables the selective configuration of virtual memory advice for launched processes and their descendants. To use the shared object, the following string must be present in the environment:

LD_PRELOAD=$LD_PRELOAD:madv.so.1

The madv.so.1 shared object applies memory advice as specified by the value of the MADV environment variable. The MADV environment variable specifies the virtual memory advice to use for all heap, shared memory, and mmap regions in the process address space. This advice is applied to all created processes. The following values of the MADV environment variable affect resource allocation among lgroups:

access_default

This value resets the kernel's expected access pattern to the default.

access_lwp

This value advises the kernel that the next LWP to touch an address range is the LWP that will access that range the most. The kernel allocates the memory and other resources for this range and the LWP accordingly.

access_many

This value advises the kernel that many processes or LWPs will access memory randomly across the system. The kernel allocates the memory and other resources accordingly.

The value of the MADVCFGFILE environment variable is the name of a text file that contains one or more memory advice configuration entries in the form <exec-name>:<advice-opts>.

The value of <exec-name> is the name of an application or executable. The value of <exec-name> can be a full pathname, a base name, or a pattern string.

The value of <advice-opts> is of the form <region>=<advice>. The values of <advice> are the same as the values for the MADV environment variable. Replace <region> with any of the following legal values:

madv

Advice applies to all heap, shared memory, and mmap(2) regions in the process address space.

heap

The heap is defined to be the brk(2) area. Advice applies to the existing heap and to any additional heap memory allocated in the future.

shm

Advice applies to shared memory segments. See shmat(2) for more information on shared memory operations.

ism

Advice applies to shared memory segments that are using the SHM_SHARE_MMU flag. The ism option takes precedence over shm.

dsm

Advice applies to shared memory segments that are using the SHM_PAGEABLE flag. The dsm option takes precedence over shm.

mapshared

Advice applies to mappings established by the mmap() system call using the MAP_SHARED flag.

mapprivate

Advice applies to mappings established by the mmap() system call using the MAP_PRIVATE flag.

mapanon

Advice applies to mappings established by the mmap() system call using the MAP_ANON flag. The mapanon option takes precendence when multiple options apply.

The value of the MADVERRFILE environment variable is the pathname where error messages are logged. In the absence of a MADVERRFILE location, the madv.so.1 shared object logs errors by using syslog(3C) with a LOG_ERR as the severity level and LOG_USER as the facility descriptor.

Memory advice is inherited. A child process has the same advice as its parent. The advice is set back to the system default advice after a call to exec(2) unless a different level of advice is configured via the madv.so.1 shared object. Advice is only applied to mmap() regions explicitly created by the user program. Regions established by the run-time linker or by system libraries that make direct system calls are not affected.

madv.so.1 Usage Examples

The following examples illustrate specific aspects of the madv.so.1 shared object.


Example 4–2 Setting Advice for a Set of Applications

This configuration applies advice to all ISM segments for applications with exec names that begin with foo.

$ LD_PRELOAD=$LD_PRELOAD:madv.so.1
$ MADVCFGFILE=madvcfg
$ export LD_PRELOAD MADVCFGFILE
$ cat $MADVCFGFILE
        foo*:ism=access_lwp


Example 4–3 Excluding a Set of Applications From Advice

This configuration sets advice for all applications with the exception of ls.

$ LD_PRELOAD=$LD_PRELOAD:madv.so.1
$ MADV=access_many
$ MADVCFGFILE=madvcfg
$ export LD_PRELOAD MADV MADVCFGFILE
$ cat $MADVCFGFILE
        ls:


Example 4–4 Pattern Matching in a Configuration File

Because the configuration specified in MADVCFGFILE takes precedence over the value set in MADV, specifying * as the <exec-name> of the last configuration entry is equivalent to setting MADV. This example is equivalent to the previous example.

$ LD_PRELOAD=$LD_PRELOAD:madv.so.1
$ MADVCFGFILE=madvcfg
$ export LD_PRELOAD MADVCFGFILE
$ cat $MADVCFGFILE
        ls:
        *:madv=access_many



Example 4–5 Advice for Multiple Regions

This configuration applies one type of advice for mmap() regions and different advice for heap and shared memory regions for applications whose exec() names begin with foo.

$ LD_PRELOAD=$LD_PRELOAD:madv.so.1
$ MADVCFGFILE=madvcfg
$ export LD_PRELOAD MADVCFGFILE
$ cat $MADVCFGFILE
        foo*:madv=access_many,heap=sequential,shm=access_lwp

Using meminfo()

The meminfo() function gives the calling process information about the virtual memory and physical memory that the system has allocated to that process.

#include <sys/types.h>
#include <sys/mman.h>
int meminfo(const uint64_t inaddr[], int addr_count,
    const uint_t info_req[], int info_count, uint64_t outdata[],
    uint_t validity[]);

The meminfo() function can return the following types of information:

MEMINFO_VPHYSICAL

The physical memory address corresponding to the given virtual address.

MEMINFO_VLGRP

The lgroup to which the physical page corresponding to the given virtual address belongs.

MEMINFO_VPAGESIZE

The size of the physical page corresponding to the given virtual address.

MEMINFO_VREPLCNT

The number of replicated physical pages that correspond to the given virtual address.

MEMINFO_VREPL|n

The nth physical replica of the given virtual address.

MEMINFO_VREPL_LGRP|n

The lgroup to which the nth physical replica of the given virtual address belongs.

MEMINFO_PLGRP

The lgroup to which the given physical address belongs.

The meminfo() function takes the following parameters:

inaddr

An array of input addresses.

addr_count

The number of addresses that are passed to meminfo().

info_req

An array listing the types of information that are being requested.

info_count

The number of pieces of information that are requested for each address in the inaddr array.

outdata

An array where the meminfo() function places the results. The array's size is equal to the product of the values of the info_req and addr_count parameters.

validity

An array of size equal to the value of the addr_count parameter. The validity array contains bitwise result codes. The 0th bit of the result code evaluates the validity of the corresponding input address. Each successive bit in the result code evaluates the validity of the response to the members of the info_req array in turn.

The meminfo() function returns EFAULT when the area of memory that the outdata or validity arrays point to cannot be written to. The meminfo() function returns EFAULT when the area of memory that the info_req or inaddr arrays point to cannot be read from. The meminfo() function returns EINVAL when the value of info_count exceeds 31 or is less than 1. The meminfo() function returns EINVAL when the value of addr_count is less than zero.


Example 4–6 Use of meminfo() to print out physical pages and page sizes corresponding to a set of virtual addresses

void
print_info(void **addrvec, int how_many)
{
        static const int info[] = {
                MEMINFO_VPHYSICAL,
                MEMINFO_VPAGESIZE};
        uint64_t * inaddr = alloca(sizeof(uint64_t) * how_many);
        uint64_t * outdata = alloca(sizeof(uint64_t) * how_many * 2;
        uint_t * validity = alloca(sizeof(uint_t) * how_many);

        int i;

        for (i = 0; i < how_many; i++)
                inaddr[i] = (uint64_t *)addr[i];

        if (meminfo(inaddr, how_many,  info,
                    sizeof (info)/ sizeof(info[0]),
                    outdata, validity) < 0)
                ...

        for (i = 0; i < how_many; i++) {
                if (validity[i] & 1 == 0)
                        printf("address 0x%llx not part of address
                                        space\n",
                                inaddr[i]);

                else if (validity[i] & 2 == 0)
                        printf("address 0x%llx has no physical page
                                        associated with it\n",
                                inaddr[i]);

                else {
                        char buff[80];
                        if (validity[i] & 4 == 0)
                                strcpy(buff, "<Unknown>");
                        else
                                sprintf(buff, "%lld", outdata[i * 2 +
                                                1]);
                        printf("address 0x%llx is backed by physical
                                        page 0x%llx of size %s\n",
                                        inaddr[i], outdata[i * 2], buff);
                }
        }
}

Locality Group Affinity

The kernel assigns a thread to a locality group when the light weight process (LWP) for that thread is created. That lgroup is called the thread's home lgroup. The kernel runs the thread on the CPUs in the thread's home lgroup and allocates memory from that lgroup whenever possible. If resources from the home lgroup are unavailable, the kernel allocates resources from other lgroups. When a thread has affinity for more than one lgroup, the operating system allocates resources from lgroups chosen in order of affinity strength. There are three affinity levels:

  1. LGRP_AFF_STRONG indicates strong affinity. If this lgroup is the thread's home lgroup, the operating system avoids rehoming the thread to another lgroup if possible. Events such as dynamic reconfiguration, processor, offlining, processor binding, and processor set binding and manipulation may still result in thread rehoming.

  2. LGRP_AFF_WEAK indicates weak affinity. If this lgroup is the thread's home lgroup, the operating system rehomes the thread if necessary for load balancing purposes.

  3. LGRP_AFF_NONE indicates no affinity. If a thread has no affinity to any lgroup, the operating system assigns the thread a home lgroup.

The operating system uses lgroup affinities as advice when allocating resources for a given thread. The advice is factored in with the other system constraints. Processor binding and processor sets do not change lgroup affinities, but may restrict the lgroups on which a thread can run.

Using lgrp_affinity_get()

The lgrp_affinity_get() function returns the affinity that a LWP or set of LWPs have for a given lgroup.

#include <sys/lgrp_user.h>
lgrp_affinity_t lgrp_affinity_get(idtype_t idtype, id_t id, lgrp_id_t lgrp);

The idtype and id arguments specify the LWP or set of LWPs that the lgrp_affinity_get() function examines. If the value of idtype is P_PID, the lgrp_affinity_get() function gets the lgroup affinity for one of the LWPs in the process whose process ID matches the value of the id argument. If the value of idtype is P_LWPID, the lgrp_affinity_get() function gets the lgroup affinity for the LWP of the current process whose LWP ID matches the value of the id argument. If the value of idtype is P_MYID, the lgrp_affinity_get() function gets the lgroup affinity for the current LWP or process.

The lgrp_affinity_get() function returns EINVAL when the given lgroup, affinity, or ID type is not valid. The lgrp_affinity_get() function returns EPERM when the effective user of the calling process is not the superuser and the calling process' ID does not match the real or effective user ID of one of the LWPs. The lgrp_affinity_get() function returns ESRCH when a given lgroup or LWP is not found.

Using lgrp_affinity_set()

The lgrp_affinity_set() function sets the affinity that a LWP or set of LWPs have for a given lgroup.

#include <sys/lgrp_user.h>
int lgrp_affinity_set(idtype_t idtype, id_t id, lgrp_id_t lgrp,
                      lgrp_affinity_t affinity);

The idtype and id arguments specify the LWP or set of LWPs the lgrp_affinity_set() function examines. If the value of idtype is P_PID, the lgrp_affinity_set() function sets the lgroup affinity for all of the LWPs in the process whose process ID matches the value of the id argument to the affinity level specified in the affinity argument. If the value of idtype is P_LWPID, the lgrp_affinity_set() function sets the lgroup affinity for the LWP of the current process whose LWP ID matches the value of the id argument to the affinity level specified in the affinity argument. If the value of idtype is P_MYID, the lgrp_affinity_set() function sets the lgroup affinity for the current LWP or process to the affinity level specified in the affinity argument.

The lgrp_affinity_set() function returns EINVAL when the given lgroup, affinity, or ID type is not valid. The lgrp_affinity_set() function returns EPERM when the effective user of the calling process is not the superuser and the calling process' ID does not match the real or effective user ID of one of the LWPs. The lgrp_affinity_set() function returns ESRCH when a given lgroup or LWP is not found.