cmi - man pages section 5: Standards, Environments, and Macros

Language:

cmi (5)

Name

cmi - Coherent Memory Interface (CMI)

Description

Oracle Coherent Memory Interface (CMI) exposes a distributed shared memory abstraction. In order to support a diverse range of networking technologies, vendors, and approaches to CMI, a portable interface is required. CMI clients and compiler backends can interact with the CMI using the Application Programming Interface (API) outlined in this man page. Vendors can provide a CMI library and associated OS hardware drivers to enable CMI applications to run on their platforms. CMI clients gain the benefit of portability by targeting this interface without being tied down to a specific vendor or platform implementation.

CMI Memory Model

It is envisaged that CMI implementations may support differing level of memory consistency ranging from relaxed to partial, to total store order (TSO). In order to support execution on platforms with more relaxed memory models, CMI defines a set of memory barrier primitives to enforce ordering. CMI clients can utilize these barriers to enforce consistent access to CMI, that is, clients can code to operate correctly for relaxed memory models by explicitly inserting memory barriers in appropriate locations. This allows clients to work seamlessly on implementations with stricter memory models (such as TSO).

Note - On CMI implementations that provide stricter ordering guarantees some of the memory barrier primitives can be no-ops to minimize overhead of CMI applications running on these systems.

CMI currently defines three memory barrier primitives:

cmi_mb(): 'Memory Barrier' that orders both loads and stores. loads and stores preceding the memory barrier are committed to memory before any loads and stores following the memory barrier.
cmi_wmb(): 'Write Memory Barrier' only orders stores. This means that stores preceding the memory barrier are committed to memory before any stores following the memory barrier in program order. load operations can be reordered across the barrier if the underlying implementation supports it.
cmi_rmb(): 'Read Memory Barrier' only orders loads. This means that loads preceding the memory barrier are completed before loads following the barrier in program order. stores can be reordered across the barrier if the underlying implementation supports it.

Note - CMI memory barrier primitives should also prevent the compiler to apply any optimizations that would have the effect of reordering memory optimizations across the barrier. With respect to memory barrier and stores being 'committed to memory' for CMI segments memory refers to home node and not any local memory being used as a cache on a remote node. This means that the result of store should be visible on home node as well as on other nodes in the cluster.

Flush barriers and Recovery

In the absence of any hardware failures, the memory barriers would be sufficient to ensure correct operation. However, CMI needs to maintain consistency of data in presence of failures. To accomplish this, CMI defines a flush barrier that is used to update the home node memory at well defined points. The flush barrier is a synchronous barrier, that is, thread execution cannot proceed till all updates are flushed to the home node. On return from the flush barrier, all updates are committed to memory which means that all subsequent loads to committed addresses from any process on any node in the cluster should see the updated values.

The CMI platform maintains addresses modified by a thread that are cached locally and need to be updated to the home node. To aid the CMI implementation in tracking updates that need to be flushed to the home node, a thread starts a flush epoch through the call to open_fb() function. CMI clients can request the CMI platform to flush or sync all stores to the home node that are being tracked during the flush epoch through a call to the flush_fb() function. The flush operation is synchronous and on return should guarantee that all previous stores executed by the thread and tracked since the start of epoch or a previous flush command was issued, have successfully been reflected on the home node. For CMI systems that support write through semantics, all stores will be reflected to the home node memory instead of only being cached locally. Therefore, flush barrier operations may not have any effect (no-ops) if all stored data have been flushed to the home node.

It is illegal for a thread to have more than one flush epoch. The CMI platform supports a separate flush epoch per thread. It is legal to have multiple threads updating a segment concurrently with each thread having a separate flush epoch open. The epoch is explicitly closed through a call to close_fb() function. Closing an epoch is an implicit sync operation and all stores within the epoch must be committed to home node memory before return.

If an access error occurs while updating the home node during a sync operation, then an access error or exception should be thrown for the thread. For more information, see Error Handling below.

Note - All store operations will eventually be made visible even in the absence of any flush or memory barrier operations. Flush barriers have thread level scope.

CMI Objects

This section provides an overview of the various CMI objects and their usage. All CMI objects are allocated by the CMI vendor library and are guaranteed to be valid during the lifetime of the object as defined by the CMI API. CMI objects may be valid only within a certain scope. The following scopes are defined:

Thread: Object is valid for a thread context. For a multithreaded process, the object cannot be transferred or used between threads.
Process: Object has process scope. For a multithreaded process, all threads within a process can share the object.
Node: Object has node level scope. The object handle or value can be shared across all processes and threads on the node. This means that the underlying resource referenced by the object can be shared across processes on a node. Node in this context refers to an operating system Image or a virtualization domain.
Cluster: Object has cluster level scope. All processes and threads across all the nodes in the cluster can share the object. This means that the underlying resource referenced by the object can be shared and is visible across the entire cluster. Cluster here refers to a collection of nodes that comprises the set of nodes with a CMI.

CMI Context

All CMI operations performed by a process or any thread requires a CMI context handle. A CMI context is a semi-opaque handle allocated by the CMI library. CMI context for the process is obtained when the CMI library is initialized using the cmi_ini() function. For multithreaded processes, any threads that may invoke a CMI routine must register itself with the CMI library using the ini_th() function. This allows the CMI library the opportunity to allocate any thread specific resources that may be required. A matching call to fini() function per thread is required to deallocate resources. The call to cmi_ini() function implicitly calls the ini_th() function, and registers the thread with the CMI library. Once all threads have deregistered using the calls to the fini() function, the CMI context can be deallocated by the library. All CMI operations invoked by threads that are not registered with the library using the ini_th() function, will fail with error CMI_ERR_INIT.

The CMI context encapsulates all library specific states required to perform CMI operations. The context contains a base CMI structure, and may contain any opaque library specific state. The base object contains the currently active version of the CMI API as well as the device and vendor IDs for the CMI library. These fields are used to validate various CMI objects by the client and the library to allow mixed vendor operation. The CMI context also contains a function vtable() for the CMI operations. The function vtable() is setup by the library based on the negotiated CMI API version between the client. The API negotiation between the client and the library is determined by the base CMI API version supported by both the client and the library. Client provides the maximum CMI API version it supports in the verno parameter to the cmi_ini() function. The CMI vendor library selects the base API version that it is capable of supporting that the client can understand. CMI APIs are backward compatible within a major release, that is, all 1.x APIs are backward compatible and a client that supports CMI v1.5 can also understand v1.1, v1.2, v1,3 and v1.4. For example, if a CMI client requests the library to initialize with v1.5, while the loaded CMI library supports only v1.3, then initialization succeeds and the CMI context is configured for v1.3. Similarly, if the CMI vendor library supports v1.8, then the CMI context is configured for v1.5. CMI ABI compatibility across major versions is not guaranteed and the CMI library initialization can fail. For example, if a client requests v1.5 and CMI library supports v2.0, then the cmi_ini() function can fail with error CMI_ERR_NOTSUP assuming that v2.0 is incompatible with v1.x. The client can retrieve the reason for failure using the cmi_get_error() function. This is the only CMI routine that a thread can call without registering with the CMI library.

Note - CMI context has process level scope.

CMI Segment

CMI segments extend the concept of standard shared memory segments across the network. CMI segments can be allocated on a node through the seg_get() function similar to the shmget() function. The node on which a segment is allocated is known as the home node for the segment. Processes on the home node can attach or map the segment into their address space using the seg_at() function, or detach the segment using the seg_dt() function from their address space. CMI segments have node level scope, which means that once a segment is allocated by a process, the underlying resource is available to other processes on the node. Standard OS mechanisms are employed for access control to a segment on the home node.

CMI segments can also be accessed over the network from processes on a remote node. To make a segment accessible over the network, the local segment must be exported using the seg_exp() function. A remote segment handle uniquely identifies an exported segment in the cluster, and hence cluster level scope. Before processes can attach to remote segments, they must be imported on the node using the seg_imp() function. Importing a remote segment handle creates a local segment handle that is a proxy to access the segment over the network. A remote segment needs to be imported only by one process on a node to make the proxy segment available to all processes on the node. Imported CMI segments are not required to support being a source or target of DMA operations. Exported CMI segments however should be capable of source or target DMA operations.

Access control to exported segments is accomplished using access tokens. The process on the home node that allocates a segment can generate access tokens for it using the tok_new() function. Access tokens can be distributed over the network to remote nodes to grant them access to the segment. Multiple access tokens can be generated for a segment on the home node. This allows fine grained access control to the segment. Remote processes associate access tokens to imported proxy segments through the seg_ctl() function using the CMI_SET_TOKEN command. Only the process that imports a remote segment can associate an access token. If an access token is not associated with an imported segment, then all access to the mapped segment will result in access error or exception. For more information, see CMI Access Control.

Note - Segments have node level scope. Exported segments have cluster level scope.

Segment Lifetime

CMI segments have a well defined lifespan. Segments can either be explicitly marked for deletion using the seg_ctl(), and specifying the CMI_SEG_RM command, or are implicitly marked for deletion once the process that is created using the seg_get() function or imported using the seg_imp() function, exits. The segment is implicitly marked for deletion regardless of the manner in which a process exits, which means that abnormal termination of the process should result in all segments created or imported by that process be marked for deletion by the CMI library or the CMI system or driver.

Once a segment has been marked for deletion, the underlying resource is only deleted once the last process has detached from the segment. Attaching to a segment marked for deletion is prohibited either by currently running processes or newly spawned ones. However, current attach mappings should remain till processes explicitly detach from the segment. The lifetime of exported/imported segment are thus symmetric, and they persist till the last process detaches from them.

Access to segments that are marked for deletion differ between the home and remote node. Once an exported segment is marked for deletion, all access to it from remote nodes is prohibited. More specifically any access tokens associated with that segment can be dropped. Any subsequent access to the segment from remote nodes should result in a SIGSEGV exception with the si_errno set to CMI_ERROR_TOKEN. For more informaton, see Delivering CMI Exceptions. Access to the segment by local processes is allowed as they do not require an access token. Once the last local process detaches from the segment, it can be deleted. Any subsequent accesses by remote nodes referring to this segment should result in a SIGSEGV exception with the si_errno set to CMI_ERROR_SINVAL as the segment is no longer valid.

Note - Once a segment has been marked for deletion the CMI platform will not generate any asynchronous error events related to the segment such as CMI_EVENT_STORE_FAILURE.

Note - Synchronous exceptions on the segment will still be delivered to offending processes. For example, if a process attempts to perform a store to a remote segment that has been marked for deletion and has no access tokens configured, it will encounter a SIGSEGV exception with the appropriate error code.

Note - A segment ID will not be recycled till all processes have detached from the segment, and the resources backing the segment have been reclaimed.

CMI Access Control

Access to segments is controlled using access tokens. Similar to CMI, context handle tokens are semi-opaque objects and contain vendor specific states to enforce access control. Access tokens are created on the home node for the segment. Tokens can only be created by processes that created the segment. Access tokens are created using the tok_new() function, and are allocated by the CMI library. Multiple access tokens can be created for a segment and distributed to remote nodes allowing fine grained access control to the segment. Access tokens can be revoked using the tok_del() function on the home node. Once revoked, the CMI library can deallocate any resources allocated for the token. Tokens can only be revoked on processes that created the access token.

Access to remote segments requires the requesting node to furnish an access token. The home node is required to validate the incoming requests using the provided access token. Remote segments imported using the seg_imp() function are created in a disabled state, and need to be programmed with an access token. Access tokens received over the network are programmed with the seg_ctl() function using the CMI_SEG_TOKEN command. Access tokens can only be programmed for remote segments on the process that imported the segment using the seg_imp() function. A home node may revoke and generate new access tokens for a segment, for instance, during cluster reconfiguration. Thus, access tokens for a remote segment may be programmed multiple times using the teg_ctl() function. Programming an access token replaces the previous access token, which means that a remote segment can only have one access token active for it, at any time. Access to remote segments, both loads and stores, with mismatched tokens result in an access error or an exception to be thrown on the requesting node, if the access token was revoked for the node. For more information, see Error Handling.

Access tokens are semi-opaque object consisting of a base object and vendor specific access control state. For example, on InfiniBand, the access token may contain one or more keys for the segment. The base token object contains the vendor and device ID of the segment. Since access tokens are distributed to remote nodes, the CMI client requires the size of the access token to be available. The CMI_ATTR_TOKEN_SIZE attribute to the attr_get() function can be used to determine the size of the vendor specific access tokens. The CMI vendor library will validate that a valid access token is being configured for a remote segment. This means that the access token vendor and device IDs are compatible with the CMI context on the remote process. If mixed mode or heterogeneous environments are supported by the vendor, then the vendor CMI library interprets the access token fields in an appropriate manner. Access tokens created on a little endian architecture may be imported on a big endian architecture.

Note - The size of an access token may vary depending on the size of the segment. Clients must always determine the size of access tokens configured on a per segment basis using the attr_get() function.

Access tokens have cluster level scope.

Extensible Segments

Support for CMI extensible segments is optional. A platform can indicate support for extensible segment by setting the CMI_CAP_EXTENSIBLE_SEGMENTS capability flag. Extensible segments enable applications to adapt to varying access patterns while providing ability for isolation between distinct components. For instance, extensible segments enable applications to support multiple components, each having its own distinct set of segments and access keys. Cumulative size of all segments created can be greater than the available memory, as extensible segments do not allocate all memory up front. As access pattern in the application change, the backing physical memory can be migrated between these components/segments. At any point in time, the total amount of backing physical memory allocated across all segments cannot exceed the available memory on the node. However, each segment can have variable allocation of that memory assigned to it.

Extensible segments provide sbrk() like functionality on a per segment basis. Clients can dynamically grow and shrink the size of the segment similar to how the data segment for a process can be grown or shrunk using the sbrk() function. Clients must ensure that segments are grown or shrunk in multiple of protection unit size on the platform and can be queried using the cmi_ctl() info command. The size of the instantiated segment is called the break point of the segment. Accesses within the instantiated part of the segment are valid. Attempt to access a segment outside the instantiated range, even if it is within the maximal size of the segment specified during creation will result in a CMI_ERROR_SEG_BRK access exception. For normal segments or non-extensible CMI segments, the instantiated size of the segment is equal to the maximum size of the segment specified during creation.

Allocating Extensible Segments

Extensible segments are allocated using the seg_get() function with the CMI_SEG_EXTENSIBLE option. A CMI segment can be created either as an extensible segment or a normal segment. Once created, segments cannot dynamically switch between extensible or normal segments. For extensible segments, the size specified during creation is the maximum size of the segment, and cannot change during the lifetime of the segment. On creation, extensible segments do not allocate any backing physical memory. This means that the segment break point is 0. Clients can dynamically size the segment at runtime. For more information, see Dynamically sizing Extensible Segments.

Exporting or Importing Extensible Segments

Before extensible segments can be accessed from remote nodes, they must be exported and imported in a similar manner to normal segments. Segments can be exported as well as attached to without any backing memory being allocated to. This means that the segment break point does not have to be set before export. On segment attach, by a local or remote process, the maximum size of the segment as specified during the seg_get() function, must be mapped into the address space of the process. However, any attempts to access the segment beyond the break point will result in an access exception. This means that a freshly created segment with break point of 0 can not be accessed till the owning context has set a break point using the CMI_SEG_BRK command. However, processes can import and/or attach to the segment before the break point has been set. Access control for extensible segments is similar to normal segments, and utilizes tokens that must be associated with the imported segments.

Dynamically Sizing Extensible Segments

It is acceptable to grow or shrink extensible segments before they are exported or attached to. Since all attaches on extensible segment map in the maximum size of the segment into the address space of the caller, the two operations, import or attach, and resizing are independent of each other. Segments can only be resized on the home node by the owning context or the context used to create the segment. Segments are sized using the seg_ctl() function with the command CMI_SEG_BRK. The absolute size of the break point for the segment is specified during resizing. The break size of the segment must be configured as a multiple of protection unit size supported by the platform. Attempting to either increase or decrease segment size that is not a multiple of protection unit size will result in failure. If the size specified is greater than the currently configured break point for the segment, then client is requesting to grow the segment. All local and remote processes currently attached to the segment should be able to safely access the segment to the increased break point. The newly instantiated backing memory for the segment must be initialized to 0. Attempt to set the break point beyond the maximal size of the segment specified during creation shall fail with CMI_ERR_INVAL. If memory reservation is configured for the context, then any newly allocated backing memory should be accounted against the reservation quote. For more information on memory reservations, see Memory Reservation. Conversely, if the break point being set for the segment is smaller than the currently configured break point, then the segment is being shrunk and backing memory for the segment can be released. All local and remote processes currently attached to the segment shall encounter an access exception for any access beyond the newly configured break point. If memory reservation is configured for the context, then any freed memory should be returned to the reservation quota to be utilized for subsequent allocations.

Note - Extending or shrinking the segment break point will not change the address map of all processes attached to the segment. Only the address range that is valid for the segment is impacted by the resize operation. The entire address space for the segment as specified by the maximal size of the segment during create is mapped upfront during attach.

It is expected that the sizing operations will be infrequent and can be fairly expensive on some CMI implementation and potentially requiring inter-node communications to update address mapping for all processes. It is possible for CMI_SEG_BRK command, which is provided by the seg_ctl() function, to wait until the resize operation completes in all nodes that attach the resized segment. However, on return from a successful resize operation, all local and remote processes attached to the resized segment see the updated break point for all subsequent accesses.

Memory Allocation

All CMI objects are allocated by the library, which maintains the ownership of the object. The only exception is the remote segment handle, which is allocated by the client. The size of the remote segment handle is obtained by querying the CMI library with the CMI_ATTR_RSEG_SIZE attribute. CMI clients are guaranteed that objects allocated by the library are valid during the lifetime of the object. CMI clients wishing to maintain control or keep track of memory usage of the CMI library may provide memory allocation callbacks when initializing using the cmi_ini() function. The CMI library must use the provided allocation callbacks when allocating CMI objects. If no allocation callbacks are provided, then the vendor library is free to perform memory management on its own.

Memory Reservation

CMI clients may request a specified amount of memory to be reserved to ensure that subsequent segment allocations are guaranteed to succeed. Clients request a reservation using the cmi_ctl() function with the CMI_CTL_MEM_RESERVE command. Each reservation is identified using an opaque memory reservation key that is allocated by the CMI platform, and returned to the client on a successful memory reservation request. Reservations are long-lived, which means that the underlying reservation can persist even after the context that created it exits.

The external facility, cmiadm, is available on the CMI platform, which lists the currently active reservations and provides an ability to delete them. For more information, see cmiadm(1m) man page.

Reservations can be shared by one or more CMI contexts by associating the reservation key with the context. The context that creates a reservation is implicitly associated with the reservation, and must not be associated with any other reservation.

For security purposes, memory reservations are only shared by processes that run with the same effective user ID that created the reservation, and with PRIV_CMI_ACCESS and PRIV_CMI_OWNER privileges. For more information, see privileges(5) man page. This is to avoid denial of service attacks where a reservation by privileged user cannot be hijacked by a normal user to exhaust memory.

Each CMI context can only have at most one memory reservation associated with it, using the CMI_CTL_MEM_RESERVE_SET command. It is not required for each context to have a memory reservation associated with it. Context can disassociate a reservation by explicitly setting a reservation with key 0. This means that a reservation with key 0 is a special identifier to indicate no reservation, and should not be used by the CMI platform to indicate a valid reservation. Context with no reservation may have segment allocation requests fail if sufficient memory resources are not available. However, if a memory reservation is associated with the context, then all subsequent segment allocation requests must succeed up to the amount of reserved memory. It is permissible for a context to request allocation of memory beyond the reserved amount. However, there are no guarantees by the platform that these allocations will succeed.

Clients specify the amount of memory to reserve using the CMI_CTL_MEM_RESERVE command. The amount of memory to reserve is specified in ctl_cfg_mem_reserve. If the context is not associated with any memory reservation, then a new memory reservation of specified size is requested, or else a request to modify the reservation associated with the calling context. For new reservations, the key is returned in rsv_key, and the reservation is implicitly associated with the calling context. Clients can modify a previously associated reservation by increasing or decreasing the reservation amount. If the currently allocated memory associated with reservation is greater than the reduced reservation request, then the request will fail with CMI_ERR_INVAL error. Similarly, if the requested memory reservation exceeds the available CMI memory, then the request will fail with CMI_ERR_NOMEM error. In all cases, on return from the cmi_ctl() function, the currently allocated memory for reservation is returned in ctl_cfg_mem_allocd.

A previously created memory reservation can be deleted by using the CMI_CTL_MEM_RESERVE_DEL command. Any currently associated contexts or segments with the deleted reservation continue to be valid. However, any subsequent segment allocations do not use the deleted reservation. Any attempt to reference the deleted reservation using the key will fail with CMI_ERR_INVAL error. Instances can be such as attempting to associate a context with deleted reservation.

Note - The CMI memory allocated for a reservation persists till the reservation is explicitly marked for deletion using the CMI_CTM_MEM_RESERVE_DEL command, and all segments allocated using the reservation are deleted.

Error Handling

This section enumerates the various error conditions that may be encountered by a CMI client during normal course of operations and how they are handled. A client can encounter error while invoking CMI library functions. These errors usually arise due to invalid use of the API or when resource exhaustion or system defined limits are hit. Clients can also encounter errors when accessing remote segments over the network. Remote access errors can broadly be categorised into the following categories:

Access control: Misconfigured or revoked access tokens can result in access errors or exception on the remote node. These errors are handled by the CMI client directly.
Node death: Accessing segments whose home nodes are dead results in access errors or exception. Node death impacts exported and imported segments differently, and needs to be handled differently.
Network error: Transient network errors can occur if the CMI network is undergoing reconfiguration. For example, during link failure, requiring failover to another path or temporarily encountering link or network buffer errors. An access error/exception is thrown by the CMI system and handled by the CMI client. The client may dismiss the exception and attempt to retry the operation immediately or after some arbitrary wait interval.

Client need to be able to distinguish between these exceptions and take appropriate actions. The error handling infrastructure distinguishes between various error types. Errors encountered during synchronous operations are handled differently from errors detected for asynchronous operations. For more information, see Access Errors and Exceptions. Finally some error conditions may require cooperation between the CMI client, the CMI library and the CMI system. These errors are handled using asynchronous event notifications which are described in detail in section CMI Events.

CMI API

Most routines in the API return a 0 on success or -1 on failure. Routines that return an object will return NULL on failure. A thread can obtain the underlying reason for the failure by invoking the cmi_get_error function. For multithreaded processes, the CMI library tracks errors on a per-thread basis. The value returned by the cmi_get_error() function is the last error encountered by the thread. Successful calls to the CMI library by the thread may clear the error code for the thread. CMI clients are encouraged to call the cmi_get_error() function on detection of failure to obtain underlying cause as subsequent calls may reset the error code.

cmi_get_error() function is the only CMI function that can be called by a thread without an initialized CMI context handle. This is to allow a thread to obtain the error in case CMI library initialization fails.

Access Errors and Exceptions

A CMI application may encounter errors during normal course of operation when accessing remote error segments. For instance access to remote segments may have been revoked during a cluster reconfiguration, or the home node for the segment may have died. Synchronous operations on these segments by the applications, such as loads, stores to write through segments or atomic CAS using atm_cas command, result in an exception being thrown that will be caught by the CMI application. CMI application will install an exception frame around all remote segment accesses and perform appropriate cleanup. The exception is thrown in the context of the calling thread accessing the offending segment. The rest of the processes and threads within the process should continue normal operation.

A thread may get an access exception during synchronous operations. Local or remote access exceptions under the following scenarios can be triggered:

Segmentation Violation: Thread accessed an illegal or protected address due to mprotect. The address in question may not reside within a CMI segment, and hence is not directly related to CMI. However, the calling thread must be able to distinguish this error from CMI related errors.
Consistency Error: Access address is in an inconsistent state and recovery operation, CMI_SEG_RECO, has not been performed. This is only possible for clients who do not implement their own consistency protocol, CMI_CTL_CLIENT_CONSIST.
Transient network error: A synchronous load/store could not be completed due to a transient error condition in the network.
Remote node death: Home node of the segment may be dead. This error needs to be distinguishable from the transient network error as the client may initiate different recovery procedures.
Access token mismatch: The access token for the imported segment has been revoked by the home node. For example, this could occur as part of a cluster reconfiguration. This error should be distinguishable from the Segment Violation error above as the client may initiate different recovery procedures, and attempt to obtain and configure a new access token for the segment.

For transient network error exceptions in general, the client will dismiss the exception and attempt to reissue the offending operation after an arbitrary wait interval. Some CMI implementations may be able to reconfigure the network in a responsive manner, while others may have a more heavy weight reconfiguration procedure. Trapping an exception and retrying the operation repeatedly can result in unnecessary overhead, especially if the reconfiguration of the network is fairly lightweight. To accommodate varying implementations and CMI client semantics, a client may specify a network reconfiguration timeout with the cmi_ctl() function using the command CMI_CTL_RECONF_TOUT. The timeout specifies the maximum allowable reconfiguration timeout for an operation before which an access exception can be thrown. It is permissible for an access exception to be thrown before timeout has expired. The parameter is an indication of how long a client is willing to wait for stalled operations due to network reconfiguration, so the error recovery and overhead can be minimized. If the client does not specify a network timeout, then the CMI implementation is free to select an appropriate interval at which to generate access exception on network reconfiguration.

A remote node death exception requires the CMI system and the CMI library to also deliver an asynchronous error event to the process on the node that imported the segment. This is in addition to the exception thrown in the context of the executing thread. Only a single asynchronous error per segment needs to be issued till the segment has undergone recovery as outlined in CMI Events. Access exceptions should still continue to be generated for any thread that continues to access the segment.

A process may encounter errors during asynchronous operations such as CMI system attempting to write back cached data to a segment. When an error is encountered during an asynchronous operation, the CMI subsystem will generate asynchronous CMI Events on the calling node indicating the failure to communicate to the remote node. The offending segment, and potentially the address within the segment is returned as part of event data. Once an event notification has been generated for a given segment or node, no further events are generated for the segment till the CMI client manipulates the CMI segment such as installing new access tokens. Any processes or threads attempting to access the segment should still encounter exceptions for synchronous operations.

Delivering CMI Exceptions

All exceptions are delivered to CMI applications using signals namely the SIGSEGV signal. CMI processes wishing to handle exceptions will install a SIGSEGV signal handler. If a process does not have a SIGSEGV handler installed, then the default action for this signal, that is usually killing the process, can be executed on the platform.

Since the SIGSEGV signal can be generated in normal course of operation while accessing non-CMI memory, the process need be able to distinguish the underlying cause of the signal. A new si_code (SEGV_CMI) is added to indicate that the SIGSEGV occurred on CMI memory access. The si_errno field in the siginfo parameter of the sigaction handler is used to deliver additional information regarding the CMI access exception. The address triggering the CMI exception is available in si_addr field of siginfo parameter. A new si_id field (alias for si_trapno) in the siginfo structure is defined which contains the ID of the segment containing si_addr. See siginfo.h(3HEAD) for more details.

CMI processes will examine the si_errno field to determine the underlying cause of the SIGSEGV exception. The following CMI error codes are currently defined:

CMI_ERROR_ENABLE: Attempt to access CMI segment before enabling using the cmi_enb() function. The address of the offending segment is available in si_addr. si_id contains the ID of the segment, cmi_seg being accessed. si_code is set to SEGV_CMI.
CMI_ERROR_TOKEN: Access token error. Either mismatched or no tokens configured for segment. The address of the offending segment is available in si_addr. si_id contains the ID of the segment, cmi_seg being accessed. si_code is set to SEGV_CMI.
CMI_ERROR_ACCESS: Insufficient privileges to access segment. For example, attempting to write to a read only segment. The address of the offending segment is available in si_addr. si_id contains the ID of the segment, cmi_seg being accessed. si_code is set to SEGV_CMI.
CMI_ERROR_SINVAL: This error should only be generated when accessing imported segments and indicates that the exporting segment is no longer valid. This can be due to the exporting segment being deallocated or the home node for the segment being dead. The address of the offending segment is available in si_addr. si_id contains the ID of the segment, cmi_seg being accessed. si_code is set to SEGV_CMI.
CMI_ERROR_UE: This error code is used to indicate an unhandled exception condition occurred while accessing a remote segment, which means that the UE occurs on memory on the home node. The address of the offending segment is available in si_addr. si_id contains the ID of the segment, cmi_seg being accessed. si_code is set to SEGV_CMI.
CMI_ERROR_TRANSIENT: A transient error condition has occurred such as a network undergoing reconfiguration. The operation may be retried. The address of the offending segment is available in si_addr. si_id contains the ID of the segment, cmi_seg being accessed. si_code is set to SEGV_CMI.
CMI_ERROR_SEG_BRK: This error code is used to indicate that an attempt was made to access beyond a valid segment break point. This error can only be generated for extensible segments where the valid/accessible size of the segment can be changed dynamically. For more information, see Extensible Segments. The address of the offending segment is available in si_addr. si_id contains the ID of the segment, cmi_seg being accessed. si_code is set to SEGV_CMI.
CMI_ERROR_CONSIST: This error code can be generated on both the home and remote node when accessing memory that may be in a inconsistent state due to node or network failure of some sort. For more information, see Consistency Data. The address of the offending segment is available in si_addr. si_id contains the ID of the segment, cmi_seg being accessed. si_code is set to SEGV_CMI.

A UE is handled differently on home and remote nodes. If a UE is encountered on the home node within a CMI segment, then all processes attached to the CMI segment are killed. The CMI platform will not generate any signal etc. If the UE is encountered by a process in process private memory, then only the process encountering the error is killed. To prevent UEs cascading across the network, a remote process that encounters a UE accessing a remote segment will result in a SIGSEGV with error code CMI_ERROR_UE. It is imperative that the remote process SHOULD NOT consume any potentially corrupted data or be killed. CMI clients will take appropriate recovery action in this situation which may include but not be limited to detaching from the segment that encountered the UE.

CMI Events

CMI system during normal operation may encounter various forms of errors that may require either cooperation from CMI clients or could aid the CMI clients in their recovery protocols by providing sufficient information regarding the error event. Synchronous operations that result in error throw an exception that the client can catch and process. For more information, see Access Errors and Exceptions. Asynchronous error notifications between the CMI system and CMI clients is accomplished using asynchronous event notifications. This section covers how the CMI subsystem and a CMI client co operate in addressing various error conditions that may arise.

Processes that may create segments using the seg_get() function, or import remote segments using the seg_imp() function, must process CMI events in a timely fashion. This is required since certain error conditions require the cooperation of the client and the CMI system to recover the system and program state. The CMI subsystem allocates a CMI event object with data relevant to the event requiring processing. The notification event is queued to the CMI context and retrieved by the process using the evt_get() function.

Some error conditions that require cooperation between the CMI client and CMI system to resolve may take an unbounded amount of time. They may require the CMI client to interact with processes on remote nodes for instance. The event notification infrastructure thus provides an asynchronous completion mechanism. Once the process has finished processing the event, the event is returned back to the CMI library by invoking evt_ret() function. The CMI library should not deallocate the event object till the event is returned using the evt_ret() function. The status of the request is returned with the event. Following event processing status are defined:

CMI_EVENT_RET_DONE: The processing of the event completed successfully. The CMI library can deallocate the event object.
CMI_EVENT_RET_FAILED: Event handling encountered an error during processing the event. Depending on the type of the event, the process may be aborted. Some events may require successful resolution before operation can continue.

Following CMI events are defined.

CMI_EVENT_QUIESCE

This event requests the CMI client to cease all operations on the specified segment. The event is invoked in the context of the process that created the segment. The CMI will require coordination with all nodes that have imported the affected segment. Either the CMI client can ensure that all operations to segment are quiesced or the CMI system can invoke the QUIESCE operation on all nodes that have imported the segment. The quiesce event is generated in the context of the processes that imported the segment on each node. Once all operations have ceased the quiesce event is returned to the CMI system. The underlying condition can then be resolved.

QUIESCE events delivered for the SAME segments are NO- OPs, if there are no intervening resume events for the segment. This means that the QUIESCE events are NOT reference counted. A single resume event enables access to the segment regardless of number of QUIESCE events.
A resume or home/remote node down event with the specified segment should be generated for correct operation. It is acceptable for segments that are QUIESCED or in the process of being QUIESCED to have home or remote node down events generated that include them. On nodes that have the home node down event generated for the segment, that is remote nodes having attached to the segment in question, a resume event is NOT required since it implicitly marks the segment as unavailable, as the home node is down. In this case client will undertake appropriate recovery procedures such as detaching and deleting the remote segment.

Note - Segments that have been, or are in the process of being quiesced always generate a resume event for the segment on the home node of the segment, even after a remote node down event has been delivered for the segment on the node. The remote node down event just notifies the home node of the segments owned by it that were remotely attached to and might require recovery to maintain consistency.

From client's perspective there is no implied ordering between QUIESCE/RESUME/REMOTE|HOME_NODE_DOWN events. However, if the CMI implementation has some ordering constraints then it should generate the events in an appropriate order. For example if we have the following sequence of events

Segment A QUIESCE event generated. Client starts Quiesce protocol across nodes.
Quiesce procedure completes. Client returns QUIESCE event with success
Remote node X goes down which has segment A attach . A REMOTE_NODE_DOWN event is generated for segment A.
Client starts attempting recovery on cache lines that may be in flux for segment A. This recovery procedure may involve modifying memory for segment A.
Segment A RESUME event generated. Client starts resume protocol across nodes.
Resume protocol completes. Client returns RESUME event with success.

If the CMI implementation does not support modifying memory during recovery to a quiesced segment, for instance, underlying page mappings are not set up, then it can reorder the issuing of the REMOTE_NODE_DOWN event and RESUME, or internally make sure the mappings for the segment are set up before generating the REMOTE_NODE_DOWN event. However, from the client's perspective, either sequence is acceptable and correct.

Note - The Quiesce event can be generated on all CMI nodes though this is not required. Quiesce segment must be generated on the home node of the segment that is being quiesced.

`CMI_EVENT_RESUME`

This event is generated to indicate to CMI client that previously quiesced operations on a segment can now be resumed as the underlying error condition has been resolved. The CMI client can then notify remote nodes that access to segment can resume. Alternatively, the CMI system can generate the resume event on all nodes that had imported the segment to indicate that the segment is ready for access.

Note - While a segment is awaiting quiesce and before it is resumed, all segment attach operations using the seg_at() function, on either the home node or on remote nodes for the segment are failed with error CMI_ERR_RECONFIG.

This event is always generated on the home node for a previously quiesced segment. If the specified segment was not previously quiesced, that is, there was no CMI_EVENT_QUIESCE for the segment, the generated CMI_EVENT_RESUME event has no effect.

`CMI_EVENT_RCTXT_DOWN`

This event is generated to indicate death of a remote CMI context. Remote context death may have been detected due to some underlying reliability protocol such as heartbeats, or if one of the local processes on the node encounters an access exception. The remote context death event consists of all the segments that were exported from the context receiving this event had been imported by the remote context that has exited. Addresses in these segments may be in an undefined state as only partial stores may have been reflected. The event handler can then initiate recovery for the affected segments in an orderly fashion. Depending on the segment consistency method employed, Client or CMI System provided consistency, the affected addresses in the segment may not be available for access till explicit recovery is initiated. See Consistent data for more details.

Note - This notification event contains segments exported by the context receiving this event, which may be affected by the death of the remote context.

Client may initiate recovery operation on specified segments in response to this event. CMI system ensures that it is safe for the client to perform recovery on the segments before generating this event. For example if the segments in question are quiesced due to some underlying issue then they should be ready for recovery before this event is generated.

`CMI_EVENT_HCTXT_DOWN`

Similar to CMI_EVENT_RCTXT_DOWN this event is generated to indicate death of a remote CMI context. However, in this case the event data contains the list of segments imported by the context on receipt of this event. If the CMI client does not take any action then all subsequent load/stores to those segments would result in access exceptions being thrown. The usual sequence of operation is to detach all affected segments and start the reconfiguration protocol so the affected segments can be reallocated and mapped.

Note - This notification event contains segments imported by the context, which may be affected by the death of a remote context. That is, the imported segments had been exported by the dead context.

If a QUIESCE event was generated for a remotely attached segment that this event refers to, that is, the exporting context for the segment has exited then a RESUME event is not required to be generated for the segment. Otherwise, a RESUME event needs to be generated for ALL QUIESCE events.

`CMI_EVENT_STORE_FAILURE`

This event is generated when the CMI platform encounters an unrecoverable store failure that may have resulted in the loss or corruption of data. Store failures in general are fatal and need to be indicated to the application in a timely manner. Asynchronous store errors can occur during normal operations as cache lines are evicted from remote node and written back to home node. Any errors encountered during this opration will result in a CMI_EVENT_STORE_FAILURE event being delivered to the process that imported the affected segment on the remote node. Synchronous store failures such as errors encountered during the flush_fb() function should return a CMI_ERR_STORE error to the caller. For synchronous stores that do not have an associated function or ability to return an error code, such as stores to uncached segments, it generates a signal to the caller. For synchronous store failures, it may also deliver an asynchronous CMI_EVENT_STORE_FAILURE event to the importing process. For asynchronous store failures, however, the importing process is alwasys notified through a store failure event.

Asynchronous store failure events are delivered using cmi_einfo_serr. Multiple store failures for a given segment can be delivered in a single event. einfo_naddrs is the number of addresses within einfo_seg that encountered a store failure. Each address encountering a store failure is returned in einfo_addr array which is of size einfo_naddrs. For platforms that cannot deliver precise addressing information, set einfo_naddrs to 0 but deliver the store failure event with the einfo_seg set to the segment facing the error that is at a minimum platforms indicate failure on a per segment basis.

`CMI_EVENT_HW`

Similar to store failure event CMI_EVENT_STORE_FAILURE, this event is generated if stores are lost due to a local hardware failure. This allows the CMI platform to distinguish the underlying cause of the store failures as clients may undertake different recovery semantics. For example, if the CMI application is running in a cluster then on store failures due to local hardware fault the local node can be shutdown and evicted from the cluster immediately. If however a generic store failure event due to network outage is detected, then more involved eviction algorithms may be employed to determine which node in the cluster to evict.

`CMI_EVENT_CMAP`

This event is generated in response to CMI_CTL_NODE_CMAP_GET command. For each command request, the CMI platform will generate a CMI_EVENT_CMAP event that indicates the remote nodes that are DISCONNECTED from the perspective of the requesting context. A given context is only interested in the connectivity map for nodes which have either imported segments exported by the context or for nodes from which the context has imported segments. Clients provide a 64 bit opaque request ID when requesting a connectivity map in the cmi_ctl() function. This request ID is returned in the corresponding CMI_EVENT_CMAP event and allows the client to associate request state for the response. The CMI platform will not interpret the value of this request ID in any way.

It is possible for CMI platform to generate connectivity map events on their own, for example when a network partition is detected. These events are referred to as unsolicited connectivity map events and distinguishable from solicited connectivity maps generated in response to client requests. All unsolicited connectivity map events contain a request id of 0. Clients can not use a request ID of 0 for solicited events.

Note - Even if the CMI platform has generated an unsolicited connectivity map event, it also generates a solicited event for each outstanding request.

Event Ordering

Certain failure cases may lead to possibly multiple events being generated. For example in response to a network partition CMI platform can generate a CMI_EVENT_CMAP event indicating the CMI nodes that are disconnected. In addition to the disconnected map a CMI platform can generate a CMI_EVENT_HCTXT_DOWN/CMI_EVENT_RCTXT_DOWN events for all CMI processes residing on the disconnected node. There is an implied ordering for the events which the CMI platform enforce as clients use the implied order to determine the exact cause of underlying failure. CMI platforms generate events in the following order which is highest priority first:

CMI_EVENT_HW: This is the highest priority event and indicates a malfunction in local CMI platform hardware. This event will be generated before any other events that may have downstream effects. For example, if the node gets disconnected from the fabric due to hardware failure, then the CMI platform may generate CMI_EVENT_CMAP event indicating all the remote CMI nodes that are disconnected with regard to to this node along with the appropriate CMI node down events. The hardware event always precede the generation of these node down events as it is the underlying cause of disconnectivity from the fabric.
CMI_EVENT_CMAP: This event is generated whenever there is a change in the physical connectivity of the fabric and identifies all the remote CMI contexts that are no longer reachable. These contexts must have either exported or imported a segment from the context that is in receipt of this event. If the contexts are no longer reachable due to a local hardware error then a CMI_EVENT_HW must have been generated prior to this event. If a hardware error event is not generated before retrieval of a CMAP event, then it implies the cause of disconnectivity is due to network partitioning which can be due to a bad cable or switch, or a hardware error on remote node.
CMI_EVENT_RCTXT_DOWN/CMI_EVENT_HCTXT_DOWN: This event indicates the address of a CMI ctxt that has terminated or is no longer reachable, and had exported or imported segments from a process that is in receipt of this event. If the ctxt is no longer reachable due to a physical network being partitioned, then a CMI_EVENT_CMAP event for it have been generated first. If a CMAP event is not generated before retrieval of a context down event then it implies that the context has exited but the connectivity is valid between the nodes.

Note - Generation of the context down events, CMI_EVENT_RCTXT_DOWN and CMI_EVENT_HCTXT_DOWN, are optional. It depends on vendor library whether these events are generated or not. However, the generation of other events, CMI_EVENT_HW and CMI_EVENT_CMAP are guaranteed. For the CMAP event, unsolicited events are optional however the solicited CMAP event is always generated.

Consistent data

Consistency of data can be maintained by various layers of the CMI stack. At the most basic level, the CMI implementation provides coherency at the cache line level across nodes. Applications themselves may utilize higher level synchronization primitives such as CAS to maintain consistent access to objects from multiple nodes. Additionally, client specific recovery structures and algorithms may be available that ensure that client data can remain consistent in face of errors provided a recovery stage is initiated after failure is detected.

Consistency of data is only pertinent on the home node when a remote node dies accessing an exported segment. The CMI system may be inconsistent in terms of cache lines being in flux/undetermined state to allow the coherency protocol to continue in a safe manner - for example cache lines in modified/ exclusive state on the remote node may not have been written back before the death of the node. The CMI system may require the client to initiate recovery of affected data to a consistent state. Alternatively, if the client application is running a higher level protocol that guarantees consistency of data, then each object may be protected with a latch that protects access to stale or inconsistent data, which means that the CMI system can assume that cache lines are stable across error boundaries.

The default CMI consistency mode is inconsistent, which means that clients do not need to support application specific protocols to guarantee data consistency across failure boundaries. To support both modes of operation, CMI clients need to specify if they are running a higher level protocol that guarantees data consistency. The CMI system can be configured using the cmi_ctl() function with CMI_CTL_CLIENT_CONSIST command to indicate the client implements consistent data protocols. The CMI system can then assume that cache lines have stable data across error boundaries and can be accessed by remote nodes, subject to access control, freely. Clients can also specify the consistency mode on a per segment basis when creating the segment using the seg_get() function, with CMI_SEG_CLIENT_CONSIST and CMI_SEG_CLIENT_INCONSIST flags. This overrides the default consistency mode configured for the platform and allows clients to support mixed consistency modes. On platforms that do not support mixed consistency modes attempt to create segments with conflicting mode can fail with CMI_ERR_NOTSUPP error.

Since the default mode of operation is that data is not guaranteed to be consistent the CMI system requires explicit recovery by the client. On detection of a remote context death or CMI_EVENT_RCTXT_DOWN, the CMI system can mark all cache lines in an undetermined state as unstable. Any access by local processes to the affected address range will result in an access exception. For more information, see Access Errors and Exceptions. Remote nodes attempting to access addresses in flux may get an access exception or stall the operation till the address is made consistent. CMI clients must explicitly perform recovery using CMI_SEG_CHECK and CMI_SEG_RECO commands to seg_ctl() function. Clients can check which addresses ranges of the affected segment require recovery using CMI_SEG_CHECK command. This command returns the lowest address range being checked that requires recovery. Affected address ranges that have not had recovery performed on them will result in access exceptions.

Recovery for inconsistent segments can only be performed on the home node. The process that create and exporte the segment with inconsistent data is the only process that can perform recovery on the semantics. Any attempt to access data in an inconsistent state will result in a CMI_ERROR_CONSIST error.

Note - Accesses to regions in the segment that are not in flux succeed at all times.

Examples

Sample Code

      #define CMI_SEG_SIZE 1024 * 1024 * 32
      Initialize CMI library and allocate, attach and export a segment
      int cmi_test()
      {
       cmi_ctxt   *ctxt = cmi_ini(CMI_VERNO, NULL);
       cmi_seg     seg = CMIFN(ctxt,10,seg_get)(ctxt, CMI_SEG_SIZE, CMI_SEG_LOCK);
       void       *ataddr = (void*) 0x100000000;
       void       *addr;
       size_t      rsegsz = 0;
       size_t      optlen;
       cmi_rseg   *rseg;

       Initialize CMI library - no callback handlers
       ctxt = cmi_ini(CMI_VERNO, NULL);
       if (NULL == ctxt)
       {
         fprintf(stdout, "Ctxt initialization failed with error %d0,
                 cmi_get_error(NULL));
         return -1;
       }

       Allocate CMI segment and lock it
       seg = CMIFN(ctxt, 10, seg_get)(ctxt, CMI_SEG_SIZE, CMI_SEG_LOCK);

       Attach segment
       addr = CMIFN(ctxt, 10, set_at)(ctxt, seg, ataddr, 0);
       if (NULL == addr)
       {
         fprintf(stdout, "Segment attach failed with error %d0,
                 cmi_get_error(ctxt));
         Deallocate/finalize context
         CMIFN(ctxt, 10, fini)(ctxt);
         return -1;
       }

       Get remote segment handle to export - query remote segment attribute
       and allocate remote segment handle of sufficient size.
       optlen = sizeof(rsegsz);
       if (0 != CMIFN(ctxt, 10, attr_get)(ctxt, seg, CMI_ATTR_RSEG_SIZE,
                                          (void*) &rsegsz, &optlen))
       {
         fprintf(stdout, "Attribute query CMI_ATTR_RSEG_SIZE failed %d0,
                 cmi_get_error(ctxt));
         Detach segment
         CMIFN(ctxt, 10, seg_dt)(ctxt, seg, addr);

         Deallocate/finalize context
         CMIFN(ctxt, 10, fini)(ctxt);
         return -1;
       }

       Obtain remote segment handle (export segment)
       optlen = sizeof(rsegsz)
       rseg = (cmi_rseg*) CMIFN(ctxt, 10, seg_exp)(ctxt, seg);
       if (NULL == rseg)
       {
         fprintf(stdout, "Unable to export segment. Error %d0,
                 cmi_get_error(ctxt));

         Detach segment
         CMIFN(ctxt, 10, seg_dt)(ctxt, seg, addr);

         Deallocate/finalize context
         CMIFN(ctxt, 10, fini)(ctxt);
         return -1;
       }

        Distribute remote segment handle to peers over network
     }

Attributes

See attributes(5) for descriptions of the following attributes:

ATTRIBUTE TYPE	ATTRIBUTE VALUE
Interface Stability	Uncommitted
Availability	system/cmi

man pages section 5: Standards, Environments, and Macros