6 Persisting Caches

The Coherence persistence feature is used to save a cache to disk and ensures that cache data can always be recovered.

This chapter includes the following sections:

Overview of Persistence

Coherence persistence is a set of tools and technologies that manage the persistence and recovery of Coherence distributed caches. Cached data is persisted so that it can be quickly recovered after a catastrophic failure or after a cluster restart due to planned maintenance. Persistence and federated caching can be used together as required. See Federating Caches Across Clusters.

This section includes the following topics:

Persistence Modes

Persistence can operate in two modes:

  • On-Demand persistence mode – a cache service is manually persisted and recovered upon request using the persistence coordinator. The persistence coordinator is exposed as an MBean interface that provides operations for creating, archiving, and recovering snapshots of a cache service.

  • Active persistence mode – In this mode, cache contents are automatically persisted on all mutations and are automatically recovered on cluster/service startup. The persistence coordinator can still be used in active persistence mode to perform on-demand snapshots.

Disk-Based Persistence Storage

Persistence uses a database for the persistence store. The database is used to store the backing map partitions of a partitioned service. The locations of the database files can be stored on the local disk of each cache server or on a shared disk: Storage Area Network (SAN) or Network File System (NFS). See Plan for SAN/NFS Persistence Storage.

Note:

Database files should never be manually edited. Editing the database files can lead to persistence errors.

The local disk option allows each cluster member to access persisted data for the service partitions that it owns. Persistence is coordinated across all storage member using a list of cache server host addresses. The address list ensures that all persisted partitions are discovered during recovery. Local disk storage provides a high throughput and low latency storage mechanism; however, a partition service must still rely on in-memory backup (backup-count value greater than zero) to remain machine safe.

The shared disk option, together with active persistence mode, allows each cluster member to access persisted data for all service partitions. An advantage to using a shared disk is that partitioned services do not require in-memory backup (backup-count value can be equal to zero) to remain machine-safe; because, all storage-enabled members can recover partitions from the shared storage. Disabling in-memory backup increases the cache capacity of the cluster at the cost of higher latency recovery during node failure. In general, the use of a shared disk can potentially affect throughput and latencies and should be tested and monitored accordingly.

Note:

The service statusHA statistic shows an ENDAGERED status when the backup count is set to zero even if persistence is being used to replace in-memory backup.

Both the local disk and shared disk approach can rely on a quorum policy that controls how many cluster members must be present to perform persistence operations and before recovery can begin. Quorum policies allow time for a cluster to start before data recovery begins.

Persistence Configuration

Persistence is declaratively configured using Coherence configuration files and requires no changes to application code. An operational override file is used to configure the underlying persistence implementation if the default settings are not acceptable. A cache configuration file is used to set persistence properties on a distributed cache.

Management and Monitoring

Persistence can be monitored and managed using MBean attributes and operations. Persistence operations such as creating and archiving snapshots are performed using the PersistenceManagerMBean MBean. Persistence attributes are included as part of the attributes of a service and can be viewed using the ServiceMBean MBean.

Persistence attributes and statistics are aggregated in the persistence and persistence-details reports. Persistence statistics are also aggregated in the VisualVM plug-in. Both tools can help troubleshoot possible resource and performance issues.

Persistence Dependencies

Persistence is only available for distributed caches and requires the use of a centralized partition assignment strategy.

Note:

Transactional caches do not support persistence.

Distributed caches use a centralized partitioned assignment strategy by default. Although uncommon, it is possible that an autonomous or a custom partition assignment strategy is being used. Check the StrategyName attribute on the PartitionAssignment MBean to verify the strategy that is currently configured for a distributed cache. See Changing the Partition Distribution Strategy in Developing Applications with Oracle Coherence.

Persisting Caches on Demand

Caches can be persisted to disk at any point in time and recovered as required.

To persist caches on demand:

  1. Use the persistence coordinator to create, recover, and remove snapshots. See Using Snapshots to Persist a Cache Service.
  2. Optionally, change the location where persistence files are written to disk. See Changing the Pre-Defined Persistence Directory.
  3. Optionally, configure the number of storage members that are required to perform recovery. See Using Quorum for Persistence Recovery.

Actively Persisting Caches

Caches can be automatically persisted to disk and automatically recovered when a cluster is restarted.

To actively persist caches:

  1. Enable active persistence. See Enabling Active Persistence Mode.
  2. Optionally, change the location where persistence files are written to disk. See Changing the Pre-Defined Persistence Directory.
  3. Optionally, change how a service responds to possible failures during active persistence. See Changing the Active Persistence Failure Response.
  4. Optionally, configure the number of storage members that are required to perform recovery. See Using Quorum for Persistence Recovery.

Using Snapshots to Persist a Cache Service

Snapshots are a backup of the contents of a cache service that must be manually managed using the PersistenceManagerMBean MBean.
The MBean includes asynchronous operations to create, recover, and remove snapshots. When a snapshot is recovered, the entire service is automatically restored to the state of the snapshot. To use the MBean, JMX must be enabled on the cluster. See Using JMX to Manage Oracle Coherence in Managing Oracle Coherence.

Note:

The instructions in this section were created using the VisualVM-MBeans plug-in for the Coherence VisualVM tool. The Coherence VisualVM Plug-in can also be used to perform snapshot operations.

This section includes the following topics:

Create a Snapshot

Creating snapshots writes the contents of a cache service to the snapshot directory that is specified within the persistence environment definition in the operational override configuration file. A snapshot can be created either on a running service (a service that is accepting and processing requests) or on a suspended service. The former provides consistency at a partition level while the latter provides global consistency.

By default, the snapshot operation assumes partition level consistency. To achieve global consistency, the service must be explicitly suspended which causes any requests into the service to be blocked.

Snapshot with Partition Consistency

To create a snapshot with partition consistency:

  1. From the list of MBeans, select and expand the Persistence node.

  2. Expand a service for which you want to create a snapshot and select PersistenceCoordinator.

  3. From the Operations tab, enter a name for the snapshot in the field for the createSnapshot operation.

  4. Click createSnapshot.

Snapshot with Global Consistency

To create a snapshot with global consistency:

  1. From the list of MBeans, select the ClusterMBean node.

  2. From the Operations tab, enter the name of the service that you want to suspend in the field for the suspendService operation.

  3. Click suspendService.

  4. From the list of MBeans, select and expand the Persistence node.

  5. Expand the service (now suspended) for which you want to create a snapshot and select PersistenceCoordinator.

  6. From the Operations tab, enter a name for the snapshot in the field for the createSnapshot operation.

  7. Click createSnapshot.

    Note:

    Applications can be notified when the operation completes by subscribing to the snapshot JMX notifications. See Subscribing to Persistence JMX Notifications.

  8. From the list of MBeans, select the ClusterMBean node.

  9. From the Operations tab, enter the name of the service that you want to resume in the field for the resumeService operation.

  10. Click resumeService.

Recover a Snapshot

Recovering snapshots restores the contents of a cache service from a snapshot.

Note:

A Coherence service recovered from a persistent snapshot is not propagated to federated clusters. The data on the originating cluster is recovered but the cache data on the destination cluster remains unaffected and may still contain the data that was present prior to the recovery. To propagate the snapshot data, a federation ReplicateAll operation is required after the snapshot recovery is completed. The ReplicateAll operation is available on the FederationManagerMBean MBean. See FederationManagerMBean Operations in Managing Oracle Coherence.

To recover a snapshot:

  1. From the list of MBeans, select and expand the Persistence node.
  2. Expand a service for which you want to recover a snapshot and select PersistenceCoordinator.
  3. From the Operations tab, enter the name of a snapshot in the field for the recoverSnapshot operation.
  4. Click recoverSnapshot.

    After the operation has returned, check the OperationStatus or Idle attributes on the persistence coordinator to determine when the operation has completed. Applications can be notified when the operation completes by subscribing to the snapshot JMX notifications.

Remove a Snapshot

Removing a snapshot deletes the snapshot from the snapshot directory. The cache service remains unchanged.

To remove a snapshot:

  1. From the list of MBeans, select and expand the Persistence node.
  2. Expand a service for which you want to remove a snapshot and select PersistenceCoordinator.
  3. From the Operations tab, enter the name of a snapshot in the field for the removeSnapshot operation.
  4. Click removeSnapshot.

Archiving Snapshots

Snapshots can be archived to a central location and then later retrieved and restored. Archiving snapshots requires defining the directory where archives are stored and configuring cache services to use an archive directory. Archiving operations are performed using the PersistenceManagerMBean MBean. An archive is slower to create than snapshots but, unlike snapshots, the archive is portable.

This section includes the following topics:

Defining a Snapshot Archive Directory

The directory where snapshots are archived is defined in the operational override file using a directory snapshot archiver definition. Multiple definitions can be created as required.

Note:

The archive directory location and name must be the same across all members. The archive directory location must be a shared directory and must be accessible to all members.

To define a snapshot archiver directory, include the <directory-archiver> element within the <snapshot-archivers> element. Use the <archiver-directory> element to enter the directory where snapshot archives are stored. Use the id attribute to provide a unique name for the definition. For example:

<snapshot-archivers>
   <directory-archiver id="archiver1">
      <archive-directory>/mydirectory</archive-directory>
   </directory-archiver>
</snapshot-archivers>

Specifying a Directory Snapshot Archiver

To specify a directory snapshot archiver, edit the persistence definition within a distributed scheme and include the name of a directory snapshot archiver that is defined in the operational override configuration file. For example:

<distributed-scheme>
   <scheme-name>distributed</scheme-name>
   <service-name>Service1</service-name>
   <backing-map-scheme>
      <local-scheme/> 
   </backing-map-scheme>
   <persistence>
      <archiver>archiver1</archiver>
   </persistence>
   <autostart>true</autostart>
</distributed-scheme>

Performing Snapshot Archiving Operations

Snapshot archiving is manually managed using the PersistenceManagerMBean MBean. The MBean includes asynchronous operations to archive and retrieve snapshot archives and also includes operations to list and remove archives.

This section includes the following topics:

Archiving a Snapshot

To archive a snapshot:

  1. From the list of MBeans, select and expand the Persistence node.
  2. Expand a service for which you want to archive a snapshot and select PersistenceCoordinator.
  3. From the Operations tab, enter a name for the archive in the field for the archiveSnapshot operation.
  4. Click archiveSnapshot. The snapshot is archived to the location that is specified in the directory archiver definition defined in the operational override configuration file.

    Check the OperationStatus on the persistence coordinator to determine when the operation has completed.

Retrieving Archived Snapshots

To retrieve an archived snapshot:

  1. From the list of MBeans, select and expand the Persistence node.
  2. Expand a service for which you want to retrieve an archived snapshot and select PersistenceCoordinator.
  3. From the Operations tab, enter the name of an archived snapshot in the field for the retrieveArchivedSnapshot operation.
  4. Click retrieveArchivedSnapshot. The archived snapshot is copied from the directory archiver location to the snapshot directory and is available to be recovered to the service backing map. See Recover a Snapshot.
Removing Archived Snapshots

To remove an archived snapshot:

  1. From the list of MBeans, select and expand the Persistence node.
  2. Expand a service for which you want to purge an archived snapshot and select PersistenceCoordinator.
  3. From the Operations tab, enter the name of an archived snapshot in the field for the removeArchivedSnapshot operation.
  4. Click removeArchivedSnapshot. The archived snapshot is removed from the archive directory.
Listing Archived Snapshots

To get a list of the current archived snapshots:

  1. From the list of MBeans, select and expand the Persistence node.
  2. Expand a service for which you want to list archived snapshots and select PersistenceCoordinator.
  3. From the Operations tab, click the listArchivedSnapshots operation. A list of archived snapshots is returned.
Listing Archived Snapshot Stores

To list the individual stores, or parts of and archived snapshot:

  1. From the list of MBeans, select and expand the Persistence node.
  2. Expand a service for which you want to list archived snapshot stores and select PersistenceCoordinator.
  3. From the Operations tab, enter the name of an archived snapshot in the field for the listArchivedSnapshotStores operation.
  4. Click listArchivedSnapshotStores. A list of stores for the archived snapshots is returned.

Creating a Custom Snapshot Archiver

Custom snapshot archiver implementations can be created as required to store archives using an alternative technique than the default directory snapshot archiver implementation. For example, you may want to persist archives to an external database, use a web service to store archives to a storage area network, or store archives in a content repository.

This section includes the following topics:

Create a Custom Snapshot Archiver Implementation

To create a custom snapshot archiver implementation, create a class that extends the AbstractSnapshotArchiver class.

Create a Custom Snapshot Archiver Definition

To create a custom snapshot archiver definition, include the <custom-archiver> element within the <snapshot-archivers> element and use the id attribute to provide a unique name for the definition. Add the <class-name> element within the <custom-archiver> element that contains the fully qualified name of the implementation class. The following example creates a definition for a custom implementation called MyCustomArchiver:

<snapshot-archivers>
   <custom-archiver id="custom1">
      <class-name>package.MyCustomArchiver</class-name>
   </custom-archiver>
</snapshot-archivers>

Use the <class-factory-name> element if your implementation uses a factory class that is responsible for creating archiver instances. Use the <method-name> element to specify the static factory method on the factory class that performs object instantiation. The following example gets a snapshot archiver instance using the getArchiver method on the MyArchiverFactory class.

<snapshot-archivers>
   <custom-archiver id="custom1">
      <class-factory-name>package.MyArchiverFactory</class-factory-name>
      <method-name>getArchiver</method-name>
   </custom-archiver>
</snapshot-archivers>

Any initialization parameters that are required for an implementation can be specified using the <init-params> element. The following example sets the UserName parameter to Admin.

<snapshot-archivers>
   <custom-archiver id="custom1">
      <class-name>package.MyCustomArchiver</class-name>
      <init-params>
         <init-param>
            <param-name>UserName</param-name>
            <param-value>Admin</param-value>
         </init-param>
      </init-params>
   </custom-archiver>
</snapshot-archivers>
Specifying a Custom Snapshot Archiver

To specify a custom snapshot archiver, edit the persistence definition within a distributed scheme and include the name of a custom snapshot archiver that is defined in the operational override configuration file. For example:

<distributed-scheme>
   <scheme-name>distributed</scheme-name>
   <service-name>Service1</service-name>
   <backing-map-scheme>
      <local-scheme/> 
   </backing-map-scheme>
   <persistence>
      <archiver>custom1</archiver>
   </persistence>
   <autostart>true</autostart>
</distributed-scheme>

Using Active Persistence Mode

You can enable and configure active persistence mode to have the contents of a cache automatically persisted and recovered.

This section includes the following topics:

Enabling Active Persistence Mode

Active persistence can be enabled for all services or for specific services. To enable active persistence for all services, set the coherence.distributed.persistence.mode system property to active. For example:

-Dcoherence.distributed.persistence.mode=active

The default value if no value is specified is on-demand, which enables on-demand persistence. The persistence coordinator can still be used in active persistence mode to take snapshots of a cache.

To enable active persistence for a specific service, modify a distributed scheme definition and include the <environment> element within the <persistence> element. Set the value of the <environment> element to default-active. For example:

<distributed-scheme>
   <scheme-name>distributed</scheme-name>
   <service-name>Service1</service-name>
   <backing-map-scheme>
      <local-scheme/> 
   </backing-map-scheme>
   <persistence>
      <environment>default-active</environment>
   </persistence>
   <autostart>true</autostart>
</distributed-scheme>

The default value if no value is specified is default-on-demand, which enables on-demand persistence for the service.

Changing the Active Persistence Failure Response

You can change the way a partitioned cache service responds to possible persistence failures during active persistence operations. The default response is to immediately stop the service. This behavior is ideal if persistence is critical for the service (for example, a cache depends on persistence for data backup). However, if persistence is not critical, you can chose to let the service continue servicing requests.

To change the active persistence failure response for a service, edit the distributed scheme definition and include the <active-failure-mode> element within the <persistence> element and set the value to stop-persistence. If no value is specified, then the default value (stop-service) is automatically used. The following example changes the active persistence failure response to stop-persistence.

<distributed-scheme>
   <scheme-name>distributed</scheme-name>
   <service-name>Service1</service-name>
   <backing-map-scheme>
      <local-scheme/> 
   </backing-map-scheme>
   <persistence>
      <active-failure-mode>stop-persistence</active-failure-mode>
   </persistence>
   <autostart>true</autostart>
</distributed-scheme>

Changing the Partition Count When Using Active Persistence

The partition count cannot be changed when using active persistence. If you change a services partition count, then on restart of the services all active data is moved to the persistence trash and must be recovered after the original partition count is restored. Data that is persisted can only be recovered only to services that are running with the same partition count, or you can select one of the available workarounds. See Workarounds to Migrate a Persistent Service to a Different Partition Count.

Ensure that the partition count is not modified if active persistence is being used. If the partition count is changed, then a message similar to the following is displayed when the services are started:

<Warning> (thread=DistributedCache:DistributedCachePersistence, member=1):
Failed to recover partition 0 from SafeBerkeleyDBStore(...); partition-count
mismatch 501(persisted) != 277(service); reinstate persistent store from
trash once validation errors have been resolved

The message indicates that the change in the partition-count is not supported and the current active data has been copied to the trash directory. To recover the data:

  1. Shutdown the entire cluster.
  2. Remove the current active directory contents for the cluster and service affected on each cluster member.
  3. Copy (recursively) the contents of the trash directory for each service to the active directory.
  4. Restore the partition count to the original value.
  5. Restart the cluster.
Workarounds to Migrate a Persistent Service to a Different Partition Count
There are two possible workarounds when changing the partition count with persistent services:
  • Using Coherence Federation as a means to replicate data to a service with a different partition count.
  • Defining a new persistent service and transferring the data manually.

For instructions to use the two options, see Using Federation and Using a New Service. Oracle recommends using federation because it ensures that data is migrated and available as quickly as possible.

If the existing persistent cache service is federated, migration is trivial, as illustrated in the following steps:

  1. Stop one of the destination clusters in the federation.
  2. Move the persistence directory to a backup storage.
  3. Create a new cache config that differs only in the partition count for the related service.
  4. Invoke the Mean operation replicateAll from the newly started cluster with the different partition count.
If the existing persistence service is not a federated cache service, the upgrade will include a few additional steps, but is still fairly simple.
Using Federation
If the existing persistent cache service is federated, migration is trivial. To migrate, complete the following steps:
  1. Create a new cache config, and change the existing cache service from distributed scheme to federated scheme.

    Note:

    It is important to use the exact same service name.
    For example, if this is the existing cache config:
    <caching-scheme-mapping>
        <cache-mapping>
          <cache-name>*</cache-name>
          <scheme-name>federated-active</scheme-name>
        </cache-mapping>
      </caching-scheme-mapping>
    
      <caching-schemes>
        <distributed-scheme>
          <scheme-name>federated-active</scheme-name>
          <service-name>DistributedCachePersistence</service-name>
          <thread-count>5</thread-count>
          <partition-count system-property="test.partitioncount">5</partition-count>
          <backing-map-scheme>
            <local-scheme>
            </local-scheme>
          </backing-map-scheme>
          <persistence>
              <environment>simple-bdb-environment</environment>
          </persistence>
          <autostart>true</autostart>
        </distributed-scheme>
      </caching-schemes>
    </cache-config>
    Then the new config should be:
    <caching-scheme-mapping>
        <cache-mapping>
          <cache-name>*</cache-name>
          <scheme-name>federated-active</scheme-name>
        </cache-mapping>
      </caching-scheme-mapping>
    
      <caching-schemes>
        <federated-scheme>
          <scheme-name>federated-active</scheme-name>
          <service-name>DistributedCachePersistence</service-name>
          <thread-count>5</thread-count>
          <partition-count system-property="test.partitioncount">5</partition-count>
          <backing-map-scheme>
            <local-scheme>
            </local-scheme>
          </backing-map-scheme>
          <persistence>
              <environment>simple-bdb-environment</environment>
          </persistence>
          <autostart>true</autostart>
          <topologies>
            <topology>
              <name>EastCoast</name>
            </topology>
          </topologies>
        </federated-scheme>
      </caching-schemes>
    Ensure that you change the override config file to include federation. For example, add the following section to override the file:
    <federation-config>
        <participants>
          <participant>
            <name>BOSTON</name>
            <remote-addresses>
              <socket-address>
                <address>192.168.1.5</address>
                <port system-property="test.federation.port.boston">7574</port>
              </socket-address>
            </remote-addresses>
          </participant>
    
          <participant>
            <name>NEWYORK</name>
            <remote-addresses>
              <socket-address>
                <address>192.168.1.5</address>
                <port system-property="test.federation.port.newyork">7574</port>
              </socket-address>
            </remote-addresses>
          </participant>
    
        </participants>
    
        <topology-definitions>
          <active-passive>
            <name>EastCoast</name>
            <active>NEWYORK</active>
            <passive>BOSTON</passive>
          </active-passive>
        </topology-definitions>
    
      </federation-config>

    See Federating Caches Across Clusters.

  2. Stop the running cluster.
  3. Start the active side of the cluster. In this example, the NEWYORK cluster. The persistent stores from the previous cluster should be recovered successfully.
  4. Start the passive cluster. In this example, the BOSTON cluster.
  5. Invoke the Mean operation replicateAll to the passive, BOSTON, cluster.

Now, all the data is persisted to the cluster with the target partition count.

Oracle recommends that you use the federated cache scheme to facilitate any future upgrade. However, if desired, you can also go back to the distributed scheme with the new partition count. Simply point the active persistent directory to the one that is created by BOSTON.

Using a New Service
If the existing persistent service is not a federated cache service, the upgrade will include a few additional steps, but is still fairly simple.

If you do not want to use a federated service, perform the following steps to upgrade:

  1. Create a new cache config, duplicate the existing cache service with a different service name and targeted partition count. For example:
    <caching-scheme-mapping>
        <cache-mapping>
          <cache-name>dist*</cache-name>
          <scheme-name>simple-persistence</scheme-name>
        </cache-mapping>
        <cache-mapping>
          <cache-name>new*</cache-name>
          <scheme-name>new-persistence</scheme-name>
        </cache-mapping>
      </caching-scheme-mapping>
    
      <caching-schemes>
        <distributed-scheme>
          <scheme-name>simple-persistence</scheme-name>
          <service-name>DistributedCachePersistence</service-name>
          <partition-count system-property="test-partitioncount">257</partition-count>
          <backing-map-scheme>
            <local-scheme/>
          </backing-map-scheme>
          <persistence>
            <environment>simple-bdb-environment</environment>
          </persistence>
          <autostart>true</autostart>
        </distributed-scheme>
    
        <distributed-scheme>
          <scheme-name>new-persistence</scheme-name>
          <service-name>DistributedCachePersistenceNew</service-name>
          <partition-count system-property="new-partitioncount">457</partition-count>
          <backing-map-scheme>
            <local-scheme/>
          </backing-map-scheme>
          <persistence>
            <environment>simple-bdb-environment</environment>
          </persistence>
          <autostart>true</autostart>
        </distributed-scheme>
      </caching-schemes>
    </cache-config>
  2. Stop the cluster.
  3. Start the cluster with the new config. Now, you have two distributed services (old and new), with different partition counts.
  4. Transfer the data from the “old” service to the “new” service. Here is an example of client code for the data transfer:
    public static void main(String[] args)
        {
        System.setProperty("tangosol.coherence.distributed.localstorage", "false");
        System.setProperty("coherence.cacheconfig", "path to new cache config file");
        System.setProperty("coherence.override", “path to override config file");
    
        NamedCache cacheOld  = CacheFactory.getCache("dist");
        NamedCache cacheTemp = CacheFactory.getCache("new");
    
        DistributedCacheService serviceOld = (DistributedCacheService) cacheOld.getCacheService();
        DistributedCacheService servicenNew = (DistributedCacheService) cacheTemp.getCacheService();
    
        NamedCache   cache       = servicenNew.ensureCache("dist", null);   // the cache name must be same as the old one
        int          cPartitions = serviceOld.getPartitionCount();
        PartitionSet parts       = new PartitionSet(cPartitions);
    
        for (int iPartition = 0; iPartition < cPartitions; iPartition++)
            {
            parts.add(iPartition);
    
            Filter filter = new PartitionedFilter(AlwaysFilter.INSTANCE, parts);
    
            Set<Map.Entry> setPart  = cacheOld.entrySet(filter);
    
            cache.putAll(new EntrySetMap(setPart));
    
            parts.remove(iPartition);
            }
        
        System.out.println("CacheOld.size " + cacheOld.size());
        System.out.println("CacheNew.size " + cache.size());
    
        cacheTemp.destroy();
    Now, all the data is persisted in the new service with the targeted partition count.
  5. Shut down the cluster.
  6. Go to the active persistent directory of the cluster. You will see two directories. In our example, DistributedCachePersistence and DistributedCachePersistenceNew. Move the directory of the old service, DistributedCachePersistence, to a backup storage.
  7. Now, remove the old service from the cache config and restart the cluster. All partitions should be recovered successfully with new partition count and new service name.
  8. If you want to use exactly the same service name, simply rename the new persistent directory to the existing service name, and restart the cluster with the old cache config with the targeted partition count.
To avoid loss of new live data while doing the data transfer to the new service, block the client requests temporarily.

Using Asynchronous Persistence Mode

You can enable and configure asynchronous persistence mode to have the storage servers to persist data asynchronously. See Using Asynchronous Persistence.

Modifying the Pre-Defined Persistence Environments

Persistence uses a set of directories for storage. You can choose to use the default storage directories or change the directories as required.

This section includes the following topics:

Overview of the Pre-Defined Persistence Environment

The operational deployment descriptor includes two pre-defined persistence environment definitions:

  • default-active – used when active persistence is enabled.

  • default-on-demand – used when on-demand persistence is enabled.

The operational override file or system properties are used to override the default settings of the pre-defined persistence environments. The pre-defined persistence environments have the following configuration:

<persistence-environments>
   <persistence-environment id="default-active">
      <persistence-mode>active</persistence-mode>
      <active-directory 
        system-property="coherence.distributed.persistence.active.dir">
      </active-directory>
      <snapshot-directory
        system-property="coherence.distributed.persistence.snapshot.dir">
      </snapshot-directory>
      <trash-directory 
        system-property="coherence.distributed.persistence.trash.dir">
      </trash-directory>
   </persistence-environment>
   <persistence-environment-environment id="default-on-demand">
      <persistence-mode>on-demand</persistence-mode>
      <active-directory 
        system-property="coherence.distributed.persistence.active.dir">
      </active-directory>
      <snapshot-directory 
        system-property="coherence.distributed.persistence.snapshot.dir">
      </snapshot-directory>
      <trash-directory 
        system-property="coherence.distributed.persistence.trash.dir">
      </trash-directory>
   </persistence-environment>
</persistence-environments>

Changing the Pre-Defined Persistence Directory

The pre-defined persistence environments use a base directory called coherence within the USER_HOME directory to save persistence files. The location includes directories for active persistence files, snapshot persistence files, and trash files. The locations can be changed to a different local directory or a shared directory on the network.

Note:

  • Persistence directories and files (including the meta.properties files) should never be manually edited. Editing the directories and files can lead to persistence errors.

To change the pre-defined location of persistence files, include the <active-directory>, <snapshot-directory>, and <trash-directory> elements that are each set to the respective directories where persistence files are saved. The following example modifies the pre-defined on-demand persistence environment and changes the location of all directories to the /persistence directory:

<persistence-environments>
   <persistence-environment id="default-on-demand">
      <active-directory
        system-property="coherence.distributed.persistence.active.dir">
        /persistence/active</active-directory>
      <snapshot-directory
        system-property="coherence.distributed.persistence.snapshot.dir">
        /persistence/snapshot</snapshot-directory>
      <trash-directory
        system-property="coherence.distributed.persistence.trash.dir">
        /persistence</trash</trash-directory>
   </persistence-environment>
</persistence-environments>

The following system properties are used to change the pre-defined location of the persistence files instead of using the operational override file:

-Dcoherence.distributed.persistence.active.dir=/persistence/active
-Dcoherence.distributed.persistence.snapshot.dir=/persistence/snapshot
-Dcoherence.distributed.persistence.trash.dir=/persistence/trash

Use the coherence.distributed.persistence.base.dir system property to change the default directory off the USER_HOME directory:

-Dcoherence.distributed.persistence.base.dir=persistence

Creating Persistence Environments

You can choose to define and use multiple persistence environments to support different cache scenarios. Persistence environments are defined in the operational override configuration file and are referred within a distributed scheme or paged-topic-scheme definition in the cache configuration file.

This section includes the following topics:

Define a Persistence Environment

To define a persistence environment, include the <persistence-environments> element that contains a <persistence-environment> element. The <persistence-environment> element includes the configuration for a persistence environment. Use the id attribute to name the environment. The id attribute is used to refer to the persistence environment from a distributed scheme definition. The following example creates a persistence environment with the name environment1:

<persistence-environments>
   <persistence-environment id="enviornment1">
      <persistence-mode></persistence-mode>
      <active-directory></active-directory>
      <snapshot-directory></snapshot-directory>
      <trash-directory></trash-directory>
   </persistence-environment>
</persistence-environments>

Configure a Persistence Mode

A persistence environment supports two persistence modes: on-demand and active. On-demand persistence requires the use of the persistence coordinator to persist and recover cache services. Active persistence automatically persists and recovers cache services. You can still use the persistence coordinator in active persistence mode to periodically persist a cache services.

To configure the persistence mode, include the <persistence-mode> element set to either on-demand or active. The default value if no value is specified is on-demand. The following example configures active persistence.

<persistence-environments>
   <persistence-environment id="enviornment1">
      <persistence-mode>active</persistence-mode>
      <persistence-mode></persistence-mode>
      <active-directory></active-directory>
      <snapshot-directory></snapshot-directory>
      <trash-directory></trash-directory>
   </persistence-environment>
</persistence-environments>

Configure Persistence Directories

A persistence environment saves cache service data to disk. The location can be configured as required and can be either on a local drive or on a shared network drive. When configuring a local drive, only the partitions that are owned by a cache server are persisted to the respective local disk. When configuring a shared network drive, all partitions are persisted to the same shared disk.

Note:

  • Persistence directories and files (including the meta.properties files) should never be manually edited. Editing the directories and files can lead to persistence errors.

  • If persistence is configured to use an NFS mounted file system, then the NFS mount should be configured to use synchronous IO and not asynchronous IO, which is the default on many operating systems. The use of asynchronous IO can lead to data loss if the file system becomes unresponsive due to an outage. For details on configuration, refer to the mount documentation for your operating system.

Different directories are used for active, snapshot and trash files and are named accordingly. Only the top-level directory must be specified. To configure persistence directories, include the <active-directory>, <snapshot-directory>, and <trash-directory> elements that are each set to a directory path where persistence files are saved. The default value if no value is specified is the USER_HOME directory. The following example configures the /env1 directory for all persistence files:

<persistence-environments>
   <persistence-environment id="enviornment1">
      <persistence-mode>on-demand</persistence-mode>
      <active-directory>/env1</active-directory>
      <snapshot-directory>/env1</snapshot-directory>
      <trash-directory>/env1</trash-directory>
   </persistence-environment>
</persistence-environments>

Configure a Cache Service to Use a Persistence Environment

To change the persistence environment used by a cache service, modify the distributed scheme definition and include the <environment> element within the <persistence> element. Set the value of the <environment> element to the name of a persistence environment that is defined in the operational override configuration file. For example:

<distributed-scheme>
   <scheme-name>distributed</scheme-name>
   <service-name>Service1</service-name>
   <backing-map-scheme>
      <local-scheme/> 
   </backing-map-scheme>
   <persistence>
      <environment>environment1</environment>
   </persistence>
   <autostart>true</autostart>
</distributed-scheme>

Using Quorum for Persistence Recovery

Coherence includes a quorum policy that enables Coherence to defer recovery until a suitable point. Suitability is based on availability of all partitions and sufficient capacity to initiate recovery across storage members.

This section includes the following topics:

Overview of Persistence Recovery Quorum

The partitioned cache recover quorum uses two inputs to determine whether the requirements of partition availability and storage capacity are met to commence persistence recovery. The partition availability requirement is met through a list of recovery host names (machines that will contain the persistent stores), while the storage capacity requirement is met by specifying the number of storage nodes. Using the Dynamic Quorum policy both of these inputs can be inferred by the 'last known good' state of the service, and is the recommended approach for configuring a recovery quorum. If the rules used by the Dynamic Quorum policy are insufficient, you can explicitly specify the recovery-hosts list and the recover-quorum. The use of the quorum allows time for a cluster to start and ensures that partitions are recovered gracefully without overloading too few storage members or without inadvertently deleting orphaned partitions.

If the recover quorum is not satisfied, then persistence recovery does not proceed and the service or cluster may appear to be blocked. To check for this scenario, view the QuorumPolicy attribute in the ServiceMBean MBean to see if recover is included in the list of actions. If data has not been recovered after cluster startup, the following log message is emitted (each time a new service member starts up) to indicate that the quorum has not been satisfied:

<Warning> (thread=DistributedCache:DistributedCachePersistence, member=1):
 Action recover disallowed; all-disallowed-actions: recover(4)

After the quorum is satisfied, the following message is emitted:

<Warning> (thread=DistributedCache:DistributedCachePersistence, member=1):
 All actions allowed

For active persistence, the recover quorum is enabled by default and automatically uses the dynamic recovery quorum policy. See Using the Dynamic Recovery Quorum Policy.

For general details about partitioned cache quorums, see Using the Partitioned Cache Quorums in Developing Applications with Oracle Coherence.

Using the Dynamic Recovery Quorum Policy

The dynamic recovery quorum policy is used with active persistence and automatically configures the persistence recovery quorum based on a predefined algorithm. The dynamic recovery quorum policy is the default quorum policy for active persistence mode and does not need to be explicitly enabled. The policy is automatically used if either the <recover-quorum> value is not specified or if the value is set to 0. The following example explicitly enables the dynamic recovery quorum policy and is provided here for clarity.

<distributed-scheme>
   <scheme-name>distributed</scheme-name>
   <service-name>Service1</service-name>
   <backing-map-scheme>
      <local-scheme/> 
   </backing-map-scheme>
   <partitioned-quorum-policy-scheme>
      <recover-quorum>0</recover-quorum>
   </partitioned-quorum-policy-scheme>
   <autostart>true</autostart>
</distributed-scheme>

Note:

When using the dynamic recovery quorum policy, the <recovery-hosts> element should not be used within the <partitioned-quorum-policy-scheme> element. All other quorum polices (for example the read quorum policy) are still valid.

Understanding the Dynamic Recovery Algorithm

The dynamic recovery quorum policy works by recording cluster membership information each time a member joins the cluster and partition distribution stabilizes. Membership is only recorded if the service is not suspended and all other partitioned cache actions (such as read, write, restore, and distribute) are allowed by the policy. JMX notifications are sent to subscribers of the PersistenceManagerMBean MBean every time the cluster membership changes.

During recovery scenarios, a service only recovers data if the following conditions are satisfied:

  • the persistent image of all partitions is accessible by the cluster members

  • the number of storage-enabled nodes is at least 2/3 of the last recorded membership

  • if the persistent data is being stored to a local disk (not shared and visible by all hosts), then there should be at least 2/3 of the number of members for each host as there was when the last membership was recorded

The partitioned cache service blocks any client side requests if any of the conditions are not satisfied. However, if an administrator determines that the full recovery is impossible due to missing partitions or that starting the number of servers that is expected by the quorum is unnecessary, then the recovery can be forced by invoking the forceRecovery operation on the PersistenceManagerMBean MBean.

The recovery algorithm can be overridden by using a custom quorum policy class that extends the com.tangosol.net.ConfigurableQuorumPolicy.PartitionedCacheQuorumPolicy class. To change the hard-coded 2/3 ratio, override the PartitionedCacheQuorumPolicy.calculateMinThreshold method. See Using Custom Action Policies in Developing Applications with Oracle Coherence.

Explicit Persistence Quorum Configuration

To configure the recover quorum for persistence, modify a distributed scheme definition and include the <recover-quorum> element within the <partitioned-quorum-policy-scheme> element. Set the <recover-quorum> element value to the number of storage members that must be available before recovery starts. For example:

<distributed-scheme>
   <scheme-name>distributed</scheme-name>
   <service-name>Service1</service-name>
   <backing-map-scheme>
      <local-scheme/> 
   </backing-map-scheme>
   <partitioned-quorum-policy-scheme>
      <recover-quorum>2</recover-quorum>
   </partitioned-quorum-policy-scheme>
   <autostart>true</autostart>
</distributed-scheme>

Note:

In active persistence mode, setting the <recover-quorum> element to 0, enables the dynamic recovery quorum policy. See Using the Dynamic Recovery Quorum Policy.

In shared disk scenarios, all partitions are persisted and recovered from a single location. For local-disk scenarios, each storage member recovers its partitions from a local disk. However, if you use a non-dynamic recovery quorum with local-disk based storage, you must define a list of storage-enabled hosts in the cluster that are required to recover orphaned partition from the persistent storage, otherwise empty partitions will be assigned.

Note:

Recovery hosts must be specified to ensure that recovery does not commence prior to all persisted state being available.

To define a list of addresses, edit the operational override configuration file and include the <address-provider> element that contains a list of addresses each defined using an <address> element. Use the id attribute to name the address provider list. The id attribute is used to refer to the list from a distributed scheme definition. The following example creates an address provider list that contains two member addresses and is named persistence_hosts:

<address-providers>
   <address-provider id="persistence_hosts">
      <address>HOST_NAME1</address>
      <address>HOST_NAME2</address>
   </address-provider>
</address-providers>

To refer to the address provider list, modify a distributed scheme definition and include the <recovery-hosts> element within the <partitioned-quorum-policy-scheme> element and set the value to the name of an address provider list. For example:

<distributed-scheme>
   <scheme-name>distributed</scheme-name>
   <service-name>Service1</service-name>
   <backing-map-scheme>
      <local-scheme/> 
   </backing-map-scheme>
   <partitioned-quorum-policy-scheme>
      <recover-quorum>2</recover-quorum>
      <recovery-hosts>persistence_hosts</recovery-hosts>
   </partitioned-quorum-policy-scheme>
   <autostart>true</autostart>
</distributed-scheme>

Subscribing to Persistence JMX Notifications

The PersistenceManagerMBean MBean includes a set of notification types that applications can use to monitor persistence operations. See PersistenceManagerMBean in Managing Oracle Coherence.

To subscribe to persistence JMX notifications, implement the JMX NotificationListener interface and register the listener. The following code snippet demonstrates registering a notification listener. Refer to the Coherence examples for the complete example, which includes a sample listener implementation.

...
MBeanServer server = MBeanHelper.findMBeanServer();
Registry registry = cluster.getManagement();
try
   {
   for (String sServiceName : setServices)
      {
      logHeader("Registering listener for " + sServiceName);
      String sMBeanName = getMBeanName(sServiceName);
       
      ObjectName           oBeanName = new ObjectName(sMBeanName);
      NotificationListener listener  = new 
         PersistenceNotificationListener(sServiceName);
      server.addNotificationListener(oBeanName, listener, null, null);
      }
   ...

Managing Persistence

Persistence should be managed to ensure there is enough disk space and to ensure persistence operations do not add significant latency to cache operations. Latency is specific to active persistence mode and can affect cache performance because persistence operations are being performed in parallel with cache operations.

This section includes the following topics:

Plan for Persistence Storage

An adequate amount of disk space is required to persist data. Ensure enough space is provisioned to persist the expected amount of cached data. The following guidelines should be used when sizing disks for persistence:

  • The approximate overhead for active persistence data storage is an extra 10-30% per partition. The actual overhead may vary depending upon data access patterns, the size of keys and values, and other factors such as block sizes and heavy system load.

  • Use the Coherence VisualVM plug-in and persistence reports to monitor space availability and usage. See Monitor Persistence Storage Usage. Specifically, use the PersistenceActiveSpaceUsed attribute on the ServiceMBean MBean to monitor the actual persistence space used for each service and node.

  • Persistence configurations that use a shared disk for storage should plan for the potential maximum size of the cache because all partitions are persisted to the same location. For example, if the maximum capacity of a cache is 8GB, then the shared disk must be able to accommodate at least 8GB of persisted data plus overhead.

  • Persistence configurations that use a local disk for storage should plan for the potential maximum cache capacity of the cache server because only the partitions owned by a cache server are persisted to the local disk. For example, if the maximum cache capacity of a cache server is 2GB, then the local disk must be able to accommodate at least 2GB of persisted data plus overhead.

  • Plan additional space when creating snapshots in either active or on-demand mode. Each snapshot of a cache duplicates the size of the persistence files on disk.

  • Plan additional space for snapshot archives. Each archive of a snapshot is slightly less than the size of the snapshot files on disk.

  • The larger the partition count, the more memory overhead a single member may observe. Each partition will consume at a minimum about 80k of heap for the underlying Berkeley DB (BDB) implementation to maintain its context. For example, a partition count of 24,000 may easily consume the entire heap of a JVM configured for 2GB for the senior member in dynamic quorum mode.

    To alleviate this issue, a recovery quorum of at least 75% of the machines may be used, therefore recovery will only take place when enough members are running and ready to be balanced. In this case, the partitions are evenly spread across all of the storage-enabled members, as opposed to being concentrated on a single member and only then distributed.

    Another way to alleviate the issue is to reduce the partition count whenever possible.

Note:

The underlying Berkeley DB (BDB) storage which is used for persistence, is “append only”. Insertions, deletions, and updates are always added at the end of the current file. The first file is named 00000000.jdb. When this file grows to a certain size (10 MB, by default), then a new file named 00000001.jdb becomes the current file, and so on. As the files reach the 10MB limit and roll over to a new file, the background tasks are run automatically to cleanup, compress, and remove the unused files. As a result, you will have at least 10MB storage overhead per service per partition. This overhead is taken into account in the 10-30% figure quoted above.

Plan for Persistence Memory Overhead

In addition to affecting disk usage, active persistence requires additional data structures and memory within each JVM's heap for managing persistence. The amount of memory required varies based on the partition-count and data usage patterns but as a guide, you should allocate an additional 20-35% of memory to each JVM that is running active persistence.

Monitor Persistence Storage Usage

Monitor persistence storage to ensure that there is enough space available on the file system to persist cached data.

Coherence VisualVM Plug-in

Use the Persistence tab in the Coherence VisualVM plug-in to view the amount of space being used by a service for active persistence. The space is reported in both Bytes and Megabytes. The tab also reports the current number of snapshots available for a service. The snapshot number can be used to estimate the additional space usage and to determine whether snapshots should be deleted to free up space.

Coherence Reports

Use the persistence detail report (persistence-detail.txt) to view the amount of space being used by a service for both active persistence and for persistence snapshots. The amount of available disk space is also reported and allows you to monitor if a disk is reaching capacity.

Coherence MBeans

Use the persistence attributes on the ServiceMBean MBean to view all the persistence storage statistics for a service. The MBean includes statistics for both active persistence and persistence snapshots.

Monitoring Persistence Latencies

Monitor persistence latencies when using active persistence to ensure that persistence operations are not adversely affecting cache operations. High latencies can be a sign that network issues are delaying writing persistence files to a shared disk or delaying coordination between local disks.

Coherence VisualVM Plug-In

Use the Persistence tab in the Coherence VisualVM plug-in to view the amount of latency that persistence operations are adding to cache operations. The time is reported in milliseconds. Statistics are reported for each service and provide the average latency of all persistence operations and for the highest recorded latency.

Coherence Reports

Use the persistence detail report (persistence-detail.txt) to view the amount of latency that persistence operations are adding to cache operations. The time is reported in milliseconds. Statistics are provided for the average latency of all persistence operations and for the highest recorded latency on each cluster node of a service. The statistics can be used to determine if some nodes are experiencing higher latencies than other nodes.

Coherence MBeans

Use the persistence attributes on the ServiceMBean MBean to view the amount of latency that persistence operations are adding to cache operations. The time is reported in milliseconds. Statistics are provided for the average latency of all persistence operations and for the highest recorded latency on each cluster nodes of a service. The statistics can be used to determine if some nodes are experiencing higher latencies than other nodes.

Configuring Caches as Transient

Caches that do not require persistence can be configured as transient. Caches that are transient are not recovered during persistence recovery operations.

Note:

During persistence recovery operations, the entire cache service is recovered from the persisted state and any caches that are configured as transient are reset.

Caches are configured as transient using the <transient> element within the <backing-map-scheme> element of a distributed scheme definition. However, because persistence is always enabled on a service, a parameter macro is used to configure the transient setting for each cache. For example:

<caching-scheme-mapping>
   <cache-mapping>
      <cache-name>nonPersistedCache</cache-name>
      <scheme-name>distributed</scheme-name>
      <init-params>
         <init-param>
            <param-name>transient</param-name>
            <param-value>true</param-value>
         </init-param>
      </init-params>
   </cache-mapping>
   <cache-mapping>
      <cache-name>persistedCache</cache-name>
      <scheme-name>distributed</scheme-name>
   </cache-mapping>
</caching-scheme-mapping>

<distributed-scheme>
   <scheme-name>distributed</scheme-name>
   <service-name>DistributedService</service-name>
   <backing-map-scheme>
      <transient>{transient false}</transient>
      <local-scheme/>
   </backing-map-scheme>
   <autostart>true</autostart>
</distributed-scheme>

Note:

The default value of the <transient> element is false and indicates that cache data is persisted.