Coherence 3.1 User Guide : Best Practices

Coherence supports several cache topologies, but the following options cover the vast majority of use cases. All are fully coherent and support cluster-wide locking and transactions:

Data Access Patterns

Data access distribution (hot spots)

When caching a large dataset, typically a small portion of that dataset will be responsible for most data accesses. For example, in a 1000 object dataset, 80% of operations may be be against a 100 object subset. The remaining 20% of operations may be against the other 900 objects. Obviously the most effective return on investment will be gained by caching the 100 most active objects; caching the remaining 900 objects will provide 25% more effective caching while requiring a 900% increase in resources.

On the other hand, if every object is accessed equally often (for example in sequential scans of the dataset), then caching will require more resources for the same level of effectiveness. In this case, achieving 80% cache effectiveness would require caching 80% of the dataset versus 10%. (Note that sequential scans of partially cached data sets will generally defeat MRU, LFU and MRU-LFU eviction policies). In practice, almost all non-synthetic (benchmark) data access patterns are uneven, and will respond well to caching subsets of data.

In cases where a subset of data is active, and a smaller subset is particularly active, Near caching can be very beneficial when used with the "all" invalidation strategy (this is effectively a two-tier extension of the above rules).

Cluster-node affinity

Coherence's Near cache technology will transparently take advantage of cluster-node affinity, especially when used with the "present" invalidation strategy. This topology is particularly useful when used in conjunction with a sticky load-balancer. Note that the "present" invalidation strategy results in higher overhead (as opposed to "all") when the front portion of the cache is "thrashed" (very short lifespan of cache entries); this is due to the higher overhead of adding/removing key-level event listeners. In general, a cache should be tuned to avoid thrashing and so this is usually not an issue.

Read-write ratio and data sizes

Generally speaking, the following cache topologies are best for the following use cases:

Replicated cache - small amounts of read-heavy data (e.g. metadata)
Partitioned cache - large amounts of read-write data (e.g. large data caches)
Near cache - similar to Partitioned, but has further benefits from read-heavy tiered access patterns (e.g. large data caches with hotspots) and "sticky" data access (e.g. sticky HTTP session data). Depending on the synchronization method (expiry, asynchronous, synchronous), the worst case performance may range from similar to a Partitioned cache to considerably worse.

Interleaving

Interleaving refers to the number of cache reads between each cache write. The Partitioned cache is not affected by interleaving (as it is designed for 1:1 interleaving). The Replicated and Near caches by contrast are optimized for read-heavy caching, and prefer a read-heavy interleave (e.g. 10 reads between every write). This is because they both locally cache data for subsequent read access. Writes to the cache will force these locally cached items to be refreshed, a comparatively expensive process (relative to the near-zero cost of fetching an object off the local memory heap). Note that with the Near cache technology, worst-case performance is still similar to the Partitioned cache; the loss of performance is relative to best-case scenarios.

Note that interleaving is related to read-write ratios, but only indirectly. For example, a Near cache with a 1:1 read-write ratio may be extremely fast (all writes followed by all reads) or much slower (1:1 interleave, write-read-write-read...).

Heap Size Considerations

Using several small heaps

For large datasets, Partitioned or Near caches are recommended. As the scalability of the Partitioned cache is linear for both reading and writing, varying the number of Coherence JVMs will not significantly affect cache performance. On the other hand, JVM memory management routines show worse than linear scalability. For example, increasing JVM heap size from 512MB to 2GB may substantially increase garbage collection (GC) overhead and pauses.

For this reason, it is common to use multiple Coherence instances per physical machine. As a general rule of thumb, current JVM technology works well up to 512MB heap sizes. Thererfore, using a number of 512MB Coherence instances will provide optimal performance without a great deal of JVM configuration or tuning.

For performance-sensitive applications, experimentation may provide better tuning. When considering heap size, it is important to find the right balance. The lower bound is determined by per-JVM overhead (and also, manageability of a potentially large number of JVMs). For example, if there is a fixed overhead of 100MB for infrastructure software (e.g. JMX agents, connection pools, internal JVM structures), then the use of JVMs with 256MB heap sizes will result in close to 40% overhead for non-cache data. The upper bound on JVM heap size is governed by memory management overhead, specifically the maximum duration of GC pauses and the percentage of CPU allocated to GC (and other memory management tasks).

For Java 5 VMs running on commodity systems, the following rules generally hold true (with no JVM configuration tuning). With a heap size of 512MB, GC pauses will not exceed one second. With a heap size of 1GB, GC pauses are limited to roughly 2-3 seconds. With a heap size of 2GB, GC pauses are limited to roughly 5-6 seconds. It is important to note that GC tuning will have an enormous impact on GC throughput and pauses. In all configurations, initial (-Xms) and maximum (-Xmx) heap sizes should be identical. There are many variations that can substantially impact these numbers, including machine architecture, CPU count, CPU speed, JVM configuration, object count (object size), object access profile (short-lived versus long-lived objects).

For allocation-intensive code, GC can theoretically consume close to 100% of CPU usage. For both cache server and client configurations, most CPU resources will typically be consumed by application-specific code. It may be worthwhile to view verbose garbage collection statistics (e.g. -verbosegc). Use the profiling features of the JVM to get profiling information including CPU usage by GC (e.g. -Xprof).

Moving the cache out of the application heap

Using dedicated Coherence cache server instances for Partitioned cache storage will minimize the heap size of application JVMs as the data is no longer stored locally. As most Partitioned cache access is remote (with only 1/N of data being held locally), using dedicated cache servers does not generally impose much additional overhead. Near cache technology may still be used, and it will generally have a minimal impact on heap size (as it is caching an even smaller subset of the Partitioned cache). Many applications are able to dramatically reduce heap sizes, resulting in better responsiveness.

Local partition storage may be enabled (for cache servers) or disabled (for application server clients) with the tangosol.coherence.distributed.localstorage Java property (e.g. -Dtangosol.coherence.distributed.localstorage=false).

It may also be disabled by modifying the <local-storage> setting in the tangosol-coherence.xml (or tangosol-coherence-override.xml) file as follows:

<coherence>
  <services>
    <service>
      <service-type>DistributedCache</service-type>
      <service-component>DistributedCache</service-component>
      <init-params>
        <init-param>
          <param-name>local-storage</param-name>
          <param-value system-property="tangosol.coherence.distributed.localstorage">false<param-value>
        </init-param>
      </init-params>
    </service>
  </services>
</coherence>

At least one storage-enabled JVM must be started before any storage-disabled clients access the cache.

Best Practices

Overview