6 Performance Tuning

This chapter provides tuning instructions to help achieve maximum performance. See also Chapter 4, "Performing a Network Performance Test."

The following sections are included in this chapter:

6.1 Operating System Tuning

The following topics are included in this section:

6.1.1 Socket Buffer Sizes

To help minimization of packet loss, the operating system socket buffers must be large enough to handle the incoming network traffic while your Java application is paused during garbage collection. By default Coherence attempts to allocate a socket buffer of 2MB. Coherence uses smaller buffers if your operating system is not configured to allow for large buffers. Most versions of UNIX have a very low default buffer limit, which should be increased to at least 2MB.

Coherence issues the following warning if the operating system failed to allocate the full size buffer.

UnicastUdpSocket failed to set receive buffer size to 1428 packets (2096304 bytes); actual size is 89 packets (131071 bytes). Consult your OS documentation regarding increasing the maximum socket buffer size. Proceeding with the actual value may cause sub-optimal performance.

Though it is safe to operate with the smaller buffers it is recommended that you configure your operating system to allow for larger buffers.

On Linux execute (as root):

sysctl -w net.core.rmem_max=2096304
sysctl -w net.core.wmem_max=2096304

On Solaris execute (as root):

ndd -set /dev/udp udp_max_buf 2096304 

On AIX execute (as root):

no -o rfc1323=1
no -o sb_max=4194304


Note that AIX only supports specifying buffer sizes of 1MB, 4MB, and 8MB.

On Windows:

Windows does not impose a buffer size restriction by default.


For information on increasing the buffer sizes for other operating systems, refer to your operating system's documentation.

You may configure Coherence to request alternate sized buffers for packet publishers and unicast listeners by using the <packet-buffer> element.

6.1.2 High Resolution timesource (Linux)

Linux has several high resolution timesources to choose from, the fastest TSC (Time Stamp Counter) unfortunately is not always reliable. Linux chooses TSC by default and during startup checks for inconsistencies, if found it switches to a slower safe timesource. The slower time sources can be 10 to 30 times more expensive to query then the TSC timesource, and may have a measurable impact on Coherence performance. Note that Coherence and the underlying JVM are not aware of the timesource which the operating system is using. It is suggested that you check your system logs (/var/log/dmesg) to verify that the following is not present. Example 6-1 illustrates a sample timesource log.

Example 6-1 Log Message from a Linux Timesource

kernel: Losing too many ticks!
kernel: TSC cannot be used as a timesource.
kernel: Possible reasons for this are:
kernel:   You're running with Speedstep,
kernel:   You don't have DMA enabled for your hard disk (see hdparm),
kernel:   Incorrect TSC synchronization on an SMP system (see dmesg).
kernel: Falling back to a sane timesource now.

As the log messages suggest, this can be caused by a variable rate CPU (SpeedStep), having DMA disabled, or incorrect TSC synchronization on multi CPU computers. If present, work with your system administrator to identify and correct the cause allowing the TSC timesource to be used.In some older versions of Linux there is a bug related to unsynchronized TSCs in multiprocessor systems. This bug can result in the clock apparently jumping forward by 4398 seconds, and then immediately jumping back to the correct time. This bug can trigger erroneous Coherence packet transmission timeouts resulting in nodes being incorrectly removed from the cluster. If you experience cluster disconnects along with a warning of a "4398 second" timeout it is suggested that you upgrade you Linux kernel. An example of such a log warning is as follows:

A potential communication problem has been detected. A packet has failed to be delivered (or acknowledged) after 4398 seconds ...

See the following resource for more details on this Linux TSC issue:



6.1.3 Datagram size (Microsoft Windows)

Microsoft Windows supports a fast I/O path which is used when sending "small" datagrams. The default setting for what is considered a small datagram is 1024 bytes; increasing this value to match your network MTU (normally 1500) can significantly improve network performance.

To adjust this parameter:

  1. Run Registry Editor (regedit)

  2. Locate the following registry key HKLM\System\CurrentControlSet\Services\AFD\Parameters

  3. Add the following new DWORD value Name: FastSendDatagramThreshold Value: 1500 (decimal)

  4. Restart.


The COHERENCE_HOME/bin/optimize.reg script performs this change. After running the script, restart the computer for the changes to take effect.

See Appendix C of http://technet.microsoft.com/en-us/library/bb726981.aspx for more details on this parameter

6.1.4 Thread Scheduling (Microsoft Windows)

Windows (including NT, 2000 and XP) is optimized for desktop application usage. If you run two console ("DOS box") windows, the one that has the focus can use almost 100% of the CPU, even if other processes have high-priority threads in a running state. To correct this imbalance, you must configure the Windows thread scheduling to less-heavily favor foreground applications.

  1. Open the Control Panel.

  2. Open System.

  3. Select the Advanced tab.

  4. Under Performance select Settings.

  5. Select the Advanced tab.

  6. Under Processor scheduling, choose Background services.


The COHERENCE_HOME/bin/optimize.reg script performs this change. After running the script, restart the computer for the changes to take effect.

6.1.5 Swapping

Ensure that you have sufficient memory such that you are not making active use of swap space on your computers. You may monitor the swap rate using tools such as vmstat and top. Actively moving through swap space can typically has a significant impact on Coherence's performance. Often this manifests itself as Coherence nodes being removed from the cluster due to long periods of unresponsiveness caused by them having been "swapped out."

6.2 Network Tuning

6.2.1 Network Interface Settings

Verify that your Network card (NIC) is configured to operate at it's maximum link speed and at full duplex. The process for doing this varies between operating systems.

On Linux execute (as root):

ethtool eth0

See the man page on ethtool for further details and for information on adjust the interface settings.

On Solaris execute (as root):

kstat ce:0 | grep link_

This displays the link settings for interface 0. Items of interest are link_duplex (2 = full), and link_speed which is reported in Mbps.


If running on Solaris 10, review Sun issues 102712 and 102741 which relate to packet corruption and multicast disconnections. These often manifest as either EOFExceptions, "Large gap" warnings while reading packet data, or frequent packet timeouts. It is highly recommend that the patches for both issues be applied when using Coherence on Solaris 10 systems.

On Windows:

  1. Open the Control Panel.

  2. Open Network Connections.

  3. Open the Properties dialog for desired network adapter.

  4. Select Configure.

  5. Select the Advanced tab.

  6. Locate the driver specific property for Speed & Duplex.

  7. Set it to either auto or to a specific speed and duplex setting.

6.2.2 Bus Considerations

For 1Gb and faster PCI network cards the system's bus speed may be the limiting factor for network performance. PCI and PCI-X busses are half-duplex, and all devices run at the speed of the slowest device on the bus. Standard PCI buses have a maximum throughput of approximately 1Gb/sec and thus are not capable of fully using a full-duplex 1Gb NIC. PCI-X has a much higher maximum throughput (1GB/sec), but can be hobbled by a single slow device on the bus. If you find that you are not able to achieve satisfactory bidirectional data rates it is suggested that you evaluate your computer's bus configuration. For instance simply relocating the NIC to a private bus may improve performance.

6.2.3 Network Infrastructure Settings

If you experience frequent multi-second communication pauses across multiple cluster nodes, try increasing your switch's buffer space. These communication pauses can be identified by a series of Coherence log messages identifying communication delays with multiple nodes which are not attributable to local or remote GCs.

Example 6-2 Message Indicating a Communication Delay

Experienced a 4172 ms communication delay (probable remote GC) with Member(Id=7, Timestamp=2006-10-20 12:15:47.511, Address=, MachineId=13838); 320 packets rescheduled, PauseRate=0.31, Threshold=512

Some switches such as the Cisco 6500 series support configuration the amount of buffer space available to each Ethernet port or ASIC. In high load applications it may be necessary to increase the default buffer space. On Cisco this can be accomplished by executing:

fabric buffer-reserve high

See Cisco's documentation for additional details on this setting.

6.2.4 Ethernet Flow-Control

Full duplex Ethernet includes a flow-control feature which allows the receiving end of a point to point link to slow down the transmitting end. This is implemented by the receiving end sending an Ethernet PAUSE frame to the transmitting end, the transmitting end then halts transmissions for the interval specified by the PAUSE frame. Note that this pause blocks all traffic from the transmitting side, even traffic destined for computers which are not overloaded. This can induce a head of line blocking condition, where one overloaded computer on a switch effectively slows down all other computers. Most switch vendors recommend that Ethernet flow-control be disabled for inter switch links, and at most be used on ports which are directly connected to computers. Even in this setup head of line blocking can still occur, and thus it is advisable to disable Ethernet flow-control. Higher level protocols such as TCP/IP and Coherence TCMP include their own flow-control mechanisms which are not subject to head of line blocking, and also negate the need for the lower level flow-control.

See http://www.networkworld.com/netresources/0913flow2.html for more details on this subject

6.2.5 Path MTU

By default Coherence assumes a 1500 byte network MTU, and uses a default packet size of 1468 based on this assumption. Having a packet size which does not fill the MTU results in an under used network. If your equipment uses a different MTU, then configure Coherence by specifying the <packet-size> setting, which is 32 bytes smaller then the network path's minimal MTU.

If you are unsure of your equipment's MTU along the full path between nodes, you can use either the standard ping or traceroute utilities to determine the MTU. For example, execute a series of ping or traceroute operations between the two computers. With each attempt, specify a different packet size, starting from a high value and progressively moving downward until the packets start to make it through without fragmentation.

On Linux execute:

ping -c 3 -M do -s 1468 serverb

On Solaris execute:

traceroute -F serverb 1468

On Windows execute:

ping -n 3 -f -l 1468 serverb

On other operating systems: Consult the documentation for the ping or traceroute command to see how to disable fragmentation, and specify the packet size.

If you receive a message stating that packets must be fragmented then the specified size is larger then the path's MTU. Decrease the packet size until you find the point at which packets can be transmitted without fragmentation. If you find that you must use packets smaller then 1468, you may want to contact your network administrator to get the MTU increased to at least 1500.

6.3 JVM Tuning

6.3.1 Basic Sizing Recommendation

The recommendations in this section are sufficient for general use cases and require minimal setup effort. The primary issue to consider when sizing your JVMs is a balance of available RAM versus garbage collection (GC) pause times.

Cache Servers

The standard, safe recommendation for Coherence cache servers is to run a fixed size heap of up to 1GB. In addition, use an incremental garbage collector to minimize GC pause durations. Lastly, run all Coherence JVMs in server mode, by specifying the -server on the JVM command line. This allows for several performance optimizations for long running applications.

For example:

java -server -Xms1g -Xmx1g -Xincgc -Xloggc: -cp coherence.jar com.tangosol.net.DefaultCacheServer

This sizing allows for good performance without the need for more elaborate JVM tuning. For more information on garbage collection, see "GC Monitoring & Tuning".


It is possible to run cache servers with larger heap sizes; though it becomes more important to monitor and tune the JVMs to minimize the GC pauses. It may also be necessary to alter the storage ratios such that the amount of scratch space is increased to facilitate faster GC compactions. Additionally it is recommended that you make use of an up to date JVM version such as HotSpot 1.6 as it includes significant improvements for managing large heaps. See "Heap Size Considerations".

TCMP Clients

Coherence TCMP clients should be configured similarly to cache servers as long GCs could cause them to be misidentified as being terminated.

Extends Clients

Coherence Extend clients are not technically speaking cluster members and, as such, the effect of long GCs is less detrimental. For extend clients it is recommended that you follow the existing guidelines as set forth by the application in which you are embedding coherence.

6.3.2 Heap Size Considerations

Use this section to decide:

  • How many CPUs are need for your system

  • How much memory is need for each system

  • How many JVMs to run per system

  • How much heap to configure with each JVM

Since all applications are different, this section should be read as guidelines. You must answer the following questions to choose the configuration that is right for you:

  • How much data is to be stored in Coherence caches?

  • What are the application requirements in terms of latency and throughput?

  • How CPU or Network intensive is the application?

Sizing is an imprecise science. There is no substitute for frequent performance and stress testing.

The following topics are included in this section: General Guidelines

Running with a fixed sized heap saves the JVM from having to grow the heap on demand and results in improved performance. To specify a fixed size heap use the -Xms and -Xmx JVM options, setting them to the same value. For example:

java -server -Xms4G -Xmx4G ...

A JVM process consumes more system memory then the specified heap size. The heap size settings specify the amount of heap which the JVM makes available to the application, but the JVM itself also consume additional memory. The amount consumed differs depending on the operating system, and JVM settings. For instance, a HotSpot JVM running on Linux configured with a 1GB JVM consumes roughly 1.2GB of RAM. It is important to externally measure the JVMs memory utilization to ensure that RAM is not over committed. Tools such as top, vmstat, and Task Manager are useful in identifying how much RAM is actually being used.

Storage Ratios

The basic recommendation for how much data can be stored within a cache server of a given size is to use a 1/3rd of the heap for primary cache storage. This leaves another 1/3rd for backup storage and the final 1/3rd for scratch space. Scratch space is then used for things such as holding classes, temporary objects, network transfer buffers, and GC compaction. You may instruct Coherence to limit primary storage on a per-cache basis by configuring the <high-units> element and specifying a BINARY value for the <unit-calculator> element. These settings are automatically applied to backup storage as well.

Ideally, both the primary and backup storage also fits within the JVMs tenured space (for HotSpot based JVMs). See HotSpot's Tuning Garbage Collection guide for details on sizing the collectors generations:


Cache Topologies and Heap Size

For large data sets, partitioned or near caches are recommended. Varying the number of Coherence JVMs does not significantly affect cache performance because the scalability of the partitioned cache is linear for both reading and writing. Using a replicated cache puts significant pressure on GC.

Planning Capacity for Data Grid Operations

Data grid operations (such as queries and aggregations) have additional heap space requirements and their use must be planned for accordingly. During data grid operations, binary representations of the cache entries that are part of the result set are held in-memory. Depending on the operation, the entries may also be held in deserialized form for the duration of the operation. Typically, this doubles the amount of memory for each entry in the result set. In addition, a second binary copy is maintained when using RAM or flash journaling as the backing map implementation due to differences in how the objects are stored. The second binary copy is also held for the duration of the operation and increases the total memory for each entry in the result set by 3x.

Heap memory usage depends on the size of the result set on which the operations are performed and the number of concurrent requests being handled. The result set size is affected by both the total number of entries as well as the size of each entry. Moderately sized result sets that are maintained on a storage cache server would not likely exhaust the heap's memory. However, if the result set is sufficiently large, the additional memory requirements can cause the heap to run out of memory. Data grid aggregate operations typically involve larger data sets and are more likely to exhaust the available memory than other operations.

The JVMs heap size should be increased on storage enabled cache servers whenever large result sets are expected. If a 1/3rd of the heap has been reserved for scratch space (as recommended above), ensure the scratch space that is allocated can support the projected result set sizes. Alternatively, data grid operations can use the PartitionedFilter API. The API reduces memory consumption by executing grid operations against individual partition sets. See Oracle Coherence Java API Reference for details on using this API.

Deciding How Many JVMs to Run Per System

The number of JVMs (nodes) to run per system depends on the system's number of processors/cores and amount of memory. As a starting point, plan to run one JVM for every two cores. This recommendation balances the following factors:

  • Multiple JVMs per server allow Coherence to make more efficient use of network resources. Coherence's packet-publisher and packet-receiver have a fixed number of threads per JVM; as you add cores, you'll want to add JVMs to scale across these cores.

  • Too many JVMs increases contention and context switching on processors.

  • Too few JVMs may not be able to handle available memory and may not fully use the NIC.

  • Especially for larger heap sizes, JVMs must have available processing capacity to avoid long GC pauses.

Depending on your application, you can add JVMs up toward one per core. The recommended number of JVMs and amount of configured heap may also vary based on the number of processors/cores per socket and on the computer architecture.

Sizing Your Heap

When considering heap size, it is important to find the right balance. The lower bound is determined by per-JVM overhead (and also, manageability of a potentially large number of JVMs). For example, if there is a fixed overhead of 100MB for infrastructure software (for example, JMX agents, connection pools, internal JVM structures), then the use of JVMs with 256MB heap sizes results in close to 40% overhead for non-cache data. The upper bound on JVM heap size is governed by memory management overhead, specifically the maximum duration of GC pauses and the percentage of CPU allocated to GC (and other memory management tasks).

GC can affect the following:

  • The latency of operations against Coherence. Larger heaps cause longer and less predictable latency than smaller heaps.

  • The stability of the cluster. With very large heaps, lengthy long garbage collection pauses can trick TCMP into believing a cluster member is terminated since the JVM is unresponsive during GC pauses. Although TCMP takes GC pauses into account when deciding on member health, at some point it may decide the member is terminated.

The following guidelines are provided:

  • For Sun 1.6 JVM or JRockit, allocate up to a 4GB heap. Also, use Sun's Concurrent Mark and Sweep GC or JRockit's Deterministic GC.

The length of a GC pause scales worse than linearly to the size of the heap. That is, if you double the size of the heap, pause times due to GC more than double ( in general). GC pauses are also impacted by application usage:

  • Pause times increase as the amount of live data in the heap increases. Do not exceed 70% live data in your heap. This includes primary data, backup data, indexes, and application data.

  • High object allocation rates increase pause times. Even "simple" Coherence applications can cause high object allocation rates since every network packet generates many objects.

  • CPU-intensive computations increase contention and may also contribute to higher pause times.

Depending on your latency requirements, you can increase allocated heap space beyond the above recommendations, but be sure to stress test your system. Moving the Cache Out of the Application Heap

Using dedicated Coherence cache server instances for Partitioned cache storage minimizes the heap size of application JVMs because the data is no longer stored locally. As most Partitioned cache access is remote (with only 1/N of data being held locally), using dedicated cache servers does not generally impose much additional overhead. Near cache technology may still be used, and it generally has a minimal impact on heap size (as it is caching an even smaller subset of the Partitioned cache). Many applications are able to dramatically reduce heap sizes, resulting in better responsiveness.

Local partition storage may be enabled (for cache servers) or disabled (for application server clients) with the tangosol.coherence.distributed.localstorage Java property (for example, -Dtangosol.coherence.distributed.localstorage=false).

It may also be disabled by modifying the <local-storage> setting in the tangosol-coherence.xml (or tangosol-coherence-override.xml) file as follows:

Example 6-3 Disabling Partition Storage

<?xml version='1.0'?>

<coherence xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   coherence-operational-config coherence-operational-config.xsd">
         id value must match what's in tangosol-coherence.xml for DistributedCache
         <service id="3">
               <init-param id="4">
                  <param-value system-property="tangosol.coherence.distributed.

At least one storage-enabled JVM must be started before any storage-disabled clients access the cache.

6.3.3 GC Monitoring & Tuning

Lengthy GC pause times can negatively impact the Coherence cluster and are typically indistinguishable from node termination. A Java application cannot send or receive packets during these pauses. As for receiving the operating system buffered packets, the packets may be discarded and must be retransmitted. For these reasons, it is very important that cluster nodes are sized and tuned to ensure that their GC times remain minimal. As a good rule of thumb, a node should spend less than 10% of its time paused in GC, normal GC times should be under 100ms, and maximum GC times should be around 1 second.

GC activity can be monitored in many ways; some standard mechanisms include:

  • JVM switch -verbose:gc

  • JVM switch -Xloggc: (similar to verbose GC but includes timestamps)

  • JVM switch -Xprof: (profiling information- profiling activities should be distinguished between testing and production deployments and its effects on resources and performance should be monitored as well)

  • Over JMX using tools such as JConsole

Log messages are generated when one cluster node detects that another cluster node has been unresponsive for a period, generally indicating that a target cluster node was in a GC cycle.

Example 6-4 Message Indicating Target Cluster Node is in Garbage Collection Mode

Experienced a 4172 ms communication delay (probable remote GC) with Member(Id=7, Timestamp=2006-10-20 12:15:47.511, Address=, MachineId=13838); 320 packets rescheduled, PauseRate=0.31, Threshold=512

PauseRate indicates the percentage of time for which the node has been considered unresponsive since the statistics were last reset. Nodes reported as unresponsive for more then a few percent of their lifetime may be worth investigating for GC tuning.

6.4 Coherence Network Tuning

Coherence includes configuration elements for throttling the amount of traffic it places on the network; see the documentation for <traffic-jam> and <flow-control>. These settings are used to control the rate of packet flow within and between cluster nodes.

6.4.1 Validation

To determine how these settings are affecting performance, you must check if you're cluster nodes are experiencing packet loss, or packet duplication, or both. This can be obtained by looking at the following JMX statistics on various cluster nodes:

  • ClusterNodeMBean.PublisherSuccessRate—If less then 1.0, packets are being detected as lost and being resent. Rates below 0.98 may warrant investigation and tuning.

  • ClusterNodeMBean.ReceiverSuccessRate—If less then 1.0, the same packet is being received multiple times (duplicates), likely due to the publisher being overly aggressive in declaring packets as lost.

  • ClusterNodeMBean.WeakestChannel—Identifies the remote cluster node which the current node is having most difficulty communicating with.

See Oracle Coherence Management Guide for information on using JMX to monitor Coherence.

6.5 Data Access Patterns

6.5.1 Data Access Distribution (hot spots)

When caching a large data set, typically a small portion of that data set is responsible for most data accesses. For example, in a 1000 object datasets, 80% of operations may be against a 100 object subset. The remaining 20% of operations may be against the other 900 objects. Obviously the most effective return on investment is gained by caching the 100 most active objects; caching the remaining 900 objects provides 25% more effective caching while requiring a 900% increase in resources.

However, if every object is accessed equally often (for example in sequential scans of the datasets), then caching requires more resources for the same level of effectiveness. In this case, achieving 80% cache effectiveness would require caching 80% of the datasets versus 10%. (Note that sequential scans of partially cached data sets generally defeat MRU, LFU and MRU-LFU eviction policies). In practice, most non-synthetic (benchmark) data access patterns are uneven, and respond well to caching subsets of data.

In cases where a subset of data is active, and a smaller subset is particularly active, Near caching can be very beneficial when used with the "all" invalidation strategy (this is effectively a two-tier extension of the above rules).

6.5.2 Cluster-node Affinity

Coherence's Near cache technology transparently takes advantage of cluster-node affinity, especially when used with the "present" invalidation strategy. This topology is particularly useful when used with a sticky load-balancer. Note that the "present" invalidation strategy results in higher overhead (as opposed to "all") when the front portion of the cache is "thrashed" (very short lifespan of cache entries); this is due to the higher overhead of adding/removing key-level event listeners. In general, a cache should be tuned to avoid thrashing and so this is usually not an issue.

6.5.3 Read/Write Ratio and Data Sizes

Generally speaking, the following cache topologies are best for the following use cases:

  • Replicated cache—small amounts of read-heavy data (for example, metadata)

  • Partitioned cache—large amounts of read/write data (for example, large data caches)

  • Near cache—similar to Partitioned, but has further benefits from read-heavy tiered access patterns (for example, large data caches with hotspots) and "sticky" data access (for example, sticky HTTP session data). Depending on the synchronization method (expiry, asynchronous, synchronous), the worst case performance may range from similar to a Partitioned cache to considerably worse.

6.5.4 Interleaving Cache Reads and Writes

Interleaving refers to the number of cache reads between each cache write. The Partitioned cache is not affected by interleaving (as it is designed for 1:1 interleaving). The Replicated and Near caches by contrast are optimized for read-heavy caching, and prefer a read-heavy interleave (for example, 10 reads between every write). This is because they both locally cache data for subsequent read access. Writes to the cache forces these locally cached items to be refreshed, a comparatively expensive process (relative to the near-zero cost of fetching an object off the local memory heap). Note that with the Near cache technology, worst-case performance is still similar to the Partitioned cache; the loss of performance is relative to best-case scenarios.

Note that interleaving is related to read/write ratios, but only indirectly. For example, a Near cache with a 1:1 read/write ratio may be extremely fast (all writes followed by all reads) or much slower (1:1 interleave, write-read-write-read...).