Skip Headers
Oracle® Coherence Developer's Guide
Release 3.6.1

Part Number E15723-03
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
View PDF

45 Performance Tuning

To achieve maximum performance with Coherence it is suggested that you test and tune your operating environment. Testing is covered in Chapter 44, "Performing a Datagram Test for Network Performance."

Tuning recommendations are available for:

Operating System Tuning

Socket Buffer Sizes

To help minimization of packet loss, the operating system socket buffers need to be large enough to handle the incoming network traffic while your Java application is paused during garbage collection. By default Coherence will attempt to allocate a socket buffer of 2MB. If your operating system is not configured to allow for large buffers Coherence will use smaller buffers. Most versions of UNIX have a very low default buffer limit, which should be increased to at least 2MB.

Starting with Coherence 3.1 you will receive the following warning if the operating system failed to allocate the full size buffer.

Example 45-1 Message Indicating OS Failed to Allocate the Full Buffer Size

UnicastUdpSocket failed to set receive buffer size to 1428 packets (2096304 bytes); actual size is 89 packets (131071 bytes). Consult your OS documentation regarding increasing the maximum socket buffer size. Proceeding with the actual value may cause sub-optimal performance.

Though it is safe to operate with the smaller buffers it is recommended that you configure your operating system to allow for larger buffers.

On Linux execute (as root):

sysctl -w net.core.rmem_max=2096304
sysctl -w net.core.wmem_max=2096304

On Solaris execute (as root):

ndd -set /dev/udp udp_max_buf 2096304 

On AIX execute (as root):

no -o rfc1323=1
no -o sb_max=4194304

Note:

Note that AIX only supports specifying buffer sizes of 1MB, 4MB, and 8MB. Additionally there is an issue with IBM's 1.5 JVMs which may prevent them from allocating socket buffers larger then 64K. This issue has been addressed in IBM's 1.5 SR3 SDK.

On Windows:

Windows does not impose a buffer size restriction by default.

Other:

For information on increasing the buffer sizes for other operating systems please refer to your operating system's documentation.

You may configure Coherence to request alternate sized buffers for packet publishers and unicast listeners by using the coherence/cluster-config/packet-publisher/packet-buffer/maximum-packets and coherence/cluster-config/unicast-listener/packet-buffer/maximum-packets elements. For more information, see "packet-buffer".

High Resolution timesource (Linux)

Linux has several high resolution timesources to choose from, the fastest TSC (Time Stamp Counter) unfortunately is not always reliable. Linux chooses TSC by default, and during boot checks for inconsistencies, if found it switches to a slower safe timesource. The slower time sources can be 10 to 30 times more expensive to query then the TSC timesource, and may have a measurable impact on Coherence performance. Note that Coherence and the underlying JVM are not aware of the timesource which the operating system is using. It is suggested that you check your system logs (/var/log/dmesg) to verify that the following is not present. Example 45-2 illustrates a sample timesource log.

Example 45-2 Log Message from a Linux Timesource

kernel: Losing too many ticks!
kernel: TSC cannot be used as a timesource.
kernel: Possible reasons for this are:
kernel:   You're running with Speedstep,
kernel:   You don't have DMA enabled for your hard disk (see hdparm),
kernel:   Incorrect TSC synchronization on an SMP system (see dmesg).
kernel: Falling back to a sane timesource now.

As the log messages suggest, this can be caused by a variable rate CPU (SpeedStep), having DMA disabled, or incorrect TSC synchronization on multi CPU machines. If present it is suggested that you work with your system administrator to identify and correct the cause allowing the TSC timesource to be utilized.In some older versions of Linux there is a bug related to unsynchronized TSCs in multiprocessor systems. This bug can result in the clock apparently jumping forward by 4398 seconds, and then immediately jumping back to the correct time. This bug can trigger erroneous Coherence packet transmission timeouts resulting in nodes being incorrectly removed from the cluster. If you experience cluster disconnects along with a warning of a "4398 second" timeout it is suggested that you upgrade you Linux kernel. An example of such a log warning is as follows:

A potential communication problem has been detected. A packet has failed to be delivered (or acknowledged) after 4398 seconds ...

See the following resource for more details on this Linux TSC issue:

http://lkml.org/lkml/2007/8/23/96

https://bugzilla.redhat.com/show_bug.cgi?id=452185

Datagram size (Microsoft Windows)

Microsoft Windows supports a fast IO path which is used when sending "small" datagrams. The default setting for what is considered a small datagram is 1024 bytes; increasing this value to match your network MTU (normally 1500) can significantly improve network performance.

To adjust this parameter:

  1. Run Registry Editor (regedit)

  2. Locate the following registry key HKLM\System\CurrentControlSet\Services\AFD\Parameters

  3. Add the following new DWORD value Name: FastSendDatagramThreshold Value: 1500 (decimal)

  4. Reboot

Note:

Included in Coherence 3.1 and above is an optimize.reg script which will perform this change for you, it can be found in the coherence/bin directory of your installation. After running the script you must reboot your computer for the changes to take effect.

See Appendix C of http://technet.microsoft.com/en-us/library/bb726981.aspx for more details on this parameter

Thread Scheduling (Microsoft Windows)

Windows (including NT, 2000 and XP) is optimized for desktop application usage. If you run two console ("DOS box") windows, the one that has the focus can use almost 100% of the CPU, even if other processes have high-priority threads in a running state. To correct this imbalance, you must configure the Windows thread scheduling to less-heavily favor foreground applications.

  1. Open the Control Panel.

  2. Open System.

  3. Select the Advanced tab.

  4. Under Performance select Settings.

  5. Select the Advanced tab.

  6. Under Processor scheduling, choose Background services.

Note:

Coherence includes an optimize.reg script which will perform this change for you, it can be found in the coherence/bin directory of your installation.

Swapping

Ensure that you have sufficient memory such that you are not making active use of swap space on your machines. You may monitor the swap rate using tools such as vmstat and top. If you find that you are actively moving through swap space this will likely have a significant impact on Coherence's performance. Often this will manifest itself as Coherence nodes being removed from the cluster due to long periods of unresponsiveness caused by them having been "swapped out".

Network Tuning

Network Interface Settings

Verify that your Network card (NIC) is configured to operate at it's maximum link speed and at full duplex. The process for doing this varies between operating systems.

On Linux execute (as root):

ethtool eth0

See the man page on ethtool for further details and for information on adjust the interface settings.

On Solaris execute (as root):

kstat ce:0 | grep link_

This will display the link settings for interface 0. Items of interest are link_duplex (2 = full), and link_speed which is reported in Mbps.

Note:

If running on Solaris 10, please review Sun issues 102712 and 102741 which relate to packet corruption and multicast disconnections. These will most often manifest as either EOFExceptions, "Large gap" warnings while reading packet data, or frequent packet timeouts. It is highly recommend that the patches for both issues be applied when using Coherence on Solaris 10 systems.

On Windows:

  1. Open the Control Panel.

  2. Open Network Connections.

  3. Open the Properties dialog for desired network adapter.

  4. Select Configure.

  5. Select the Advanced tab.

  6. Locate the driver specific property for Speed & Duplex.

  7. Set it to either auto or to a specific speed and duplex setting.

Bus Considerations

For 1Gb and faster PCI network cards the system's bus speed may be the limiting factor for network performance. PCI and PCI-X busses are half-duplex, and all devices will run at the speed of the slowest device on the bus. Standard PCI buses have a maximum throughput of approximately 1Gb/sec and thus are not capable of fully using a full-duplex 1Gb NIC. PCI-X has a much higher maximum throughput (1GB/sec), but can be hobbled by a single slow device on the bus. If you find that you are not able to achieve satisfactory bidirectional data rates it is suggested that you evaluate your machine's bus configuration. For instance simply relocating the NIC to a private bus may improve performance.

Network Infrastructure Settings

If you experience frequent multi-second communication pauses across multiple cluster nodes you may need to increase your switch's buffer space. These communication pauses can be identified by a series of Coherence log messages identifying communication delays with multiple nodes which are not attributable to local or remote GCs.

Example 45-3 Message Indicating a Communication Delay

Experienced a 4172 ms communication delay (probable remote GC) with Member(Id=7, Timestamp=2006-10-20 12:15:47.511, Address=192.168.0.10:8089, MachineId=13838); 320 packets rescheduled, PauseRate=0.31, Threshold=512

Some switches such as the Cisco 6500 series support configuration the amount of buffer space available to each Ethernet port or ASIC. In high load applications it may be necessary to increase the default buffer space. On Cisco this can be accomplished by executing:

fabric buffer-reserve high

See Cisco's documentation for additional details on this setting.

Ethernet Flow-Control

Full duplex Ethernet includes a flow-control feature which allows the receiving end of a point to point link to slow down the transmitting end. This is implemented by the receiving end sending an Ethernet PAUSE frame to the transmitting end, the transmitting end will then halt transmissions for the interval specified by the PAUSE frame. Note that this pause blocks all traffic from the transmitting side, even traffic destined for machines which are not overloaded. This can induce a head of line blocking condition, where one overloaded machine on a switch effectively slows down all other machines. Most switch vendors will recommend that Ethernet flow-control be disabled for inter switch links, and at most be used on ports which are directly connected to machines. Even in this setup head of line blocking can still occur, and thus it is advisable to disable Ethernet flow-control all together. Higher level protocols such as TCP/IP and Coherence TCMP include their own flow-control mechanisms which are not subject to head of line blocking, and also negate the need for the lower level flow-control.

See http://www.networkworld.com/netresources/0913flow2.html for more details on this subject

Path MTU

By default Coherence assumes a 1500 byte network MTU, and uses a default packet size of 1468 based on this assumption. Having a packet size which does not fill the MTU will result is an under used network. If your equipment uses a different MTU, then configure Coherence by specifying a packet size which is 32 bytes smaller then the network path's minimal MTU. The packet size may be specified in coherence/cluster-config/packet-publisher/packet-size/maximum-length and preferred-length configuration elements. For more information on these elements, see "packet-size".

If you are unsure of your equipment's MTU along the full path between nodes you can use either the standard ping or traceroute utilities to determine it. To do this, execute a series of ping or traceroute operations between the two machines. With each attempt you will specify a different packet size, starting from a high value and progressively moving downward until the packets start to make it through without fragmentation. You will need to specify a particular packet size, and to not fragment the packets.

On Linux execute:

ping -c 3 -M do -s 1468 serverb

On Solaris execute:

traceroute -F serverb 1468

On Windows execute:

ping -n 3 -f -l 1468 serverb

On other operating systems: Consult the documentation for the ping or traceroute command to see how to disable fragmentation, and specify the packet size.

If you receive a message stating that packets must be fragmented then the specified size is larger then the path's MTU. Decrease the packet size until you find the point at which packets can be transmitted without fragmentation. If you find that you need to use packets smaller then 1468 you may want to contact your network administrator to get the MTU increased to at least 1500.

JVM Tuning

Basic Sizing Recommendation

The recommendations in this section are sufficient for general use cases and require minimal setup effort. The primary issue to consider when sizing your JVMs is a balance of available RAM versus garbage collection (GC) pause times.

Cache Servers

The standard, safe recommendation for Coherence cache servers is to run a fixed size heap of up to 1GB. Additionally, it is recommended to utilize an incremental garbage collector to minimize GC pause durations. Lastly, run all Coherence JVMs in server mode, by specifying the -server on the JVM command line. This allows for several performance optimizations for long running applications.

For example:

java -server -Xms1g -Xmx1g -Xincgc -Xloggc: -cp coherence.jar com.tangosol.net.DefaultCacheServer

This sizing allows for good performance without the need for more elaborate JVM tuning. For more information on garbage collection, see "GC Monitoring & Tuning".

Note:

It is possible to run cache servers with larger heap sizes; though it becomes more important to monitor and tune the JVMs to minimize the GC pauses. It may also be necessary to alter the storage ratios such that the amount of scratch space is increased to facilitate faster GC compactions. Additionally it is recommended that you make use of an up to date JVM version such as HotSpot 1.6 as it includes significant improvements for managing large heaps. See "Heap Size Considerations".

TCMP Clients

Coherence TCMP clients should be configured similarly to cache servers as long GCs could cause them to be misidentified as being dead.

Extends Clients

Coherence Extend clients are not technically speaking cluster members and, as such, the effect of long GCs is less detrimental. For extend clients it is recommended that you follow the existing guidelines as set forth by the application in which you are embedding coherence.

Heap Size Considerations

This section will help you decide:

  • How many CPUs you will need for your system

  • How much memory you will need for each system

  • How many JVMs to run per system

  • How much heap to configure with each JVM

Since all applications are different, this section should be read as guidelines. You will need to answer the following questions in order to choose the configuration that is right for you:

  • How much data will be stored in Coherence caches?

  • What are the application requirements in terms of latency and throughput?

  • How CPU or Network intensive is the application?

Sizing is an imprecise science. There is no substitute for frequent performance and stress testing.

The following topics are included in this section:

General Guidelines

Running with a fixed sized heap will save your JVM from having to grow the heap on demand and will result in improved performance. To specify a fixed size heap use the -Xms and -Xmx JVM options, setting them to the same value. For example:

java -server -Xms4G -Xmx4G ...

A JVM process consumes more system memory then the specified heap size. The heap size settings specify the amount of heap which the JVM makes available to the application, but the JVM itself will also consume additional memory. The amount consumed differs depending on the OS, and JVM settings. For instance, a HotSpot JVM running on Linux configured with a 1GB JVM will consume roughly 1.2GB of RAM. It is important to externally measure the JVMs memory utilization to ensure that RAM is not over committed. Tools such as top, vmstat, and Task Manager are useful in identifying how much RAM is actually being utilized.

Storage Ratios

The basic recommendation for how much data can be stored within a cache server of a given size is to use up to 1/3rd of the heap for primary cache storage. This leaves another 1/3rd for backup storage, and the final 1/3rd for scratch space. Scratch space is then used for things such as holding classes, temporary objects, network transfer buffers, and GC compaction. You may instruct Coherence to limit primary storage on a per-cache basis by configuring the <high-units> element and specifying a BINARY value for the <unit-calculator> element. These settings are automatically applied to backup storage as well.

Ideally both the primary and backup storage will also fit within the JVMs tenured space (for HotSpot based JVMs). See HotSpot's Tuning Garbage Collection guide for details on sizing the collectors generations:

http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html

Cache Topologies and Heap Size

For large datasets, Partitioned or Near caches are recommended. As the scalability of the Partitioned cache is linear for both reading and writing, varying the number of Coherence JVMs will not significantly affect cache performance. Using a Replicated cache will put significant pressure on GC.

Deciding How Many JVMs to Run Per System

The number of JVMs (nodes) to run per system depends on the system's number of processors/cores and amount of memory. As a starting point, we recommend starting with a plan to run one JVM for every two cores. This recommendation balances the following factors:

  • Multiple JVMs per server allow Coherence to make more efficient use of network resources. Coherence's packet-publisher and packet-receiver have a fixed number of threads per JVM; as you add cores, you'll want to add JVMs to scale across these cores.

  • Too many JVMs will increase contention and context switching on processors.

  • Too few JVMs may not be able to handle available memory and may not fully utilize the NIC.

  • Especially for larger heap sizes, JVMs must have available processing capacity to avoid long GC pauses.

Depending on your application, you can add JVMs up toward one per core. The recommended number of JVMs and amount of configured heap may also vary based on the number of processors/cores per socket and on the machine architecture.

Sizing Your Heap

When considering heap size, it is important to find the right balance. The lower bound is determined by per-JVM overhead (and also, manageability of a potentially large number of JVMs). For example, if there is a fixed overhead of 100MB for infrastructure software (e.g. JMX agents, connection pools, internal JVM structures), then the use of JVMs with 256MB heap sizes will result in close to 40% overhead for non-cache data. The upper bound on JVM heap size is governed by memory management overhead, specifically the maximum duration of GC pauses and the percentage of CPU allocated to GC (and other memory management tasks).

GC can affect the following:

  • The latency of operations against Coherence. Larger heaps will cause longer and less predictable latency than smaller heaps.

  • The stability of the cluster. With very large heaps, lengthy long garbage collection pauses can trick TCMP into believing a cluster member is dead since the JVM is unresponsive during GC pauses. Although TCMP takes GC pauses into account when deciding on member health, at some point it may decide the member is dead.

We offer the following guidelines:

  • For Sun 1.5 JVM, we recommend not allocating more than 1GB and 2GB heap respectively.

  • For Sun 1.6 JVM or JRockit, we recommend allocating up to a 4GB heap. We also recommend using Sun's Concurrent Mark and Sweep GC or JRockit's Deterministic GC.

The length of a GC pause scales worse than linearly to the size of the heap. That is, if you double the size of the heap, pause times due to GC will, in general, more than double. GC pauses are also impacted by application usage:

  • Pause times increase as the amount of live data in the heap increases. We recommend not exceeding 70% live data in your heap. This includes primary data, backup data, indexes, and application data.

  • High object allocation rates will increase pause times. Even "simple" Coherence applications can cause high object allocation rates since every network packet generates many objects.

  • CPU-intensive computation will increase contention and may also contribute to higher pause times.

Depending on your latency requirements, you can increase allocated heap space beyond the above recommendations, but be sure to stress test your system.

Moving the Cache Out of the Application Heap

Using dedicated Coherence cache server instances for Partitioned cache storage will minimize the heap size of application JVMs as the data is no longer stored locally. As most Partitioned cache access is remote (with only 1/N of data being held locally), using dedicated cache servers does not generally impose much additional overhead. Near cache technology may still be used, and it will generally have a minimal impact on heap size (as it is caching an even smaller subset of the Partitioned cache). Many applications are able to dramatically reduce heap sizes, resulting in better responsiveness.

Local partition storage may be enabled (for cache servers) or disabled (for application server clients) with the tangosol.coherence.distributed.localstorage Java property (for example, -Dtangosol.coherence.distributed.localstorage=false).

It may also be disabled by modifying the <local-storage> setting in the tangosol-coherence.xml (or tangosol-coherence-override.xml) file as follows:

Example 45-4 Disabling Partition Storage

<!--
Example using tangosol-coherence-override.xml
-->
<coherence>
  <cluster-config>
    <services>
      <!--
      id value must match what's in tangosol-coherence.xml for DistributedCache
      service
      -->
      <service id="3">
        <init-params>
          <init-param id="4">
            <param-name>local-storage</param-name>
            <param-value system-property="tangosol.coherence.distributed.
               localstorage">false</param-value>
          </init-param>
        </init-params>
      </service>
    </services>
  </cluster-config>
</coherence>

At least one storage-enabled JVM must be started before any storage-disabled clients access the cache.

GC Monitoring & Tuning

Lengthy GC pause times can negatively impact the Coherence cluster and are, for the most part, indistinguishable from node death. During these pauses, a Java application is unable to send or receive packets and in the case of receiving the operating system buffered packets, the packets may be discarded and need to be retransmitted. For these reasons, it is very important that cluster nodes are sized and/or tuned to ensure that their GC times remain minimal. As a good rule of thumb, a node should spend less than 10% of its time paused in GC, normal GC times should be under 100ms, and maximum GC times should be around 1 second.

GC activity can be monitored in a number of ways; some standard mechanisms include:

  • JVM switch -verbose:gc

  • JVM switch -Xloggc: (similar to verbose GC but includes timestamps)

  • JVM switch -Xprof: (profiling information- profiling activities should be distinguished between testing and production deployments and its effects on resources and performance should be monitored as well)

  • Over JMX via tools such as JConsole

Log messages will be generated when one cluster node detects that another cluster node has been unresponsive for a period, generally indicating that a target cluster node was in a GC cycle.

Example 45-5 Message Indicating Target Cluster Node is in Garbage Collection Mode

Experienced a 4172 ms communication delay (probable remote GC) with Member(Id=7, Timestamp=2006-10-20 12:15:47.511, Address=192.168.0.10:8089, MachineId=13838); 320 packets rescheduled, PauseRate=0.31, Threshold=512

PauseRate indicates the percentage of time for which the node has been considered unresponsive since the statistics were last reset. Nodes reported as unresponsive for more then a few percent of their lifetime may be worth investigating for GC tuning.

Coherence Network Tuning

Coherence includes configuration elements for throttling the amount of traffic it will place on the network; see the documentation for <traffic-jam> and <flow-control>, these settings are used to control the rate of packet flow within and between cluster nodes.

Validation

To determine how these settings are affecting performance you need to check if you're cluster nodes are experiencing packet loss and/or packet duplication. This can be obtained by looking at the following JMX statistics on various cluster nodes:

  • ClusterNodeMBean.PublisherSuccessRate—If less then 1.0, packets are being detected as lost and being resent. Rates below 0.98 may warrant investigation and tuning.

  • ClusterNodeMBean.ReceiverSuccessRate—If less then 1.0, the same packet is being received multiple times (duplicates), likely due to the publisher being overly aggressive in declaring packets as lost.

  • ClusterNodeMBean.WeakestChannel—Identifies the remote cluster node which the current node is having most difficulty communicating with.

For information on using JMX to monitor Coherence see Chapter 34, "How to Manage Coherence Using JMX."

Data Access Patterns

Data Access Distribution (hot spots)

When caching a large data set, typically a small portion of that data set will be responsible for most data accesses. For example, in a 1000 object datasets, 80% of operations may be against a 100 object subset. The remaining 20% of operations may be against the other 900 objects. Obviously the most effective return on investment will be gained by caching the 100 most active objects; caching the remaining 900 objects will provide 25% more effective caching while requiring a 900% increase in resources.

However, if every object is accessed equally often (for example in sequential scans of the datasets), then caching will require more resources for the same level of effectiveness. In this case, achieving 80% cache effectiveness would require caching 80% of the datasets versus 10%. (Note that sequential scans of partially cached data sets will generally defeat MRU, LFU and MRU-LFU eviction policies). In practice, almost all non-synthetic (benchmark) data access patterns are uneven, and will respond well to caching subsets of data.

In cases where a subset of data is active, and a smaller subset is particularly active, Near caching can be very beneficial when used with the "all" invalidation strategy (this is effectively a two-tier extension of the above rules).

Cluster-node Affinity

Coherence's Near cache technology will transparently take advantage of cluster-node affinity, especially when used with the "present" invalidation strategy. This topology is particularly useful when used with a sticky load-balancer. Note that the "present" invalidation strategy results in higher overhead (as opposed to "all") when the front portion of the cache is "thrashed" (very short lifespan of cache entries); this is due to the higher overhead of adding/removing key-level event listeners. In general, a cache should be tuned to avoid thrashing and so this is usually not an issue.

Read/Write Ratio and Data Sizes

Generally speaking, the following cache topologies are best for the following use cases:

  • Replicated cache—small amounts of read-heavy data (for example, metadata)

  • Partitioned cache—large amounts of read/write data (for example, large data caches)

  • Near cache—similar to Partitioned, but has further benefits from read-heavy tiered access patterns (for example, large data caches with hotspots) and "sticky" data access (for example, sticky HTTP session data). Depending on the synchronization method (expiry, asynchronous, synchronous), the worst case performance may range from similar to a Partitioned cache to considerably worse.

Interleaving Cache Reads and Writes

Interleaving refers to the number of cache reads between each cache write. The Partitioned cache is not affected by interleaving (as it is designed for 1:1 interleaving). The Replicated and Near caches by contrast are optimized for read-heavy caching, and prefer a read-heavy interleave (for example, 10 reads between every write). This is because they both locally cache data for subsequent read access. Writes to the cache will force these locally cached items to be refreshed, a comparatively expensive process (relative to the near-zero cost of fetching an object off the local memory heap). Note that with the Near cache technology, worst-case performance is still similar to the Partitioned cache; the loss of performance is relative to best-case scenarios.

Note that interleaving is related to read/write ratios, but only indirectly. For example, a Near cache with a 1:1 read/write ratio may be extremely fast (all writes followed by all reads) or much slower (1:1 interleave, write-read-write-read...).