4 Performance Tuning

A critical part of successfully deploying Coherence solutions is to tune the application and production environment to achieve maximum performance.This chapter includes many guidelines and considerations that can be used to tune performance and possibly identify performance issues. However, as with any production deployment, always run performance and stress tests to validate how a particular application and production environment performs.

This chapter includes the following sections:

Operating System Tuning

Operating system settings, such as socket buffers, thread scheduling, and disk swapping can be configured to minimize latency.

This section includes the following topics:

Socket Buffer Sizes

Large operating system socket buffers can help minimize packet loss during garbage collection. Each Coherence socket implementation attempts to allocate a default socket buffer size. A warning message is logged for each socket implementation if the default size cannot be allocated. The following example is a message for the inbound UDP socket buffer:

UnicastUdpSocket failed to set receive buffer size to 16 packets (1023KB); actual
size is 12%, 2 packets (127KB). Consult your OS documentation regarding increasing
the maximum socket buffer size. Proceeding with the actual value may cause
sub-optimal performance.

It is recommended that you configure the operating system to allow for larger buffers. However, alternate buffer sizes for Coherence packet publishers and unicast listeners can be configured using the <packet-buffer> element. See Configuring the Size of the Packet Buffers in Developing Applications with Oracle Coherence.

Note:

Most versions of UNIX have a very low default buffer limit, which should be increased to at least 2MB. Also, note that UDP recommendations are only applicable for configurations which explicitly configure UDP in favor of TCP as TCP is the default for performance sensitive tasks.

On Linux, execute (as root):

sysctl -w net.core.rmem_max=2097152
sysctl -w net.core.wmem_max=2097152

On Solaris, execute (as root):

ndd -set /dev/udp udp_max_buf 2097152 

On AIX, execute (as root):

no -o rfc1323=1
no -o sb_max=4194304

Note:

Note that AIX only supports specifying buffer sizes of 1MB, 4MB, and 8MB.

On Windows:

Windows does not impose a buffer size restriction by default.

Other:

For information on increasing the buffer sizes for other operating systems, refer to your operating system's documentation.

High Resolution timesource (Linux)

Linux has several high resolution timesources to choose from, the fastest TSC (Time Stamp Counter) unfortunately is not always reliable. Linux chooses TSC by default and during startup checks for inconsistencies, if found it switches to a slower safe timesource. The slower time sources can be 10 to 30 times more expensive to query then the TSC timesource, and may have a measurable impact on Coherence performance. For more details on TSC, see

https://lwn.net/Articles/209101/

Note that Coherence and the underlying JVM are not aware of the timesource which the operating system is using. It is suggested that you check your system logs (/var/log/dmesg) to verify that the following is not present.

kernel: Losing too many ticks!
kernel: TSC cannot be used as a timesource.
kernel: Possible reasons for this are:
kernel:   You're running with Speedstep,
kernel:   You don't have DMA enabled for your hard disk (see hdparm),
kernel:   Incorrect TSC synchronization on an SMP system (see dmesg).
kernel: Falling back to a sane timesource now.

As the log messages suggest, this can be caused by a variable rate CPU (SpeedStep), having DMA disabled, or incorrect TSC synchronization on multi CPU computers. If present, work with your system administrator to identify and correct the cause allowing the TSC timesource to be used.

Datagram size (Microsoft Windows)

Microsoft Windows supports a fast I/O path which is used when sending "small" datagrams. The default setting for what is considered a small datagram is 1024 bytes; increasing this value to match your network maximum transmission unit (MTU), normally 1500, can significantly improve network performance.

To adjust this parameter:

  1. Run Registry Editor (regedit)
  2. Locate the following registry key HKLM\System\CurrentControlSet\Services\AFD\Parameters
  3. Add the following new DWORD value Name: FastSendDatagramThreshold Value: 1500 (decimal)
  4. Restart.

Note:

The COHERENCE_HOME/bin/optimize.reg script can also perform this change. After running the script, restart the computer for the changes to take effect.

TCP Retransmission Timeout (Microsoft Windows)

Microsoft Windows includes a TCP retransmission timeout that is used for existing and new connections. The default retransmission timeout can abandon connections in a matter of seconds based on the Windows automatic tuning for TCP data transmission on the network. The short timeout can result in the false positive detection of cluster member death by the TcpRing process and can result in data loss. The default retransmission timeout can be configured to be more tolerant of short outages that may occur on the production network.

To increase the TCP retransmission timeout:

  1. Run Registry Editor (regedit)
  2. Locate the following registry key HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
  3. Add the following new DWORD value Name: TcpMaxConnectRetransmissions Value: 00000015 (Hex)
  4. Add the following new DWORD value Name: TcpMaxDataRetransmissions Value: 00000015 (Hex)
  5. Restart.

Note:

The COHERENCE_HOME/bin/optimize.reg script can also perform this change. After running the script, restart the computer for the changes to take effect.

Thread Scheduling (Microsoft Windows)

Windows is optimized for desktop application usage. If you run two console ("DOS box") windows, the one that has the focus can use almost 100% of the CPU, even if other processes have high-priority threads in a running state. To correct this imbalance, you must configure the Windows thread scheduling to less-heavily favor foreground applications.

  1. Open the Control Panel.
  2. Open System.
  3. Select the Advanced tab.
  4. Under Performance select Settings.
  5. Select the Advanced tab.
  6. Under Processor scheduling, choose Background services.

Note:

The COHERENCE_HOME/bin/optimize.reg script performs this change. After running the script, restart the computer for the changes to take effect.

Swapping

Swapping, also known as paging, is the use of secondary storage to store and retrieve application data for use in RAM memory. Swapping is automatically performed by the operating system and typically occurs when the available RAM memory is depleted. Swapping can have a significant impact on Coherence's performance and should be avoided. Often, swapping manifests itself as Coherence nodes being removed from the cluster due to long periods of unresponsiveness caused by them having been swapped out of RAM. See Avoid using virtual memory (paging to disk).

To avoid swapping, ensure that sufficient RAM memory is available on the computer or that the number of running processes is accounted for and do not consume all the available RAM. Tools such as vmstat and top (on Unix and Linux) and taskmgr (on Windows) should be used to monitor swap rates.

Swappiness in Linux

Linux, by default, may choose to swap out a process or some of its heap due to low usage even if it is not running low on RAM. Swappiness is performed to be ready to handle eventual memory requests. Swappiness should be avoided for Coherence JVMs. The swappiness setting on Linux is a value between 0 and 100, where higher values encourage more optimistic swapping. The default value is 60. For Coherence, a lower value (0 if possible) should always be set.

To see the current swappiness value that is set, enter the following at the command prompt:

cat /proc/sys/vm/swappiness

To temporarily set the swappiness, as the root user echo a value onto /proc/sys/vm/swappiness. The following example sets the value to 0.

echo 0 > /proc/sys/vm/swappiness

To set the value permanently, modify the /etc/sysctl.conf file as follows:

vm.swappiness = 0

Load Balancing Network Interrupts (Linux)

Linux kernels have the ability to balance hardware interrupt requests across multiple CPUs or CPU cores. The feature is referred to as SMP IRQ Affinity and results in better system performance as well as better CPU utilization. For Coherence, significant performance can be gained by balancing ethernet card interrupts for all servers that host cluster members. Most Linux distributions also support irqbalance, which is aware of the cache topologies and power management features of modern multi-core and multi-socket systems.

Most Linux installations are not configured to balance network interrupts. The default network interrupt behavior uses a single processor (typically CPU0) to handle all network interrupts and can become a serious performance bottleneck with high volumes of network traffic. Balancing network interrupts among multiple CPUs increases the performance of network-based operations.

For detailed instructions on how to configure SMP IRQ Affinity, see SMP IRQ Affinity. The information is summarized below:

Note:

If you are running on a cloud based Virtual Machine, such as on Oracle Cloud Infrastructure, it is not possible to apply the changes below because the hypervisor is not accessible.

For other cloud providers, refer to their documentation.

To view a list of the system's IRQs that includes which device they are assigned to and the number of interrupts each processor has handled for the device, run the following command:

# cat /proc/interrupts

The following example output snippet shows a single network interface card where all interrupts have been handled by the same processor (CPU0). This particular network card has multiple transmit and receive queues which have their own assigned IRQ. Systems that use multiple network cards will have additional IRQs assigned for each card.

         CPU0       CPU1       CPU2       CPU3
65:      20041      0          0          0      IR-PCI-MSI-edge  eth0-tx-0
66:      20232      0          0          0      IR-PCI-MSI-edge  eth0-tx-1
67:      20105      0          0          0      IR-PCI-MSI-edge  eth0-tx-2
68:      20423      0          0          0      IR-PCI-MSI-edge  eth0-tx-3
69:      21036      0          0          0      IR-PCI-MSI-edge  eth0-rx-0
70:      20201      0          0          0      IR-PCI-MSI-edge  eth0-rx-1
71:      20587      0          0          0      IR-PCI-MSI-edge  eth0-rx-2
72:      20853      0          0          0      IR-PCI-MSI-edge  eth0-rx-3

The goal is to have the interrupts balanced across the 4 processors instead of just a single processor. Ideally, the overall utilization of the processors on the system should also be evaluated to determine which processors can handle additional interrupts. Use mpstat to view statistics for a system's processors. The statistics show which processors are being over utilized and which are being under utilized and help determine the best ways to balance the network interrupts across the CPUs.

SMP IRQ affinity is configured in an smp_affinity file. Each IRQ has its own smp_affinity file that is located in the /proc/irq/irq_#/ directory. To see the current affinity setting for an IRQ (for example 65), run:

# cat /proc/irq/65/smp_affinity

The returned hexadecimal value is a bitmask and represents the processors to which interrupts on IRQ 65 are routed. Each place in the value represents a group of 4 CPUs. For a 4 processor system, the hexadecimal value to represent a group of all four processors is f (or 15) and is 00000f as mapped below:

            Binary       Hex 
    CPU 0    0001         1
    CPU 1    0010         2
    CPU 2    0100         4
    CPU 3    1000         8
    -----------------------
    all      1111         f

To target a single processor or group of processors, the bitmask must be changed to the appropriate hexadecimal value. Based on the system in the example above, to direct all interrupts on IRQ 65 to CPU1 and all interrupts on IRQ 66 to CPU2, change the smp_affinity files as follows:

echo 000002 > /proc/irq/65/smp_affinity # eth0-tx-0
echo 000004 > /proc/irq/66/smp_affinity # eth0-tx-1

To direct all interrupts on IRQ 65 to both CPU1 and CPU2, change the smp_affinity file as follows:

echo 000006 > /proc/irq/65/smp_affinity # eth0-tx-0

To direct all interrupts on each IRQ to all CPUs, change the smp_affinity files as follows:

echo 00000f > /proc/irq/65/smp_affinity # eth0-tx-0
echo 00000f > /proc/irq/66/smp_affinity # eth0-tx-1
echo 00000f > /proc/irq/67/smp_affinity # eth0-tx-2
echo 00000f > /proc/irq/68/smp_affinity # eth0-tx-3
echo 00000f > /proc/irq/69/smp_affinity # eth0-rx-0
echo 00000f > /proc/irq/70/smp_affinity # eth0-rx-1
echo 00000f > /proc/irq/71/smp_affinity # eth0-rx-2
echo 00000f > /proc/irq/72/smp_affinity # eth0-rx-3

Network Tuning

Network settings, such as network link speeds, Ethernet flow-control, and path MTU can be configured to maximize network throughput.

This section includes the following topics:

Network Interface Settings

Verify that your Network card (NIC) is configured to operate at it's maximum link speed and at full duplex. The process for doing this varies between operating systems.

On Linux execute (as root):

ethtool eth0

See the man page on ethtool for further details and for information on adjust the interface settings.

On Solaris execute (as root):

kstat ce:0 | grep link_

This displays the link settings for interface 0. Items of interest are link_duplex (2 = full), and link_speed which is reported in Mbps.

Note:

If running on Solaris 10, review issues 1000972.1 and 1000940.1 which relate to packet corruption and multicast disconnections. These often manifest as either EOFExceptions, "Large gap" warnings while reading packet data, or frequent packet timeouts. It is highly recommend that the patches for both issues be applied when using Coherence on Solaris 10 systems.

On Windows:

  1. Open the Control Panel.

  2. Open Network Connections.

  3. Open the Properties dialog for desired network adapter.

  4. Select Configure.

  5. Select the Advanced tab.

  6. Locate the driver specific property for Speed & Duplex.

  7. Set it to either auto or to a specific speed and duplex setting.

Network Infrastructure Settings

If you experience frequent multi-second communication pauses across multiple cluster nodes, try increasing your switch's buffer space. These communication pauses can be identified by a series of Coherence log messages identifying communication delays with multiple nodes which are not attributable to local or remote GCs.

Example 4-1 Message Indicating a Communication Delay

Experienced a 4172 ms communication delay (probable remote GC) with Member(Id=7, Timestamp=2006-10-20 12:15:47.511, Address=192.168.0.10:8089, MachineId=13838); 320 packets rescheduled, PauseRate=0.31, Threshold=512

Some switches such as the Cisco 6500 series support configuring the amount of buffer space available to each Ethernet port or ASIC. In high load applications it may be necessary to increase the default buffer space. On Cisco, this can be accomplished by executing:

fabric buffer-reserve high

See Cisco's documentation for additional details on this setting.

Switch and Subnet Considerations

Cluster members may be split across multiple switches and may be part of multiple subnets. However, such topologies can overwhelm inter-switch links and increase the chances of a split cluster if the links fail.Typically, the impact materializes as communication delays that affect cluster and application performance. If possible, consider always locating all cluster members on the same switch and subnet to minimize the impact. See Evaluate the Production Network's Speed for both UDP and TCP.

Ethernet Flow-Control

Full duplex Ethernet includes a flow-control feature which allows the receiving end of a point to point link to slow down the transmitting end. This is implemented by the receiving end sending an Ethernet PAUSE frame to the transmitting end, the transmitting end then halts transmissions for the interval specified by the PAUSE frame. Note that this pause blocks all traffic from the transmitting side, even traffic destined for computers which are not overloaded. This can induce a head of line blocking condition, where one overloaded computer on a switch effectively slows down all other computers. Most switch vendors recommend that Ethernet flow-control be disabled for inter switch links, and at most be used on ports which are directly connected to computers. Even in this setup head of line blocking can still occur, and thus it is advisable to disable Ethernet flow-control. Higher level protocols such as TCP/IP and Coherence TCMP include their own flow-control mechanisms which are not subject to head of line blocking, and also negate the need for the lower level flow-control.

Path MTU

By default Coherence assumes a 1500 byte network MTU, and uses a default packet size of 1468 based on this assumption. Having a packet size which does not fill the MTU results in an under used network. If your equipment uses a different MTU, then configure Coherence by specifying the <packet-size> setting, which is 32 bytes smaller then the network path's minimal MTU.

If you are unsure of your equipment's MTU along the full path between nodes, you can use either the standard ping or traceroute utilities to determine the MTU. For example, execute a series of ping or traceroute operations between the two computers. With each attempt, specify a different packet size, starting from a high value and progressively moving downward until the packets start to make it through without fragmentation.

On Linux execute:

ping -c 3 -M do -s 1468 serverb

On Solaris execute:

traceroute -F serverb 1468

On Windows execute:

ping -n 3 -f -l 1468 serverb

On other operating systems: Consult the documentation for the ping or traceroute command to see how to disable fragmentation, and specify the packet size.

If you receive a message stating that packets must be fragmented then the specified size is larger then the path's MTU. Decrease the packet size until you find the point at which packets can be transmitted without fragmentation. If you find that you must use packets smaller then 1468, you may want to contact your network administrator to get the MTU increased to at least 1500.

10GbE Considerations

Many 10 Gigabit Ethernet (10GbE) switches and network interface cards support frame sizes that are larger than the 1500 byte ethernet frame standard. When using 10GbE, make sure that the MTU is set to the maximum allowed by the technology (approximately 16KB for ethernet) to take full advantage of the greater bandwidth. Coherence automatically detects the MTU of the network and selects a UDP socket buffer size accordingly. UDP socket buffer sizes of 2MB, 4MB, or 8MB are selected for MTU sizes of 1500 bytes (standard), 9000 bytes (jumbo), and over 9000 (super jumbo), respectively. Also, make sure to increase the operating system socket buffers to 8MB to accommodate the larger sizes. A warning is issued in the Coherence logs if a significantly small operating system buffer is detected. Lastly, always run the datagram test to validate the performance and throughput of the network. See Performing a Network Performance Test.

TCP Considerations

Coherence utilizes a TCP message bus (TMB) to transmit messages between clustered data services. Therefore, a network must be optimally tuned for TCP. Coherence inherits TCP settings, including buffer settings, from the operating system. Most servers already have TCP tuned for the network and should not require additional configuration. The recommendation is to tune the TCP stack for the network instead of tuning Coherence for the network.

Coherence includes a message bus test utility that can be used to test throughput and latency between network nodes. See Running the Message Bus Test Utility. If a network shows poor performance, then it may not be properly configured, use the following recommendations (note that these settings are demonstrated on Linux but can be translated to other operating systems):

#!/bin/bash
#
# aggregate size limitations for all connections, measured in pages; these values
# are for 4KB pages (getconf PAGESIZE)

/sbin/sysctl -w net.ipv4.tcp_mem='   65536   131072   262144'

# limit on receive space bytes per-connection; overridable by SO_RCVBUF; still
# goverened by core.rmem_max

/sbin/sysctl -w net.ipv4.tcp_rmem=' 262144  4194304  8388608'

# limit on send space bytes per-connection; overridable by SO_SNDBUF; still
# goverered by core.wmem_max

/sbin/sysctl -w net.ipv4.tcp_wmem='  65536  1048576  2097152'

# absolute limit on socket receive space bytes per-connection; cannot be
# overriden programatically

/sbin/sysctl -w net.core.rmem_max=16777216

# absolute limit on socket send space bytes per-connection; cannot be overriden;
# cannot be overriden programatically

/sbin/sysctl -w net.core.wmem_max=16777216

Each connection consumes a minimum of 320KB, but under normal memory pressure, consumes 5MB per connection and ultimately the operating system tries to keep the entire system buffering for TCP under 1GB. These are recommended defaults based on tuning for fast (>= 10gbe) networks and should be acceptable on 1gbe.

JVM Tuning

JVM runtime settings, such as heap size and garbage collection can be configured to ensure the right balance of resource utilization and performance.

This section includes the following topics:

Basic Sizing Recommendation

The recommendations in this section are sufficient for general use cases and require minimal setup effort. The primary issue to consider when sizing your JVMs is a balance of available RAM versus garbage collection (GC) pause times.

Cache Servers

The standard, safe recommendation for Coherence cache servers is to run a fixed size heap of up to 8GB. In addition, use an incremental garbage collector to minimize GC pause durations. Lastly, run all Coherence JVMs in server mode, by specifying the -server on the JVM command line. This allows for several performance optimizations for long running applications.

For example:

java -server -Xms8g -Xmx8g -Xloggc: -jar coherence.jar

This sizing allows for good performance without the need for more elaborate JVM tuning. See Garbage Collection Monitoring.

Larger heap sizes are possible and have been implemented in production environments; however, it becomes more important to monitor and tune the JVMs to minimize the GC pauses. It may also be necessary to alter the storage ratios such that the amount of scratch space is increased to facilitate faster GC compactions. Additionally, it is recommended that you make use of an up-to-date JVM version to ensure the latest improvements for managing large heaps. See Heap Size Considerations.

TCMP Clients

Coherence TCMP clients should be configured similarly to cache servers as long GCs could cause them to be misidentified as being terminated.

Extends Clients

Coherence Extend clients are not technically speaking cluster members and, as such, the effect of long GCs is less detrimental. For extend clients it is recommended that you follow the existing guidelines as set forth by the application in which you are embedding coherence.

Heap Size Considerations

Use this section to decide:

  • How many CPUs are need for your system

  • How much memory is need for each system

  • How many JVMs to run per system

  • How much heap to configure with each JVM

Since all applications are different, this section should be read as guidelines. You must answer the following questions to choose the configuration that is right for you:

  • How much data is to be stored in Coherence caches?

  • What are the application requirements in terms of latency and throughput?

  • How CPU or Network intensive is the application?

Sizing is an imprecise science. There is no substitute for frequent performance and stress testing.

This section includes the following topics:

General Guidelines

Running with a fixed sized heap saves the JVM from having to grow the heap on demand and results in improved performance. To specify a fixed size heap use the -Xms and -Xmx JVM options, setting them to the same value. For example:

java -server -Xms4G -Xmx4G ...

A JVM process consumes more system memory then the specified heap size. The heap size settings specify the amount of heap which the JVM makes available to the application, but the JVM itself also consumes additional memory. The amount consumed differs depending on the operating system and JVM settings. For instance, a HotSpot JVM running on Linux configured with a 1GB JVM consumes roughly 1.2GB of RAM. It is important to externally measure the JVMs memory utilization to ensure that RAM is not over committed. Tools such as top, vmstat, and Task Manager are useful in identifying how much RAM is actually being used.

Storage Ratios

The basic starting point for how much data can be stored within a cache server of a given size is to use a 1/3rd of the heap for primary cache storage. This leaves another 1/3rd for backup storage and the final 1/3rd for scratch space. Scratch space is then used for things such as holding classes, temporary objects, network transfer buffers, and GC compaction. However, this recommendation is considered a basic starting point and should not be considered a rule. A more precise, but still conservative starting point, is to assume your cache data can occupy no more than the total heap minus two times the young generation size of a JVM heap (for example, 32GB – (2 * 4GB) = 24GB). In this case, cache data can occupy 75% of the heap. Note that the resulting percentage depends on the configured young generation size. In addition, you may instruct Coherence to limit primary storage on a per-cache basis by configuring the <high-units> element and specifying a BINARY value for the <unit-calculator> element. These settings are automatically applied to backup storage as well.

Ideally, both the primary and backup storage also fits within the JVMs tenured space (for HotSpot-based JVMs). See Sizing the Generations in Java Platform, Standard Edition HotSpot Virtual Machine Garbage Collection Tuning.

Cache Topologies and Heap Size

For large data sets, partitioned or near caches are recommended. Varying the number of Coherence JVMs does not significantly affect cache performance because the scalability of the partitioned cache is linear for both reading and writing. Using a replicated cache puts significant pressure on GC.

Planning Capacity for Data Grid Operations

Data grid operations (such as queries and aggregations) have additional heap space requirements and their use must be planned for accordingly. During data grid operations, binary representations of the cache entries that are part of the result set are held in-memory. Depending on the operation, the entries may also be held in deserialized form for the duration of the operation. Typically, this doubles the amount of memory for each entry in the result set. In addition, a second binary copy is maintained when using RAM or flash journaling as the backing map implementation due to differences in how the objects are stored. The second binary copy is also held for the duration of the operation and increases the total memory for each entry in the result set by 3x.

Heap memory usage depends on the size of the result set on which the operations are performed and the number of concurrent requests being handled. The result set size is affected by both the total number of entries as well as the size of each entry. Moderately sized result sets that are maintained on a storage cache server would not likely exhaust the heap's memory. However, if the result set is sufficiently large, the additional memory requirements can cause the heap to run out of memory. Data grid aggregate operations typically involve larger data sets and are more likely to exhaust the available memory than other operations.

The JVMs heap size should be increased on storage enabled cache servers whenever large result sets are expected. For example, if a third of the heap has been reserved for scratch space, then the scratch space must also support the projected result set sizes. Alternatively, data grid operations can use the PartitionedFilter API. The API reduces memory consumption by executing grid operations against individual partition sets.

Deciding How Many JVMs to Run Per System

The number of JVMs (nodes) to run per system depends on the system's number of processors/cores and amount of memory. As a starting point, plan to run one JVM for every four cores. This recommendation balances the following factors:

  • Multiple JVMs per server allow Coherence to make more efficient use of high-bandwidth (>1gb) network resources.

  • Too many JVMs increases contention and context switching on processors.

  • Too few JVMs may not be able to handle available memory and may not fully use the NIC.

  • Especially for larger heap sizes, JVMs must have available processing capacity to avoid long GC pauses.

Depending on your application, you can add JVMs up toward one per core. The recommended number of JVMs and amount of configured heap may also vary based on the number of processors/cores per socket and on the computer architecture.

Note:

Applications that use Coherence as a basic cache (get, put and remove operations) and have no application classes (entry processors, aggregators, queries, cachestore modules, and so on) on the cache server can sometimes go beyond 1 JVM per core. They should be tested for both health and failover scenarios.

Sizing Your Heap

When considering heap size, it is important to find the right balance. The lower bound is determined by per-JVM overhead (and also, manageability of a potentially large number of JVMs). For example, if there is a fixed overhead of 100MB for infrastructure software (for example, JMX agents, connection pools, internal JVM structures), then the use of JVMs with 256MB heap sizes results in close to 40% overhead for non-cache data. The upper bound on JVM heap size is governed by memory management overhead, specifically the maximum duration of GC pauses and the percentage of CPU allocated to GC (and other memory management tasks).

GC can affect the following:

  • The latency of operations against Coherence. Larger heaps cause longer and less predictable latency than smaller heaps.

  • The stability of the cluster. With very large heaps, lengthy long garbage collection pauses can trick TCMP into believing a cluster member is terminated since the JVM is unresponsive during GC pauses. Although TCMP takes GC pauses into account when deciding on member health, at some point it may decide the member is terminated.

The following guideline is provided:

  • For Java, a conservative heap size of 8GB is recommended for most applications and is based on throughput, latency, and stability. However, larger heap sizes, are suitable for some applications where the simplified management of fewer, larger JVMs outweighs the performance benefits of many smaller JVMs. A core-to-heap ratio of roughly 4 cores: 8GB is ideal, with some leeway to manage more GBs per core. Every application is different and GC must be monitored accordingly.

The length of a GC pause scales worse than linearly to the size of the heap. That is, if you double the size of the heap, pause times due to GC more than double (in general). GC pauses are also impacted by application usage:

  • Pause times increase as the amount of live data in the heap increases. Do not exceed 70% live data in your heap. This includes primary data, backup data, indexes, and application data.

  • High object allocation rates increase pause times. Even "simple" Coherence applications can cause high object allocation rates since every network packet generates many objects.

  • CPU-intensive computations increase contention and may also contribute to higher pause times.

Depending on your latency requirements, you can increase allocated heap space beyond the above recommendations, but be sure to stress test your system.

Moving the Cache Out of the Application Heap

Using dedicated Coherence cache server instances for Partitioned cache storage minimizes the heap size of application JVMs because the data is no longer stored locally. As most Partitioned cache access is remote (with only 1/N of data being held locally), using dedicated cache servers does not generally impose much additional overhead. Near cache technology may still be used, and it generally has a minimal impact on heap size (as it is caching an even smaller subset of the Partitioned cache). Many applications are able to dramatically reduce heap sizes, resulting in better responsiveness.

Local partition storage may be enabled (for cache servers) or disabled (for application server clients) with the coherence.distributed.localstorage Java property (for example, -Dcoherence.distributed.localstorage=false).

It may also be disabled by modifying the <local-storage> setting in the tangosol-coherence.xml (or tangosol-coherence-override.xml) file as follows:

Example 4-2 Disabling Partition Storage

<?xml version='1.0'?>

<coherence xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xmlns="http://xmlns.oracle.com/coherence/coherence-operational-config"
   xsi:schemaLocation="http://xmlns.oracle.com/coherence/coherence-operational-config
   coherence-operational-config.xsd">
   <cluster-config>
      <services>
         <!--
         id value must match what's in tangosol-coherence.xml for DistributedCache
         service
         -->
         <service id="3">
            <init-params>
               <init-param id="4">
                  <param-name>local-storage</param-name>
                  <param-value system-property="coherence.distributed.
                     localstorage">false</param-value>
               </init-param>
           </init-params>
        </service>
      </services>
   </cluster-config>
</coherence>

At least one storage-enabled JVM must be started before any storage-disabled clients access the cache.

Garbage Collection Monitoring

Lengthy GC pause times can negatively impact the Coherence cluster and are typically indistinguishable from node termination. A Java application cannot send or receive packets during these pauses. As for receiving the operating system buffered packets, the packets may be discarded and must be retransmitted. For these reasons, it is very important that cluster nodes are sized and tuned to ensure that their GC times remain minimal. As a general rule, a node should spend less than 10% of its time paused in GC, normal GC times should be under 100ms, and maximum GC times should be around 1 second. See Introduction in Java Platform, Standard Edition HotSpot Virtual Machine Garbage Collection Tuning.

Log messages are generated when one cluster node detects that another cluster node has been unresponsive for a period of time, generally indicating that a target cluster node was in a GC cycle.

Example 4-3 Message Indicating Target Cluster Node is in Garbage Collection Mode

Experienced a 4172 ms communication delay (probable remote GC) with Member(Id=7, Timestamp=2006-10-20 12:15:47.511, Address=192.168.0.10:8089, MachineId=13838); 320 packets rescheduled, PauseRate=0.31, Threshold=512

PauseRate indicates the percentage of time for which the node has been considered unresponsive since the statistics were last reset. Nodes reported as unresponsive for more then a few percent of their lifetime may be worth investigating for GC tuning.

GC activity can be monitored in many ways; some Oracle HotSpot mechanisms include:

  • -verbose:gc (writes GC log to standard out; use -Xloggc to direct it to some custom location)

  • -XX:+PrintGCDetails, -XX:+PrintGCTimeStamps, -XX:+PrintHeapAtGC, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime

  • -Xprof: enables profiling. Profiling activities should be distinguished between testing and production deployments and its effects on resources and performance should always be monitored

  • JConsole and VisualVM (including VisualGC plug-in) that are included with the JDK.

Data Access Patterns

Understanding how an application uses a cache can help you determine how to configure the cache for maximum performance.

This section includes the following topics:

Data Access Distribution (hot spots)

When caching a large data set, typically a small portion of that data set is responsible for most data accesses. For example, in a 1000 object datasets, 80% of operations may be against a 100 object subset. The remaining 20% of operations may be against the other 900 objects. Obviously the most effective return on investment is gained by caching the 100 most active objects; caching the remaining 900 objects provides 25% more effective caching while requiring a 900% increase in resources.

However, if every object is accessed equally often (for example in sequential scans of the datasets), then caching requires more resources for the same level of effectiveness. In this case, achieving more than 0% effectiveness requires caching 100% of the data. (Note that sequential scans of partially cached data sets generally defeat MRU, LFU and MRU-LFU eviction policies). In practice, most non-synthetic (benchmark) data access patterns are uneven, and respond well to caching subsets of data.

In cases where a subset of data is active, and a smaller subset is particularly active, Near caching can be very beneficial when used with the all invalidation strategy (this is effectively a two-tier extension of the above rules).

Cluster-node Affinity

Coherence's Near cache technology transparently takes advantage of cluster-node affinity, especially when used with the present invalidation strategy. This topology is particularly useful when used with a sticky load-balancer. Note that the present invalidation strategy results in higher overhead (as opposed to all) when the front portion of the cache is "thrashed" (very short lifespan of cache entries); this is due to the higher overhead of adding/removing key-level event listeners. In general, a cache should be tuned to avoid thrashing and so this is usually not an issue.

Read/Write Ratio and Data Sizes

Generally speaking, the following cache topologies are best for the following use cases:

  • Replicated cache—small amounts of read-heavy data (for example, metadata)

  • Partitioned cache—large amounts of read/write data (for example, large data caches)

  • Near cache—similar to Partitioned, but has further benefits from read-heavy tiered access patterns (for example, large data caches with hotspots) and "sticky" data access (for example, sticky HTTP session data). Depending on the synchronization method (expiry, asynchronous, synchronous), the worst case performance may range from similar to a Partitioned cache to considerably worse.

Interleaving Cache Reads and Writes

Interleaving refers to the number of cache reads between each cache write. The Partitioned cache is not affected by interleaving (as it is designed for 1:1 interleaving). The Replicated and Near caches by contrast are optimized for read-heavy caching, and prefer a read-heavy interleave (for example, 10 reads between every write). This is because they both locally cache data for subsequent read access. Writes to the cache forces these locally cached items to be refreshed, a comparatively expensive process (relative to the near-zero cost of fetching an object off the local memory heap). Note that with the Near cache technology, worst-case performance is still similar to the Partitioned cache; the loss of performance is relative to best-case scenarios.

Note that interleaving is related to read/write ratios, but only indirectly. For example, a Near cache with a 1:1 read/write ratio may be extremely fast (all writes followed by all reads) or much slower (1:1 interleave, write-read-write-read...).

Concurrent Near Cache Misses on a Specific Hot Key

Frequent cache misses by multiple clients, concurrently on a specific hot key that will never have a value in a near cache, result in all the misses being serialized across the cluster. The backlog of multiple cache lookups on the same hot key results in cluster slowdown around registering/unregistering of the message listener for the hot key in question. The more concurrent clients request the same key, the longer the delay will be for the clients.

To avoid this issue, Oracle recommends you to completely avoid requesting a value for a key that will never have a value from a near cache.

If that is not possible, use negative caching to resolve performance slow down. Negative caching turns the expensive misses that cannot be satisfied, into a cheap miss that returns a value, which by application convention, indicates no value for the key. However, it is a value that is kept in the near cache to avoid the overhead of frequent misses on the same key.