Coherence 3.3 User Guide : Performance Tuning

To achieve maximum performance with Coherence it is suggested that you test and tune your operating environment. Tuning recommendations are available for:

OS Tuning

Socket Buffer Sizes

To help minimization of packet loss, the OS socket buffers need to be large enough to handle the incoming network traffic while your Java application is paused during garbage collection. By default Coherence will attempt to allocate a socket buffer of 2MB. If your OS is not configured to allow for large buffers Coherence will utilize smaller buffers. Most versions of Unix have a very low default buffer limit, which should be increased to at least 2MB.

Starting with Coherence 3.1 you will receive the following warning if the OS failed to allocate the full size buffer.

UnicastUdpSocket failed to set receive buffer size to 1428 packets (2096304 bytes); actual size is 89 packets (131071 bytes). Consult your OS documentation regarding increasing the maximum socket buffer size. Proceeding with the actual value may cause sub-optimal performance.

Though it is safe to operate with the smaller buffers it is recommended that you configure your OS to allow for larger buffers.

sysctl -w net.core.rmem_max=2096304
sysctl -w net.core.wmem_max=2096304

ndd -set /dev/udp udp_max_buf 2096304

no -o rfc1323=1
no -o sb_max=4194304

Note that AIX only supports specifying receive buffer sizes of 1MB, 4MB, and 8MB. Additionally there is an issue with IBM's 1.4.2, and 1.5 JVMs which may prevent them from allocating socket buffers larger then 64K. This issue has been addressed in IBM's 1.4.2 SR7 SDK and 1.5 SR3 SDK.

For information on increasing the buffer sizes for other OSs please refer to your OS's documentation.

High Resolution timesource (Linux)

Linux has a number of high resolution timesources to choose from, the fastest TSC (Time Stamp Counter) unfortunately is not always reliable. Linux chooses TSC by default, and during boot checks for inconsistencies, if found it switches to a slower safe timesource. The slower time sources can be 10 to 30 times more expensive to query then the TSC timesource, and may have a measurable impact on Coherence performance. Note that Coherence and the underlying JVM are not aware of the timesource which the OS is utilizing. It is suggested that you check your system logs (/var/log/dmesg) to verify that the following is not present.

kernel: Losing too many ticks!
kernel: TSC cannot be used as a timesource.
kernel: Possible reasons for this are:
kernel:   You're running with Speedstep,
kernel:   You don't have DMA enabled for your hard disk (see hdparm),
kernel:   Incorrect TSC synchronization on an SMP system (see dmesg).
kernel: Falling back to a sane timesource now.

As the log messages suggest, this can be caused by a variable rate CPU (SpeedStep), having DMA disabled, or incorrect TSC synchronization on multi CPU machines. If present it is suggested that you work with your system administrator to identify the cause and allow the TSC timesource to be utilized.

Datagram size (Microsoft Windows)

Microsoft Windows supports a fast I/O path which is utilized when sending "small" datagrams. The default setting for what is considered a small datagram is 1024 bytes; increasing this value to match your network MTU (normally 1500) can significantly improve network performance.

Run Registry Editor (regedit)
Locate the following registry key
HKLM\System\CurrentControlSet\Services\AFD\Parameters
Add the following new DWORD value
Name: FastSendDatagramThreshold
Value: 1500 (decimal)
Reboot

Note: Included in Coherence 3.1 and above is an optimize.reg script which will perform this change for you, it can be found in the coherence/bin directory of your installation. After running the script you must reboot your computer for the changes to take effect.

Thread Scheduling (Microsoft Windows)

Windows (including NT, 2000 and XP) is optimized for desktop application usage. If you run two console (" DOS box ") windows, the one that has the focus can use almost 100% of the CPU, even if other processes have high-priority threads in a running state. To correct this imbalance, you must configure the Windows thread scheduling to less-heavily favor foreground applications.

Open the Control Panel.
Open System.
Select the Advanced tab.
Under Performance select Settings.
Select the Advanced tab.
Under Processor scheduling, choose{{Background services}}.

Note: Coherence includes an optimize.reg script which will perform this change for you, it can be found in the coherence/bin directory of your installation.

Swapping

Ensure that you have sufficient memory such that you are not making active use of swap space on your machines. You may monitor the swap rate using tools such as vmstat and top. If you find that you are actively moving through swap space this will likely have a significant impact on Coherence's performance. Often this will manifest itself as Coherence nodes being removed from the cluster due to long periods of unresponsiveness caused by them having been "swapped out".

Network Tuning

Network Interface Settings

Verify that your Network card (NIC) is configured to operate at it's maximum link speed and at full duplex. The process for doing this varies between OSs.

ethtool eth0

See the man page on ethtool for further details and for information on adjust the interface settings.

kstat ce:0 | grep link_

This will display the link settings for interface 0. Items of interest are link_duplex (2 = full), and link_speed which is reported in Mbps. For further details on Solaris network tuning see this article from Prentice Hall.

If running on Solaris 10, please review Sun issues 102712 and 102741 which relate to packet corruption and multicast disconnections. These will most often manifest as either EOFExceptions, "Large gap" warnings while reading packet data, or frequent packet timeouts. It is highly recommend that the patches for both issues be applied when using Coherence on Solaris 10 systems.

Open the Control Panel.
Open Network Connections.
Open the Properties dialog for desired network adapter.
Select Configure.
Select the Advanced tab.
Locate the driver specific property for Speed & Duplex.
Set it to either auto or to a specific speed and duplex setting.

Bus Considerations

For 1Gb and faster PCI network cards the system's bus speed may be the limiting factor for network performance. PCI and PCI-X busses are half-duplex, and all devices will run at the speed of the slowest device on the bus. Standard PCI buses have a maximum throughput of approximately 1Gb/sec and thus are not capable of fully utilizing a full-duplex 1Gb NIC. PCI-X has a much higher maximum throughput (1GB/sec), but can be hobbled by a single slow device on the bus. If you find that you are not able to achieve satisfactory bidirectional data rates it is suggested that you evaluate your machine's bus configuration. For instance simply relocating the NIC to a private bus may improve performance.

Network Infrastructure Settings

If you experience frequent multi-second communication pauses across multiple cluster nodes you may need to increase your switch's buffer space. These communication pauses can be identified by a series of Coherence log messages identifying communication delays with multiple nodes which are not attributable to local or remote GCs.

Experienced a 4172 ms communication delay (probable remote GC) with Member(Id=7, Timestamp=2006-10-20 12:15:47.511, Address=192.168.0.10:8089, MachineId=13838); 320 packets rescheduled, PauseRate=0.31, Threshold=512

Some switches such as the Cisco 6500 series support configuration the amount of buffer space available to each ethernet port or ASIC. In high load applications it may be necessary to increase the default buffer space. On Cisco this can be accomplished by executing:

fabric buffer-reserve high

Ethernet flow-control

Full duplex Ethernet includes a flow-control feature which allows the receiving end of a point to point link to slow down the transmitting end. This is implemented by the receiving end sending an Ethernet PAUSE frame to the transmitting end, the transmitting end will then halt transmissions for the interval specified by the PAUSE frame. Note that this pause blocks all traffic from the transmitting side, even traffic destined for machines which are not overloaded. This can induce a head of line blocking condition, where one overloaded machine on a switch effectively slows down all other machines. Most switch vendors will recommend that Ethernet flow-control be disabled for inter switch links, and at most be utilized on ports which are directly connected to machines. Even in this setup head of line blocking can still occur, and thus it is advisable to disable Ethernet flow-control all together. Higher level protocols such as TCP/IP and Coherence TCMP include their own flow-control mechanisms which are not subject to head of line blocking, and also negate the need for the lower level flow-control.

Path MTU

By default Coherence assumes a 1500 byte network MTU, and uses a default packet size of 1468 based on this assumption. Having a packet size which does not fill the MTU will result is an under utilized network. If your equipment uses a different MTU, you should configure Coherence accordingly, by specifying a packet size which is 32 bytes smaller then the network path's minimal MTU. The packet size may be specified in coherence/cluster-config/packet-publisher/packet-size/maximum-length and preferred-length configuration elements.

If you are unsure of your equpiment's MTU along the full path between nodes you can use either the standard ping or traceroute utility to determine it. To do this, execute a series of ping or traceroute operations between the two machines. With each attempt you will specify a different packet size, starting from a high value and progressively moving downward until the packets start to make it through without fragmentation. You will need to specify a particular packet size, and to not fragment the packets.

ping -c 3 -M do -s 1468 serverb

traceroute -F serverb 1468

ping -n 3 -f -l 1468 serverb

On other OSs, consult the documentation for the ping or traceroute command to see how to disable fragmentation, and specify the packet size.

If you receive a message stating that packets must be fragmented then the specified size is larger then the path's MTU. Decrease the packet size until you find the point at which packets can be transmitted without fragmentation. If you find that you need to use packets smaller then 1468 you may wish to contact your network administrator to get the MTU increased to at least 1500.

JVM Tuning

Server Mode

It is recommended that you run all your Coherence JVMs in server mode, by specifying the "-server" on the JVM command line. This allows for a number of performance optimizations for long running applications.

Sizing the Heap

It is generally recommended that heap sizes be kept at 1GB or below as larger heaps will have a more significant impact on garbage collection times. On 1.5 and higher JVMs larger heaps are reasonable, but will likely require additional GC tuning.

Running with a fixed sized heap will save your JVM from having to grow the heap on demand and will result in improved performance. To specify a fixed size heap use the -Xms and -Xmx JVM options, setting them to the same value. For example:

java -server -Xms1024m -Xmx1024m ...

Note that the JVM process will consume more system memory then the specified heap size, for instance a 1GB JVM will consume 1.3GB of memory. This should be taken into consideration when determining the maximum number of JVMs which you will run on a machine. The actual allocated size can be monitored with tools such as top. See Best Practices for additional details on heap size considerations.

GC Monitoring & Tuning

Frequent garbage collection pauses which are in the range of 100ms or more are likely to have a noticeable impact on network performance. During these pauses a Java application is unable to send or receive packets, and in the case of receiving, the OS buffered packets may be discarded and need to be retransmitted.

Specify "-verbose:gc" or "-Xloggc:" on the JVM command line to monitor the frequency and duration of garbage collection pauses.

Starting with Coherence 3.2 log messages will be generated when one cluster node detects that another cluster node has been unresponsive for a period of time, generally indicating that a target cluster node was in a GC cycle.

The PauseRate indicates the percentage of time for which the node has been considered unresponsive since the stats were last reset. Nodes reported as unresponsive for more then a few percent of their lifetime may be worth investigating for GC tuning.

Coherence Network Tuning

Coherence includes configuration elements for throttling the amount of traffic it will place on the network; see the documentation for traffic-jam, flow-control and burst-mode, these settings are used to control the rate of packet flow within and between cluster nodes.

Validation

To determine how these settings are affecting performance you need to check if you're cluster nodes are experiencing packet loss and/or packet duplication. This can be obtained by looking at the following JMX stats on various cluster nodes: