20 Performance Tuning

To achieve maximum performance with Coherence it is suggested that you test and tune your operating environment. Testing is covered in Chapter 17, "Performing a Datagram Test for Network Performance."

Tuning recommendations are available for:

20.1 Operating System Tuning

20.1.1 Socket Buffer Sizes

To help minimization of packet loss, the operating system socket buffers need to be large enough to handle the incoming network traffic while your Java application is paused during garbage collection. By default Coherence will attempt to allocate a socket buffer of 2MB. If your operating system is not configured to allow for large buffers Coherence will use smaller buffers. Most versions of UNIX have a very low default buffer limit, which should be increased to at least 2MB.

Starting with Coherence 3.1 you will receive the following warning if the operating system failed to allocate the full size buffer.

Example 20-1 Message Indicating OS Failed to Allocate the Full Buffer Size

UnicastUdpSocket failed to set receive buffer size to 1428 packets (2096304 bytes); actual size is 89 packets (131071 bytes). Consult your OS documentation regarding increasing the maximum socket buffer size. Proceeding with the actual value may cause sub-optimal performance.

Though it is safe to operate with the smaller buffers it is recommended that you configure your operating system to allow for larger buffers.

On Linux execute (as root):

sysctl -w net.core.rmem_max=2096304
sysctl -w net.core.wmem_max=2096304

On Solaris execute (as root):

ndd -set /dev/udp udp_max_buf 2096304 

On AIX execute (as root):

no -o rfc1323=1
no -o sb_max=4194304

Note:

Note that AIX only supports specifying buffer sizes of 1MB, 4MB, and 8MB. Additionally there is an issue with IBM's 1.4.2, and 1.5 JVMs which may prevent them from allocating socket buffers larger then 64K. This issue has been addressed in IBM's 1.4.2 SR7 SDK and 1.5 SR3 SDK.

On Windows:

Windows does not impose a buffer size restriction by default.

Other:

For information on increasing the buffer sizes for other operating systems please refer to your operating system's documentation.

You may configure Coherence to request alternate sized buffers for packet publishers and unicast listeners by using the coherence/cluster-config/packet-publisher/packet-buffer/maximum-packets and coherence/cluster-config/unicast-listener/packet-buffer/maximum-packets elements. For more information, see "packet-buffer".

20.1.2 High Resolution timesource (Linux)

Linux has several high resolution timesources to choose from, the fastest TSC (Time Stamp Counter) unfortunately is not always reliable. Linux chooses TSC by default, and during boot checks for inconsistencies, if found it switches to a slower safe timesource. The slower time sources can be 10 to 30 times more expensive to query then the TSC timesource, and may have a measurable impact on Coherence performance. Note that Coherence and the underlying JVM are not aware of the timesource which the operating system is using. It is suggested that you check your system logs (/var/log/dmesg) to verify that the following is not present. Example 20-2 illustrates a sample timesource log.

Example 20-2 Log Message from a Linux Timesource

kernel: Losing too many ticks!
kernel: TSC cannot be used as a timesource.
kernel: Possible reasons for this are:
kernel:   You're running with Speedstep,
kernel:   You don't have DMA enabled for your hard disk (see hdparm),
kernel:   Incorrect TSC synchronization on an SMP system (see dmesg).
kernel: Falling back to a sane timesource now.

As the log messages suggest, this can be caused by a variable rate CPU (SpeedStep), having DMA disabled, or incorrect TSC synchronization on multi CPU machines. If present it is suggested that you work with your system administrator to identify the cause and allow the TSC timesource to be used.

20.1.3 Datagram size (Microsoft Windows)

Microsoft Windows supports a fast I/O path which is used when sending "small" datagrams. The default setting for what is considered a small datagram is 1024 bytes; increasing this value to match your network MTU (normally 1500) can significantly improve network performance.

To adjust this parameter:

  1. Run Registry Editor (regedit)

  2. Locate the following registry key HKLM\System\CurrentControlSet\Services\AFD\Parameters

  3. Add the following new DWORD value Name: FastSendDatagramThreshold Value: 1500 (decimal)

  4. Reboot

Note:

Included in Coherence 3.1 and above is an optimize.reg script which will perform this change for you, it can be found in the coherence/bin directory of your installation. After running the script you must reboot your computer for the changes to take effect.

For more details on this parameter see Appendix C of http://technet.microsoft.com/en-us/library/bb726981.aspx

20.1.4 Thread Scheduling (Microsoft Windows)

Windows (including NT, 2000 and XP) is optimized for desktop application usage. If you run two console ("DOS box") windows, the one that has the focus can use almost 100% of the CPU, even if other processes have high-priority threads in a running state. To correct this imbalance, you must configure the Windows thread scheduling to less-heavily favor foreground applications.

  1. Open the Control Panel.

  2. Open System.

  3. Select the Advanced tab.

  4. Under Performance select Settings.

  5. Select the Advanced tab.

  6. Under Processor scheduling, choose Background services.

Note:

Coherence includes an optimize.reg script which will perform this change for you, it can be found in the coherence/bin directory of your installation.

20.1.5 Swapping

Ensure that you have sufficient memory such that you are not making active use of swap space on your machines. You may monitor the swap rate using tools such as vmstat and top. If you find that you are actively moving through swap space this will likely have a significant impact on Coherence's performance. Often this will manifest itself as Coherence nodes being removed from the cluster due to long periods of unresponsiveness caused by them having been "swapped out".

20.2 Network Tuning

20.2.1 Network Interface Settings

Verify that your Network card (NIC) is configured to operate at it's maximum link speed and at full duplex. The process for doing this varies between operating systems.

On Linux execute (as root):

ethtool eth0

See the man page on ethtool for further details and for information on adjust the interface settings.

On Solaris execute (as root):

kstat ce:0 | grep link_

This will display the link settings for interface 0. Items of interest are link_duplex (2 = full), and link_speed which is reported in Mbps.

Note:

If running on Solaris 10, please review Sun issues 102712 and 102741 which relate to packet corruption and multicast disconnections. These will most often manifest as either EOFExceptions, "Large gap" warnings while reading packet data, or frequent packet timeouts. It is highly recommend that the patches for both issues be applied when using Coherence on Solaris 10 systems.

On Windows:

  1. Open the Control Panel.

  2. Open Network Connections.

  3. Open the Properties dialog for desired network adapter.

  4. Select Configure.

  5. Select the Advanced tab.

  6. Locate the driver specific property for Speed & Duplex.

  7. Set it to either auto or to a specific speed and duplex setting.

20.2.2 Bus Considerations

For 1Gb and faster PCI network cards the system's bus speed may be the limiting factor for network performance. PCI and PCI-X busses are half-duplex, and all devices will run at the speed of the slowest device on the bus. Standard PCI buses have a maximum throughput of approximately 1Gb/sec and thus are not capable of fully using a full-duplex 1Gb NIC. PCI-X has a much higher maximum throughput (1GB/sec), but can be hobbled by a single slow device on the bus. If you find that you are not able to achieve satisfactory bidirectional data rates it is suggested that you evaluate your machine's bus configuration. For instance simply relocating the NIC to a private bus may improve performance.

20.2.3 Network Infrastructure Settings

If you experience frequent multi-second communication pauses across multiple cluster nodes you may need to increase your switch's buffer space. These communication pauses can be identified by a series of Coherence log messages identifying communication delays with multiple nodes which are not attributable to local or remote GCs.

Example 20-3 Message Indicating a Communication Delay

Experienced a 4172 ms communication delay (probable remote GC) with Member(Id=7, Timestamp=2006-10-20 12:15:47.511, Address=192.168.0.10:8089, MachineId=13838); 320 packets rescheduled, PauseRate=0.31, Threshold=512

Some switches such as the Cisco 6500 series support configuration the amount of buffer space available to each Ethernet port or ASIC. In high load applications it may be necessary to increase the default buffer space. On Cisco this can be accomplished by executing:

fabric buffer-reserve high

See Cisco's documentation for additional details on this setting.

20.2.4 Ethernet Flow-Control

Full duplex Ethernet includes a flow-control feature which allows the receiving end of a point to point link to slow down the transmitting end. This is implemented by the receiving end sending an Ethernet PAUSE frame to the transmitting end, the transmitting end will then halt transmissions for the interval specified by the PAUSE frame. Note that this pause blocks all traffic from the transmitting side, even traffic destined for machines which are not overloaded. This can induce a head of line blocking condition, where one overloaded machine on a switch effectively slows down all other machines. Most switch vendors will recommend that Ethernet flow-control be disabled for inter switch links, and at most be used on ports which are directly connected to machines. Even in this setup head of line blocking can still occur, and thus it is advisable to disable Ethernet flow-control all together. Higher level protocols such as TCP/IP and Coherence TCMP include their own flow-control mechanisms which are not subject to head of line blocking, and also negate the need for the lower level flow-control.

For more details on this subject see http://www.networkworld.com/netresources/0913flow2.html.

20.2.5 Path MTU

By default Coherence assumes a 1500 byte network MTU, and uses a default packet size of 1468 based on this assumption. Having a packet size which does not fill the MTU will result is an under used network. If your equipment uses a different MTU, then configure Coherence by specifying a packet size which is 32 bytes smaller then the network path's minimal MTU. The packet size may be specified in coherence/cluster-config/packet-publisher/packet-size/maximum-length and preferred-length configuration elements. For more information on these elements, see "packet-size".

If you are unsure of your equpiment's MTU along the full path between nodes you can use either the standard ping or traceroute utility to determine it. To do this, execute a series of ping or traceroute operations between the two machines. With each attempt you will specify a different packet size, starting from a high value and progressively moving downward until the packets start to make it through without fragmentation. You will need to specify a particular packet size, and to not fragment the packets.

On Linux execute:

ping -c 3 -M do -s 1468 serverb

On Solaris execute:

traceroute -F serverb 1468

On Windows execute:

ping -n 3 -f -l 1468 serverb

On other operating systems: Consult the documentation for the ping or traceroute command to see how to disable fragmentation, and specify the packet size.

If you receive a message stating that packets must be fragmented then the specified size is larger then the path's MTU. Decrease the packet size until you find the point at which packets can be transmitted without fragmentation. If you find that you need to use packets smaller then 1468 you may want to contact your network administrator to get the MTU increased to at least 1500.

20.3 JVM Tuning

20.3.1 Server Mode

It is recommended that you run all your Coherence JVMs in server mode, by specifying the -server on the JVM command line. This allows for several performance optimizations for long running applications.

20.3.2 Sizing the Heap

It is generally recommended that heap sizes be kept at 1GB or below as larger heaps will have a more significant impact on garbage collection times. On 1.5 and higher JVMs larger heaps are reasonable, but will likely require additional GC tuning. For more information, see "Heap Size Considerations".

Running with a fixed sized heap will save your JVM from having to grow the heap on demand and will result in improved performance. To specify a fixed size heap use the -Xms and -Xmx JVM options, setting them to the same value. For example:

java -server -Xms1024m -Xmx1024m ...

Note that the JVM process will consume more system memory then the specified heap size, for instance a 1GB JVM will consume 1.3GB of memory. This should be taken into consideration when determining the maximum number of JVMs which you will run on a machine. The actual allocated size can be monitored with tools such as top. See "Heap Size Considerations" for additional details on heap size considerations.

20.3.3 GC Monitoring & Tuning

Frequent garbage collection pauses which are in the range of 100ms or more are likely to have a noticeable impact on network performance. During these pauses a Java application is unable to send or receive packets, and in the case of receiving, the operating system buffered packets may be discarded and need to be retransmitted.

Specify -verbose:gc or -Xloggc: on the JVM command line to monitor the frequency and duration of garbage collection pauses.

See http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html for details on GC tuning.

Starting with Coherence 3.2 log messages will be generated when one cluster node detects that another cluster node has been unresponsive for a period, generally indicating that a target cluster node was in a GC cycle.

Example 20-4 Message Indicating Target Cluster Node is in Garbage Collection Mode

Experienced a 4172 ms communication delay (probable remote GC) with Member(Id=7, Timestamp=2006-10-20 12:15:47.511, Address=192.168.0.10:8089, MachineId=13838); 320 packets rescheduled, PauseRate=0.31, Threshold=512

PauseRate indicates the percentage of time for which the node has been considered unresponsive since the stats were last reset. Nodes reported as unresponsive for more then a few percent of their lifetime may be worth investigating for GC tuning.

20.4 Coherence Network Tuning

Coherence includes configuration elements for throttling the amount of traffic it will place on the network; see the documentation for <traffic-jam>, <flow-control> and <burst-mode>, these settings are used to control the rate of packet flow within and between cluster nodes.

20.4.1 Validation

To determine how these settings are affecting performance you need to check if you're cluster nodes are experiencing packet loss and/or packet duplication. This can be obtained by looking at the following JMX stats on various cluster nodes:

  • ClusterNodeMBean.PublisherSuccessRate—If less then 1.0, packets are being detected as lost and being resent. Rates below 0.98 may warrant investigation and tuning.

  • ClusterNodeMBean.ReceiverSuccessRate—If less then 1.0, the same packet is being received multiple times (duplicates), likely due to the publisher being overly aggressive in declaring packets as lost.

  • ClusterNodeMBean.WeakestChannel—Identifies the remote cluster node which the current node is having most difficulty communicating with.

For information on using JMX to monitor Coherence see Chapter 22, "How to Manage Coherence Using JMX."