2 Platform-Specific Deployment Considerations

This chapter identifies issues that should be considered when deploying Coherence to various platforms and offers solutions if available.

The following sections are included in this chapter:

2.1 Deploying to AIX

When deploying Coherence on AIX, be aware of the following:

2.1.1 Multicast and IPv6

AIX 5.2 and above default to running multicast over IPv6 rather then IPv4. If you run in a mixed IPv6/IPv4 environment, configure your JVMs to explicitly use IPv4. This can be done by setting the java.net.preferIPv4Stack system property to true on the Java command line. See the IBM 32-bit SDK for AIX User Guide for details.

2.1.2 Unique Multicast Addresses and Ports

On AIX, it is suggested that each Coherence cluster use a unique multicast address and port as some versions of AIX do not take both into account when delivering packets. See Developing Applications with Oracle Coherence for details on configuring the multicast address and port.

2.2 Deploying to Cisco Switches

When deploying Coherence with Cisco switches, be aware of the following:

2.2.1 Buffer Space and Packet Pauses

Under heavy UDP packet load some Cisco switches may run out of buffer space and exhibit frequent multi-second communication pauses. These communication pauses can be identified by a series of Coherence log messages referencing communication delays with multiple nodes which cannot be attributed to local or remote GCs.

Experienced a 4172 ms communication delay (probable remote GC) with Member(Id=7, Timestamp=2008-09-15 12:15:47.511, Address=xxx.xxx.x.xx:8089, MachineId=13838); 320 packets rescheduled, PauseRate=0.31, Threshold=512

The Cisco 6500 series support configuring the amount of buffer space available to each Ethernet port or ASIC. In high load applications it may be necessary to increase the default buffer space. This can be accomplished by executing:

fabric buffer-reserve high

See Cisco's documentation for additional details on this setting.

2.2.2 Multicast Connectivity on Large Networks

Cisco's default switch configuration does not support proper routing of multicast packets between switches due to the use of IGMP snooping. See the Cisco's documentation regarding the issue and solutions.

2.2.3 Multicast Outages

Some Cisco switches have shown difficulty in maintaining multicast group membership resulting in existing multicast group members being silently removed from the multicast group. This cause a partial communication disconnect for the associated Coherence node(s) and they are forced to leave and rejoin the cluster. This type of outage can most often be identified by the following Coherence log messages indicating that a partial communication problem has been detected.

A potential network configuration problem has been detected. A packet has failed to be delivered (or acknowledged) after 60 seconds, although other packets were acknowledged by the same cluster member (Member(Id=3, Timestamp=Sat Sept 13 12:02:54 EST 2008, Address=xxx.xxx.x.xxx, Port=8088, MachineId=48991)) to this member (Member(Id=1, Timestamp=Sat Sept 13 11:51:11 EST 2008, Address=xxx.xxx.x.xxx, Port=8088, MachineId=49002)) as recently as 5 seconds ago.

To confirm the issue, use the same multicast address and port as the running cluster. If the issue affects a multicast test node, its logs show that it suddenly stopped receiving multicast test messages. See Chapter 4, "Performing a Multicast Connectivity Test".

The following test logs show the issue:

Example 2-1 Log for a Multicast Outage

Test Node 192.168.1.100:

Sun Sept 14 16:44:22 GMT 2008: Received 83 bytes from a Coherence cluster node at 182.168.1.100: ??? 
Sun Sept 14 16:44:23 GMT 2008: Received test packet 76 from ip=/192.168.1.101, group=/224.3.2.0:32367, ttl=4. 
Sun Sept 14 16:44:23 GMT 2008: Received 83 bytes from a Coherence cluster node at 182.168.1.100: ??? 
Sun Sept 14 16:44:23 GMT 2008: Sent packet 85. Sun Sept 14 16:44:23 GMT 2008: Received test packet 85 from self. 
Sun Sept 14 16:44:24 GMT 2008: Received 83 bytes from a Coherence cluster node at 182.168.1.100: ??? 
Sun Sept 14 16:44:25 GMT 2008: Received test packet 77 from ip=/192.168.1.101, group=/224.3.2.0:32367, ttl=4. 
Sun Sept 14 16:44:25 GMT 2008: Received 83 bytes from a Coherence cluster node at 182.168.1.100: ??? 
Sun Sept 14 16:44:25 GMT 2008: Sent packet 86. 
Sun Sept 14 16:44:25 GMT 2008: Received test packet 86 from self. Sun Sept 14 16:44:26 GMT 2008: Received 83 bytes from a Coherence cluster node at 182.168.1.100: ??? 
Sun Sept 14 16:44:27 GMT 2008: Received test packet 78 from ip=/192.168.1.101, group=/224.3.2.0:32367, ttl=4. 
Sun Sept 14 16:44:27 GMT 2008: Received 83 bytes from a Coherence cluster node at 182.168.1.100: ??? 
Sun Sept 14 16:44:27 GMT 2008: Sent packet 87. 
Sun Sept 14 16:44:27 GMT 2008: Received test packet 87 from self. 
Sun Sept 14 16:44:28 GMT 2008: Received 83 bytes from a Coherence cluster node at 182.168.1.100: ??? 
Sun Sept 14 16:44:29 GMT 2008: Received 83 bytes from a Coherence cluster node at 182.168.1.100: ??? 
Sun Sept 14 16:44:29 GMT 2008: Sent packet 88. 
Sun Sept 14 16:44:29 GMT 2008: Received test packet 88 from self.
 Sun Sept 14 16:44:30 GMT 2008: Received 83 bytes from a Coherence cluster node at 182.168.1.100: ??? 
Sun Sept 14 16:44:31 GMT 2008: Received 83 bytes from a Coherence cluster node at 182.168.1.100: ??? 
Sun Sept 14 16:44:31 GMT 2008: Sent packet 89. 
Sun Sept 14 16:44:31 GMT 2008: Received test packet 89 from self. 
Sun Sept 14 16:44:32 GMT 2008: Received 83 bytes from a Coherence cluster node at 182.168.1.100: ??? 
Sun Sept 14 16:44:33 GMT 2008: Received 83 bytes from a Coherence cluster node at 182.168.1.100: ???

Test Node 192.168.1.101:

Sun Sept 14 16:44:22 GMT 2008: Sent packet 76.
Sun Sept 14 16:44:22 GMT 2008: Received test packet 76 from self. 
Sun Sept 14 16:44:22 GMT 2008: Received 83 bytes from a Coherence cluster node at 192.168.1.100: ??? 
Sun Sept 14 16:44:22 GMT 2008: Received test packet 85 from ip=/192.168.1.100, group=/224.3.2.0:32367, ttl=4. 
Sun Sept 14 16:44:23 GMT 2008: Received 83 bytes from a Coherence cluster node at 192.168.1.100: ??? Sun Sept 14 16:44:24 GMT 2008: Sent packet 77.
Sun Sept 14 16:44:24 GMT 2008: Received test packet 77 from self. 
Sun Sept 14 16:44:24 GMT 2008: Received 83 bytes from a Coherence cluster node at 192.168.1.100: ??? 
Sun Sept 14 16:44:24 GMT 2008: Received test packet 86 from ip=/192.168.1.100, group=/224.3.2.0:32367, ttl=4. 
Sun Sept 14 16:44:25 GMT 2008: Received 83 bytes from a Coherence cluster node at 192.168.1.100: ??? 
Sun Sept 14 16:44:26 GMT 2008: Sent packet 78.Sun Sept 14 16:44:26 GMT 2008: Received test packet 78 from self. 
Sun Sept 14 16:44:26 GMT 2008: Received 83 bytes from a Coherence cluster node at 192.168.1.100: ??? 
Sun Sept 14 16:44:26 GMT 2008: Received test packet 87 from ip=/192.168.1.100, group=/224.3.2.0:32367, ttl=4. 
Sun Sept 14 16:44:27 GMT 2008: Received 83 bytes from a Coherence cluster node at 192.168.1.100: ??? 
Sun Sept 14 16:44:28 GMT 2008: Sent packet 79.
Sun Sept 14 16:44:28 GMT 2008: Received test packet 79 from self. 
Sun Sept 14 16:44:28 GMT 2008: Received 83 bytes from a Coherence cluster node at 192.168.1.100: ??? 
Sun Sept 14 16:44:28 GMT 2008: Received test packet 88 from ip=/192.168.1.100, group=/224.3.2.0:32367, ttl=4. 
Sun Sept 14 16:44:29 GMT 2008: Received 83 bytes from a Coherence cluster node at 192.168.1.100: ??? 
Sun Sept 14 16:44:30 GMT 2008: Sent packet 80.
Sun Sept 14 16:44:30 GMT 2008: Received test packet 80 from self. 
Sun Sept 14 16:44:30 GMT 2008: Received 83 bytes from a Coherence cluster node at 192.168.1.100: ??? 
Sun Sept 14 16:44:30 GMT 2008: Received test packet 89 from ip=/192.168.1.100, group=/224.3.2.0:32367, ttl=4. 
Sun Sept 14 16:44:31 GMT 2008: Received 83 bytes from a Coherence cluster node at 192.168.1.100: ??? 
Sun Sept 14 16:44:32 GMT 2008: Sent packet 81.Sun Sept 14 16:44:32 GMT 2008: Received test packet 81 from self. 
Sun Sept 14 16:44:32 GMT 2008: Received 83 bytes from a Coherence cluster node at 192.168.1.100: ??? 
Sun Sept 14 16:44:32 GMT 2008: Received test packet 90 from ip=/192.168.1.100, group=/224.3.2.0:32367, ttl=4. 
Sun Sept 14 16:44:33 GMT 2008: Received 83 bytes from a Coherence cluster node at 192.168.1.100: ??? 
Sun Sept 14 16:44:34 GMT 2008: Sent packet 82.

Note that at 16:44:27 the first test node stops receiving multicast packets from other computers. The operating system continues to properly forward multicast traffic from other processes on the same computer, but the test packets (79 and higher) from the second test node are not received. Also note that both the test packets and the cluster's multicast traffic generated by the first node do continue to be delivered to the second node. This indicates that the first node was silently removed from the multicast group.

If you encounter this multicast issue it is suggested that you contact Cisco technical support, or you may consider changing your configuration to unicast-only by using the well known addresses. See Developing Applications with Oracle Coherence for details on using well known addresses.

2.2.4 Multicast Time-to-Live

The Cisco 6500 series router may become overloaded if too many packets with a time-to-live (TTL) value of 1 are received. In addition, a low TTL setting may overload single group members. Set the Coherence multicast TTL setting to at least the size of the multicast domain (127 or 255) and make sure that clusters do not use overlapping groups. See Developing Applications with Oracle Coherence for details on configuring multicast TTL.

2.3 Deploying to Foundry Switches

When deploying Coherence with Foundry switches, be aware of the following:

2.3.1 Multicast Connectivity

Foundry switches have shown to exhibit difficulty in handing multicast traffic. When deploying on with Foundry switches, ensure that all computers that are part of the Coherence cluster can communicate over multicast. See Chapter 4, "Performing a Multicast Connectivity Test".

If you encounter issues with multicast you may consider changing your configuration to unicast-only by using the well-known-addresses feature. See Developing Applications with Oracle Coherence for details on using well known addresses.

2.4 Deploying to IBM BladeCenters

When deploying Coherence on IBM BladeCenters, be aware of the following:

2.4.1 MAC Address Uniformity and Load Balancing

A typical deployment on a BladeCenter may include blades with two NICs where one is used for administration purposes and the other for cluster traffic. By default, the MAC addresses assigned to the blades of a BladeCenter are uniform enough that the first NIC generally has an even MAC address and the second has an odd MAC address. If the BladeCenter's uplink to a central switch also has an even number of channels, then layer 2 (MAC based) load balancing may prevent one set of NICs from making full use of the available uplink bandwidth as they are all bound to either even or odd channels. This issue arises due to the assumption in the switch that MAC addresses are essentially random, which in BladeCenter's is untrue. Remedies to this situation include:

  • Use layer 3 (IP based) load balancing (if the IP addresses do not follow the same even/odd pattern).

    • This setting must be applied across all switches carrying cluster traffic.

  • Randomize the MAC address assignments by swapping them between the first and second NIC on alternating computers.

    • Linux enables you to change a NIC's MAC address using the ifconfig command.

    • For other operating systems custom tools may be available to perform the same task.

2.5 Deploying to IBM JVMs

When deploying Coherence on IBM JVMs, be aware of the following:

2.5.1 OutOfMemoryError

JVMs that experience an OutOfMemoryError can be left in an indeterministic state which can have adverse effects on a cluster. We recommend configuring JVMs to exit upon encountering an OutOfMemoryError instead of allowing the JVM to attempt recovery. Here is the parameter to configure this setting on IBM JVMs:

UNIX:

-Xdump:tool:events=throw,filter=java/lang/OutOfMemoryError,exec="kill -9 %pid"

Windows:

-Xdump:tool:events=throw,filter=java/lang/OutOfMemoryError,exec="taskkill /F /PID %pid"

2.5.2 Heap Sizing

IBM does not recommend fixed size heaps for JVMs. In many cases, it is recommended to use the default for -Xms (in other words, omit this setting and only set -Xmx). See this link for more details:

http://www.ibm.com/developerworks/java/jdk/diagnosis/

It is recommended to configure the JVM to generate a heap dump if an OutOfMemoryError is thrown to assist the investigation into the root cause for the error. IBM JVMs generate a heap dump on OutOfMemoryError by default; no further configuration is required.

2.6 Deploying to Linux

When deploying Coherence on Linux, be aware of the following:

2.6.1 TSC High Resolution Timesource

Linux has several high resolution timesources to choose from, the fastest TSC (Time Stamp Counter) unfortunately is not always reliable. Linux chooses TSC by default, and during startup checks for inconsistencies, if found it switches to a slower safe timesource. The slower time sources can be 10 to 30 times more expensive to query then the TSC timesource, and may have a measurable impact on Coherence performance. For more details on TSC, see

https://lwn.net/Articles/209101/

Coherence and the underlying JVM are not aware of the timesource which the operating system is using. It is suggested that you check your system logs (/var/log/dmesg) to verify that the following is not present.

kernel: Losing too many ticks!
kernel: TSC cannot be used as a timesource.
kernel: Possible reasons for this are:
kernel:   You're running with Speedstep,
kernel:   You don't have DMA enabled for your hard disk (see hdparm),
kernel:   Incorrect TSC synchronization on an SMP system (see dmesg).
kernel: Falling back to a sane timesource now.

As the log messages suggest, this can be caused by a variable rate CPU (SpeedStep), having DMA disabled, or incorrect TSC synchronization on multi CPU computers. If present, it is suggested that you work with your system administrator to identify the cause and allow the TSC timesource to be used.

2.7 Deploying to Oracle HotSpot JVMs

When deploying Coherence on Oracle HotSpot JVMs, be aware of the following:

2.7.1 Heap Sizes

Coherence recommends keeping heap sizes at 1-4GB per JVM. However, larger heap sizes, up to 20GB, are suitable for some applications where the simplified management of fewer, larger JVMs outweighs the performance benefits of many smaller JVMs. Using multiple cache servers allows a single computer to achieve higher capacities. With Oracle's JVMs, heap sizes beyond 4GB are reasonable, though GC tuning is still advisable to minimize long GC pauses. See Oracles's GC Tuning Guide for tuning details. It is also advisable to run with fixed sized heaps as this generally lowers GC times. See "JVM Tuning" for additional information.

2.7.2 AtomicLong

When available Coherence uses the highly concurrent AtomicLong class, which allows concurrent atomic updates to long values without requiring synchronization.

It is suggested to run in server mode to ensure that the stable and highly concurrent version can be used. To run the JVM in server mode include the -server option on the Java command line.

2.7.3 OutOfMemoryError

JVMs that experience an OutOfMemoryError can be left in an indeterministic state which can have adverse effects on a cluster. We recommend configuring JVMs to exit upon encountering an OutOfMemoryError instead of allowing the JVM to attempt recovery. Here is the parameter to configure this setting on Sun JVMs:

UNIX:

-XX:OnOutOfMemoryError="kill -9 %p"

Windows:

-XX:OnOutOfMemoryError="taskkill /F /PID %p"

Additionally, it is recommended to configure the JVM to generate a heap dump if an OutOfMemoryError is thrown to assist the investigation into the root cause for the error. Use the following flag to enable this feature on the Sun JVM:

-XX:+HeapDumpOnOutOfMemoryError

2.8 Deploying to OS X

When deploying Coherence on OS X, be aware of the following:

2.8.1 Multicast and IPv6

OS X defaults to running multicast over IPv6 rather then IPv4. If you run in a mixed IPv6/IPv4 environment, configure your JVMs to explicitly use IPv4. This can be done by setting the java.net.preferIPv4Stack system property to true on the Java command line.

2.8.2 Unique Multicast Addresses and Ports

On OS X, it is suggested that each Coherence cluster use a unique multicast address and port, as some versions of OS X do not take both into account when delivering packets. See Developing Applications with Oracle Coherence for details on configuring the multicast address and port.

2.8.3 Socket Buffer Sizing

Generally, Coherence prefers 2MB or higher buffers, but for OS X this may result in unexpectedly high kernel CPU time, which in turn reduces throughput. For OS X, the suggested buffers size is 768KB, though your own tuning may find a better size. See the Developing Applications with Oracle Coherence for details on configuring socket buffers.

2.9 Deploying to Solaris

When deploying Coherence on Solaris, be aware of the following:

2.9.1 Solaris 10 (x86 and SPARC)

When running on Solaris 10, there are known issues relate to packet corruption and multicast disconnections. These most often manifest as either EOFExceptions, "Large gap" warnings while reading packet data, or frequent packet timeouts. It is highly recommend that the patches for both issues below be applied when using Coherence on Solaris 10 systems.

Possible Data Integrity Issues on Solaris 10 Systems Using the e1000g Driver for the Intel Gigabit Network Interface Card (NIC)

https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&doctype=ALERT&id=1000972.1

IGMP(1) Packets do not Contain IP Router Alert Option When Sent From Solaris 10 Systems With Patch 118822-21 (SPARC) or 118844-21 (x86/x64) or Later Installed

https://support.oracle.com/CSP/main/article?cmd=show&type=NOT&doctype=ALERT&id=1000940.1

2.9.2 Solaris 10 Networking

If running on Solaris 10, review the above Solaris 10 (x86 and SPARC) issues which relate to packet corruption and multicast disconnections. These most often manifest as either EOFExceptions, "Large gap" warnings while reading packet data, or frequent packet timeouts. It is highly recommend that the patches for both issues be applied when using Coherence on Solaris 10 systems.

2.10 Deploying to Virtual Machines

Oracle Coherence follows the support policies of Oracle Fusion Middleware. See the following link for Oracle Fusion Middleware supported virtualization and partitioning technologies:

http://www.oracle.com/technetwork/middleware/ias/oracleas-supported-virtualization-089265.html

When deploying Coherence to virtual machines, be aware of the following:

2.10.1 Multicast Connectivity

Virtualization adds another layer to your network topology and it must be properly configured to support multicast networking. See Developing Applications with Oracle Coherence for detailed information on configuring multicast networking.

2.10.2 Performance

It is less likely that a process running in a virtualized operating system can fully use gigabit Ethernet. This is not specific to Coherence and is visible on most network intensive virtualized applications.

2.10.3 Fault Tolerance

Additional configuration is required to ensure that cache entry backups reside on physically separate hardware. See Developing Applications with Oracle Coherence for detailed information on configuring cluster member identity.

2.11 Deploying to Windows

When deploying Coherence on Windows, be aware of the following:

2.11.1 Performance Tuning

Out of the box Windows is not optimized for background processes and heavy network loads. This may be addressed by running the optimize.reg script included in the Coherence installation's bin directory. See "Operating System Tuning" for details on the optimizations which are performed.

2.11.2 Personal Firewalls

If running a firewall on a computer you may have difficulties in forming a cluster consisting of multiple computers. This can be resolved by either:

  • Disabling the firewall, though this is generally not recommended.

  • Granting full network access to the Java executable which runs Coherence.

  • Opening up individual address and ports for Coherence.

    By default Coherence uses TCP and UDP ports starting at 8088, subsequent nodes on the same computer use increasing port numbers. Coherence may also communicate over multicast; the default address and port differs between releases an is based on the version number of the release. See Developing Applications with Oracle Coherence for details on configuring multicast and unicast.

2.11.3 Disconnected Network Interface

On Microsoft Windows, if the Network Interface Card (NIC) is unplugged from the network, the operating system invalidates the associated IP address. The effect of this is that any socket which is bound to that IP address enters an error state. This results in the Coherence nodes exiting the cluster and residing in an error state until the NIC is reattached to the network. In cases where it is desirable to allow multiple collocated JVMs to remain clustered during a physical outage Windows must be configured to not invalidate the IP address.

To adjust this parameter:

  1. Run Registry Editor (regedit)

  2. Locate the following registry key

    HKLM\System\CurrentControlSet\Services\Tcpip\Parameters
    
  3. Add or reset the following new DWORD value

    Name: DisableDHCPMediaSense
    Value: 1 (boolean)
    
  4. Reboot

While the name of the keyword includes DHCP, the setting effects both static and dynamic IP addresses. See Microsoft Windows TCP/IP Implementation Details for additional information:

http://technet.microsoft.com/en-us/library/bb726981.aspx#EDAA

2.12 Deploying to z/OS

When deploying Coherence on z/OS, be aware of the following:

2.12.1 EBCDIC

When deploying Coherence into environments where the default character set is EBCDIC rather than ASCII, ensure that Coherence configuration files which are loaded from JAR files or off of the classpath are in ASCII format. Configuration files loaded directly from the file system should be stored in the systems native format of EBCDIC.

2.12.2 Multicast

Under some circumstances, Coherence cluster nodes that run within the same logical partition (LPAR) on z/OS on IBM zSeries cannot communicate with each other. (This problem does not occur on the zSeries when running on Linux.)

The root cause is that z/OS may bind the MulticastSocket that Coherence uses to an automatically-assigned port, but Coherence requires the use of a specific port in order for cluster discovery to operate correctly. (Coherence does explicitly initialize the java.net.MulitcastSocket to use the necessary port, but that information appears to be ignored on z/OS when there is an instance of Coherence running within that same LPAR.)

The solution is to run only one instance of Coherence within a z/OS LPAR; if multiple instances are required, each instance of Coherence should be run in a separate z/OS LPAR. Alternatively, well known addresses may be used. See Developing Applications with Oracle Coherence for details on using well known addresses.