Troubleshooting Oracle Coherence

Communication Delays Reported By the Coherence Packet Publisher

If Coherence experiences communication delays, then it will report these in the following format:

2011-12-06 00:06:54.386/354363.956 Oracle Coherence GE 3.4.2/411
<D5> (thread=PacketPublisher, member=2): Experienced a 254 ms
communication delay (probable remote GC) with Member(Id=3, Timestamp=2011-12-01
21:47:00.977, Address=...); 39 packets rescheduled, PauseRate=0.0,
Threshold=711

As the log message itself states, the most likely cause of this is Garbage Collection (GC) going on in the target node that did not respond in the expected timeframe.

Socket Buffer Sizes

To help minimize packet loss, the Operating System (OS) socket buffers need to be large enough to handle the incoming network traffic while the Java application pauses for garbage collection. By default, Coherence will attempt to allocate a socket buffer of 2MB. Coherence will use smaller buffers if the OS does not allow for such large buffers. It is important to increase the default buffer limit to at least 2MB, as most versions of Unix have a very low value.

On Linux execute (as root):

sysctl -w net.core.rmem_max=2096304
sysctl -w net.core.wmem_max=2096304

Coherence Cache Logs

It is often helpful to enable debug logging to get Coherence GET/PUT information to analyze the issues in testing or production environment. To do so, enable the log level on the specific package, as shown below:

<logger name="org.eclipse.persistence.logging.cache" level="debug" />

But this logs the Coherence GET/PUT information for all the entities. Use OhiCacheLoggingFilter besides enabling debug logging for the org.eclipse.persistence.logging.cache package to enable logging only for a specific set of entities:

<filter class="com.oracle.healthinsurance.support.cache.logging.logback.OhiCacheLoggingFilter">
  <entityNames>FlexCodeSystemDomain,FlexCodeSystemTranslationPersistable</entityNames>
</filter>

To enable the logging for a list of entities, pass them as a comma-separated value for the parameter entityNames. Note that when the log level will debug for the aforementioned package and the filter is also set, the Coherence GET/PUT logging will occur only for the specified set of entities in the parameter entityNames.

It is important to do this in close cooperation with Oracle.

Coherence Web Logic Logging

It is good to enable logging for Coherence to debug in case something goes wrong with running jobs using Coherence. Please refer to this page for details: https://docs.oracle.com/cd/E24290_01/coh.371/e22837/gs_debug.htm#COHDG5549.

  • Set the following java options while starting the WebLogic nodes:

JAVA_OPTIONS="$\{JAVA_OPTIONS} -Dtangosol.coherence.mode=prod" +
JAVA_OPTIONS="$\{JAVA_OPTIONS} -Dtangosol.coherence.log=stdout" +
JAVA_OPTIONS="$\{JAVA_OPTIONS} -Dtangosol.coherence.log.level=9"

nine is the finest level, four is the default.

  1. To change in WebLogic console to the server logging, go to Environments and click Clusters. Then, click Server Templates.

  2. Select the server template for Claims.

  3. Select Logging tab.

  4. Click Advanced section.

  5. Change the Severity Level of Log File to Trace. Coherence logs will be present in node-specific logs.

Is Further Action Required?

  • It may be worthwhile to check the nodes PauseRate, as explained in Coherence Performance Tuning Guide.

  • The PauseRate shows the percentage of time for which the node has been unresponsive since the stats were last reset.

  • Nodes that do not report for more than a few percent of their life may be worth investigating for GC tuning.

  • Query Coherence Java Management Extensions (JMX) MBeans to determine if cluster nodes are experiencing packet loss or packet duplication. Get this by looking at the following JMX stats on various cluster nodes:

    • ClusterNodeMBean.PublisherSuccessRate - If less than 1.0, detect packets as lost and resends them. Rates below 0.98 may warrant investigation and tuning.

    • ClusterNodeMBean.ReceiverSuccessRate - If less than 1.0, receiving the same packet multiple times (duplicates). This is likely if the publisher is being overly aggressive in declaring packets as lost.

    • ClusterNodeMBean.WeakestChannel - Identifies the remote cluster node with which the current node is having the most difficulty communicating.

Only if the PublisherSuccessRate is below 0.98 warrants further investigation (and possibly tuning).

Reporting on Coherence Performance

Oracle Health Insurance applications make use of Oracle Coherence for caching data and for grid processing. Coherence reports are useful in identifying trends that are valuable for troubleshooting and planning. Coherence reporting feature is disabled by default. It is essential to enable it explicitly using an operational override file, by using system properties or via the use of JMX MBeans.

The Coherence Documentation specifies how to create these reports.