Troubleshooting Oracle Coherence

Communication delays reported by the Coherence Packet Publisher

If Coherence experiences communication delays then it will report these in the following format:

2011-12-06 00:06:54.386/354363.956 Oracle Coherence GE 3.4.2/411
<D5> (thread=PacketPublisher, member=2): Experienced a 254 ms
communication delay (probable remote GC) with Member(Id=3, Timestamp=2011-12-01
21:47:00.977, Address=...); 39 packets rescheduled, PauseRate=0.0,
Threshold=711

The most likely cause of this is, as is already stated in the log message itself, Garbage Collection (GC) going on in the target node that did not respond in the expected timeframe.

Socket Buffer Sizes

To help minimization of packet loss, the OS socket buffers need to be large enough to handle the incoming network traffic while your Java application is paused during garbage collection. By default Coherence will attempt to allocate a socket buffer of 2MB. If the OS is not configured to allow for large buffers Coherence will utilize smaller buffers. Most versions of Unix have a very low default buffer limit, which should be increased to at least 2MB.

On Linux execute (as root):

sysctl -w net.core.rmem_max=2096304
sysctl -w net.core.wmem_max=2096304

Coherence Cache Logs

It is often helpful to enable debug logging to get Coherence Get/Put information to analyze the issues in testing/production environment. To do so, enable the log level on the specific package as shown below:

<logger name="org.eclipse.persistence.logging.cache" level="debug" />

But this logs the Coherence Get/Put information for all the entities.To enable this logging only for a specific set of entities, use OhiCacheLoggingFilter in addition to enabling debug logging for the "org.eclipse.persistence.logging.cache" package:

<filter class="com.oracle.healthinsurance.support.cache.logging.logback.OhiCacheLoggingFilter">
  <entityNames>FlexCodeSystemDomain,FlexCodeSystemTranslationPersistable</entityNames>
</filter>

Pass the list of entities for which the logging should be enabled as a comma separated value for the parameter entityNames. Note that when the log level is set to debug for the package mentioned above and if the filter is also set, the Coherence Get/Put logging will happen only for the specified set of entities in the parameter entityNames.

This should only be done in close cooperation with Oracle

Coherence Weblogic Logging

In order to debug in case something goes wrong with running jobs using Coherence, it is good to enable logging for Coherence. Please refer to this page for details: https://docs.oracle.com/cd/E24290_01/coh.371/e22837/gs_debug.htm#COHDG5549

  • Set the following java options while starting the weblogic nodes:

JAVA_OPTIONS="${JAVA_OPTIONS} -Dtangosol.coherence.mode=prod"
JAVA_OPTIONS="${JAVA_OPTIONS} -Dtangosol.coherence.log=stdout"
JAVA_OPTIONS="${JAVA_OPTIONS} -Dtangosol.coherence.log.level=9"

9 is the finest level, 4 is default.

  • Change in Weblogic console to the server logging:

Environments→Clusters→Server Templates. Click on the server template for claims. Click on Logging tab. Go on advanced section. Change the Severity Level of 'Log File' to Trace.

Coherence logs will be present in node specific logs.

Is further action required?

The PauseRate indicates the percentage of time for which the node has been
considered unresponsive since the stats were last reset. Nodes reported as
unresponsive for more then a few percent of their lifetime may be worth
investigating for GC tuning.
  • Coherence JMX MBeans can be queried to determine if cluster nodes are experiencing packet loss and/or packet duplication. This can be obtained by looking at the following JMX stats on various cluster nodes:

    • ClusterNodeMBean.PublisherSuccessRate - If less then 1.0, packets are being detected as lost and being resent. Rates below 0.98 may warrant investigation and tuning.

    • ClusterNodeMBean.ReceiverSuccessRate - If less then 1.0, the same packet is being received multiple times (duplicates), likely due to the publisher being overly aggressive in declaring packets as lost.

    • ClusterNodeMBean.WeakestChannel - Identifies the remote cluster node which the current node is having most difficulty communicating with.

Only if the PublisherSuccessRate is below 0.98 further investigation (and possibly tuning) is warranted.

Reporting on Coherence Performance

Oracle Health Insurance Components make use of Oracle Coherence for caching data and for grid processing. Coherence reports are most often used to identify trends that are valuable for troubleshooting and planning. The Coherence reporting feature is disabled by default and must be explicitly enabled in an operational override file, by using system properties or via the use of JMX MBeans.

The Coherence documentation ("Use Coherence Reporting") specifies how to create these reports.