2 Documentation Errata

This chapter describes changes, enhancements, and corrections made to the Oracle Coherence documentation library for 3.5.1. The library can be found at the following URL:

http://download.oracle.com/docs/cd/E14526_01/index.htm

Correction to "Coherence*Web Session Management Features"
Correction to "Read-Through, Write-Through, Write-Behind, and Refresh-Ahead Caching"
Correction to Managing Map Operations with Triggers
Correction to the cache-config.dtd
Addition to the address-provider Subelement Description
Addition to the backing-map-scheme Subelement Description
Sizing Considerations for Coherence Cluster JVMs
Additions to Log and Error Messages

2.1 Correction to "Coherence*Web Session Management Features"

Web applications that use different sticky optimization and locking settings should not be intermixed within the same cluster. With that in mind, the following note has been added to the Session Models section of the Coherence*Web Session Management Features chapter in the User's Guide for Oracle Coherence*Web.

Note:

In general, Web applications that are part of the same Coherence cluster must use the same session model type. Inconsistent configurations may result in deserialization errors.

The following note has been added to the Session Locking Modes section of the Coherence*Web Session Management Features chapter in the User's Guide for Oracle Coherence*Web.

Note:

In general, Web applications that are part of the same Coherence cluster must use the same locking mode and sticky session optimizations setting. Inconsistent configurations may result in deadlock.

2.2 Correction to "Read-Through, Write-Through, Write-Behind, and Refresh-Ahead Caching"

This section describes the changes made to the Read-Through, Write-Through, Write-Behind, and Refresh-Ahead Caching chapter in Getting Started with Oracle Coherence Corrected text appears in italics.

Table 2-1 illustrates the changes made to the Write-Behind Caching section.

Table 2-1 Changes Made to the Write-Behind Caching Description

Old Text New Text

Old Text	New Text
In the Write-Behind scenario, modified cache entries are asynchronously written to the datasource after a configurable delay, whether after 10 seconds, 20 minutes, a day or even a week or longer. For Write-Behind caching, Coherence maintains a write-behind queue of the data that needs to be updated in the datasource. When the application updates `X` in the cache, `X` is added to the write-behind queue (if it isn't there already; otherwise, it is replaced), and after the specified write-behind delay Coherence will call the `CacheStore` to update the underlying datasource with the latest state of `X`. Note that the write-behind delay is relative to the first of a series of modifications—in other words, the data in the datasource will never lag behind the cache by more than the write-behind delay.	In the Write-Behind scenario, modified cache entries are asynchronously written to the datasource after a configurable delay, whether after 10 seconds, 20 minutes, a day or even a week or longer. Note that this only applies to cache inserts and updates—cache entries are removed synchronously from the datasource. For Write-Behind caching, Coherence maintains a write-behind queue of the data that needs to be updated in the datasource. When the application updates `X` in the cache, `X` is added to the write-behind queue (if it isn't there already; otherwise, it is replaced), and after the specified write-behind delay Coherence will call the `CacheStore` to update the underlying datasource with the latest state of `X`. Note that the write-behind delay is relative to the first of a series of modifications—in other words, the data in the datasource will never lag behind the cache by more than the write-behind delay.

In the *Write-Behind* scenario, modified cache entries are asynchronously written to the datasource after a configurable delay, whether after 10 seconds, 20 minutes, a day or even a week or longer. For *Write-Behind* caching, Coherence maintains a write-behind queue of the data that needs to be updated in the datasource. When the application updates X in the cache, X is added to the write-behind queue (if it isn't there already; otherwise, it is replaced), and after the specified write-behind delay Coherence will call the CacheStore to update the underlying datasource with the latest state of X. Note that the write-behind delay is relative to the first of a series of modifications—in other words, the data in the datasource will never lag behind the cache by more than the write-behind delay.

In the *Write-Behind* scenario, modified cache entries are asynchronously written to the datasource after a configurable delay, whether after 10 seconds, 20 minutes, a day or even a week or longer. Note that this only applies to cache inserts and updates—cache entries are removed synchronously from the datasource. For *Write-Behind* caching, Coherence maintains a write-behind queue of the data that needs to be updated in the datasource. When the application updates X in the cache, X is added to the write-behind queue (if it isn't there already; otherwise, it is replaced), and after the specified write-behind delay Coherence will call the CacheStore to update the underlying datasource with the latest state of X. Note that the write-behind delay is relative to the first of a series of modifications—in other words, the data in the datasource will never lag behind the cache by more than the write-behind delay.

2.3 Correction to Managing Map Operations with Triggers

In Managing Map Operations with Triggers in the Developer's Guide for Oracle Coherence: the introduction to Example 4–1 incorrectly states that the createMapTrigger method would return a new MapTriggerListener(new MyCustomTrigger());.

Instead of createMapTrigger, the name of the correct method should be createTriggerListener.

2.4 Correction to the cache-config.dtd

The pre-3.5.1 cache-config.dtd was not well-formed in that it was missing a comma (in line 445) between the thread-count? and task-hung-threshold? attributes in the proxy-scheme element definition.

...
    <!ELEMENT proxy-scheme
    (scheme-name?, scheme-ref?, service-name?, thread-count? 
    task-hung-threshold?, task-timeout?, request-timeout?, acceptor-config?,
    proxy-config?, autostart?)>
...

This has been fixed for the 3.5.1 release.

2.5 Addition to the address-provider Subelement Description

The following addition was made to the description of the address-provider subelement of well-known-addresses in the Developer's Guide for Oracle Coherence:

The calling component will attempt to obtain the full list upon node startup, the provider must return a terminating null address to indicate that all available addresses have been returned.

2.6 Addition to the backing-map-scheme Subelement Description

The following addition was made to the description of backing-map-scheme subelement of the distributed-scheme element in the Developer's Guide for Oracle Coherence. Added text is in italics.

Table 2-2 Changes Made to the backing-map-scheme Description

Old Text New Text

Old Text	New Text
Note that when using an overflow-based backing map it is important that the corresponding `<backup-storage>` be configured for overflow (potentially using the same scheme as the backing-map). See "Partitioned Cache with Overflow" for an example configuration.	When using an off-heap backing map it is important that the corresponding `<backup-storage>` be configured for off-heap (potentially using the same scheme as the backing-map). Here, off-heap refers to any storage where some or all entries are stored outside of the JVMs garbage collected heap space. Examples include overflow-scheme, and external-scheme. See "Partitioned Cache with Overflow" for an example configuration.

Note that when using an overflow-based backing map it is important that the corresponding <backup-storage> be configured for overflow (potentially using the same scheme as the backing-map). See "Partitioned Cache with Overflow" for an example configuration.

When using an off-heap backing map it is important that the corresponding <backup-storage> be configured for off-heap (potentially using the same scheme as the backing-map). Here, off-heap refers to any storage where some or all entries are stored outside of the JVMs garbage collected heap space. Examples include overflow-scheme, and external-scheme. See "Partitioned Cache with Overflow" for an example configuration.

2.7 Sizing Considerations for Coherence Cluster JVMs

This section discusses JVM sizing considerations for Coherence cluster JVMs. The primary issue to consider when sizing your JVMs is achieving a balance of available RAM versus garbage collection (GC) pause times.

GC Pauses

Lengthy GC pause times can negatively impact the Coherence cluster as they are, for the most part, indistinguishable from node death. For this reason, it is very important that cluster nodes are sized and/or tuned to ensure that their GC times remain minimal. As a good rule of thumb, a node should spend less than 10% of its time paused in GC, normal GC times should be under 100ms, and maximum GC times should be around 1 second.

You can monitor GC activity in several ways; some standard mechanisms include:

JVM switch -verbose:gc
JVM switch -Xloggc: (similar to verbose gc but includes timestamps)
Over JMX using tools such as jConsole

2.7.1 Basic Sizing Recommendations

If you are looking to just get things up and running with minimal effort, the following recommendations should suffice.

2.7.1.1 Cache Servers

The standard safe recommendation for Coherence cache servers is to run fixed-size heap of up to 1GB. Additionally, it is recommended to use an incremental garbage collector to minimize GC pause durations, and to also run the JVM in "server" mode to encourage optimizations for long running processes.

For example, the following command allows for good performance without the need for more elaborate JVM tuning:

java -server -Xms1g -Xmx1g -Xincgc -Xloggc: -cp coherence.jar com.tangosol.net.DefaultCacheServer

2.7.1.2 TCMP Clients

Coherence TCMP clients should be configured similarly to cache servers as long GCs could cause them to be misidentified as being dead.

2.7.1.3 Extend Clients

Coherence Extend clients are not, technically speaking, cluster members, and as such the effect of long GCs is less detrimental. For extend clients, it is recommended that you follow the existing guidelines as set forth by the application in which you are embedding Coherence.

2.7.1.4 Storage Ratios

There is a related question of how much data you can store within a cache server of a given size. The basic recommendation is to use up to one-third of the heap for primary cache storage. This leaves another one-third for backup storage, and the final one-third for "scratch space". Scratch space is then used for things such as holding classes, temporary objects, network transfer buffers, and GC compaction. You may instruct Coherence to limit primary storage on a per-cache basis by using the <high-units> element, and specifying BINARY in the <unit-calculator> element. These settings are automatically applied to backup storage as well.

Ideally, both the primary and backup storage will also fit within the JVMs tenured space (for HotSpot based JVMs). See HotSpot's Tuning Garbage Collection guide for details on sizing the collectors generations.

See the Developer's Guide for Oracle Coherence for more information on the <high-units> and <unit-calculator> elements

2.7.1.5 Running with Large Heaps

It is possible to run cache servers with larger heap sizes, although it becomes more important to monitor and tune the JVMs to minimize the GC pauses. It may also be necessary to alter the storage ratios such that the amount of scratch space is increased to facilitate faster GC compactions. Additionally it is recommended that you make use of an up to date JVM version such as HotSpot 1.6 as it includes significant improvements for managing large heaps.

2.7.1.6 Using Available System Memory

Running multiple identical cache server instances on a single machine enables you to use the available system memory. It is important to not overcommit the available resources. Namely if you have a machine with 16GB of RAM, it is not reasonable to attempt to dedicate 16GB of memory to your JVMs. Ultimately when all the machines processes are running you want to be in a state that swap space is not actively being used. In selecting the size and number of JVMs to run, it is important to realize that the JVM process will use more memory than is specified when configuring the heap size. The heap size settings specify the amount of heap which the JVM makes available to the application, but the JVM itself will also consume additional memory. The amount consumed will differ depending on the OS, and JVM settings. For instance a HotSpot JVM running on Linux configured with a 1GB heap will consume roughly 1.2GB of RAM. It is important that you externally measure the JVMs memory utilization to ensure that you don't over commit your RAM. Tools such as 'top', 'vmstat', and 'Task Manager' are useful in identifying how much RAM is actually being used.

2.8 Additions to Log and Error Messages

The following most frequently seen messages have been documented for the 3.5.1 release.

Additions to Partitioned Cache Service Log Messages
Additions to TCMP Log Messages

2.8.1 Additions to Partitioned Cache Service Log Messages

The following partitioned cache service messages have been documented for the 3.5.1 release.

validatePolls: This service timed-out due to unaswered handshake request. Manual intervention is required to stop the members that have not responded to this Poll: Parameters: none; Severity: 1—Error; Cause: When a node joins a clustered service, it performs a handshake with each clustered node running the service. A missing handshake response prevents this node from joining the service. Most commonly, it is caused by an unresponsive (for example, deadlocked) service thread.; Action: Corrective action may require locating and shutting down the JVM running the unresponsive service. See My Oracle Support (Metalink) Note 845363.1:
https://metalink.oracle.com/CSP/main/article?cmd=show&type=NOT&id=845363.1

Error while starting cluster: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(%s): Parameters: %s—information on the service that could not be started; Severity: 1—Error; Cause: When joining a service, every service in the cluster must respond to the join request. If one or more nodes have a service that does not respond within the timeout period, the join times out.; Action: See My Oracle Support (Metalink) Note 845363.1:
https://metalink.oracle.com/CSP/main/article?cmd=show&type=NOT&id=845363.1

Failed to restart services: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(%s): Parameters: %s—information on the service that could not be started Severity: 1—Error; Cause: When joining a service, every service in the cluster must respond to the join request. If one or more nodes have a service that does not respond within the timeout period, the join times out.; Action: See My Oracle Support (Metalink) Note 845363.1:
https://metalink.oracle.com/CSP/main/article?cmd=show&type=NOT&id=845363.1

2.8.2 Additions to TCMP Log Messages

The following TCMP log messages have been documented for the 3.5.1 release.

This node appears to have partially lost the connectivity: it receives responses from MemberSet(%s1) which communicate with Member(%s2), but is not responding directly to this member; that could mean that either requests are not coming out or responses are not coming in; stopping cluster service.: Parameters: %s1—set of members that can communicate with the member indicated in %s2; %s2—member that can communicate with set of members indicated in %s1. Severity: 1—Error; Cause: The communication link between this member and the member indicated by %s2 has been broken. However, the set of witnesses indicated by %s1 report no communication issues with %s2. It is therefore assumed that this node is in a state of partial failure, thus resulting in the shutdown of its cluster threads.; Action: Corrective action is not necessarily required, since the rest of the cluster presumably is continuing its operation and this node may recover and rejoin the cluster. On the other hand, it may warrant an investigation into root causes of the problem (especially if it is recurring with some frequency).

validatePolls: This senior encountered an overdue poll, indicating a dead member, a significant network issue or an Operating System threading library bug (e.g. Linux NPTL): Poll: Parameters: none. Severity: 2—Warning; Cause: When a node joins a cluster, it performs a handshake with each cluster node. A missing handshake response prevents this node from joining the service. The log message following this one will indicate the corrective action taken by this node.; Action: If this message reoccurs, further investigation into the root cause may be warranted.

Received panic from senior Member(%s1) caused by Member(%s2): Parameters: %s1—the cluster senior member as known by this node; %s2—a member claiming to be the senior member. Severity: 1—Error; Cause: This occurs after a cluster is split into multiple cluster islands (usually due to a network link failure.) When a link is restored and the corresponding island seniors see each other, the panic protocol is initiated to resolve the conflict.; Action: If this issue occurs frequently, the root cause of the cluster split should be investigated.

A potential communication problem has been detected. A packet has failed to be delivered (or acknowledged) after %n1 seconds, although other packets were acknowledged by the same cluster member (Member(%s1)) to this member (Member(%s2)) as recently as %n2 seconds ago. Possible causes include network failure, poor thread scheduling (see FAQ if running on Windows), an extremely overloaded server, a server that is attempting to run its processes using swap space, and unreasonably lengthy GC times.: Parameters: %n1—The number of seconds a packet has failed to be delivered or acknowledged; %s1—the recipient of the packets indicated in the message; %s2 —the sender of the packets indicated in the message; %n2—the number of seconds since a packet was delivered successfully between the two members indicated above. Severity: 2—Warning; Cause: Possible causes are indicated in the text of the message.; Action: If this issue occurs frequently, the root cause should be investigated.

Node %s1 is not allowed to create a new cluster; WKA list: \[%s2\]: Parameters: %s1—Address of node attempting to join cluster; %s2—List of WKA addresses. Severity: 1—Error; Cause: The cluster is configured to use WKA, and there are no nodes present in the cluster that are in the WKA list.; Action: Ensure that at least one node in the WKA list exists in the cluster, or add this node's address to the WKA list.

This member is configured with a compatible but different WKA list then the senior Member(%s). It is strongly recommended to use the same WKA list for all cluster members.: Parameters: %s—the senior node of the cluster. Severity: 2—Warning; Cause: The WKA list on this node is different than the WKA list on the senior node.; Action: Ensure that every node in the cluster has the same WKA list.

UnicastUdpSocket failed to set receive buffer size to %n1 packets (%n2 bytes); actual size is %n3 packets (%n4 bytes). Consult your OS documentation regarding increasing the maximum socket buffer size. Proceeding with the actual value may cause sub-optimal performance.: Parameters: %n1—the number of packets that will fit in the buffer that Coherence attempted to allocate; %n2—the size of the buffer Coherence attempted to allocate; %n3—the number of packets that will fit in the actual allocated buffer size; %n4—the actual size of the allocated buffer. Severity: 2—Warning; Cause: See Performance Tuning in the Developer's Guide for Oracle Coherence; Action: See Performance Tuning in the Developer's Guide for Oracle Coherence