3 Evaluating Performance and Scalability

Coherence distributed caches are often evaluated against pre-existing local caches. The local caches generally take the form of in-processes hash maps. While Coherence does include facilities for in-process non-clustered caches, direct performance comparison between local caches and a distributed cache is not realistic. By the very nature of being out of process, the distributed cache must perform serialization and network transfers. For this cost, you gain cluster wide coherency of the cache data, and data and query scalability beyond what a single JVM or computer provides. This does not mean that you cannot achieve impressive performance using a distributed cache, but it must be evaluated in the correct context.

The following sections are included in this chapter:

3.1 Measuring Latency and Throughput

When evaluating performance you try to establish two things, latency, and throughput. A simple performance analysis test may simply try performing a series of timed cache accesses in a tight loop. While these tests may accurately measure latency, to measure maximum throughput on a distributed cache a test must make use of multiple threads concurrently accessing the cache, and potentially multiple test clients. In a single threaded test, the client thread naturally spends the majority of the time simply waiting on the network. By running multiple clients/threads, you can more efficiently make use of your available processing power by issuing several requests in parallel. The use of batching operations can also be used to increase the data density of each operation. As you add threads, you should see that the throughput continues to increase until the CPU or network has been maximized, while the overall latency remains constant for the same period.

3.2 Demonstrating Scalability

To show true linear scalability as you increase cluster size, be prepared to add hardware, and not simply JVMs to the cluster. Adding JVMs to a single computer scales up to the point where the CPU or network are fully used.

Plan on testing with clusters with more then just two cache servers (storage enabled nodes). The jump from one to two cache servers does not show the same scalability as from two to four. The reason for this is because by default Coherence maintains one backup copy of each piece of data written into the cache. The process of maintaining backups only begins when there are two storage-enabled nodes in the cluster (there must be a place to put the backup). Thus when you move from a one to two, the amount of work involved in a mutating operation such as a put operation actually doubles, but beyond that the amount of work stays fixed and is evenly distributed across the nodes. "Scalability: A Test Case" provides an example test case that demonstrates scalability.

3.3 Tuning Your Environment

To get the most out of your cluster it is important that you've tuned of your environment and JVMs. Chapter 6, "Performance Tuning,", provides good start to getting the most out of your environment. For example, Coherence includes a registry script for Windows (optimize.reg), which adjusts a few critical settings and allow Windows to achieve much higher data rates.

3.4 Evaluating the Measurements of a Large Sample Cluster

The following graphs show the results of scaling out a cluster in an environment of 100 computers. In this particular environment, Coherence was able to scale to the limits of the network's switching infrastructure. Namely, there were 8 sets of ~12 computers, each set having a 4Gbs link to a central switch. Thus, for this test, Coherence's network throughput scales up to ~32 computers at which point it reaches the available bandwidth, beyond that it continues to scale in total data capacity.

Figure 3-1 Coherence Throughput Versus Number of Computers

The text descibes this figure

Figure 3-2 Coherence Latency Versus Number of Computers

The text descibes this figure

Latency for 10MB operations (~300ms) is not included in the graph for display reasons, as the payload is 1000x the next payload size.

3.5 Scalability: A Test Case

This section uses a data grid aggregation example to demonstrate Coherence scalability. The tests were performed on a cluster comprised of Dual 2.3GHz PowerPC G5 Xserve computers.

The following topics are included in this section:

3.5.1 Overview of the Scalability Test Case

Coherence provides a data grid by dynamically, transparently, and automatically partitioning the data set across all storage enabled nodes in a cluster. The InvocableMap interface tightly binds the concepts of a data grid (that is, partitioned cache) and the processing of the data stored in the grid.

The following test case shows that Coherence provides the ability to build an application that can scale out to handle any SLA requirement and any increase in throughput requirements. The example in this test case clearly demonstrates Coherence's ability to linearly increase the number of trades aggregated per second as hardware is increased. That is, if one computer can aggregate X trades per second, a second computer added to the data grid increases the aggregate to 2X trades per second, a third computer added to the data grid increases the aggregate to 3X trades per second and so on.

3.5.2 Step 1: Create Test Data for Aggregation

Example 3-1 illustrates a Trade object that is used as the basis for the scalability test. The Trade object contains three properties: Id, Price, and Symbol. These properties represent the data that is aggregated.

Example 3-1 Trade Object Defining Three Properties

package com.tangosol.examples.coherence.data;

import com.tangosol.util.Base;
import com.tangosol.util.ExternalizableHelper;
import com.tangosol.io.ExternalizableLite;

import java.io.IOException;
import java.io.NotActiveException;
import java.io.DataInput;
import java.io.DataOutput;

* Example Trade class
* @author erm 2005.12.27
public class Trade
        extends Base
        implements ExternalizableLite
    * Default Constructor
    public Trade()

    public Trade(int iId, double dPrice, String sSymbol)

    public int getId()
        return m_iId;

    public void setId(int iId)
        m_iId = iId;

    public double getPrice()
        return m_dPrice;

    public void setPrice(double dPrice)
        m_dPrice = dPrice;

    public String getSymbol()
        return m_sSymbol;

    public void setSymbol(String sSymbol)
        m_sSymbol = sSymbol;

    * Restore the contents of this object by loading the object's state from the
    * passed DataInput object.
    * @param in the DataInput stream to read data from to restore the
    *           state of this object
    * @throws IOException        if an I/O exception occurs
    * @throws NotActiveException if the object is not in its initial state, and
    *                            therefore cannot be deserialized into
    public void readExternal(DataInput in)
            throws IOException
        m_iId     = ExternalizableHelper.readInt(in);
        m_dPrice  = in.readDouble();
        m_sSymbol = ExternalizableHelper.readSafeUTF(in);

    * Save the contents of this object by storing the object's state into the
    * passed DataOutput object.
    * @param out the DataOutput stream to write the state of this object to
    * @throws IOException if an I/O exception occurs
    public void writeExternal(DataOutput out)
            throws IOException
        ExternalizableHelper.writeInt(out, m_iId);
        ExternalizableHelper.writeSafeUTF(out, m_sSymbol);

    private int    m_iId;
    private double m_dPrice;
    private String m_sSymbol;

3.5.3 Step 2: Configure a Partitioned Cache

Example 3-2 defines one wildcard cache mapping that maps to a distributed scheme which has unlimited capacity:

Example 3-2 Mapping a cache-mapping to a caching-scheme with Unlimited Capacity

<?xml version="1.0"?>

<cache-config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

    Distributed caching scheme.



    Backing map scheme definition used by all the caches that do 
    not require any eviction policies 


3.5.4 Step 3: Add an Index

Example 3-3 illustrates the code to add an index to the Price property. Adding an index to this property increases performance by allowing Coherence to access the values directly rather than having to deserialize each item to accomplish the calculation.

Example 3-3 Adding an Index to the Price Property

ReflectionExtractor extPrice  = new ReflectionExtractor("getPrice");
m_cache.addIndex(extPrice, true, null);

In the test case, the aggregation speed was increased by more than 2x after an index was applied.

3.5.5 Step 4: Perform a Parallel Aggregation

Example 3-4 illustrates the code to perform a parallel aggregation across all JVMs in the data grid. The aggregation is initiated and the results are received by a single client. That is, a single "low-power" client can use the full processing power of the cluster/data grid in aggregate to perform this aggregation in parallel with just one line of code.

Example 3-4 Perform a Parallel Aggregation Across all JVMs in the Grid

Double DResult;
DResult = (Double) m_cache.aggregate((Filter) null, new DoubleSum("getPrice"));

3.5.6 Step 5: Run the Aggregation Test Case

This section describes the sample test runs and the overall test environment. The JDK used on both the client and the servers was Java 2 Runtime Environment, Standard Edition (build 1.5.0_05-84).

A test run does several things:

  1. Loads 200,000 trade objects into the data grid.

  2. Adds indexes to Price property.

  3. Performs a parallel aggregation of all trade objects stored in the data grid. This aggregation step is done 20 times to obtain an "average run time" to ensure that the test takes into account garbage collection.

  4. Loads 400,000 trade objects into the data grid.

  5. Repeats steps 2 and 3.

  6. Loads 600,000 trade objects into the data grid.

  7. Repeats steps 2 and 3.

  8. Loads 800,000 trade objects into the data grid.

  9. Repeats steps 2 and 3.

  10. Loads 1,000,000 trade objects into the data grid.

  11. Repeats steps 2 and 3.

Client Considerations: The test client itself is run on an Intel Core Duo iMac which is marked as local storage disabled. The command line used to start the client was:

java ... -Dtangosol.coherence.distributed.localstorage=false -Xmx128m -Xms128m com.tangosol.examples.coherence.invocable.TradeTest

This Test Suite (and Subsequent Results) Includes Data from Four Test Runs:

  1. Start 4 JVMs on one Xserve - Perform a test run

  2. Start 4 JVMs on each of two Xserves - Perform a test run

  3. Start 4 JVMs on each of three Xserves - Perform a test run

  4. Start 4 JVMs on each of four Xserves - Perform a test run

Server Considerations: In this case a JVM refers to a cache server (DefaultCacheServer) instance that is responsible for managing/storing the data.

The following command line was used to start the server:

java ... -Xmx384m -Xms384m -server com.tangosol.net.DefaultCacheServer

3.5.7 Step 6: Analyzing the Results

The following graph shows that the average aggregation time for the aggregations decreases linearly as more cache servers/computers are added to the data grid.


The lowest and highest times were not used in the calculations below resulting in a data set of eighteen results used to create an average.

Figure 3-3 Average Aggregation Time

The text descibes this figure

Similarly, the following graph illustrates how the aggregations per second scales linearly as more computers are added. When moving from 1 computer to 2 computers, the trades aggregated per second doubled, when moving from 2 computers to 4 computers, the trades aggregated per second double again.

Figure 3-4 Aggregation Scale-Out

The text descibes this figure

In addition to scaling, the above aggregations completes successfully and correctly even if a cache server, or an entire computer, fails during the aggregation.