Bulk Loading and Processing with Coherence

Bulk Writing to a Cache

A common scenario when using Coherence is to pre populate a cache before the application makes use of the cache. A simple way to do this would be:

public static void bulkLoad(NamedCache cache, Connection conn)
    {
    Statement s;
    ResultSet rs;
    
    try
        {
        s = conn.createStatement();
        rs = s.executeQuery("select key, value from table");
        while (rs.next())
            {
            Integer key   = new Integer(rs.getInt(1));
            String  value = rs.getString(2);
            cache.put(key, value);
            }
        ...
        }
    catch (SQLException e)
        {...}
    }

This works, but each call to put may result in network traffic, especially for partitioned and replicated caches. Additionally, each call to put will return the object it just replaced in the cache (per the java.util.Map interface) which will add more unnecessary overhead. This can be made much more efficient by using putAll instead:

public static void bulkLoad(NamedCache cache, Connection conn)
    {
    Statement s;
    ResultSet rs;
    Map       buffer = new HashMap();

    try
        {
        int count = 0;
        s = conn.createStatement();
        rs = s.executeQuery("select key, value from table");
        while (rs.next())
            {
            Integer key   = new Integer(rs.getInt(1));
            String  value = rs.getString(2);
            buffer.put(key, value);

            // this loads 1000 items at a time into the cache
            if ((count++ % 1000) == 0)
                {
                cache.putAll(buffer);
                buffer.clear();
                }
            }
        if (!buffer.isEmpty())
            {
            cache.putAll(buffer);
            }
        ...
        }
    catch (SQLException e)
        {...}
    }

Efficient processing of filter results

Coherence provides the ability to query caches based on criteria via the filter API. Here is an example (given entries with integers as keys and strings as values):

NamedCache c = CacheFactory.getCache("test");

// Search for entries that start with 'c'
Filter query = new LikeFilter(IdentityExtractor.INSTANCE, "c%", '\\', true);

// Perform query, return all entries that match
Set results = c.entrySet(query);
for (Iterator i = results.iterator(); i.hasNext();)
    {
    Map.Entry e = (Map.Entry) i.next();
    out("key: "+e.getKey() + ", value: "+e.getValue());
    }

This example works for small data sets, but it may encounter problems if the data set is too large, such as running out of heap space. Here is a pattern to process query results in batches to avoid this problem:

public static void performQuery()
    {
    NamedCache c = CacheFactory.getCache("test");

    // Search for entries that start with 'c'
    Filter query = new LikeFilter(IdentityExtractor.INSTANCE, "c%", '\\', true);

    // Perform query, return keys of entries that match
    Set keys = c.keySet(query);

    // The amount of objects to process at a time
    final int BUFFER_SIZE = 100;

    // Object buffer
    Set buffer = new HashSet(BUFFER_SIZE);

    for (Iterator i = keys.iterator(); i.hasNext();)
        {
        buffer.add(i.next());

        if (buffer.size() >= BUFFER_SIZE)
            {
            // Bulk load BUFFER_SIZE number of objects from cache
            Map entries = c.getAll(buffer);

            // Process each entry
            process(entries);

            // Done processing these keys, clear buffer
            buffer.clear();
            }
        }
    // Handle the last partial chunk (if any)
    if (!buffer.isEmpty())
        {
        process(c.getAll(buffer));
        }
    }

public static void process(Map map)
    {
    for (Iterator ie = map.entrySet().iterator(); ie.hasNext();)
        {
        Map.Entry e = (Map.Entry) ie.next();
        out("key: "+e.getKey() + ", value: "+e.getValue());
        }
    }

In this example, all keys for entries that match the filter are returned, but only BUFFER_SIZE (in this case, 100) entries are retrieved from the cache at a time.

Note that LimitFilter can be used to process results in parts, similar to the example above. However LimitFilter is meant for scenarios where the results will be paged, such as in a user interface. It is not an efficient means to process all data in a query result.

A Complete Example

Here is an example program that demonstrates the concepts described above. Note this can be downloaded from the forums.

package com.tangosol.examples;

import com.tangosol.net.CacheFactory;
import com.tangosol.net.NamedCache;
import com.tangosol.net.cache.NearCache;
import com.tangosol.util.Base;
import com.tangosol.util.Filter;
import com.tangosol.util.filter.LikeFilter;

import java.io.Serializable;

import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Random;
import java.util.Set;
import java.util.HashSet;


/**
* This sample application demonstrates the following:
* <ul>
* <li>
* <b>Obtaining a back cache from a near cache for populating a cache.</b>
* Since the near cache holds a limited subset of the data in a cache it is
* more efficient to bulk load data directly into the back cache instead of
* the near cache.
* </li>
* <li>
* <b>Populating a cache in bulk using <tt>putAll</tt>.</b>
* This is more efficient than <tt>put</tt> for a large amount of entries.
* </li>
* <li>
* <b>Executing a filter against a cache and processing the results in bulk.</b>
* This sample issues a query against the cache using a filter.  The result is
* a set of keys that represent the query results.  Instead of iterating
* through the keys and loading each item individually with a <tt>get</tt>,
* this sample loads entries from the cache in bulk using <tt>getAll</tt> which
* is more efficient.
* </li>
*
* @author cp, pperalta  2007.02.21
*/
public class PagedQuery
        extends Base
    {
    /**
    * Command line execution entry point.
    */
    public static void main(String[] asArg)
        {
        NamedCache cacheContacts = CacheFactory.getCache("contacts",
                Contact.class.getClassLoader());

        populateCache(cacheContacts);

        executeFilter(cacheContacts);

        CacheFactory.shutdown();
        }

    // ----- populate the cache ---------------------------------------------
    
    /**
    * Populate the cache with test data.  This example shows how to populate
    * the cache a chunk at a time using {@link NamedCache#putAll} which is more
    * efficient than {@link NamedCache#put}.
    *
    * @param cacheDirect  the cache to populate. Note that this should <b>not</b>
    *                     be a near cache since that will thrash the cache
    *                     if the load size exceeds the near cache max size.  
    */
    public static void populateCache(NamedCache cacheDirect)
        {
        if (cacheDirect.isEmpty())
            {
            Map mapBuffer = new HashMap();
            for (int i = 0; i < 100000; ++i)
                {
                // make up some fake data
                Contact contact = new Contact();
                contact.setName(getRandomName() + ' ' + getRandomName());
                contact.setPhone(getRandomPhone());
                mapBuffer.put(new Integer(i), contact);

                // this loads 1000 items at a time into the cache
                if ((i % 1000) == 0)
                    {
                    out("Adding "+mapBuffer.size()+" entries to cache");
                    cacheDirect.putAll(mapBuffer);
                    mapBuffer.clear();
                    }
                }
            if (!mapBuffer.isEmpty())
                {
                cacheDirect.putAll(mapBuffer);
                }
            }
        }

    /**
    * Creates a random name.
    *
    * @return  a random string between 4 to 11 chars long
    */
    public static String getRandomName()
        {
        Random rnd = getRandom();
        int    cch = 4 + rnd.nextInt(7);
        char[] ach = new char[cch];
        ach[0] = (char) ('A' + rnd.nextInt(26));
        for (int of = 1; of < cch; ++of)
            {
            ach[of] = (char) ('a' + rnd.nextInt(26));
            }
        return new String(ach);
        }

    /**
    * Creates a random phone number
    *
    * @return  a random string of integers 10 chars long
    */
    public static String getRandomPhone()
        {
        Random rnd = getRandom();
        return "("
            + toDecString(100 + rnd.nextInt(900), 3)
            + ") "
            + toDecString(100 + rnd.nextInt(900), 3)
            + "-"
            + toDecString(10000, 4);
        }

    // ----- process the cache ----------------------------------------------

    /**
    * Query the cache and process the results in batches.  This example
    * shows how to load a chunk at a time using {@link NamedCache#getAll}
    * which is more efficient than {@link NamedCache#get}.
    *
    * @param cacheDirect  the cache to issue the query against
    */
    private static void executeFilter(NamedCache cacheDirect)
        {
        Filter query = new LikeFilter("getName", "C%");

        // Let's say we want to process 100 entries at a time
        final int CHUNK_COUNT = 100;

        // Start by querying for all the keys that match
        Set setKeys = cacheDirect.keySet(query);

        // Create a collection to hold the "current" chunk of keys
        Set setBuffer = new HashSet();

        // Iterate through the keys
        for (Iterator iter = setKeys.iterator(); iter.hasNext(); )
            {
            // Collect the keys into the current chunk
            setBuffer.add(iter.next());

            // handle the current chunk once it gets big enough
            if (setBuffer.size() >= CHUNK_COUNT)
                {
                // Instead of retrieving each object with a get,
                // retrieve a chunk of objects at a time with a getAll.
                processContacts(cacheDirect.getAll(setBuffer));
                setBuffer.clear();
                }
            }

        // Handle the last partial chunk (if any)
        if (!setBuffer.isEmpty())
            {
            processContacts(cacheDirect.getAll(setBuffer));
            }
        }

    /**
    * Process the map of contacts. In a real application some sort of
    * processing for each map entry would occur. In this example each
    * entry is logged to output. 
    *
    * @param map  the map of contacts to be processed
    */
    public static void processContacts(Map map)
        {
        out("processing chunk of " + map.size() + " contacts:");
        for (Iterator iter = map.entrySet().iterator(); iter.hasNext(); )
            {
            Map.Entry entry = (Map.Entry) iter.next();
            out("  [" + entry.getKey() + "]=" + entry.getValue());
            }
        }

    // ----- inner classes --------------------------------------------------

    /**
    * Sample object used to populate cache
    */
    public static class Contact
            extends Base
            implements Serializable
        {
        public Contact() {}

        public String getName()
            {
            return m_sName;
            }
        public void setName(String sName)
            {
            m_sName = sName;
            }

        public String getPhone()
            {
            return m_sPhone;
            }
        public void setPhone(String sPhone)
            {
            m_sPhone = sPhone;
            }

        public String toString()
            {
            return "Contact{"
                    + "Name=" + getName()
                    + ", Phone=" + getPhone()
                    + "}";
            }

        public boolean equals(Object o)
            {
            if (o instanceof Contact)
                {
                Contact that = (Contact) o;
                return equals(this.getName(), that.getName())
                    && equals(this.getPhone(), that.getPhone());
                }
            return false;
            }

        public int hashCode()
            {
            int result;
            result = (m_sName != null ? m_sName.hashCode() : 0);
            result = 31 * result + (m_sPhone != null ? m_sPhone.hashCode() : 0);
            return result;
            }

        private String m_sName;
        private String m_sPhone;
        }
    }

Here are the steps to running the example:

  1. Save this file as com/tangosol/examples/PagedQuery.java
  2. Point the classpath to the Coherence libraries and the current directory
  3. Compile and run the example

$ vi com/tangosol/examples/PagedQuery.java

$ export TANGOSOL_HOME=[**Coherence install directory**]

$ export CLASSPATH=$TANGOSOL_HOME/lib/tangosol.jar:$TANGOSOL_HOME/lib/coherence.jar:.

$ javac com/tangosol/examples/PagedQuery.java

$ java com.tangosol.examples.PagedQuery

Tangosol Coherence Version 3.2.2/371

******************************************************************************
*
* Tangosol Coherence is licensed by Tangosol, Inc.
*
* Licensed for evaluation use from 2007-02-01 until 2007-04-01 (30 days
* remaining)
*   Tangosol Coherence: DataGrid Edition
*   Tangosol Coherence: Caching Edition
*   Tangosol Coherence: Data Client
*   Tangosol Coherence: Application Edition
*   Tangosol Coherence: Real-Time Client
*   Tangosol Coherence: Compute Client
*
* A production license is required for production use. For more information,
* see http://www.tangosol.com/license.jsp.
*
* Copyright (c) 2000-2006 Tangosol, Inc.
*
******************************************************************************

2007-02-28 22:48:38.208 Tangosol Coherence DGE 3.2.2/371  (thread=Cluster, member=n/a): Created a new cluster
 with Member(Id=1, Timestamp=2007-02-28 22:48:34.828, Address=192.168.0.6:8088, MachineId=26630, Edition=DataGrid
 Edition, Mode=Evaluation, CpuCount=2, SocketCount=2) UID=0xC0A80006000001110B9D118C68061F98
Adding 1024 entries to cache
Adding 1024 entries to cache

...repeated many times...

Adding 1024 entries to cache
Adding 1024 entries to cache
Adding 1024 entries to cache
processing chunk of 100 contacts:
  [25827]=Contact{Name=Cgkyleass Kmknztk, Phone=(285) 452-0000}
  [4847]=Contact{Name=Cyedlujlc Ruexrtgla, Phone=(255) 296-0000}
...repeated many times
  [33516]=Contact{Name=Cjfwlxa Wsfhrj, Phone=(683) 968-0000}
  [71832]=Contact{Name=Clfsyk Dwncpr, Phone=(551) 957-0000}
processing chunk of 100 contacts:
  [38789]=Contact{Name=Cezmcxaokf Kwztt, Phone=(725) 575-0000}
  [87654]=Contact{Name=Cuxcwtkl Tqxmw, Phone=(244) 521-0000}
...repeated many times
  [96164]=Contact{Name=Cfpmbvq Qaxty, Phone=(596) 381-0000}
  [29502]=Contact{Name=Cofcdfgzp Nczpdg, Phone=(563) 983-0000}
...
processing chunk of 80 contacts:
  [49179]=Contact{Name=Czbjokh Nrinuphmsv, Phone=(140) 353-0000}
  [84463]=Contact{Name=Cyidbd Rnria, Phone=(571) 681-0000}
...
  [2530]=Contact{Name=Ciazkpbos Awndvrvcd, Phone=(676) 700-0000}
  [9371]=Contact{Name=Cpqo Rmdw, Phone=(977) 729-0000}