Determining the Per-Node Cache Size

Sizing Advice
Arriving at Sizing Numbers

Sizing your in-memory cache correctly is an important part of meeting your store's performance goals. Disk I/O is an expensive operation from a performance point of view; the more operations you can service from cache, the better your store's performance is going to be.

There are several disk cache strategies that you can use, each of which is appropriate for different workloads. However, Oracle NoSQL Database was designed for applications that cannot place all their data in memory, so this release of the product describes a caching strategy that is appropriate for that class of workload.

Before continuing, it is worth noting that there are two caches that we are concerned with:

Sizing Advice

JE uses a Btree to organize the data that it stores. Btrees provide a tree-like data organization structure that allows for rapid information lookup. These structures consist of interior nodes (INs) and leaf nodes (LNs). INs are used to navigate to data. LNs are where the data is actually stored in the Btree.

Because of the very large data sets that an Oracle NoSQL Database application is expected to use, it is unlikely that you can place even a small fraction of your data into JE's in-memory cache. Therefore, the best strategy is to size the cache such that it is large enough to hold most, if not all, of your database's INs, and leave the rest of your node's memory available for system overhead (negligible) and the FS cache.

You cannot control whether INs or LNs are being served out of the FS cache, so sizing the JE cache to be large enough for your INs is simply sizing advice. Both INs and LNs can take advantage of the FS cache. Because INs and LNs do not have Java object overhead when present in the FS cache (as they would when using the JE cache), they can make more effective use of the FS cache memory than the JE cache memory.

Of course, in order for this strategy to be truly effective, your data access patterns should not be completely random. Some subset of your key-value pairs must be favored over others in order to achieve a useful cache hit rate. For applications where the access patterns are not random, the high file system cache hit rates on LNs and INs can increase throughput and decrease average read latency. Also, larger file system caches, when properly tuned, can help reduce the number of stalls during sequential writes to the log files, thus decreasing write latency. Large caches also permit more of the writes to be done asynchronously, thus improving throughput.

Assuming a reasonable amount of clustering in your data access patterns, your disk subsystem should be capable of delivering roughly the following throughput if you size your cache as described here:

((readOps/Sec + createOps/Sec + updateOps/Sec + deleteOps/Sec) *
(1-cache hit fraction))/nReplicationNodes => throughput in IOPs/sec 

The above rough calculation assumes that each create, update, and delete operation results in a random I/O operation. Due to the log structured nature of the underlying storage system, this is not typically the case and application-level write operations result in batched sequential synchronous write operations. So the above rough calculation may overstate the IOPs requirements, but it does provide a good conservative number for estimation purposes.

For example, if a KVStore with two shards and a replication factor of 3 (for a total of six replication nodes) needs to deliver an aggregate 2000 ops/sec (summing all read, create, update and delete operations), and a 50% cache hit ratio is expected, then the I/O subsystem on each replication node should be able to deliver:

((2000 ops/sec) * (1 - 0.5)) / 6 nodes = 166 IOPs/sec 

This is roughly in the range of what a single spindle disk subsystem can provide. For higher throughput, a multi-spindle I/O subsystem may be more appropriate. Another option is to increase the number of shards and therefore the number of replication nodes and therefore disks, thus spreading out the I/O load.

Arriving at Sizing Numbers

In order to identify an appropriate JE cache size for your Big Data application, use the com.sleepycat.je.util.DbCacheSize utility. This utility requires you to provide the number of records and the size of your keys. You can also optionally provide other information, such as your expected data size. The utility then provides a short table of information. The number you want is provided in the Cache Size column, and in the Minimum, internal nodes only row.

For example, to determine the JE cache size for an environment consisting of 100 million records, with an average key size of 12 bytes, and an average value size of 1000 bytes, invoke DbCacheSize as follows:

java -d64 -XX:+UseCompressedOops -jar je.jar DbCacheSize \
-key 12 -data 1000 -records 100000000

=== Environment Cache Overhead ===

3,156,253 minimum bytes

To account for JE daemon operation and record locks,
a significantly larger amount is needed in practice.

=== Database Cache Size ===

 Minimum Bytes    Maximum Bytes   Description
---------------  ---------------  -----------
  2,888,145,968    3,469,963,312  Internal nodes only
107,499,427,952  108,081,245,296  Internal nodes and leaf nodes

=== Internal Node Usage by Btree Level ===

 Minimum Bytes    Maximum Bytes      Nodes    Level
---------------  ---------------  ----------  -----
  2,849,439,456    3,424,720,608   1,123,596    1
     38,275,968       44,739,456      12,624    2
        427,512          499,704         141    3
          3,032            3,544           1    4              

The numbers you want are in the Database Cache Size section of the output. In the Minimum Bytes column, there are two numbers: One for internal nodes only, and one for internal nodes plus leaf nodes. What this means is that the absolutely minimum cache size you should use for a dataset of this size is 2.9 GB. However, that stores only your internal database structure; the cache is not large enough to hold any data.

The second number in the output represents the minimum cache size required to hold your entire database, including all data. At 107.5 GB, it is highly unlikely that you have machines with that much RAM. Which means that you now have to make some decisions about your data. Namely, you have to decide how large your working set is. Your working set is the data that your application accesses so frequently that it is worth placing it in the in-memory cache. How large your working set has to be is determined by the nature of your application. Hopefully your working set is small enough to fit into the amount of RAM available to your node machines, as this provides you the best read throughput by avoiding a lot of disk I/O.

java -d64 -XX:+UseCompressedOops -jar je.jar DbCacheSize \
-key 12 -data 1000 -records 10000000

=== Environment Cache Overhead ===

3,156,253 minimum bytes

To account for JE daemon operation and record locks,
a significantly larger amount is needed in practice.

=== Database Cache Size ===

 Minimum Bytes    Maximum Bytes   Description
---------------  ---------------  -----------
    288,816,824      346,998,968  Internal nodes only
 10,749,982,264   10,808,164,408  Internal nodes and leaf nodes

=== Internal Node Usage by Btree Level ===

 Minimum Bytes    Maximum Bytes      Nodes    Level
---------------  ---------------  ----------  -----
    284,944,960      342,473,280     112,360    1
      3,826,384        4,472,528       1,262    2
         42,448           49,616          14    3
          3,032            3,544           1    4
          

Not surprisingly, our cache sizes are now approximately 10% of what they were for our entire data set size (because we decided that our working set is about 10% of our entire data set size). That is, our working set can be placed in a cache that is about 10.8 GB in size. This should be easily possible for modern commodity hardware.

For more information on using the DbCacheSize utility, see this Javadoc page: http://docs.oracle.com/cd/E17277_02/html/java/com/sleepycat/je/util/DbCacheSize.html. Note that in order to use this utility, you must add the <KVHOME>/lib/je.jar file to your Java classpath. <KVHOME> represents the directory where you placed the Oracle NoSQL Database package files.

Having used DbCacheSize to obtain a targeted cache size value, you need to find out how big your Java heap must be in order to support it. To do this, use the KVS Node Heap Shaping and Sizing spreadsheet. Plug the number you obtained from DbCacheSize into cell 8B of the spreadsheet. Cell 29B then shows you how large to make the Java heap size.

Your file system cache is whatever memory is left over on your node after you subtract system overhead and the Java heap size.

You can find the KVS Node Heap Shaping and Sizing spreadsheet in your Oracle NoSQL Database distribution here: <KVHOME>/doc/misc/MemoryConfigPlanning.xls