Sizing your in-memory cache correctly is an important part of meeting your store's performance goals. Disk I/O is an expensive operation from a performance point of view; the more operations you can service from cache, the better your store's performance is going to be.
There are several disk cache strategies that you can use, each of which is appropriate for different workloads. However, Oracle NoSQL Database was designed for applications that cannot place all their data in memory, so this release of the product describes a caching strategy that is appropriate for that class of workload.
Before continuing, it is worth noting that there are two caches that we are concerned with:
JE cache size. The underlying storage engine used by Oracle NoSQL Database is Berkeley DB Java Edition (JE). JE provides an in-memory cache. For the most part, this is the cache size that you most need to think about, because it is the one that you have the most control over.
The file system (FS) cache. Modern operating systems attempt to improve their I/O subsystem performance by providing a cache, or buffer, that is dedicated to disk I/O. By using the FS cache, read operations can be performed very quickly if the reads can be satisfied by data that is stored there.
JE uses a Btree to organize the data that it stores. Btrees provide a tree-like data organization structure that allows for rapid information lookup. These structures consist of interior nodes (INs) and leaf nodes (LNs). INs are used to navigate to data. LNs are where the data is actually stored in the Btree.
Because of the very large data sets that an Oracle NoSQL Database application is expected to use, it is unlikely that you can place even a small fraction of your data into JE's in-memory cache. Therefore, the best strategy is to size the cache such that it is large enough to hold most, if not all, of your database's INs, and leave the rest of your node's memory available for system overhead (negligible) and the FS cache.
You cannot control whether INs or LNs are being served out of the FS cache, so sizing the JE cache to be large enough for your INs is simply sizing advice. Both INs and LNs can take advantage of the FS cache. Because INs and LNs do not have Java object overhead when present in the FS cache (as they would when using the JE cache), they can make more effective use of the FS cache memory than the JE cache memory.
Of course, in order for this strategy to be truly effective, your data access patterns should not be completely random. Some subset of your key-value pairs must be favored over others in order to achieve a useful cache hit rate. For applications where the access patterns are not random, the high file system cache hit rates on LNs and INs can increase throughput and decrease average read latency. Also, larger file system caches, when properly tuned, can help reduce the number of stalls during sequential writes to the log files, thus decreasing write latency. Large caches also permit more of the writes to be done asynchronously, thus improving throughput.
Assuming a reasonable amount of clustering in your data access patterns, your disk subsystem should be capable of delivering roughly the following throughput if you size your cache as described here:
((readOps/Sec + createOps/Sec + updateOps/Sec + deleteOps/Sec) * (1-cache hit fraction))/nReplicationNodes => throughput in IOPs/sec
The above rough calculation assumes that each create, update, and delete operation results in a random I/O operation. Due to the log structured nature of the underlying storage system, this is not typically the case and application-level write operations result in batched sequential synchronous write operations. So the above rough calculation may overstate the IOPs requirements, but it does provide a good conservative number for estimation purposes.
For example, if a KVStore with two shards and a replication factor of 3 (for a total of six replication nodes) needs to deliver an aggregate 2000 ops/sec (summing all read, create, update and delete operations), and a 50% cache hit ratio is expected, then the I/O subsystem on each replication node should be able to deliver:
((2000 ops/sec) * (1 - 0.5)) / 6 nodes = 166 IOPs/sec
This is roughly in the range of what a single spindle disk subsystem can provide. For higher throughput, a multi-spindle I/O subsystem may be more appropriate. Another option is to increase the number of shards and therefore the number of replication nodes and therefore disks, thus spreading out the I/O load.
In order to identify an appropriate JE cache size for your Big
Data application, use the
com.sleepycat.je.util.DbCacheSize
utility. This utility requires you to provide the number of
records and the size of your keys. You can also optionally
provide other information, such as your expected data size.
The utility then provides a short table of information. The
number you want is provided in the Cache Size
column, and in the Minimum, internal nodes only
row.
For example, to determine the JE cache size for an
environment consisting of 100 million records, with an
average key size of 12 bytes, and an average value size of
1000 bytes, invoke DbCacheSize
as follows:
java -d64 -XX:+UseCompressedOops -jar je.jar DbCacheSize \ -key 12 -data 1000 -records 100000000 === Environment Cache Overhead === 3,156,253 minimum bytes To account for JE daemon operation and record locks, a significantly larger amount is needed in practice. === Database Cache Size === Minimum Bytes Maximum Bytes Description --------------- --------------- ----------- 2,888,145,968 3,469,963,312 Internal nodes only 107,499,427,952 108,081,245,296 Internal nodes and leaf nodes === Internal Node Usage by Btree Level === Minimum Bytes Maximum Bytes Nodes Level --------------- --------------- ---------- ----- 2,849,439,456 3,424,720,608 1,123,596 1 38,275,968 44,739,456 12,624 2 427,512 499,704 141 3 3,032 3,544 1 4
The numbers you want are in the
Database Cache Size
section of the output. In the
Minimum Bytes
column, there are two
numbers: One for internal nodes only, and one for internal
nodes plus leaf nodes. What this means is that the
absolutely minimum cache size you should use for a dataset
of this size is 2.9 GB. However, that stores only your
internal database structure; the cache is not large
enough to hold any data.
The second number in the output represents the minimum cache size required to hold your entire database, including all data. At 107.5 GB, it is highly unlikely that you have machines with that much RAM. Which means that you now have to make some decisions about your data. Namely, you have to decide how large your working set is. Your working set is the data that your application accesses so frequently that it is worth placing it in the in-memory cache. How large your working set has to be is determined by the nature of your application. Hopefully your working set is small enough to fit into the amount of RAM available to your node machines, as this provides you the best read throughput by avoiding a lot of disk I/O.
java -d64 -XX:+UseCompressedOops -jar je.jar DbCacheSize \ -key 12 -data 1000 -records 10000000 === Environment Cache Overhead === 3,156,253 minimum bytes To account for JE daemon operation and record locks, a significantly larger amount is needed in practice. === Database Cache Size === Minimum Bytes Maximum Bytes Description --------------- --------------- ----------- 288,816,824 346,998,968 Internal nodes only 10,749,982,264 10,808,164,408 Internal nodes and leaf nodes === Internal Node Usage by Btree Level === Minimum Bytes Maximum Bytes Nodes Level --------------- --------------- ---------- ----- 284,944,960 342,473,280 112,360 1 3,826,384 4,472,528 1,262 2 42,448 49,616 14 3 3,032 3,544 1 4
Not surprisingly, our cache sizes are now approximately 10% of what they were for our entire data set size (because we decided that our working set is about 10% of our entire data set size). That is, our working set can be placed in a cache that is about 10.8 GB in size. This should be easily possible for modern commodity hardware.
For more information on using the DbCacheSize
utility, see this Javadoc page:
http://docs.oracle.com/cd/E17277_02/html/java/com/sleepycat/je/util/DbCacheSize.html.
Note that in order to use this utility, you must add the
<KVHOME>/lib/je.jar
file to your Java classpath.
<KVHOME> represents the directory where you placed the Oracle NoSQL Database
package files.
Having used DbCacheSize
to obtain a targeted cache size value, you need to
find out how big your Java heap must be in order to
support it. To do this, use the
KVS Node Heap Shaping and Sizing
spreadsheet. Plug the number you obtained from
DbCacheSize
into cell 8B of the
spreadsheet. Cell 29B then shows you how large to make
the Java heap size.
Your file system cache is whatever memory is left over on your node after you subtract system overhead and the Java heap size.
You can find the KVS Node Heap Shaping and
Sizing
spreadsheet in your Oracle NoSQL Database distribution
here: <KVHOME>/doc/misc/MemoryConfigPlanning.xls