Shard Capacity

Application Characteristics

Replication Factor

In general, a Primary Replication Factor of 3 is adequate for most applications and is a good starting point, because 3 replicas allow write availability if a single primary zone fails. It can be refined if performance testing suggests some other number works better for the specific workload. Do not select a Primary Replication Factor of 2 because doing so means that even a single failure results in too few sites to elect a new master. This is not the case if you have an Arbiter Node, as a new master can still be elected if the Replication Factor is two and you lose a Replication Node. However, if you have multiple failures before both Replication Nodes are caught up, you may not be able to elect a new master. A Primary Replication Factor of 1 is to be avoided in general since Oracle NoSQL Database has just a single copy of the data; if the storage device hosting the data were to fail the data could be lost.

Larger Primary Replication Factor provide two benefits:

Increased durability to better withstand disk or machine failures.
Increased read request throughput, because there are more nodes per shard available to service those requests.

However, the increased durability and read throughput has costs associated with it: more hardware resources to host and serve the additional copies of the data and slower write performance, because each shard has more nodes to which updates must be replicated.

Note:

Only the Primary Replication Factor affects write availability, but both Primary and Secondary Replication Factors, and therefore the Store Replication Factor, have an effect on read availability.

The Primary Replication Factor is defined by the cell RF.

Average Key Size

Use knowledge of the application's key schema and the relative distributions of the various keys to arrive at an average key length. The length of a key on disk is the number of UTF-8 bytes needed to represent the components of the key, plus the number of components, minus one.

This value is defined by the cell AvgKeySize.

Average Value Size

Use knowledge of the application to arrive at an average serialized value size. The value size will vary depending upon the particular serialization format used by the application.

This value is defined by the cell AvgValueSize.

Read and Write Operation Percentages

Compute a rough estimate of the relative frequency of store level read and write operations on the basis of the KVS API operations used by the application.

At the most basic level, each KVS get() call results in a store level read operation and each put() operation results in a store level write operation. Each KVS multi key operation (KVStore.execute(), multiGet(), or multiDelete()) can result in multiple store level read/write operations. Again, use application knowledge about the number of keys accessed in these operations to arrive at an estimate.

Express the estimate as a read percentage, that is, the percentage of the total operations on the store that are reads. The rest of the operations are assumed to be write operations.

This value is defined by the cell ReadOpsPercent.

Estimate the percentage of read operations that will likely be satisfied from the file system cache. The percentage depends primarily upon the application's data access pattern and the size of the file system cache. Sizing Advice contains a discussion of how this cache is used.

This value is defined by the cell ReadCacheHitPercent.

Hardware Characteristics

Determine the following hardware characteristics based on a rough idea of the type of the machines that will be used to host the store:

The number of disks per machine that will be used for storing KV pairs. This value is defined by the cell DisksPerMachine. The number of disks per machine typically determines the Storage Node Capacity as described in Storage Node Parameters.
The usable storage capacity of each disk. This value is defined by the cell DiskCapacityGB.
The IOPs capacity of each disk. This information is typically available in the disk spec sheet as the number of sustained random IO operations/sec that can be delivered by the disk. This value is defined by the cell DiskIopsPerSec.

The following discussion assumes that the system will be configured with one RN per disk.

Shard Storage and Throughput Capacities

There are two types of capacity that are relevant to this discussion: 1) Storage Capacity 2) Throughput Capacity. The following sections describe how these two measures of capacity are calculated. The underlying calculations are done automatically by the attached spreadsheet based upon the application and hardware characteristics supplied earlier.

Shard Storage Capacity

The storage capacity is the maximum number of KV pairs that can be stored in a shard. It is calculated by dividing the storage actually available for live KV pairs (after accounting for the storage set aside as a safety margin and cleaner utilization) by the storage (including a rough estimation of Btree overheads) required by each KV pair.

The KV Storage Capacity is computed by the cell: MaxKVPairsPerShard.

Shard I/O Throughput capacity

The throughput capacity is a measure of the read and write ops that can be supported by a single shard. In the calculations below, the logical throughput capacity is derived from the disk IOPs capacity based upon the percentage of logical operations that actually translate into disk IOPs after allowing for cache hits. The Machine Physical Memory section contains more detail about configuring the caches used by Oracle NoSQL Database.

For logical read operations, the shard-wide IOPs is computed as:

(ReadOpsPercent * (1 - ReadCacheHitPercent))

Note that all percentages are expressed as fractions.

For logical write operations, the shard-wide IOPs is computed as:

(((1 - ReadOpsPercent) / WriteOpsBatchSize) * RF)

The writeops calculations are very approximate. Write operations make a much smaller contribution to the IOPs load than do the read ops due to the sequential writes used by the log structured storage system. The use of WriteOpsBatchSize is intended to account for the sequential nature of the writes to the underlying JE log structured storage system. The above formula does not work well when there are no reads in the workload, that is, under pure insert or pure update loads. Under pure insert, the writes are limited primarily by acknowledgement latency which is not modeled by the formula. Under pure update loads, both the acknowledgement latency and cleaner performance play an important role.

The sum of the above two numbers represents the percentage of logical operations that actually result in disk operations (the DiskIopsPercent cell). The shard's logical throughput can then be computed as:

(DiskIopsPerSec * RF)/DiskIopsPercent

and is calculated by the cell OpsPerShardPerSec.

Memory and Network Configuration

Having established the storage and throughput capacities of a shard, the amount of physical memory and network capacity required by each machine can be determined. Correct configuration of physical memory and network resources is essential for the proper operation of the store. If your primary goal is to determine the total size of the store, skip ahead to Estimate total Shards and Machines but make sure to return to this section later when it is time to finalize the machine level hardware requirements.

Note:

You can also set the memory size available for each Storage Node in your store, either through the memory_mb parameter of the makebootconfig utility or through the memorymb Storage Node parameter. For more information, see Installation Configuration Parameters and Storage Node Parameters respectively.

Machine Physical Memory

The shard storage capacity (computed by the cell MaxKVPairsPerShard) and the average key size (defined by the cell AvgKeySize cell) can be used to estimate the physical memory requirements of the machine. The physical memory on the machine backs up the caches used by Oracle NoSQL Database.

Sizing the in-memory cache correctly is essential for meeting store's performance goals. Disk I/O is an expensive operation from a performance point of view; the more operations that can be serviced from the cache, the better the store's performance.

Before continuing, it is worth noting that there are two caches that are relevant to this discussion:

The JE cache. The underlying storage engine used by Oracle NoSQL Database is Berkeley DB Java Edition (JE). JE provides an in-memory cache. For the most part, this is the cache size that is most important, because it is the one that is simplest to control and configure.
The file system (FS) cache. Modern operating systems attempt to improve their I/O subsystem performance by providing a cache, or buffer, that is dedicated to disk I/O. By using the FS cache, read operations can be performed very quickly if the reads can be satisfied by data that is stored there.

Sizing Advice

JE uses a Btree to organize the data that it stores. Btrees provide a tree-like data organization structure that allows for rapid information lookup. These structures consist of interior nodes (INs) and leaf nodes (LNs). INs are used to navigate to data. LNs are where the data is actually stored in the Btree.

Because of the very large data sets that an Oracle NoSQL Database application is expected to use, it is unlikely that you can place even a small fraction of the data into JE's in-memory cache. Therefore, the best strategy is to size the cache such that it is large enough to hold most, if not all, of the database's INs, and leave the rest of the node's memory available for system overhead (negligible) and the FS cache.

Both INs and LNs can take advantage of the FS cache. Because INs and LNs do not have Java object overhead when present in the FS cache (as they would when using the JE cache), they can make more effective use of the FS cache memory than the JE cache memory.

Of course, in order for the FS cache to be truly effective, the data access patterns should not be completely random. Some subset of your key-value pairs must be favored over others in order to achieve a useful cache hit rate. For applications where the access patterns are not random, the high file system cache hit rates on LNs and INs can increase throughput and decrease average read latency. Also, larger file system caches, when properly tuned, can help reduce the number of stalls during sequential writes to the log files, thus decreasing write latency. Large caches also permit more of the writes to be done asynchronously, thus improving throughput.

Determine JE Cache Size

To determine an appropriate JE cache size, use the com.sleepycat.je.util.DbCacheSize utility. This utility requires as input the number of records and the size of the application keys. You can also optionally provide other information, such as the expected data size. The utility then provides a short table of information. The number you want is provided in the Cache Size column, and in the Internal nodes and leaf nodes: MAIN cache row.

For example, to determine the JE cache size for an environment consisting of 100 million records, with an average key size of 12 bytes, and an average value size of 1000 bytes, invoke DbCacheSize as follows:

java -Xmx64m -Xms64m \
-d64 -XX:+UseCompressedOops -jar je.jar DbCacheSize \
-key 12 -data 1000 -records 100000000 -replicated -offheap

	  === Environment Cache Overhead ===

	  3,164,485 minimum bytes

	  To account for JE daemon operation, record locks, HA network
          connections, etc, a larger amount is needed in practice.

	  === Database Cache Size ===

	   Number of Bytes  Description
	   ---------------  -----------
	     3,885,284,944  Internal nodes only: MAIN cache
			 0  Internal nodes only: OFF-HEAP cache
	     3,885,284,944  Internal nodes and leaf nodes: MAIN cache
	   104,002,733,216  Internal nodes and leaf nodes: OFF-HEAP cache

Please make note of the following jvm arguments (they have a special meaning when supplied to DbCacheSize):

The -d64 argument is used to ensure that cache sizes account for a 64-bit JVM. Only 64-bit JVMs are supported by NoSQL DB.
The -XX:+UseCompressedOops causes cache sizes to account for CompressedOops mode, which is used by NoSQL DB by default. This mode uses more efficient 32 bit pointers in a 64-bit JVM thus permitting better utilization of the JE cache.
The -replicated is used to account for memory usage in a JE ReplicatedEnvironment, which is always used by NoSQL DB.
The -offheap is specified to use a JE off-heap cache, which is configured by default in NoSQL DB. This causes DbCacheSize to print values for the main (in-heap) cache and off-heap cache separately.

These arguments when supplied to Database Cache Size serve as an indication that the JE application will also be supplied these arguments and Database Cache Size adjusts its calculations appropriately. The arguments are used by Oracle NoSQL Database when starting up the Replication Nodes which uses these caches.

The output indicates that a cache size of 3.6 GB is sufficient to hold all the internal nodes representing the Btree in the JE cache. With a JE cache of this size, the IN nodes will be fetched from the JE cache and the LNs will be fetched from the off-heap cache or the disk.

For more information on using the DbCacheSize utility, see this Javadoc page. Note that in order to use this utility, you must add the <KVHOME>/lib/je.jar file to your Java classpath. <KVHOME> represents the directory where you placed the Oracle NoSQL Database package files.

Having used DbCacheSize to obtain the JE cache size, the heap size can be calculated from it. To do this, enter the number obtained from DbCacheSize into the cell named DbCacheSizeMB making sure to convert the units from bytes to MB. The heap size is computed by the cell RNHeapMB as below:

(DBCacheSizeMB/RNCachePercent)

where RNCachePercent is the percentage of the heap that is used for the JE cache. The computed heap size should not exceed 32GB, so that the java VM can use its efficient CompressedOops format to represent the java objects in memory. Heap sizes with values exceeding 32GB will appear with a strikethrough in the RNHeapMB cell to emphasize this requirement. If the heap size exceeds 32GB, try to reduce the size of the keys to reduce the JE cache size in turn and bring the overall heap size below 32GB.

The heap size is used as the basis for computing the memory required by the machine as below:

(RNHeapMB * DisksPerMachine)/SNRNHeapPercent

where SNRNHeapPercent is the percentage of the physical memory that is available for use by the RN's hosted on the machine. The result is available in the cell MachinePhysicalMemoryMB.

Machine Network Throughput

We need to ensure that the NIC attached to the machine is capable of delivering the application I/O throughput as calculated earlier in Shard I/O Throughput capacity, because otherwise it could prove to be a bottleneck.

The number of bytes received by the machine over the network as a result of write operations initiated by the client is calculated as:

(OpsPerShardPerSec * (1 - ReadOpsPercent) * 
 (AvgKeySize + AvgValueSize)) * DisksPerMachine

and is denoted by ReceiveBytesPerSec in the spreadsheet. Note that whether a node is a master or a replica does not matter for the purposes of this calculation; the inbound write bytes come from the client for the master and from the masters for the replicas on the machine.

The number of bytes received by the machine as a result of read requests is computed as:

((OpsPerShardPerSec * ReadOpsPercent)/RF) * 
 (AvgKeySize + ReadRequestOverheadBytes) * DisksPerMachine

where ReadRequestOverheadBytes is a fixed constant overhead of 100 bytes.

The bytes sent out by the machine over the network as a result of the read operations has two underlying components:

The bytes sent out in direct response to application read requests and can be expressed as:

((OpsPerShardPerSec *  ReadOpsPercent)/RF) * 
 (AvgKeySize + AvgValueSize) * DisksPerMachine

The bytes sent out as replication traffic by the masters on the machine expressed as:

(OpsPerShardPerSec * (1 - ReadOpsPercent) * 
 (AvgKeySize + AvgValueSize) * (RF-1)) * MastersOnMachine

The sum of the above two values represents the total outbound traffic denoted by SendBytesPerSec in the spreadsheet.

The total inbound and outbound traffic must be comfortably within the NIC's capacity. The spreadsheet calculates the kind of network card, GigE or 10GigE, which is required to support the traffic.