Choice of Keys in NoSQL Database

Using Primary Keys and Shard Keys in Oracle NoSQL tables

Primary Keys

You must designate one or more primary key columns when you create your table. The primary key cannot be changed and exists for the life of the table. A primary key uniquely identifies every row in the table. For simple CRUD operations, Oracle NoSQL Database uses the primary key to retrieve a specific row to read or modify. Since the underlying storage in NoSQL Database is based on a key/value model, the choice of the primary key can greatly enhance the performance of certain lookup operations.

Shard Keys

The main purpose of shard keys is to distribute data across the Oracle NoSQL Database cluster for scalability and to co-locate the records that share the same shard key on the same physical node for easy reference. These records can be accessed atomically and efficiently.

Impact of keys while developing an application:

In an Oracle NoSQL Database, replication nodes are grouped together to form the shards of the NoSQL Database cluster. When an application asks to retrieve the record for a given key, the NoSQL Database driver will hash a portion of the key (denoted as the shard key ) to identify the shard that houses the data. Once the shard is identified, the NoSQL Database driver can choose to read the data from the most optimal replica in the shard, depending on the requested consistency level. With respect to the write operations, the NoSQL Database driver will always route the write requests to the dynamically elected leader node of the shard. Hence, from the perspective of workload scaling, you can generally think of this architecture as being scaled by adding shards. Oracle NoSQL Database supports the online elastic expansion of the cluster by adding shards, however, without the proper selection of a shard key, expanding the cluster will be useless in scaling your solution.

How you design primary keys and shard keys has huge implications on scaling and realizing the system throughput. For instance, when records share shard keys, you can delete multiple table rows in an atomic operation, or retrieve a subset of rows in your table in a single atomic operation. In addition to enabling scalability, well-designed shard keys can improve performance by requiring fewer cycles to put data on, or get data from, a single shard. Shard keys designate storage on the same shard to facilitate efficient queries for key values. However, because you want your data to be distributed across the shards for best performance and scalability, you will want to avoid shard keys that have a small number of unique values.

Important factors to consider when choosing a shard key:

Cardinality: Low cardinality field groups are stored together on a small number of shards. In turn, those shards require frequent data rebalancing, increasing the likelihood of hot shard issues. Instead, each shard key should have high cardinality, where it can express several million values. For best performance and value, choose fields with high cardinalities, such as identity numbers, where millions of records are possible.
Atomicity:Only objects that share the same shard key can participate in a transaction. If you have a requirement for ACID transactions that span multiple records, choose only a shard key that lets you meet that requirement.

Best practices to follow:

Uniform distribution of shard keys:Operations may be limited by the capacity of a single shard. When shard keys are uniformly distributed, no single shard limits the capacity of the system. Choosing one or more columns whose values are known to be uniformly distributed is ideal.
Query Isolation: Queries should be targeted to a specific shard to maximize scalability. If queries are not isolated to a single shard, the query will be applied to all shards. This is less efficient and increases query latency. Make sure your queries fetch data stored in a single shard. Well-designed shard keys can improve performance by getting data from a single shard. Shard keys designate storage on the same shard to facilitate efficient queries for key values. Specify the fields ( which are frequently used in your application queries) as shard keys.

Key Sizes and Key Only Modeling Methods

Oracle NoSQL Database caches the keys for each table. So the key size is a critical component to the effective use of memory and ultimately may be a determining factor in the ability of Oracle NoSQL Database to service your performance SLAs. Hence, it is important for you to create primary keys that are as efficient as possible with respect to size. For workloads that require very low latencies for the read and writes (single to low double-digit milliseconds) across millions of operations per second, exploiting cached keys in NoSQL’s B-trees can be the make or break of building an application capable of achieving these stringent requirements. Furthermore, if you can encode what would otherwise be non-key values as part of the primary key and also size your keys and the NoSQL Database cluster carefully, then you can realize the enormous benefits of memory cached B-tree access methods that are maintained with ACID semantics. For highly optimal, ultra-low latency applications, Oracle NoSQL provides key-only accessors for those workloads that can model everything as key-only data. Oracle NoSQL Database offers convenient key-only access APIs such as multiKetGeys and tableKeysIterator for doing key-only scans.

When considering whether or not key-only modeling of your data is right for your application, you should consider the following:

Latency and throughput SLAs – Do you have very stringent latency and throughput SLAs that would require a key-only model? Can you afford to perform an I/O when retrieving a value, noting that for spinning disks, the average latency of retrieving your value could be anywhere from fifteen to thirty milliseconds and for Single Shared Disk (SSD) this could be anywhere from one to 5 milliseconds.
Spinning disk versus SSDs – If you are considering using SSDs and your latency SLAs are for reads that can comfortably fit within the 5-millisecond range then it’s probably not worth the effort to try and craft a key-only model for your application.
Code maintainability and extensibility – Key-only modeling brings large performance benefits to your application at the potential cost of code maintainability and extensibility. You may find that encoding your value into the key can ultimately be a complex and esoteric strategy. Ultimately, you will have to make a judgment call on whether or not the code you develop and maintain is too complex and esoteric to be worth the benefit of the key-only solution.
Accurate sizing data – Is it possible for you to derive a somewhat accurate sizing of your keys such that you can adequately size the Oracle NoSQL Database cluster? Sizing the cluster and the cache of each replication node will be crucial to exploiting the benefits of a key-only data model.

Key Column Ordering and Query-ability

In Oracle NoSQL Database, the order of declaration for key columns is crucial to satisfying partial key lookup queries. This is because of the way that the storage engine manages the underlying B-trees. You can think of composite keys as an ordered concatenation of the columns specified in the DDL for the key declaration (primary key or index key). You should think of the order from the most significant column to the least significant column based on the appearance of the columns in the DDL for the key. If your table has a composite primary key ( a primary key with more than one column), then the primary key becomes a concatenation of the string representation of each column. Here for better performance of queries, it is important to specify the most commonly used query column as the most significant column in the primary key.

As you start to think about how you will size your cluster and your Oracle NoSQL Database caches, a critical consideration is to get an estimate of your key sizes. Sizing your caches so that Oracle NoSQL can keep most or all of the index nodes in memory can help your application realize enormous performance benefits. Understanding how keys are serialized and stored persistently can help you in getting a more accurate sizing estimate. In Oracle NoSQL Database, numeric keys are stored as compressed String values but must remain sortable when in string format. This means that a numeric key must be a fixed size when represented as a key string. See Initial Capacity Planning for more details on shard capacity, shard storage, and throughput capacities and how to estimate total shards and machines.

Choice of using Identity column Vs UUID

Declare a column as IDENTITY to have Oracle NoSQL Database automatically assign values to it, where the values are generated from an associated sequence generator. The sequence generator is the table’s manager for tracking the IDENTITY column’s current, next, and total number of values. You create an IDENTITY column as part of a CREATE TABLE name DDL statement, or add an IDENTITY column to an existing table with an ALTER TABLE name DDL statement.

A universally unique identifier (UUID) is a 128-bit number used to identify information in computer systems. You can create a UUID and use it to uniquely identify something. In Oracle NoSQL, UUID values are represented by the UUID data type. The UUID data type is considered a subtype of the STRING data type, because UUID values are displayed in their canonical textual format and, in general, behave the same as string values in the various SQL operators and expressions. A table column can be declared as having UUID type in a CREATE TABLE statement. The UUID data type is best-suited in situations where you need a globally unique identifier for the records in a table that span multiple regions since identity columns are only guaranteed to be unique within a NoSQL cluster in a region.

Table 5-1 Comparison between Identity Column and UUID column

Identity Column	UUID column
Declare a column as Identity to have Oracle NoSQL Cluster automatically assign values to it	Declare a column as UUID if you need unique values to be assigned to a NoSQL Cluster column in a multi-region system
An INTEGER, LONG, or NUMBER column in a table can be defined as an Identity column	A UUID is a subtype of the STRING data type
An Identity column can be defined either as GENERATED ALWAYS or GENERATED BY DEFAULT	A UUID column can be defined as GENERATED BY DEFAULT or you can supply the value of the string while inserting or updating data
Costs less storage space than a corresponding UUID column. If LONG is the primary key, it costs a maximum of 10 bytes. If LONG is a non-primary key, it costs a maximum of 8 bytes.	Costs more storage space than a corresponding Identity column. If the UUID value is the primary key, it costs 19-bytes. If the UUID value is a non-primary key, it costs 16-bytes.
Identity columns allow Oracle NoSQL to automatically assign values within a single region. An error is thrown if an Identity column is used in a multi-region deployment.	UUID columns allow Oracle NoSQL to automatically assign global values across regions. This is useful in multi-region deployments. A UUID column is larger than an Identity column.