Table Modelling and Design

A critical part of the application development process is the task of modeling your data.

Proper modeling of your data is crucial to application performance, extensibility, application correctness, and finally, the ability for your application to support rich user experiences. In this article, you will learn some crucial aspects of data modeling and understand guidelines on how to model your persistent data for an Oracle NoSQL Database application.

The Oracle NoSQL Database gives the data modeler a large range of flexibility with respect to modeling application data. Understanding the tradeoffs associated with each level of flexibility is extremely useful in making wise data modeling decisions.

Schema Flexibility in Oracle NoSQL Database

Unlike the relational database world with purely fixed schemas, NoSQL Database is largely about schema flexibility – that is the ability to easily change how data is organized and stored.

Schema flexibility in Oracle NoSQL Database mostly takes the form of non-scalar data types. These non-scalar data types can be used to embed flexible structures inside your tables.

Non-scalar data types:

Oracle NoSQL database supports the following non-scalar data types:
  • JSON – JSON is a map of key/value pairs that can be used as a datatype of a column in the Oracle NoSQL Database. The JSON datatype gives you the ability to dynamically read and write attributes having no prior knowledge of what is and what is not stored in the JSON document. You can introspect into the document by reading from Oracle NoSQL Database as a JSON string, or you can specify path expressions as deep as you like into a hierarchy of JSON. As an example, you can create a JSON document that represents the variable terms and conditions of a contract. The document attribute names (or keys) can represent the tag or name of the contractual term or condition and the value of the attribute can represent the text of that term. Using a JSON brings you ultimate flexibility in your data model.
  • Records – Records containing scalar or non-scalar values can be used as a datatype for a column in an Oracle NoSQL Database table. You can think of a record as a document with a fixed set of attributes, however noting that one or more attributes of the record can be a non-fixed array or JSON document, giving you the flexibility to extend a fixed document without modifying the schema. Records present an interesting intermediate step between the benefits of the fixed schema world (single copy of a schema) and the ultimate flexibility of the JSON world.
  • Arrays – Arrays of scalar or non-scalar values can be used as a datatype for a column in an Oracle NoSQL Database table. Arrays can be convenient for storing a collection of event values. For example, you may wish to collect a list of behavioral segments for users as they browse web pages.

Trade-offs while using Flexible Schema:

Some guidelines that you can follow while considering flexible schema are listed below.

The Flexibility/Cost of scale Tradeoff :

When thinking about how flexible you want your schema to be, it’s important to understand that the more flexible you make your schema, the bigger the challenge is for scaling your solution. For example, let’s say that you are storing information on user behavior. And you want to store this information as the users access your website. You can implement one of the two options here. You can choose to model the solution using fixed columns for the required user attributes that you will need to track. Alternatively, you can choose to model this using a JSON document, giving you the flexibility to add and remove attributes for users without having to evolve your schema.

The second option may work quite well for small numbers of users; however, if you will need to scale this solution to large numbers of users, then you need extra storage. You also need additional compute overhead for processing the key/value pairs (attribute names and their values) in the JSON document. This could make the cost of scaling your solution prohibitive. Extra storage is needed to store the metadata along with the data (e.g. the attribute names) and extra compute is needed to serialize and de-serialize these documents. If you are using a replication factor of more than one, then that adds additional overhead for each tracked user. If a large scale is a major requirement for you then you’ll most likely want to trade off flexibility for storage efficiency and consider using a more fixed schema.

The Flexibility/Latency Tradeoff :

In many NoSQL applications, low latency data access is a key requirement. In these situations, it’s important to understand the potential tradeoff with respect to the I/O latency of using one data modeling method over another. In this respect, using a non-scalar data type such as a record, array, or JSON document will entail a read followed by an update.

For example, when you add a new value to an array your application must read the record from NoSQL Database first, add the value to the array, and then write it back to NoSQL Database. Even if performing this operation using the SQL UPDATE operator of Oracle NoSQL Database (which executes in the replication node), the record must still be read from persistent storage, de-serialized, modified, serialized, and written back. On a system with a spinning disk, this could cost anywhere from fifteen to thirty milliseconds (or more). For certain applications like online advertising, this may be beyond the latency SLA that can be tolerated. If you are faced with similar stringent latency SLAs then you should consider favoring a child table approach which eliminates the read and will allow you to simply perform a write of the new value. Of course, the tradeoff here is one of flexibility for low latency.

For more information on when to use parent-child tables, see Using Parent-Child tables in Oracle NoSQL Database.

Updates to Non-Scalars versus Inserts :

The Oracle NoSQL Database storage engine is based on an append-only architecture, also known as log-structured storage. Log structured storage systems perform extremely well for insert operations where the new records to be inserted are simply appended to the end of the log. Update operations involve appending the updated record to the log and then marking the old record for deletion. Records marked for deletion are regularly cleaned from NoSQL Database’s logs to free up disk space by a background process called the cleaner. Although the cleaner is highly optimized, it will add some CPU and I/O overhead to the replication node. The more updates performed by your application, the more log cleaning activity there will be.

As a guideline, if you have extreme performance goals for your application (or for a specific table), you should strongly consider trying to craft your data model by using parent/child tables versus non-scalar columns, giving you the potential for replacing updates with inserts. For more information on when to use parent-child tables, see Using Parent-Child tables in Oracle NoSQL Database.

Static Vs Dynamic Data :

In many applications, it’s possible to identify portions of the data that are somewhat static and change relatively slowly, and other portions of the data which are highly dynamic and change frequently, even at millisecond granularity. For example, in online advertising, campaigns are a relatively slow-moving piece of data while the budget spent (impressions or clicks delivered) can change every few milliseconds as millions of users load web pages that have ads associated with the campaign. The data pertaining to budgets is a case of highly dynamic data. This is an example of a scenario that has a high velocity of write operations. Oracle NoSQL database is a log-structured, append-only storage architecture, where inserts are more optimal than an update operation.

For the more static portions of your data, the flexibility of the non-scalar datatypes may be an attractive option for your application. Using a JSON document could provide an extensible way for your application to interact with this data without undue sacrifice to performance. On the flip side, for data that is changing rapidly or being inserted rapidly, you’ll want to consider trading off flexibility for this data and use a parent table with a fixed schema and a child table with a fixed schema. Whether or not you choose to model the rapidly changing data as a parent or child table will depend largely on how you wish to access it. For more information on when to use parent-child tables, see Using Parent-Child tables in Oracle NoSQL Database.

Choice of Keys in NoSQL Database

Primary keys and shard keys are important elements in your schema and help you access and distribute data efficiently.

Primary keys and shard keys are indispensable for data distribution and easy accessibility. You specify primary keys and shard keys only when you create a table. They remain in place for the life of the table, and cannot be changed or dropped.

Using Primary Keys and Shard Keys in Oracle NoSQL tables

Primary Keys

You must designate one or more primary key columns when you create your table. The primary key cannot be changed and exists for the life of the table. A primary key uniquely identifies every row in the table. For simple CRUD operations, Oracle NoSQL Database uses the primary key to retrieve a specific row to read or modify. Since the underlying storage in NoSQL Database is based on a key/value model, the choice of the primary key can greatly enhance the performance of certain lookup operations.

Shard Keys

The main purpose of shard keys is to distribute data across the Oracle NoSQL Database cluster for scalability and to co-locate the records that share the same shard key on the same physical node for easy reference. These records can be accessed atomically and efficiently.

Impact of keys while developing an application:

In an Oracle NoSQL Database, replication nodes are grouped together to form the shards of the NoSQL Database cluster. When an application asks to retrieve the record for a given key, the NoSQL Database driver will hash a portion of the key (denoted as the shard key ) to identify the shard that houses the data. Once the shard is identified, the NoSQL Database driver can choose to read the data from the most optimal replica in the shard, depending on the requested consistency level. With respect to the write operations, the NoSQL Database driver will always route the write requests to the dynamically elected leader node of the shard. Hence, from the perspective of workload scaling, you can generally think of this architecture as being scaled by adding shards. Oracle NoSQL Database supports the online elastic expansion of the cluster by adding shards, however, without the proper selection of a shard key, expanding the cluster will be useless in scaling your solution.

How you design primary keys and shard keys has huge implications on scaling and realizing the system throughput. For instance, when records share shard keys, you can delete multiple table rows in an atomic operation, or retrieve a subset of rows in your table in a single atomic operation. In addition to enabling scalability, well-designed shard keys can improve performance by requiring fewer cycles to put data on, or get data from, a single shard. Shard keys designate storage on the same shard to facilitate efficient queries for key values. However, because you want your data to be distributed across the shards for best performance and scalability, you will want to avoid shard keys that have a small number of unique values.

Important factors to consider when choosing a shard key:
  • Cardinality: Low cardinality field groups are stored together on a small number of shards. In turn, those shards require frequent data rebalancing, increasing the likelihood of hot shard issues. Instead, each shard key should have high cardinality, where it can express several million values. For best performance and value, choose fields with high cardinalities, such as identity numbers, where millions of records are possible.
  • Atomicity:Only objects that share the same shard key can participate in a transaction. If you have a requirement for ACID transactions that span multiple records, choose only a shard key that lets you meet that requirement.
Best practices to follow:
  • Uniform distribution of shard keys:Operations may be limited by the capacity of a single shard. When shard keys are uniformly distributed, no single shard limits the capacity of the system. Choosing one or more columns whose values are known to be uniformly distributed is ideal.
  • Query Isolation: Queries should be targeted to a specific shard to maximize scalability. If queries are not isolated to a single shard, the query will be applied to all shards. This is less efficient and increases query latency. Make sure your queries fetch data stored in a single shard. Well-designed shard keys can improve performance by getting data from a single shard. Shard keys designate storage on the same shard to facilitate efficient queries for key values. Specify the fields ( which are frequently used in your application queries) as shard keys.

Key Sizes and Key Only Modeling Methods

Oracle NoSQL Database caches the keys for each table. So the key size is a critical component to the effective use of memory and ultimately may be a determining factor in the ability of Oracle NoSQL Database to service your performance SLAs. Hence, it is important for you to create primary keys that are as efficient as possible with respect to size. For workloads that require very low latencies for the read and writes (single to low double-digit milliseconds) across millions of operations per second, exploiting cached keys in NoSQL’s B-trees can be the make or break of building an application capable of achieving these stringent requirements. Furthermore, if you can encode what would otherwise be non-key values as part of the primary key and also size your keys and the NoSQL Database cluster carefully, then you can realize the enormous benefits of memory cached B-tree access methods that are maintained with ACID semantics. For highly optimal, ultra-low latency applications, Oracle NoSQL provides key-only accessors for those workloads that can model everything as key-only data. Oracle NoSQL Database offers convenient key-only access APIs such as multiKetGeys and tableKeysIterator for doing key-only scans.

When considering whether or not key-only modeling of your data is right for your application, you should consider the following:
  • Latency and throughput SLAs – Do you have very stringent latency and throughput SLAs that would require a key-only model? Can you afford to perform an I/O when retrieving a value, noting that for spinning disks, the average latency of retrieving your value could be anywhere from fifteen to thirty milliseconds and for Single Shared Disk (SSD) this could be anywhere from one to 5 milliseconds.
  • Spinning disk versus SSDs – If you are considering using SSDs and your latency SLAs are for reads that can comfortably fit within the 5-millisecond range then it’s probably not worth the effort to try and craft a key-only model for your application.
  • Code maintainability and extensibility – Key-only modeling brings large performance benefits to your application at the potential cost of code maintainability and extensibility. You may find that encoding your value into the key can ultimately be a complex and esoteric strategy. Ultimately, you will have to make a judgment call on whether or not the code you develop and maintain is too complex and esoteric to be worth the benefit of the key-only solution.
  • Accurate sizing data – Is it possible for you to derive a somewhat accurate sizing of your keys such that you can adequately size the Oracle NoSQL Database cluster? Sizing the cluster and the cache of each replication node will be crucial to exploiting the benefits of a key-only data model.

Key Column Ordering and Query-ability

In Oracle NoSQL Database, the order of declaration for key columns is crucial to satisfying partial key lookup queries. This is because of the way that the storage engine manages the underlying B-trees. You can think of composite keys as an ordered concatenation of the columns specified in the DDL for the key declaration (primary key or index key). You should think of the order from the most significant column to the least significant column based on the appearance of the columns in the DDL for the key. If your table has a composite primary key ( a primary key with more than one column), then the primary key becomes a concatenation of the string representation of each column. Here for better performance of queries, it is important to specify the most commonly used query column as the most significant column in the primary key.

As you start to think about how you will size your cluster and your Oracle NoSQL Database caches, a critical consideration is to get an estimate of your key sizes. Sizing your caches so that Oracle NoSQL can keep most or all of the index nodes in memory can help your application realize enormous performance benefits. Understanding how keys are serialized and stored persistently can help you in getting a more accurate sizing estimate. In Oracle NoSQL Database, numeric keys are stored as compressed String values but must remain sortable when in string format. This means that a numeric key must be a fixed size when represented as a key string. See Initial Capacity Planning for more details on shard capacity, shard storage, and throughput capacities and how to estimate total shards and machines.

Using Indexes in NoSQL Database

n Oracle NoSQL Database, the query processor can identify which of the available indexes are beneficial for a query and rewrite the query to make use of such an index.

Using an index means scanning a contiguous subrange of its entries, potentially applying further filtering conditions on the entries within this subrange, and using the primary keys stored in the index entries to extract and return the associated table rows. The subrange of the index entries to scan is determined by the conditions appearing in the WHERE clause, some of which may be converted to search conditions for the index. Given that only a (hopefully small) subset of the index entries will satisfy the search conditions, the query can be evaluated without accessing each individual table row, thus saving a potentially large number of disk accesses.

In an Oracle NoSQL Database, a primary-key index is always created by default. This index maps the primary key columns of a table to the physical location of the table rows. Furthermore, if no other index is available, the primary index will be used. In other words, there is no pure table scan mechanism; a table scan is equivalent to a scan via the primary-key index. When it comes to indexes and queries, the query processor must answer two questions:
  1. Is an index applicable to a query? That is, will accessing the table via this index be more efficient than doing a full table scan (via the primary index)?
  2. Among the applicable indexes, which index or combination of indexes is the best to use?

There are no statistics on the number and distribution of values in a table column. As a result, the query processor has to rely on some simple heuristics in choosing among the applicable indexes. In addition, SQL for Oracle NoSQL Database allows for the inclusion of index hints in the queries. You can use index hints to force the use of a particular index in queries. You can use a query execution plan to understand what indexes are being used in the query. For more information on how a query is executed, see Query execution plan.

Secondary Index

There will be cases where you will want to use a secondary index to support some of your read requirements. Each secondary index that you add to a table will incur some overhead for writes as each index will need to be maintained. The good news with Oracle NoSQL is that secondary index partitions live on the same shard as the primary data, so the updates to the secondary index are limited on a per-shard basis. Index updates in Oracle NoSQL are also atomic, so your application can be guaranteed that updates to records in the shard are consistent with updates to the secondary index and these structures will never be out of sync. Another factor for consideration is that Oracle NoSQL Database nodes will keep the non-leaf index nodes in the cache, and will never cache the leaf portion (i.e. the data record). This gives the indexed scan an enormous performance benefit (for systems using spinning disk) over the non-indexed scan.

There are several things that you should think about when deciding on using a secondary index in Oracle NoSQL Database:
  • Filtering data close to the source – In Oracle NoSQL Database, secondary indexes are the primary mechanism for you to utilize when your query needs a filter and that filter needs to be executed as close as possible to the data. To fully understand why you may need a secondary index to filter your data for querying, let’s consider your options for scanning the data in a table:
    • Unordered parallel table scan with no full shard key – The shard key is a table column or multiple columns used to control how the rows of that table are distributed. The main purpose of shard keys is to distribute data across the Oracle NoSQL Database Cloud cluster for scalability, and to position records that share the same shard key locally for easy reference and access. When you write a query using filters as columns that are part of the shard key but also include other columns, then you end up doing a parallel table scan. Each shard is scanned in parallel and the data is returned to your application. This will return every record in the table across all shards in the NoSQL Database.
    • Ordered or unordered parallel index scan – The B-tree index at each shard is scanned in parallel. If an ordered scan is requested, the results are merged and presented.
  • Each option for scanning a table has its own costs and benefits and you should carefully weigh these tradeoffs and use what you know about the application requirements and expected workload to help guide your modeling decision.
    • Efficient range scans – Will it be common for your queries to restrict the value ranges? For example, if your application needs to answer queries like “find all records between a range of dates” then using secondary indexes in Oracle NoSQL Database will be the easiest and most efficient way for your application to answer these types of queries.
    • Workload and index maintenance update – Is it acceptable for writes to incur some extra overhead for index maintenance? Does your workload exhibit heavy read activity where latency for reads is more important than incurring extra write overhead?

See Tuning and Optimizing SQL queries for more guidelines on using indexes in queries.

Transactions in NoSQL database

In Oracle NoSQL Database, a transaction is treated as a logical, atomic unit of work that entails a single database operation.

Every data modification in the database takes place in a single transaction, managed by the system. Database developers do not have the ability to group multiple operations into a single transaction because there isn't the notion of begin/end transactions. In a database, transactional semantics are often described in terms of ACID properties.

ACID properties

In Oracle NoSQL Database, transactions maintain all the following properties and developers can control some of them.
  • Atomicity: Transaction either completes or fails in its entirety. There is no in-between state or no partial transactions.
  • Consistency: Transaction leaves the database in a valid state.
  • Isolation: No two transactions mingle or interfere with each other. Developers get the same result when the two transactions are executed in sequence or executed in parallel.
  • Durability: Changes in a transaction are saved and the changes survive any type of failure (network, disk, CPU, or a power failure).

Developers can define a wide range of consistency levels depending on the application's needs with the Oracle NoSQL Database Direct Driver. In addition, the Oracle NoSQL Database Drivers (commonly called the SDKs) support eventual and absolute consistency.

Developers can also configure durability such that updated rows in the database survive any failure with the Oracle NoSQL Database Direct Driver. Durability is not configurable in the SDKs.

Atomicity and Isolation are not configurable but Oracle NoSQL Database allows you to control consistency and durability policies in order to trade-off the performance for application needs. Some NoSQL databases only support eventual consistency but have no mechanism for absolute consistency.

Shard keys play an important role in achieving the ACID properties in the Oracle NoSQL database. For instance, when records share shard keys, you can delete multiple table rows in an atomic operation, or retrieve a subset of rows in your table in a single atomic operation. In addition to enabling scalability, well-designed shard keys can improve performance by requiring fewer cycles to put data on, or get data from, a single shard.

The NoSQL table hierarchy is an ideal data model for applications that need some data normalization, but also require predictable, low latency at scale. The hierarchy links different tables to enable left outer joins, combining rows from two or more tables based on related columns between them. Such joins execute efficiently since rows from the parent-child tables are co-located in the same shards. Also, writes to multiple tables in a table hierarchy obey transactional ACID properties since the records residing in each table of the hierarchy share the same shard key. All write operations perform as a single atomic unit. So all of the write operations will execute successfully, or none of them will.

Using Parent-Child tables in the Oracle NoSQL database

The Oracle NoSQL Database enables tables to exist in a parent-child relationship. This is known as table hierarchies.

Many NoSQL databases support data types like arrays and maps. When modeling a data relationship, application developers may find it easier to have each parent row store its child rows inside an array or a map in a nested structure. By doing so, not only is the data relationship denormalized but it has the potential for creating large parent rows, especially when the hierarchy is heavily nested, resulting in inefficient storage and poor performance. Oracle NoSQL Database’s table hierarchy is the ideal data model to avoid issues associated with arrays and maps. One of the biggest benefits of using child tables over embedded arrays is for those workloads that have a high velocity of write operations. When using embedded arrays, the write operations become updates, but when they are modeled as child tables, those operations become inserts. Inserts in a log-structured, append-only storage architecture are much more optimal than updates. Utilizing a table hierarchy should be considered when building data relationships in Oracle NoSQL Database.

The NoSQL table hierarchy is an ideal data model for applications that need some data normalization, but also require predictable, low latency at scale. The hierarchy links different tables to enable left outer joins, combining rows from two or more tables based on related columns between them. Such joins execute efficiently since rows from the parent-child tables are co-located in the same shards. Also, writes to multiple tables in a table hierarchy obey transactional ACID properties since the records residing in each table of the hierarchy share the same shard key. All write operations perform as a single atomic unit. So all of the write operations will execute successfully, or none of them will.

The Benefits of a Table Hierarchy

Oracle NoSQL Database table hierarchy comes with the following benefits:
  • Highly efficient for storing data in a parent-child hierarchy - Parent and child rows are stored in separate NoSQL tables, reducing the size of parent rows compared with the single parent with child rows in nested arrays or maps. Write operations on parent or child tables create new versions of smaller rows and store these changes efficiently, given the append-only architecture of Oracle NoSQL Database.
  • Highly performant for read and write workloads - Parent and child rows reside in the same local shard, enabling write and read operations to achieve high performance since all records in the hierarchy can be read or written in a single network call.
  • Highly flexible for fine-grained authorization - Access rights to a parent or child table can be configured individually based on conditions at run-time, offering granular and flexible authorization.
  • Scalable ACID transactions - Uniquely balance the goals of scalability, low latency, and ACID by co-locating parent and child data on the same shard.
  • Table joins - Data can be queried using the nested table clause or left outer joins.
Characteristics of parent-child tables:
  • A child table inherits the primary key columns of its parent table.
  • All tables in the hierarchy have the same shard key columns, which are specified in the create table statement of the root table.
  • A parent table cannot be dropped before its children are dropped.
  • A referential integrity constraint is not enforced in a parent-child table.

A NoSQL table hierarchy not only captures the relationship between data entities but also takes advantage of the co-location of the parent-child rows to offer highly performant retrievals and superior scalability. The table hierarchy enables applications to implement ACID transactions. All data in the same parent-child rows are stored in the same shard and can be committed as a single database operation to ensure atomicity, consistency, isolation, durability.