A Brief Primer on Apache Hive

Paraphrasing wikipedia, Apache Hive is a data warehouse infrastructure built on top of Apache Hadoop that facilitates querying datasets residing in distributed file systems such as the Hadoop Distributed File System (referred to as HDFS) or in compatible file systems. In addition to those built in features, Hive also provides a pluggable programming model that allows you to specify custom interfaces and classes that support querying data residing in data sources such as the Oracle NoSQL Database.

In addition to the Hive infrastructure and its pluggable programming model, Hive also provides a convenient client-side command line interface (the Hive CLI), which allows you to interact with the Hive infrastructure to create a Hive external table and then map it to the data located in remote sources like those just described.

Oracle NoSQL Database provides a set of interfaces and classes that satisfy the Hive programming model so that the Hive Query Language can be used to query data contained in an Oracle NoSQL Database store (either secure or non-secure). The classes that are defined for that purpose are located in the Java package oracle.kv.hadoop.hive.table (see Java API), and consist of the following Hive and Hadoop types:

  • A subclass of the Hive class org.apache.hadoop.hive.ql.metadata.HiveStorageHandler. The HiveStorageHandler is the mechanism (the pluggable interface) Oracle NoSQL Database uses to specify the location of the data that the Hive infrastructure should process, as well as how that data should be processed. The HiveStorageHandler consists of the following components:
    • A subclass of the Hadoop MapReduce version 1 class org.apache.hadoop.mapred.InputFormat, where InputFormat specifies how the associated MapReduce job reads its input data, taken from the Oracle NoSQL Database table.
    • A subclass of the Hadoop MapReduce version 1 class org.apache.hadoop.mapred.OutputFormat, where OutputFormat specifies how the associated MapReduce job writes its output.
    • A subclass of the Hive class org.apache.hadoop.hive.serde2.AbstractSerDe. The AbstractSerDe class and its associated subclasses are used to deserialize the table data that is retrieved and sent to the Hive infrastructure and/or the Hadoop MapReduce job for further processing. Although not currently supported, this mechanism can also be used to serialize data input to Hive for writing to an Oracle NoSQL Database table.
    • Metadata hooks for keeping an external catalog in sync with the Hive Metastore component.
    • Rules for setting up the configuration properties on MapReduce jobs run against the data being processed.
    • An implementation of the interface org.apache.hadoop.hive.ql.metadata.HiveStoragePredicateHandler. As described in the Predicate Pushdown appendix, the implementation of HiveStoragePredicateHandler provided by Oracle NoSQL Database supports the decomposition of a query's WHERE clause (the predicates of the query) into information that can be passed to the database so that some (or even all) of the search processing can be performed in the database itself rather than on the client side of the query.
  • A subclass of the Hadoop MapReduce version 1 class org.apache.hadoop.mapred.RecordReader, where a RecordReader is used to specify how the mapped keys and values are located and retrieved during any MapReduce processing performed while executing a Hive query.
  • A subclass of the Hadoop MapReduce version 1 class org.apache.hadoop.mapred.InputSplit, where an InputSplit is used to represent the data to be processed by an individual Mapper that operates during the MapReduce processing performed as part of executing a Hive query.

See Apache Hadoop API and Hive API for more details.

As described in the following sections, it is through the implementation of the HiveStorageHandler provided by Oracle NoSQL Database that the Hive infrastructure obtains access to a given Oracle NoSQL Database store and ultimately the table data on which to run the desired Hive query.