hadoop.hive.table (Oracle NoSQL Database Examples)

Package hadoop.hive.table Description

The Table API Hive Cookbook: documentation that describes how to run example Hive queries against data written via the Oracle NoSQL Database Table API.

Introduction

With the introduction of the Oracle NoSQL Database Hadoop integration classes, which support running Hadoop MapReduce jobs against data stored in an Oracle NoSQL Database table, it was natural to also provide new interfaces and classes which support running Hive queries against such table data (since a typical Hive query generally results in the execution of a MapReduce job). In addition to describing the core interfaces and classes involved in running a Hive query against data from a table located in a given Oracle NoSQL Database store (a KVStore), the information presented below also walks through the steps to take to execute a given set of basic Hive queries against example table data contained in a KVStore.

Note that this package does not contain any example Java source code. This is because a Hive query is expressed in the Hive Query Language (HQL), not Java. And the supporting example Java source and scripts on which this example depends is already provided with the Hadoop/Table API Example.

Prerequisites

Before attempting to execute the example that demonstrates the concepts presented in this document, you should first satisfy the following prerequisites:

Become familiar with both Apache Hive and its programming model; that is, become familiar with how to write and execute a Hive query.
Become familiar with Apache Hadoop; specifically, become familiar with how Hive and Hadoop interact.
Deploy a Hadoop cluster with 3 DataNodes running on machines with host names, dn-host-1, dn-host-2, and dn-host-3.
Install and configure Hive, with network access to the Hadoop cluster that was deployed; using either Derby (the default) or MySQL to store the Hive metadata.
Become familiar with the Hive Command Line Interface (the Hive CLI), and the Hive Query Language.
Become familiar with Oracle NoSQL Database and then install, start, and configure an Oracle NoSQL Database that is network reachable from the nodes of the Hadoop cluster.
Deploy a KVStore — named example-store — to 3 machines with host names, kv-host-1, kv-host-2, and kv-host-3; where 1 admin service, listening on port 13230, is deployed on kv-host-1.
Become familiar with the supporting scripts and Java classes from the Hadoop/Table API Example; and then follow the steps of that example to create and populate a table named vehicleTable with example data.

Using specific values for items such as the host names and admin port described above should allow you to more easily follow the example that is presented. Combined with the information contained in the Oracle NoSQL Database Getting Started Guide, as well as the Oracle NoSQL Database Admin Guide, you should then be able to generalize and extend the example to your own particular development scenario; substituting the values specific to the given environment where necessary.

A Brief Hive Primer

Paraphrasing wikipedia, Apache Hive is a data warehouse infrastructure built on top of Apache Hadoop, that facilitates querying datasets residing in distributed file systems or data stores such as Hadoop HDFS or Amazon S3. Additionally, Hive also provides a pluggable programming model that allows you to specify custom interfaces and classes that support querying data residing in data sources other than HDFS and S3; in particular, data written to an Oracle NoSQL Database table. To access and analyze data stored in these data sources, Hive provides a mechanism to project structure onto the data and query the data using a SQL-like language called the Hive Query Language, or HQL. Depending on the complexity of a given Hive query, the Hive infrastructure may construct and deploy a set of MapReduce jobs to retrieve and process the data or, when possible, it may simply satisfy the query via the metadata stored in the Hive metastore (Derby or MySQL).

In addition to the Hive infrastructure itself, Hive also provides a convenient client-side command line interface (the Hive CLI); which allows you to interact with the Hive infrastructure to create a Hive external table and then map it to the data located in a source like those described above.

As indicated above, a new set of interfaces and classes that satisfy the Hive programming model have been provided which support running Hive queries against table data contained in a KVStore. These new classes are located in the oracle.kv.hadoop.hive.table package, and consist of the following Hive and Hadoop types:

A subclass of the Hive class, org.apache.hadoop.hive.ql.metadata.HiveStorageHandler, which is the mechanism (pluggable interface) to use to specify to the Hive infrastructure where the data to process is located, as well as how to process that data; and which consists of the following components:
- A subclass of the Hadoop MapReduce version 1 class, org.apache.hadoop.mapred.InputFormat, which specifies how the associated MapReduce job reads its input data (from an Oracle NoSQL Database table).
- A subclass of the Hadoop MapReduce version 1 class, org.apache.hadoop.mapred.OutputFormat, which specifies how the associated MapReduce job writes its output.
- A subclass of the Hive class, org.apache.hadoop.hive.serde2.SerDe, which is used to serialize the table data that is retrieved and sent to the Hadoop MapReduce job for processing, and to deserialize the results from the Hadoop MapReduce job so those results can be output and/or displayed.
- Metadata hooks for keeping an external catalog in sync with the Hive metastore.
- Rules for setting up the configuration properties on MapReduce jobs run against the data being processed.
A subclass of the Hadoop version 1 class, org.apache.hadoop.mapred.RecordReader, which specifies how the mapped keys and values are located and retrieved during any MapReduce processing performed as part of executing a Hive query.
A subclass of the Hadoop version 1 class, org.apache.hadoop.mapred.InputSplit, which represents the data to be processed by an individual Mapper that operates during the MapReduce processing performed as part of executing a Hive query.

As described below, it is through the specific implementation of the org.apache.hadoop.hive.ql.metadata.HiveStorageHandler class provided in the Oracle NoSQL Database distribution that the Hive infrastructure obtains access to a given KVStore and the desired table data on which to run the desired Hive query.

The Oracle NoSQL Database Table API Hive Integration Classes

To support running Hive queries against data stored in a table of an Oracle NoSQL Database store, the following core classes are employed:

oracle.kv.hadoop.hive.table.TableStorageHandler
oracle.kv.hadoop.hive.table.TableHiveInputFormat
oracle.kv.hadoop.hive.table.TableHiveInputSlit
oracle.kv.hadoop.hive.table.TableHiveRecordReader
oracle.kv.hadoop.hive.table.TableSerDe
Various implementations — specific to Oracle NoSQL Database — of the Hive org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector interface; for example, oracle.kv.hadoop.hive.table.TableIntObjectInspector.

For more detail about the semantics of the classes listed above, refer to the javadoc of each respective class.

Note: YARN versus MapReduce version 1

It is important to note that currently, Hadoop deployments include two versions of MapReduce. The first version (referred to as MRv1) is the original version of MapReduce; and consists of interfaces and classes from the Java package org.apache.hadoop.mapred. The newer version of MapReduce is referred to as YARN (Yet Another Resource Negotiator) or, more generally, MRv2; and resides in the package org.apache.hadoop.mapreduce. Unfortunately, for the following reasons, the Table API Hive integration classes must address the existence of both versions of MapReduce:

Hive currently employs MRv1, whereas the Oracle NoSQL Database Table API Hadoop integration classes employ MRv2.
MRv1 and MRv2 are source incompatible.

As a result, the core classes listed above must be subclasses of the various MRv1 classes listed in the previous section. For example, rather than subclassing the MRv2 based Table API Hadoop integration class TableInputFormat (which would be preferred), because of the incompatability between MRv1 and MRv2, the Table API Hive integration class described above (TableHiveInputFormat) actually subclasses the Hadoop MRv1 class org.apache.hadoop.mapred.InputFormat. Thus, to exploit and reuse the mechanisms provided by the Table API Hadoop integration classes, the Table API Hive integration classes presented here internally create, manage, and delegate to subclasses of the appropriate MRv2 based classes.

It is also important to note that because the Table API Hadoop integration classes do not currently support writing data from a MapReduce job into a KVStore, the Table API Hive integration classes do not support queries that modify the contents of a table in a KVStore.

Example: Executing Hive Queries Against Oracle NoSQL Database Table Data

As indicated above, this example package contains no scripts or Java source files. This is because Hive queries are written in the Hive Query Language (HQL) and are executed from the Hive command line; thus, there is no example Java program to demonstrate the integration of the Table API with Hive, only the queries themselves. Nevertheless, to run this example, you do have to first create and populate an example table with a schema and data that is consistent with the example queries that are presented here; which can be achieved by following the instructions presented in the Hadoop/Table API Example.

Thus, after satisfying the prerequisites listed above, if not done so already, you should create and populate a table named vehicleTable. Once these initial steps are performed, the example Hive queries described below can be run against the data stored in that table.

Before proceeding though, execute the Hive interactive CLI from the Hive client; which should provide access to the Hive CLI command prompt. To execute the Hive CLI, type the following command from the system command line of the Hive client:

  > hive

which should then present the following Hive CLI command prompt:

  hive>

At this point, Hive commands can be executed, as described below.

Create a Hive EXTERNAL table corresponding to the Oracle NoSQL Database table

From the Hive command prompt type the following:

  hive> CREATE EXTERNAL TABLE IF NOT EXISTS 
          vehicleTable (TYPE STRING, MAKE STRING, MODEL STRING, CLASS STRING, COLOR STRING, PRICE DOUBLE, COUNT INT) 
          STORED BY 'oracle.kv.hadoop.hive.table.TableStorageHandler' 
          TBLPROPERTIES ("oracle.kv.kvstore" = "example-store",
                         "oracle.kv.hosts" = "kv-host-1:13230",
                         "oracle.kv.tableName" = "vehicleTable",
                         "oracle.kv.hadoop.hosts" = "dn-host-1,dn-host-2,dn-host-3");

The command above creates a Hive table named vehicleTable with columns whose type is consistent with the corresponding fields of the Oracle NoSQL Database table created above; also named vehicleTable. Additionally, there are a number of things to note:

The contents of the example command above are displayed on separate lines in this document for readability, but in practice, because the Hive command interpreter may have trouble handling such multi-line commands, it is generally best to enter a single, continuous command with no line breaks.
The Hive table name is not required to be the same as the name of the table created in Oracle NoSQL Database; they are the same in this example only for convenience.
If the KVStore is configured with multiple administrative hosts, then any subset of the names of those hosts can be included in the value of the "oracle.kv.hosts" property; as long as at least one valid administrative host (and port) is included.
With respect to the "oracle.kv.hadoop.hosts" property in the example command above:
- This property is currently optional for all systems except the Oracle BigDataSQL system.
- If this property is specified on a system that does not require it, it will have no effect.
- If it is specified on a system that does require it, its value must contain the names of all of the DataNodes making up the Hadoop cluster.

Execute example Hive queries against the Oracle NoSQL Database table

From the Hive command prompt type the following queries and observe the corresponding results; which, if successful, should look something like the output shown below.

List the contents of each row in the table

  hive> select * from vehicleTable;

  OK
  auto  Chrysler Imperial FrontWheelDrive white  20743.943359375 5
  auto  GM       Impala   4WheelDrive     black  20743.91015625  46
  auto  GM       Impala   FrontWheelDrive yellow 20743.5390625   28
  truck Ford     F250     AllWheelDrive   blue   31115.759765625 47
  ..........

Note that unlike the more complicated example queries below, the query above does not result in the execution of a MapReduce job. This is because there is enough metadata in the Hive metastore to satisfy the query.

Count the total number of rows in the table

  hive> select count(type) from vehicleTable;

  Launching Job 1 out of 1
  ..........
  Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 1
  2014-12-12 12:04:18,403 Stage-1 map = 0%,  reduce = 0%
  2014-12-12 12:05:12,431 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 2.26 sec
  2014-12-12 12:05:13,816 Stage-1 map = 21%,  reduce = 0%, Cumulative CPU 6.7 sec
  2014-12-12 12:05:15,201 Stage-1 map = 30%,  reduce = 0%, Cumulative CPU 6.87 sec
  2014-12-12 12:05:16,594 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 14.16 sec
  2014-12-12 12:05:17,980 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 14.16 sec
  2014-12-12 12:05:19,364 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 14.16 sec
  2014-12-12 12:05:20,754 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 15.24 sec
  2014-12-12 12:05:22,140 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 15.24 sec
  ..........
  Job 0: Map: 6  Reduce: 1   Cumulative CPU: 15.24 sec   HDFS Read: 4532 HDFS Write: 3 SUCCESS
  Total MapReduce CPU Time Spent: 15 seconds 240 msec
  OK
  79
  Time taken: 89.359 seconds, Fetched: 1 row(s)

Find the vehicle that has the lowest price

  hive> select min(price) from vehicleTable;

  Launching Job 1 out of 1
  ..........
  Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 1
  2014-12-12 12:11:10,924 Stage-1 map = 0%,  reduce = 0%
  2014-12-12 12:12:06,213 Stage-1 map = 21%,  reduce = 0%, Cumulative CPU 6.77 sec
  2014-12-12 12:12:07,606 Stage-1 map = 21%,  reduce = 0%, Cumulative CPU 6.77 sec
  2014-12-12 12:12:09,076 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 14.16 sec
  2014-12-12 12:12:10,464 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 14.3 sec
  2014-12-12 12:12:11,849 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 14.3 sec
  2014-12-12 12:12:13,238 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 14.3 sec
  2014-12-12 12:12:14,629 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 15.38 sec
  2014-12-12 12:12:16,031 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 15.38 sec
  ..........
  Job 0: Map: 6  Reduce: 1   Cumulative CPU: 15.38 sec   HDFS Read: 4532 HDFS Write: 16 SUCCESS
  Total MapReduce CPU Time Spent: 15 seconds 380 msec
  OK
  20743.244140625
  Time taken: 89.615 seconds, Fetched: 1 row(s)

Find all GM vehicles

  hive> select * from vehicleTable where make LIKE "%GM%"; 

  Launching Job 1 out of 1
  ..........
  Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 0
  2014-12-12 12:19:24,269 Stage-1 map = 0%,  reduce = 0%
  2014-12-12 12:20:18,239 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU 2.43 sec
  2014-12-12 12:20:19,622 Stage-1 map = 26%,  reduce = 0%, Cumulative CPU 4.81 sec
  2014-12-12 12:20:21,006 Stage-1 map = 26%,  reduce = 0%, Cumulative CPU 7.14 sec
  2014-12-12 12:20:22,395 Stage-1 map = 79%,  reduce = 0%, Cumulative CPU 13.09 sec
  2014-12-12 12:20:23,777 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 16.06 sec
  2014-12-12 12:20:25,162 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 16.06 sec
  ..........
  Job 0: Map: 6   Cumulative CPU: 16.06 sec   HDFS Read: 4532 HDFS Write: 1491 SUCCESS
  Total MapReduce CPU Time Spent: 16 seconds 60 msec
  OK
  suv   GM Equinox  4WheelDrive      yellow 41486.78125     37
  truck GM Sierra   4WheelDrive      black  31115.224609375 87
  auto  GM Corvette FrontWheelDrive  yellow 20743.84375     7
  auto  GM Impala   4WheelDrive      black  20743.91015625  46
  ..........

Find all vehicles that are Sierra model trucks

  hive> select * from vehicleTable where model LIKE "%Sierra%"; 

  Launching Job 1 out of 1
  ..........
  Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 0
  2014-12-12 13:36:24,284 Stage-1 map = 0%,  reduce = 0%
  2014-12-12 13:37:19,528 Stage-1 map = 39%,  reduce = 0%, Cumulative CPU 9.35 sec
  2014-12-12 13:37:20,910 Stage-1 map = 65%,  reduce = 0%, Cumulative CPU 11.98 sec
  2014-12-12 13:37:22,296 Stage-1 map = 97%,  reduce = 0%, Cumulative CPU 15.31 sec
  2014-12-12 13:37:23,681 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 15.58 sec
  2014-12-12 13:37:25,069 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 15.58 sec
  ..........
  Job 0: Map: 6   Cumulative CPU: 15.58 sec   HDFS Read: 4532 HDFS Write: 496 SUCCESS
  Total MapReduce CPU Time Spent: 15 seconds 580 msec
  OK
  truck GM Silverado2500 4WheelDrive           blue          31115.548828125 68
  truck GM Silverado2500 4WheelDrive-4cylinder blue-on-green 31114.91015625  17
  truck GM Silverado2500 AllWheelDrive         white         31115.275390625 36
  truck GM Silverado2500 AllWheelDrive         yellow        31114.796875    73
  truck GM Silverado1500 4WheelDrive           black         31114.98046875  64
  truck GM Silverado1500 4WheelDrive           green         31115.078125    38
  truck GM Silverado1500 RearWheelDrive        blue          31115.640625    55
  truck GM Silverado1500 RearWheelDrive        white         31115.517578125 37
  ..........
  Time taken: 83.589 seconds, Fetched: 8 row(s)