Oracle NoSQL Database Examples
version 12cR1.3.3.4

Package hadoop.hive.table

The Table API Hive Cookbook: documentation that describes how to run example Hive queries against data written via the Oracle NoSQL Database Table API.

See: Description

Package hadoop.hive.table Description

The Table API Hive Cookbook: documentation that describes how to run example Hive queries against data written via the Oracle NoSQL Database Table API.

Introduction

With the introduction of the Oracle NoSQL Database Hadoop integration classes, which support running Hadoop MapReduce jobs against data stored in an Oracle NoSQL Database table, it was natural to also provide new interfaces and classes which support running Hive queries against such table data (since a typical Hive query generally results in the execution of a MapReduce job). In addition to describing the core interfaces and classes involved in running a Hive query against data from a table located in a given Oracle NoSQL Database store (a KVStore), the information presented below also walks through the steps to take to execute a given set of basic Hive queries against example table data contained in a KVStore.

Note that this package does not contain any example Java source code. This is because a Hive query is expressed in the Hive Query Language (HQL), not Java. And the supporting example Java source and scripts on which this example depends is already provided with the Hadoop/Table API Example.

Prerequisites

Before attempting to execute the example that demonstrates the concepts presented in this document, you should first satisfy the following prerequisites: Using specific values for items such as the host names and admin port described above should allow you to more easily follow the example that is presented. Combined with the information contained in the Oracle NoSQL Database Getting Started Guide, as well as the Oracle NoSQL Database Admin Guide, you should then be able to generalize and extend the example to your own particular development scenario; substituting the values specific to the given environment where necessary.

A Brief Hive Primer

Paraphrasing wikipedia, Apache Hive is a data warehouse infrastructure built on top of Apache Hadoop, that facilitates querying datasets residing in distributed file systems or data stores such as Hadoop HDFS or Amazon S3. Additionally, Hive also provides a pluggable programming model that allows you to specify custom interfaces and classes that support querying data residing in data sources other than HDFS and S3; in particular, data written to an Oracle NoSQL Database table. To access and analyze data stored in these data sources, Hive provides a mechanism to project structure onto the data and query the data using a SQL-like language called the Hive Query Language, or HQL. Depending on the complexity of a given Hive query, the Hive infrastructure may construct and deploy a set of MapReduce jobs to retrieve and process the data or, when possible, it may simply satisfy the query via the metadata stored in the Hive metastore (Derby or MySQL).

In addition to the Hive infrastructure itself, Hive also provides a convenient client-side command line interface (the Hive CLI); which allows you to interact with the Hive infrastructure to create a Hive external table and then map it to the data located in a source like those described above.

As indicated above, a new set of interfaces and classes that satisfy the Hive programming model have been provided which support running Hive queries against table data contained in a KVStore. These new classes are located in the oracle.kv.hadoop.hive.table package, and consist of the following Hive and Hadoop types:

As described below, it is through the specific implementation of the org.apache.hadoop.hive.ql.metadata.HiveStorageHandler class provided in the Oracle NoSQL Database distribution that the Hive infrastructure obtains access to a given KVStore and the desired table data on which to run the desired Hive query.

The Oracle NoSQL Database Table API Hive Integration Classes

To support running Hive queries against data stored in a table of an Oracle NoSQL Database store, the following core classes are employed:
For more detail about the semantics of the classes listed above, refer to the javadoc of each respective class.

Note: YARN versus MapReduce version 1

It is important to note that currently, Hadoop deployments include two versions of MapReduce. The first version (referred to as MRv1) is the original version of MapReduce; and consists of interfaces and classes from the Java package org.apache.hadoop.mapred. The newer version of MapReduce is referred to as YARN (Yet Another Resource Negotiator) or, more generally, MRv2; and resides in the package org.apache.hadoop.mapreduce. Unfortunately, for the following reasons, the Table API Hive integration classes must address the existence of both versions of MapReduce:

As a result, the core classes listed above must be subclasses of the various MRv1 classes listed in the previous section. For example, rather than subclassing the MRv2 based Table API Hadoop integration class TableInputFormat (which would be preferred), because of the incompatability between MRv1 and MRv2, the Table API Hive integration class described above (TableHiveInputFormat) actually subclasses the Hadoop MRv1 class org.apache.hadoop.mapred.InputFormat. Thus, to exploit and reuse the mechanisms provided by the Table API Hadoop integration classes, the Table API Hive integration classes presented here internally create, manage, and delegate to subclasses of the appropriate MRv2 based classes.

It is also important to note that because the Table API Hadoop integration classes do not currently support writing data from a MapReduce job into a KVStore, the Table API Hive integration classes do not support queries that modify the contents of a table in a KVStore.

Example: Executing Hive Queries Against Oracle NoSQL Database Table Data

As indicated above, this example package contains no scripts or Java source files. This is because Hive queries are written in the Hive Query Language (HQL) and are executed from the Hive command line; thus, there is no example Java program to demonstrate the integration of the Table API with Hive, only the queries themselves. Nevertheless, to run this example, you do have to first create and populate an example table with a schema and data that is consistent with the example queries that are presented here; which can be achieved by following the instructions presented in the Hadoop/Table API Example.

Thus, after satisfying the prerequisites listed above, if not done so already, you should create and populate a table named vehicleTable. Once these initial steps are performed, the example Hive queries described below can be run against the data stored in that table.

Before proceeding though, execute the Hive interactive CLI from the Hive client; which should provide access to the Hive CLI command prompt. To execute the Hive CLI, type the following command from the system command line of the Hive client:

  > hive
which should then present the following Hive CLI command prompt:
  hive>
At this point, Hive commands can be executed, as described below.

Create a Hive EXTERNAL table corresponding to the Oracle NoSQL Database table

From the Hive command prompt type the following:
  hive> CREATE EXTERNAL TABLE IF NOT EXISTS 
          vehicleTable (TYPE STRING, MAKE STRING, MODEL STRING, CLASS STRING, COLOR STRING, PRICE DOUBLE, COUNT INT) 
          STORED BY 'oracle.kv.hadoop.hive.table.TableStorageHandler' 
          TBLPROPERTIES ("oracle.kv.kvstore" = "example-store",
                         "oracle.kv.hosts" = "kv-host-1:13230",
                         "oracle.kv.tableName" = "vehicleTable",
                         "oracle.kv.hadoop.hosts" = "dn-host-1,dn-host-2,dn-host-3");
The command above creates a Hive table named vehicleTable with columns whose type is consistent with the corresponding fields of the Oracle NoSQL Database table created above; also named vehicleTable. Additionally, there are a number of things to note:

Execute example Hive queries against the Oracle NoSQL Database table

From the Hive command prompt type the following queries and observe the corresponding results; which, if successful, should look something like the output shown below.

List the contents of each row in the table

  hive> select * from vehicleTable;

  OK
  auto  Chrysler Imperial FrontWheelDrive white  20743.943359375 5
  auto  GM       Impala   4WheelDrive     black  20743.91015625  46
  auto  GM       Impala   FrontWheelDrive yellow 20743.5390625   28
  truck Ford     F250     AllWheelDrive   blue   31115.759765625 47
  ..........
Note that unlike the more complicated example queries below, the query above does not result in the execution of a MapReduce job. This is because there is enough metadata in the Hive metastore to satisfy the query.

Count the total number of rows in the table

  hive> select count(type) from vehicleTable;

  Launching Job 1 out of 1
  ..........
  Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 1
  2014-12-12 12:04:18,403 Stage-1 map = 0%,  reduce = 0%
  2014-12-12 12:05:12,431 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 2.26 sec
  2014-12-12 12:05:13,816 Stage-1 map = 21%,  reduce = 0%, Cumulative CPU 6.7 sec
  2014-12-12 12:05:15,201 Stage-1 map = 30%,  reduce = 0%, Cumulative CPU 6.87 sec
  2014-12-12 12:05:16,594 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 14.16 sec
  2014-12-12 12:05:17,980 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 14.16 sec
  2014-12-12 12:05:19,364 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 14.16 sec
  2014-12-12 12:05:20,754 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 15.24 sec
  2014-12-12 12:05:22,140 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 15.24 sec
  ..........
  Job 0: Map: 6  Reduce: 1   Cumulative CPU: 15.24 sec   HDFS Read: 4532 HDFS Write: 3 SUCCESS
  Total MapReduce CPU Time Spent: 15 seconds 240 msec
  OK
  79
  Time taken: 89.359 seconds, Fetched: 1 row(s)
Find the vehicle that has the lowest price
  hive> select min(price) from vehicleTable;

  Launching Job 1 out of 1
  ..........
  Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 1
  2014-12-12 12:11:10,924 Stage-1 map = 0%,  reduce = 0%
  2014-12-12 12:12:06,213 Stage-1 map = 21%,  reduce = 0%, Cumulative CPU 6.77 sec
  2014-12-12 12:12:07,606 Stage-1 map = 21%,  reduce = 0%, Cumulative CPU 6.77 sec
  2014-12-12 12:12:09,076 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 14.16 sec
  2014-12-12 12:12:10,464 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 14.3 sec
  2014-12-12 12:12:11,849 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 14.3 sec
  2014-12-12 12:12:13,238 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 14.3 sec
  2014-12-12 12:12:14,629 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 15.38 sec
  2014-12-12 12:12:16,031 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 15.38 sec
  ..........
  Job 0: Map: 6  Reduce: 1   Cumulative CPU: 15.38 sec   HDFS Read: 4532 HDFS Write: 16 SUCCESS
  Total MapReduce CPU Time Spent: 15 seconds 380 msec
  OK
  20743.244140625
  Time taken: 89.615 seconds, Fetched: 1 row(s)
Find all GM vehicles
  hive> select * from vehicleTable where make LIKE "%GM%"; 

  Launching Job 1 out of 1
  ..........
  Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 0
  2014-12-12 12:19:24,269 Stage-1 map = 0%,  reduce = 0%
  2014-12-12 12:20:18,239 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU 2.43 sec
  2014-12-12 12:20:19,622 Stage-1 map = 26%,  reduce = 0%, Cumulative CPU 4.81 sec
  2014-12-12 12:20:21,006 Stage-1 map = 26%,  reduce = 0%, Cumulative CPU 7.14 sec
  2014-12-12 12:20:22,395 Stage-1 map = 79%,  reduce = 0%, Cumulative CPU 13.09 sec
  2014-12-12 12:20:23,777 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 16.06 sec
  2014-12-12 12:20:25,162 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 16.06 sec
  ..........
  Job 0: Map: 6   Cumulative CPU: 16.06 sec   HDFS Read: 4532 HDFS Write: 1491 SUCCESS
  Total MapReduce CPU Time Spent: 16 seconds 60 msec
  OK
  suv   GM Equinox  4WheelDrive      yellow 41486.78125     37
  truck GM Sierra   4WheelDrive      black  31115.224609375 87
  auto  GM Corvette FrontWheelDrive  yellow 20743.84375     7
  auto  GM Impala   4WheelDrive      black  20743.91015625  46
  ..........
Find all vehicles that are Sierra model trucks
  hive> select * from vehicleTable where model LIKE "%Sierra%"; 

  Launching Job 1 out of 1
  ..........
  Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 0
  2014-12-12 13:36:24,284 Stage-1 map = 0%,  reduce = 0%
  2014-12-12 13:37:19,528 Stage-1 map = 39%,  reduce = 0%, Cumulative CPU 9.35 sec
  2014-12-12 13:37:20,910 Stage-1 map = 65%,  reduce = 0%, Cumulative CPU 11.98 sec
  2014-12-12 13:37:22,296 Stage-1 map = 97%,  reduce = 0%, Cumulative CPU 15.31 sec
  2014-12-12 13:37:23,681 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 15.58 sec
  2014-12-12 13:37:25,069 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 15.58 sec
  ..........
  Job 0: Map: 6   Cumulative CPU: 15.58 sec   HDFS Read: 4532 HDFS Write: 496 SUCCESS
  Total MapReduce CPU Time Spent: 15 seconds 580 msec
  OK
  truck GM Silverado2500 4WheelDrive           blue          31115.548828125 68
  truck GM Silverado2500 4WheelDrive-4cylinder blue-on-green 31114.91015625  17
  truck GM Silverado2500 AllWheelDrive         white         31115.275390625 36
  truck GM Silverado2500 AllWheelDrive         yellow        31114.796875    73
  truck GM Silverado1500 4WheelDrive           black         31114.98046875  64
  truck GM Silverado1500 4WheelDrive           green         31115.078125    38
  truck GM Silverado1500 RearWheelDrive        blue          31115.640625    55
  truck GM Silverado1500 RearWheelDrive        white         31115.517578125 37
  ..........
  Time taken: 83.589 seconds, Fetched: 8 row(s)
Oracle NoSQL Database Examples
version 12cR1.3.3.4

Copyright (c) 2011, 2015 Oracle and/or its affiliates. All rights reserved.