See: Description
Oracle NoSQL Database Hadoop integration classes
,
which support running Hadoop MapReduce jobs against data stored in an
Oracle NoSQL Database table, it was natural to also provide new interfaces
and classes which support running Hive queries against such table data
(since a typical Hive query generally results in the execution of a
MapReduce job). In addition to describing the core interfaces and classes
involved in running a Hive query against data from a table located in a
given Oracle NoSQL Database store (a KVStore), the information
presented below also walks through the steps to take to execute a
given set of basic Hive queries against example table data contained in
a KVStore.
Note that this package does not contain any example Java source code. This is because a Hive query is expressed in the Hive Query Language (HQL), not Java. And the supporting example Java source and scripts on which this example depends is already provided with the Hadoop/Table API Example.
Apache Hive
and its programming model; that is, become familiar with how to write
and execute a Hive query.
Apache Hadoop
;
specifically, become familiar with how Hive and Hadoop interact.
Apache Hive
is a data
warehouse infrastructure built on top of
Apache Hadoop
,
that facilitates querying datasets residing in distributed file systems or data stores such as
Hadoop HDFS
or
Amazon S3
.
Additionally, Hive also provides a pluggable programming model that allows
you to specify custom interfaces and classes that support querying
data residing in data sources other than HDFS and S3; in particular,
data written to an Oracle NoSQL Database table.
To access and analyze data stored in these data sources, Hive provides a
mechanism to project structure onto the data and query the data using
a SQL-like language called the Hive Query Language, or
HQL. Depending on the complexity of a given Hive query,
the Hive infrastructure may construct and deploy a set of MapReduce
jobs to retrieve and process the data or, when possible, it may simply
satisfy the query via the metadata stored in the Hive metastore
(Derby or MySQL).
In addition to the Hive infrastructure itself, Hive also provides a convenient client-side command line interface (the Hive CLI); which allows you to interact with the Hive infrastructure to create a Hive external table and then map it to the data located in a source like those described above.
As indicated above, a new set of interfaces and classes that satisfy the
Hive programming model have been provided which support running Hive queries
against table data contained in a KVStore. These new classes are located in the
oracle.kv.hadoop.hive.table
package, and consist of the following Hive and Hadoop types:
org.apache.hadoop.hive.ql.metadata.HiveStorageHandler
,
which is the mechanism (pluggable interface) to use to specify to the Hive
infrastructure where the data to process is located, as well as how to process
that data; and which consists of the following components:
org.apache.hadoop.mapred.InputFormat
,
which specifies how the associated MapReduce job reads its input data
(from an Oracle NoSQL Database table).
org.apache.hadoop.mapred.OutputFormat
,
which specifies how the associated MapReduce job writes its output.
org.apache.hadoop.hive.serde2.SerDe
,
which is used to serialize the table data that is retrieved and
sent to the Hadoop MapReduce job for processing, and to deserialize
the results from the Hadoop MapReduce job so those results can be
output and/or displayed.
org.apache.hadoop.mapred.RecordReader
,
which specifies how the mapped keys and values are located and retrieved
during any MapReduce processing performed as part of executing a Hive query.
org.apache.hadoop.mapred.InputSplit
,
which represents the data to be processed by an individual Mapper
that operates during the MapReduce processing performed as part of
executing a Hive query.
org.apache.hadoop.hive.ql.metadata.HiveStorageHandler
class provided in the Oracle NoSQL Database distribution that the Hive
infrastructure obtains access to a given KVStore and the desired table data
on which to run the desired Hive query.
oracle.kv.hadoop.hive.table.TableStorageHandler
oracle.kv.hadoop.hive.table.TableHiveInputFormat
oracle.kv.hadoop.hive.table.TableHiveInputSlit
oracle.kv.hadoop.hive.table.TableHiveRecordReader
oracle.kv.hadoop.hive.table.TableSerDe
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector
interface; for example,
oracle.kv.hadoop.hive.table.TableIntObjectInspector
.
Note: YARN versus MapReduce version 1
It is important to note that currently, Hadoop deployments include
two versions of MapReduce. The first version (referred to as MRv1)
is the original version of MapReduce; and consists of interfaces and classes
from the Java package org.apache.hadoop.mapred
.
The newer version of MapReduce is referred to as YARN
(Yet Another Resource Negotiator) or, more generally,
MRv2; and resides in the package org.apache.hadoop.mapreduce
.
Unfortunately, for the following reasons, the
Table API Hive integration
classes must address the existence of both versions of MapReduce:
Oracle NoSQL Database Table API Hadoop integration
classes employ MRv2.
TableInputFormat
(which would be preferred), because of the incompatability between MRv1 and MRv2,
the Table API Hive integration class described above
(TableHiveInputFormat
)
actually subclasses the Hadoop MRv1 class
org.apache.hadoop.mapred.InputFormat
. Thus, to exploit and reuse the mechanisms
provided by the
Table API Hadoop integration
classes, the
Table API Hive integration
classes presented here internally create, manage, and delegate to subclasses of the
appropriate MRv2 based classes.
It is also important to note that because the
Table API Hadoop integration
classes do not currently support writing data from a MapReduce job into
a KVStore, the
Table API Hive integration
classes do not support queries that modify the contents of a table in a KVStore.
Thus, after satisfying the prerequisites
listed above, if not done so already, you should
create
and
populate
a table named vehicleTable. Once these initial steps are performed,
the example Hive queries described below can be run against the data stored in that table.
Before proceeding though, execute the Hive interactive CLI from the Hive client; which should provide access to the Hive CLI command prompt. To execute the Hive CLI, type the following command from the system command line of the Hive client:
> hivewhich should then present the following Hive CLI command prompt:
hive>At this point, Hive commands can be executed, as described below.
hive> CREATE EXTERNAL TABLE IF NOT EXISTS vehicleTable (TYPE STRING, MAKE STRING, MODEL STRING, CLASS STRING, COLOR STRING, PRICE DOUBLE, COUNT INT) STORED BY 'oracle.kv.hadoop.hive.table.TableStorageHandler' TBLPROPERTIES ("oracle.kv.kvstore" = "example-store", "oracle.kv.hosts" = "kv-host-1:13230", "oracle.kv.tableName" = "vehicleTable", "oracle.kv.hadoop.hosts" = "dn-host-1,dn-host-2,dn-host-3");The command above creates a Hive table named vehicleTable with columns whose type is consistent with the corresponding fields of the Oracle NoSQL Database table created above; also named vehicleTable. Additionally, there are a number of things to note:
"oracle.kv.hosts"
property; as long as at least one valid administrative host (and port) is included.
"oracle.kv.hadoop.hosts"
property in the example command above:
List the contents of each row in the table
hive> select * from vehicleTable; OK auto Chrysler Imperial FrontWheelDrive white 20743.943359375 5 auto GM Impala 4WheelDrive black 20743.91015625 46 auto GM Impala FrontWheelDrive yellow 20743.5390625 28 truck Ford F250 AllWheelDrive blue 31115.759765625 47 ..........Note that unlike the more complicated example queries below, the query above does not result in the execution of a MapReduce job. This is because there is enough metadata in the Hive metastore to satisfy the query.
Count the total number of rows in the table
hive> select count(type) from vehicleTable; Launching Job 1 out of 1 .......... Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 1 2014-12-12 12:04:18,403 Stage-1 map = 0%, reduce = 0% 2014-12-12 12:05:12,431 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 2.26 sec 2014-12-12 12:05:13,816 Stage-1 map = 21%, reduce = 0%, Cumulative CPU 6.7 sec 2014-12-12 12:05:15,201 Stage-1 map = 30%, reduce = 0%, Cumulative CPU 6.87 sec 2014-12-12 12:05:16,594 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 14.16 sec 2014-12-12 12:05:17,980 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 14.16 sec 2014-12-12 12:05:19,364 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 14.16 sec 2014-12-12 12:05:20,754 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 15.24 sec 2014-12-12 12:05:22,140 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 15.24 sec .......... Job 0: Map: 6 Reduce: 1 Cumulative CPU: 15.24 sec HDFS Read: 4532 HDFS Write: 3 SUCCESS Total MapReduce CPU Time Spent: 15 seconds 240 msec OK 79 Time taken: 89.359 seconds, Fetched: 1 row(s)Find the vehicle that has the lowest price
hive> select min(price) from vehicleTable; Launching Job 1 out of 1 .......... Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 1 2014-12-12 12:11:10,924 Stage-1 map = 0%, reduce = 0% 2014-12-12 12:12:06,213 Stage-1 map = 21%, reduce = 0%, Cumulative CPU 6.77 sec 2014-12-12 12:12:07,606 Stage-1 map = 21%, reduce = 0%, Cumulative CPU 6.77 sec 2014-12-12 12:12:09,076 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 14.16 sec 2014-12-12 12:12:10,464 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 14.3 sec 2014-12-12 12:12:11,849 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 14.3 sec 2014-12-12 12:12:13,238 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 14.3 sec 2014-12-12 12:12:14,629 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 15.38 sec 2014-12-12 12:12:16,031 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 15.38 sec .......... Job 0: Map: 6 Reduce: 1 Cumulative CPU: 15.38 sec HDFS Read: 4532 HDFS Write: 16 SUCCESS Total MapReduce CPU Time Spent: 15 seconds 380 msec OK 20743.244140625 Time taken: 89.615 seconds, Fetched: 1 row(s)Find all GM vehicles
hive> select * from vehicleTable where make LIKE "%GM%"; Launching Job 1 out of 1 .......... Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 0 2014-12-12 12:19:24,269 Stage-1 map = 0%, reduce = 0% 2014-12-12 12:20:18,239 Stage-1 map = 9%, reduce = 0%, Cumulative CPU 2.43 sec 2014-12-12 12:20:19,622 Stage-1 map = 26%, reduce = 0%, Cumulative CPU 4.81 sec 2014-12-12 12:20:21,006 Stage-1 map = 26%, reduce = 0%, Cumulative CPU 7.14 sec 2014-12-12 12:20:22,395 Stage-1 map = 79%, reduce = 0%, Cumulative CPU 13.09 sec 2014-12-12 12:20:23,777 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 16.06 sec 2014-12-12 12:20:25,162 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 16.06 sec .......... Job 0: Map: 6 Cumulative CPU: 16.06 sec HDFS Read: 4532 HDFS Write: 1491 SUCCESS Total MapReduce CPU Time Spent: 16 seconds 60 msec OK suv GM Equinox 4WheelDrive yellow 41486.78125 37 truck GM Sierra 4WheelDrive black 31115.224609375 87 auto GM Corvette FrontWheelDrive yellow 20743.84375 7 auto GM Impala 4WheelDrive black 20743.91015625 46 ..........Find all vehicles that are Sierra model trucks
hive> select * from vehicleTable where model LIKE "%Sierra%"; Launching Job 1 out of 1 .......... Hadoop job information for Stage-1: number of mappers: 6; number of reducers: 0 2014-12-12 13:36:24,284 Stage-1 map = 0%, reduce = 0% 2014-12-12 13:37:19,528 Stage-1 map = 39%, reduce = 0%, Cumulative CPU 9.35 sec 2014-12-12 13:37:20,910 Stage-1 map = 65%, reduce = 0%, Cumulative CPU 11.98 sec 2014-12-12 13:37:22,296 Stage-1 map = 97%, reduce = 0%, Cumulative CPU 15.31 sec 2014-12-12 13:37:23,681 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 15.58 sec 2014-12-12 13:37:25,069 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 15.58 sec .......... Job 0: Map: 6 Cumulative CPU: 15.58 sec HDFS Read: 4532 HDFS Write: 496 SUCCESS Total MapReduce CPU Time Spent: 15 seconds 580 msec OK truck GM Silverado2500 4WheelDrive blue 31115.548828125 68 truck GM Silverado2500 4WheelDrive-4cylinder blue-on-green 31114.91015625 17 truck GM Silverado2500 AllWheelDrive white 31115.275390625 36 truck GM Silverado2500 AllWheelDrive yellow 31114.796875 73 truck GM Silverado1500 4WheelDrive black 31114.98046875 64 truck GM Silverado1500 4WheelDrive green 31115.078125 38 truck GM Silverado1500 RearWheelDrive blue 31115.640625 55 truck GM Silverado1500 RearWheelDrive white 31115.517578125 37 .......... Time taken: 83.589 seconds, Fetched: 8 row(s)
Copyright (c) 2011, 2015 Oracle and/or its affiliates. All rights reserved.