public class TableHiveInputFormat<K,V> extends Object implements InputFormat<K,V>
Note that whereas this class is an instance of a version 1 InputFormat class, in order to exploit and reuse the mechanisms provided by the Hadoop integration classes (in package oracle.kv.hadoop.table), this class also creates and manages an instance of a version 2 InputFormat.
Constructor and Description |
---|
TableHiveInputFormat() |
Modifier and Type | Method and Description |
---|---|
RecordReader<K,V> |
getRecordReader(InputSplit split,
JobConf job,
Reporter reporter)
Returns the RecordReader for the given InputSplit.
|
InputSplit[] |
getSplits(JobConf job,
int numSplits)
Returns an array containing the input splits for the given job.
|
public RecordReader<K,V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException
Note that the RecordReader that is returned is based on version 1 of MapReduce, but wraps and delegates to a YARN based (MapReduce version2) RecordReader. This is done because the RecordReader provided for Hadoop integration is YARN based, whereas the Hive infrastructure requires a version 1 RecordReader.
Additionally, note that when query execution occurs via a MapReduce
job, this method is invoked by backend processes running on each
DataNode in the Hadoop cluster; where the splits are distributed to
each DataNode. When the query is simple enough to be executed by the
Hive infrastructure from data in the metastore -- that is, without
MapReduce -- this method is invoked by the frontend Hive processes;
once for each split. For example, if there are 6 splits and the query
is executed via a MapReduce job employing only 3 DataNodes, then each
DataNode will invoke this method twice; once for each of 2 splits in
the set of splits. On the other hand, if MapReduce is not employed,
then the Hive frontend will invoke this method 6 separate times;
one per different split. In either case, when this method is
invoked, the given Version 1 split
has already been
populated with a fully populated Version 2 split; and the state of
that encapsulated Version 2 split can be exploited to construct the
necessary Version 1 RecordReader encapsulating a fully functional
Version 2 RecordReader, as required by YARN.
getRecordReader
in interface InputFormat<K,V>
IOException
public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException
Implementation Note: when V1V2TableUtil.getInputFormat() is called by this method to retrieve the TableInputFormat instance to use for a given query, only the VERY FIRST call to V1V2TableUtil.getInputFormat() (after the query has been entered on the command line and the input info for the job has been reset) will construct an instance of TableInputFormat; all additional calls -- while that query is executing -- will always return the original instance created by that first call. Note also that in addition to constructing a TableInputFormat instance, that first call to V1V2TableUtil.getInputFormat() also populates the splitMap; which is achieved via a call to getSplits() on the newly created TableInputFormat instance. Since the first call to V1V2TableUtil.getInputFormat() has already called TableInputFormat.getSplits() and placed the retrieved splits in the splitMap, it is no longer necessary to make any additional calls to TableInputFormat.getSplits(). Not only is it not necessary to call TableInputFormat.getSplits(), but such a call should be avoided; to avoid additional, unnecessary remote calls to KVStore. Thus, avoid calls such as, V1V2TableUtil.getInputFormat().getSplits(); since such a call may result in two successive calls to TableInputFormat.getSplits(). To avoid this situation, a two step process like the following should be employed to retrieve and return the desired splits:
1. First call V1V2TableUtil.getInputFormat(); which when called repeatedly, will always return the same instance of TableInputFormat. 2. Call V1V2TableUtil.getSplitMap(), then retrieve and return the desired splits from the returned map.
getSplits
in interface InputFormat<K,V>
IOException
Copyright (c) 2011, 2017 Oracle and/or its affiliates. All rights reserved.