Class TableHiveInputFormat<K,V>

java.lang.Object
oracle.kv.hadoop.hive.table.TableHiveInputFormat<K,V>
All Implemented Interfaces:
InputFormat<K,V>

public class TableHiveInputFormat<K,V> extends Object implements InputFormat<K,V>
A Hadoop MapReduce version 1 InputFormat class for reading data from an Oracle NoSQL Database when processing a Hive query against data written to that database using the Table API.

Note that whereas this class is an instance of a version 1 InputFormat class, in order to exploit and reuse the mechanisms provided by the Hadoop integration classes (in package oracle.kv.hadoop.table), this class also creates and manages an instance of a version 2 InputFormat. - Note on Logging - Two loggers are currently employed by this class:

  • One logger based on Log4j version 1, accessed via the org.apache.commons.logging wrapper.
  • One logger based on the Log4j2 API.
Two loggers are necessary because Hive2 employs the Log4j2 logging mechanism, whereas logging in Big Data SQL 4.0 is still based on Log4j version 1. As a result, when one executes a Hive query and wishes to log output from this class, the Log4j2-based logger specified by this class must be added to the Hive2 logging configuration file. On the other hand, to log output from this class when executing a Big Data SQL query, the Log4j v1 logger must be added to the Big Data SQL logging configuration file. In the future, when Big Data SQL changes its logging mechanims from Log4j v1 to Log4j2, this class should be changed to employ only the Log4j2-based logger.

  • Constructor Details

    • TableHiveInputFormat

      public TableHiveInputFormat()
  • Method Details

    • getRecordReader

      public RecordReader<K,V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException
      Returns the RecordReader for the given InputSplit.

      Note that the RecordReader that is returned is based on version 1 of MapReduce, but wraps and delegates to a YARN based (MapReduce version2) RecordReader. This is done because the RecordReader provided for Hadoop integration is YARN based, whereas the Hive infrastructure requires a version 1 RecordReader.

      Additionally, note that when query execution occurs via a MapReduce job, this method is invoked by backend processes running on each DataNode in the Hadoop cluster; where the splits are distributed to each DataNode. When the query is simple enough to be executed by the Hive infrastructure from data in the metastore -- that is, without MapReduce -- this method is invoked by the frontend Hive processes; once for each split. For example, if there are 6 splits and the query is executed via a MapReduce job employing only 3 DataNodes, then each DataNode will invoke this method twice; once for each of 2 splits in the set of splits. On the other hand, if MapReduce is not employed, then the Hive frontend will invoke this method 6 separate times; one per different split. In either case, when this method is invoked, the given Version 1 split has already been populated with a fully populated Version 2 split; and the state of that encapsulated Version 2 split can be exploited to construct the necessary Version 1 RecordReader encapsulating a fully functional Version 2 RecordReader, as required by YARN.

      Specified by:
      getRecordReader in interface InputFormat<K,V>
      Throws:
      IOException
    • getSplits

      public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException
      For the current query, returns an array containing the input splits to use when satisfying the query.

      Implementation Note: V1V2TableUtil.getInputFormat() is first called by this method which, in addition to constructing a TableInputFormat instance, also populatest the splitMap. The splitMap constructed by V1V2TableUtil.getInputFormat() is then retrieved and the keySet of that splitMap is used to populate the array returned by this method.

      Specified by:
      getSplits in interface InputFormat<K,V>
      Throws:
      IOException