Class V1V2TableUtil


  • public final class V1V2TableUtil
    extends Object
    Utility class that provides static convenience methods for managing the interactions between version 1 and version 2 (YARN) MapReduce classes. - Note on Logging - Two loggers are currently employed by this class:
    • One logger based on Log4j version 1, accessed via the org.apache.commons.logging wrapper.
    • One logger based on the Log4j2 API.
    Two loggers are necessary because Hive2 employs the Log4j2 logging mechanism, whereas logging in Big Data SQL 4.0 is still based on Log4j version 1. As a result, when one executes a Hive query and wishes to log output from this class, the Log4j2-based logger specified by this class must be added to the Hive2 logging configuration file. On the other hand, to log output from this class when executing a Big Data SQL query, the Log4j v1 logger must be added to the Big Data SQL logging configuration file. In the future, when Big Data SQL changes its logging mechanims from Log4j v1 to Log4j2, this class should be changed to employ only the Log4j2-based logger.
    • Method Detail

      • getSplitMap

        public static Map<TableHiveInputSplit,​TableInputSplit> getSplitMap​(JobConf jobConf,
                                                                                 TableHiveInputSplit inputSplit,
                                                                                 int queryBy,
                                                                                 String whereClause,
                                                                                 Integer shardKeyPartitionId,
                                                                                 TableInputFormatBase.TopologyLocatorWrapper topologyLocator)
                                                                          throws IOException
        For the current Hive query, returns a singleton collection that maps each version 1 InputSplit for the query to its corresponding version 2 InputSplit. If the call to this method is the first call after the query has been entered on the command line and the input info for the job has been reset (using resetInputJobInfoForNewQuery), this method will construct and populate the return Map; otherwise, it will return the previously constructed Map.

        Implementation Note: when the getInputFormat method from this class is called to retrieve the TableInputFormat instance, only the VERY FIRST call to getInputFormat will construct an instance of TableInputFormat; all additional calls will always return the original instance created by that first call. More importantly, in addition to constructing a TableInputFormat instance, that first call to getInputFormat also constructs and populates the Map returned by this method; which is achieved via a call to the getSplits method on the newly created TableInputFormat instance.

        Since the first call to the getInputFormat method of this class has already called TableInputFormat.getSplits and placed the retrieved splits in the Map to return here, it is no longer necessary to make any additional calls to TableInputFormat.getSplits. Not only is it not necessary to call TableInputFormat.getSplits, but such a call should be avoided. This is because any call to TableInputFormat.getSplits will result in remote calls to the KVStore; which can be very costly. As a result, one should NEVER make a call such as, getInputFormat().getSplits() as such a call may result in two successive calls to TableInputFormat.getSplits. Thus, to avoid the situation just described, this method only needs to call getInputFormat (not getSplits()) to construct and populate the Map to return.

        Throws:
        IOException
      • getInputFormat

        public static InputFormat<PrimaryKey,​Row> getInputFormat​(JobConf jobConf,
                                                                       TableHiveInputSplit inputSplit,
                                                                       int queryBy,
                                                                       String whereClause,
                                                                       Integer shardKeyPartitionId,
                                                                       TableInputFormatBase.TopologyLocatorWrapper topologyLocator)
                                                                throws IOException
        For the current Hive query, constructs and returns a YARN based InputFormat class that will be used when processing the query. This method also constructs and populates a singleton Map whose elements are key/value pairs in which each key is a version 1 split for the returned InputFormat, and each value is the key's corresponding version 2 split. Note that both the InputFormat and the Map are contructed only on the first call to this method for the given query. On all subsequent calls, the original objects are returned; until the resetInputJobInfoForNewQuery method from this utility is called.
        Throws:
        IOException
      • resetInputJobInfoForNewQuery

        public static void resetInputJobInfoForNewQuery()
        Clears and resets the information related to the current job's input classes.

        This method must be called before each new query has been entered on the command line; to reset the splits as well as the InputFormats participating in the job. Note that the Hive infrastructure and BigDataSQL each employ different code paths with respect to the initialization of the query state set in TableStorageHandlerBase. That is, for a Hive-only query, the path consists of the following calls: decomposePredicate followed by configureJobProperties; whereas for a BigDataSQL query, the code path consists of: configureJobProperties followed by decomposePredicate. As a result, this method must be invoked after processing of the current query has completed; for example, in the close method of the record reader.