A Brief Primer on Apache Hadoop

Apache Hadoop can be thought of as consisting of two primary components:

  • The Hadoop Distributed File System (referred to as, HDFS).
  • The MapReduce programming model; which includes a Map Phase consisting of a mapping step and a shuffle-and-sort step that together perform filtering and sorting, followed by a Reduce Phase that performs a summary operation on the mapped and sorted results from the Map Phase.

The various Hadoop distributions that are available (for example, Cloudera) provide an infrastructure for orchestrating the processing performed in a MapReduce job. This includes marshaling the distributed servers that execute job tasks in parallel, the management of all communication and data transfers between each part of the system, and mechanisms for providing redundancy and fault tolerance.

The Hadoop infrastructure also provides several interactive tools such as a command line interface (the Hadoop CLI) that provide access to the data stored in HDFS. But the typical way application developers read, write, and process data stored in HDFS is via MapReduce jobs; which are programs that adhere to the Hadoop MapReduce programming model. For more detailed information on Hadoop HDFS and MapReduce, see the Hadoop MapReduce tutorial.

As indicated earlier, with the introduction of the Oracle NoSQL Table API, Oracle NoSQL Database provides a set of interfaces and classes that satisfy the Hadoop MapReduce programming model to allow one to write MapReduce jobs that can be run to process data written to a table created in an Oracle NoSQL Database store. These classes are located in the oracle.kv.hadoop.table package, and consist of the following types:

  • A subclass of the Hadoop class, org.apache.hadoop.mapreduce.InputFormat, which specifies how the associated MapReduce job uses a Hadoop RecordReader to read its input data and splits the input data into logical sections, each referred to as an InputSplit.
  • A subclass of the Hadoop class, org.apache.hadoop.mapreduce.OutputFormat, which specifies how the associated MapReduce job uses a Hadoop RecordWriter to write its output data.
  • A subclass of the Hadoop class, org.apache.hadoop.mapreduce.RecordReader, which specifies how the mapped keys and values are located and retrieved during MapReduce processing.
  • A subclass of the Hadoop class, org.apache.hadoop.mapreduce.InputSplit, which represents the data to be processed by an individual MapReduce Mapper; where there is one Mapper per InputSplit.

For the complete list of classes, see Apache Hadoop API.

As described in the following sections, it is through the specific implementation of the InputFormat class provided by Oracle NoSQL Database that the Hadoop MapReduce infrastructure obtains access to a given store and the data written to the store.