Skip navigation links

Oracle® Loader for Hadoop Java API Reference for Linux
Release 1.1

E20858-03


oracle.hadoop.loader.examples
Class CSVInputFormat

java.lang.Object
  extended by org.apache.hadoop.mapreduce.InputFormat<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable>
      extended by oracle.hadoop.loader.examples.CSVInputFormat


public class CSVInputFormat
extends org.apache.hadoop.mapreduce.InputFormat<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable>

This is a simple InputFormat example for CSV files. It uses TextInputFormat to break the input file(s) into lines, then breaks each line into fields using a comma (,) separator, and places the fields into an Avro IndexedRecord.

CSVInputFormat uses the following simple Avro schema:

 {
    "type" : "record",
    "name" : "All_string_schema", 
    "fields" : [
        {"name" : "F0", "type" : "string"},
        {"name" : "F1", "type" : "string"}, 
        ...]
 }
 

Note that:

CSVInputFormat cannot extend TextInputFormat, since TextInputFormat.createRecordReader() returns RecordReader <LongWritable,Text>, while createRecordReader(org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext) has to return RecordReader <IndexedRecord,NullWritable>.

For simplicity, this implementation does not try to cache and reuse a TextInputFormatinstance; the methods getSplits(org.apache.hadoop.mapreduce.JobContext) and createRecordReader(org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext) get called on different InputFormat instances anyway.


Nested Class Summary
static class CSVInputFormat.CSVRecordReader
          The record reader parses the input data into key/value pairs which are read by OraLoaderMapper.

 

Constructor Summary
CSVInputFormat()
           

 

Method Summary
 org.apache.hadoop.mapreduce.RecordReader<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)
          Create a record reader for a given split.
static org.apache.avro.Schema generateSimpleAllStringSchema(int numFields)
          Generate an Avro Record schema for the CSV input record.
 java.util.List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
          Logically split the set of input files for the job.

 

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

 

Constructor Detail

CSVInputFormat

public CSVInputFormat()

Method Detail

getSplits

public java.util.List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
                                                                 throws java.io.IOException
Logically split the set of input files for the job.
Specified by:
getSplits in class org.apache.hadoop.mapreduce.InputFormat<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable>
Parameters:
context - job configuration.
Returns:
an array of InputSplits for the job.
Throws:
java.io.IOException

createRecordReader

public org.apache.hadoop.mapreduce.RecordReader<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
                                                                                                                                            org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                                                                                                     throws java.io.IOException,
                                                                                                                                            java.lang.InterruptedException
Create a record reader for a given split. The framework will call RecordReader.initialize(InputSplit, TaskAttemptContext) before the split is used.
Specified by:
createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable>
Parameters:
split - the split to be read
context - the information about the task
Returns:
a new record reader
Throws:
java.io.IOException
java.lang.InterruptedException

generateSimpleAllStringSchema

public static org.apache.avro.Schema generateSimpleAllStringSchema(int numFields)
Generate an Avro Record schema for the CSV input record.
Parameters:
numFields - the number of fields for the schema
Returns:
a simple Avro schema:
 {
    "type" : "record",
    "name" : "All_string_schema", 
    "fields" : [
        {"name" : "F0", "type" : "string"},
        {"name" : "F1", "type" : "string"}, 
        ...]
 }
 

Skip navigation links

Oracle® Loader for Hadoop Java API Reference for Linux
Release 1.1

E20858-03


Copyright © 2011, Oracle and/or its affiliates. All rights reserved.