CSVInputFormat (Oracle Loader for Hadoop Java API Reference for Linux)

Overview

Package

Class

Tree

Deprecated

Index

Help

Oracle® Loader for Hadoop Java API Reference for Linux
Release 1.1
E20858-03

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

oracle.hadoop.loader.examples
Class CSVInputFormat

java.lang.Object org.apache.hadoop.mapreduce.InputFormat<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable> oracle.hadoop.loader.examples.CSVInputFormat

public class CSVInputFormat

extends org.apache.hadoop.mapreduce.InputFormat<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable>

This is a simple InputFormat example for CSV files. It uses TextInputFormat to break the input file(s) into lines, then breaks each line into fields using a comma (,) separator, and places the fields into an Avro IndexedRecord.

CSVInputFormat uses the following simple Avro schema:

 {
    "type" : "record",
    "name" : "All_string_schema", 
    "fields" : [
        {"name" : "F0", "type" : "string"},
        {"name" : "F1", "type" : "string"}, 
        ...]
 }

Note that:

the schema is of type record;
all fields are of type string;
and
field names are F0, F1, F2, etc. A field "Fi" will be loaded only if:
- the loaderMap file specifies a COLUMN element with the attribute: field="Fi"
  or
- the target table has a column named Fi (and the 'oracle.hadoop.loader.targetTable' property was used in lieu of a loaderMap file).

CSVInputFormat cannot extend TextInputFormat, since TextInputFormat.createRecordReader() returns RecordReader <LongWritable,Text>, while createRecordReader(org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext) has to return RecordReader <IndexedRecord,NullWritable>.

For simplicity, this implementation does not try to cache and reuse a TextInputFormatinstance; the methods getSplits(org.apache.hadoop.mapreduce.JobContext) and createRecordReader(org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext) get called on different InputFormat instances anyway.

Nested Class Summary
`static class`	`CSVInputFormat.CSVRecordReader` The record reader parses the input data into key/value pairs which are read by `OraLoaderMapper`.

Constructor Summary
`CSVInputFormat()`

Method Summary
`org.apache.hadoop.mapreduce.RecordReader<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable>`	`createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)` Create a record reader for a given split.
`static org.apache.avro.Schema`	`generateSimpleAllStringSchema(int numFields)` Generate an Avro Record schema for the CSV input record.
`java.util.List<org.apache.hadoop.mapreduce.InputSplit>`	`getSplits(org.apache.hadoop.mapreduce.JobContext context)` Logically split the set of input files for the job.

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

CSVInputFormat

public CSVInputFormat()

Method Detail

getSplits

public java.util.List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
                                                                 throws java.io.IOException

Logically split the set of input files for the job.

Specified by:: getSplits in class org.apache.hadoop.mapreduce.InputFormat<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable>

Parameters:: context - job configuration.
Returns:: an array of InputSplits for the job.
Throws:: java.io.IOException

createRecordReader

public org.apache.hadoop.mapreduce.RecordReader<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
                                                                                                                                            org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                                                                                                     throws java.io.IOException,
                                                                                                                                            java.lang.InterruptedException

Create a record reader for a given split. The framework will call RecordReader.initialize(InputSplit, TaskAttemptContext) before the split is used.

Specified by:: createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable>

Parameters:: split - the split to be read; context - the information about the task
Returns:: a new record reader
Throws:: java.io.IOException; java.lang.InterruptedException

generateSimpleAllStringSchema

public static org.apache.avro.Schema generateSimpleAllStringSchema(int numFields)

Generate an Avro Record schema for the CSV input record.

Parameters:

numFields - the number of fields for the schema

Returns:

a simple Avro schema:

 {
    "type" : "record",
    "name" : "All_string_schema", 
    "fields" : [
        {"name" : "F0", "type" : "string"},
        {"name" : "F1", "type" : "string"}, 
        ...]
 }