CSVInputFormat (Oracle Loader for Hadoop Java API Reference for Linux)

Overview

Package

Class

Tree

Deprecated

Index

Help

Oracle® Loader for Hadoop Java API Reference for Linux
Release 2.2
E41239-01

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

oracle.hadoop.loader.examples
Class CSVInputFormat

java.lang.Object org.apache.hadoop.mapreduce.InputFormat<java.lang.Object,org.apache.avro.generic.IndexedRecord> oracle.hadoop.loader.examples.CSVInputFormat

public class CSVInputFormat

extends org.apache.hadoop.mapreduce.InputFormat<java.lang.Object,org.apache.avro.generic.IndexedRecord>

This is a simple InputFormat example for CSV files. It uses TextInputFormat to break the input file(s) into lines, then breaks each line into fields using a comma (,) separator, and places the fields into an Avro IndexedRecord.

CSVInputFormat uses the following simple Avro schema:

 {
    "type" : "record",
    "name" : "All_string_schema", 
    "fields" : [
        {"name" : "F0", "type" : "string"},
        {"name" : "F1", "type" : "string"}, 
        ...]
 }

Note that:

the schema is of type record;
all fields are of type string;
and
field names are F0, F1, F2, etc. A field "Fi" will be loaded only if:
- the loaderMap file specifies a COLUMN element with the attribute: field="Fi"
  or
- the target table has a column named Fi (and the 'oracle.hadoop.loader.targetTable' property was used in lieu of a loaderMap file).

CSVInputFormat cannot extend TextInputFormat, since TextInputFormat.createRecordReader() returns RecordReader <LongWritable,Text>, while createRecordReader(org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext) has to return RecordReader <Object,IndexedRecord>.

The Object key is used only when there is an error processing the corresponding IndexedRecord value. The key's toString() method is used to generate a printable message helping you identify the culprit record. If the key is null, then no information identifying the record will be printed if the record fails.

You can use the key to print:

data file URI,
InputSplit information,
data file and the record's offset in that file,
textual representation of the actual record (not advisable if data contains sensitive information, as it may be printed in Hadoop logs throughout the cluster).
etc.

Nested Class Summary
`static class`	`CSVInputFormat.CSVRecordReader` The record reader parses the input data into key/value pairs which are read by `OraLoaderMapper`.

Constructor Summary
`CSVInputFormat()`

Method Summary
`org.apache.hadoop.mapreduce.RecordReader<java.lang.Object,org.apache.avro.generic.IndexedRecord>`	`createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)` Create a record reader for a given split.
`static org.apache.avro.Schema`	`generateSimpleAllStringSchema(int numFields)` Generate an Avro Record schema for the CSV input record.
`java.util.List<org.apache.hadoop.mapreduce.InputSplit>`	`getSplits(org.apache.hadoop.mapreduce.JobContext context)` Logically split the set of input files for the job.

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

CSVInputFormat

public CSVInputFormat()

Method Detail

getSplits

public java.util.List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
                                                                 throws java.io.IOException

Logically split the set of input files for the job.

Specified by:: getSplits in class org.apache.hadoop.mapreduce.InputFormat<java.lang.Object,org.apache.avro.generic.IndexedRecord>

Parameters:: context - job configuration.
Returns:: an array of InputSplits for the job.
Throws:: java.io.IOException

createRecordReader

public org.apache.hadoop.mapreduce.RecordReader<java.lang.Object,org.apache.avro.generic.IndexedRecord> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
                                                                                                                           org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                                                                                    throws java.io.IOException,
                                                                                                                           java.lang.InterruptedException

Create a record reader for a given split. The framework will call RecordReader.initialize(InputSplit, TaskAttemptContext) before the split is used.

Specified by:: createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<java.lang.Object,org.apache.avro.generic.IndexedRecord>

Parameters:: split - the split to be read; context - the information about the task
Returns:: a new record reader
Throws:: java.io.IOException; java.lang.InterruptedException

generateSimpleAllStringSchema

public static org.apache.avro.Schema generateSimpleAllStringSchema(int numFields)

Generate an Avro Record schema for the CSV input record.

Parameters:

numFields - the number of fields for the schema

Returns:

a simple Avro schema:

 {
    "type" : "record",
    "name" : "All_string_schema", 
    "fields" : [
        {"name" : "F0", "type" : "string"},
        {"name" : "F1", "type" : "string"}, 
        ...]
 }