CSVInputFormat (Oracle Loader for Hadoop Java Example)

java.lang.Object
- org.apache.hadoop.mapreduce.InputFormat<java.lang.Object,org.apache.avro.generic.IndexedRecord>
- - oracle.hadoop.loader.examples.CSVInputFormat

```
public class CSVInputFormat
extends org.apache.hadoop.mapreduce.InputFormat<java.lang.Object,org.apache.avro.generic.IndexedRecord>
```
This is a simple InputFormat example for CSV files. It uses TextInputFormat to break the input file(s) into lines, then breaks each line into fields using a comma (,) separator, and places the fields into an Avro IndexedRecord.
CSVInputFormat uses the following simple Avro schema:
```
 {
    "type" : "record",
    "name" : "All_string_schema", 
    "fields" : [
        {"name" : "F0", "type" : "string"},
        {"name" : "F1", "type" : "string"}, 
        ...]
 }
 
```
Note that:
- the schema is of type record;
- all fields are of type string;
  and
- field names are F0, F1, F2, etc. A field "Fi" will be loaded only if:
  - the loaderMap file specifies a COLUMN element with the attribute: field="Fi"
    or
  - the target table has a column named Fi (and the 'oracle.hadoop.loader.targetTable' property was used in lieu of a loaderMap file).
CSVInputFormat cannot extend TextInputFormat, since TextInputFormat.createRecordReader() returns RecordReader <LongWritable,Text>, while createRecordReader(org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext) has to return RecordReader <Object,IndexedRecord>.

The Object key is used only when there is an error processing the corresponding IndexedRecord value. The key's toString() method is used to generate a printable message helping you identify the culprit record. If the key is null, then no information identifying the record will be printed if the record fails.

You can use the key to print:
- data file URI,
- InputSplit information,
- data file and the record's offset in that file,
- textual representation of the actual record (not advisable if data contains sensitive information, as it may be printed in Hadoop logs throughout the cluster).
- etc.

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`CSVInputFormat.CSVRecordReader` The record reader parses the input data into key/value pairs which are read by `OraLoaderMapper`.

Constructor Summary

Constructors
Constructor and Description

CSVInputFormat()

Constructors
Constructor and Description
`CSVInputFormat()`

Method Summary

Methods
Modifier and Type	Method and Description
`org.apache.hadoop.mapreduce.RecordReader<java.lang.Object,org.apache.avro.generic.IndexedRecord>`	`createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)` Create a record reader for a given split.
`static org.apache.avro.Schema`	`generateSimpleAllStringSchema(int numFields)` Generate an Avro Record schema for the CSV input record.
`java.util.List<org.apache.hadoop.mapreduce.InputSplit>`	`getSplits(org.apache.hadoop.mapreduce.JobContext context)` Logically split the set of input files for the job.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail
- CSVInputFormat
```
public CSVInputFormat()
```

Method Detail

getSplits
```
public java.util.List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
                                                                 throws java.io.IOException
```
Logically split the set of input files for the job.

Specified by:

getSplits in class org.apache.hadoop.mapreduce.InputFormat<java.lang.Object,org.apache.avro.generic.IndexedRecord>

Parameters:

context - job configuration.

Returns:

an array of InputSplits for the job.

Throws:

java.io.IOException

createRecordReader

public org.apache.hadoop.mapreduce.RecordReader<java.lang.Object,org.apache.avro.generic.IndexedRecord> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
                                                                                                                  org.apache.hadoop.mapreduce.TaskAttemptContext context)
                                                                                                                    throws java.io.IOException,
                                                                                                                           java.lang.InterruptedException

Create a record reader for a given split. The framework will call RecordReader.initialize(InputSplit, TaskAttemptContext) before the split is used.

Specified by:: createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<java.lang.Object,org.apache.avro.generic.IndexedRecord>
Parameters:: split - the split to be read; context - the information about the task
Returns:: a new record reader
Throws:: java.io.IOException; java.lang.InterruptedException

generateSimpleAllStringSchema

public static org.apache.avro.Schema generateSimpleAllStringSchema(int numFields)

Generate an Avro Record schema for the CSV input record.

Parameters:

numFields - the number of fields for the schema

Returns:

a simple Avro schema:

 {
    "type" : "record",
    "name" : "All_string_schema", 
    "fields" : [
        {"name" : "F0", "type" : "string"},
        {"name" : "F1", "type" : "string"}, 
        ...]
 }

Class CSVInputFormat

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

CSVInputFormat

Method Detail

getSplits

createRecordReader

generateSimpleAllStringSchema