public class CSVInputFormat
extends org.apache.hadoop.mapreduce.InputFormat<java.lang.Object,org.apache.avro.generic.IndexedRecord>
InputFormat
example for CSV files. It uses TextInputFormat
to break the input file(s) into lines, then breaks each line into fields using a comma (,) separator, and places the fields into an Avro IndexedRecord
.
CSVInputFormat uses the following simple Avro schema:
{ "type" : "record", "name" : "All_string_schema", "fields" : [ {"name" : "F0", "type" : "string"}, {"name" : "F1", "type" : "string"}, ...] }
Note that:
CSVInputFormat cannot extend TextInputFormat, since TextInputFormat.createRecordReader()
returns RecordReader <LongWritable,Text>, while createRecordReader(org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext)
has to return RecordReader <Object,IndexedRecord>.
The Object
key is used only when there is an error processing the corresponding IndexedRecord
value. The key's toString()
method is used to generate a printable message helping you identify the culprit record. If the key is null, then no information identifying the record will be printed if the record fails.
You can use the key to print:
Modifier and Type | Class and Description |
---|---|
static class |
CSVInputFormat.CSVRecordReader
The record reader parses the input data into key/value pairs which are read by
OraLoaderMapper . |
Constructor and Description |
---|
CSVInputFormat() |
Modifier and Type | Method and Description |
---|---|
org.apache.hadoop.mapreduce.RecordReader<java.lang.Object,org.apache.avro.generic.IndexedRecord> |
createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)
Create a record reader for a given split.
|
static org.apache.avro.Schema |
generateSimpleAllStringSchema(int numFields)
Generate an Avro Record schema for the CSV input record.
|
java.util.List<org.apache.hadoop.mapreduce.InputSplit> |
getSplits(org.apache.hadoop.mapreduce.JobContext context)
Logically split the set of input files for the job.
|
public java.util.List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context) throws java.io.IOException
getSplits
in class org.apache.hadoop.mapreduce.InputFormat<java.lang.Object,org.apache.avro.generic.IndexedRecord>
context
- job configuration.InputSplit
s for the job.java.io.IOException
public org.apache.hadoop.mapreduce.RecordReader<java.lang.Object,org.apache.avro.generic.IndexedRecord> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws java.io.IOException, java.lang.InterruptedException
RecordReader.initialize(InputSplit, TaskAttemptContext)
before the split is used.createRecordReader
in class org.apache.hadoop.mapreduce.InputFormat<java.lang.Object,org.apache.avro.generic.IndexedRecord>
split
- the split to be readcontext
- the information about the taskjava.io.IOException
java.lang.InterruptedException
public static org.apache.avro.Schema generateSimpleAllStringSchema(int numFields)
numFields
- the number of fields for the schema{ "type" : "record", "name" : "All_string_schema", "fields" : [ {"name" : "F0", "type" : "string"}, {"name" : "F1", "type" : "string"}, ...] }