|
Oracle® Loader for Hadoop Java API Reference for Linux Release 1.1 E20858-03 |
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable>
oracle.hadoop.loader.examples.CSVInputFormat
public class CSVInputFormat
This is a simple InputFormat
example for CSV files. It uses TextInputFormat
to break the input file(s) into lines, then breaks each line into fields using a comma (,) separator, and places the fields into an Avro IndexedRecord
.
CSVInputFormat uses the following simple Avro schema:
{ "type" : "record", "name" : "All_string_schema", "fields" : [ {"name" : "F0", "type" : "string"}, {"name" : "F1", "type" : "string"}, ...] }
Note that:
CSVInputFormat cannot extend TextInputFormat, since TextInputFormat.createRecordReader()
returns RecordReader <LongWritable,Text>, while createRecordReader(org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext)
has to return RecordReader <IndexedRecord,NullWritable>.
For simplicity, this implementation does not try to cache and reuse a TextInputFormat
instance; the methods getSplits(org.apache.hadoop.mapreduce.JobContext)
and createRecordReader(org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext)
get called on different InputFormat
instances anyway.
Nested Class Summary | |
---|---|
static class |
CSVInputFormat.CSVRecordReader The record reader parses the input data into key/value pairs which are read by OraLoaderMapper . |
Constructor Summary | |
---|---|
CSVInputFormat() |
Method Summary | |
---|---|
org.apache.hadoop.mapreduce.RecordReader<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable> |
createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context) Create a record reader for a given split. |
static org.apache.avro.Schema |
generateSimpleAllStringSchema(int numFields) Generate an Avro Record schema for the CSV input record. |
java.util.List<org.apache.hadoop.mapreduce.InputSplit> |
getSplits(org.apache.hadoop.mapreduce.JobContext context) Logically split the set of input files for the job. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public CSVInputFormat()
Method Detail |
---|
public java.util.List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context) throws java.io.IOException
getSplits
in class org.apache.hadoop.mapreduce.InputFormat<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable>
context
- job configuration.InputSplit
s for the job.java.io.IOException
public org.apache.hadoop.mapreduce.RecordReader<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws java.io.IOException, java.lang.InterruptedException
RecordReader.initialize(InputSplit, TaskAttemptContext)
before the split is used.createRecordReader
in class org.apache.hadoop.mapreduce.InputFormat<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable>
split
- the split to be readcontext
- the information about the taskjava.io.IOException
java.lang.InterruptedException
public static org.apache.avro.Schema generateSimpleAllStringSchema(int numFields)
numFields
- the number of fields for the schema{ "type" : "record", "name" : "All_string_schema", "fields" : [ {"name" : "F0", "type" : "string"}, {"name" : "F1", "type" : "string"}, ...] }
|
Oracle® Loader for Hadoop Java API Reference for Linux Release 1.1 E20858-03 |
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |