|
Oracle® Loader for Hadoop Java API Reference for Linux Release 2.2 E41239-01 |
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<java.lang.Object,org.apache.avro.generic.IndexedRecord>
oracle.hadoop.loader.examples.CSVInputFormat
public class CSVInputFormat
This is a simple InputFormat
example for CSV files. It uses TextInputFormat
to break the input file(s) into lines, then breaks each line into fields using a comma (,) separator, and places the fields into an Avro IndexedRecord
.
CSVInputFormat uses the following simple Avro schema:
{ "type" : "record", "name" : "All_string_schema", "fields" : [ {"name" : "F0", "type" : "string"}, {"name" : "F1", "type" : "string"}, ...] }
Note that:
CSVInputFormat cannot extend TextInputFormat, since TextInputFormat.createRecordReader()
returns RecordReader <LongWritable,Text>, while createRecordReader(org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext)
has to return RecordReader <Object,IndexedRecord>.
The Object
key is used only when there is an error processing the corresponding IndexedRecord
value. The key's toString()
method is used to generate a printable message helping you identify the culprit record. If the key is null, then no information identifying the record will be printed if the record fails.
You can use the key to print:
Nested Class Summary | |
---|---|
static class |
CSVInputFormat.CSVRecordReader The record reader parses the input data into key/value pairs which are read by OraLoaderMapper . |
Constructor Summary | |
---|---|
CSVInputFormat() |
Method Summary | |
---|---|
org.apache.hadoop.mapreduce.RecordReader<java.lang.Object,org.apache.avro.generic.IndexedRecord> |
createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context) Create a record reader for a given split. |
static org.apache.avro.Schema |
generateSimpleAllStringSchema(int numFields) Generate an Avro Record schema for the CSV input record. |
java.util.List<org.apache.hadoop.mapreduce.InputSplit> |
getSplits(org.apache.hadoop.mapreduce.JobContext context) Logically split the set of input files for the job. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public CSVInputFormat()
Method Detail |
---|
public java.util.List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context) throws java.io.IOException
getSplits
in class org.apache.hadoop.mapreduce.InputFormat<java.lang.Object,org.apache.avro.generic.IndexedRecord>
context
- job configuration.InputSplit
s for the job.java.io.IOException
public org.apache.hadoop.mapreduce.RecordReader<java.lang.Object,org.apache.avro.generic.IndexedRecord> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws java.io.IOException, java.lang.InterruptedException
RecordReader.initialize(InputSplit, TaskAttemptContext)
before the split is used.createRecordReader
in class org.apache.hadoop.mapreduce.InputFormat<java.lang.Object,org.apache.avro.generic.IndexedRecord>
split
- the split to be readcontext
- the information about the taskjava.io.IOException
java.lang.InterruptedException
public static org.apache.avro.Schema generateSimpleAllStringSchema(int numFields)
numFields
- the number of fields for the schema{ "type" : "record", "name" : "All_string_schema", "fields" : [ {"name" : "F0", "type" : "string"}, {"name" : "F1", "type" : "string"}, ...] }
|
Oracle® Loader for Hadoop Java API Reference for Linux Release 2.2 E41239-01 |
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |