|
Oracle® Loader for Hadoop Java API Reference for Linux Release 1.1 E20858-03 |
||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Object
org.apache.hadoop.mapreduce.InputFormat<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable>
oracle.hadoop.loader.examples.CSVInputFormat
public class CSVInputFormat
This is a simple InputFormat example for CSV files. It uses TextInputFormat to break the input file(s) into lines, then breaks each line into fields using a comma (,) separator, and places the fields into an Avro IndexedRecord.
CSVInputFormat uses the following simple Avro schema:
{
"type" : "record",
"name" : "All_string_schema",
"fields" : [
{"name" : "F0", "type" : "string"},
{"name" : "F1", "type" : "string"},
...]
}
Note that:
CSVInputFormat cannot extend TextInputFormat, since TextInputFormat.createRecordReader() returns RecordReader <LongWritable,Text>, while createRecordReader(org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext) has to return RecordReader <IndexedRecord,NullWritable>.
For simplicity, this implementation does not try to cache and reuse a TextInputFormatinstance; the methods getSplits(org.apache.hadoop.mapreduce.JobContext) and createRecordReader(org.apache.hadoop.mapreduce.InputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext) get called on different InputFormat instances anyway.
| Nested Class Summary | |
|---|---|
static class |
CSVInputFormat.CSVRecordReaderThe record reader parses the input data into key/value pairs which are read by OraLoaderMapper. |
| Constructor Summary | |
|---|---|
CSVInputFormat() |
|
| Method Summary | |
|---|---|
org.apache.hadoop.mapreduce.RecordReader<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable> |
createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context)Create a record reader for a given split. |
static org.apache.avro.Schema |
generateSimpleAllStringSchema(int numFields)Generate an Avro Record schema for the CSV input record. |
java.util.List<org.apache.hadoop.mapreduce.InputSplit> |
getSplits(org.apache.hadoop.mapreduce.JobContext context)Logically split the set of input files for the job. |
| Methods inherited from class java.lang.Object |
|---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public CSVInputFormat()
| Method Detail |
|---|
public java.util.List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context)
throws java.io.IOException
getSplits in class org.apache.hadoop.mapreduce.InputFormat<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable>context - job configuration.InputSplits for the job.java.io.IOException
public org.apache.hadoop.mapreduce.RecordReader<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
org.apache.hadoop.mapreduce.TaskAttemptContext context)
throws java.io.IOException,
java.lang.InterruptedException
RecordReader.initialize(InputSplit, TaskAttemptContext) before the split is used.createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<org.apache.avro.generic.IndexedRecord,org.apache.hadoop.io.NullWritable>split - the split to be readcontext - the information about the taskjava.io.IOExceptionjava.lang.InterruptedExceptionpublic static org.apache.avro.Schema generateSimpleAllStringSchema(int numFields)
numFields - the number of fields for the schema
{
"type" : "record",
"name" : "All_string_schema",
"fields" : [
{"name" : "F0", "type" : "string"},
{"name" : "F1", "type" : "string"},
...]
}
|
Oracle® Loader for Hadoop Java API Reference for Linux Release 1.1 E20858-03 |
||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||