Skip Headers
Oracle® Big Data Connectors User's Guide
Release 2 (2.0)

Part Number E36961-03
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

3 Oracle Loader for Hadoop

This chapter explains how to use Oracle Loader for Hadoop to copy data from Hadoop files into tables in an Oracle database. It contains the following sections:

3.1 What Is Oracle Loader for Hadoop?

Oracle Loader for Hadoop is an efficient and high-performance loader for fast movement of data from a Hadoop cluster into a table in an Oracle database. Oracle Loader for Hadoop prepartitions the data if necessary and transforms it into a database-ready format. It optionally sorts records by primary key or user-specified columns before loading the data or creating output files. Oracle Loader for Hadoop is a MapReduce application that is invoked as a command-line utility.

Note:

Partitioning is a database feature for managing and efficiently querying very large tables. It provides a way to decompose a large table into smaller and more manageable pieces called partitions, in a manner entirely transparent to applications. For more information about partitioning, see the Oracle Database VLDB and Partitioning Guide.

Oracle Loader for Hadoop works with a range of input data formats. It handles skew in the input data to help maximize performance.

After the prepartitioning and transforming steps, there are two modes for loading the data into an Oracle database from a Hadoop cluster:

3.2 Using Oracle Loader for Hadoop

This section describes the following steps for using Oracle Loader for Hadoop:

  1. Implementing InputFormat

  2. Creating the loaderMap Document

  3. Accessing Table Metadata

  4. Invoking OraLoader

  5. Loading Files Into an Oracle Database (Offline Loads Only)

See "Oracle Loader for Hadoop Setup" for installation instructions.

3.2.1 Implementing InputFormat

Oracle Loader for Hadoop is a MapReduce application that gets its input from an org.apache.hadoop.mapreduce.RecordReader implementation as provided by the org.apache.hadoop.mapreduce.InputFormat class that is specified in the mapreduce.inputformat.class configuration property. Oracle Loader for Hadoop requires that the RecordReader return an Avro IndexedRecord input object from the getCurrentValue() method. The method signature should be:

public org.apache.avro.generic.IndexedRecord getCurrentValue()     
throws IOException, InterruptedException;

Oracle Loader for Hadoop uses the schema of the IndexedRecord input object to discover the names of the input fields and map them to the columns of the table to load. This mapping is discussed in more detail in the following sections.

Additionally, Oracle Loader for Hadoop uses the Object returned by the getCurrentKey() method of the RecordReader as a way of providing feedback in the event that processing the corresponding IndexedRecord value results in an error. In such a case the toString() method of the key is called and the result is formatted in an error message. By returning one of the following pieces of information, InputFormat developers can assist users in identifying the records that Oracle Loader for Hadoop rejected:

  • Data file URI

  • InputSplit information

  • Data file and the record's offset in that file

Textual representation of the actual record is not advisable if data contains sensitive information because it may be printed in Hadoop logs throughout the cluster. Instead, see "Logging Rejected Records in Bad Files" for information about enabling bad files for logging of rejected records.

If the key is null, then no information identifying the record is printed if the record fails.

Oracle Loader for Hadoop comes with four built-in input formats. Oracle NoSQL Database provides an input format as well. In addition, Oracle Loader for Hadoop provides the source code for an InputFormat example. The example source code is located in the examples/jsrc/ directory. Table 3-1 lists the class names for all these input formats, along with the types of input they handle and how they generate the Avro schema field names. (Understanding how an InputFormat generates field names is critical to getting the data loaded into the target table.)

The built-in input format classes are described in the next subsections. For the InputFormat example, consult the source code and Javadoc for this class for more information.

Table 3-1 InputFormat Classes, Types, and Field Names

Class Input Type Avro Schema Field Names

oracle.hadoop.loader.lib.input.HiveToAvroInputFormat

Hive table sources

Hive table's column names (upper cased)

oracle.hadoop.loader.lib.input.DelimitedTextInputFormat

Delimited text files

Comma-separated list from the property oracle.hadoop.loader. input.fieldNames

(or F0, F1, … if property not defined)

oracle.hadoop.loader.lib.input.RegexInputFormat

Text files

Comma-separated list from the property oracle.hadoop.loader. input.fieldNames

(or F0, F1, … if property not defined)

oracle.hadoop.loader.lib.input.AvroInputFormat

Binary format Avro record files

Field names from input files' Avro schemas

oracle.hadoop.loader.examples.CSVInputFormat

Simple, delimited text files

F0, F1,...

oracle.kv.hadoop.KVAvroInputFormat

Oracle NoSQL Database

Field names from value portion of key-value pairs for Oracle NoSQL Database Avro records


3.2.1.1 HiveToAvroInputFormat

This class presents an input format that reads data from a Hive table. It requires that the Hive database and table names be specified using the following configuration properties:

  • oracle.hadoop.loader.input.hive.tableName

  • oracle.hadoop.loader.input.hive.databaseName

HiveToAvroInputFormat contacts the HiveMetaStoreClient to retrieve information about the table's columns, location, InputFormat, SerDe, and so on. Depending on the way in which Hive was configured, additional Hive-specific properties must be set (such as hive.metastore.uris and hive.metastore.local).

Note that Oracle Loader for Hadoop does not currently support Hive partitioned tables.

HiveToAvroInputFormat imports the entire table (all the files in the Hive table's directory). All other (file-based) input formats discussed in this document allow "globbing" (that is, appending wildcard patterns to the input directories to restrict the input).

Rows in the Hive table are transformed into Avro records whose field names are the Hive table's column names in uppercase. This makes it more likely that the fields will coincide with the database column names. See "Creating the loaderMap Document".

3.2.1.2 DelimitedTextInputFormat

This is an InputFormat class for delimited text files, such as comma-separated values files or tab-separated values files. DelimitedTextInputFormat requires that records be separated by newline characters and that fields be delimited using single-character markers.

The DelimitedTextInputFormat class is meant to emulate the "terminated by t [optionally enclosed by ie [and te]]" behavior of SQL*Loader. The t is the field terminator, ie is the initial field encloser, and te is the trailing field encloser.

DelimitedTextInputFormat uses a parser based on the following grammar:

  • Line = Token t Line | Token\n

  • Token = EnclosedToken | UnenclosedToken

  • EnclosedToken = (white-space)* ie [(non-te)* te te]* (non-te)* te (white-space)*

  • UnenclosedToken = (white-space)* (non-t)*

  • white-space = {c | Character.isWhitespace(c) and c!=t}

Any trailing field encloser character contained inside an enclosed token must be encoded by "doubling it up" (that is, printing it twice).

White space around enclosed tokens is discarded. For unenclosed tokens the leading white space is discarded, but not the trailing white space (if any).

Any empty-string token (either enclosed or unenclosed) is replaced with a null.

This implementation allows custom enclosers and terminator characters (see Table 3-2), but hard codes the record terminator (to newline) and white space (to Java's Character.isWhitespace()). The enclosers must be different from the terminator character and white spaces (but can be equal to each other). The terminator can be a white space (but that value is removed from the class of white space characters).

Table 3-2 describes the delimiters available for DelimitedTextInputFormat. In the table, HHHH is a big-endian hexadecimal representation of the character in UTF-16.

Table 3-2 Delimiters for DelimitedTextInputFormat

Delimiter Type Property Possible Values Default

Field terminator

oracle.hadoop.loader.input.fieldTerminator

  • one character

  • \uHHHH

, (comma)

Initial field encloser

oracle.hadoop.loader.input.initialFieldEncloser

  • one character

  • \uHHHH

  • nothing

No default

Trailing field encloser

oracle.hadoop.loader.input.trailingFieldEncloser

  • one character

  • \uHHHH

  • nothing

The value of the initial field encloser


If the trailing field encloser is not set, then the parser uses the value of the initial field encloser. This includes the empty string value corresponding to not using enclosers.

If the initial field encloser is not set, then the trailing field encloser must not be set either.

If the field enclosers are not set, then the EnclosedToken non-terminal is essentially removed from the grammar previously listed. If the initial field encloser is set, then the parser attempts to read each field as an EnclosedToken first before reading it as an UnenclosedToken.

DelimitedTextInputFormat reads the field names as a comma-separated list from the configuration property oracle.hadoop.loader.input.fieldNames. If parsing a line results in more tokens (fields) than field names, the extra tokens are discarded. If there are fewer tokens than field names, the missing trailing tokens are set to null.

If the oracle.hadoop.loader.input.fieldNames property is not set, then the DelimitedTextInputFormat's RecordReader uses F0, F1,… Fn as field names (where n is the largest number of tokens encountered by that RecordReader in any line so far).

3.2.1.3 RegexInputFormat

The RegexInputFormat can be thought of as a generalization of DelimitedTextInputFormat. It is useful for data that cannot be handled by the Oracle Loader for Hadoop DelimitedTextInputFormat. An example is a web log where one field is delimited by quotes and another field is delimited by square brackets. It still requires that records be separated by newline characters, but it identifies fields in each text line using the java.util.regex regular expression-based pattern matching engine. For more information about java.util.regex, see the Java Platform Standard Edition 6 Javadoc at:

http://docs.oracle.com/javase/6/docs/api/java/util/regex/package-summary.html

Table 3-3 describes the properties that control the parsing behavior of RegexInputFormat.

Table 3-3 Properties Controlling Pattern Matching in RegexInputFormat

Property Description and Possible Values Default

oracle.hadoop.loader.input.regexPattern

Sets the regular expression pattern. For specifications, and a description of possible values, see the Java Platform Standard Edition 6 Javadoc at:

http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

(empty string)

oracle.hadoop.loader.input.regexCaseInsensitive

Specifies whether the matching process should be case insensitive or not. The possible values are:

  • true

  • false

false


Note that the regular expression must match each text line in its entirety. For example, a correct regex pattern for input line "a,b,c," would be "([^,]*),([^,]*),([^,]*),", but not "([^,]*)," because the regular expression is not applied repeatedly to a line of input text.

RegexInputFormat uses the regular expression matching's capturing groups as fields. The special group zero is ignored because it stands for the entire input line. Capturing groups are described at: http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#cg

If the oracle.hadoop.loader.input.fieldNames property is set, then RegexInputFormat interprets it as a comma-separated list of fields names. In this case the number of field names determines the number of fields in each record. Extra fields are discarded and trailing missing fields are set to null.

If the oracle.hadoop.loader.input.fieldNames property is not set, then the RegexInputFormat RecordReader uses F0, F1,… Fn as field names (where n is the largest number of fields encountered by that RecordReader so far).

3.2.1.4 AvroInputFormat

The AvroInputFormat is an InputFormat class for standard Avro data files containing Avro records.

To process only files with the .avro extension, append *.avro to directories specified in the mapred.input.dir configuration property.

3.2.1.5 KVAvroInputFormat

Oracle NoSQL Database 11g, Release 2 includes a class called oracle.kv.hadoop.KVAvroInputFormat which Oracle Loader for Hadoop can use to read data from Oracle NoSQL Database directly. This class is a subclass of org.apache.hadoop.mapreduce.InputFormat<oracle.kv.Key, org.apache.avro.generic.IndexedRecord> and can be specified as a value for the mapreduce.inputformat.class property. You can use this class to read Avro values that are stored in Oracle NoSQL Database into Oracle Loader for Hadoop.

The KVAvroInputFormat class causes the value portion of key-value pairs for Oracle NoSQL Database Avro records to be passed directly to Oracle Loader for Hadoop, but the keys of the Oracle NoSQL Database records are not. The record keys are available as the MapReduce keys in the Oracle Loader for Hadoop MapReduce job, but they are not accessible as a field that can be specified in the Oracle Loader for Hadoop mapping. If you need access to the Oracle NoSQL Database record keys in Oracle Loader for Hadoop (for example, to store them in the target table), then you can specify a class that implements oracle.kv.hadoop.AvroFormatter using the oracle.kv.formatterClass property in the Oracle Loader for Hadoop configuration file.

See Also:

Javadoc for the KVInputFormatBase class at:

http://docs.oracle.com/cd/NOSQL/html/index.html

3.2.2 Creating the loaderMap Document

Oracle Loader for Hadoop loads data into a single database table. This table is referred to as the target table. You can use the following ways to specify the target table, the columns to load, and how input fields are mapped to database columns and the date format specification:

  • To indicate that all columns of the database table will be loaded and that the names of the input fields exactly match the database column names, use the configuration property oracle.hadoop.loader.targetTable. It allows you to define a name for the target load table. For each database column, the loader uses the column name to discover an input field with the same name. The value of the field is then loaded into the column.

  • To load a subset of the target table columns or to create explicit mappings when the input field names are not exactly the same as the database column names, create a loaderMap document to specify the target table, columns, and how the input fields should be mapped to the database columns. The location of the loaderMap document is specified using the oracle.hadoop.loader.loaderMapFile configuration property.

  • To specify a default date format that applies to all date input fields, use the configuration property oracle.hadoop.loader.defaultDateFormat. A loaderMap document can be used to specify different date formats for different date input fields.

See Also:

3.2.2.1 Example loaderMap Document

The following example loaderMap document specifies a list of columns in the HR.EMPLOYEES table that should be loaded. It includes a mapping of input data field names to table column names. It also specifies the format of input data that should be used for that column.

<?xml version="1.0" encoding="UTF-8"?>
<LOADER_MAP>
<SCHEMA>HR</SCHEMA>
<TABLE>EMPLOYEES</TABLE>
<COLUMN field="empId">EMPLOYEE_ID</COLUMN>
<COLUMN field="lastName">LAST_NAME</COLUMN>
<COLUMN field="email">EMAIL</COLUMN>
<COLUMN field="hireDate" format="MM-dd-yyyy">HIRE_DATE</COLUMN>
<COLUMN field="jobId">JOB_ID</COLUMN>
</LOADER_MAP>

Note:

If all the columns in the target table are used for loading, and if the input data field names in the IndexedRecord input object match the column names exactly, then the loaderMap file is not needed unless a table column is a DATE data type. Input fields mapped to DATE columns are parsed using the default Java date format. If the input is in a different format, then you must create a loaderMap document and use the format attribute to specify the Java date format string to use when parsing input values.

3.2.3 Accessing Table Metadata

Oracle Loader for Hadoop uses table metadata from Oracle Database to control the execution of a loader job. The loader automatically fetches the metadata whenever a JDBC connection can be established. Sometimes it may be impossible for the loader job to access the database. For example, the Hadoop cluster may be on a different network than the database. In this case, the OraLoaderMetadata utility program is used to extract table metadata from the database into an XML document. The metadata document is then transferred to the Hadoop cluster. The configuration property oracle.hadoop.loader.tableMetadataFile is used to specify the location of the metadata document. When the loader job runs, it accesses this document to discover all necessary metadata information about the target table.

3.2.3.1 Running the OraLoaderMetadata Utility

To run the OraLoaderMetadata Java utility, add the following JAR files to the CLASSPATH variable:

  • $OLH_HOME/jlib/oraloader.jar

  • $OLH_HOME/jlib/ojdbc6.jar

  • $OLH_HOME/jlib/oraclepki.jar

Note:

The oraclepki.jar library is required only if you are connecting to the database using credentials stored in an Oracle wallet.

Then run the following command:

java oracle.hadoop.loader.metadata.OraLoaderMetadata \ 
-user <username> -connection_url <connection URL> [-schema <schemaName>] \
-table <tableName> -output <output filename>

OraLoaderMetadata Parameters

The OraLoaderMetadata utility accepts the following parameters:

  • -user is the Oracle database user name. The user is prompted for the password.

  • -connection_url is the connection URL to connect to Oracle Database.

  • -schema is the name of the schema containing the target table. If this parameter is not specified, then the target table is assumed to be in the user schema specified in the connect URL.

  • -table is the name of the target table.

  • -output is the output file name used to store the metadata document.

3.2.4 Invoking OraLoader

OraLoader is a Hadoop job that you execute using the standard Hadoop tools. OraLoader implements the org.apache.hadoop.util.Tool interface and follows the standard Hadoop methods for building MapReduce applications. OraLoader performs the following actions:

  1. Reads and verifies input configuration parameters.

  2. Retrieves and verifies table and column metadata information for the target table. Metadata is retrieved from the database whenever a JDBC connection can be made. Otherwise, the loader looks for metadata stored in the location specified by the oracle.hadoop.loader.tableMetadataFile property.

  3. Prepares internal configuration information for the MapReduce tasks of OraLoader and stores table metadata information and dependent Java libraries in the distributed cache so that they are available to the map and reduce tasks throughout the cluster.

  4. Submits the MapReduce job to Hadoop.

  5. Consolidates reporting information from individual tasks to create a common log file for the job after the map and reduce tasks are complete. The log file is written to the _olh subdirectory in the job output directory and is named oraloader-report.txt.

OraLoader is invoked from the command line and accepts any of the generic command-line options. The following is an example invocation:

HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$OLH_HOME/jlib/*"

bin/hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader \ 
-conf MyConf.xml

When using one of the example input formats shown in Table 3-1, you must include the -libjars option in the OraLoader invocation:

bin/hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader \
-conf MyConf.xml -libjars $OLH_HOME/jlib/oraloader-examples.jar

When using the KVAvroInputFormat from Oracle NoSQL Database 11g, Release 2, you must include $KVHOME/lib/kvstore-2.0.22.jar in your HADOOP_CLASSPATH and you must include the -libjars option in the OraLoader invocation:

bin/hadoop jar $OLH_HOME/jlib/oraloader.jar oracld.hadoop.loader.OraLoader -conf
MyConf.xml -libjars $KVHOME/lib/kvstore-2.0.22.jar

See Also:

  • The Apache Hadoop documentation for information about where to find the Hadoop executable and the setting for the HADOOP_CLASSPATH variable

  • Javadoc for the generic options, which is located at the following Apache site: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/util/GenericOptionsParser.html

3.2.5 Loading Files Into an Oracle Database (Offline Loads Only)

For offline loads, Oracle Loader for Hadoop produces files that must be copied to the database server and loaded into an Oracle database. The following section describes the available offline load method.

3.2.5.1 Loading From Delimited Text Files Into an Oracle Database

After you copy the delimited text files to the database system, use the generated control files to invoke SQL*Loader and load the data from the delimited text files into the database. Alternatively, you can use the generated SQL scripts to perform external table loads into the database. See "Delimited Text Output".

3.3 Output Modes During OraLoader Invocation

This section describes the following output options:

3.3.1 JDBC Output

JDBC is an output option in online database mode. The output records of the loader job are loaded directly into the target table by map or reduce tasks as part of the OraLoader process. There is no need to execute additional steps to load the data. A JDBC connection between the Hadoop system and Oracle Database is required for this output option.

The JDBC output option uses standard JDBC batching to increase performance and efficiency. If an error occurs during batch execution (for example a constraint is violated), the JDBC driver stops execution at the first error. Thus, if there are 100 rows in a batch and the tenth row causes an error then nine rows are inserted and 91 rows are not. Moreover, the JDBC driver does not provide information to identify which row caused the error. In this case, Oracle Loader for Hadoop does not know the insert status for any of the rows in the batch. It counts all rows in the batch as "in question" and continues loading the next batch. A load report is produced at the end of the job that details the number of batch errors incurred and the number of rows whose insert status is in question. One way to handle this problem is by using a unique key on the data. After the data is loaded, the key can be enabled and used to discover missing key values. The missing rows must be located in the input data and reloaded after it has been determined why they failed to load.

To select the JDBC output format, set the following Hadoop property:

<property>
  <name>mapreduce.outputformat.class</name>
  <value>oracle.hadoop.loader.lib.output.JDBCOutputFormat</value>
</property>

The relevant property for configuring JDBC output is oracle.hadoop.loader.jdbc.defaultExecuteBatch, which controls the size of the batch.

To optimize data movement over InfiniBand between Big Data Appliance (BDA) and Exadata, Sockets Direct Protocol (SDP) may be used. To specify SDP protocol:

  • Add JVM options to enable JDBC SDP export:

    HADOOP_OPTS="-Doracle.net.SDP=true -Djava.net.preferIPv4Stack=true"
    
  • Use a connect URL with the SDP protocol. Port 1522 is deliberate since it points to an SDP enabled listener:

    <property> 
      \u000b    
      <name>oracle.hadoop.loader.connection.url</name> 
      \u000b
      <value>jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=SDP)
       (HOST=example.com}) (PORT=1522))
       (CONNECT_DATA=(SERVICE_NAME=example_service))) </value> 
      \u000b 
    </property>
    

See Also:

  • Big Data Appliance (BDA) documentation for more information about configuring BDA and Exadata for best performance over InfiniBand

3.3.2 Oracle OCI Direct Path Output

The Oracle OCI Direct Path output format is available in online database mode. This output format uses the OCI Direct Path interface to load rows into the target table. Parallel direct path load is possible because each reducer loads into a distinct database partition.

To select the Oracle OCI Direct Path output format, set the following Hadoop property:

<property>
  <name>mapreduce.outputformat.class</name>
  <value>oracle.hadoop.loader.lib.output.OCIOutputFormat</value>
</property>

The size of the direct path stream buffer can be controlled using the following property:

<property>
  <name>oracle.hadoop.loader.output.dirpathBufsize</name>
  <value>131072</value>
  <description>
   This property is used to set the size, in bytes, of the direct path 
   stream buffer for OCIOutputFormat. If needed, values are rounded 
   up to the next nearest multiple of 8k.
  </description>
</property>

To optimize data movement over InfiniBand between Big Data Appliance (BDA) and Exadata, Sockets Direct Protocol (SDP) may be used. To specify SDP protocol:

  • Add JVM options to enable JDBC SDP export:

    HADOOP_OPTS="-Doracle.net.SDP=true -Djava.net.preferIPv4Stack=true"
    
  • Use a connect URL with the SDP protocol. Port 1522 is deliberate since it points to an SDP enabled listener:

    <property> 
      \u000b    
      <name>oracle.hadoop.loader.connection.url</name> 
      \u000b
      <value>jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=SDP)
       (HOST=example.com}) (PORT=1522))
       (CONNECT_DATA=(SERVICE_NAME=example_service))) </value> 
      \u000b 
    </property>
    

See Also:

  • Big Data Appliance (BDA) documentation for more information about configuring BDA and Exadata for best performance over InfiniBand

The Oracle OCI Direct Path output format has the following restrictions:

  • It is only available on a Linux x86.64 platform.

  • The load target table must be partitioned.

  • The number of reducers must be greater than zero.

  • OCI Direct Path output cannot load a composite interval partitioned table where the subpartition key contains a CHAR, VARCHAR2, NCHAR, or NVARCHAR2 column. The loader checks for this condition and stops with an error if the target load table meets this condition. Composite interval partitions where the subpartition key does not contain a character type column are supported.

The Oracle OCI Direct Path output format requires the following configuration steps. These steps enable the loader to locate the C shared libraries that implement the output format. These libraries are automatically distributed to compute nodes using the Hadoop Distributed Cache mechanism.

  1. Create the environment variable JAVA_LIBRARY_PATH to point to the directory $OLH_HOME/lib. This environment variable is required only on the node where the job is submitted. The $HADOOP_HOME/bin/hadoop command in CDH automatically injects this variable value into the Java system property java.library.path when the job is created. For the Apache Hadoop distribution, you must edit the $HADOOP_HOME/bin/hadoop command so that it concatenates new values to an existing value. The Apache hadoop command begins with an empty JAVA_LIBRARY_PATH value and does not import a value from the environment.

  2. Add $OLH_HOME/lib to the LD_LIBRARY_PATH variable on the client where the loader job is submitted.

3.3.3 Delimited Text Output

Delimited text is an output option in offline database mode. Comma-separated value (CSV) format files, or other delimited text files, are generated by map or reduce tasks. These files are then loaded into the target table by using either SQL*Loader or external tables.

To select the Delimited Text output format, set the following Hadoop property:

<property>
  <name>mapreduce.outputformat.class</name>
  <value>oracle.hadoop.loader.lib.output.DelimitedTextOutputFormat</value>
</property>

Each output task generates a delimited text format file and a SQL*Loader control file (see Example 3-1). If the table is not partitioned or if oracle.hadoop.loader.loadByPartition=false, then a single SQL*Loader control file is generated for the entire job. Additionally, a single SQL script is generated to load the delimited text file into the target table.

Delimited text files have the following template:

oraloader-${taskId}-csv-${partitionId}.dat

SQL*Loader control file names have the following template:

oraloader-${taskId}-csv-${partitionId}.ctl

Definitions of the template parameters are as follows:

${taskId}: mapper (reducer) Id

${partitionId}: Partition identifier

If the table is not partitioned or if oracle.hadoop.loader.loadByPartition=false, then a single SQL*Loader control file is generated: oraloader-csv.ctl.

The SQL script for loading into an external table is called oraloader-csv.sql.

The formatting of records and fields in the delimited text file is controlled by the following properties:

  • oracle.hadoop.loader.output.fieldTerminator: Identifies a single character that delimits the fields.

  • oracle.hadoop.loader.output.initialFieldEncloser: Identifies the beginning of a field. It indicates that the fields are always enclosed between this character and the trailingFieldEncloser character.

  • oracle.hadoop.loader.output.trailingFieldEncloser: If not set, then the value of initialFieldEncloser is assumed.

  • oracle.hadoop.loader.output.escapeEnclosers: Identifies the character that is used to escape embedded trailing field encloser characters.

Example 3-1 shows a sample SQL*Loader control file that might be generated by an output task.

Example 3-1 Sample SQL*Loader Control File

LOAD DATA CHARACTERSET AL32UTF8
INFILE 'oraloader-csv-1-0.dat'
BADFILE 'oraloader-csv-1-0.bad'
DISCARDFILE 'oraloader-csv-1-0.dsc'
INTO TABLE "SCOTT"."CSV_PART" PARTITION(10) APPEND
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
(
"ID"      DECIMAL EXTERNAL,
"NAME"    CHAR,
"DOB"     DATE 'SYYYY-MM-DD HH24:MI:SS'
)

3.3.4 Oracle Data Pump Output

The Oracle Data Pump output format is available in offline database mode. The loader produces binary format files that can be loaded into the target table using an external table and the ORACLE_DATAPUMP access driver. The output files must be copied from the HDFS file system to a local file system that is accessible to Oracle Database.

To select the Oracle Data Pump output format, set the following Hadoop property:

<property>
  <name>mapreduce.outputformat.class</name>
  <value>oracle.hadoop.loader.lib.output.DataPumpOutputFormat</value>
</property>

Oracle Data Pump output file names have the following template:

oraloader-${taskId}-dp-${partitionId}.dat

Oracle Loader for Hadoop also produces a SQL file that contains commands to perform the following tasks:

  1. Create an external table definition using the ORACLE_DATAPUMP access driver. The binary format Oracle Data Pump output files are listed in the LOCATION clause of the external table.

  2. Create a directory object that is used by the external table. This command must be uncommented before it can be used. To specify the directory name that is produced in the SQL file, set the following property:

    <property>
      <name>oracle.hadoop.loader.extTabDirectoryName</name>
      <value>OLH_EXTTAB_DIR</value>
      <description>
       The name of the Oracle directory object for the external table's
       LOCATION data files. This property applies only to the CSV and 
       DataPump output formats.
      </description>
    </property>
     
    
  3. Insert the rows from the external table into the target table. This command must be uncommented before it can be used.

The SQL file is placed in the _olh directory inside the job's output directory.

See Also:

3.4 Error Handling and Diagnostics

Oracle Loader for Hadoop may reject input records for a variety of reasons, such as:

See "Logging Rejected Records in Bad Files" for instructions about how to explicitly log the culprit records along with the error messages.

See "Setting a Job Reject Limit" for instructions about how to configure Oracle Loader for Hadoop to stop a job early if too many errors are encountered.

3.4.1 Logging Rejected Records in Bad Files

By default, Oracle Loader for Hadoop does not log the rejected records into Hadoop logs, it only logs information on how to identify the rejected records. This prevents user-sensitive information from being stored in Hadoop logs scattered across the cluster. See the discussion on getCurrentKey() in "Implementing InputFormat" for more information on how to identify rejected records.

You can direct Oracle Loader for Hadoop to log rejected records by setting the configuration property oracle.hadoop.loader.logBadRecords to true (it is false by default). When the property is set to true, Oracle Loader for Hadoop logs bad records into one or more "bad" files in the _olh/ directory inside the job's output directory.

This property applies to records rejected by input formats and mappers. It does not apply to errors encountered by the sampling feature (see "Balancing Loads When Loading Data into Partitioned Tables") or output formats (see "JDBC Output" and "Oracle OCI Direct Path Output").

3.4.2 Setting a Job Reject Limit

Certain problems, such as using the wrong loaderMap file will likely result in Oracle Loader for Hadoop rejecting every record in the input. To mitigate the amount of time wasted by such an occurrence, Oracle Loader for Hadoop aborts the job if more than 1000 records are rejected. To change the maximum number of rejected records allowed, use the configuration property oracle.hadoop.loader.rejectLimit. A negative value for this property turns off the reject limit and allows the job to run to completion regardless of the number of rejected records.

Note that input format errors do not count towards the reject limit because they are fatal and cause the map task to be aborted. Errors encountered by the sampling feature (see "Balancing Loads When Loading Data into Partitioned Tables") or output formats (see "JDBC Output" and "Oracle OCI Direct Path Output") also do not count towards the reject limit.

3.5 Balancing Loads When Loading Data into Partitioned Tables

To balance loads across reducers when data is loaded into a partitioned database table, use the sampling feature of Oracle Loader for Hadoop.

The execution time of a reducer is usually proportional to the number of records that it processes—the more records, the longer the execution time. When the sampling feature is disabled, all records from a given database partition are sent to one reducer. This can result in unbalanced reducer loads because some database partitions may have more records than others. Because the execution time of a Hadoop job is the execution time of its slowest reducer, unbalanced reducer loads can slow down the entire job.

Although hashing records uniformly across reducers can generate balanced reducer loads, it does not necessarily group records by database partition before inserting them into the database.

The sampling feature of Oracle Loader for Hadoop generates an efficient MapReduce partitioning scheme that groups records by database partition while also balancing reducer load.

3.5.1 Using the Sampling Feature

To enable the sampling feature, set the configuration property oracle.hadoop.loader.sampler.enableSampling to true.

Even if the enableSampling property is set to true, the loader automatically disables the sampling feature if sampling is not necessary or if the loader determines that a good sample cannot be made. For example, sampling is automatically disabled if the table is not partitioned, the number of reducer tasks is less than two, or there is too little input data to compute a good load balance. In those cases, the loader returns an informational message.

Note:

The sampler is multithreaded, and each sampler thread instantiates its own copy of the supplied InputFormat class. Any new InputFormat implementations provided to Oracle Loader for Hadoop should ensure that data structures that are static and mutable are synchronized for multiple thread access.

It is possible for the sampler to return an out-of-memory error on the client node where the loader job is submitted. This can occur when the input splits returned by the InputFormat do not fit in memory.

The following are possible solutions to this problem:

  • Increase the heap size of the JVM where the job is submitted.

  • Adjust the following properties:

    oracle.hadoop.loader.sampler.hintMaxSplitSize
    oracle.hadoop.loader.sampler.hintNumMapTasks
    

    See "XML Document for the Configuration Properties" for descriptions of these properties.

3.5.2 Tuning Load Balancing and Sampling Behavior

Oracle Loader for Hadoop provides properties that you can use to tune load balancing and sampling behavior. These properties are summarized in "Primary Configuration Properties for the Load Balancing Feature".

3.5.2.1 Properties to Tune Load Balancing

The goal of load balancing is to generate a MapReduce partitioning scheme that assigns approximately the same amount of work to all reducers. This scheme is used in the partitioning step during Oracle Loader for Hadoop job execution.

Two properties control the quality of load balancing: maxLoadFactor and loadCI. The sampler uses the expected reducer load factor to evaluate the quality of its partitioning scheme. Load factor is a metric that indicates how much a reducer's load deviates from a perfectly balanced reducer load. A load factor of one indicates a perfectly balanced load (no overload).

Small load factors indicate better load balancing. The maxLoadFactor default of 0.05 means that no reducer is ever overloaded by more than 5%. The sampler guarantees this maxLoadFactor with a statistical confidence level determined by the value of loadCI. The default value of loadCI is 0.95, which means that any reducer's load factor exceeds maxLoadFactor in only 5% of the cases.

There is a trade-off between the execution time of the sampler and the quality of load balancing. Lower values of maxLoadFactor and higher values of loadCI result in more balanced reducer loads at the expense of longer sampling times. The default values of maxLoadFactor=0.05 and loadCI=0.95 provide a good trade-off between load balancing quality and execution time.

3.5.2.2 Properties to Tune Sampling Behavior

By default, the sampler runs until it collects just enough samples to generate a partitioning scheme that satisfies the maxLoadFactor and loadCI criteria.

However, you can limit the sampler's running time by using the maxSamplesPct property, which specifies the maximum number of records that the sampler should sample before stopping.

3.5.3 Does Oracle Loader for Hadoop Always Use the Sampler's Partitioning Scheme?

Oracle Loader for Hadoop uses the generated partitioning scheme only if sampling is successful. A sampling is successful if it generates a partitioning scheme with a maximum reducer load factor of (1+ maxLoadFactor) guaranteed at a statistical confidence level of loadCI. The default values of maxLoadFactor, loadCI, and maxSamplesPct allow the sampler to successfully generate high-quality partitioning schemes for a variety of different input data distributions. However, in some cases the sampler might be unsuccessful in generating a partitioning scheme that satisfies these constraints (for example, if the constraints are too rigid or if the number of samples it requires exceeds the user-specified maximum of maxSamplesPct). In such cases, Oracle Loader for Hadoop prints a log message saying that there were not enough samples. It then defaults to partitioning records by database partition and provides no load balancing guarantees (as described in "Tuning Load Balancing and Sampling Behavior").

An alternative approach would be to reset the configuration properties to less rigid values. You can do this either by increasing maxSamplesPct or by decreasing maxLoadFactor or loadCI, or both.

3.5.4 What Happens When a Sampling Feature Property Has an Invalid Value?

If any configuration properties of the sampling feature are set to values outside the accepted range, an exception is not returned. Instead, the sampler prints a warning message, resets the property to its default value, and continues executing.

3.5.5 Primary Configuration Properties for the Load Balancing Feature

The following list describes the primary properties available to tune sampling behavior. See "XML Document for the Configuration Properties" for a complete list of properties.

Configuration Properties for Hadoop Sampling 

oracle.hadoop.loader.sampler.maxSamplesPct

Type: Float

Default: 0.01

Accepted Range: [0, 1]

A value of <=0 disables this property.

Description: The maximum sample size as a percentage of the number of records in the input data. A value of 0.05 indicates that the sampler never samples more than 5% of the total number of records. The sampler may collect fewer samples than this amount.

oracle.hadoop.loader.sampler.maxLoadFactor

Type: Float

Default: 0.05

Accepted Range: >0

A value of <=0 resets the property to the default.

Description: Maximum acceptable load factor for reducer workload.

oracle.hadoop.loader.sampler.loadCI

Type: Float

Default: 0.95

Accepted Range: >=0.5 and <1

Recommended values are >= 0.9. A value of < 0.5 resets the property to the default.

Description: The statistical confidence level for the maximum reducer load factor. Commonly used values other than the default are 0.90 and 0.99.

3.6 OraLoader Configuration Properties

OraLoader uses the standard method in Hadoop for specifying configuration properties. These properties can be specified in a configuration file or by using the -D property=value option to GenericOptionsParser and ToolRunner.

The following list provides brief descriptions of the primary job configuration properties for Oracle Loader for Hadoop. For a complete list and detailed descriptions of all configuration properties, see the oraloader-conf.xml document in "XML Document for the Configuration Properties".

3.6.1 Primary Job Configuration Properties

The following list describes the primary job configuration properties for Oracle Loader for Hadoop.

oracle.hadoop.loader.targetTable

Type: String

Default: Not defined

Description: A schema-qualified name for the table to be loaded. Use this option to indicate that all columns of the table are to be loaded and that the names of the input fields match the column names. This property takes precedence over the oracle.hadoop.loader.loaderMapFile property. If the table is not schema-qualified, then Oracle Loader for Hadoop uses the connection user.

oracle.hadoop.loader.loaderMapFile

Type: String

Default: Not defined

Description: Path to the loader map file

oracle.hadoop.loader.tableMetadataFile

Type: String

Default: Not defined

Description: Path to the target table metadata file. Use this option when running in disconnected mode. The table metadata file is created by running the OraLoaderMetadata utility.

oracle.hadoop.loader.olhcachePath

Type: String

Default: ${mapred.output.dir}/../olhcache

Description: Path to a directory where Oracle Loader for Hadoop can create files that are loaded into the DistributedCache. In distributed mode, the value must be an HDFS path.

oracle.hadoop.loader.extTabDirectoryName

Type: String

Default: OLH_EXTTAB_DIR

Description: The name of the database directory object for the external table's LOCATION data files. This property applies only to the Delimited Text and Data Pump output formats.

oracle.hadoop.loader.sampler.enableSampling

Type: Boolean

Default: true

Description: Indicates whether the sampling feature is enabled

oracle.hadoop.loader.enableSorting

Type: Boolean

Default: true

Description: Indicates whether output records within each reducer group should be sorted by the primary key for the table

oracle.hadoop.loader.connection.url

Type: String

Default: Not defined

Description: Specifies the URL of the database connection string. This property takes precedence over all other connection properties. If an Oracle wallet is configured as an external password store, then the property value must start with the driver prefix jdbc:oracle:thin:@ and the db_connect_string must exactly match the credential defined in the wallet.

oracle.hadoop.loader.connection.user

Type: String

Default: Not defined

Description: Name for database login

oracle.hadoop.loader.connection.password

Type: String

Default: Not defined

Description: Password for the connecting user

oracle.hadoop.loader.connection.wallet_location

Type: String

Default: Not defined:

Description: File path to an Oracle wallet directory where the connection credential is stored.

When using Oracle Wallet as an external password store, set the following three properties:

  • oracle.hadoop.loader.connection.wallet_location

  • oracle.hadoop.loader.connection.url

  • oracle.hadoop.loader.connection.tns_admin

Or, set the following three properties:

  • oracle.hadoop.loader.connection.wallet_location

  • oracle.hadoop.loader.connection.tnsEntryName

  • oracle.hadoop.loader.connection.tns_admin

oracle.hadoop.loader.connection.tnsEntryName

Type: String

Default: Not defined

Description: Specifies a TNS entry name defined in the tnsnames.ora file. This property is used with the oracle.hadoop.loader.connection.tns_admin property.

oracle.hadoop.loader.connection.tns_admin

Type: String

Default: Not defined

Description: File path to a directory containing SQL*Net configuration files such as sqlnet.ora and tnsnames.ora. If this property is not defined, the value of the environment variable TNS_ADMIN will be used. Define this property in order to use TNS entry names in database connect strings.

This property must be set when using Oracle Wallet as an external password store. See the property oracle.hadoop.loader.connection.wallet_location.

oracle.hadoop.loader.connection.defaultExecuteBatch

Type: Integer

Default: 100

Description: Applicable only for the JDBC and OCI Direct Path output formats. It is the default value for the number of records to be inserted in a batch for each trip to the database. Specify a value greater than 1 to override the default value. If the specified value is less than 1, then this property assumes the default value. Although the maximum value is unlimited, using very large batch sizes is not recommended because it results in a large memory footprint without much increase in performance.

oracle.hadoop.loader.connection.sessionTimeZone

Type: String

Default: LOCAL

Description: This property is used to alter the session time zone for database connections. Valid values are as follows:

  • [+|-] hh:mm: Hours and minutes before or after UTC

  • Local: The default time zone of the JVM

  • time_zone_region: A valid time zone region

This property also determines the default time zone used when parsing input data that is loaded into the following database column types: TIMESTAMP, TIMESTAMP WITH TIME ZONE and TIMESTAMP WITH LOCAL TIME ZONE.

oracle.hadoop.loader.output.dirpathBufsize

Type: Integer

Default: 131072

Description: This property is used to set the size, in bytes, of the direct path stream buffer for OCIOutputFormat.If needed, values are rounded up to the next nearest multiple of 8 KB.

oracle.hadoop.loader.output.fieldTerminator

Type: String

Default: , (comma)

Description: A single character to delimit fields for DelimitedTextOutputFormat.

Alternate representation: \uHHHH (where HHHH is the character's UTF-16 encoding).

oracle.hadoop.loader.output.initialFieldEncloser

Type: String

Default: None

Description: When this value is set, fields are always enclosed between the specified character and ${oracle.hadoop.loader.output.trailingFieldEncloser}, which defaults to this property's value.

If this value is set, it must be either a single character or \uHHHH (where HHHH is the character's UTF-16 encoding). A zero-length value means that there are no enclosers (default value).

${oracle.hadoop.loader.output.initialFieldEncloser} and ${oracle.hadoop.loader.output.trailingFieldEncloser} must be either both not set or both set.

Use these properties when some field may contain the fieldTerminator. If some field may also contain the trailingFieldEncloser, then set the escapeEnclosers property to true.

oracle.hadoop.loader.output.trailingFieldEncloser

Type: String

Default: None

Description: When this value is set, fields are always enclosed between ${oracle.hadoop.loader.output.initialFieldEncloser} and the specified character for this property.

If this value is set, it must be either a single character, or \uHHHH (where HHHH is the character's UTF-16 encoding). A zero-length value means that there are no enclosers (default value).

If this value is not set, the value of ${oracle.hadoop.loader.output.initialFieldEncloser} is used instead.

Do not set this property if ${oracle.hadoop.loader.output.initialFieldEncloser} is not set.

Use these properties when some field may contain the fieldTerminator. If some field may also contain the trailingFieldEncloser, then set the escapeEnclosers property to true.

oracle.hadoop.loader.output.escapeEnclosers

Type: Boolean

Default: false

Description: When this is set to true and both initial and trailing field enclosers are set, fields are scanned and embedded trailing encloser characters are escaped. Use this option when some of the field values may contain the trailing encloser character.

oracle.hadoop.loader.input.fieldTerminator

Type: String

Default: , (comma)

Description: A single character to delimit fields for DelimitedTextInputFormat.

Alternate representation: \uHHHH (where HHHH is the character's UTF-16 encoding).

oracle.hadoop.loader.input.initialFieldEncloser

Type: String

Default: None

Description: When this value is set, fields are allowed to be enclosed between the specified character and ${oracle.hadoop.loader.input.trailingFieldEncloser} (which defaults to this property's value).

If this value is set, it must be either a single character, or \uHHHH (where HHHH is the character's UTF-16 encoding).

A zero length value means no enclosers (default value).

oracle.hadoop.loader.input.trailingFieldEncloser

Type: String

Default: None

Description: When this value is set, fields can be enclosed between ${oracle.hadoop.loader.input.initialFieldEncloser} and the specified character.

If this value is set, it must be either a single character or \uHHHH (where HHHH is the character's UTF-16 encoding). A zero-length value means no enclosers (default value).

${oracle.hadoop.loader.input.initialFieldEncloser} and ${oracle.hadoop.loader.input.trailingFieldEncloser} must be either both not set, or both set.

oracle.hadoop.loader.input.fieldNames

Type: Comma-separated list of strings

Default: F0,F1,F2,...

Description: Names to assign to input fields. The names are used to create the Avro schema for the record. The strings must be valid JSON name strings.

3.6.2 General Properties

The following list describes the general properties for Oracle Loader for Hadoop.

mapreduce.inputformat.class

Name of the class implementing InputFormat

mapreduce.outputformat.class

Output options supported by Oracle Loader for Hadoop. The values can be:

  • oracle.hadoop.loader.lib.output.DelimitedTextOutputFormat

    Writes data records to delimited text format files such as comma-separated values (CSV) format files

  • oracle.hadoop.loader.lib.output.JDBCOutputFormat

    Inserts data records into the target table using JDBC

  • oracle.hadoop.loader.lib.output.OCIOutputFormat

    Inserts rows into the target table using the Oracle OCI Direct Path interface

  • oracle.hadoop.loader.lib.output.DataPumpOutputFormat

    Writes rows into binary format files that can be loaded into the target table using an external table

3.7 Example of Using Oracle Loader for Hadoop

The example shown in this section uses Oracle Loader for Hadoop in the online database mode using JDBC. It involves the following steps:

  1. Create a table in the database. This example uses the HR.EMPLOYEES table available as part of the HR sample schema in Oracle Database.

  2. Implement an InputFormat class similar to the examples in the oracle.hadoop.loader.examples package.

  3. Set the configuration properties. The MyLoaderMap.xml document contains the mapping of input data fields to columns in the HR.EMPLOYEES table, as follows:

    <?xml version="1.0" encoding="UTF-8"?>
    <LOADER_MAP>
    <SCHEMA>HR</SCHEMA>
    <TABLE>EMPLOYEES</TABLE>
    <COLUMN field="empId">EMPLOYEE_ID</COLUMN>
    <COLUMN field="lastName">LAST_NAME</COLUMN>
    <COLUMN field="email">EMAIL</COLUMN>
    <COLUMN field="hireDate" format="MM-dd-yyyy">HIRE_DATE</COLUMN>
    <COLUMN field="jobId">JOB_ID</COLUMN>
    </LOADER_MAP>
    

    The configuration properties in MyConf.xml are as follows:

    <configuration>
      <property>
        <name>mapreduce.inputformat.class</name>
        <value><full_class_name>.MyInputFormat</value>
        <description> Name of the class implementing InputFormat </description>
      </property>
     
      <property>
        <name>mapreduce.outputformat.class</name>
        <value>oracle.hadoop.loader.lib.output.JDBCOutputFormat</value>
        <description> Output mode after the loader job executes on Hadoop  </description>
      </property>
     
      <property>
        <name>oracle.hadoop.loader.loaderMapFile</name>
        <value>MyLoaderMap.xml</value>
        <description> The loaderMap file specifying the mapping of input data
         fields to the table columns </description>
      </property>
     
     <property>
       <name>oracle.hadoop.loader.connection.user</name>
       <value>HR</value>
       <description> Name of the user connecting to the database</description>
     </property>
    
    <property> 
      <name>oracle.hadoop.loader.connection.password</name>
      <value>[HR password]</value>
      <description>Password of the user connecting to the database</description>
    </property>
     
     <property>
       <name>oracle.hadoop.loader.connection.url</name>
       <value>jdbc:oracle:thin:@//example.com:1521/serviceName</value>
       <description> Database connection string </description>
     </property>
    </configuration>
    
  4. Invoke OraLoader.

    bin/hadoop jar oraloader.jar oracle.hadoop.loader.OraLoader -libjars \
    avro-1.4.1.jar, MyInputFormat.jar -conf MyConf.xml \ 
    -fs [<local|namenode:port>] \
    -jt [<local|jobtracker:port>]
    

3.8 Target Table Characteristics

Oracle Loader for Hadoop supports loads into a single table, which is referred to as the target table. The target table must exist in the Oracle database. It can contain data or it can be empty.

3.8.1 Supported Data Types

Oracle Loader for Hadoop supports the following database built-in data types:

  • VARCHAR2

  • CHAR

  • NVARCHAR2

  • NCHAR

  • NUMBER

  • FLOAT

  • RAW

  • BINARY_FLOAT

  • BINARY_DOUBLE

  • DATE

  • TIMESTAMP

  • TIMESTAMP WITH TIME ZONE

  • TIMESTAMP WITH LOCAL TIME ZONE

  • INTERVAL YEAR TO MONTH

  • INTERVAL DAY TO SECOND

The target table can contain columns with unsupported data types, but these columns must be nullable, or otherwise set to a value.

3.8.2 Supported Partitioning Strategies

Oracle Loader for Hadoop supports the following single-level partitioning and composite-level partitioning strategies:

  • Range

  • List

  • Hash

  • Interval

  • Range-Range

  • Range-Hash

  • Range-List

  • List-Range

  • List-Hash

  • List-List

  • Hash-Range

  • Hash-Hash

  • Hash-List

  • Interval-Range

  • Interval-Hash

  • Interval-List

Oracle Loader for Hadoop does not support reference partitioning or virtual column-based partitioning.

3.9 Loader Map XML Schema Definition

This section contains the XML schema definition (XSD) for the loader map that specifies the columns to be loaded into the target table:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema elementFormDefault="qualified" attributeFormDefault="unqualified"
           xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:attributeGroup name="columnAttrs">
    <xs:annotation>
      <xs:documentation>Column attributes define how to map input fields to the
                        database column. field - is the name of the field in the
                        IndexedRecord input object. The field name need not be
                        unique. This means that the same input field can map to
                        different columns in the database table. format - is a
                        format string for interpreting the input. For example,
                        if the field is a date then the format is a date format
                        string suitable for interpreting dates</xs:documentation>
    </xs:annotation>
    <xs:attribute name="field" type="xs:token" use="optional"/>
    <xs:attribute name="format" type="xs:token" use="optional"/>
  </xs:attributeGroup>
  <xs:simpleType name="TOKEN_T">
    <xs:restriction base="xs:token">
      <xs:minLength value="1"/>
    </xs:restriction>
  </xs:simpleType>
  <xs:element name="LOADER_MAP">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="SCHEMA" type="TOKEN_T" minOccurs="0"/>
        <xs:element name="TABLE" type="TOKEN_T" nillable="false"/>
        <xs:element name="COLUMN" maxOccurs="unbounded" minOccurs="0">
          <xs:annotation>
            <xs:documentation>specifies the database column name that will be
                              loaded. Each column name must be unique.
            </xs:documentation>
          </xs:annotation>
          <xs:complexType>
            <xs:simpleContent>
              <xs:extension base="TOKEN_T">
                <xs:attributeGroup ref="columnAttrs"/>
              </xs:extension>
            </xs:simpleContent>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
 </xs:schema>

3.10 XML Document for the Configuration Properties

This is the oraloader-conf.xml document, which describes the configuration properties for Oracle Loader for Hadoop:

<?xml version="1.0"?>
<!-- 
 Copyright (c) 2011, 2012, Oracle and/or its affiliates. All rights reserved. 
 
   NAME
     oraloader-conf.xml 
 
   DESCRIPTION     
     This file is loaded as the very first conf resource.
-->
<configuration>
  <property>
    <name>oracle.hadoop.loader.libjars</name>
    <value>${oracle.hadoop.loader.olh_home}/jlib/ojdbc6.jar,
${oracle.hadoop.loader.olh_home}/jlib/orai18n.jar,
${oracle.hadoop.loader.olh_home}/jlib/orai18n-utility.jar,
${oracle.hadoop.loader.olh_home}/jlib/orai18n-mapping.jar,
${oracle.hadoop.loader.olh_home}/jlib/orai18n-collation.jar,
${oracle.hadoop.loader.olh_home}/jlib/oraclepki.jar,
${oracle.hadoop.loader.olh_home}/jlib/osdt_cert.jar,
${oracle.hadoop.loader.olh_home}/jlib/osdt_core.jar,
${oracle.hadoop.loader.olh_home}/jlib/commons-math-2.2.jar,
${oracle.hadoop.loader.olh_home}/jlib/jackson-core-asl-1.8.8.jar,
${oracle.hadoop.loader.olh_home}/jlib/jackson-mapper-asl-1.8.8.jar,
${oracle.hadoop.loader.olh_home}/jlib/avro-1.6.3.jar,
${oracle.hadoop.loader.olh_home}/jlib/avro-mapred-1.6.3.jar</value> 
    <description>
      Comma separated list of library jar files. These jars are 
      appended to the value of the "-libjars" command-line argument. 
      Users can distribute their application jars using this property
      in place of, or in combination with, the "-libjars" option.
                 
      It is invalid for this list to have a leading comma or 
      consecutive commas.
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.sharedLibs</name>
    <value>
${oracle.hadoop.loader.olh_home}/lib/libolh11.so,
${oracle.hadoop.loader.olh_home}/lib/libclntsh.so.11.1,
${oracle.hadoop.loader.olh_home}/lib/libnnz11.so,
${oracle.hadoop.loader.olh_home}/lib/libociei.so
</value>
    <description>
      These files are appended to the value of the "-files" 
      command-line argument.
     
      It is invalid for this list to have a leading comma or 
      consecutive commas.
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.olh_home</name>
    <value/>
    <description>
      A path to the OLH_HOME on the node where the OraLoader job
      is initiated. OraLoader uses this path to locate required libraries.
      If this property is not defined, OraLoader will use the value in the
      environment variable OLH_HOME.
    </description>
  </property>
  <property>
    <name>mapred.job.name</name>
    <value>OraLoader</value>
    <description>
      Hadoop job name for this Oracle loader job.
    </description>
  </property>
  <property> 
    <name>oracle.hadoop.loader.targetTable</name>
    <value/>
    <description>
      A schema qualified name for the table to be loaded. Use this 
      property to indicate that all columns of the table will be 
      loaded and that the names of the input fields match the 
      column names. This property takes precedence over the
      oracle.hadoop.loader.loaderMapFile property. By default
      this property is not defined. If the table is not schema-qualified, 
      then Oracle Loader for Hadoop uses the connection user.
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.defaultDateFormat</name>
    <value>yyyy-MM-dd HH:mm:ss</value>
    <description>
      A java.text.SimpleDateFormat pattern that is used to parse any input 
      field into a DATE column. The format is constructed using 
      the default locale. If you need different patterns for different input 
      fields, then specify the pattern for each input field using the "format" 
      attribute of the COLUMN element definition in the loader map file 
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.loaderMapFile</name>
    <value/>
    <description>
      Path to the loader map file. 
      Use a file:// scheme to indicate a local file.
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.tableMetadataFile</name>
    <value/>
    <description>
      Path to the target table metadata file. Use this property when
      running in disconnected mode. The table metadata file is
      created by running the OraLoaderMetadata utility.
      Use a file:// scheme to indicate a local file.
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.olhcachePath</name>
    <value>${mapred.output.dir}/../olhcache</value>
    <description>
      Path to a directory where Oracle Loader for Hadoop can create 
      files that will be loaded into the DistributedCache. 
              
      The default value is a directory called 'olhcache' in the parent directory 
      of the job's output directory (i.e. ${mapred.output.dir}).
      
      In distributed mode, the value must be a hdfs path
      (see javaDoc for org.apache.hadoop.filecache.DistributedCache).    
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.loadByPartition</name>
    <value>true</value>
    <description>
      Instructs the output format to perform a partition-aware load.
      For DelimitedText output format, this option controls whether the 
      keyword "PARTITION" appears in the generated .ctl file(s).
    </description>
  </property>  
  <property>
    <name>oracle.hadoop.loader.extTabDirectoryName</name>
    <value>OLH_EXTTAB_DIR</value>
    <description>
      The name of the Oracle directory object for the external table's
      LOCATION data files. This property applies only to the DelimitedText 
      and DataPump output formats.
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.sampler.enableSampling</name>
    <value>true</value>
    <description>
      Indicates whether the sampling feature is enabled. 
      Set the value to false to disable this feature.
    </description>
  </property>  
  <property>
     <name>oracle.hadoop.loader.enableSorting</name>
     <value>true</value>
     <description>
      Indicates whether output records within each reducer group 
      should be sorted. Use the property oracle.hadoop.loader.sortKey
      to specify the columns to use when sorting. If no value is given,
      then records will be sorted by primary key if the table has one.
     </description>
 </property>
 <property>
     <name>oracle.hadoop.loader.sortKey</name>
     <value/>
     <description>
      A comma separated list of column names that is used to form a 
      key for sorting output records within a reducer group. If no value
      is given, and oracle.hadoop.loader.enableSorting is true, then 
      records will be sorted by primary key if the table has one.
      
      A name may be provided as a quoted or non-quoted identifier. A 
      quoted identifier begins and ends with a double quotation mark.
      The name between the quotes is used to identify the column.  
      A non-quoted identifier will be converted to uppercase before use.
     </description>
  </property>
  <property>
     <name>oracle.hadoop.loader.rejectLimit</name>
     <value>1000</value>
     <description>
      The allowed number of rejected (skipped) records. If this value is 
      exceeded, the job is aborted.
       
      A negative value signals that no limit is imposed on the number of
      rejected records.
 
      Note that if mapper speculative execution is turned on
      (${mapred.map.tasks.speculative.execution}=true by default),
      the number of rejected records may be temporarily inflated, and
      the job may be prematurely aborted.
     </description>
  </property>
 
  <property>
     <name>oracle.hadoop.loader.logBadRecords</name>
     <value>false</value>
     <description>
      When set to true, Oracle Loader for Hadoop will log bad records to a file.
      This applies to records rejected by input formats and mappers. 
     </description>
  </property>
  
  <property>
     <name>oracle.hadoop.loader.badRecordFlushInterval</name>
     <value>500</value>
     <description>
      Specifies the number of records logged by a task attempt before 
      flushing (sync-ing) the log.
      
      This limits the number of records logged by a task attempt 
      that can be lost when the job is killed due to 
      ${oracle.hadoop.loader.rejectLimit}.
      
      This is ignored unless ${oracle.hadoop.loader.logBadRecords} 
      is set to true.
     </description>
  </property>
  
  <property>
    <name>oracle.hadoop.loader.log4j.propertyPrefix</name>
    <value>log4j.logger.oracle.hadoop.loader</value>
    <description>
      Oracle Loader for Hadoop allows you to specify log4j properties
      using Hadoop's job configuration mechanism (-conf, -D).
      
      All configuration properties starting with this prefix are loaded into
      log4j. They will override the settings with the same property names 
      log4j loaded from ${log4j.configuration}. 
      
      These overrides apply to the Oracle Loader for Hadoop job driver
      and all its map and reduce tasks.
      
      Example: -D log4j.logger.oracle.hadoop.loader.OraLoader=DEBUG
               -D log4j.logger.oracle.hadoop.loader.metadata=INFO
               
      Expert: properties are copied from the conf to log4j with their
      RAW values; any variable expansion is done in the context of log4j.
      In order to use conf variables in the expansion, the variables would have
      to start with this prefix.
    </description>
  </property>
 
  <!-- CONNECTION properties -->
  
  <property>
    <name>oracle.hadoop.loader.connection.url</name>
    <value/>
    <description>
      Specifies the URL of the database connection string. This property 
      takes precedence and overrides all other connection properties.      
    
      If Oracle Wallet is configured as an external password store,
      the property value must start with the driver prefix: jdbc:oracle:thin:@ 
      and the db_connect_string must exactly match the credential defined in the 
      wallet.
 
        Example 1: ( using oracle net syntax) 
        jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS_LIST=
            (ADDRESS=(PROTOCOL=TCP)(HOST=myhost)(PORT=1521)))
                     (CONNECT_DATA=(SERVICE_NAME=my_db_service_name)))
        
        Example 2: ( using TNS entry)
          jdbc:oracle:thin:@myTNS
        
          - Also see documentation for 
            oracle.hadoop.loader.connection.wallet_location          
    
      If Oracle Wallet is NOT used, then set the following conf properties     
      oracle.hadoop.loader.connection.url. 
      Examples of connection URL styles:
        thin-style: 
          jdbc:oracle:thin:@//myhost:1521/my_db_service_name  
          jdbc:oracle:thin:user/password@//myhost:1521/my_db_service_name
         
        Oracle Net:
          jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS_LIST=
              (ADDRESS=(PROTOCOL=TCP)(HOST=myhost)(PORT=1521)))
                       (CONNECT_DATA=(SERVICE_NAME=my_db_service_name)))
        
        TNSEntry Name:
          jdbc:oracle:thin:@myTNSEntryName
    
     oracle.hadoop.loader.connection.user  
     oracle.hadoop.loader.connection.password  
       
     If OCIOutputFormat is configured, and Oracle Wallet is not used, then 
     username and password must be specified in these separate properties.     
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.connection.user</name>
    <value/>
    <description>
      Name for the database login. If OCIOutputFormat is configured and this 
      user property is not defined, then logon with Oracle Wallet is assumed.
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.connection.password</name>
    <value/>
    <description>Password for the connecting user.</description>
  </property>
  <property>
    <name>oracle.hadoop.loader.connection.wallet_location</name>
    <value/>
    <description>
     File path to an Oracle wallet directory where the connection 
     credential is stored.
     
     When using Oracle Wallet as an external password store 
     set the three properties
      oracle.hadoop.loader.connection.wallet_location
      oracle.hadoop.loader.connection.url
      oracle.hadoop.loader.connection.tns_admin
      
     or set the three properties      
      oracle.hadoop.loader.connection.wallet_location
      oracle.hadoop.loader.connection.tnsEntryName
      oracle.hadoop.loader.connection.tns_admin     
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.connection.tnsEntryName</name>
    <value/>
    <description>
      Specifies a TNS entry name defined in the tnsnames.ora file.
      This property is used together with the 
      oracle.hadoop.loader.connection.tns_admin property.
    </description>
  </property>  
  <property>
    <name>oracle.hadoop.loader.connection.tns_admin</name>
    <value/>
    <description>
      File path to a directory containing
      SQL*Net configuration files like sqlnet.ora and tnsnames.ora.
      If this property is not defined, the value of the environment
      variable TNS_ADMIN will be used. Define this property in order
      to use TNS entry names in database connect strings.
 
      This property must be set when using Oracle Wallet as an
      external password store. See the property 
      oracle.hadoop.loader.connection.wallet_location
    </description>
  </property>  
  <property>
    <name>oracle.hadoop.loader.connection.defaultExecuteBatch</name>
    <value>100</value>
    <description>
       Applicable only for JDBC and OCI output formats. The default
       value for the number of records to be inserted in a batch for
       each trip to the database. Specify a value >= 1 to
       override the default value. If the specified value is less than 1,
       this property assumes the default value. Though the maximum
       value is unlimited, using very large batch sizes is not
       recommended, as it results in a large memory footprint without
       much increase in performance.
     </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.connection.sessionTimezone</name>
    <value>LOCAL</value>
    <description>
      This property is used to alter the session time zone for 
      database connections.  Valid values are:
      
        [+|-] hh:mm      - hours and minutes before or after UTC
        LOCAL            - the default timezone of the JVM 
        time_zone_region - a valid time zone region
      
      This property also determines the default timezone when parsing
      input data that will be loaded to database column types:
      TIMESTAMP, TIMESTAMP WITH TIME ZONE and TIMESTAMP WITH LOCAL TIME ZONE
    </description>
  </property>
   
  <!-- DEBUG properties 
       These properties should not compromise security or
       expose customer data or Oracle proprietary information.
       Remember these properties will be available to anyone.
       
       Current NO DEBUG properties
  -->
  <!-- properties for OCIOutputFormat -->
  <property>
    <name>oracle.hadoop.loader.output.dirpathBufsize</name>
    <value>131072</value>
    <description>
      This property is used to set the size, in bytes, of the direct path stream
      buffer for OCIOutputFormat.  If needed, values are rounded up to the next 
      nearest multiple of 8k. The default is 128k.
    </description>
  </property>
  
  <!-- 
    This is currently only used by parallel dirpath output, but it's conceivable
    that this could be used in a read scenario as well. That is why it does not
    have an output prefix.
  -->
  <property>
    <name>oracle.hadoop.loader.compressionFactors</name>
    <value>BASIC=5.0,OLTP=5.0,QUERY_LOW=10.0,QUERY_HIGH=10.0,
           ARCHIVE_LOW=10.0,ARCHIVE_HIGH=10.0</value>
    <description>
      This property is used to define the compression factor for different types
      of compression. The format is a comma separated list of name=value pairs
      where name is one of BASIC, OLTP, QUERY_LOW, QUERY_HIGH, ARCHIVE_LOW, or
      ARCHIVE_HIGH.  Value is a decimal number.
    </description>
  </property>
 
  
  <!-- properties for DelimitedTextOutputFormat -->
    <property>
      <name>oracle.hadoop.loader.output.fieldTerminator</name>
      <value>,</value>
      <description>
        A single character to delimit fields for DelimitedTextOutputFormat.
        Alternate representation: \uHHHH (where HHHH is the character's UTF-16 
        encoding).
      </description>
    </property>
    <property>
      <name>oracle.hadoop.loader.output.initialFieldEncloser</name>
      <value></value>
      <description>
        When this value is set, fields are always enclosed between the
        specified character and 
        ${oracle.hadoop.loader.output.trailingFieldEncloser} (which defaults
        to this property's value).
        
        If this value is set, it must be either a single character, or \uHHHH 
        (where HHHH is the character's UTF-16 encoding).
        
        A zero length value means no enclosers (default value).
                           
        Use this when some field may contain the fieldTerminator. 
        If some field may also contain the trailingFieldEncloser, then
        the escapeEnclosers property should be set to true.
      </description>
    </property>
    <property>
      <name>oracle.hadoop.loader.output.trailingFieldEncloser</name>
      <value></value>
      <description>
        When this value is set, fields are always enclosed between 
        ${oracle.hadoop.loader.output.initialFieldEncloser} and the
        specified character for this property.
 
        If this property is not defined, the value of 
        ${oracle.hadoop.loader.output.initialFieldEncloser} is used instead.
 
        If this value is set, it must be either a single character, or \uHHHH 
        (where HHHH is the character's UTF-16 encoding).
        
        Do not set this property if
        ${oracle.hadoop.loader.output.initialFieldEncloser} is not defined.
                
        Use this when some field may contain the fieldTerminator. 
        If some field may also contain the trailingFieldEncloser,then
        the escapeEnclosers property should be set to true.
      </description>
    </property>
    <property>
      <name>oracle.hadoop.loader.output.escapeEnclosers</name>
      <value>false</value>
      <description>
        When this is set to true and both initial and trailing field enclosers 
        are set, fields will be scanned, and embedded trailing encloser 
        characters will be escaped. Use this option when some of the field
        values may contain the trailing encloser character.
      </description>
    </property>
 
  <!-- property for DataPumpInputFormat -->
    <property>
      <name>oracle.hadoop.loader.output.granuleSize</name>
      <value>10240000</value>
      <description>
 
        Granule size (in bytes) used in the generated data pump files.
        A granule determines the work load for a pq-slave when loading the 
        file through the ORACLE_DATAPUMP access driver.
      </description>
    </property>
    
  <!-- common property for DelimitedTextInputFormat and RegexInputFormat -->
    <property>
      <name>oracle.hadoop.loader.input.fieldNames</name>
      <value/>
      <description>
        Comma-separated list of names to assign to input fields. 
        The names are used to create the Avro schema for the record. 
        The strings must be valid JSON name strings.
 
        If this property is not defined, the names F0,F1,F2,... will be used
        (consistent with oracle.hadoop.loader.examples.CSVInputFormat).
      </description>
    </property>
 
 
  <!-- properties for DelimitedTextInputFormat -->
    <property>
      <name>oracle.hadoop.loader.input.fieldTerminator</name>
      <value>,</value>
      <description>
        A single character to delimit fields for DelimitedTextInputFormat.
        Alternate representation: \uHHHH (where HHHH is the character's UTF-16 
        encoding).
      </description>
    </property>
    <property>
      <name>oracle.hadoop.loader.input.initialFieldEncloser</name>
      <value></value>
      <description>
        When this value is set, fields are allowed to be enclosed
        between the specified character and 
        ${oracle.hadoop.loader.input.trailingFieldEncloser} (which defaults
        to this property's value).
        
        If this value is set, it must be either a single character, or \uHHHH 
        (where HHHH is the character's UTF-16 encoding).
        
        A zero length value means no enclosers (default value).
      </description>
    </property>
    <property>
      <name>oracle.hadoop.loader.input.trailingFieldEncloser</name>
      <value></value>
      <description>
        When this value is set, fields are allowed to be enclosed
        between ${oracle.hadoop.loader.input.initialFieldEncloser} 
        and the specified character.
        
        If this property is not defined, the value of 
        ${oracle.hadoop.loader.input.initialFieldEncloser} is used instead.
        
        If this value is set, it must be either a single character, or \uHHHH 
        (where HHHH is the character's UTF-16 encoding).
        
        Do not set this property if
        ${oracle.hadoop.loader.input.initialFieldEncloser} is not defined.
      </description>
    </property>
    
    <!-- properties for RegexInputFormat -->
    <property>
      <name>oracle.hadoop.loader.input.regexPattern</name>
      <value/>
      <description>
        The pattern string for a regular expression as defined in 
        http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html.
        
        The pattern must match the entire text line (the matching is applied 
        only once per line).
        
        Same as org.apache.hadoop.hive.contrib.serde2.RegexSerDe's 
        "input.regex" property.
      </description>
    </property>
    <property>
      <name>oracle.hadoop.loader.input.regexCaseInsensitive</name>
      <value>false</value>
      <description>
         Directs the pattern matching to be case insensitive. 
         
         Same as org.apache.hadoop.hive.contrib.serde2.RegexSerDe's 
         "input.regex.case.insensitive" property.
      </description>
    </property>    
 
    <!--Properties for tuning the sampler-->
    <!-- set numThreads > 1 for large datasets -->
    <property>
      <name>oracle.hadoop.loader.sampler.numThreads</name>
          <value>5</value>
      <description>
        Number of sampler threads.  
        
        This value should be set based on the processor and memory resources 
        available on the node where the Oracle Loader for Hadoop job is initiated. 
        A higher number of sampler threads implies higher concurrency in sampling.
        Set this value to 1 to disable multi-threading in the sampler. 
        The default value is 5 threads.  
      </description>
   </property> 
   <property>
     <name>oracle.hadoop.loader.sampler.maxLoadFactor</name>
     <value>0.05</value>
     <description> 
       The maximum acceptable reducer load factor.
       In a perfectly load balanced job, every reducer is assigned 
       an equal amount of work (or load). 
       Load factor is the percent overload per reducer 
       i.e. (assigned load - ideal load)%
       For example: a value of 0.05, indicates that it is acceptable for 
       reducers to be assigned up to 5% more data than their ideal load. 
       If load balancing is successful, it guarantees this 
       maximum load factor at the specified confidence.
       (see oracle.hadoop.loader.sampler.loadCI)
       Default = 0.05, another common value is 0.1.
     </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.sampler.loadCI</name>
    <value>0.95</value>
    <description> 
      The confidence level for the specified 
      maximum reducer load factor.
      (See oracle.hadoop.loader.sampler.maxLoadFactor)
      Default = 0.95, other common values = 0.90, 0.99
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.sampler.minSplits</name>
    <value>5</value>
    <description>
      The minimum number of splits that will be 
      read by the sampler. If the total number of splits 
      is lesser than this value, then the sampler will read
      all splits. Splits may be read partially. 
      A non-positive value is equivalent to minSplits=1. 
      The default value is 5.
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.sampler.hintMaxSplitSize</name>
    <value>1048576</value>
    <description> 
      The sampler sets Hadoop configuration parameter
      mapred.max.split.size to this value before it calls the InputFormat's 
      getSplits() method.
      
      The value of mapred.max.split.size is only set to this value for the 
      duration of sampling, it is not changed in the actual job 
      configuration. Some InputFormats (e.g. FileInputFormat) use the 
      maximum split size as a hint to determine the number of splits 
      returned by getSplits(). Smaller split sizes imply that more
      chunks of data will be sampled at random (good). While large splits are 
      better for IO performance, they are not necessarily better for sampling. 
      Set this value to be small enough for good sampling performance, 
      but not any smaller: extremely small values can cause inefficient IO 
      performance and cause getSplits() to run out of memory by returning too
      many splits. 
      
      The recommended minimum value for this property is 1048576 bytes (1 MB).
      Note that org.apache.hadoop.mapreduce.lib.input.FileInputFormat will
      always return splits of size at least FileInputFormat::getMinSplitSize(),
      regardless of the value of this property.
 
      This value can be increased for larger datasets (e.g. tens of terabytes) 
      or if the InputFormat's getSplits() method throws an OutOfMemoryError.  
      If the specified value is less than 1, this property is ignored. 
      The default value is 1048576 bytes (1 MB).
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.sampler.hintNumMapTasks</name>
    <value>100</value>
    <description> 
      The sampler sets Hadoop configuration parameter
      mapred.map.tasks to this value for the duration of sampling. 
      The value of mapred.map.tasks is not changed in the actual job 
      configuration. Some InputFormats (e.g. DBInputFormat) use the 
      number of map tasks parameter as a hint to determine the number of 
      splits returned by getSplits(). Higher values imply that more chunks 
      of data will be sampled at random (good). The default value is 100. 
      This value should typically be increased for large datasets (e.g. more 
      than a million rows), while keeping in mind that extremely large values
      can cause the InputFormat's getSplits() method to run out of memory by
      returning too many splits.
      
      If the specified value is less than 1, this property is ignored.  
      The default value is 100.
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.sampler.maxSamplesPct</name>
    <value>0.01</value>
    <description> 
      This property specifies the maximum data to sample, as a
      percentage of the total amount of data. In general, the 
      sampler will stop sampling if any one of the following is true:
      (1) it has collected the minimum number of samples 
          required for optimal load-balancing, or 
      (2) the percent of data sampled exceeds 
          oracle.hadoop.loader.sampler.maxSamplesPct, or 
      (3) the number of bytes sampled exceeds 
          oracle.hadoop.loader.sampler.maxHeapBytes.
      If this parameter is set to a negative value, 
      condition (2) is not imposed.
      The default value is 0.01 (1%).
    </description>
  </property>
  <property>
    <name>oracle.hadoop.loader.sampler.maxHeapBytes</name>
    <value>-1</value>
    <description> 
      This value specifies the maximum memory available to 
      the sampler in bytes. In general, the sampler will 
      stop sampling when any one of these conditions is true:
      (1) it has collected the minimum number of samples 
          required for optimal load-balancing, or 
      (2) the percent of data sampled exceeds 
          oracle.hadoop.loader.sampler.maxSamplesPct, or 
      (3) the number of bytes sampled exceeds 
          oracle.hadoop.loader.sampler.maxHeapBytes.
      If this parameter is set to a negative value, 
      condition (3) is not imposed.
      Default = -1 (no memory restrictions on the sampler).
    </description>
  </property>
  
</configuration>

3.11 Third-Party Licenses for Bundled Software

Oracle Loader for Hadoop installs the following third-party products:

Oracle Loader for Hadoop includes Oracle 11g Release 2 (11.2) client libraries. For information about third party product included with Oracle Database 11g Release 2 (11.2), refer toOracle Database Licensing Information.

Unless otherwise specifically noted, or as required under the terms of the third party license (e.g., LGPL), the licenses and statements herein, including all statements regarding Apache-licensed code, are intended as notices only.

3.11.1 Apache Licensed Code

The following is included as a notice in compliance with the terms of the Apache 2.0 License, and applies to all programs licensed under the Apache 2.0 license:

You may not use the identified files except in compliance with the Apache License, Version 2.0 (the "License.")

You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

A copy of the license is also reproduced below.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and limitations under the License.

Apache License


Version 2.0, January 2004
http://www.apache.org/licenses/

TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

  1. Definitions

    "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.

    "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.

    "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.

    "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.

    "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.

    "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.

    "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).

    "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.

    "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."

    "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.

  2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.

  3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.

  4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:

    1. You must give any other recipients of the Work or Derivative Works a copy of this License; and

    2. You must cause any modified files to carry prominent notices stating that You changed the files; and

    3. You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and

    4. If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.

    You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.

  5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.

  6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.

  7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.

  8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.

  9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.

END OF TERMS AND CONDITIONS

APPENDIX: How to apply the Apache License to your work

To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Do not include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives.

Copyright [yyyy] [name of copyright owner]

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

This product includes software developed by The Apache Software Foundation (http://www.apache.org/) (listed below):

3.11.2 Apache Avro 1.6.3

Licensed under the Apache License, Version 2.0 (the "License"); you may not use Apache Avro except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

3.11.3 Apache Commons Mathematics Library 2.2

Copyright 2001-2011 The Apache Software Foundation

Licensed under the Apache License, Version 2.0 (the "License"); you may not use the Apache Commons Mathematics library except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

3.11.4 Jackson JSON 1.8.8

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this library except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.