Oracle Loader for Apache Hadoop

3 Oracle Loader for Apache Hadoop

This chapter explains how to use Oracle Loader for Apache Hadoop (Oracle Loader for Hadoop) to load data from Apache Hadoop into tables in an Oracle Database. It contains the following sections:

What Is Oracle Loader for Hadoop?

Oracle Loader for Hadoop is an efficient and high-performance loader for fast loading of data from a Hadoop cluster into a table in an Oracle database. It pre-partitions the data if necessary and transforms it into a database-ready format. It can also sort records by primary key or user-specified columns before loading the data or creating output files. Oracle Loader for Hadoop uses the parallel processing framework of Hadoop to perform these preprocessing operations, which other loaders typically perform on the database server as part of the load process. Off-loading these operations to Hadoop reduces the CPU requirements on the database server, thereby lessening the performance impact on other database tasks.

Oracle Loader for Hadoop is a Java MapReduce application that balances the data across reducers to help maximize performance. It works with a range of input data formats that present the data as records with fields. It can read from sources that have the data already in a record format (such as Avro files or Apache Hive tables), or it can split the lines of a text file into fields.

If you have Java programming skills, you can extend the types of data that the loader can handle by defining custom input formats. Then Oracle Loader for Hadoop uses your code to extract the fields and records.

Interfaces to Oracle Loader For Hadoop

There are three ways to use Oracle Loader for Hadoop:

Oracle Shell for Hadoop Loaders (OHSH)
OHSH is the preferred way to use Oracle Loader for Hadoop. It includes a CLI (whose simple command syntax can also be scripted) for moving data between Hadoop and Oracle Database using various resources, including OLH.
Oracle SQL Developer
Oracle SQL Developer is a free graphical IDE that includes integration with Oracle Database and Oracle Big Data Connectors (among many other products). It provides wizards to help you access and use Oracle Big Data Connectors.
The hadoop command-line utility
On the command line, you provide the job details with the configuration settings. You typically provide these settings in a job configuration file. You can consider this option, if you need to use a feature not supported by your preferred UI (OHSH or SQL Developer). For most use cases, OHSH or SQL Developer is sufficient.

See Also:

Use Oracle SQL Developer With Oracle Big Data Connectors in this guide provides instructions for downloading Oracle SQL Developer and configuring it for use with Oracle Big Data Connectors.

Get Started With Oracle Loader for Hadoop

These instructions show how to use Oracle Loader for Hadoop through OHSH.

Before You Start

This is what you need to know before using OLH to load an Oracle Database table with data stored in Hadoop:

The password of the database schema you are connecting to (which is implied by the database connection URL) or any other relevant database credentials information.
The name of the Oracle Database table.
The source of the data stored in Hadoop (either a path to an HDFS directory or the name of a Hive table).
The preferred method for loading. Choose either JDBC or direct path. Direct path load is faster, but requires a partitioned target table. JDBC does not.

About Resources

In OHSH, the term resources refers to the interfaces that OHSH presents for defining the data source, destination, and command language. Four types of resources are available:

Hadoop resources – for executing HDFS commands and use HDFS as a source or destination.
Hive resources – for executing Hive commands and specifying Hive as a source or destination.
JDBC resources – for making JDBC connections to a database.
SQL resources – for executing SQL commands in a database schema.

Two resources are created upon OHSH startup:

hive0 – enables access to the default Hive database.
hadoop0 – enables access to HDFS.

Within a session, you can create SQL and JDBC resources as well as additional Hive resources (for example, to connect to other Hive databases). Assign a meaningful name to a resource, as in the example below where the names sql10 and ora_mydatabase clearly indicate the type of resource.

Where resources are invoked in the commands below, the percent sign (%) prefix identifies a resource name.

Loading an Oracle Database Table

Start an OHSH session.
```
$ ohsh
ohsh>
```
Create the following resources:
- SQL resource:
```
ohsh> create sqlplus resource sql0 connectid="<database connection url>"
```
  At the prompt, enter the database password.
  - SQL resource when using JDBC SSL:
    You can use a JDBC SSL connection to connect to Oracle Database (for example, to connect to an Oracle Autonomous Database). See Using JDBC SSL to download the client credentials and identify the TNS entry in the tnsnames.ora file.
    For example:
```
ohsh> create sqlplus resource sql_ssl connectid="inst1_ssl"
```
    inst1_ssl is the TNS entry for the JDBC SSL connection.
    
    At the prompt, enter the database password.
  - SQL resource when using Java KeyStore:
    You can use Java KeyStore to store your password and you won't be prompted for username and password. Add this to the scripts you develop to load data. See Using Secure External Java KeyStore and Hadoop credential command to create a Java KeyStore.
    For example:
```
ohsh> create sqlplus resource sql_cs user=oracle                
passwordalias=oracle_passwd               
provider="jceks://file/home/oracle/passwd.jceks"               
connectid="inst1" 
```
- JDBC resource:
  A JDBC resource name that indicates the target schema is recommended.
```
ohsh> create jdbc resource ora_mydatabase 
connectid="<database connection ur1>"
```
  At the prompt, enter the database password.
  - JDBC resource when using JDBC SSL:
    You can use a JDBC SSL connection to connect to Oracle Database (for example, to connect to an Oracle Autonomous Database).
    For example:
```
ohsh> create jdbc resource ora_mydatabase_ssl 
connectiondir=/home/oracle/ssl_client_wallet 
connectid="inst1_ssl"
```
    inst1_ssl is the TNS entry for the JDBC SSL connection.
    
    At the prompt, enter the database password.
    
    See Using JDBC SSL to download the client credentials and identify the TNS entry in the tnsnames.ora file.
  - JDBC resource when using Java KeyStore:
    You can use Java KeyStore to store your password and you won't be prompted for username and password. Add this to the scripts you develop to load data. See Using Secure External Java KeyStore and Hadoop credential command to create a Java KeyStore.
    For example:
```
ohsh> create jdbc resource ora_mydatabase_cs 
connectiondir=oracle
passwordalias=oracle_passwd              
provider="jceks://file/home/oracle/passwd.jceks"              
connectid="inst1"
```
- Additional Hive resources (if required):
  The default Hive resource hive0 connects to the default database in Hive. If you want to connect to another Hive database, create another resource:.
```
ohsh> create hive resource hive_mydatabase connectionurl=”jdbc:hive2:///<Hive database name>
```
Use the load command to load files from HDFS into a target table in the Oracle database.

The following command loads data from a delimited text file in HDFS <HDFS path> into the target table in Oracle Database using the direct path option.
```
ohsh> load oracle table ora_mydatabase:<target table in the Oracle database> from path hadoop0:/user/<HDFS path> using directpath
```
Note:

The default direct path method is the fastest way to load a table. However, it requires partitioned target table. Direct path is always recommended for use with partition tables. Use the JDBC option to load into a non-partitioned target table.

If the command does not explicitly state the load method, then OHSH automatically uses the appropriate method. If the target Oracle table is partitioned, then by default, OHSH uses direct path (i.e. Oracle OCI). If the Oracle table is not partitioned, it uses JDBC.
After loading, check the number of rows.

You can do this conveniently from the OHSH command line:
```
ohsh> %sql0 select count(*) from <target table in Oracle Database>
```

Loading a Hive Table Into an Oracle Database Table

You can use OHSH to load a Hive table into a target table in an Oracle database. The command below shows how to do this using the direct path method.

ohsh> load oracle table ora_mydatabase:<target table in Oracle Database> from hive table hive0:<Hive table name>

Note that if the target table is partitioned, then OHSH uses direct path automatically. You do not need to enter using directpath explicitly in the command.

If the target table is non-partitioned, then specify the JDBC method instead:

ohsh> load oracle table ora_mydatabase:<target table in Oracle Database> from hive table hive0:<Hive table name> using jdbc

Note:

The load command assumes that the column names in the Hive table and in the Oracle Database table are identically matched. If they do not match, then use OHSH loadermap.

Using OHSH Loadermaps

The simple load examples in this section assume the following:

Where we load data from a text file in Hadoop into an Oracle Database table, the declared order of columns in the target table maps correctly to the physical ordering of the delimited text fields in the file.
Where we load Hive tables in to Oracle Database tables, the Hive and Oracle Database column names are identically matched.

However, in less straightforward cases where the column names (or the order of column names and delimited text fields) do not match, use the OHSH loadermap construct to correct these mismatches.

You can also use a loadermap to specify a subset of target columns to load into table or in the case of a load from a text file, specify the format of a field in the load.

Loadermaps are not covered in this introduction.

Performance Tuning Oracle Loader for Hadoop in OHSH

Aside from network bandwidth, two factors can have significant impact on Oracle Loader for Hadoop performance. You can tune both in OHSH.

Degree of parallelism

The degree of parallelism affects performance when Oracle Loader for Hadoop runs in Hadoop. For the default method (direct path), parallelism is determined by the number of reducer tasks. The higher the number of reducer tasks, the faster the performance. The default value is 4. To set the number of tasks:
```
ohsh> set reducetasks 18
```
For the JDBC option, parallelism is determined by the number of map tasks and the optimal number is determined automatically. However, remember that if the target table is partitioned, direct path is faster than JDBC.
Load balancing

Performance is best when the load is balanced evenly across reduce tasks. The load is detected by sampling. Sampling is always enabled by default for loads using the JDBC and the default copy method.

Debugging in OHSH

Several OHSH settings control the availability of debugging information:

outputlevel

The outputlevel is set to minimal by default. Set it to verbose in order to return a stack trace when a command fails:
```
ohsh> set outputlevel verbose
```
logbadrecords
```
ohsh> set logbadrecords true
```
This is set to true by default.

These log files are informative for debugging:

Oracle Loader for Hadoop log files.

/user/<username>/smartloader/jobhistory/oracle/<target table schema>/<target table name>/<OHSH job ID>/_olh

Log files generated by the map and reduce tasks.

Other OHSH Properties That are Useful for Oracle Loader for Hadoop

You can set these properties on the OHSH command line or in a script.

dateformat
```
ohsh> set dateformat “yyyy-MM-dd HH:mm:ss”
```
The syntax for this command is dictated by the Java date format.
rejectlimit

The number of rows that can be rejected before the load of a delimited text file fails.
fieldterminator

The field terminator in loads of delimited text files.
hadooptnsadmin

Location of an Oracle TNS admin directory in the Hadoop cluster
hadoopwalletlocation

Location of the Oracle Wallet directory in the Hadoop cluster.

Additional Information

Using the exttab (External Table) Method to Load Data

A third option to load data from Hadoop into Oracle Database is exttab.

Note:

The exttab option is available in on-premises deployments of OHSH only. It is not available in Oracle cloud services

In the exttab, data is loaded via external tables. OHSH creates the external table using Oracle SQL Connector for HDFS, and then uses a Create table as Select statement to load the data into the target table:

ohsh> load oracle table ora_mydatabase:<target table in Oracle Database> from hive table hive0:<Hive table name> using exttab

Learning Resources

These OHSH blog entries can help you get started.

See the Java™ Platform, Standard Edition 7 API Specification for documentation on the SimpleDateFormat class.

Use Oracle Loader for Hadoop With the Hadoop Command Line Utility

Perform the following basic steps when using Oracle Loader for Hadoop:

The first time you use Oracle Loader for Hadoop, ensure that the software is installed and configured.

See "Oracle Loader for Hadoop Setup."
Connect to Oracle Database and create the target table.

See "Create the Target Table."
Establish a secure connection to the Oracle Database.
See "Establish Secure Connections to Oracle Database Using SSL and Oracle Wallet."
Log in to either a node in the Hadoop cluster or a system set up as a Hadoop client for the cluster.
If you are using offline database mode, then copy the table metadata to the Hadoop system where you are logged in.
Create a configuration file. This file is an XML document that describes configuration information, such as access to the target table metadata, the input format of the data, and the output format.

See "Create a Job Configuration File."
Create an XML document that maps input fields to columns in the Oracle database table. Optional.

See "Mapping Input Fields to Target Table Columns."
Create a shell script to run the Oracle Loader for Hadoop job.

See "Run a Loader Job."
If you are connecting to a secure cluster, then you run kinit to authenticate yourself.
Run the shell script.
If the job fails, then use the diagnostic messages in the output to identify and correct the error.

See "Job Reporting."
After the job succeeds, check the command output for the number of rejected records. If too many records were rejected, then you may need to modify the input format properties.
If you generated text files or Data Pump-format files, then load the data into Oracle Database using one of these methods:
- Create an external table using Oracle SQL Connector for HDFS (online database mode only).
  
  See Oracle SQL Connector for Hadoop Distributed File System.
- Copy the files to the Oracle Database system and use SQL*Loader or external tables to load the data into the target database table. Oracle Loader for Hadoop generates scripts that you can use for these methods.
  
  See "About DelimitedTextOutputFormat" or "About DataPumpOutputFormat."
Connect to Oracle Database as the owner of the target table. Query the table to ensure that the data loaded correctly. If it did not, then modify the input or output format properties as needed to correct the problem.
Before running the OraLoader job in a production environment, employ these optimizations:
- Balancing Loads When Loading Data into Partitioned Tables
- Optimize Communications Between Oracle Engineered Systems

About the Modes of Operation

Oracle Loader for Hadoop operates in two modes:

Online Database Mode

In online database mode, Oracle Loader for Hadoop can connect to the target database using the credentials provided in the job configuration file or in an Oracle wallet. The loader obtains the table metadata from the database. It can insert new records directly into the target table or write them to a file in the Hadoop cluster. You can load records from an output file when the data is needed in the database, or when the database system is less busy.

The following figure shows the relationships among elements in online database mode.

Figure 3-1 Online Database Mode

Description of "Figure 3-1 Online Database Mode"

Offline Database Mode

Offline database mode enables you to use Oracle Loader for Hadoop when the Oracle Database system is on a separate network from the Hadoop cluster, or is otherwise inaccessible. In this mode, Oracle Loader for Hadoop uses the information supplied in a table metadata file, which you generate using a separate utility. The loader job stores the output data in binary or text format output files on the Hadoop cluster. Loading the data into Oracle Database is a separate procedure using another utility, such as Oracle SQL Connector for Hadoop Distributed File System (HDFS) or SQL*Loader.

The following figure shows the relationships among elements in offline database mode. The figure does not show the separate procedure of loading the data into the target table.

Figure 3-2 Offline Database Mode

Description of "Figure 3-2 Offline Database Mode"

Create the Target Table

Oracle Loader for Hadoop loads data into one target table, which must exist in the Oracle database. The table can be empty or contain data already. Oracle Loader for Hadoop does not overwrite existing data.

Create the table the same way that you would create one for any other purpose. It must comply with the following restrictions:

Supported Data Types for Target Tables

You can define the target table using any of these data types:

BINARY_DOUBLE
BINARY_FLOAT
CHAR
DATE
FLOAT
INTERVAL DAY TO SECOND
INTERVAL YEAR TO MONTH
NCHAR
NUMBER
NVARCHAR2
RAW
TIMESTAMP
TIMESTAMP WITH LOCAL TIME ZONE
TIMESTAMP WITH TIME ZONE
VARCHAR2

The target table can contain columns with unsupported data types, but these columns must be nullable, or otherwise set to a value.

Supported Partitioning Strategies for Target Tables

Partitioning is a database feature for managing and efficiently querying very large tables. It provides a way to decompose a large table into smaller and more manageable pieces called partitions, in a manner entirely transparent to applications.

You can define the target table using any of the following single-level and composite-level partitioning strategies.

Hash
Hash-Hash
Hash-List
Hash-Range
Interval
Interval-Hash
Interval-List
Interval-Range
List
List-Hash
List-List
List-Range
Range
Range-Hash
Range-List
Range-Range

Oracle Loader for Hadoop does not support reference partitioning or virtual column-based partitioning.

Compression

Oracle Loader for Hadoop does not compress data. Compressing data during load is defined by the table and database properties. To load data into a compressed table define the table and database properties accordingly.

Create a Job Configuration File

A configuration file is an XML document that provides Hadoop with all the information it needs to run a MapReduce job. This file can also provide Oracle Loader for Hadoop with all the information it needs. See "Oracle Loader for Hadoop Configuration Property Reference".

Configuration properties provide the following information, which is required for all Oracle Loader for Hadoop jobs:

How to secure connection to Oracle Database.

See "Establish Secure Connections to Oracle Database Using SSL and Oracle Wallet."
The format of the input data.

See "About Input Formats."
The format of the output data.

See "About Output Formats."

OraLoader implements the org.apache.hadoop.util.Tool interface and follows the standard Hadoop methods for building MapReduce applications. Thus, you can supply the configuration properties in a file (as shown here) or on the hadoop command line. See "Run a Loader Job."

You can use any text or XML editor to create the file. The following example provides an example of a job configuration file.

Example 3-1 Job Configuration File

<?xml version="1.0" encoding="UTF-8" ?>
<configuration>
 
<!--                          Input settings                             -->
 <property>
   <name>mapreduce.job.inputformat.class</name>
   <value>oracle.hadoop.loader.lib.input.DelimitedTextInputFormat</value>
 </property>
 
 <property>
   <name>mapreduce.input.fileinputformat.inputdir</name>
   <value>/user/oracle/moviedemo/session/*00000</value>
 </property>
 
 <property>
   <name>oracle.hadoop.loader.input.fieldTerminator</name>
   <value>\u0009</value>
 </property>
 
 <property>
   <name>oracle.hadoop.loader.input.fieldNames</name>
   <value>SESSION_ID,TIME_IDDATE,CUST_ID,DURATION_SESSION,NUM_RATED,DURATION_RATED,NUM_COMPLETED,DURATION_COMPLETED,TIME_TO_FIRST_START,NUM_STARTED,NUM_BROWSED,DURATION_BROWSED,NUM_LISTED,DURATION_LISTED,NUM_INCOMPLETE,NUM_SEARCHED</value>
 </property>
 
 <property>
   <name>oracle.hadoop.loader.defaultDateFormat</name>
   <value>yyyy-MM-dd:HH:mm:ss</value>
 </property>
 

<!--                          Output settings                             -->
 <property>
   <name>mapreduce.job.outputformat.class</name>
   <value>oracle.hadoop.loader.lib.output.OCIOutputFormat</value>
 </property>
 
 <property>
   <name>mapreduce.output.fileoutputformat.outputdir</name>
   <value>temp_out_session</value>
 </property>
 
<!--                          Table information                           -->
 <property>
   <name>oracle.hadoop.loader.loaderMap.targetTable</name>
   <value>movie_sessions_tab</value>
 </property>   

 
<!--                          Connection information                      -->
 
<property>
  <name>oracle.hadoop.loader.connection.url</name>
  <value>jdbc:oracle:thin:@${HOST}:${TCPPORT}/${SERVICE_NAME}</value>
</property>
 
<property>
  <name>TCPPORT</name>
  <value>1521</value>
</property>
 
<property>
  <name>HOST</name>
  <value>myoraclehost.example.com</value>
</property>
 
<property>
 <name>SERVICE_NAME</name>
 <value>orcl</value>
</property>
 
<property>
  <name>oracle.hadoop.loader.connection.user</name>
  <value>MOVIEDEMO</value>
</property>
 
<property>
  <name>oracle.hadoop.loader.connection.password</name>
  <value>oracle</value>        
  <description> Note: Protect this file with 600 permissions since it has password in clear text.</description>
</property>
 
</configuration>

Establish Secure Connections to Oracle Database Using SSL and Oracle Wallet

Learn about establishing secure connections to Oracle Database.

This section describes how to create and use an Oracle Wallet or the JDBC SSL protocol to create and establish highly secure connections to Oracle Database.

Use Oracle Wallets

Oracle Wallet is a secure software container that stores authentication and signing credentials. Oracle recommends that you use a wallet to provide your credentials. To use an Oracle wallet, enter the following properties in the job configuration file:

Use JDBC SSL

SSL is a widely used industry standard protocol that provides secure communication over a network. SSL provides authentication, data encryption, and data integrity. SSL is required when connecting to Oracle Cloud services such as Oracle Autonomous Data Warehouse Cloud Service.

Connect to Oracle Database Using JDBC SSL

Learn to connect to Oracle Database using JDBC SSL.

Follow the instructions in Using JDBC SSL to connect to Oracle Database.

Add the following properties in the job configuration file:

<property> 
<name>oracle.hadoop.loader.connection.directory</name> 
<value><directory_location></value>
<description>For example: /home/oracle/SSL_wallet</description>
</property> 

<property>
<name>oracle.hadoop.loader.connection.tns_admin</name>
<value><directory_location></value>
<description>For example: /home/oracle/SSL_wallet</description>
</property>

<property>
<name>oracle.hadoop.loader.connection.tnsEntryName</name> 
<value><tns_entry></value>
<description>The TNS Entry must use the tcps protocol in order to activate SSL in the JDBC Thin driver.For example,
 inst1_ssl.</description> 
</property>

Generate the Target Table Metadata for Offline Database Mode

Learn to generate the Target table metadata for offline database mode.

Under some circumstances, the loader job cannot access the database, such as when the Hadoop cluster is on a different network than Oracle Database. In such cases, you can use the OraLoaderMetadata utility to extract and store the target table metadata in a file. To learn more about the OraLoaderMetadata utility, see OraLoaderMetadata Utility.

About Input Formats

An input format reads a specific type of data stored in Hadoop. Several input formats are available, which can read the data formats most commonly found in Hadoop:

You can also use your own custom input formats. The descriptions of the built-in formats provide information that may help you develop custom Java InputFormat classes. See "Custom Input Formats."

You specify a particular input format for the data that you want to load into a database table, by using the mapreduce.job.inputformat.class configuration property in the job configuration file.

Note:

The built-in text formats do not handle header rows or newline characters (\n) embedded in quoted values.

Delimited Text Input Format

To load data from a delimited text file, set mapreduce.job.inputformat.class to

oracle.hadoop.loader.lib.input.DelimitedTextInputFormat

About DelimitedTextInputFormat

The input file must comply with these requirements:

Records must be separated by newline characters.
Fields must be delimited using single-character markers, such as commas or tabs.

A null replaces any empty-string token, whether enclosed or unenclosed.

DelimitedTextInputFormat emulates the tokenization method of SQL*Loader: Terminated by t, and optionally enclosed by ie, or by ie and te. DelimitedTextInputFormat uses the following syntax rules, where t is the field terminator, ie is the initial field encloser, te is the trailing field encloser, and c is one character.

Line = Token t Line | Token\n
Token = EnclosedToken | UnenclosedToken
EnclosedToken = (white-space)* ie [(non-te)* te te]* (non-te)* te (white-space)*
UnenclosedToken = (white-space)* (non-t)*
white-space = {c | Character.isWhitespace(c) and c!=t}

White space around enclosed tokens (data values) is discarded. For unenclosed tokens, the leading white space is discarded, but not the trailing white space (if any).

This implementation enables you to define custom enclosers and terminator characters, but it hard codes the record terminator as a newline, and white space as Java Character.isWhitespace. A white space can be defined as the field terminator, but then that character is removed from the class of white space characters to prevent ambiguity.

Hadoop automatically decompresses compressed text files when they are read.

Required Configuration Properties

None. The default format separates fields with commas and has no field enclosures.

Optional Configuration Properties

Use one or more of the following properties to define the field delimiters for DelimitedTextInputFormat:

Use the following property to provide names for the input fields:

oracle.hadoop.loader.input.fieldNames

Complex Text Input Formats

To load data from text files that are more complex than DelimitedTextInputFormat can handle, set mapreduce.job.inputformat.class to

oracle.hadoop.loader.lib.input.RegexInputFormat

For example, a web log might delimit one field with quotes and another field with square brackets.

About RegexInputFormat

RegexInputFormat requires that records be separated by newline characters. It identifies fields in each text line by matching a regular expression:

The regular expression must match the entire text line.
The fields are identified using the capturing groups in the regular expression.

RegexInputFormat uses the java.util.regex regular expression-based pattern matching engine. Hadoop automatically decompresses compressed files when they are read.

See Also:

Java Platform Standard Edition 6 Java Reference for more information about java.util.regex at

http://docs.oracle.com/javase/6/docs/api/java/util/regex/package-summary.html

Required Configuration Properties

Use the following property to describe the data input file:

oracle.hadoop.loader.input.regexPattern

Optional Configuration Properties

Use the following property to identify the names of all input fields:

oracle.hadoop.loader.input.fieldNames

Use this property to enable case-insensitive matches:

oracle.hadoop.loader.input.regexCaseInsensitive

Hive Table Input Format

To load data from a Hive table, set mapreduce.job.inputformat.class to

oracle.hadoop.loader.lib.input.HiveToAvroInputFormat

About HiveToAvroInputFormat

For nonpartitioned tables, HiveToAvroInputFormat imports the entire table, which is all files in the Hive table directory.

For partitioned tables, HiveToAvroInputFormat imports one or more of the partitions. You can either load or skip a partition. However, you cannot partially load a partition.

Oracle Loader for Hadoop rejects all rows with complex (non-primitive) column values. UNIONTYPE fields that resolve to primitive values are supported. See "Handling Rejected Records."

HiveToAvroInputFormat transforms rows in the Hive table into Avro records, and capitalizes the Hive table column names to form the field names. This automatic capitalization improves the likelihood that the field names match the target table column names. See "Mapping Input Fields to Target Table Columns".

Note:

This input format does not support Hive tables using quoted identifiers for column names. See HIVE-6013

Also note that HiveToAvroInputFormat does not enforce the SQL Standard Based Hive Authorization. For more information, see https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization.

Required Configuration Properties

You must specify the Hive database and table names using the following configuration properties:

Optional Configuration Properties

To specify a subset of rows in the input Hive table to load, use the following property:

oracle.hadoop.loader.input.hive.rowFilter

Avro Input Format

To load data from binary Avro data files containing standard Avro-format records, set mapreduce.job.inputformat.class to

oracle.hadoop.loader.lib.input.AvroInputFormat

To process only files with the .avro extension, append *.avro to directories listed in the mapreduce.input.fileinputformat.inputdir configuration property.

Configuration Properties

None

Oracle NoSQL Database Input Format

To load data from Oracle NoSQL Database, set mapreduce.job.inputformat.class to

oracle.kv.hadoop.KVAvroInputFormat

This input format is defined in Oracle NoSQL Database 11g, Release 2 and later releases.

About KVAvroInputFormat

Oracle Loader for Hadoop uses KVAvroInputFormat to read data directly from Oracle NoSQL Database.

KVAvroInputFormat passes the value but not the key from the key-value pairs in Oracle NoSQL Database. If you must access the Oracle NoSQL Database keys as Avro data values, such as storing them in the target table, then you must create a Java InputFormat class that implements oracle.kv.hadoop.AvroFormatter. Then you can specify the oracle.kv.formatterClass property in the Oracle Loader for Hadoop configuration file.

The KVAvroInputFormat class is a subclass of org.apache.hadoop.mapreduce.InputFormat<oracle.kv.Key, org.apache.avro.generic.IndexedRecord>

See Also:

Javadoc for the KVInputFormatBase class at

http://docs.oracle.com/cd/NOSQL/html/javadoc/index.html

Required Configuration Properties

You must specify the name and location of the key-value store using the following configuration properties:

oracle.kv.hosts
oracle.kv.kvstore

See "Oracle NoSQL Database Configuration Properties."

Custom Input Formats

If the built-in input formats do not meet your needs, then you can write a Java class for a custom input format. The following information describes the framework in which an input format works in Oracle Loader for Hadoop.

About Implementing a Custom Input Format

Oracle Loader for Hadoop gets its input from a class extending org.apache.hadoop.mapreduce.InputFormat. You must specify the name of that class in the mapreduce.job.inputformat.class configuration property.

The input format must create RecordReader instances that return an Avro IndexedRecord input object from the getCurrentValue method. Use this method signature:

public org.apache.avro.generic.IndexedRecord getCurrentValue()
throws IOException, InterruptedException;

Oracle Loader for Hadoop uses the schema of the IndexedRecord input object to discover the names of the input fields and map them to the columns of the target table.

About Error Handling

If processing an IndexedRecord value results in an error, Oracle Loader for Hadoop uses the object returned by the getCurrentKey method of the RecordReader to provide feedback. It calls the toString method of the key and formats the result in an error message. InputFormat developers can assist users in identifying the rejected records by returning one of the following:

Data file URI
InputSplit information
Data file name and the record offset in that file

Oracle recommends that you do not return the record in clear text, because it might contain sensitive information; the returned values can appear in Hadoop logs throughout the cluster. See "Log Rejected Records in Bad Files."

If a record fails and the key is null, then the loader generates no identifying information.

Supporting Data Sampling

Oracle Loader for Hadoop uses a sampler to improve performance of its MapReduce job. The sampler is multithreaded, and each sampler thread instantiates its own copy of the supplied InputFormat class. When implementing a new InputFormat, ensure that it is thread-safe. See "Balancing Loads When Loading Data into Partitioned Tables."

InputFormat Source Code Example

Oracle Loader for Hadoop provides the source code for an InputFormat example.

In order to access the examples, unzip file examples.zip, which is in $OLH_HOME. You can find the InputFormat example in the examples/jsrc directory.

The sample format loads data from a simple, comma-separated value (CSV) file. To use this input format, add $OLH_HOME/examples/oraloader-examples.jar to HADOOP_CLASSPATH and specify oracle.hadoop.loader.examples.CSVInputFormat as the value of mapreduce.job.inputformat.class in the job configuration file.

This input format automatically assigns field names of F0, F1, F2, and so forth. It does not have configuration properties.

Mapping Input Fields to Target Table Columns

Mapping identifies which input fields are loaded into which columns of the target table. You may be able to use the automatic mapping facilities, or you can always manually map the input fields to the target columns.

Automatic Mapping

Oracle Loader for Hadoop can automatically map the fields to the appropriate columns when the input data complies with these requirements:

All columns of the target table are loaded.
The input data field names in the IndexedRecord input object exactly match the column names. For example: When you load from a Hive table, the names of the Oracle target table columns exactly match the names of the Hive table columns.
All input fields that are mapped to DATE columns can be parsed using the same Java date format.

Use these configuration properties for automatic mappings:

oracle.hadoop.loader.loaderMap.targetTable: Identifies the target table.
oracle.hadoop.loader.defaultDateFormat: Specifies a default date format that applies to all DATE fields.

Manual Mapping

For loads that do not comply with the requirements for automatic mapping, you must define additional properties. These properties enable you to:

Load data into a subset of the target table columns.
Create explicit mappings when the input field names are not identical to the database column names.
Specify different date formats for different input fields.

Use these properties for manual mappings:

oracle.hadoop.loader.loaderMap.targetTable configuration property to identify the target table. Required.
oracle.hadoop.loader.loaderMap.columnNames: Lists the columns to be loaded.
oracle.hadoop.loader.defaultDateFormat: Specifies a default date format that applies to all DATE fields.
oracle.hadoop.loader.loaderMap.column_name.format: Specifies the data format for a particular column.
oracle.hadoop.loader.loaderMap.column_name.field: Identifies the name of an Avro record field mapped to a particular column.

Note:

Manual Mapping is particularly useful when different date columns have different formats.

Manual Mapping: Examples

The following are the examples of manual mapping:

Configuration File conf.xml when loading from a text file

When you load delimited text from text files on HDFS, use F0, F1, … to refer to columns in text files. In this example, F0 maps to EMPLOYEE_ID in the columnNames property, F1 maps to LAST_NAME, and so on.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<configuration>   
     <property>       
          <name>oracle.hadoop.loader.loaderMap.targetTable</name>       
          <value>HR.EMPLOYEES</value>    
     </property>    
     <property>       
          <name>oracle.hadoop.loader.loaderMap.columnNames</name>       
          <value>EMPLOYEE_ID,LAST_NAME,EMAIL,HIRE_DATE,JOB_ID</value>   
     </property>   
      <property>       
          <name>oracle.hadoop.loader.loaderMap.EMPLOYEE_ID.field</name>       
          <value>F0</value>    
      </property>     
       <property>       
          <name>oracle.hadoop.loader.loaderMap.EMPLOYEE_ID.format</name>       
          <value></value>    
      </property>    
       <property>       
          <name>oracle.hadoop.loader.loaderMap.LAST_NAME.field</name>       
          <value>F1</value>    
      </property>    
       <property>       
          <name>oracle.hadoop.loader.loaderMap.LAST_NAME.format</name>       
          <value></value>   
      </property>    
       <property>       
          <name>oracle.hadoop.loader.loaderMap.EMAIL.field</name>       
          <value>F2</value>   
      </property>   
       <property>       
          <name>oracle.hadoop.loader.loaderMap.EMAIL.format</name>       
          <value></value>   
      </property>    
       <property>       
          <name>oracle.hadoop.loader.loaderMap.HIRE_DATE.field</name>       
          <value>F3</value>    
      </property>    
       <property>       
          <name>oracle.hadoop.loader.loaderMap.HIRE_DATE.format</name>       
          <value>MM-dd-yyyy</value>    
      </property>    
       <property>        
           <name>oracle.hadoop.loader.loaderMap.JOB_ID.field</name>       
           <value>F4</value>    
      </property>    
       <property>       
           <name>oracle.hadoop.loader.loaderMap.JOB_ID.format</name>       
           <value></value>    
       </property>
</configuration>

Configuration File conf.xml when loading from a Hive table

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<configuration>    
     <property>       
          <name>oracle.hadoop.loader.loaderMap.targetTable</name>          
          <value>HR.EMPLOYEES</value>   
    </property>    
     <property>       
          <name>oracle.hadoop.loader.loaderMap.columnNames</name>       
          <value>EMPLOYEE_ID,LAST_NAME,EMAIL,HIRE_DATE,JOB_ID</value>    
    </property>    
     <property>       
          <name>oracle.hadoop.loader.loaderMap.EMPLOYEE_ID.field</name>       
          <value>EMPLOYEE_ID</value>    
    </property>    
     <property>       
          <name>oracle.hadoop.loader.loaderMap.EMPLOYEE_ID.format</name>       
          <value></value>    
     </property>    
     <property>       
          <name>oracle.hadoop.loader.loaderMap.LAST_NAME.field</name>       
          <value>LAST_NAME</value>    
     </property>    
      <property>       
          <name>oracle.hadoop.loader.loaderMap.LAST_NAME.format</name>       
          <value></value>    
     </property>    
      <property>       
          <name>oracle.hadoop.loader.loaderMap.EMAIL.field</name>       
          <value>EMAIL</value>    
     </property>    
      <property>       
          <name>oracle.hadoop.loader.loaderMap.EMAIL.format</name>       
          <value></value>    
     </property>     
      <property>       
          <name>oracle.hadoop.loader.loaderMap.HIRE_DATE.field</name>       
          <value>HIRE_DATE</value>    
      </property>     
      <property>       
          <name>oracle.hadoop.loader.loaderMap.HIRE_DATE.format</name>       
          <value>MM-dd-yyyy</value>    
      </property>     
       <property>       
          <name>oracle.hadoop.loader.loaderMap.JOB_ID.field</name>       
          <value>JOB_ID</value>    
      </property>    
       <property>      
          <name>oracle.hadoop.loader.loaderMap.JOB_ID.format</name>       
          <value></value>    
      </property>
</configuration>

About Output Formats

In online database mode, you can choose between loading the data directly into an Oracle database table or storing it in a file. In offline database mode, you are restricted to storing the output data in a file, which you can load into the target table as a separate procedure. You specify the output format in the job configuration file using the mapreduce.job.outputformat.class property.

Choose from these output formats:

JDBC Output Format: Loads the data directly into the target table.
Oracle OCI Direct Path Output Format: Loads the data directly into the target table.
Delimited Text Output Format: Stores the data in a local file.
Oracle Data Pump Output Format: Stores the data in a local file.

JDBC Output Format

You can use a JDBC connection between the Hadoop system and Oracle Database to load the data. The output records of the loader job are loaded directly into the target table by map or reduce tasks as part of the OraLoader process, in online database mode. No additional steps are required to load the data.

A JDBC connection must be open between the Hadoop cluster and the Oracle Database system for the duration of the job.

To use this output format, set mapreduce.job.outputformat.class to

oracle.hadoop.loader.lib.output.JDBCOutputFormat

About JDBCOutputFormat

JDBCOutputFormat uses standard JDBC batching to optimize performance and efficiency. If an error occurs during batch execution, such as a constraint violation, the JDBC driver stops execution immediately. Thus, if there are 100 rows in a batch and the tenth row causes an error, then nine rows are inserted and 91 rows are not.

The JDBC driver does not identify the row that caused the error, and so Oracle Loader for Hadoop does not know the insert status of any of the rows in the batch. It counts all rows in a batch with errors as "in question," that is, the rows may or may not be inserted in the target table. The loader then continues loading the next batch. It generates a load report at the end of the job that details the number of batch errors and the number of rows in question.

One way that you can handle this problem is by defining a unique key in the target table. For example, the HR.EMPLOYEES table has a primary key named EMPLOYEE_ID. After loading the data into HR.EMPLOYEES, you can query it by EMPLOYEE_ID to discover the missing employee IDs.Then you can locate the missing employee IDs in the input data, determine why they failed to load, and try again to load them.

Configuration Properties

To control the batch size, set this property:

oracle.hadoop.loader.connection.defaultExecuteBatch

Oracle OCI Direct Path Output Format

You can use the direct path interface of Oracle Call Interface (OCI) to load data into the target table. Each reducer loads into a distinct database partition in online database mode, enabling the performance gains of a parallel load. No additional steps are required to load the data.

The OCI connection must be open between the Hadoop cluster and the Oracle Database system for the duration of the job.

To use this output format, set mapreduce.job.outputformat.class to

oracle.hadoop.loader.lib.output.OCIOutputFormat

About OCIOutputFormat

OCIOutputFormat has the following restrictions:

It is available only on the Linux x86.64 platform.
The MapReduce job must create one or more reducers.
The target table must be partitioned.
For Oracle Database 11g (11.2.0.3), apply the patch for bug 13498646 if the target table is a composite interval partitioned table in which the subpartition key contains a CHAR, VARCHAR2, NCHAR, or NVARCHAR2 column. Later versions of Oracle Database do not require this patch.

Configuration Properties

To control the size of the direct path stream buffer, set this property:

oracle.hadoop.loader.output.dirpathBufsize

Delimited Text Output Format

You can create delimited text output files on the Hadoop cluster. The map or reduce tasks generate delimited text files, using the field delimiters and enclosers that you specify in the job configuration properties. Afterward, you can load the data into an Oracle database as a separate procedure. See "About DelimitedTextOutputFormat."

This output format can use either an open connection to the Oracle Database system to retrieve the table metadata in online database mode, or a table metadata file generated by the OraloaderMetadata utility in offline database mode.

To use this output format, set mapreduce.job.outputformat.class to

oracle.hadoop.loader.lib.output.DelimitedTextOutputFormat

About DelimitedTextOutputFormat

Output tasks generate delimited text format files, and one or more corresponding SQL*Loader control files, and SQL scripts for loading with external tables.

If the target table is not partitioned or if oracle.hadoop.loader.loadByPartition is false, then DelimitedTextOutputFormat generates the following files:

A data file named oraloader-taskId-csv-0.dat.
A SQL*Loader control file named oraloader-csv.ctl for the entire job.
A SQL script named oraloader-csv.sql to load the delimited text file into the target table.

For partitioned tables, multiple output files are created with the following names:

Data files: oraloader-taskId-csv-partitionId.dat
SQL*Loader control files: oraloader-taskId-csv-partitionId.ctl
SQL script: oraloader-csv.sql

In the generated file names, taskId is the mapper or reducer identifier, and partitionId is the partition identifier.

If the Hadoop cluster is connected to the Oracle Database system, then you can use Oracle SQL Connector for HDFS to load the delimited text data into an Oracle database. See Oracle SQL Connector for Hadoop Distributed File System.

Alternatively, you can copy the delimited text files to the database system and load the data into the target table in one of the following ways:

Use the generated control files to run SQL*Loader and load the data from the delimited text files.
Use the generated SQL scripts to perform external table loads.

The files are located in the ${mapreduce.output.fileoutputformat.outputdir}/_olh directory.

Configuration Properties

The following properties control the formatting of records and fields in the output files:

The following example shows a sample SQL*Loader control file that might be generated by an output task.

Example 3-2 Sample SQL*Loader Control File

LOAD DATA CHARACTERSET AL32UTF8
INFILE 'oraloader-csv-1-0.dat'
BADFILE 'oraloader-csv-1-0.bad'
DISCARDFILE 'oraloader-csv-1-0.dsc'
INTO TABLE "SCOTT"."CSV_PART" PARTITION(10) APPEND
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
(
   "ID"      DECIMAL EXTERNAL,
   "NAME"    CHAR,
   "DOB"     DATE 'SYYYY-MM-DD HH24:MI:SS'
)

Oracle Data Pump Output Format

You can create Data Pump format files on the Hadoop cluster. The map or reduce tasks generate Data Pump files. Afterward, you can load the data into an Oracle database as a separate procedure. See "About DataPumpOutputFormat."

This output format can use either an open connection to the Oracle Database system in online database mode, or a table metadata file generated by the OraloaderMetadata utility in offline database mode.

To use this output format, set mapreduce.job.outputformat.class to

oracle.hadoop.loader.lib.output.DataPumpOutputFormat

About DataPumpOutputFormat

DataPumpOutputFormat generates data files with names in this format:

oraloader-taskId-dp-partitionId.dat

In the generated file names, taskId is the mapper or reducer identifier, and partitionId is the partition identifier.

If the Hadoop cluster is connected to the Oracle Database system, then you can use Oracle SQL Connector for HDFS to load the Data Pump files into an Oracle database. See Oracle SQL Connector for Hadoop Distributed File System.

Alternatively, you can copy the Data Pump files to the database system and load them using a SQL script generated by Oracle Loader for Hadoop. The script performs the following tasks:

Creates an external table definition using the ORACLE_DATAPUMP access driver. The binary format Oracle Data Pump output files are listed in the LOCATION clause of the external table.
Creates a directory object that is used by the external table. You must uncomment this command before running the script. To specify the directory name used in the script, set the oracle.hadoop.loader.extTabDirectoryName property in the job configuration file.
Insert the rows from the external table into the target table. You must uncomment this command before running the script.

The SQL script is located in the ${mapreduce.output.fileoutputformat.outputdir}/_olh directory.

See Also:

Oracle Database Administrator's Guide for more information about creating and managing external tables
Oracle Database Utilities for more information about the ORACLE_DATAPUMP access driver

Run a Loader Job

To run a job using Oracle Loader for Hadoop, you use the OraLoader utility in a hadoop command.

The following is the basic syntax:

hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader \
-conf job_config.xml \
-libjars input_file_format1.jar[,input_file_format2.jar...]

You can include any generic hadoop command-line option. OraLoader implements the org.apache.hadoop.util.Tool interface and follows the standard Hadoop methods for building MapReduce applications.

Unzip the examples.zip file in $OLH_HOME in order to use the oraloader-examples.jar file in the instructions below.

Basic Options

-conf job_config.xml

Identifies the job configuration file. See "Create a Job Configuration File."

-libjars

Identifies the JAR files for the input format.

When using the example input format, specify $OLH_HOME/jlib/oraloader-examples.jar. (You will first need to set up the example for use as described in InputFormat Source Code Example.)
When using the Hive or Oracle NoSQL Database input formats, you must specify additional JAR files, as described later in this section.
When using a custom input format, specify its JAR file. Also remember to add the JAR to HADOOP_CLASSPATH.

Separate multiple file names with commas, and list each one explicitly. Wildcard characters and spaces are not allowed.

Oracle Loader for Hadoop prepares internal configuration information for the MapReduce tasks. It stores table metadata information and the dependent Java libraries in the distributed cache, so that they are available to the MapReduce tasks throughout the cluster.

Example of Running OraLoader

The following uses a built-in input format and a job configuration file named MyConf.xml.

HADOOP_CLASSPATH="$OLH_HOME/jlib/*:$OLH_HOME/examples/oraloader-examples.jar:$HADOOP_CLASSPATH"

hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader \
-conf MyConf.xml -libjars $OLH_HOME/jlib/oraloader-examples.jar

Specify Hive Input Format JAR Files

When using HiveToAvroInputFormat, you must add the Hive configuration directory to the HADOOP_CLASSPATH environment variable:

HADOOP_CLASSPATH="$OLH_HOME/jlib/*:hive_home/lib/*:hive_conf_dir:$HADOOP_CLASSPATH"

You must also add the following Hive JAR files, in a comma-separated list, to the -libjars option of the hadoop command. Replace the stars (*) with the complete file names on your system:

hive-exec-*.jar
hive-metastore-*.jar
libfb303*.jar

This example shows the full file names in Cloudera's Distribution including Apache Hadoop (CDH) 5.8:

# hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader \ 
-conf MyConf.xml \
-libjars hive-exec-1.1.0-cdh5.8.0.jar, hive-metastore-1.1.0-cdh5.8.0.jar, libfb303-0.9.3.jar

Specify Oracle NoSQL Database Input Format JAR Files

When using KVAvroInputFormat from Oracle NoSQL Database 11g, Release 2, you must include $KVHOME/lib/kvstore.jar in your HADOOP_CLASSPATH and you must include the -libjars option in the hadoop command:

hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader \ 
-conf MyConf.xml \
-libjars $KVHOME/lib/kvstore.jar

Job Reporting

Oracle Loader for Hadoop consolidates reporting information from individual tasks into a file named ${mapreduce.output.fileoutputformat.outputdir}/_olh/oraloader-report.txt. Among other statistics, the report shows the number of errors, broken out by type and task, for each mapper and reducer.

Handling Rejected Records

Oracle Loader for Hadoop may reject input records for a variety of reasons, such as:

Errors in the mapping properties
Missing fields in the input data
Records mapped to invalid table partitions
Badly formed records, such as dates that do not match the date format or records that do not match regular expression patterns

Log Rejected Records in Bad Files

By default, Oracle Loader for Hadoop does not log the rejected records into Hadoop logs; it only logs information on how to identify the rejected records. This practice prevents user-sensitive information from being stored in Hadoop logs across the cluster.

You can direct Oracle Loader for Hadoop to log rejected records by setting the oracle.hadoop.loader.logBadRecords configuration property to true. Then Oracle Loader for Hadoop logs bad records into one or more "bad" files in the _olh/ directory under the job output directory.

Set a Job Reject Limit

Some problems can cause Oracle Loader for Hadoop to reject every record in the input. To mitigate the loss of time and resources, Oracle Loader for Hadoop aborts the job after rejecting 1000 records.

You can change the maximum number of rejected records allowed by setting the oracle.hadoop.loader.rejectLimit configuration property. A negative value turns off the reject limit and allows the job to run to completion regardless of the number of rejected records.

Balancing Loads When Loading Data into Partitioned Tables

The goal of load balancing is to generate a MapReduce partitioning scheme that assigns approximately the same amount of work to all reducers.

The sampling feature of Oracle Loader for Hadoop balances loads across reducers when data is loaded into a partitioned database table. It generates an efficient MapReduce partitioning scheme that assigns database partitions to the reducers.

The execution time of a reducer is usually proportional to the number of records that it processes—the more records, the longer the execution time. When sampling is disabled, all records from a given database partition are sent to one reducer. This can result in unbalanced reducer loads, because some database partitions may have more records than others. Because the execution time of a Hadoop job is usually dominated by the execution time of its slowest reducer, unbalanced reducer loads slow down the entire job.

Use the Sampling Feature

You can turn the sampling feature on or off by setting the oracle.hadoop.loader.sampler.enableSampling configuration property. Sampling is turned on by default.

Tuning Load Balancing

These job configuration properties control the quality of load balancing:

The sampler uses the expected reducer load factor to evaluate the quality of its partitioning scheme. The load factor is the relative overload for each reducer, calculated as (assigned_load - ideal_load)/ideal_load. This metric indicates how much a reducer's load deviates from a perfectly balanced reducer load. A load factor of 1.0 indicates a perfectly balanced load (no overload).

Small load factors indicate better load balancing. The maxLoadFactor default of 0.05 means that no reducer is ever overloaded by more than 5%. The sampler guarantees this maxLoadFactor with a statistical confidence level determined by the value of loadCI. The default value of loadCI is 0.95, which means that any reducer's load factor exceeds maxLoadFactor in only 5% of the cases.

There is a trade-off between the execution time of the sampler and the quality of load balancing. Lower values of maxLoadFactor and higher values of loadCI achieve more balanced reducer loads at the expense of longer sampling times. The default values of maxLoadFactor=0.05 and loadCI=0.95 are a good trade-off between load balancing quality and execution time.

Tuning Sampling Behavior

By default, the sampler runs until it collects just enough samples to generate a partitioning scheme that satisfies the maxLoadFactor and loadCI criteria.

However, you can limit the sampler running time by setting the oracle.hadoop.loader.sampler.maxSamplesPct property, which specifies the maximum number of sampled records.

When Does Oracle Loader for Hadoop Use the Sampler's Partitioning Scheme?

Oracle Loader for Hadoop uses the generated partitioning scheme only if sampling is successful. A sampling is successful if it generates a partitioning scheme with a maximum reducer load factor of (1+ maxLoadFactor) guaranteed at a statistical confidence level of loadCI.

Partition report identifies the keys that are assigned to the various mappers. This report is saved in XML for the sampler to use; it does not contain information of use to you. The report is named ${mapreduce.output.fileoutputformat.outputdir}/_balancer/orabalancer_report.xml. It is only generated for sampled jobs. This xml file contains the information about how to assign map output to different reducers, as well as the sampling statistics.

The default values of maxLoadFactor, loadCI, and maxSamplesPct allow the sampler to successfully generate high-quality partitioning schemes for a variety of different input data distributions. However, the sampler might be unsuccessful in generating a partitioning scheme using custom property values, such as when the constraints are too rigid or the number of required samples exceeds the user-specified maximum of maxSamplesPct. In these cases, Oracle Loader for Hadoop generates a log message identifying the problem, partitions the records using the database partitioning scheme, and does not guarantee load balancing.

Alternatively, you can reset the configuration properties to less rigid values. Either increase maxSamplesPct, or decrease maxLoadFactor or loadCI, or both.

Resolve Memory Issues

A custom input format may return input splits that do not fit in memory. If this happens, the sampler returns an out-of-memory error on the client node where the loader job is submitted.

To resolve this problem:

Increase the heap size of the JVM where the job is submitted.
Adjust the following properties:
- oracle.hadoop.loader.sampler.hintMaxSplitSize
- oracle.hadoop.loader.sampler.hintNumMapTasks

If you are developing a custom input format, then see "Custom Input Formats."

What Happens When a Sampling Feature Property Has an Invalid Value?

If any configuration properties of the sampling feature are set to values outside the accepted range, an exception is not returned. Instead, the sampler prints a warning message, resets the property to its default value, and continues executing.

Optimize Communications Between Oracle Engineered Systems

If you are using Oracle Loader for Hadoop to load data from Oracle Big Data Appliance to Oracle Exadata Database Machine, then you can increase throughput by configuring the systems to use Sockets Direct Protocol (SDP) over the InfiniBand private network. This setup provides an additional connection attribute whose sole purpose is serving connections to Oracle Database to load data.

To specify SDP protocol:

Add JVM options to the HADOOP_OPTS environment variable to enable JDBC SDP export:
```
HADOOP_OPTS="-Doracle.net.SDP=true -Djava.net.preferIPv4Stack=true"
```
Set this Hadoop configuration property for the child task JVMs:
```
-D mapred.child.java.opts="-Doracle.net.SDP=true -Djava.net.preferIPv4Stack=true"
```
Note:

This Hadoop configuration property can be either added to the OLH command line or set in the configuration file.

Configure standard Ethernet communications. In the job configuration file, set oracle.hadoop.loader.connection.url using this syntax:

jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS_LIST=
     (ADDRESS=(PROTOCOL=TCP)(HOST=hostName)(PORT=portNumber)))
     (CONNECT_DATA=(SERVICE_NAME=serviceName)))

Configure the Oracle listener on Exadata to support the SDP protocol and bind it to a specific port address (such as 1522). In the job configuration file, specify the listener address as the value of oracle.hadoop.loader.connection.oci_url using this syntax:
```
(DESCRIPTION=(ADDRESS=(PROTOCOL=SDP)
    (HOST=hostName) (PORT=portNumber))
    (CONNECT_DATA=(SERVICE_NAME=serviceName)))
```
Replace hostName, portNumber, and serviceName with the appropriate values to identify the SDP listener on your Oracle Exadata Database Machine.

See Also:

Oracle Big Data Appliance Software User's Guide for more information about configuring communications over InfiniBand

Oracle Loader for Hadoop Configuration Property Reference

OraLoader uses the standard methods of specifying configuration properties in the hadoop command. You can use the -conf option to identify configuration files, and the -D option to specify individual properties.

This section describes the OraLoader configuration properties, the Oracle NoSQL Database configuration properties, and a few generic Hadoop MapReduce properties that you typically must set for an OraLoader job:

A configuration file showing all OraLoader properties is in $OLH_HOME/doc/oraloader-conf.xml.

See Also:

Hadoop documentation for job configuration files at

http://wiki.apache.org/hadoop/JobConfFile

MapReduce Configuration Properties

Property	Description
mapreduce.job.name	Type: String Default Value: `OraLoader` Description: The Hadoop job name. A unique name can help you monitor the job using tools such as the Hadoop JobTracker web interface and Cloudera Manager.
mapreduce.input.fileinputformat.inputdir	Type: String Default Value: Not defined Description: A comma-separated list of input directories.
mapreduce.job.inputformat.class	Type: String Default Value: Not defined Description: Identifies the format of the input data. You can enter one of the following built-in input formats, or the name of a custom `InputFormat` class: `oracle.hadoop.loader.lib.input.AvroInputFormat` `oracle.hadoop.loader.lib.input.DelimitedTextInputFormat` `oracle.hadoop.loader.lib.input.HiveToAvroInputFormat` `oracle.hadoop.loader.lib.input.RegexInputFormat` `oracle.kv.hadoop.KVAvroInputFormat` See "About Input Formats" for descriptions of the built-in input formats.
mapreduce.output.fileoutputformat.outputdir	Type: String Default Value: Not defined Description: A comma-separated list of output directories, which cannot exist before the job runs. Required.
mapreduce.job.outputformat.class	Type: String Default Value: Not defined Description: Identifies the output type. The values can be: `oracle.hadoop.loader.lib.output.DataPumpOutputFormat` Writes data records into binary format files that can be loaded into the target table using an external table. `oracle.hadoop.loader.lib.output.DelimitedTextOutputFormat` Writes data records to delimited text format files such as comma-separated values (CSV) format files. `oracle.hadoop.loader.lib.output.JDBCOutputFormat` Inserts rows into the target table using a JDBC connection. `oracle.hadoop.loader.lib.output.OCIOutputFormat` Inserts rows into the target table using the Oracle OCI Direct Path interface. See "About Output Formats."
mapreduce.job.reduces	Type: Integer Default Value: 1 Description: The number of reduce tasks used by the Oracle Loader for Hadoop job. The default value of 1 does not support parallel processing, therefore performance improves when the value is increased to support multiple parallel data loads. Choose a value that provides an ample, but not excessive, number of reduce tasks for the job. At a point dictated by the available resources, an excessive increase in the number of reduce tasks result in diminishing improvements, while potentially degrading the performance of other jobs.

OraLoader Configuration Properties

Property	Description
oracle.hadoop.loader.badRecordFlushInterval	Type: Integer Default Value: `500` Description: Sets the maximum number of records that a task attempt can log before flushing the log file. This setting limits the number of records that can be lost when the record reject limit (`oracle.hadoop.loader.rejectLimit`) is reached and the job stops running. The `oracle.hadoop.loader.logBadRecords` property must be set to `true` for a flush interval to take effect.
oracle.hadoop.loader.compressionFactors	Type: Decimal Default Value: `BASIC=5.0,OLTP=5.0,QUERY_LOW=10.0,QUERY_HIGH=10.0,ARCHIVE_LOW=10.0,ARCHIVE_HIGH=10.0` Description: These values are used by Oracle Loader for Hadoop when sampling is enabled and the target table is compressed. They are the compression factors of the target table. For best performance, the values of this property should match the compression factors of the target table. The values are a comma-delimited list of name=value pairs. The names must be one of the following keywords: `ARCHIVE_HIGH` `ARCHIVE_LOW` `BASIC` `OLTP` `QUERY_HIGH` `QUERY_LOW`
oracle.hadoop.loader.connection.defaultExecuteBatch	Type: Integer Default Value: `100` Description: The number of records inserted in one trip to the database. It applies only to `JDBCOutputFormat` and `OCIOutputFormat`. Specify a value greater than or equal to 1. Although the maximum value is unlimited, very large batch sizes are not recommended because they result in a large memory footprint without much increase in performance. A value less than 1 sets the property to the default value.
oracle.hadoop.loader.connection.oci_url	Type: String Default Value: Value of `oracle.hadoop.loader.connection.url` Description: The database connection string used by `OCIOutputFormat`. This property enables the OCI client to connect to the database using different connection parameters than the JDBC connection URL. The following example specifies Socket Direct Protocol (SDP) for OCI connections. `(DESCRIPTION=(ADDRESS_LIST=` `(ADDRESS=(PROTOCOL=SDP)(HOST=myhost)(PORT=1521)))` `(CONNECT_DATA=(SERVICE_NAME=my_db_service_name)))` This connection string does not require a "jdbc:oracle:thin:@" prefix. All characters up to and including the first at-sign (@) are removed.
oracle.hadoop.loader.connection.password	Type: String Default Value: Not defined Description: Password for the connecting user. Oracle recommends that you do not store your password in clear text. Use an Oracle wallet instead.
oracle.hadoop.loader.connection.sessionTimeZone	Type: String Default Value: `LOCAL` Description: Alters the session time zone for database connections. Valid values are: [+\|-]hh:mm: Hours and minutes before or after Coordinated Universal Time (UTC), such as `-5:00` for Eastern Standard Time `LOCAL`: The default time zone of the JVM time_zone_region: A valid JVM time zone region, such as EST (for Eastern Standard Time) or America/New_York This property also determines the default time zone for input data that is loaded into `TIMESTAMP WITH TIME ZONE` and `TIMESTAMP WITH LOCAL TIME ZONE` database column types.
oracle.hadoop.loader.connection.cluster.tns_admin	Type: String Default Value: Not defined. Description: The TNS admin location on the cluster node if it is different from the client side location. By default, the client-side TNS admin location is the same as the location on cluster nodes and it is specified by oracle.hadoop.loader.connection.tns_admin. It is invalid to specify this property without specifying `oracle.hadoop.loader.connection.tns_admin`.
oracle.hadoop.loader.connection.directory	Type: String Default Value: Not defined Description: File path to a directory on each node of the Hadoop cluster.
oracle.hadoop.loader.connection.tns_admin	Type: String Default Value: Not defined Description: File path to a directory on each node of the Hadoop cluster, which contains SQL*Net configuration files such as `sqlnet.ora` and `tnsnames.ora`. Set this property so that you can use TNS entry names in database connection strings. You must set this property when using an Oracle wallet as an external password store (as Oracle recommends). See `oracle.hadoop.loader.connection.wallet_location`.
oracle.hadoop.loader.connection.tnsEntryName	Type: String Default Value: Not defined Description: A TNS entry name defined in the `tnsnames.ora` file. Use this property with `oracle.hadoop.loader.connection.tns_admin`.
oracle.hadoop.loader.connection.url	Type: String Default Value: Not defined Description: The URL of the database connection. This property overrides all other connection properties. If an Oracle wallet is configured as an external password store (as Oracle recommends), then the property value must start with the `jdbc:oracle:thin:@` driver prefix, and the database connection string must exactly match the credential in the wallet. See `oracle.hadoop.loader.connection.wallet_location`. The following examples show valid values of connection URLs: Oracle Net Format: `jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS_LIST= (ADDRESS=(PROTOCOL=TCP)(HOST=myhost)(PORT=1521))) (CONNECT_DATA=(SERVICE_NAME=example_service_name)))` TNS Entry Format: jdbc:oracle:thin:@myTNSEntryName Thin Style: jdbc:oracle:thin:@//myhost:1521/my_db_service_name
oracle.hadoop.loader.connection.user	Type: String Default Value: Not defined Description: A database user name. This property requires that you also set `oracle.hadoop.loader.connection.password`. However, Oracle recommends that you use an Oracle wallet to store your password. Do not store it in clear text. When using online database mode, you must set either this property or `oracle.hadoop.loader.connection.wallet_location`.
oracle.hadoop.loader.connection.wallet_location	Type: String Default Value: Not defined Description: File path to an Oracle wallet directory on each node of the Hadoop cluster, where the connection credentials are stored. When using an Oracle wallet, you must also set the following properties: `oracle.hadoop.loader.connection.tns_admin` `oracle.hadoop.loader.connection.url` or `oracle.hadoop.loader.connection.tnsEntryName`
oracle.hadoop.loader.connection.cluster.wallet_location	Type: String Default Value: Not defined. Description: The wallet location on the cluster node if it is different from the client-side location. By default, the client-side wallet location is the same as the location on cluster node and it is specified by oracle.hadoop.loader.connection.wallet_location. It is invalid to specify this property without specifying `oracle.hadoop.loader.connection.wallet_location`.
oracle.hadoop.loader.defaultDateFormat	Type: String Default Value: `yyyy-MM-dd HH:mm:ss` Description: Parses an input field into a `DATE` column using a `java.text.SimpleDateformat` pattern and the default locale. If the input file requires different patterns for different fields, then use the manual mapping properties. See "Manual Mapping."
oracle.hadoop.loader.enableSorting	Type: Boolean Default Value: `true` Description: Controls whether output records within each reducer group are sorted. Use the `oracle.hadoop.loader.sortKey` property to identify the columns of the target table to sort by. Otherwise, Oracle Loader for Hadoop sorts the records by the primary key.
oracle.hadoop.loader.enforceClasspath	Type: Boolean Default Value: `true` To prevent mismatched versions of its JARs from being added to the classpath, Oracle Loader for Hadoop checks that its internal classes are loaded from `${oracle.hadoop.loader.olh_home}/jlib/jars`. To disable this check, set the property to `false`.
oracle.hadoop.loader.extTabDirectoryName	Type: String Default Value: `OLH_EXTTAB_DIR` Description: The name of the database directory object for the external table `LOCATION` data files. Oracle Loader for Hadoop does not copy data files into this directory; the file output formats generate a SQL file containing external table DDL, where the directory name appears. This property applies only to `DelimitedTextOutputFormat` and `DataPumpOutputFormat`.
oracle.hadoop.loader.input.fieldNames	Type: String Default Value: `F0,F1,F2,...` Description: A comma-delimited list of names for the input fields. For the built-in input formats, specify names for all fields in the data, not just the fields of interest. If an input line has more fields than this property has field names, then the extra fields are discarded. If a line has fewer fields than this property has field names, then the extra fields are set to null. See "Mapping Input Fields to Target Table Columns" for loading only selected fields. The names are used to create the Avro schema for the record, so they must be valid JSON name strings. This property applies to `DelimitedTextInputFormat` and `RegexInputFormat` only.
oracle.hadoop.loader.input.fieldTerminator	Type: String Default Value: `,` (comma) Description: A character that indicates the end of an input field for `DelimitedTextInputFormat`. The value can be either a single character or `\uHHHH`, where `HHHH` is the character's UTF-16 encoding.
oracle.hadoop.loader.input.hive.databaseName	Type: String Default Value: Not defined Description: The name of the Hive database where the input table is stored
oracle.hadoop.loader.input.hive.partitionFilter Note: This property is deprecated. Use `oracle.hadoop.loader.input.hive.rowFilter` instead.	Type: String Default Value: Not defined Description: A valid HiveQL expression that is used to filter the source Hive table partitions for `HiveToAvroInputFormat`. The expression must contain only partition columns. Including other columns does not raise an error, but unintended consequences can result. Oracle recommends that you not include other columns. If the value is not set, then Oracle Loader for Hadoop loads the data from all partitions of the source Hive table. This property is ignored if the table is not partitioned. It is also ignored if `oracle.hadoop.loader.input.hive.rowFilter` is set. The expression must conform to the following restrictions: Selects partitions and not individual records inside the partitions. Does not include columns that are not used to partition the table, because they might cause unintended consequences. Does not include subqueries. Does not include user-defined functions (UDFs), which are not supported; built-in functions are supported. Resolves all variable expansions at the Hadoop level. Hive variable name spaces (such as `env:`, `system:`, `hiveconf:`, and `hivevar:`) have no meaning. Oracle Loader for Hadoop sets `hive.variable.substitute` to `false`, which disables Hive variable expansion. You can choose between these expansion methods: Expand all variables before setting this property: In the Hive CLI, use the following commands: `CREATE VIEW view_name AS SELECT * from database.table_name WHERE expression; DESCRIBE FORMATTED view_name;` The View Original Text field contains the query with all variables expanded. Copy the `where` clause, starting after `where`. Define all variables in Oracle Loader for Hadoop: In the `hadoop` command to run Oracle Loader for Hadoop, use the generic options (`-D` and `-conf`). You can use the Hive CLI to test the expression and ensure that it returns the expected results. The following examples assume a source table defined with this command: `CREATE TABLE t(c string) PARTITIONED BY (p1 string, p2 int, p3 boolean, p4 string, p5 timestamp);` Example 1: Nested Expressions `p1 like 'abc%' or (p5 >= '2010-06-20' and p5 <= '2010-07-03')` Example 2: Built-in Functions `year(p5) = 2014` Example 3: Bad Usage: Columns That Are Not Used to Partition the Table These examples show that using c, a column that is not used to partition the table, is unnecessary and can cause unexpected results. This example is equivalent to `p2 > 35`: `p2 > 35 and c like 'abc%'` This example loads all partitions. All partitions could contain `c like 'abc%`, so partitions are filtered out: `p2 > 35 or c like 'abc%'`
oracle.hadoop.loader.input.hive.rowFilter	Type: String Default Value: Not defined Description: A valid HiveQL expression that is used to filter the rows of the source Hive table for HiveToAvroInputFormat. If this value is not set (default), Oracle Loader for Hadoop attempts to use the value of oracle.hadoop.loader.input.hive.partitionFilter (provided the table is partitioned). Otherwise, Oracle Loader for Hadoop loads the entire source hive table. The expression must conform to the following restrictions: Does not include subqueries. Does not include user-defined functions (UDFs), which are not supported; built-in functions are supported. Resolves all variable expansions at the Hadoop level. Hive variable name spaces (such as env:, system:, hiveconf:, and hivevar:) have no meaning. Oracle Loader for Hadoop sets hive.variable.substitute to false, which disables Hive variable expansion. You can choose between these expansion methods: Expand all variables before setting this property: In the Hive CLI, use the following commands: `CREATE VIEW view_name AS SELECT * from database.table_name WHERE expression; DESCRIBE FORMATTED view_name;` The View Original Text field contains the query with all variables expanded. Copy the expression within the WHERE clause. (Do not include the WHERE keyword itself.) Define all variables in Oracle Loader for Hadoop. In the Hadoop command to run Oracle Loader for Hadoop, use the generic options (`-D`and `-conf`). In both cases you can use the Hive CLI to test the expression and ensure that it returns the expected results. The following examples assume a source table defined with this command: `CREATE TABLE t(c string) PARTITIONED BY (p1 string, p2 int, p3 boolean, p4 string, p5 timestamp);` Example #1: nested expressions `c like 'abc%' and (p5 <= '2010-06-20' and p5 <= '2010-07-03')` Example #2: built-in functions `year(p5) = 2013)` Oracle recommends that you turn on `hive.optimize.index.filter` when importing a subset of rows from a native Hive table (a table that is not managed by a storage handler). This is known to help input formats such as ORC and PARQUET, however there are several caveats: The property must be set with a `-D` (using `-conf` will not work). Alternatively, the property can be set in hive-site.xml. This does not work for ORC tables in Hive 0.12.
oracle.hadoop.loader.input.hive.tableName	Type: String Default Value: Not defined Description: The name of the Hive table where the input data is stored.
oracle.hadoop.loader.input.initialFieldEncloser	Type: String Default Value: Not defined Description: A character that indicates the beginning of a field. The value can be either a single character or `\uHHHH`, where `HHHH` is the character's UTF-16 encoding. To restore the default setting (no encloser), enter a zero-length value. A field encloser cannot equal the terminator or white-space character defined for the input format. When this property is set, the parser attempts to read each field as an enclosed token (value) before reading it as an unenclosed token. If the field enclosers are not set, then the parser reads each field as an unenclosed token. If you set this property but not `oracle.hadoop.loader.input.trailingFieldEncloser`, then the same value is used for both properties.
oracle.hadoop.loader.input.regexCaseInsensitive	Type: Boolean Default Value: `false` Description: Controls whether pattern matching is case-sensitive. Set to `true` to ignore case, so that "string" matches "String", "STRING", "string", "StRiNg", and so forth. By default, "string" matches only "string". This property is the same as the`input.regex.case.insensitive` property of `org.apache.hadoop.hive.contrib.serde2.RegexSerDe`.
oracle.hadoop.loader.input.regexPattern	Type: String Default Value: Not defined Description: The pattern string for a regular expression. The regular expression must match each text line in its entirety. For example, a correct regex pattern for input line `"a,b,c,"` is `"([^,]),([^,]),([^,]),"`. However, `"([^,]),"` is invalid, because the expression is not applied repeatedly to a line of input text. `RegexInputFormat` uses the capturing groups of regular expression matching as fields. The special group zero is ignored because it stands for the entire input line. This property is the same as the `input.regex` property of `org.apache.hadoop.hive.contrib.serde2.RegexSerDe`. See Also: For descriptions of regular expressions and capturing groups, the entry for `java.util.regex` in the Java Platform Standard Edition 6 API Specification at `http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html`
oracle.hadoop.loader.input.trailingFieldEncloser	Type: String Default Value: The value of `oracle.hadoop.loader.input.initialFieldEncloser` Description: Identifies a character that marks the end of a field. The value can be either a single character or `\uHHHH`, where `HHHH` is the character's UTF-16 encoding. For no trailing encloser, enter a zero-length value. A field encloser cannot be the terminator or a white-space character defined for the input format. If the trailing field encloser character is embedded in an input field, then the character must be doubled up to be parsed as literal text. For example, an input field must have `''` (two single quotes) to load `'` (one single quote). If you set this property, then you must also set `oracle.hadoop.loader.input.initialFieldEncloser`.
oracle.hadoop.loader.loadByPartition	Type: Boolean Default Value: `true` Description: Specifies a partition-aware load. Oracle Loader for Hadoop organizes the output by partition for all output formats on the Hadoop cluster; this task does not impact the resources of the database system. `DelimitedTextOutputFormat` and `DataPumpOutputFormat` generate multiple files, and each file contains the records from one partition. For `DelimitedTextOutputFormat`, this property also controls whether the `PARTITION` keyword appears in the generated control files for SQL*Loader. `OCIOutputFormat` requires partitioned tables. If you set this property to `false`, then `OCIOutputFormat` turns it back on. For the other output formats, you can set `loadByPartition` to `false`, so that Oracle Loader for Hadoop handles a partitioned table as if it were nonpartitioned.
oracle.hadoop.loader.loaderMap.columnNames	Type: String Default Value: Not defined Description: A comma-separated list of column names in the target table, in any order. The names can be quoted or unquoted. Quoted names begin and end with double quotes (") and are used exactly as entered. Unquoted names are converted to upper case. You must set `oracle.hadoop.loader.loaderMap.targetTable`, or this property is ignored. You can optionally set `oracle.hadoop.loader.loaderMap.column_name.field` and `oracle.hadoop.loader.loaderMap.column_name.format`.
oracle.hadoop.loader.loaderMap.column_name.field	Type: String Default Value: Normalized column name Description: The name of a field that contains Avro records, which is mapped to the column identified in the property name. The column name can be quoted or unquoted. A quoted name begins and ends with double quotes (") and is used exactly as entered. An unquoted name is converted to upper case. Optional. You must set `oracle.hadoop.loader.loaderMap.columnNames`, or this property is ignored.
oracle.hadoop.loader.loaderMap.column_name.format	Type: String Default Value: Not defined Description: Specifies the data format of the data being loaded into the column identified in the property name. Use a `java.text.SimpleDateformat` pattern for a date format or regular expression patterns for text. Optional. You must set `oracle.hadoop.loader.loaderMap.columnNames`, or this property is ignored.
oracle.hadoop.loader.loaderMap.targetTable	Type: String Default Value: Not defined Description: A schema-qualified name for the table to be loaded. This property takes precedence over `oracle.hadoop.loader.loaderMapFile`. To load a subset of columns, set the `oracle.hadoop.loader.loaderMap.columnNames` property. With `columnNames`, you can optionally set `oracle.hadoop.loader.loaderMap.column_name.field` to specify the names of the fields that are mapped to the columns, and `oracle.hadoop.loader.loaderMap.column_name.format` to specify the format of the data in those fields. If all the columns of a table will be loaded, and the input field names match the database column names, then you do not need to set `columnNames`.
oracle.hadoop.loader.loaderMapFile	Loader maps are deprecated starting with Release 2.3. The `oracle.hadoop.loader.loaderMap.*` configuration properties replace loader map files. See "Manual Mapping."
oracle.hadoop.loader.log4j.propertyPrefix	Type: String Default Value: `log4j.logger.oracle.hadoop.loader` Description: Identifies the prefix used in Apache `log4j` properties loaded from its configuration file. Oracle Loader for Hadoop enables you to specify `log4j` properties in the `hadoop` command using the `-conf` and `-D` options. For example: `-D log4j.logger.oracle.hadoop.loader.OraLoader=DEBUG -D log4j.logger.oracle.hadoop.loader.metadata=INFO` All configuration properties starting with this prefix are loaded into `log4j`. They override the settings for the same properties that `log4j` loaded from `${log4j.configuration}`. The overrides apply to the Oracle Loader for Hadoop job driver, and its map and reduce tasks. The configuration properties are copied to `log4j` with `RAW` values; any variable expansion is done for `log4j`. Any configuration variables to be used in the expansion must also start with this prefix.
oracle.hadoop.loader.logBadRecords	Type: Boolean Default Value: `false` Description: Controls whether Oracle Loader for Hadoop logs bad records to a file. This property applies only to records rejected by input formats and mappers. It does not apply to errors encountered by the output formats or by the sampling feature.
oracle.hadoop.loader.logRetentionPolicy	Type: String Default Value: `ALWAYS` Description: Specifies when Oracle Loader for Hadoop logs should be generated/retained at the end of a job. Valid values are: `ALWAYS` – logs are generated and retained at the end of all jobs. `NEVER` – logs are never retained. `ON_ERROR` – logs are discarded unless an error is identified. The following situations are considered errors for the purposes of `logRetentionPolicy`: Any non-zero exit code. Any input record parse error (See `oracle.hadoop.loader.rejectLimit` and `oracle.hadoop.loader.logBadRecords` ). Any rejected rows (for OCIOutputFormat or JDBCOutputFormat). The following files are covered by this property: `${mapreduce.output.fileoutputformat.outputdir}/_olh/` with the exception of `.ctl` and `*.sql` files. `${mapreduce.output.fileoutputformat.outputdir}/_balancer/`
oracle.hadoop.loader.olh_home	Type: String Default Value: Value of the `OLH_HOME` environment variable Description: The path of the Oracle Loader for Hadoop home directory on the node where you start the OraLoader job. This path identifies the location of the required libraries.
oracle.hadoop.loader.olhcachePath	Type: String Default Value: `${mapreduce.output.fileoutputformat.outputdir}/../olhcache` Description: Identifies the full path to an HDFS directory where Oracle Loader for Hadoop can create files that are loaded into the MapReduce distributed cache. The distributed cache is a facility for caching large, application-specific files and distributing them efficiently across the nodes in a cluster. See Also: The description of `org.apache.hadoop.filecache.DistributedCache` in the Java documentation at `http://hadoop.apache.org/`
oracle.hadoop.loader.output.degreeOfParallelism	Type: String Default Value: Not defined Description: If set, the value of this property controls either: The number of reduce tasks (if Oracle Loader for Hadoop triggers a map-reduce job), or, The number of map tasks (if Oracle Loader for Hadoop triggers a map-only job). If this property is not set, then the value of `${mapreduce.job.reduces}` is used for the number of reduce tasks. If the value for `${mapreduce.job.reduces}` was set explicitly for the job (set through a `-D`, a `-conf`, or set programmatically), then that value will be used as is. If the value for `${mapreduce.job.reduces}` was not set explicitly for the job (e.g. set in one of the `*-{site\|default}.xml` configuration files), then the value is considered to be a cluster-wide setting and is limited to a maximum value of 64. You can avoid this by explicitly setting `oracle.hadoop.loader.output.degreeOfParallelism` or `mapreduce.job.reduces` explicitly. This property provides a unified way to limit the number of database connections made by `OCIOutputFormat` and `JDBCOutputFormat`. However, the property is enforced on all output formats in order to facilitate debugging scenarios. For example, you can replace `OCIOutputFormat` with `DelimitedTextOutputFormat` in order to see what data is being processed by a particular reduce task.
oracle.hadoop.loader.output.dirpathBufsize	Type: Integer Default Value: `131072` (128 KB) Description: Sets the size in bytes of the direct path stream buffer for `OCIOutputFormat`. Values are rounded up to the next multiple of 8 KB.
oracle.hadoop.loader.output.escapeEnclosers	Type: Boolean Default Value: `false` Description: Controls whether the embedded trailing encloser character is handled as literal text (that is, escaped). Set this property to `true` when a field may contain the trailing enclosure character as part of the data value. See `oracle.hadoop.loader.output.trailingFieldEncloser`.
oracle.hadoop.loader.output.fieldTerminator	Type: String Default Value: `,` (comma) Description: A character that indicates the end of an output field for `DelimitedTextInputFormat`. The value can be either a single character or `\uHHHH`, where `HHHH` is the character's UTF-16 encoding.
oracle.hadoop.loader.output.granuleSize	Type: Integer Default Value: `10240000` Description: The granule size in bytes for generated Data Pump files. A granule determines the work load for a parallel process (PQ slave) when loading a file through the `ORACLE_DATAPUMP` access driver. See Also: Oracle Database Utilities for more information about the `ORACLE_DATAPUMP` access driver.
oracle.hadoop.loader.output.initialFieldEncloser	Type: String Default Value: Not defined Description: A character generated in the output to identify the beginning of a field. The value must be either a single character or `\uHHHH`, where `HHHH` is the character's UTF-16 encoding. A zero-length value means that no enclosers are generated in the output (default value). Use this property when a field may contain the value of `oracle.hadoop.loader.output.fieldTerminator`. If a field may also contain the value of `oracle.hadoop.loader.output.trailingFieldEncloser`, then set `oracle.hadoop.loader.output.escapeEnclosers` to `true`. If you set this property, then you must also set `oracle.hadoop.loader.output.trailingFieldEncloser`.
oracle.hadoop.loader.output.trailingFieldEncloser	Type: String Default Value: Value of `oracle.hadoop.loader.output.initialFieldEncloser` Description: A character generated in the output to identify the end of a field. The value must be either a single character or `\uHHHH`, where `HHHH` is the character's UTF-16 encoding. A zero-length value means that there are no enclosers (default value). Use this property when a field may contain the value of `oracle.hadoop.loader.output.fieldTerminator`. If a field may also contain the value of `oracle.hadoop.loader.output.trailingFieldEncloser`, then set `oracle.hadoop.loader.output.escapeEnclosers` to `true`. If you set this property, then you must also set `oracle.hadoop.loader.output.initialFieldEncloser`.
oracle.hadoop.loader.rejectLimit	Type: Integer Default Value: `1000` Description: The maximum number of rejected or skipped records allowed before the job stops running. A negative value turns off the reject limit and allows the job to run to completion. If `mapreduce.map.speculative` is `true` (the default), then the number of rejected records may be inflated temporarily, causing the job to stop prematurely. Input format errors do not count toward the reject limit because they are irrecoverable and cause the map task to stop. Errors encountered by the sampling feature or the online output formats do not count toward the reject limit either.
oracle.hadoop.loader.sampler.enableSampling	Type: Boolean Default Value: `true` Description: Controls whether the sampling feature is enabled. Set this property to `false` to disable sampling. Even when `enableSampling` is set to `true`, the loader automatically disables sampling if it is unnecessary, or if the loader determines that a good sample cannot be made. For example, the loader disables sampling if the table is not partitioned, the number of reducer tasks is less than two, or there is too little input data to compute a good load balance. In these cases, the loader returns an informational message.
oracle.hadoop.loader.sampler.hintMaxSplitSize	Type: Integer Default Value: `1048576` (1 MB) Description: Sets the Hadoop `mapred.max.split.size` property for the sampling process; the value of `mapred.max.split.size` does not change for the job configuration. A value less than `1` is ignored. Some input formats (such as `FileInputFormat`) use this property as a hint to determine the number of splits returned by `getSplits`. Smaller values imply that more chunks of data are sampled at random, which results in a better sample. Increase this value for data sets with tens of terabytes of data, or if the input format `getSplits` method throws an out-of-memory error. Although large splits are better for I/O performance, they are not necessarily better for sampling. Set this value small enough for good sampling performance, but no smaller. Extremely small values can cause inefficient I/O performance, and can cause `getSplits` to run out of memory by returning too many splits. The `org.apache.hadoop.mapreduce.lib.input.FileInputFormat` method always returns splits at least as large as the minimum split size setting, regardless of the value of this property.
oracle.hadoop.loader.sampler.hintNumMapTasks	Type: Integer Default Value: `100` Description: Sets the value of the Hadoop `mapred.map.tasks` configuration property for the sampling process; the value of `mapred.map.tasks` does not change for the job configuration. A value less than `1` is ignored. Some input formats (such as `DBInputFormat`) use this property as a hint to determine the number of splits returned by the `getSplits` method. Higher values imply that more chunks of data are sampled at random, which results in a better sample. Increase this value for data sets with more than a million rows, but remember that extremely large values can cause `getSplits` to run out of memory by returning too many splits.
oracle.hadoop.loader.sampler.loadCI	Type: Decimal Default Value: `0.95` Description: The statistical confidence indicator for the maximum reducer load factor. This property accepts values greater than or equal to `0.5` and less than `1` (`0.5 <=` value `< 1`). A value less than `0.5` resets the property to the default value. Typical values are `0.90`, `0.95`, and `0.99`. See `oracle.hadoop.loader.sampler.maxLoadFactor`.
oracle.hadoop.loader.sampler.maxHeapBytes	Type: Integer Default Value: `-1` Description: Specifies in bytes the maximum amount of memory available to the sampler. Sampling stops when one of these conditions is true: The sampler has collected the minimum number of samples required for load balancing. The percent of sampled data exceeds `oracle.hadoop.loader.sampler.maxSamplesPct`. The number of sampled bytes exceeds `oracle.hadoop.loader.sampler.maxHeapBytes`. This condition is not imposed when the property is set to a negative value.
oracle.hadoop.loader.sampler.maxLoadFactor	Type: Float Default Value: `0.05` (5%) Description: The maximum acceptable load factor for a reducer. A value of `0.05` indicates that reducers can be assigned up to 5% more data than their ideal load. This property accepts values greater than `0`. A value less than or equal to 0 resets the property to the default value. Typical values are `0.05` and `0.1`. In a perfectly balanced load, every reducer is assigned an equal amount of work (or load). The load factor is the relative overload for each reducer, calculated as (assigned_load - ideal_load)/ideal_load. If load balancing is successful, the job runs within the maximum load factor at the specified confidence. See `oracle.hadoop.loader.sampler.loadCI`.
oracle.hadoop.loader.sampler.maxSamplesPct	Type: Float Default Value: `0.01` (1%) Description: Sets the maximum sample size as a fraction of the number of records in the input data. A value of `0.05` indicates that the sampler never samples more than 5% of the total number of records. This property accepts a range of 0 to 1 (0% to 100%). A negative value disables it. Sampling stops when one of these conditions is true: The sampler has collected the minimum number of samples required for load balancing, which can be fewer than the number set by this property. The percent of sampled data exceeds `oracle.hadoop.loader.sampler.maxSamplesPct`. The number of sampled bytes exceeds `oracle.hadoop.loader.sampler.maxHeapBytes`. This condition is not imposed when the property is set to a negative value.
oracle.hadoop.loader.sampler.minSplits	Type: Integer Default Value: `5` Description: The minimum number of input splits that the sampler reads from before it makes any evaluation of the stopping condition. If the total number of input splits is less than `minSplits`, then the sampler reads from all the input splits. A number less than or equal to `0` is the same as a value of `1`.
oracle.hadoop.loader.sampler.numThreads	Type: Integer Default Value: `5` Description: The number of sampler threads. A higher number of threads allows higher concurrency in sampling. A value of `1` disables multithreading for the sampler. Set the value based on the processor and memory resources available on the node where you start the Oracle Loader for Hadoop job.
oracle.hadoop.loader.sortKey	Type: String Default Value: Not defined Description: A comma-delimited list of column names that forms a key for sorting output records within a reducer group. The column names can be quoted or unquoted identifiers: A quoted identifier begins and ends with double quotation marks (`"`). An unquoted identifier is converted to uppercase before use.
oracle.hadoop.loader.tableMetadataFile	Type: String Default Value: Not defined Description: Path to the target table metadata file. Set this property when running in offline database mode. Use the `file://` syntax to specify a local file, for example: file:///home/jdoe/metadata.xml To create the table metadata file, run the `OraLoaderMetadata` utility. See "OraLoaderMetadata Utility."
oracle.hadoop.loader.targetTable	Deprecated. Use `oracle.hadoop.loader.loaderMap.targetTable`.

Oracle NoSQL Database Configuration Properties

Property	Description
oracle.kv.kvstore	Type: String Default Value: Not defined Description: The name of the KV store with the source data.
oracle.kv.hosts	Type: String Default Value: Not defined Description: An array of one or more hostname:port pairs that identify the hosts in the KV store with the source data. Separate multiple pairs with commas.
oracle.kv.batchSize	Type: Key Default Value: Not defined Description: The desired number of keys for `KVAvroInputFormat`to fetch during each network round trip. A value of zero (0) sets the property to a default value.
oracle.kv.parentKey	Type: String Default Value: Not defined Description: Restricts the returned values to only the child key-value pairs of the specified key. A major key path must be a partial path, and a minor key path must be empty. A null value (the default) does not restrict the output, and so `KVAvroInputFormat` returns all keys in the store.
oracle.kv.subRange	Type: KeyRange Default Value: Not defined Description: Further restricts the returned values to a particular child under the parent key specified by `oracle.kv.parentKey`.
oracle.kv.depth	Type: Depth Default Value: `PARENT_AND_DESCENDENTS` Description: Restricts the returned values to a particular hierarchical depth under the value of `oracle.kv.parentKey`. The following keywords are valid values: `CHILDREN_ONLY`: Returns the children, but not the specified parent. `DESCENDANTS_ONLY`: Returns all descendants, but not the specified parent. `PARENT_AND_CHILDREN`: Returns the children and the parent. `PARENT_AND_DESCENDANTS`: Returns all descendants and the parent.
oracle.kv.consistency	Type: Consistency Default Value: `NONE_REQUIRED` Description: The consistency guarantee for reading child key-value pairs. The following keywords are valid values: `ABSOLUTE`: Requires the master to service the transaction so that consistency is absolute. `NONE_REQUIRED`: Allows replicas to service the transaction, regardless of the state of the replicas relative to the master.
oracle.kv.timeout	Type: Long Default Value: Description: Sets a maximum time interval in milliseconds for retrieving a selection of key-value pairs. A value of zero (0) sets the property to its default value.
oracle.kv.formatterClass	Type: String Default Value: Not defined Description: Specifies the name of a class that implements the `AvroFormatter` interface to format `KeyValueVersion` instances into Avro `IndexedRecord` strings. Because the Avro records from Oracle NoSQL Database pass directly to Oracle Loader for Hadoop, the NoSQL keys are not available for mapping into the target Oracle Database table. However, the formatter class receives both the NoSQL key and value, enabling the class to create and return a new Avro record that contains both the value and key, which can be passed to Oracle Loader for Hadoop.