3 Oracle Loader for Apache Hadoop
This chapter explains how to use Oracle Loader for Apache Hadoop (Oracle Loader for Hadoop) to load data from Apache Hadoop into tables in an Oracle Database. It contains the following sections:
What Is Oracle Loader for Hadoop?
Oracle Loader for Hadoop is an efficient and high-performance loader for fast loading of data from a Hadoop cluster into a table in an Oracle database. It pre-partitions the data if necessary and transforms it into a database-ready format. It can also sort records by primary key or user-specified columns before loading the data or creating output files. Oracle Loader for Hadoop uses the parallel processing framework of Hadoop to perform these preprocessing operations, which other loaders typically perform on the database server as part of the load process. Off-loading these operations to Hadoop reduces the CPU requirements on the database server, thereby lessening the performance impact on other database tasks.
Oracle Loader for Hadoop is a Java MapReduce application that balances the data across reducers to help maximize performance. It works with a range of input data formats that present the data as records with fields. It can read from sources that have the data already in a record format (such as Avro files or Apache Hive tables), or it can split the lines of a text file into fields.
If you have Java programming skills, you can extend the types of data that the loader can handle by defining custom input formats. Then Oracle Loader for Hadoop uses your code to extract the fields and records.
Interfaces to Oracle Loader For Hadoop
There are three ways to use Oracle Loader for Hadoop:
- Oracle Shell for Hadoop Loaders (OHSH)
OHSH is the preferred way to use Oracle Loader for Hadoop. It includes a CLI (whose simple command syntax can also be scripted) for moving data between Hadoop and Oracle Database using various resources, including OLH.
- Oracle SQL Developer
Oracle SQL Developer is a free graphical IDE that includes integration with Oracle Database and Oracle Big Data Connectors (among many other products). It provides wizards to help you access and use Oracle Big Data Connectors.
- The
hadoop
command-line utilityOn the command line, you provide the job details with the configuration settings. You typically provide these settings in a job configuration file. You can consider this option, if you need to use a feature not supported by your preferred UI (OHSH or SQL Developer). For most use cases, OHSH or SQL Developer is sufficient.
See Also:
Use Oracle SQL Developer With Oracle Big Data Connectors in this guide provides instructions for downloading Oracle SQL Developer and configuring it for use with Oracle Big Data Connectors.Get Started With Oracle Loader for Hadoop
These instructions show how to use Oracle Loader for Hadoop through OHSH.
Before You Start
This is what you need to know before using OLH to load an Oracle Database table with data stored in Hadoop:
-
The password of the database schema you are connecting to (which is implied by the database connection URL) or any other relevant database credentials information.
-
The name of the Oracle Database table.
-
The source of the data stored in Hadoop (either a path to an HDFS directory or the name of a Hive table).
-
The preferred method for loading. Choose either JDBC or direct path. Direct path load is faster, but requires a partitioned target table. JDBC does not.
About Resources
In OHSH, the term resources refers to the interfaces that OHSH presents for defining the data source, destination, and command language. Four types of resources are available:
-
Hadoop resources – for executing HDFS commands and use HDFS as a source or destination.
-
Hive resources – for executing Hive commands and specifying Hive as a source or destination.
-
JDBC resources – for making JDBC connections to a database.
-
SQL resources – for executing SQL commands in a database schema.
Two resources are created upon OHSH startup:
-
hive0
– enables access to the default Hive database. -
hadoop0
– enables access to HDFS.
Within a session, you can create SQL and JDBC resources as well as additional Hive
resources (for example, to connect to other Hive databases). Assign
a meaningful name to a resource, as in the example below where the
names sql10
and ora_mydatabase
clearly indicate the type of resource.
Where resources are invoked in the commands below, the percent sign (%) prefix identifies a resource name.
Loading an Oracle Database Table
-
Start an OHSH session.
$ ohsh ohsh>
-
Create the following resources:
- SQL
resource:
ohsh> create sqlplus resource sql0 connectid="<database connection url>"
At the prompt, enter the database password.
- SQL resource when using JDBC
SSL:
You can use a JDBC SSL connection to connect to Oracle Database (for example, to connect to an Oracle Autonomous Database). See Using JDBC SSL to download the client credentials and identify the TNS entry in the
tnsnames.ora
file.For example:ohsh> create sqlplus resource sql_ssl connectid="inst1_ssl"
inst1_ssl
is the TNS entry for the JDBC SSL connection.At the prompt, enter the database password.
- SQL resource when using Java
KeyStore:
You can use Java KeyStore to store your password and you won't be prompted for username and password. Add this to the scripts you develop to load data. See Using Secure External Java KeyStore and Hadoop credential command to create a Java KeyStore.
For example:ohsh> create sqlplus resource sql_cs user=oracle passwordalias=oracle_passwd provider="jceks://file/home/oracle/passwd.jceks" connectid="inst1"
- SQL resource when using JDBC
SSL:
- JDBC resource:
A JDBC resource name that indicates the target schema is recommended.
ohsh> create jdbc resource ora_mydatabase connectid="<database connection ur1>"
At the prompt, enter the database password.
- JDBC resource when using JDBC
SSL:
You can use a JDBC SSL connection to connect to Oracle Database (for example, to connect to an Oracle Autonomous Database).
For example:ohsh> create jdbc resource ora_mydatabase_ssl connectiondir=/home/oracle/ssl_client_wallet connectid="inst1_ssl"
inst1_ssl
is the TNS entry for the JDBC SSL connection.At the prompt, enter the database password.
See Using JDBC SSL to download the client credentials and identify the TNS entry in the
tnsnames.ora
file. - JDBC resource when using Java
KeyStore:
You can use Java KeyStore to store your password and you won't be prompted for username and password. Add this to the scripts you develop to load data. See Using Secure External Java KeyStore and Hadoop credential command to create a Java KeyStore.
For example:ohsh> create jdbc resource ora_mydatabase_cs connectiondir=oracle passwordalias=oracle_passwd provider="jceks://file/home/oracle/passwd.jceks" connectid="inst1"
- JDBC resource when using JDBC
SSL:
- Additional Hive resources (if required):
The default Hive resource
hive0
connects to the default database in Hive. If you want to connect to another Hive database, create another resource:.ohsh> create hive resource hive_mydatabase connectionurl=”jdbc:hive2:///<Hive database name>
- SQL
resource:
-
Use the
load
command to load files from HDFS into a target table in the Oracle database.The following command loads data from a delimited text file in HDFS <
HDFS path
> into the target table in Oracle Database using the direct path option.ohsh> load oracle table ora_mydatabase:<target table in the Oracle database> from path hadoop0:/user/<HDFS path> using directpath
Note:
The default direct path method is the fastest way to load a table. However, it requires partitioned target table. Direct path is always recommended for use with partition tables. Use the JDBC option to load into a non-partitioned target table.
If the command does not explicitly state the load method, then OHSH automatically uses the appropriate method. If the target Oracle table is partitioned, then by default, OHSH uses direct path (i.e. Oracle OCI). If the Oracle table is not partitioned, it uses JDBC.
-
After loading, check the number of rows.
You can do this conveniently from the OHSH command line:
ohsh> %sql0 select count(*) from <target table in Oracle Database>
Loading a Hive Table Into an Oracle Database Table
You can use OHSH to load a Hive table into a target table in an Oracle database. The command below shows how to do this using the direct path method.
ohsh> load oracle table ora_mydatabase:<target table in Oracle Database> from hive table hive0:<Hive table name>
Note that if the target table is partitioned, then OHSH uses direct path automatically. You do not need to enter using directpath
explicitly in the command.
If the target table is non-partitioned, then specify the JDBC method instead:
ohsh> load oracle table ora_mydatabase:<target table in Oracle Database> from hive table hive0:<Hive table name> using jdbc
Note:
Theload
command assumes that the column names in the Hive table and in the Oracle Database table are identically matched. If they do not match, then use OHSH loadermap
.
Using OHSH Loadermaps
The simple load examples in this section assume the following:
-
Where we load data from a text file in Hadoop into an Oracle Database table, the declared order of columns in the target table maps correctly to the physical ordering of the delimited text fields in the file.
-
Where we load Hive tables in to Oracle Database tables, the Hive and Oracle Database column names are identically matched.
However, in less straightforward cases where the column names (or the order of column names and delimited text fields) do not match, use the OHSH loadermap
construct to correct these mismatches.
You can also use a loadermap to specify a subset of target columns to load into table or in the case of a load from a text file, specify the format of a field in the load.
Loadermaps are not covered in this introduction.
Performance Tuning Oracle Loader for Hadoop in OHSH
Aside from network bandwidth, two factors can have significant impact on Oracle Loader for Hadoop performance. You can tune both in OHSH.
-
Degree of parallelism
The degree of parallelism affects performance when Oracle Loader for Hadoop runs in Hadoop. For the default method (direct path), parallelism is determined by the number of reducer tasks. The higher the number of reducer tasks, the faster the performance. The default value is 4. To set the number of tasks:
ohsh> set reducetasks 18
For the JDBC option, parallelism is determined by the number of map tasks and the optimal number is determined automatically. However, remember that if the target table is partitioned, direct path is faster than JDBC.
-
Load balancing
Performance is best when the load is balanced evenly across reduce tasks. The load is detected by sampling. Sampling is always enabled by default for loads using the JDBC and the default copy method.
Debugging in OHSH
Several OHSH settings control the availability of debugging information:
-
outputlevel
The
outputlevel
is set tominimal
by default. Set it toverbose
in order to return a stack trace when a command fails:ohsh> set outputlevel verbose
-
logbadrecords
ohsh> set logbadrecords true
This is set to
true
by default.
These log files are informative for debugging:
-
Oracle Loader for Hadoop log files.
/user/<username>/smartloader/jobhistory/oracle/<target table schema>/<target table name>/<OHSH job ID>/_olh
-
Log files generated by the map and reduce tasks.
Other OHSH Properties That are Useful for Oracle Loader for Hadoop
You can set these properties on the OHSH command line or in a script.
-
dateformat
ohsh> set dateformat “yyyy-MM-dd HH:mm:ss”
The syntax for this command is dictated by the Java date format.
-
rejectlimit
The number of rows that can be rejected before the load of a delimited text file fails.
-
fieldterminator
The field terminator in loads of delimited text files.
-
hadooptnsadmin
Location of an Oracle TNS admin directory in the Hadoop cluster
-
hadoopwalletlocation
Location of the Oracle Wallet directory in the Hadoop cluster.
Additional Information
Using the exttab (External Table) Method to Load Data
A third option to load data from Hadoop into Oracle Database is exttab.
Note:
The exttab option is available in on-premises deployments of OHSH only. It is not available in Oracle cloud servicesIn the exttab, data is loaded via external tables. OHSH creates the external table using Oracle SQL Connector for HDFS, and then uses a Create table as Select
statement to load the data into the target table:
ohsh> load oracle table ora_mydatabase:<target table in Oracle Database> from hive table hive0:<Hive table name> using exttab
Learning Resources
These OHSH blog entries can help you get started.
See the Java™ Platform, Standard Edition 7 API Specification for documentation on the SimpleDateFormat class.
Use Oracle Loader for Hadoop With the Hadoop Command Line Utility
Perform the following basic steps when using Oracle Loader for Hadoop:
About the Modes of Operation
Oracle Loader for Hadoop operates in two modes:
Online Database Mode
In online database mode, Oracle Loader for Hadoop can connect to the target database using the credentials provided in the job configuration file or in an Oracle wallet. The loader obtains the table metadata from the database. It can insert new records directly into the target table or write them to a file in the Hadoop cluster. You can load records from an output file when the data is needed in the database, or when the database system is less busy.
The following figure shows the relationships among elements in online database mode.
Offline Database Mode
Offline database mode enables you to use Oracle Loader for Hadoop when the Oracle Database system is on a separate network from the Hadoop cluster, or is otherwise inaccessible. In this mode, Oracle Loader for Hadoop uses the information supplied in a table metadata file, which you generate using a separate utility. The loader job stores the output data in binary or text format output files on the Hadoop cluster. Loading the data into Oracle Database is a separate procedure using another utility, such as Oracle SQL Connector for Hadoop Distributed File System (HDFS) or SQL*Loader.
The following figure shows the relationships among elements in offline database mode. The figure does not show the separate procedure of loading the data into the target table.
Create the Target Table
Oracle Loader for Hadoop loads data into one target table, which must exist in the Oracle database. The table can be empty or contain data already. Oracle Loader for Hadoop does not overwrite existing data.
Create the table the same way that you would create one for any other purpose. It must comply with the following restrictions:
Supported Data Types for Target Tables
You can define the target table using any of these data types:
-
BINARY_DOUBLE
-
BINARY_FLOAT
-
CHAR
-
DATE
-
FLOAT
-
INTERVAL DAY TO SECOND
-
INTERVAL YEAR TO MONTH
-
NCHAR
-
NUMBER
-
NVARCHAR2
-
RAW
-
TIMESTAMP
-
TIMESTAMP WITH LOCAL TIME ZONE
-
TIMESTAMP WITH TIME ZONE
-
VARCHAR2
The target table can contain columns with unsupported data types, but these columns must be nullable, or otherwise set to a value.
Supported Partitioning Strategies for Target Tables
Partitioning is a database feature for managing and efficiently querying very large tables. It provides a way to decompose a large table into smaller and more manageable pieces called partitions, in a manner entirely transparent to applications.
You can define the target table using any of the following single-level and composite-level partitioning strategies.
-
Hash
-
Hash-Hash
-
Hash-List
-
Hash-Range
-
Interval
-
Interval-Hash
-
Interval-List
-
Interval-Range
-
List
-
List-Hash
-
List-List
-
List-Range
-
Range
-
Range-Hash
-
Range-List
-
Range-Range
Oracle Loader for Hadoop does not support reference partitioning or virtual column-based partitioning.
Create a Job Configuration File
A configuration file is an XML document that provides Hadoop with all the information it needs to run a MapReduce job. This file can also provide Oracle Loader for Hadoop with all the information it needs. See "Oracle Loader for Hadoop Configuration Property Reference".
Configuration properties provide the following information, which is required for all Oracle Loader for Hadoop jobs:
-
How to secure connection to Oracle Database.
See "Establish Secure Connections to Oracle Database Using SSL and Oracle Wallet."
-
The format of the input data.
See "About Input Formats."
-
The format of the output data.
See "About Output Formats."
OraLoader
implements the org.apache.hadoop.util.Tool
interface and follows the standard Hadoop methods for building MapReduce applications. Thus, you can supply the configuration properties in a file (as shown here) or on the hadoop
command line. See "Run a Loader Job."
You can use any text or XML editor to create the file. The following example provides an example of a job configuration file.
Example 3-1 Job Configuration File
<?xml version="1.0" encoding="UTF-8" ?> <configuration> <!-- Input settings --> <property> <name>mapreduce.job.inputformat.class</name> <value>oracle.hadoop.loader.lib.input.DelimitedTextInputFormat</value> </property> <property> <name>mapreduce.input.fileinputformat.inputdir</name> <value>/user/oracle/moviedemo/session/*00000</value> </property> <property> <name>oracle.hadoop.loader.input.fieldTerminator</name> <value>\u0009</value> </property> <property> <name>oracle.hadoop.loader.input.fieldNames</name> <value>SESSION_ID,TIME_IDDATE,CUST_ID,DURATION_SESSION,NUM_RATED,DURATION_RATED,NUM_COMPLETED,DURATION_COMPLETED,TIME_TO_FIRST_START,NUM_STARTED,NUM_BROWSED,DURATION_BROWSED,NUM_LISTED,DURATION_LISTED,NUM_INCOMPLETE,NUM_SEARCHED</value> </property> <property> <name>oracle.hadoop.loader.defaultDateFormat</name> <value>yyyy-MM-dd:HH:mm:ss</value> </property> <!-- Output settings --> <property> <name>mapreduce.job.outputformat.class</name> <value>oracle.hadoop.loader.lib.output.OCIOutputFormat</value> </property> <property> <name>mapreduce.output.fileoutputformat.outputdir</name> <value>temp_out_session</value> </property> <!-- Table information --> <property> <name>oracle.hadoop.loader.loaderMap.targetTable</name> <value>movie_sessions_tab</value> </property> <!-- Connection information --> <property> <name>oracle.hadoop.loader.connection.url</name> <value>jdbc:oracle:thin:@${HOST}:${TCPPORT}/${SERVICE_NAME}</value> </property> <property> <name>TCPPORT</name> <value>1521</value> </property> <property> <name>HOST</name> <value>myoraclehost.example.com</value> </property> <property> <name>SERVICE_NAME</name> <value>orcl</value> </property> <property> <name>oracle.hadoop.loader.connection.user</name> <value>MOVIEDEMO</value> </property> <property> <name>oracle.hadoop.loader.connection.password</name> <value>oracle</value> <description> Note: Protect this file with 600 permissions since it has password in clear text.</description> </property> </configuration>
Establish Secure Connections to Oracle Database Using SSL and Oracle Wallet
Learn about establishing secure connections to Oracle Database.
This section describes how to create and use an Oracle Wallet or the JDBC SSL protocol to create and establish highly secure connections to Oracle Database.
Use Oracle Wallets
Oracle Wallet is a secure software container that stores authentication and signing credentials. Oracle recommends that you use a wallet to provide your credentials. To use an Oracle wallet, enter the following properties in the job configuration file:
Use JDBC SSL
SSL is a widely used industry standard protocol that provides secure communication over a network. SSL provides authentication, data encryption, and data integrity. SSL is required when connecting to Oracle Cloud services such as Oracle Autonomous Data Warehouse Cloud Service.
Generate the Target Table Metadata for Offline Database Mode
Learn to generate the Target table metadata for offline database mode.
Under some circumstances, the loader job cannot access the database, such as when the Hadoop cluster is on a different network than Oracle Database. In such cases, you can use the OraLoaderMetadata utility to extract and store the target table metadata in a file. To learn more about the OraLoaderMetadata utility, see OraLoaderMetadata Utility.
About Input Formats
An input format reads a specific type of data stored in Hadoop. Several input formats are available, which can read the data formats most commonly found in Hadoop:
You can also use your own custom input formats. The descriptions of the built-in formats provide information that may help you develop custom Java InputFormat
classes. See "Custom Input Formats."
You specify a particular input format for the data that you want to load into a database table, by using the mapreduce.job.inputformat.class
configuration property in the job configuration file.
Note:
The built-in text formats do not handle header rows or newline characters (\n
) embedded in quoted values.
Delimited Text Input Format
To load data from a delimited text file, set mapreduce.job.inputformat.class
to
oracle.hadoop.loader.lib.input.DelimitedTextInputFormat
About DelimitedTextInputFormat
The input file must comply with these requirements:
-
Records must be separated by newline characters.
-
Fields must be delimited using single-character markers, such as commas or tabs.
A null replaces any empty-string token, whether enclosed or unenclosed.
DelimitedTextInputFormat
emulates the tokenization method of SQL*Loader: Terminated by t, and optionally enclosed by ie, or by ie and te. DelimitedTextInputFormat
uses the following syntax rules, where t is the field terminator, ie is the initial field encloser, te is the trailing field encloser, and c is one character.
-
Line = Token t Line | Token\n
-
Token = EnclosedToken | UnenclosedToken
-
EnclosedToken = (white-space)* ie [(non-te)* te te]* (non-te)* te (white-space)*
-
UnenclosedToken = (white-space)* (non-t)*
-
white-space = {c | Character.isWhitespace(c) and c!=t}
White space around enclosed tokens (data values) is discarded. For unenclosed tokens, the leading white space is discarded, but not the trailing white space (if any).
This implementation enables you to define custom enclosers and terminator characters, but it hard codes the record terminator as a newline, and white space as Java Character.isWhitespace
. A white space can be defined as the field terminator, but then that character is removed from the class of white space characters to prevent ambiguity.
Hadoop automatically decompresses compressed text files when they are read.
Required Configuration Properties
None. The default format separates fields with commas and has no field enclosures.
Complex Text Input Formats
To load data from text files that are more complex than DelimitedTextInputFormat
can handle, set mapreduce.job.inputformat.class
to
oracle.hadoop.loader.lib.input.RegexInputFormat
For example, a web log might delimit one field with quotes and another field with square brackets.
About RegexInputFormat
RegexInputFormat
requires that records be separated by newline characters. It identifies fields in each text line by matching a regular expression:
-
The regular expression must match the entire text line.
-
The fields are identified using the capturing groups in the regular expression.
RegexInputFormat
uses the java.util.regex
regular expression-based pattern matching engine. Hadoop automatically decompresses compressed files when they are read.
See Also:
Java Platform Standard Edition 6 Java Reference for more information about java.util.regex
at
http://docs.oracle.com/javase/6/docs/api/java/util/regex/package-summary.html
Hive Table Input Format
To load data from a Hive table, set mapreduce.job.inputformat.class
to
oracle.hadoop.loader.lib.input.HiveToAvroInputFormat
About HiveToAvroInputFormat
For nonpartitioned tables, HiveToAvroInputFormat
imports the entire table, which is all files in the Hive table directory.
For partitioned tables, HiveToAvroInputFormat
imports one or more of the partitions. You can either load or skip a partition. However, you cannot partially load a partition.
Oracle Loader for Hadoop rejects all rows with complex (non-primitive) column values. UNIONTYPE
fields that resolve to primitive values are supported. See "Handling Rejected Records."
HiveToAvroInputFormat
transforms rows in the Hive table into Avro records, and capitalizes the Hive table column names to form the field names. This automatic capitalization improves the likelihood that the field names match the target table column names. See "Mapping Input Fields to Target Table Columns".
Note:
This input format does not support Hive tables using quoted identifiers for column names. See HIVE-6013
Also note that HiveToAvroInputFormat
does not enforce the SQL Standard Based Hive Authorization. For more information, see https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization
.
Required Configuration Properties
You must specify the Hive database and table names using the following configuration properties:
Avro Input Format
To load data from binary Avro data files containing standard Avro-format records, set mapreduce.job.inputformat.class
to
oracle.hadoop.loader.lib.input.AvroInputFormat
To process only files with the .avro extension, append *.avro
to directories listed in the mapreduce.input.fileinputformat.inputdir
configuration property.
Oracle NoSQL Database Input Format
To load data from Oracle NoSQL Database, set mapreduce.job.inputformat.class
to
oracle.kv.hadoop.KVAvroInputFormat
This input format is defined in Oracle NoSQL Database 11g, Release 2 and later releases.
About KVAvroInputFormat
Oracle Loader for Hadoop uses KVAvroInputFormat
to read data directly from Oracle NoSQL Database.
KVAvroInputFormat
passes the value but not the key from the key-value pairs in Oracle NoSQL Database. If you must access the Oracle NoSQL Database keys as Avro data values, such as storing them in the target table, then you must create a Java InputFormat
class that implements oracle.kv.hadoop.AvroFormatter
. Then you can specify the oracle.kv.formatterClass
property in the Oracle Loader for Hadoop configuration file.
The KVAvroInputFormat
class is a subclass of org.apache.hadoop.mapreduce.InputFormat<oracle.kv.Key, org.apache.avro.generic.IndexedRecord>
See Also:
Javadoc for the KVInputFormatBase
class at
Custom Input Formats
If the built-in input formats do not meet your needs, then you can write a Java class for a custom input format. The following information describes the framework in which an input format works in Oracle Loader for Hadoop.
About Implementing a Custom Input Format
Oracle Loader for Hadoop gets its input from a class extending org.apache.hadoop.mapreduce.InputFormat
. You must specify the name of that class in the mapreduce.job.inputformat.class
configuration property.
The input format must create RecordReader
instances that return an Avro IndexedRecord
input object from the getCurrentValue
method. Use this method signature:
public org.apache.avro.generic.IndexedRecord getCurrentValue() throws IOException, InterruptedException;
Oracle Loader for Hadoop uses the schema of the IndexedRecord
input object to discover the names of the input fields and map them to the columns of the target table.
About Error Handling
If processing an IndexedRecord
value results in an error, Oracle Loader for Hadoop uses the object returned by the getCurrentKey
method of the RecordReader
to provide feedback. It calls the toString
method of the key and formats the result in an error message. InputFormat
developers can assist users in identifying the rejected records by returning one of the following:
-
Data file URI
-
InputSplit
information -
Data file name and the record offset in that file
Oracle recommends that you do not return the record in clear text, because it might contain sensitive information; the returned values can appear in Hadoop logs throughout the cluster. See "Log Rejected Records in Bad Files."
If a record fails and the key is null, then the loader generates no identifying information.
Supporting Data Sampling
Oracle Loader for Hadoop uses a sampler to improve performance of its MapReduce job. The sampler is multithreaded, and each sampler thread instantiates its own copy of the supplied InputFormat
class. When implementing a new InputFormat
, ensure that it is thread-safe. See "Balancing Loads When Loading Data into Partitioned Tables."
InputFormat Source Code Example
Oracle Loader for Hadoop provides the source code for an InputFormat
example.
In order to access the examples, unzip file examples.zip
, which is in $OLH_HOME
. You can find the InputFormat example in the examples/jsrc
directory.
The sample format loads data from a simple, comma-separated value (CSV) file. To use this input format, add $OLH_HOME/examples/oraloader-examples.jar
to HADOOP_CLASSPATH
and specify oracle.hadoop.loader.examples.CSVInputFormat
as the value of mapreduce.job.inputformat.class
in the job configuration file.
This input format automatically assigns field names of F0, F1, F2, and so forth. It does not have configuration properties.
Mapping Input Fields to Target Table Columns
Mapping identifies which input fields are loaded into which columns of the target table. You may be able to use the automatic mapping facilities, or you can always manually map the input fields to the target columns.
Automatic Mapping
Oracle Loader for Hadoop can automatically map the fields to the appropriate columns when the input data complies with these requirements:
-
All columns of the target table are loaded.
-
The input data field names in the
IndexedRecord
input object exactly match the column names. For example: When you load from a Hive table, the names of the Oracle target table columns exactly match the names of the Hive table columns. -
All input fields that are mapped to
DATE
columns can be parsed using the same Java date format.
Use these configuration properties for automatic mappings:
-
oracle.hadoop.loader.loaderMap.targetTable
: Identifies the target table. -
oracle.hadoop.loader.defaultDateFormat
: Specifies a default date format that applies to allDATE
fields.
Manual Mapping
For loads that do not comply with the requirements for automatic mapping, you must define additional properties. These properties enable you to:
-
Load data into a subset of the target table columns.
-
Create explicit mappings when the input field names are not identical to the database column names.
-
Specify different date formats for different input fields.
Use these properties for manual mappings:
-
oracle.hadoop.loader.loaderMap.targetTable
configuration property to identify the target table. Required. -
oracle.hadoop.loader.loaderMap.columnNames
: Lists the columns to be loaded. -
oracle.hadoop.loader.defaultDateFormat
: Specifies a default date format that applies to allDATE
fields. -
oracle.hadoop.loader.loaderMap.column_name.format
: Specifies the data format for a particular column. -
oracle.hadoop.loader.loaderMap.column_name.field
: Identifies the name of an Avro record field mapped to a particular column.
Note:
Manual Mapping is particularly useful when different date columns have different formats.Manual Mapping: Examples
The following are the examples of manual mapping:
Configuration File conf.xml when loading from a text file
When you load delimited text from text files on HDFS, use F0, F1, … to
refer to columns in text files. In this example, F0 maps to
EMPLOYEE_ID
in the columnNames property, F1 maps to
LAST_NAME
, and so on.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <configuration> <property> <name>oracle.hadoop.loader.loaderMap.targetTable</name> <value>HR.EMPLOYEES</value> </property> <property> <name>oracle.hadoop.loader.loaderMap.columnNames</name> <value>EMPLOYEE_ID,LAST_NAME,EMAIL,HIRE_DATE,JOB_ID</value> </property> <property> <name>oracle.hadoop.loader.loaderMap.EMPLOYEE_ID.field</name> <value>F0</value> </property> <property> <name>oracle.hadoop.loader.loaderMap.EMPLOYEE_ID.format</name> <value></value> </property> <property> <name>oracle.hadoop.loader.loaderMap.LAST_NAME.field</name> <value>F1</value> </property> <property> <name>oracle.hadoop.loader.loaderMap.LAST_NAME.format</name> <value></value> </property> <property> <name>oracle.hadoop.loader.loaderMap.EMAIL.field</name> <value>F2</value> </property> <property> <name>oracle.hadoop.loader.loaderMap.EMAIL.format</name> <value></value> </property> <property> <name>oracle.hadoop.loader.loaderMap.HIRE_DATE.field</name> <value>F3</value> </property> <property> <name>oracle.hadoop.loader.loaderMap.HIRE_DATE.format</name> <value>MM-dd-yyyy</value> </property> <property> <name>oracle.hadoop.loader.loaderMap.JOB_ID.field</name> <value>F4</value> </property> <property> <name>oracle.hadoop.loader.loaderMap.JOB_ID.format</name> <value></value> </property> </configuration>
Configuration File conf.xml when loading from a Hive table
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <configuration> <property> <name>oracle.hadoop.loader.loaderMap.targetTable</name> <value>HR.EMPLOYEES</value> </property> <property> <name>oracle.hadoop.loader.loaderMap.columnNames</name> <value>EMPLOYEE_ID,LAST_NAME,EMAIL,HIRE_DATE,JOB_ID</value> </property> <property> <name>oracle.hadoop.loader.loaderMap.EMPLOYEE_ID.field</name> <value>EMPLOYEE_ID</value> </property> <property> <name>oracle.hadoop.loader.loaderMap.EMPLOYEE_ID.format</name> <value></value> </property> <property> <name>oracle.hadoop.loader.loaderMap.LAST_NAME.field</name> <value>LAST_NAME</value> </property> <property> <name>oracle.hadoop.loader.loaderMap.LAST_NAME.format</name> <value></value> </property> <property> <name>oracle.hadoop.loader.loaderMap.EMAIL.field</name> <value>EMAIL</value> </property> <property> <name>oracle.hadoop.loader.loaderMap.EMAIL.format</name> <value></value> </property> <property> <name>oracle.hadoop.loader.loaderMap.HIRE_DATE.field</name> <value>HIRE_DATE</value> </property> <property> <name>oracle.hadoop.loader.loaderMap.HIRE_DATE.format</name> <value>MM-dd-yyyy</value> </property> <property> <name>oracle.hadoop.loader.loaderMap.JOB_ID.field</name> <value>JOB_ID</value> </property> <property> <name>oracle.hadoop.loader.loaderMap.JOB_ID.format</name> <value></value> </property> </configuration>
About Output Formats
In online database mode, you can choose between loading the data directly into an Oracle database table or storing it in a file. In offline database mode, you are restricted to storing the output data in a file, which you can load into the target table as a separate procedure. You specify the output format in the job configuration file using the mapreduce.job.outputformat.class
property.
Choose from these output formats:
-
JDBC Output Format: Loads the data directly into the target table.
-
Oracle OCI Direct Path Output Format: Loads the data directly into the target table.
-
Delimited Text Output Format: Stores the data in a local file.
-
Oracle Data Pump Output Format: Stores the data in a local file.
JDBC Output Format
You can use a JDBC connection between the Hadoop system and Oracle Database to load the data. The output records of the loader job are loaded directly into the target table by map or reduce tasks as part of the OraLoader
process, in online database mode. No additional steps are required to load the data.
A JDBC connection must be open between the Hadoop cluster and the Oracle Database system for the duration of the job.
To use this output format, set mapreduce.job.outputformat.class
to
oracle.hadoop.loader.lib.output.JDBCOutputFormat
About JDBCOutputFormat
JDBCOutputFormat
uses standard JDBC batching to optimize performance and efficiency. If an error occurs during batch execution, such as a constraint violation, the JDBC driver stops execution immediately. Thus, if there are 100 rows in a batch and the tenth row causes an error, then nine rows are inserted and 91 rows are not.
The JDBC driver does not identify the row that caused the error, and so Oracle Loader for Hadoop does not know the insert status of any of the rows in the batch. It counts all rows in a batch with errors as "in question," that is, the rows may or may not be inserted in the target table. The loader then continues loading the next batch. It generates a load report at the end of the job that details the number of batch errors and the number of rows in question.
One way that you can handle this problem is by defining a unique key in the target table. For example, the HR.EMPLOYEES
table has a primary key named EMPLOYEE_ID
. After loading the data into HR.EMPLOYEES
, you can query it by EMPLOYEE_ID
to discover the missing employee IDs.Then you can locate the missing employee IDs in the input data, determine why they failed to load, and try again to load them.
Oracle OCI Direct Path Output Format
You can use the direct path interface of Oracle Call Interface (OCI) to load data into the target table. Each reducer loads into a distinct database partition in online database mode, enabling the performance gains of a parallel load. No additional steps are required to load the data.
The OCI connection must be open between the Hadoop cluster and the Oracle Database system for the duration of the job.
To use this output format, set mapreduce.job.outputformat.class
to
oracle.hadoop.loader.lib.output.OCIOutputFormat
About OCIOutputFormat
OCIOutputFormat
has the following restrictions:
-
It is available only on the Linux x86.64 platform.
-
The MapReduce job must create one or more reducers.
-
The target table must be partitioned.
-
For Oracle Database 11g (11.2.0.3), apply the patch for bug 13498646 if the target table is a composite interval partitioned table in which the subpartition key contains a
CHAR
,VARCHAR2
,NCHAR
, orNVARCHAR2
column. Later versions of Oracle Database do not require this patch.
Delimited Text Output Format
You can create delimited text output files on the Hadoop cluster. The map or reduce tasks generate delimited text files, using the field delimiters and enclosers that you specify in the job configuration properties. Afterward, you can load the data into an Oracle database as a separate procedure. See "About DelimitedTextOutputFormat."
This output format can use either an open connection to the Oracle Database system to retrieve the table metadata in online database mode, or a table metadata file generated by the OraloaderMetadata
utility in offline database mode.
To use this output format, set mapreduce.job.outputformat.class
to
oracle.hadoop.loader.lib.output.DelimitedTextOutputFormat
About DelimitedTextOutputFormat
Output tasks generate delimited text format files, and one or more corresponding SQL*Loader control files, and SQL scripts for loading with external tables.
If the target table is not partitioned or if oracle.hadoop.loader.loadByPartition
is false
, then DelimitedTextOutputFormat
generates the following files:
-
A data file named
oraloader-
taskId
-csv-0.dat.
-
A SQL*Loader control file named oraloader-csv.ctl for the entire job.
-
A SQL script named
oraloader-csv.sql
to load the delimited text file into the target table.
For partitioned tables, multiple output files are created with the following names:
-
Data files:
oraloader-
taskId
-csv-
partitionId
.dat
-
SQL*Loader control files:
oraloader-
taskId
-csv-
partitionId
.ctl
-
SQL script:
oraloader-csv.sql
In the generated file names, taskId is the mapper or reducer identifier, and partitionId is the partition identifier.
If the Hadoop cluster is connected to the Oracle Database system, then you can use Oracle SQL Connector for HDFS to load the delimited text data into an Oracle database. See Oracle SQL Connector for Hadoop Distributed File System.
Alternatively, you can copy the delimited text files to the database system and load the data into the target table in one of the following ways:
-
Use the generated control files to run SQL*Loader and load the data from the delimited text files.
-
Use the generated SQL scripts to perform external table loads.
The files are located in the ${mapreduce.output.fileoutputformat.outputdir}/_olh
directory.
Configuration Properties
The following properties control the formatting of records and fields in the output files:
The following example shows a sample SQL*Loader control file that might be generated by an output task.
Example 3-2 Sample SQL*Loader Control File
LOAD DATA CHARACTERSET AL32UTF8 INFILE 'oraloader-csv-1-0.dat' BADFILE 'oraloader-csv-1-0.bad' DISCARDFILE 'oraloader-csv-1-0.dsc' INTO TABLE "SCOTT"."CSV_PART" PARTITION(10) APPEND FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ( "ID" DECIMAL EXTERNAL, "NAME" CHAR, "DOB" DATE 'SYYYY-MM-DD HH24:MI:SS' )
Oracle Data Pump Output Format
You can create Data Pump format files on the Hadoop cluster. The map or reduce tasks generate Data Pump files. Afterward, you can load the data into an Oracle database as a separate procedure. See "About DataPumpOutputFormat."
This output format can use either an open connection to the Oracle Database system in online database mode, or a table metadata file generated by the OraloaderMetadata
utility in offline database mode.
To use this output format, set mapreduce.job.outputformat.class
to
oracle.hadoop.loader.lib.output.DataPumpOutputFormat
About DataPumpOutputFormat
DataPumpOutputFormat
generates data files with names in this format:
oraloader-
taskId
-dp-
partitionId
.dat
In the generated file names, taskId is the mapper or reducer identifier, and partitionId is the partition identifier.
If the Hadoop cluster is connected to the Oracle Database system, then you can use Oracle SQL Connector for HDFS to load the Data Pump files into an Oracle database. See Oracle SQL Connector for Hadoop Distributed File System.
Alternatively, you can copy the Data Pump files to the database system and load them using a SQL script generated by Oracle Loader for Hadoop. The script performs the following tasks:
-
Creates an external table definition using the
ORACLE_DATAPUMP
access driver. The binary format Oracle Data Pump output files are listed in theLOCATION
clause of the external table. -
Creates a directory object that is used by the external table. You must uncomment this command before running the script. To specify the directory name used in the script, set the
oracle.hadoop.loader.extTabDirectoryName
property in the job configuration file. -
Insert the rows from the external table into the target table. You must uncomment this command before running the script.
The SQL script is located in the ${mapreduce.output.fileoutputformat.outputdir}/_olh
directory.
See Also:
-
Oracle Database Administrator's Guide for more information about creating and managing external tables
-
Oracle Database Utilities for more information about the
ORACLE_DATAPUMP
access driver
Run a Loader Job
To run a job using Oracle Loader for Hadoop, you use the OraLoader
utility in a hadoop
command.
The following is the basic syntax:
hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader \ -conf job_config.xml \ -libjars input_file_format1.jar[,input_file_format2.jar...]
You can include any generic hadoop
command-line option. OraLoader
implements the org.apache.hadoop.util.Tool
interface and follows the standard Hadoop methods for building MapReduce applications.
Unzip the examples.zip
file in $OLH_HOME
in order to use the oraloader-examples.jar
file in the instructions below.
Basic Options
- -conf job_config.xml
-
Identifies the job configuration file. See "Create a Job Configuration File."
- -libjars
-
Identifies the JAR files for the input format.
-
When using the example input format, specify
$OLH_HOME/jlib/oraloader-examples.jar
. (You will first need to set up the example for use as described in InputFormat Source Code Example.) -
When using the Hive or Oracle NoSQL Database input formats, you must specify additional JAR files, as described later in this section.
-
When using a custom input format, specify its JAR file. Also remember to add the JAR to
HADOOP_CLASSPATH
.
-
Separate multiple file names with commas, and list each one explicitly. Wildcard characters and spaces are not allowed.
Oracle Loader for Hadoop prepares internal configuration information for the MapReduce tasks. It stores table metadata information and the dependent Java libraries in the distributed cache, so that they are available to the MapReduce tasks throughout the cluster.
Example of Running OraLoader
The following uses a built-in input format and a job configuration file named MyConf.xml
.
HADOOP_CLASSPATH="$OLH_HOME/jlib/*:$OLH_HOME/examples/oraloader-examples.jar:$HADOOP_CLASSPATH" hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader \ -conf MyConf.xml -libjars $OLH_HOME/jlib/oraloader-examples.jar
Specify Hive Input Format JAR Files
When using HiveToAvroInputFormat
, you must add the Hive configuration directory to the HADOOP_CLASSPATH
environment variable:
HADOOP_CLASSPATH="$OLH_HOME/jlib/*:hive_home/lib/*:hive_conf_dir:$HADOOP_CLASSPATH"
You must also add the following Hive JAR files, in a comma-separated list, to the -libjars
option of the hadoop
command. Replace the stars (*) with the complete file names on your system:
-
hive-exec-*.jar
-
hive-metastore-*.jar
-
libfb303*.jar
This example shows the full file names in Cloudera's Distribution including Apache Hadoop (CDH) 5.8:
# hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader \
-conf MyConf.xml \
-libjars hive-exec-1.1.0-cdh5.8.0.jar, hive-metastore-1.1.0-cdh5.8.0.jar, libfb303-0.9.3.jar
Specify Oracle NoSQL Database Input Format JAR Files
When using KVAvroInputFormat
from Oracle NoSQL Database 11g, Release 2, you must include $KVHOME/lib/kvstore.jar
in your HADOOP_CLASSPATH
and you must include the -libjars
option in the hadoop
command:
hadoop jar $OLH_HOME/jlib/oraloader.jar oracle.hadoop.loader.OraLoader \ -conf MyConf.xml \ -libjars $KVHOME/lib/kvstore.jar
Job Reporting
Oracle Loader for Hadoop consolidates reporting information from individual tasks into a file named ${mapreduce.output.fileoutputformat.outputdir}/_olh/oraloader-report.txt
. Among other statistics, the report shows the number of errors, broken out by type and task, for each mapper and reducer.
Handling Rejected Records
Oracle Loader for Hadoop may reject input records for a variety of reasons, such as:
-
Errors in the mapping properties
-
Missing fields in the input data
-
Records mapped to invalid table partitions
-
Badly formed records, such as dates that do not match the date format or records that do not match regular expression patterns
Log Rejected Records in Bad Files
By default, Oracle Loader for Hadoop does not log the rejected records into Hadoop logs; it only logs information on how to identify the rejected records. This practice prevents user-sensitive information from being stored in Hadoop logs across the cluster.
You can direct Oracle Loader for Hadoop to log rejected records by setting the oracle.hadoop.loader.logBadRecords
configuration property to true
. Then Oracle Loader for Hadoop logs bad records into one or more "bad" files in the _olh/
directory under the job output directory.
Set a Job Reject Limit
Some problems can cause Oracle Loader for Hadoop to reject every record in the input. To mitigate the loss of time and resources, Oracle Loader for Hadoop aborts the job after rejecting 1000 records.
You can change the maximum number of rejected records allowed by setting the oracle.hadoop.loader.rejectLimit
configuration property. A negative value turns off the reject limit and allows the job to run to completion regardless of the number of rejected records.
Balancing Loads When Loading Data into Partitioned Tables
The goal of load balancing is to generate a MapReduce partitioning scheme that assigns approximately the same amount of work to all reducers.
The sampling feature of Oracle Loader for Hadoop balances loads across reducers when data is loaded into a partitioned database table. It generates an efficient MapReduce partitioning scheme that assigns database partitions to the reducers.
The execution time of a reducer is usually proportional to the number of records that it processes—the more records, the longer the execution time. When sampling is disabled, all records from a given database partition are sent to one reducer. This can result in unbalanced reducer loads, because some database partitions may have more records than others. Because the execution time of a Hadoop job is usually dominated by the execution time of its slowest reducer, unbalanced reducer loads slow down the entire job.
Use the Sampling Feature
You can turn the sampling feature on or off by setting the oracle.hadoop.loader.sampler.enableSampling
configuration property. Sampling is turned on by default.
Tuning Load Balancing
These job configuration properties control the quality of load balancing:
The sampler uses the expected reducer load factor to evaluate the quality of its partitioning scheme. The load factor is the relative overload for each reducer, calculated as (assigned_load - ideal_load)/ideal_load. This metric indicates how much a reducer's load deviates from a perfectly balanced reducer load. A load factor of 1.0 indicates a perfectly balanced load (no overload).
Small load factors indicate better load balancing. The maxLoadFactor
default of 0.05 means that no reducer is ever overloaded by more than 5%. The sampler guarantees this maxLoadFactor
with a statistical confidence level determined by the value of loadCI
. The default value of loadCI
is 0.95, which means that any reducer's load factor exceeds maxLoadFactor
in only 5% of the cases.
There is a trade-off between the execution time of the sampler and the quality of load balancing. Lower values of maxLoadFactor
and higher values of loadCI
achieve more balanced reducer loads at the expense of longer sampling times. The default values of maxLoadFactor=0.05
and loadCI=0.95
are a good trade-off between load balancing quality and execution time.
Tuning Sampling Behavior
By default, the sampler runs until it collects just enough samples to generate a partitioning scheme that satisfies the maxLoadFactor
and loadCI
criteria.
However, you can limit the sampler running time by setting the oracle.hadoop.loader.sampler.maxSamplesPct
property, which specifies the maximum number of sampled records.
When Does Oracle Loader for Hadoop Use the Sampler's Partitioning Scheme?
Oracle Loader for Hadoop uses the generated partitioning scheme only if sampling is successful. A sampling is successful if it generates a partitioning scheme with a maximum reducer load factor of (1+ maxLoadFactor
) guaranteed at a statistical confidence level of loadCI
.
Partition report identifies the keys that are assigned to the various mappers. This report is saved in XML for the sampler to use; it does not contain information of use to you. The report is named ${mapreduce.output.fileoutputformat.outputdir}/_balancer/orabalancer_report.xml
. It is only generated for sampled jobs. This xml file contains the information about how to assign map output to different reducers, as well as the sampling statistics.
The default values of maxLoadFactor
, loadCI
, and maxSamplesPct
allow the sampler to successfully generate high-quality partitioning schemes for a variety of different input data distributions. However, the sampler might be unsuccessful in generating a partitioning scheme using custom property values, such as when the constraints are too rigid or the number of required samples exceeds the user-specified maximum of maxSamplesPct
. In these cases, Oracle Loader for Hadoop generates a log message identifying the problem, partitions the records using the database partitioning scheme, and does not guarantee load balancing.
Alternatively, you can reset the configuration properties to less rigid values. Either increase maxSamplesPct
, or decrease maxLoadFactor
or loadCI
, or both.
Resolve Memory Issues
A custom input format may return input splits that do not fit in memory. If this happens, the sampler returns an out-of-memory error on the client node where the loader job is submitted.
To resolve this problem:
-
Increase the heap size of the JVM where the job is submitted.
-
Adjust the following properties:
If you are developing a custom input format, then see "Custom Input Formats."
What Happens When a Sampling Feature Property Has an Invalid Value?
If any configuration properties of the sampling feature are set to values outside the accepted range, an exception is not returned. Instead, the sampler prints a warning message, resets the property to its default value, and continues executing.
Optimize Communications Between Oracle Engineered Systems
If you are using Oracle Loader for Hadoop to load data from Oracle Big Data Appliance to Oracle Exadata Database Machine, then you can increase throughput by configuring the systems to use Sockets Direct Protocol (SDP) over the InfiniBand private network. This setup provides an additional connection attribute whose sole purpose is serving connections to Oracle Database to load data.
To specify SDP protocol:
-
Add JVM options to the
HADOOP_OPTS
environment variable to enable JDBC SDP export:HADOOP_OPTS="-Doracle.net.SDP=true -Djava.net.preferIPv4Stack=true"
-
Set this Hadoop configuration property for the child task JVMs:
-D mapred.child.java.opts="-Doracle.net.SDP=true -Djava.net.preferIPv4Stack=true"
Note:
This Hadoop configuration property can be either added to the OLH command line or set in the configuration file.
-
Configure standard Ethernet communications. In the job configuration file, set
oracle.hadoop.loader.connection.url
using this syntax:jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS_LIST= (ADDRESS=(PROTOCOL=TCP)(HOST=hostName)(PORT=portNumber))) (CONNECT_DATA=(SERVICE_NAME=serviceName)))
-
Configure the Oracle listener on Exadata to support the SDP protocol and bind it to a specific port address (such as 1522). In the job configuration file, specify the listener address as the value of
oracle.hadoop.loader.connection.oci_url
using this syntax:(DESCRIPTION=(ADDRESS=(PROTOCOL=SDP) (HOST=hostName) (PORT=portNumber)) (CONNECT_DATA=(SERVICE_NAME=serviceName)))
Replace hostName, portNumber, and serviceName with the appropriate values to identify the SDP listener on your Oracle Exadata Database Machine.
See Also:
Oracle Big Data Appliance Software User's Guide for more information about configuring communications over InfiniBand
Oracle Loader for Hadoop Configuration Property Reference
OraLoader
uses the standard methods of specifying configuration properties in the hadoop command. You can use the -conf
option to identify configuration files, and the -D
option to specify individual properties.
This section describes the OraLoader configuration properties, the Oracle NoSQL Database configuration properties, and a few generic Hadoop MapReduce properties that you typically must set for an OraLoader
job:
A configuration file showing all OraLoader properties is in $OLH_HOME/doc/oraloader-conf.xml
.
See Also:
Hadoop documentation for job configuration files at
MapReduce Configuration Properties
Property | Description |
---|---|
mapreduce.job.name |
Type: String Default Value: Description: The Hadoop job name. A unique name can help you monitor the job using tools such as the Hadoop JobTracker web interface and Cloudera Manager. |
mapreduce.input.fileinputformat.inputdir |
Type: String Default Value: Not defined Description: A comma-separated list of input directories. |
mapreduce.job.inputformat.class |
Type: String Default Value: Not defined Description: Identifies the format of the input data. You can enter one of the following built-in input formats, or the name of a custom
See "About Input Formats" for descriptions of the built-in input formats. |
mapreduce.output.fileoutputformat.outputdir |
Type: String Default Value: Not defined Description: A comma-separated list of output directories, which cannot exist before the job runs. Required. |
mapreduce.job.outputformat.class |
Type: String Default Value: Not defined Description: Identifies the output type. The values can be:
See "About Output Formats." |
mapreduce.job.reduces |
Type: Integer Default Value: 1 Description: The number of reduce tasks used by the Oracle Loader for Hadoop job. The default value of 1 does not support parallel processing, therefore performance improves when the value is increased to support multiple parallel data loads. Choose a value that provides an ample, but not excessive, number of reduce tasks for the job. At a point dictated by the available resources, an excessive increase in the number of reduce tasks result in diminishing improvements, while potentially degrading the performance of other jobs. |
OraLoader Configuration Properties
Property | Description |
---|---|
oracle.hadoop.loader.badRecordFlushInterval |
Type: Integer Default Value: Description: Sets the maximum number of records that a task attempt can log before flushing the log file. This setting limits the number of records that can be lost when the record reject limit ( The |
oracle.hadoop.loader.compressionFactors |
Type: Decimal Default Value: Description: These values are used by Oracle Loader for Hadoop when sampling is enabled and the target table is compressed. They are the compression factors of the target table. For best performance, the values of this property should match the compression factors of the target table. The values are a comma-delimited list of name=value pairs. The names must be one of the following keywords:
|
oracle.hadoop.loader.connection.defaultExecuteBatch |
Type: Integer Default Value: Description: The number of records inserted in one trip to the database. It applies only to Specify a value greater than or equal to 1. Although the maximum value is unlimited, very large batch sizes are not recommended because they result in a large memory footprint without much increase in performance. A value less than 1 sets the property to the default value. |
oracle.hadoop.loader.connection.oci_url |
Type: String Default Value: Value of Description: The database connection string used by The following example specifies Socket Direct Protocol (SDP) for OCI connections.
This connection string does not require a "jdbc:oracle:thin:@" prefix. All characters up to and including the first at-sign (@) are removed. |
oracle.hadoop.loader.connection.password |
Type: String Default Value: Not defined Description: Password for the connecting user. Oracle recommends that you do not store your password in clear text. Use an Oracle wallet instead. |
oracle.hadoop.loader.connection.sessionTimeZone |
Type: String Default Value: Description: Alters the session time zone for database connections. Valid values are:
This property also determines the default time zone for input data that is loaded into |
oracle.hadoop.loader.connection.cluster.tns_admin |
Type: String Default Value: Not defined. Description: The TNS admin location on the cluster node if it is different from the client side location. By default, the client-side TNS admin location is the same as the location on cluster nodes and it is specified by oracle.hadoop.loader.connection.tns_admin. It is invalid to specify this property without specifying |
oracle.hadoop.loader.connection.directory |
Type: String Default Value: Not defined Description: File path to a directory on each node of the Hadoop cluster. |
oracle.hadoop.loader.connection.tns_admin |
Type: String Default Value: Not defined Description: File path to a directory on each node of the Hadoop cluster, which contains SQL*Net configuration files such as You must set this property when using an Oracle wallet as an external password store (as Oracle recommends). See |
oracle.hadoop.loader.connection.tnsEntryName |
Type: String Default Value: Not defined Description: A TNS entry name defined in the |
oracle.hadoop.loader.connection.url |
Type: String Default Value: Not defined Description: The URL of the database connection. This property overrides all other connection properties. If an Oracle wallet is configured as an external password store (as Oracle recommends), then the property value must start with the The following examples show valid values of connection URLs:
|
oracle.hadoop.loader.connection.user |
Type: String Default Value: Not defined Description: A database user name. This property requires that you also set When using online database mode, you must set either this property or |
oracle.hadoop.loader.connection.wallet_location |
Type: String Default Value: Not defined Description: File path to an Oracle wallet directory on each node of the Hadoop cluster, where the connection credentials are stored. When using an Oracle wallet, you must also set the following properties: |
oracle.hadoop.loader.connection.cluster.wallet_location |
Type: String Default Value: Not defined. Description: The wallet location on the cluster node if it is different from the client-side location. By default, the client-side wallet location is the same as the location on cluster node and it is specified by oracle.hadoop.loader.connection.wallet_location. It is invalid to specify this property without specifying |
oracle.hadoop.loader.defaultDateFormat |
Type: String Default Value: Description: Parses an input field into a |
oracle.hadoop.loader.enableSorting |
Type: Boolean Default Value: Description: Controls whether output records within each reducer group are sorted. Use the |
oracle.hadoop.loader.enforceClasspath |
Type: Boolean Default Value: To prevent mismatched versions of its JARs from being added to the classpath, Oracle Loader for Hadoop checks that its internal classes are loaded from To disable this check, set the property to |
oracle.hadoop.loader.extTabDirectoryName |
Type: String Default Value: Description: The name of the database directory object for the external table This property applies only to |
oracle.hadoop.loader.input.fieldNames |
Type: String Default Value: Description: A comma-delimited list of names for the input fields. For the built-in input formats, specify names for all fields in the data, not just the fields of interest. If an input line has more fields than this property has field names, then the extra fields are discarded. If a line has fewer fields than this property has field names, then the extra fields are set to null. See "Mapping Input Fields to Target Table Columns" for loading only selected fields. The names are used to create the Avro schema for the record, so they must be valid JSON name strings. This property applies to |
oracle.hadoop.loader.input.fieldTerminator |
Type: String Default Value: Description: A character that indicates the end of an input field for |
oracle.hadoop.loader.input.hive.databaseName |
Type: String Default Value: Not defined Description: The name of the Hive database where the input table is stored |
oracle.hadoop.loader.input.hive.partitionFilter Note: This property is deprecated. Useoracle.hadoop.loader.input.hive.rowFilter instead.
|
Type: String Default Value: Not defined Description: A valid HiveQL expression that is used to filter the source Hive table partitions for The expression must conform to the following restrictions:
You can use the Hive CLI to test the expression and ensure that it returns the expected results. The following examples assume a source table defined with this command:
Example 1: Nested Expressions
Example 2: Built-in Functions
Example 3: Bad Usage: Columns That Are Not Used to Partition the Table These examples show that using c, a column that is not used to partition the table, is unnecessary and can cause unexpected results. This example is equivalent to
This example loads all partitions. All partitions could contain
|
oracle.hadoop.loader.input.hive.rowFilter |
Type: String Default Value: Not defined Description: A valid HiveQL expression that is used to filter the rows of the source Hive table for HiveToAvroInputFormat. If this value is not set (default), Oracle Loader for Hadoop attempts to use the value of oracle.hadoop.loader.input.hive.partitionFilter (provided the table is partitioned). Otherwise, Oracle Loader for Hadoop loads the entire source hive table. The expression must conform to the following restrictions:
In both cases you can use the Hive CLI to test the expression and ensure that it returns the expected results. The following examples assume a source table defined with this command:
Example #1: nested expressions
Example #2: built-in functions
Oracle recommends that you turn on
|
oracle.hadoop.loader.input.hive.tableName |
Type: String Default Value: Not defined Description: The name of the Hive table where the input data is stored. |
oracle.hadoop.loader.input.initialFieldEncloser |
Type: String Default Value: Not defined Description: A character that indicates the beginning of a field. The value can be either a single character or When this property is set, the parser attempts to read each field as an enclosed token (value) before reading it as an unenclosed token. If the field enclosers are not set, then the parser reads each field as an unenclosed token. If you set this property but not |
oracle.hadoop.loader.input.regexCaseInsensitive |
Type: Boolean Default Value: Description: Controls whether pattern matching is case-sensitive. Set to This property is the same as the |
oracle.hadoop.loader.input.regexPattern |
Type: String Default Value: Not defined Description: The pattern string for a regular expression. The regular expression must match each text line in its entirety. For example, a correct regex pattern for input line
This property is the same as the See Also: For descriptions of regular expressions and capturing groups, the entry for
|
oracle.hadoop.loader.input.trailingFieldEncloser |
Type: String Default Value: The value of Description: Identifies a character that marks the end of a field. The value can be either a single character or A field encloser cannot be the terminator or a white-space character defined for the input format. If the trailing field encloser character is embedded in an input field, then the character must be doubled up to be parsed as literal text. For example, an input field must have If you set this property, then you must also set |
oracle.hadoop.loader.loadByPartition |
Type: Boolean Default Value: Description: Specifies a partition-aware load. Oracle Loader for Hadoop organizes the output by partition for all output formats on the Hadoop cluster; this task does not impact the resources of the database system.
|
oracle.hadoop.loader.loaderMap.columnNames |
Type: String Default Value: Not defined Description: A comma-separated list of column names in the target table, in any order. The names can be quoted or unquoted. Quoted names begin and end with double quotes (") and are used exactly as entered. Unquoted names are converted to upper case. You must set |
oracle.hadoop.loader.loaderMap.column_name.field |
Type: String Default Value: Normalized column name Description: The name of a field that contains Avro records, which is mapped to the column identified in the property name. The column name can be quoted or unquoted. A quoted name begins and ends with double quotes (") and is used exactly as entered. An unquoted name is converted to upper case. Optional. You must set |
oracle.hadoop.loader.loaderMap.column_name.format |
Type: String Default Value: Not defined Description: Specifies the data format of the data being loaded into the column identified in the property name. Use a You must set |
oracle.hadoop.loader.loaderMap.targetTable |
Type: String Default Value: Not defined Description: A schema-qualified name for the table to be loaded. This property takes precedence over To load a subset of columns, set the |
oracle.hadoop.loader.loaderMapFile |
Loader maps are deprecated starting with Release 2.3. The |
oracle.hadoop.loader.log4j.propertyPrefix |
Type: String Default Value: Description: Identifies the prefix used in Apache Oracle Loader for Hadoop enables you to specify
All configuration properties starting with this prefix are loaded into The configuration properties are copied to |
oracle.hadoop.loader.logBadRecords |
Type: Boolean Default Value: Description: Controls whether Oracle Loader for Hadoop logs bad records to a file. This property applies only to records rejected by input formats and mappers. It does not apply to errors encountered by the output formats or by the sampling feature. |
oracle.hadoop.loader.logRetentionPolicy |
Type: String Default Value: Description: Specifies when Oracle Loader for Hadoop logs should be generated/retained at the end of a job. Valid values are:
The following situations are considered errors for the purposes of
logRetentionPolicy :
The following files are covered by this property:
|
oracle.hadoop.loader.olh_home |
Type: String Default Value: Value of the Description: The path of the Oracle Loader for Hadoop home directory on the node where you start the OraLoader job. This path identifies the location of the required libraries. |
oracle.hadoop.loader.olhcachePath |
Type: String Default Value: Description: Identifies the full path to an HDFS directory where Oracle Loader for Hadoop can create files that are loaded into the MapReduce distributed cache. The distributed cache is a facility for caching large, application-specific files and distributing them efficiently across the nodes in a cluster. See Also: The description of |
oracle.hadoop.loader.output.degreeOfParallelism |
Type: String Default Value: Not defined Description: If set, the value of this property controls either:
If this property is not set, then the value of
This property provides a unified way to limit the number of database connections made by |
oracle.hadoop.loader.output.dirpathBufsize |
Type: Integer Default Value: Description: Sets the size in bytes of the direct path stream buffer for |
oracle.hadoop.loader.output.escapeEnclosers |
Type: Boolean Default Value: Description: Controls whether the embedded trailing encloser character is handled as literal text (that is, escaped). Set this property to |
oracle.hadoop.loader.output.fieldTerminator |
Type: String Default Value: Description: A character that indicates the end of an output field for |
oracle.hadoop.loader.output.granuleSize |
Type: Integer Default Value: Description: The granule size in bytes for generated Data Pump files. A granule determines the work load for a parallel process (PQ slave) when loading a file through the See Also: Oracle Database Utilities for more information about the |
oracle.hadoop.loader.output.initialFieldEncloser |
Type: String Default Value: Not defined Description: A character generated in the output to identify the beginning of a field. The value must be either a single character or Use this property when a field may contain the value of If you set this property, then you must also set |
oracle.hadoop.loader.output.trailingFieldEncloser |
Type: String Default Value: Value of Description: A character generated in the output to identify the end of a field. The value must be either a single character or Use this property when a field may contain the value of If you set this property, then you must also set |
oracle.hadoop.loader.rejectLimit |
Type: Integer Default Value: Description: The maximum number of rejected or skipped records allowed before the job stops running. A negative value turns off the reject limit and allows the job to run to completion. If Input format errors do not count toward the reject limit because they are irrecoverable and cause the map task to stop. Errors encountered by the sampling feature or the online output formats do not count toward the reject limit either. |
oracle.hadoop.loader.sampler.enableSampling |
Type: Boolean Default Value: Description: Controls whether the sampling feature is enabled. Set this property to Even when |
oracle.hadoop.loader.sampler.hintMaxSplitSize |
Type: Integer Default Value: Description: Sets the Hadoop Some input formats (such as Increase this value for data sets with tens of terabytes of data, or if the input format Although large splits are better for I/O performance, they are not necessarily better for sampling. Set this value small enough for good sampling performance, but no smaller. Extremely small values can cause inefficient I/O performance, and can cause The |
oracle.hadoop.loader.sampler.hintNumMapTasks |
Type: Integer Default Value: Description: Sets the value of the Hadoop Some input formats (such as Increase this value for data sets with more than a million rows, but remember that extremely large values can cause |
oracle.hadoop.loader.sampler.loadCI |
Type: Decimal Default Value: Description: The statistical confidence indicator for the maximum reducer load factor. This property accepts values greater than or equal to |
oracle.hadoop.loader.sampler.maxHeapBytes |
Type: Integer Default Value: Description: Specifies in bytes the maximum amount of memory available to the sampler. Sampling stops when one of these conditions is true:
|
oracle.hadoop.loader.sampler.maxLoadFactor |
Type: Float Default Value: Description: The maximum acceptable load factor for a reducer. A value of This property accepts values greater than In a perfectly balanced load, every reducer is assigned an equal amount of work (or load). The load factor is the relative overload for each reducer, calculated as (assigned_load - ideal_load)/ideal_load. If load balancing is successful, the job runs within the maximum load factor at the specified confidence. |
oracle.hadoop.loader.sampler.maxSamplesPct |
Type: Float Default Value: Description: Sets the maximum sample size as a fraction of the number of records in the input data. A value of This property accepts a range of 0 to 1 (0% to 100%). A negative value disables it. Sampling stops when one of these conditions is true:
|
oracle.hadoop.loader.sampler.minSplits |
Type: Integer Default Value: Description: The minimum number of input splits that the sampler reads from before it makes any evaluation of the stopping condition. If the total number of input splits is less than A number less than or equal to |
oracle.hadoop.loader.sampler.numThreads |
Type: Integer Default Value: Description: The number of sampler threads. A higher number of threads allows higher concurrency in sampling. A value of Set the value based on the processor and memory resources available on the node where you start the Oracle Loader for Hadoop job. |
oracle.hadoop.loader.sortKey |
Type: String Default Value: Not defined Description: A comma-delimited list of column names that forms a key for sorting output records within a reducer group. The column names can be quoted or unquoted identifiers:
|
oracle.hadoop.loader.tableMetadataFile |
Type: String Default Value: Not defined Description: Path to the target table metadata file. Set this property when running in offline database mode. Use the file:///home/jdoe/metadata.xml To create the table metadata file, run the |
oracle.hadoop.loader.targetTable |
Deprecated. Use |
Oracle NoSQL Database Configuration Properties
Property | Description |
---|---|
oracle.kv.kvstore |
Type: String Default Value: Not defined Description: The name of the KV store with the source data. |
oracle.kv.hosts |
Type: String Default Value: Not defined Description: An array of one or more hostname:port pairs that identify the hosts in the KV store with the source data. Separate multiple pairs with commas. |
oracle.kv.batchSize |
Type: Key Default Value: Not defined Description: The desired number of keys for |
oracle.kv.parentKey |
Type: String Default Value: Not defined Description: Restricts the returned values to only the child key-value pairs of the specified key. A major key path must be a partial path, and a minor key path must be empty. A null value (the default) does not restrict the output, and so |
oracle.kv.subRange |
Type: KeyRange Default Value: Not defined Description: Further restricts the returned values to a particular child under the parent key specified by |
oracle.kv.depth |
Type: Depth Default Value: Description: Restricts the returned values to a particular hierarchical depth under the value of
|
oracle.kv.consistency |
Type: Consistency Default Value: Description: The consistency guarantee for reading child key-value pairs. The following keywords are valid values:
|
oracle.kv.timeout |
Type: Long Default Value: Description: Sets a maximum time interval in milliseconds for retrieving a selection of key-value pairs. A value of zero (0) sets the property to its default value. |
oracle.kv.formatterClass |
Type: String Default Value: Not defined Description: Specifies the name of a class that implements the Because the Avro records from Oracle NoSQL Database pass directly to Oracle Loader for Hadoop, the NoSQL keys are not available for mapping into the target Oracle Database table. However, the formatter class receives both the NoSQL key and value, enabling the class to create and return a new Avro record that contains both the value and key, which can be passed to Oracle Loader for Hadoop. |