3 Hadoop

This chapter describes how to use the knowledge modules in Oracle Data Integrator (ODI) Application Adapter for Hadoop.

It contains the following sections:

Introduction
Installation and Configuration
Setting Up the Topology
Setting Up an Integration Project
Creating an Oracle Data Integrator Model from a Reverse-Engineered Hive and HBase Model
Designing a Mapping

3.1 Introduction

Apache Hadoop is designed to handle and process data that is typically from data sources that are nonrelational and data volumes that are beyond what is handled by relational databases.

Oracle Data Integrator (ODI) Application Adapter for Hadoop enables data integration developers to integrate and transform data easily within Hadoop using Oracle Data Integrator. Employing familiar and easy-to-use tools and preconfigured knowledge modules (KMs), the application adapter provides the following capabilities:

Loading data into Hadoop from the local file system, HDFS, HBase (using Hive), and SQL database (using SQOOP)
Performing validation and transformation of data within Hadoop
Loading processed data from Hadoop to an Oracle database, an SQL database (using SQOOP), or HBase for further processing and generating reports

Knowledge modules (KMs) contain the information needed by Oracle Data Integrator to perform a specific set of tasks against a specific technology. An application adapter is a group of knowledge modules. Thus, Oracle Data Integrator Application Adapter for Hadoop is a group of knowledge modules for accessing data stored in Hadoop.

3.1.1 Concepts

Typical processing in Hadoop includes data validation and transformations that are programmed as MapReduce jobs. Designing and implementing a MapReduce job requires expert programming knowledge. However, when you use Oracle Data Integrator and Oracle Data Integrator Application Adapter for Hadoop, you do not need to write MapReduce jobs. Oracle Data Integrator uses Apache Hive and the Hive Query Language (HiveQL), a SQL-like language for implementing MapReduce jobs.

When you implement a big data processing scenario, the first step is to load the data into Hadoop. The data source is typically in the local file system, HDFS, HBase, and Hive tables.

After the data is loaded, you can validate and transform it by using HiveQL like you use SQL. You can perform data validation (such as checking for NULLS and primary keys), and transformations (such as filtering, aggregations, set operations, and derived tables). You can also include customized procedural snippets (scripts) for processing the data.

When the data has been aggregated, condensed, or processed into a smaller data set, you can load it into an Oracle database, other relational database, or HBase for further processing and analysis. Oracle Loader for Hadoop is recommended for optimal loading into an Oracle database.

3.1.2 Knowledge Modules

Oracle Data Integrator provides the knowledge modules (KMs) described in Table 3-1 for use with Hadoop.

Table 3-1 Oracle Data Integrator Application Adapter for Hadoop Knowledge Modules

KM Name	Description	Source	Target
IKM File to Hive (Load Data)	Loads data from local and HDFS files into Hive tables. It provides options for better performance through Hive partitioning and fewer data movements. This knowledge module supports wildcards (*,?).	File system	Hive
IKM Hive Control Append	Integrates data into a Hive target table in truncate/insert (append) mode. Data can be controlled (validated). Invalid data is isolated in an error table and can be recycled.	Hive	Hive
IKM Hive Transform	Integrates data into a Hive target table after the data has been transformed by a customized script such as Perl or Python	Hive	Hive
IKM File-Hive to Oracle (OLH-OSCH)	Integrates data from an HDFS file or Hive source into an Oracle database target using Oracle Loader for Hadoop, Oracle SQL Connector for HDFS, or both.	File system or Hive	Oracle Database
IKM File-Hive to SQL (SQOOP)	Integrates data from an HDFS file or Hive data source into an SQL database target using SQOOP. SQOOP uses parallel JDBC connections for loading data.	File system or Hive	SQL Database
IKM SQL to Hive-HBase-File (SQOOP)	Integrates data from an SQL database into a Hive table, HBase table, or HDFS file using SQOOP. SQOOP uses parallel JDBC connections for unloading data.	SQL Database	Hive, HBase, or File system
IKM Hive to HBase Incremental Update (HBase-SerDe)	Integrates data from a Hive table into an HBase table. It supports inserting new rows and updating existing rows.	Hive	HBase
LKM HBase to Hive (HBase-SerDe)	Loads data from an HBase table into a Hive table. It provides read-only access to the source HBase table from Hive. It defines a temporary load table definition on Hive, which represents all the relevant columns of the HBase source table.	HBase	Hive
CKM Hive	Validates data against constraints	NA	Hive
RKM Hive	Reverse engineers Hive tables	Hive metadata	NA
RKM HBase	Reverse engineers HBase tables	HBase metadata	NA

3.2 Installation and Configuration

Installation requirements for Oracle Data Integrator (ODI) Application Adapter for Hadoop are provided in these topics:

System Requirements and Certifications
Technology-Specific Requirements
Location of Oracle Data Integrator Application Adapter for Hadoop

3.2.1 System Requirements and Certifications

To use Oracle Data Integrator Application Adapter for Hadoop, you must first have Oracle Data Integrator, which is licensed separately from Oracle Big Data Connectors. You can download ODI from the Oracle website at

http://www.oracle.com/technetwork/middleware/data-integrator/downloads/index.html

Oracle Data Integrator Application Adapter for Hadoop requires a minimum version of Oracle Data Integrator 11.1.1.6.0.

Before performing any installation, read the system requirements and certification documentation to ensure that your environment meets the minimum installation requirements for the products that you are installing.

The list of supported platforms and versions is available on Oracle Technology Network:

http://www.oracle.com/technetwork/middleware/data-integrator/overview/index.html

3.2.2 Technology-Specific Requirements

The list of supported technologies and versions is available on Oracle Technology Network:

http://www.oracle.com/technetwork/middleware/data-integrator/overview/index.html">>http://www.oracle.com/technetwork/middleware/data-integrator/overview/index.html

3.2.3 Location of Oracle Data Integrator Application Adapter for Hadoop

Oracle Data Integrator Application Adapter for Hadoop is available in the ODI_Home/odi/sdk/xml-reference directory.

3.3 Setting Up the Topology

To set up the topology in Oracle Data Integrator, you need to identify the data server and the physical and logical schemas that store the file system, Hive, and HBase information.

This section contains the following topics:

Setting Up File Data Sources
Setting Up Hive Data Sources
Setting Up HBase Data Sources
Connecting to a Secure Cluster
Setting Up the Oracle Data Integrator Agent to Execute Hadoop Jobs
Configuring Oracle Data Integrator Studio for Executing Hadoop Jobs on the Local Agent

Note:

Many of the environment variables described in the following sections are already configured for Oracle Big Data Appliance. See the configuration script at /opt/oracle/odiagent-version/agent_standalone/odi/agent/bin/HadoopEnvSetup.sh

3.3.1 Setting Up File Data Sources

In the Hadoop context, there is a distinction between files in Hadoop Distributed File System (HDFS) and local files (outside of HDFS).

To define a data source:

Create a Data Server object under File technology.
Create a Physical Schema object for every directory to be accessed.
Create a Logical Schema object for every directory to be accessed.
Create a Model for every Logical Schema.
Create one or more data stores for each different type of file and wildcard name pattern.
For HDFS files, create a Data Server object under File technology by entering the HDFS name node in the field JDBC URL and leave the JDBC Driver name empty. For example:
```
hdfs://bda1node01.example.com:8020
```
Test Connection is not supported for this Data Server configuration.

Note:
No dedicated technology is defined for HDFS files.

3.3.2 Setting Up Hive Data Sources

The following steps in Oracle Data Integrator are required for connecting to a Hive system. Oracle Data Integrator connects to Hive by using JDBC.

Prerequisites

The Hive technology must be included in the standard Oracle Data Integrator technologies. If it is not, then import the technology in INSERT_UPDATE mode from the xml-reference directory.

You must add all Hive-specific flex fields.

To set up a Hive data source:

Create a Data Server object under Hive technology.
Set the following locations under JDBC:

JDBC Driver: weblogic.jdbc.hive.HiveDriver

JDBC URL: jdbc:weblogic:hive://<host>:<port>[; property=value[;...]]

For example, jdbc:weblogic:hive://localhost:10000;DatabaseName=default;User=default;Password=default

Note:
Usually the user ID and password are provided in the respective fields of an ODI Data Server. If a Hive user is defined without a password, "password=default" is necessary as part of the URL and the password field of Data Server should be left blank.
Set the following under Flexfields:

Hive Metastore URIs: for example, thrift://BDA:10000
Ensure that the Hive server is up and running.
Test the connection to the Data Server.
Create a Physical Schema. Enter the name of the Hive schema in both schema fields of the Physical Schema definition.
Create a Logical Schema object.
Import RKM Hive into Global Objects or a project.
Create a new model for Hive Technology pointing to the logical schema.
Perform a custom reverse-engineering operation using RKM Hive.

At the end of this process, the Hive Data Model contains all Hive tables with their columns, partitioning, and clustering details stored as flex field values.

3.3.3 Setting Up HBase Data Sources

The following steps in Oracle Data Integrator are required for connecting to a HBase system.

Prerequisites

The HBase technology must be included in the standard Oracle Data Integrator technologies. If it is not, then import the technology in INSERT_UPDATE mode from the xml-reference directory.

You must add all HBase-specific flex fields.

To set up a HBase data source:

Create a Data Server object under HBase technology.

JDBC Driver and URL are not available for data servers of this technology.
Set the following under Flexfields:

HBase Quorum: Quorum of the HBase installation. For example, localhost:2181
Ensure that the HBase server is up and running.

Note:
You cannot test the connection to the HBase Data Server.
Create a Physical Schema.
Create a Logical Schema object.
Import RKM HBase into Global Objects or a project.
Create a new model for HBase Technology pointing to the logical schema.
Perform a custom reverse-engineering operation using RKM HBase.

At the end of this process, the HBase Data Model contains all the HBase tables with their columns and data types.

3.3.4 Connecting to a Secure Cluster

To run the Oracle Data Integrator agent on a Hadoop cluster that is protected by Kerberos authentication, you must perform additional configuration steps.

To use a Kerberos-secured cluster:

Log in to the node04 of the Oracle Big Data Appliance, where the Oracle Data Integrator agent runs.
Generate a new Kerberos ticket for the oracle user. Use the following command, replacing realm with the actual Kerberos realm name.

$ kinit oracle@realm
Set the environment variables by using the following commands. Substitute the appropriate values for your appliance:

$ export KRB5CCNAME=Kerberos-ticket-cache-directory

$ export KRB5_CONFIG=Kerberos-configuration-file

$ export HADOOP_OPTS="$HADOOP_OPTS -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal. jaxp.DocumentBuilderFactoryImpl-Djava.security.krb5.conf=Kerberos-configuration-file"

In this example, the configuration files are named krb5* and are located in /tmp/oracle_krb/:

$ export KRB5CCNAME=/tmp/oracle_krb/krb5cc_1000

$ export KRB5_CONFIG=/tmp/oracle_krb/krb5.conf

$ export HADOOP_OPTS="$HADOOP_OPTS -D javax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal. jaxp.DocumentBuilderFactoryImpl -D java.security.krb5.conf=/tmp/oracle_krb/krb5.conf"
Redefine the JDBC connection URL, using syntax like the following:

jdbc:hive2://node1:10000/default;principal=HiveServer2-Kerberos-Principal

For example:

jdbc:hive2://bda1node01.example.com:10000/default;principal= hive/HiveServer2Host@EXAMPLE.COM

See also, "HiveServer2 Security Configuration" in the CDH5 Security Guide at the following URL:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Security-Guide/cdh5sg_hiveserver2_security.html
Renew the Kerberos ticket for the oracle use on a regular basis to prevent disruptions in service.

See Oracle Big Data Appliance Software User's Guide for instructions about managing Kerberos on Oracle Big Data Appliance.

3.3.5 Setting Up the Oracle Data Integrator Agent to Execute Hadoop Jobs

After setting up an Oracle Data Integrator agent, configure it to work with Oracle Data Integrator Application Adapter for Hadoop.

To configure the Oracle Data Integrator agent:

Install Hadoop on your Oracle Data Integrator agent computer.

For Oracle Big Data Appliance, see Oracle Big Data Appliance Software User's Guide for instructions for setting up a remote Hadoop client.
Install Hive on your Oracle Data Integrator agent computer.
Install SQOOP on your Oracle Data Integrator agent computer.

Set the following base environment variables for Hadoop and Hive on your ODI agent computer.

Table 3-2 Environment variables mandatory for Hadoop and Hive

Environment Variable	Value
`HADOOP_HOME`	Location of Hadoop dir. For example, `/usr/lib/hadoop`
`HADOOP_CONF`	Location of Hadoop configuration files such as core-default.xml, core-site.xml, and hdfs-site.xml. For example, `/home/shared/hadoop-conf`
`HIVE_HOME`	Location of Hive dir. For example, `/usr/lib/hive`
`HIVE_CONF`	Location of Hive configuration files such as hive-site.xml. For example, `/home/shared/hive-conf`
`HADOOP_CLASSPATH`	`$HIVE_HOME/lib/hive-metastore-.jar:$HIVE_HOME/lib/libthrift-.jar:$HIVE_HOME/lib/libfb.jar:$HIVE_HOME/lib/hive-exec-.jar:$HIVE_CONF`
`ODI_ADDITIONAL_CLASSPATH`	`$HIVE_HOME/lib/'':$HADOOP_HOME/client/:$HADOOP_CONF`
`ODI_HIVE_SESSION_JARS`	`$HIVE_HOME/lib/hive-contrib-*.jar:<ODI library directory>/wlhive.jar` Actual path of `wlhive.jar` can be determined under ODI installation home. Include other JAR files as required, such as custom SerDes JAR files. These JAR files are added to every Hive JDBC session and thus are added to every Hive MapReduce job. List of JARs is separated by ":", wildcards in file names must not evaluate to more than one file.

If you plan to use HBase features, set the following environment variables on your ODI agent computer. Note that you need to set these environment variables in addition to the base Hadoop and Hive environment variables.

Table 3-3 Environment Variables mandatory for HBase (In addition to base Hadoop and Hive environment variables)

Environment Variable	Value
`HBASE_HOME`	Location of HBase dir. For example, `/usr/lib/hbase`
`HADOOP_CLASSPATH`	`$HBASE_HOME/lib/hbase-.jar:$HIVE_HOME/lib/hive-hbase-handler.jar:$HBASE_HOME/hbase.jar`
`ODI_ADDITIONAL_CLASSPATH`	`$HBASE_HOME/hbase.jar`
`ODI_HIVE_SESSION_JARS`	`$HBASE_HOME/hbase.jar:$HBASE_HOME/lib/hbase-sep-api-.jar:$HBASE_HOME/lib/hbase-sep-impl-hbase.jar:/$HBASE_HOME/lib/hbase-sep-impl-common-.jar:/$HBASE_HOME/lib/hbase-sep-tools-.jar:$HIVE_HOME/lib/hive-hbase-handler-.jar`

To use Oracle Loader for Hadoop:

Install Oracle Loader for Hadoop on your Oracle Data Integrator agent system. See Installing Oracle Loader for Hadoop in Oracle Big Data Connectors User's Guide.
To use Oracle SQL Connector for HDFS (OLH_OUTPUT_MODE=DP_OSCH or OSCH), you must first install it. See "Oracle SQL Connector for Hadoop Distributed File System Setup" in Oracle Big Data Connectors User's Guide.

Set the following environment variables for Oracle Loader for Hadoop on your ODI agent computer. Note that you must set these environment variables in addition to the base Hadoop and Hive environment variables.

Table 3-4 Environment Variables mandatory for Oracle Loader for Hadoop (In addition to base Hadoop and Hive environment variables)

Environment Variable	Value
OLH_HOME	Location of OLH installation. For example, `/u01/connectors/olh`
OSCH_HOME	Location of OSCH installation. For example, `/u01/connectors/osch`
HADOOP_CLASSPATH	`$OLH_HOME/jlib/:$OSCH_HOME/jlib/` In order to work with OLH, the Hadoop jars in the `HADOOP_CLASSPATH` have to be manually resolved without wildcards.
ODI_OLH_JARS	Comma-separated list of all JAR files required for custom input formats, Hive, Hive SerDes, and so forth, used by Oracle Loader for Hadoop. All filenames have to be expanded without wildcards. For example: `$HIVE_HOME/lib/hive-metastore-0.10.0-cdh4.5.0.jar,$HIVE_HOME/lib/libthrift-0.9.0-cdh4-1.jar,$HIVE_HOME/lib/libfb303-0.9.0.jar`
ODI_OLH_SHAREDLIBS	`$OLH_HOME/lib/libolh12.so,$OLH_HOME/lib/libclntsh.so.12.1,$OLH_HOME/lib/libnnz12.so,$OLH_HOME/lib/libociei.so,$OLH_HOME/lib/libclntshcore.so.12.1,$OLH_HOME/lib/libons.so`
ODI_ADDITIONAL_CLASSPATH	`$OSCH_HOME/jlib/'*'`

3.3.6 Configuring Oracle Data Integrator Studio for Executing Hadoop Jobs on the Local Agent

For executing Hadoop jobs on the local agent of an Oracle Data Integrator Studio installation, follow the configuration steps in the previous section with the following change: Copy JAR files into the Oracle Data Integrator userlib directory instead of the drivers directory. For example:

Linux: $USER_HOME/.odi/oracledi/userlib directory.

Windows: C:\Users\<USERNAME>\AppData\Roaming\odi\oracledi\userlib directory

3.4 Setting Up an Integration Project

Setting up a project follows the standard procedures. See Developing Integration Projects with Oracle Data Integrator.

Import the following KMs into Global Objects or a project:

IKM File to Hive (Load Data)
IKM Hive Control Append
IKM Hive Transform
IKM File-Hive to Oracle (OLH-OSCH)
IKM File-Hive to SQL (SQOOP)
IKM SQL to Hive-HBase-File (SQOOP)
IKM Hive to HBase Incremental Update (HBase-SerDe)
LKM HBase to Hive (HBase-SerDe)
CKM Hive
RKM Hive
RKM HBase

3.5 Creating an Oracle Data Integrator Model from a Reverse-Engineered Hive and HBase Model

This section contains the following topics:

Creating a Model
Reverse Engineering Hive Tables
Reverse Engineering HBase Tables

3.5.1 Creating a Model

To create a model that is based on the technology hosting Hive or HBase and on the logical schema created when you configured the Hive or HBase connection, follow the standard procedure described in Developing Integration Projects with Oracle Data Integrator.

3.5.2 Reverse Engineering Hive Tables

RKM Hive is used to reverse engineer Hive tables and views. To perform a customized reverse-engineering of Hive tables with RKM Hive, follow the usual procedures, as described in Developing Integration Projects with Oracle Data Integrator. This topic details information specific to Hive tables.

The reverse-engineering process creates the data stores for the corresponding Hive table or views. You can use the data stores as either a source or a target in a mapping.

RKM Hive

RKM Hive reverses these metadata elements:

Hive tables and views as Oracle Data Integrator data stores.

Specify the reverse mask in the Mask field, and then select the tables and views to reverse. The Mask field in the Reverse Engineer tab filters reverse-engineered objects based on their names. The Mask field cannot be empty and must contain at least the percent sign (%).
Hive columns as Oracle Data Integrator attributes with their data types.
Information about buckets, partitioning, clusters, and sort columns are set in the respective flex fields in the Oracle Data Integrator data store or column metadata.

Table 3-5 describes the created flex fields.

Table 3-5 Flex Fields for Reverse-Engineered Hive Tables and Views

Object	Flex Field Name	Flex Field Code	Flex Field Type	Description
DataStore	Hive Buckets	`HIVE_BUCKETS`	String	Number of buckets to be used for clustering
Column	Hive Partition Column	`HIVE_PARTITION_COLUMN`	Numeric	All partitioning columns are marked as "1". Partition information can come from the following: Mapped source column Constant value specified in the target column File name fragment
Column	Hive Cluster Column	`HIVE_CLUSTER_COLUMN`	Numeric	All cluster columns are marked as "1".
Column	Hive Sort Column	`HIVE_SORT_COLUMN`	Numeric	All sort columns are marked as "1".

3.5.3 Reverse Engineering HBase Tables

RKM HBase is used to reverse engineer HBase tables. To perform a customized reverse-engineering of HBase tables with RKM HBase, follow the usual procedures, as described in Developing Integration Projects with Oracle Data Integrator. This topic details information specific to HBase tables.

The reverse-engineering process creates the data stores for the corresponding HBase table. You can use the data stores as either a source or a target in a mapping.

RKM HBase

RKM HBase reverses these metadata elements:

HBase tables as Oracle Data Integrator data stores.

Specify the reverse mask in the Mask field, and then select the tables to reverse. The Mask field in the Reverse Engineer tab filters reverse-engineered objects based on their names. The Mask field cannot be empty and must contain at least the percent sign (%).
HBase columns as Oracle Data Integrator attributes with their data types.
HBase unique row key as Oracle Data Integrator attribute called key.

Table 3-6 describes the options for RKM HBase.

Table 3-6 RKM HBase Options

Option	Description
`SCAN_MAX_ROWS`	Specifies the maximum number of rows to be scanned during reversing of a table. The default value is `10000`.
`SCAN_START_ROW`	Specifies the key of the row to start the scan on. By default the scan will start on the first row. The row key is specified as a Java expressions returning an instance of `org.apache.hadoop.hbase.util.Bytes`. Example: `Bytes.toBytes(?EMP000001?)`.
`SCAN_STOP_ROW`	Specifies the key of the row to stop the scan on? By default the scan will run to the last row of the table or up to `SCAN_MAX_ROWS` is reached. The row key is specified as a Java expressions returning an instance of `org.apache.hadoop.hbase.util.Bytes`. Example: `Bytes.toBytes(?EMP000999?)`. Only applies if `SCAN_START_ROW` is specified.
`SCAN_ONLY_FAMILY`	Restricts the scan to column families, whose name match this pattern. SQL-LIKE wildcards percentage (`%)` and underscore (`_)` can be used. By default all column families are scanned.
`LOG_FILE_NAME`	Specifies the path and file name of the log file. Default path is the user home and the default file name is `reverse.log`.

Table 3-7 describes the created flex fields.

Table 3-7 Flex Fields for Reverse-Engineered HBase Tables

Object Flex Field Name Flex Field Code Flex Field Type Description

Object	Flex Field Name	Flex Field Code	Flex Field Type	Description
DataStore	HBase Quorum	`HBASE_QUORUM`	String	Comma separated list of Zookeeper nodes. It is used by the HBase client to locate the HBase Master server and HBase Region servers.
Column	HBase storage type	`HBASE_STORAGE_TYPE`	String	Defines how a data type is physically stored in HBase. Permitted values are `Binary` and `String` (default).

DataStore

HBase Quorum

HBASE_QUORUM

String

Comma separated list of Zookeeper nodes. It is used by the HBase client to locate the HBase Master server and HBase Region servers.

Column

HBase storage type

HBASE_STORAGE_TYPE

String

Defines how a data type is physically stored in HBase.

Permitted values are Binary and String (default).

3.6 Designing a Mapping

After reverse engineering Hive tables and configuring them, you can choose from these mapping configurations:

Loading Data from Files into Hive
Loading Data from HBase into Hive
Loading Data from Hive into HBase
Loading Data from an SQL Database into Hive, HBase, and File
Validating and Transforming Data Within Hive
Loading Data into an Oracle Database from Hive and HDFS
Loading Data into a SQL Database from Hive and HDFS

3.6.1 Loading Data from Files into Hive

To load data from the local file system or the HDFS file system into Hive tables:

Create the data stores for local files and HDFS files.

Refer to Connectivity and Knowledge Modules Guide for Oracle Data Integrator for information about reverse engineering and configuring local file data sources.
Create a mapping using the file data store as the source and the corresponding Hive table as the target. Use the IKM File to Hive (Load Data) knowledge module specified in the physical diagram of the mapping. This integration knowledge module loads data from flat files into Hive, replacing or appending any existing data.

I KM File to Hive

IKM File to Hive (Load Data) supports:

One or more input files. To load multiple source files, enter an asterisk or a question mark as a wildcard character in the resource name of the file DataStore (for example, webshop_*.log).
File formats:
- Fixed length
- Delimited
- Customized format
Loading options:
- Immediate or deferred loading
- Overwrite or append
- Hive external tables

Table 3-8 describes the options for IKM File to Hive (Load Data). See the knowledge module for additional details.

Table 3-8 IKM File to Hive Options

Option	Description
`CREATE_TARG_TABLE`	Check this option, if you wish to create the target table. In case `USE_STAGING_TABLE` is set to `false`, please note that data will only be read correctly, if the target table definition, particularly the row format and file format details, are correct.
`TRUNCATE`	Set this option to true, if you wish to replace the target table/partition content with the new data. Otherwise the new data will be appended to the target table. If `TRUNCATE` and `USE_STAGING_TABLE` are set to `false`, all source file names must be unique and must not collide with any data files already loaded into the target table.
`FILE_IS_LOCAL`	Defines whether the source file is to be considered local (outside of the current Hadoop cluster). If this option is set to `true`, the data file(s) are copied into the Hadoop cluster first. The file has to be accessible by the Hive server through the local or shared file system. If this option is set to `false`, the data file(s) are moved into the Hadoop cluster and therefore will no longer be available at their source location. If the source file is already in HDFS, setting this option is set to `false` results in just a file rename, and therefore the operation is very fast. This option only applies, if `EXTERNAL_TABLE` is set to `false`.
`EXTERNAL_TABLE`	Defines whether to declare the target/staging table as externally managed. For non-external tables Hive manages all data files. That is, it will move any data files into `<hive.metastore.warehouse.dir>/<table_name>`. For external tables Hive does not move or delete any files. It will load data from the location given by the ODI schema. If this option is set to `true`: All files in the directory given by the physical data schema will be loaded. So any filename or wildcard information from the source DataStore's resource name will be ignored. The directory structure and file names must comply with Hives directory organization for tables, for example, for partitioning and clustering. The directory and its files must reside in HDFS. No Hive LOAD-DATA-statements are submitted and thus loading of files to a specific partition (using a target-side expression) is not possible.
`USE_STAGING_TABLE`	Defines whether an intermediate staging table will be created. A Hive staging table is required if: Target table is partitioned, but data spreads across partitions Target table is clustered Target table (partition) is sorted, but input file is not Target table is already defined and target table definition does not match the definition required by the KM Target column order does not match source file column order There are any unmapped source columns There are any unmapped non-partition target columns The source is a fixed length file and the target has non-string columns In case none of the above is `true`, this option can be turned off for better performance.
`DELETE_TEMPORARY_OBJECTS`	Removes temporary objects, such as tables, files, and scripts after integration. Set this option to `No` if you want to retain the temporary files, which might be useful for debugging.
`DEFER_TARGET_LOAD`	Defines whether the file(s), which have been declared to the staging table should be loaded into the target table now or during a later execution. Permitted values are `START`, `NEXT`, `END` or `<empty>`. This option only applies if `USE_STAGE_TABLE` is set to `true`. The typical use case for this option is when there are multiple files and each of them requires data redistribution/sorting and the files are gathered by calling the interface several times. For example, the interface is used in a package, which retrieves (many small) files from different locations and the location, stored in an Oracle Data Integrator variable, is to be used in a target partition column. In this case the first interface execution will have `DEFER_TARGET_LOAD` set to `START`, the next interface executions will have `DEFER_TARGET_LOAD` set to `NEXT` and set to `END` for the last interface. The interfaces having `DEFER_ TARGET _LOAD` set to `START/NEXT` will just load the data file into HDFS (but not yet into the target table) and can be executed in parallel to accelerate file upload to cluster.
`OVERRIDE_ROW_FORMAT`	Allows to override the entire Hive row format definition of the staging table (in case `USE_STAGE_TABLE` is set to `true`) or the target table (in case `USE_STAGE_TABLE` is set to `false`). It contains the text to be used for row format definition.Example for reading Apache Combined WebLog files: `ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]) ([^ ]) ([^ ]) (-\|\\[[^\\]]\\]) ([^ \"]\|\"[^\"]\") (-\|[0-9]) (-\|[0-9]) (\".?\") (\".?\") (\".*?\")", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s" ) STORED AS TEXTFILE` The list of columns in the source DataStore must match the list of input groups in the regular expression (same number of columns and appropriate data types). If `USE_STAGE_TABLE` is set to `false`, the number of target columns must match the number of columns returned by the SerDe, in the above example, the number of groups in the regular expression. The number of source columns is ignored (At least one column must be mapped to the target.). All source data is mapped into the target table structure according to the column order, the SerDe's first column is mapped to the first target column, the SerDe's second column is mapped to the second target column, and so on. If `USE_STAGE_TABLE` is set to `true`, the source DataStore must have as many columns as the SerDe returns columns. Only data of mapped columns will be transferred.
`STOP_ON_FILE_NOT_FOUND`	Defines whether the KM should stop, if input file is not found.
`HIVE_COMPATIBILE`	Specifies the Hive version compatibility. The values permitted for this option are 0.7 and 0.8. 0.7: Simulates the append behavior. Must be used for Hive 0.7 (CDH3). 0.8: Uses Hive's append feature, which provides better performance. Requires Hive 0.8 (CDH4) or later.

3.6.2 Loading Data from HBase into Hive

To load data from an HBase table into Hive:

Create a data store for the HBase table that you want to load in Hive.

Refer to "Setting Up HBase Data Sources" for information about reverse engineering and configuring HBase data sources.
Create a mapping using the HBase data store as the source and the corresponding Hive table as the target. Use the LKM HBase to Hive (HBase-SerDe) knowledge module, specified in the Physical diagram of the mapping. This knowledge module provides read access to an HBase table from Hive.

LKM HBase to Hive (HBase-SerDe)

LKM HBase to Hive (HBase-SerDe) supports:

A single source HBase table.

Table 3-9 describes the options for LKM HBase to Hive (HBase-SerDe). See the knowledge module for additional details.

Table 3-9 LKM HBase to Hive (HBase-SerDe) Options

Option	Description
`DELETE_TEMPORARY_OBJECTS`	Deletes temporary objects such as tables, files, and scripts post data integration. Set this option to `NO` if you want to retain the temporary objects, which might be useful for debugging.

3.6.3 Loading Data from Hive into HBase

To load data from a Hive table into HBase:

Create a data store for the Hive tables that you want to load in HBase.

Refer to "Setting Up Hive Data Sources" for information about reverse engineering and configuring Hive data sources.
Create a mapping using the Hive data store as the source and the corresponding HBase table as the target. Use the IKM Hive to HBase Incremental Update (HBase-SerDe) knowledge module, specified in the physical diagram of the mapping. This integration knowledge module loads data from Hive into HBase. It supports inserting new rows and updating existing rows.

IKM Hive to HBase Incremental Update (HBase-SerDe)

IKM Hive to HBase Incremental Update (HBase-SerDe) supports:

Filters, Joins, Datasets, Transformations and Aggregations in Hive
Inline views generated by IKM Hive Transform
Inline views generated by IKM Hive Control Append

Table 3-10 describes the options for IKM Hive to HBase Incremental Update (HBase-SerDe). See the knowledge module for additional details.

Table 3-10 IKM Hive to HBase Incremental Update (HBase-SerDe) Options

Option	Description
`CREATE_TARG_TABLE`	Creates the HBase target table.
`TRUNCATE`	Replaces the target table content with the new data. If this option is set to `false`, the new data is appended to the target table.
`DELETE_TEMPORARY_OBJECTS`	Deletes temporary objects such as tables, files, and scripts post data integration. Set this option to `NO` if you want to retain the temporary objects, which might be useful for debugging.
`HBASE_WAL`	Enables or disables the Write-Ahead-Log (WAL) that HBase uses to protect against data loss. For better performance, WAL can be disabled.

3.6.4 Loading Data from an SQL Database into Hive, HBase, and File

To load data from an SQL Database into a Hive, HBase, and File target:

Create a data store for the SQL source that you want to load into Hive, HBase, or File target.

Refer to Connectivity and Knowledge Modules Guide for Oracle Data Integrator for information about reverse engineering and configuring SQL data sources.
Create a mapping using the SQL source data store as the source and the corresponding HBase table, Hive table, or HDFS files as the target. Use the IKM SQL to Hive-HBase-File (SQOOP) knowledge module, specified in the Physical diagram of the mapping. This integration knowledge module loads data from a SQL source into Hive, HBase, or Files target. It uses SQOOP to load the data into Hive, HBase, and File targets. SQOOP uses parallel JDBC connections to load the data.

IKM SQL to Hive-HBase-File (SQOOP)

IKM SQL to Hive-HBase-File (SQOOP) supports:

Mappings on staging
Joins on staging
Filter expressions on staging
Datasets
Lookups
Derived tables

Table 3-11 describes the options for IKM SQL to Hive-HBase-File (SQOOP). See the knowledge module for additional details.

Table 3-11 IKM SQL to Hive-HBase-File (SQOOP) Options

Option	Description
`CREATE_TARG_TABLE`	Creates the target table. This option is applicable only if the target is Hive or HBase.
`TRUNCATE`	Replaces any existing target table content with the new data. For Hive and HBase targets, the target data is truncated. For File targets, the target directory is removed. For File targets, this option must be set to `true`.
`SQOOP_PARALLELISM`	Specifies the degree of parallelism. More precisely the number of mapper processes used for extraction. If `SQOOP_PARALLELISM` option is set to greater than 1, `SPLIT_BY` option must be defined.
`SPLIT_BY`	Specifies the target column to be used for splitting the source data into n chunks for parallel extraction, where n is `SQOOP_PARALLELISM`. To achieve equally sized data chunks the split column should contain homogeneously distributed values. For calculating the data chunk boundaries a query similar to `SELECT MIN(EMP.EMPNO), MAX(EMP.EMPNO) from EMPLOYEE EMP` is used. To avoid an extra full table scan the split column should be backed by an index.
`BOUNDARY_QUERY`	For splitting the source data into chunks for parallel extraction the minimum and maximum value of the split column is retrieved (KM option `SPLIT-BY`). In certain situations this may not be the best boundaries or not the most optimized way to retrieve the boundaries. In such cases this KM option can be set to a SQL query returning one row with two columns, lowest value and highest value to be used for split-column. This range will be divided into `SQOOP_PARALLELISM` chunks for parallel extraction. Example for hard-coded ranges for an Oracle source: `SELECT 1000, 2000 FROM DUAL` For preserving context independence, regular table names should be inserted through `odiRef.getObjectName` calls. For example: `SELECT MIN(EMPNO), MAX(EMPNO) FROM <%=odiRef.getObjectName("EMP")%>`
`TEMP_DIR`	Specifies the directory used for storing temporary files, such as sqoop script, stdout and stderr redirects. Leave this option blank to use system's default temp directory: `<?=System.getProperty("java.io.tmp")?>`
`MAPRED_OUTPUT_BASE_DIR`	Specifies an hdfs directory, where SQOOP creates subdirectories for temporary files. A subdirectory called like the work table will be created here to hold the temporary data.
`DELETE_TEMPORARY_OBJECTS`	Deletes temporary objects such as tables, files, and scripts after data integration. Set this option to `NO` if you want to retain the temporary objects, which might be useful for debugging.
`USE_HIVE_STAGING_TABLE`	Loads data into the Hive work table before loading into the Hive target table. Set this option to `false` to load data directly into the target table. Setting this option to `false` is only possible, if all these conditions are true: All target columns are mapped Existing Hive table uses standard hive row separators (\n) and column delimiter (\01) Setting this option to `false` provides better performance by avoiding an extra data transfer step. This option is applicable only if the target technology is Hive.
`USE_GENERIC_JDBC_CONNECTOR`	Specifies whether to use the generic JDBC connector if a connector for the target technology is not available. For certain technologies SQOOP provides specific connectors. These connectors take care of SQL-dialects and optimize performance. When there is a connector for the respective target technology, this connector should be used. If not, the generic JDBC connector can be used.
`EXTRA_HADOOP_CONF_PROPERTIES`	Optional generic Hadoop properties.
`EXTRA_SQOOP_CONF_PROPERTIES`	Optional SQOOP properties.
`EXTRA_SQOOP_CONNECTOR_CONF_PROPERTIES`	Optional SQOOP connector properties.

3.6.5 Validating and Transforming Data Within Hive

After loading data into Hive, you can validate and transform the data using the following knowledge modules.

3.6.5.1 IKM Hive Control Append

This knowledge module validates and controls the data, and integrates it into a Hive target table in truncate/insert (append) mode. Invalid data is isolated in an error table and can be recycled. IKM Hive Control Append supports inline view mappings that use either this knowledge module or IKM Hive Transform.

Table 3-12 lists the options. See the knowledge module for additional details.

Table 3-12 IKM Hive Control Append Options

Option	Description
`FLOW_CONTROL`	Activates flow control.
`RECYCLE_ERRORS`	Recycles data rejected from a previous control.
`STATIC_CONTROL`	Controls the target table after having inserted or updated target data.
`CREATE_TARG_TABLE`	Creates the target table.
`TRUNCATE`	Replaces the target table content with the new data. Setting this option to `true` provides better performance.
`DELETE_TEMPORARY_OBJECTS`	Removes the temporary objects, such as tables, files, and scripts after data integration. Set this option to `NO` if you want to retain the temporary objects, which might be useful for debugging.
`HIVE_COMPATIBILE`	Specifies the Hive version compatibility. The values permitted for this option are 0.7 and 0.8. 0.7: Simulates the append behavior. Must be used for Hive 0.7 (CDH3). 0.8: Uses Hive's append feature, which provides better performance. Requires Hive 0.8 (CDH4) or later.

3.6.5.2 CKM Hive

This knowledge module checks data integrity for Hive tables. It verifies the validity of the constraints of a Hive data store and diverts the invalid records to an error table. You can use CKM Hive for static control and flow control. You must also define these constraints on the stored data.

Table 3-13 lists the options for this check knowledge module. See the knowledge module for additional details.

Table 3-13 CKM Hive Options

Option Description

Option	Description
`DROP_ERROR_TABLE`	Drops error table before execution. When this option is set to `YES`, the error table will be dropped each time a control is performed on the target table. This means that any rejected records, identified and stored during previous control operations, will be lost. Otherwise previous rejects will be preserved. In addition to the error table, any table called `<error table>_tmp` will also be dropped.
`HIVE_COMPATIBILE`	Specifies the Hive version compatibility. The values permitted for this option are 0.7 and 0.8. 0.7: Simulates the append behavior. Must be used for Hive 0.7 (CDH3). 0.8: Uses Hive's append feature, which provides better performance. Requires Hive 0.8 (CDH4) or later.

DROP_ERROR_TABLE

Drops error table before execution. When this option is set to YES, the error table will be dropped each time a control is performed on the target table. This means that any rejected records, identified and stored during previous control operations, will be lost. Otherwise previous rejects will be preserved. In addition to the error table, any table called <error table>_tmp will also be dropped.

HIVE_COMPATIBILE

Specifies the Hive version compatibility. The values permitted for this option are 0.7 and 0.8.

0.7: Simulates the append behavior. Must be used for Hive 0.7 (CDH3).
0.8: Uses Hive's append feature, which provides better performance. Requires Hive 0.8 (CDH4) or later.

3.6.5.3 IKM Hive Transform

This knowledge module performs transformations. It uses a shell script to transform the data, and then integrates it into a Hive target table using replace mode. The knowledge module supports inline view mappings and can be used as an inline-view for IKM Hive Control Append.

The transformation script must read the input columns in the order defined by the source data store. Only mapped source columns are streamed into the transformations. The transformation script must provide the output columns in the order defined by the target data store.

Table 3-14 lists the options for this integration knowledge module. See the knowledge module for additional details.

Table 3-14 IKM Hive Transform Options

Option	Description
`CREATE_TARG_TABLE`	Creates the target table.
`DELETE_TEMPORARY_OBJECTS`	Removes the temporary objects, such as tables, files, and scripts post data integration. Set this option to `NO` if you want to retain the temporary objects, which might be useful for debugging.
`TRANSFORM_SCRIPT_NAME`	Defines the file name of the transformation script. This transformation script is used to transform the input data into the output structure. Both local and HDFS paths are supported, for example: Local script location: `file:///tmp/odi/script1.pl` HDFS script location: `hdfs://namenode:nnPort/tmp/odi/script1.pl` Ensure that the following requirements are met: The path/file must be accessible by both the ODI agent and the Hive server. Read access for the Hive server is required as it is the Hive server, which executes the resulting MR job invoking the script. If `TRANSFORM_SCRIPT` is set (ODI creates the script file during mapping execution), the path/file must be writable for the ODI agent, as it is the ODI agent, which writes the script file using the HDFS Java API. When the KM option `TRANSFORM_SCRIPT` is set, the following paragraphs provide some configuration help: For HDFS script locations: The script file created is owned by the ODI agent user and receives the group of the owning directory. See Hadoop Hdfs Permissions Guide for more details. The standard configuration to cover the above two requirements for HDFS scripts is to ensure that the group of the HDFS script directory includes the ODI agent user (let's assume oracle) as well as the Hive server user (let's assume hive). Assuming that the group hadoop includes oracle and hive, the sample command below adjusts the ownership of the HDFS script directory: `logon as hdfs user hdfs dfs -chown oracle:hadoop /tmp/odi/myscriptdir` For local script locations: The script file created is owned by the ODI agent user and receives the ODI agent user's default group, unless SGID has been set on the script directory. If the sticky group bit has been set, the file will be owned by the group of the script directory instead. The standard configuration to cover the above two requirements for local scripts is similar to the HDFS configuration by using the SGID: `chown oracle:hadoop /tmp/odi/myscriptdir chmod g+s /tmp/odi/myscriptdir`
`TRANSFORM_SCRIPT`	Defines the transformation script content. This transformation script is then used to transform the input data into the output structure. If left blank, the file given in `TRANSFORM_SCRIPT_NAME` must already exist. If not blank, the script file is created. Script example (1-to-1 transformation): `#! /usr/bin/csh -f cat` All mapped source columns are spooled as tab separated data into this script via stdin. This unix script then transforms the data and writes out the data as tab separated data on stdout. The script must provide as many output columns as there are target columns.
`TRANSFORM_SCRIPT_MODE`	Unix/HDFS file permissions for script file in octal notation with leading zero. For example, full permissions for owner and group: 0770. Warning: Using wider permissions like 0777 poses a security risk. See also KM option description for `TRANSFORM_SCRIPT_NAME` for details on directory permissions.
`PRE_TRANSFORM_DISTRIBUTE`	Provides an optional, comma-separated list of source column names, which enables the knowledge module to distribute the data before the transformation script is applied.
`PRE_TRANSFORM_SORT`	Provide an optional, comma-separated list of source column names, which enables the knowledge module to sort the data before the transformation script is applied.
`POST_TRANSFORM_DISTRIBUTE`	Provides an optional, comma-separated list of target column names, which enables the knowledge module to distribute the data after the transformation script is applied.
`POST_TRANSFORM_SORT`	Provides an optional, comma-separated list of target column names, which enables the knowledge module to sort the data after the transformation script is applied.

3.6.6 Loading Data into an Oracle Database from Hive and HDFS

IKM File-Hive to Oracle (OLH-OSCH) integrates data from an HDFS file or Hive source into an Oracle database target using Oracle Loader for Hadoop. Using the mapping configuration and the selected options, the knowledge module generates an appropriate Oracle Database target instance. Hive and Hadoop versions must follow the Oracle Loader for Hadoop requirements.

Option	Description
`OLH_OUTPUT_MODE`	Specifies how to load the Hadoop data into Oracle. Permitted values are JDBC, OCI, DP_COPY, DP_OSCH, and OSCH. JDBC output mode: The data is inserted using a number of direct insert JDBC connections. In very rare cases JDBC mode may result in duplicate records in target table due to Hadoop trying to restart tasks. OCI output mode: The data is inserted using a number of direct insert OCI connections in direct path mode. If `USE_ORACLE_STAGING`is set to `false`, target table must be partitioned. If `USE_ORACLE_STAGING` is set to `true`, `FLOW_TABLE_OPTIONS` must explicitly specify partitioning, for example, `"PARTITION BY HASH(COL1) PARTITIONS 4"`. In very rare cases OCI mode may result in duplicate records in target table due to Hadoop trying to restart tasks. DP_COPY output mode: OLH creates a number of DataPump export files. These files are transferred by a "`Hadoop fs -copyToLocal`" command to the local path specified by `EXT_TAB_DIR_LOCATION`. Please note that the path must be accessible by the Oracle Database engine. Once the copy job is complete, an external table is defined in the target database, which accesses the files from `EXT_TAB_DIR_LOCATION`. DP_OSCH output mode: OLH creates a number of DataPump export files. After the export phase an external table is created on the target database, which accesses these output files directly via OSCH. Please note that the path must be accessible by the Oracle Database engine. Once the copy job is complete, an external table is defined in the target database, which accesses the files from `EXT_TAB_DIR_LOCATION`. OSCH output mode: In OSCH mode loading, OLH is bypassed. ODI creates an external table on the target database, which accesses the input files through OSCH. Please note that only delimited and fixed length files can be read. No support for loading from Hive or custom Input Formats such as RegexInputFormat, as there is no OLH pre-processing.
`REJECT_LIMIT`	Specifies the maximum number of errors for Oracle Loader for Hadoop and external table. Examples: `UNLIMITED` to except all errors. Integer value (10 to allow 10 rejections) This value is used in Oracle Loader for Hadoop job definitions as well as in external table definitions.
`CREATE_TARG_TABLE`	Creates the target table.
`TRUNCATE`	Replaces the target table content with the new data.
`DELETE_ALL`	Deletes all the data in target table.
`USE_HIVE_STAGING_TABLE`	Materializes Hive source data before extraction by Oracle Loader for Hadoop. If this option is set to `false`, Oracle Loader for Hadoop directly accesses the Hive source data. Setting this option to `false` is only possible, if all these conditions are true: Only a single source table No transformations, filters, joins No datasets Setting this option to `false` provides better performance by avoiding an extra data transfer step. This option is applicable only if the source technology is Hive.
`USE_ORACLE_STAGING_TABLE`	Uses an intermediate Oracle database staging table. The extracted data is made available to Oracle by an external table. If `USE_ORACLE_STAGING_TABLE` is set to `true` (default), the external table is created as a temporary (I$) table. This I$ table data is then inserted into the target table. Setting this option to `false` is only possible, if all these conditions are true: `OLH_OUTPUT_MODE` is set to `JDBC` or `OCI` All source columns are mapped All target columns are mapped No target-side mapping expressions Setting this option to `false` provides better performance by avoiding an extra data transfer step, but may lead to partial data being loaded into the target table, as Oracle Loader for Hadoop loads data in multiple transactions.
`EXT_TAB_DIR_LOCATION`	Specifies the file system path of the external table. Please note the following: Only applicable, if `OLH_OUTPUT_MODE` = `DP_\|OSCH` For `OLH_OUTPUT_MODE` = `DP_`: this path must be accessible both from the ODI agent and from the target database engine. For `OLH_OUTPUT_MODE` = `DP_`: the name of the external directory object is the I$ table name. For `OLH_OUTPUT_MODE` = `DP_COPY`: ODI agent will use `hadoop-fs` command to copy dp files into this directory. For `OLH_OUTPUT_MODE` = `DP_\|OSCH`: this path will contain any external table log/bad/dsc files. ODI agent will remove any files from this directory during clean up before launching OLH/OSCH.
`TEMP_DIR`	Specifies the directory used for storing temporary files, such as sqoop script, stdout and stderr redirects. Leave this option blank to use system's default temp directory: `<?=System.getProperty("java.io.tmp")?>`
`MAPRED_OUTPUT_BASE_DIR`	Specifies an HDFS directory, where the Oracle Loader for Hadoop job will create subdirectories for temporary files/datapump output files.
`FLOW_TABLE_OPTIONS`	Specifies the attributes for the integration table at create time and used for increasing performance. This option is set by default to `NOLOGGING`. This option may be left empty.
`DELETE_TEMPORARY_OBJECTS`	Removes temporary objects, such as tables, files, and scripts post data integration. Set this option to `NO` if you want to retain the temporary objects, which might be useful for debugging.
`OVERRIDE_INPUTFORMAT`	By default the InputFormat class is derived from the source DataStore/Technology (DelimitedTextInputFormat or HiveToAvroInputFormat). This option allows the user to specify the class name of a custom InputFormat. Cannot be used with `OLH_OUTPUT_MODE`=`OSCH`. Example, for reading custom file formats like web log files the OLH RegexInputFormat can be used by assigning the value: `oracle.hadoop.loader.lib.input.RegexInputFormat` See KM option EXTRA_OLH_CONF_PROPERTIES for details on how to specify the regular expression.
`EXTRA_OLH_CONF_PROPERTIES`	Particularly when using custom InputFormats (see KM option `OVERRIDE_INPUTFORMAT` for details) the InputFormat may require additional configuration parameters. These are provided in the OLH configuration file. This KM option allows adding extra properties to the OLH configuration file. Cannot be used with `OLH_OUTPUT_MODE`=`OSCH`. Example, (loading apache weblog file format): When OLH RegexInputFormat is used for reading custom file formats, this KM option specifies the regular expression and other parsing details: `<property> <name>oracle.hadoop.loader.input.regexPattern</name> <value>([^ ]) ([^ ]) ([^ ]) (-\|\[[^\]]\]) ([^ \"]\|\"[^\"]\") (-\|[0-9]) (-\|[0-9]) (\".?\") (\".?\") (\".*?\")</value> <description>RegEx for Apache WebLog format</description> </property>`

3.6.7 Loading Data into a SQL Database from Hive and HDFS

IKM File-Hive to SQL (SQOOP) integrates data from an HDFS file or Hive source into an SQL database target using SQOOP.

IKM File-Hive to SQL (SQOOP)

IKM File-Hive to SQL (SQOOP) supports:

Filters, Joins, Datasets, Transformations and Aggregations in Hive
Inline views generated by IKM Hive Control Append
Inline views generated by IKM Hive Transform
Hive-HBase source tables using LKM HBase to Hive (HBase SerDe)
File source data (delimited file format only)

Table 3-16 lists the options for this integration knowledge module. See the knowledge module for additional details.

Table 3-16 IKM File-Hive to SQL (SQOOP)

Option	Description
`CREATE_TARG_TABLE`	Creates the target table.
`TRUNCATE`	Replaces the target datastore content with new data. If this option is set to `false`, the new data is appended to the target datastore.
`DELETE_ALL`	Deletes all the rows in the target datastore.
`SQOOP_PARALLELISM`	Specifies the degree of parallelism. More precisely the number of mappers used during SQOOP export and therefore the number of parallel JDBC connections.
`USE_TARGET_STAGING_TABLE`	By default the source data is staged into a target-side staging table, before it is moved into the target table. If this option is set to `false`, SQOOP loads the source data directly into the target table, which provides better performance and less need for tablespace in target RDBMS by avoiding an extra data transfer step. For File sources setting this option to `false` is only possible, if all these conditions are met: All source columns must be mapped Source and target columns have same order First file column must map to first target column no mapping gaps only 1-to-1 mappings (no expressions) Please note the following: SQOOP uses multiple writers, each having their own JDBC connection to the target. Every writer uses multiple transactions for inserting the data. This means that in case `USE_TARGET_STAGING_TABLE` is set to `false`, changes to the target table are no longer atomic and writer failures can lead to partially updated target tables. The Teradata Connector for SQOOP always creates an extra staging table during load. This connector staging table is independent of the KM option.
`USE_GENERIC_JDBC_CONNECTOR`	Specifies whether to use the generic JDBC connector if a connector for the target technology is not available. For certain technologies SQOOP provides specific connectors. These connectors take care of SQL-dialects and optimize performance. When there is a connector for the respective target technology, this connector should be used. If not, the generic JDBC connector can be used.
`FLOW_TABLE_OPTIONS`	When creating the target-side work table, RDBMS-specific table options can improve performance. By default this option is empty and the knowledge module will use the following table options: For Oracle: `NOLOGGING` For DB2: `NOT LOGGED INITIALLY` For Teradata: `no fallback, no before journal, no after journal` Any explicit value overrides these defaults.
`TEMP_DIR`	Specifies the directory used for storing temporary files, such as sqoop script, stdout and stderr redirects. Leave this option blank to use system's default temp directory: `<?=System.getProperty("java.io.tmp")?>`
`MAPRED_OUTPUT_BASE_DIR`	Specifies an HDFS directory, where SQOOP creates subdirectories for temporary files. A subdirectory called like the work table will be created here to hold the temporary data.
`DELETE_TEMPORARY_OBJECTS`	Deletes temporary objects such as tables, files, and scripts after data integration. Set this option to `NO` if you want to retain the temporary objects, which might be useful for debugging.
`TERADATA_PRIMARY_INDEX`	Primary index for the target table. Teradata uses the primary index to spread data across AMPs. It is important that the chosen primary index has a high cardinality (many distinct values) to ensure evenly spread data to allow maximum processing performance. Please follow Teradata's recommendation on choosing a primary index. This option is applicable only to Teradata targets.
`TERADATA_FLOW_TABLE_TYPE`	Type of the Teradata flow table, either SET or MULTISET. This option is applicable only to Teradata targets.
`TERADATA_OUTPUT_METHOD`	Specifies the way the Teradata Connector will load the data. Valid values are: `batch.insert`: multiple JDBC connections using batched prepared statements (simplest to start with) `multiple.fastload`: multiple FastLoad connections `internal.fastload`: single coordinated FastLoad connections (most performant) This option is applicable only to Teradata targets.
`EXTRA_HADOOP_CONF_PROPERTIES`	Optional generic Hadoop properties.
`EXTRA_SQOOP_CONF_PROPERTIES`	Optional SQOOP properties.
`EXTRA_SQOOP_CONNECTOR_CONF_PROPERTIES`	Optional SQOOP connector properties.