4 Oracle Data Integrator Application Adapter for Hadoop

This chapter describes Oracle Data Integrator. It contains these sections:

Introduction
Setting up the Topology
Setting up an Integration Project
Creating Oracle Data Integrator Model from a Reverse-Engineering Hive Model
Designing the Interface

4.1 Introduction

Apache Hadoop is designed to handle and process data from data sources that are typically non-RDBMS and data volumes that are typically beyond what is handled by relational databases.

The Oracle Data Integrator Application Adapter for Hadoop enables data integration developers to integrate and transform data easily within Hadoop using Oracle Data Integrator. Employing familiar and easy-to-use tools and preconfigured knowledge modules, the adapter provides the following capabilities:

Loading data into Hadoop from the local file system and HDFS.
Performing validation and transformation of data within Hadoop.
Loading processed data from Hadoop to Oracle Database for further processing and generating reports.

4.1.1 Concepts

Typical processing in Hadoop includes data validation and transformations that are programmed as MapReduce jobs. Designing and implementing a MapReduce job requires expert programming knowledge. However, using Oracle Data Integrator and the Oracle Data Integrator Application Adapter for Hadoop, you do not need to write MapReduce jobs. Oracle Data Integrator uses Hive and the Hive Query Language (HiveQL), a SQL-like language for implementing MapReduce jobs. The Oracle Data Integrator graphical user interface enhancing the developer's experience and productivity while enabling them to create Hadoop integrations.

When implementing a big data processing scenario, the first step is to load the data into Hadoop. The data source is typically in the local file system, HDFS, Hive tables, or external Hive tables.

After the data is loaded, you can validate and transform the data using HiveQL like you do in SQL. You can perform data validation such as checking for NULLS and primary keys, and transformations such as filtering, aggregations, set operations, and derived tables. You can also include customized procedural snippets (scripts) for processing the data.

When the data has been aggregated, condensed, or crunched down, you can load it into Oracle Database for further processing and analysis. Oracle Loader for Hadoop is recommended for optimal loading into Oracle Database.

4.1.2 Knowledge Modules

Oracle Data Integrator provides the knowledge modules described in Table 4-1 for use with Hadoop.

Table 4-1 Oracle Data Integrator Application Adapter for Hadoop Knowledge Modules

KM Name	Description	Source	Target
IKM File To Hive (Load Data)	Loads data from local and HDFS files into Hive tables. It provides options for better performance through Hive partitioning and fewer data movements. This KM supports wild cards (*,?).	File System	Hive
IKM Hive Control Append	Integrates data into a Hive target table in truncate/ insert (append) mode. Data can be controlled (validated). Invalid data is isolated in an error table and can be recycled.	Hive	Hive
IKM Hive Transform	Integrates data into a Hive target table after the data has been transformed by a customized script such as Perl or Python.	Hive	Hive
IKM File-Hive to Oracle (OLH)	Integrates data from an HDFS file or Hive source into an Oracle Database target using Oracle Loader for Hadoop.	File System or Hive	Oracle Database
CKM Hive	Validates data against constraints.	NA	Hive
RKM Hive	Reverse engineers Hive tables.	Hive Metadata	NA

4.1.3 Security

For security information for Oracle Data Integrator, see the Oracle Fusion Middleware Developer's Guide for Oracle Data Integrator.

4.2 Setting up the Topology

This step declares, in Oracle Data Integrator, the data server and the physical and logical schemas that are used to store the file system and Hive information.

This section contains the following topics:

Setting up the File Data Source
Setting Up the Hive Data Source
Setting Up the Oracle Data Integrator Agent to Execute Hadoop Jobs
Configuring Oracle Data Integrator Studio for Executing Hadoop Jobs on the Local Agent

4.2.1 Setting up the File Data Source

In the Hadoop context there is a distinction between files in the Hadoop Distributed File System (HDFS) and local files (files outside of HDFS).

To define a data source:

Create a DataServer object under File technology.
Create a Physical Schema object for every directory to be accessed.
Create a Logical Schema object for every directory to be accessed.
Create a Model for every LogicalSchema.
Create one or more data stores for each different type of file and wildcard name pattern.
For HDFS files, create a DataServer object under File technology by entering the HDFS name node in the field JDBC URL. For example:
```
hdfs://bda1node01.example.com:9000
```
Note: There is no dedicated technology defined for HDFS files.

4.2.2 Setting Up the Hive Data Source

The following steps in Oracle Data Integrator are required for connecting to a Hive system. Oracle Data Integrator connects to Hive using JDBC.

Prerequisites

The Hive technology must be included in the standard Oracle Data Integrator technologies. If it is not, then import the technology in INSERT_UPDATE mode from the xml-reference folder.

All Hive-specific FlexFields must be added. For pre-11.1.1.6.0 repositories, the FlexFields are added as part of the repository upgrade process.

To set up a Hive data source:

Place all required Hive JDBC jars into the Oracle Data Integrator user lib folder:

$HIVE_HOME/lib/*.jar
$HADOOP_HOME/hadoop-*-core*.jar,
$HADOOP_HOME/Hadoop-*-tools*.jar

Create a DataServer object under Hive technology.
Set the following locations under JDBC:

JDBC Driver: org.apache.hadoop.hive.jdbc.HiveDriver

JDBC URL: for example, jdbc:hive://BDA:10000/default
Set the following under Flexfields:

Hive Metastore URIs: for example, thrift://BDA:10000
Create a Physical Default Schema.

As of Hive 0.7.0, no schemas or databases are supported. Only Default is supported. Enter default in both schema fields of the physical schema definition.
Ensure that the Hive server is up and running.
Test the connection to the DataServer.
Create a Logical Schema object.
Create at least one Model for the LogicalSchema.
Import RKM Hive as a global KM or into a project.
Create a new model for Hive Technology pointing to the logical schema.
Perform a custom reverse using RKM Hive.

At the end of this process, the Hive DataModel contains all Hive tables with their columns, partitioning, and clustering details stored as FlexField values.

4.2.3 Setting Up the Oracle Data Integrator Agent to Execute Hadoop Jobs

After setting up an Oracle Data Integrator agent, configure it to work with the Oracle Data Integrator Application Adapter for Hadoop.

To configure the Oracle Data Integrator agent:

Install Hadoop on your Oracle Data Integrator Agent computer. Ensure that the HADOOP_HOME environment variable is set.

For Oracle Big Data Appliance, see the Oracle Big Data Appliance Software User's Guide for instructions for setting up a remote Hadoop client.
Install Hive on your Oracle Data Integrator Agent computer. Ensure that the HIVE_HOME environment variable is set.
Copy these jar files to the Oracle Data Integrator agent drivers directory.
```
$HIVE_HOME/lib/*.jar,
$HADOOP_HOME/hadoop-*-core*.jar
$HADOOP_HOME/hadoop-*-tools*.jar
```
See the Oracle Fusion Middleware Installation Guide for Oracle Data Integrator for information about adding drivers to the agent. This step enables the Oracle Data Integrator agent to load the Hive JDBC driver.
Set environment variable ODI_HIVE_SESSION_JARS to include Hive RegEx SerDe:
```
ODI_HIVE_SESSION_JARS=$HIVE_HOME/lib/hive-contrib-0.7.1-cdh3u3.jar
```
Include other jars as required, such as custom SerDes jars. These jars are added to every Hive JDBC session and thus are added to every Hive MapReduce job.

Set environment variable HADOOP_CLASSPATH:

HADOOP_CLASSPATH=$HIVE_HOME/lib/hive-metastore-0.7.1-cdh3u3.jar:$HIVE_HOME/lib/libthrift.jar:$HIVE_HOME/lib/libfb303.jar:$HIVE_HOME/lib/hive-common-0.7.1-cdh3u3.jar:$HIVE_HOME/lib/hive-exec-0.7.1-cdh3u3.jar.

This setting enables the Hadoop script to start Hive MapReduce jobs.

To use Oracle Loader for Hadoop:

Install Oracle Loader for Hadoop on your Oracle Data Integrator Agent system.
Install Oracle client on your Oracle Data Integrator Agent system. See the Oracle Loader for Hadoop requirements for the Oracle client version.
Set environment variable OLH_HOME.

Set environment variable ODI_OLH_JARS.

You must list all jar files required for Oracle Loader for Hadoop. See the oracle.hadoop.loader.libjars property in "OraLoader for Hadoop Configuration Properties".

This is a comma-separated list of jar files:

ODI_OLH_JARS=OLH_HOME/jlib/ojdbc6.jar,$OLH_HOME/jlib/orai18n.jar,$OLH_HOME/jlib/orai18n-utility.jar,$OLH_HOME/jlib/orai18n-mapping.jar,$OLH_HOME/jlib/orai18n-collation.jar,$OLH_HOME/jlib/oraclepki.jar,$OLH_HOME/jlib/osdt_cert.jar,$OLH_HOME/jlib/osdt_core.jar,$OLH_HOME/jlib/commons-math-2.2.jar,$OLH_HOME/jlib/jackson-core-asl-1.5.2.jar,$OLH_HOME/jlib/jackson-mapper-asl-1.5.2.jar,$OLH_HOME/jlib/avro-1.5.4.jar,$OLH_HOME/jlib/avro-mapred-1.5.4.jar,$OLH_HOME/jlib/oraloader.jar,$HIVE_HOME/lib/hive-metastore-0.7.1-cdh3u3.jar,$HIVE_HOME/lib/libthrift.jar,$HIVE_HOME/lib/libfb303.jar,$HIVE_HOME/lib/hive-common-0.7.1-cdh3u3.jar,$HIVE_HOME/lib/hive-exec-0.7.1-cdh3u3.jar

Add paths to HADOOP_CLASSPATH:

$HADOOP_CLASSPATH= $OLH_HOME/jlib:$HADOOP_CLASSPATH

Verify that ODI_OLH_SHAREDLIBS lists all native libraries required for Oracle Loader for Hadoop. See the oracle.hadoop.loader.sharedLibs property in the "OraLoader for Hadoop Configuration Properties".

This is a comma separated list of shared libraries files:
```
ODI_OLH_SHAREDLIBS= $OLH_HOME/lib/libolh11.so,$OLH_HOME/lib/libclntsh.so.11.1,$OLH_HOME/lib/libnnz11.so,$OLH_HOME/lib/libociei.so
```
If Oracle Loader for Hadoop is used in OCI mode, then check these variables:
- JAVA_LIBRARY_PATH must include $OLH_HOME/lib. It locates the Oracle Loader for Hadoop native library libolh11.so.
- LD_LIBRARY_PATH must include $ORACLE_HOME/lib.

4.2.4 Configuring Oracle Data Integrator Studio for Executing Hadoop Jobs on the Local Agent

For executing Hadoop jobs on the local agent of an Oracle Data Integrator Studio installation, follow the steps in the previous section with the following change: Copy jar files into the Oracle Data Integrator userlib directory instead of the driver directory.

4.3 Setting up an Integration Project

Setting up a project follows the standard procedures. See Oracle Fusion Middleware Developer's Guide for Oracle Data Integrator.

Import the following KMs into Oracle Data Integrator project:

IKM File To Hive (Load Data)
IKM Hive Control Append
IKM Hive Transform
IKM File-Hive to Oracle (OLH)
CKM Hive
RKM Hive

4.4 Creating Oracle Data Integrator Model from a Reverse-Engineering Hive Model

This section contains the following topics:

Creating a Model
Reverse-Engineering Hive Tables

4.4.1 Creating a Model

Create a model based on the technology hosting Hive and on the logical schema created when configuring the Hive connection using the standard procedure, as described in the Oracle Fusion Middleware Developer's Guide for Oracle Data Integrator.

4.4.2 Reverse-Engineering Hive Tables

The Hive RKM is used to reverse-engineer Hive tables and views. To perform a Customized Reverse-Engineering of Hive tables with the Hive RKM, follow the usual procedures, as described in the Oracle Fusion Middleware Developer's Guide for Oracle Data Integrator. This topic details information specific to Hive tables.

The reverse-engineering process creates the data stores for the corresponding Hive table or views. You can use the data stores as either a source or a target in an integration interface.

The RKM reverses these metadata elements:

Hive tables and views as Oracle Data Integrator data stores.

Specify the reverse mask in the Mask field, then select the tables and views to reverse. The Mask field in the Reverse tab filters reverse-engineered objects based on their names. The Mask field cannot be empty and must contain at least the percentage symbol (%).
Hive columns as Oracle Data Integrator columns with their data types.
Information about buckets, partitioning, cluster, and sort columns are set in the respective FlexFields in the Oracle Data Integrator data store or column metadata.

Table 4-2 describes the created FlexFields.

Table 4-2 FlexFields for Reverse-Engineered Hive Tables and Views

Object	FlexField Name	FlexField Code	FlexField Type	Description
DataStore	Hive Buckets	`HIVE_BUCKETS`	String	Number of Buckets to be used for clustering
Column	Hive Partition Column	`HIVE_PARTITION_COLUMN`	Numeric	All partitioning columns are marked as "1". Partition information can come from the following: Mapped Source Column Constant value specified in the target column File name fragment
Column	Hive Cluster Column	`HIVE_CLUSTER_COLUMN`	Numeric	All cluster columns are marked as "1".
Column	Hive Sort Column	`HIVE_SORT_COLUMN`	Numeric	All sort columns are marked as "1".

Table 4-3 describes the options for Hive RKM.

Table 4-3 Hive RKM Options

Option	Description
`USE_LOG`	Log intermediate results?
`LOG_FILE_NAME`	Path and file name of log file. Default path is the user home and default file name is reverse.log.

4.5 Designing the Interface

After reverse engineering Hive tables and configuring them, you can choose from these interface configurations:

Loading Data from Files into Hive
Validating and Transforming Data Within Hive
Loading into Oracle from Hive and HDFS

4.5.1 Loading Data from Files into Hive

To load data from the local or the HDFS file system into Hive tables:

Create the data stores for local files and HDFS files.

Refer to the Oracle Fusion Middleware Connectivity and Knowledge Modules Guide for Oracle Data Integrator for information on reverse engineering and configuring local file data sources.
Create an interface using the file data store as the source and the corresponding Hive table as the target. Use the IKM File To Hive (Load Data) knowledge module specified in the flow tab of the interface. This IKM loads data from flat files into Hive, replacing or appending to any existing data.

IKM File to Hive

IKM File to Hive (Load Data) supports:

One or more input files. To load multiple source files, enter an asterisk or a question mark as a wildcard character in the resource name of the file DataStore, for example, webshop_*.log.
File formats:
- Fixed length
- Delimited
- Customized format
Loading options:
- Immediate or deferred loading
- Overwrite or append
- Hive external tables

Table 4-4 describes the options for IKM File To Hive (Load Data). See the KM for additional details.

Table 4-4 IKM File To Hive Options

Option	Description
`CREATE_TARG_TABLE`	Create target table.
`TRUNCATE`	Truncate data in target table.
`FILE_IS_LOCAL`	Is the file in the local file system or in HDFS?
`EXTERNAL_TABLE`	Use an externally managed Hive table
`USE_STAGING_TABLE`	Use a Hive staging table. Select this option if the source and target do not match or if the partition column value is part of the data file. If the partitioning value is provided by a file name fragment or a constant in target mapping, then set this value to `false`.
`DELETE_TEMPORARY_OBJECTS`	Remove temporary objects after the interface execution.
`DEFER_TARGET_LOAD`	Load data into the final target now or defer?
`OVERRIDE_ROW_FORMAT`	Provide a parsing expression for handling a custom file format to perform the mapping from source to target.
`STOP_ON_FILE_NOT_FOUND`	Stop if no source file is found?

4.5.2 Validating and Transforming Data Within Hive

After loading data into Hive, you can validate and transform the data using the following KMs.

4.5.2.1 IKM Hive Control Append

This KM validates and controls the data, and integrates it into a Hive target table in truncate/insert (append) mode. Invalid data is isolated in an error table and can be recycled. This KM supports inline view interfaces that use either IKM Hive Control Append or IKM Hive Transform.

Table 4-5 lists the options. See the KM for additional details.

Table 4-5 IKM Hive Control Append Options

Option	Description
`FLOW_CONTROL`	Validate incoming data?
`RECYCLE_ERRORS`	Reintegrate data from error table?
`STATIC_CONTROL`	Validate data after load?
`CREATE_TARG_TABLE`	Create target table?
`DELETE_TEMPORARY_OBJECTS`	Remove temporary objects after execution?
`TRUNCATE`	Truncate data in target table?

4.5.2.2 CKM Hive

This KM checks data integrity for Hive tables. It verifies the validity of the constraints of a Hive data store and rejects the invalid records into an error table. You can use CKM Hive for static control and flow control. You must also define these constraints on the stored data.

Table 4-6 lists the options for this CKM. See the KM for additional details.

Table 4-6 CKM Hive Options

Option	Description
`DROP_ERROR_TABLE`	Drop error table before execution?

4.5.2.3 IKM Hive Transform

This KM performs transformations. It uses a shell script to transform the data, then integrates it into a Hive target table using replace mode. The KM supports inline view interfaces and can be used as an inline-view for IKM Hive Control Append.

The transformation script must expect the input columns in the order defined by the source data store. Only mapped source columns are streamed into the transformations. The transformation script must provide the output columns in the order defined by the target data store.

Table 4-7 lists the options for this IKM. See the KM for additional details.

Table 4-7 IKM Hive Transform Options

Option	Description
`CREATE_TARG_TABLE`	Create target table?
`DELETE_TEMPORARY_OBJECTS`	Remove temporary objects after execution?
`TRANSFORM_SCRIPT_NAME`	Script file name
`TRANSFORM_SCRIPT`	Script content
`PRE_TRANSFORM_DISTRIBUTE`	Provides an optional, comma-separated list of source column names, which enables the KM to distribute the data before the transformation script is applied.
`PRE_TRANSFORM_SORT`	Provide an optional, comma-separated list of source column names, which enables the KM to sort the data before the transformation script is applied
`POST_TRANSFORM_DISTRIBUTE`	Provides an optional, comma-separated list of target column names, which enables the KM to distribute the data after the transformation script is applied.
`POST_TRANSFORM_SORT`	Provides an optional, comma-separated list of target column names, which enables the KM to sort the data after the transformation script is applied.

4.5.3 Loading into Oracle from Hive and HDFS

IKM File-Hive to Oracle (OLH) integrates data from an HDFS file or Hive source into an Oracle Database target using Oracle Loader for Hadoop. Using the interface configuration and the selected options, the KM generates an appropriate Oracle target instance. Hive and Hadoop versions must follow the Oracle Loader for Hadoop requirements.

Option	Description
`OLH_OUTPUT_MODE`	Specify either JDBC, OCI, or Data Pump for data transfer.
`CREATE_TARG_TABLE`	Create target table?
`USE_HIVE_STAGING_TABLE`	Materialize Hive source data before extract?
`USE_ORACLE_STAGING_TABLE`	Use an Oracle database staging table?
`EXT_TAB_DIR_LOCATION`	Shared file path used for Oracle Data Pump transfer.
`TEMP_DIR`	Local path for temporary files.
`MAPRED_OUTPUT_BASE_DIR`	HDFS directory for Oracle Loader for Hadoop output files.
`FLOW_TABLE_OPTIONS`	Options for flow (stage) table creation when an Oracle database staging table is used.
`DELETE_TEMPORARY_OBJECTS`	Remove temporary objects after execution?
`OVERRIDE_INPUTFORMAT`	Set to handle custom file formats.
`EXTRA_OLH_CONF_PROPERTIES`	Optional Oracle Loader for Hadoop configuration file properties
`TRUNCATE`	Truncate data in target table?
`DELETE_ALL`	Delete all data in target table?