Skip Headers
Oracle® Big Data Connectors User's Guide
Release 2 (2.2)

Part Number E41238-03
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

4 Oracle Data Integrator Application Adapter for Hadoop

This chapter describes how to use the knowledge modules in Oracle Data Integrator (ODI) Application Adapter for Hadoop. It contains the following sections:

4.1 Introduction

Apache Hadoop is designed to handle and process data that is typically from data sources that are nonrelational and data volumes that are beyond what is handled by relational databases.

Oracle Data Integrator (ODI) Application Adapter for Hadoop enables data integration developers to integrate and transform data easily within Hadoop using Oracle Data Integrator. Employing familiar and easy-to-use tools and preconfigured knowledge modules (KMs), the application adapter provides the following capabilities:

  • Loading data into Hadoop from the local file system and HDFS

  • Performing validation and transformation of data within Hadoop

  • Loading processed data from Hadoop to an Oracle database for further processing and generating reports

Knowledge modules (KMs) contain the information needed by Oracle Data Integrator to perform a specific set of tasks against a specific technology. An application adapter is a group of knowledge modules. Thus, Oracle Data Integrator Application Adapter for Hadoop is a group of knowledge modules for accessing data stored in Hadoop.

4.1.1 Concepts

Typical processing in Hadoop includes data validation and transformations that are programmed as MapReduce jobs. Designing and implementing a MapReduce job requires expert programming knowledge. However, when you use Oracle Data Integrator and Oracle Data Integrator Application Adapter for Hadoop, you do not need to write MapReduce jobs. Oracle Data Integrator uses Hive and the Hive Query Language (HiveQL), a SQL-like language for implementing MapReduce jobs.

When you implement a big data processing scenario, the first step is to load the data into Hadoop. The data source is typically in the local file system, HDFS, Hive tables, or external Hive tables.

After the data is loaded, you can validate and transform it by using HiveQL like you use SQL. You can perform data validation (such as checking for NULLS and primary keys), and transformations (such as filtering, aggregations, set operations, and derived tables). You can also include customized procedural snippets (scripts) for processing the data.

When the data has been aggregated, condensed, or processed into a smaller data set, you can load it into an Oracle database for further processing and analysis. Oracle Loader for Hadoop is recommended for optimal loading into an Oracle database.

4.1.2 Knowledge Modules

Oracle Data Integrator provides the knowledge modules (KMs) described in Table 4-1 for use with Hadoop.

Table 4-1 Oracle Data Integrator Application Adapter for Hadoop Knowledge Modules

KM Name Description Source Target

IKM File to Hive (Load Data)

Loads data from local and HDFS files into Hive tables. It provides options for better performance through Hive partitioning and fewer data movements.

This knowledge module supports wildcards (*,?).

File system

Hive

IKM Hive Control Append

Integrates data into a Hive target table in truncate/insert (append) mode. Data can be controlled (validated). Invalid data is isolated in an error table and can be recycled.

Hive

Hive

IKM Hive Transform

Integrates data into a Hive target table after the data has been transformed by a customized script such as Perl or Python

Hive

Hive

IKM File-Hive to Oracle (OLH)

Integrates data from an HDFS file or Hive source into an Oracle database target using Oracle Loader for Hadoop, Oracle SQL Connector for HDFS, or both.

File system or Hive

Oracle Database

CKM Hive

Validates data against constraints

NA

Hive

RKM Hive

Reverse engineers Hive tables

Hive metadata

NA


4.1.3 Security

For security information for Oracle Data Integrator, see the Oracle Fusion Middleware Developer's Guide for Oracle Data Integrator.

4.2 Setting Up the Topology

To set up the topology in Oracle Data Integrator, you identify the data server and the physical and logical schemas that are used to store the file system and Hive information.

This section contains the following topics:

Note:

Many of the environment variables described in the following sections are already configured for Oracle Big Data Appliance. See the configuration script at /opt/oracle/odiagent-version/agent_standalone/oracledi/agent/bin/HadoopEnvSetup.sh

4.2.1 Setting Up File Data Sources

In the Hadoop context, there is a distinction between files in Hadoop Distributed File System (HDFS) and local files (files outside of HDFS).

To define a data source: 

  1. Create a DataServer object under File technology.

  2. Create a Physical Schema object for every directory to be accessed.

  3. Create a Logical Schema object for every directory to be accessed.

  4. Create a Model for every Logical Schema.

  5. Create one or more data stores for each different type of file and wildcard name pattern.

  6. For HDFS files, create a DataServer object under File technology by entering the HDFS name node in the field JDBC URL. For example:

    hdfs://bda1node01.example.com:8020
    

    Note:

    No dedicated technology is defined for HDFS files.

4.2.2 Setting Up Hive Data Sources

The following steps in Oracle Data Integrator are required for connecting to a Hive system. Oracle Data Integrator connects to Hive by using JDBC.

Prerequisites

The Hive technology must be included in the standard Oracle Data Integrator technologies. If it is not, then import the technology in INSERT_UPDATE mode from the xml-reference directory.

You must add all Hive-specific flex fields. For pre-11.1.1.6.0 repositories, the flex fields are added during the repository upgrade process.

To set up a Hive data source: 

  1. Ensure that the following environment variables are set, and note their values. The following list shows typical values, although your installation may be different:

    • $HIVE_HOME: /usr/lib/hive

    • $HADOOP_HOME: /usr/lib/hadoop (contains configuration files such as core-site.xml)

    • $OSCH_HOME: /opt/oracle/orahdfs-version

  2. Open ~/.odi/oracledi/userlib/additional_path.txt in a text editor and add the paths listed in Table 4-2. Enter the full path obtained in Step 1 instead of the variable name.

    This step enables ODI Studio to access the JAR files.

    Table 4-2 JAR File Paths

    Description CDH4 Path CDH3 Path

    Hive JAR Files

    $HIVE_HOME/lib/*.jar

    $HIVE_HOME/*.jar

    Hadoop Client JAR Files

    $HADOOP_HOME/client/*.jar

    $HADOOP_HOME/hadoop-*-core*.jarFoot 1 

    $HADOOP_HOME/hadoop-*-tools*.jarFootref 1

    Hadoop Configuration Directory

    $HADOOP_HOME

    $HADOOP_HOME

    Oracle SQL Connector for HDFS JAR Files (optional)

    $OSCH_HOME/jlib/*.jar

    $OSCH_HOME/jlib/*.jar


    Footnote 1 Replace the stars (*) with the full file name.

  3. Ensure that the Hadoop configuration directory is in the ODI class path.

    The Hadoop configuration directory contains files such as core-default.xml, core-site.xml, and hdfs-site.xml.

  4. Create a DataServer object under Hive technology.

  5. Set the following locations under JDBC:

    JDBC Driver: org.apache.hadoop.hive.jdbc.HiveDriver

    JDBC URL: for example, jdbc:hive://BDA:10000/default

  6. Set the following under Flexfields:

    Hive Metastore URIs: for example, thrift://BDA:10000

  7. Create a Physical Default Schema.

    As of Hive 0.7.0, no schemas or databases are supported. Only Default is supported. Enter default in both schema fields of the physical schema definition.

  8. Ensure that the Hive server is up and running.

  9. Test the connection to the DataServer.

  10. Create a Logical Schema object.

  11. Create at least one Model for the LogicalSchema.

  12. Import RKM Hive as a global knowledge module or into a project.

  13. Create a new model for Hive Technology pointing to the logical schema.

  14. Perform a custom reverse-engineering operation using RKM Hive.

At the end of this process, the Hive DataModel contains all Hive tables with their columns, partitioning, and clustering details stored as flex field values.

4.2.3 Setting Up the Oracle Data Integrator Agent to Execute Hadoop Jobs

After setting up an Oracle Data Integrator agent, configure it to work with Oracle Data Integrator Application Adapter for Hadoop.

Note:

Many file names contain the version number. When you see a star (*) in a file name, check your installation and enter the full file name.

To configure the Oracle Data Integrator agent: 

  1. Install Hadoop on your Oracle Data Integrator agent computer. Ensure that the HADOOP_HOME environment variable is set.

    For Oracle Big Data Appliance, see the Oracle Big Data Appliance Software User's Guide for instructions for setting up a remote Hadoop client.

  2. Install Hive on your Oracle Data Integrator agent computer. Ensure that the HIVE_HOME environment variable is set.

  3. Ensure that the Hadoop configuration directory is in the ODI class path.

    The Hadoop configuration directory contains files such as core-default.xml, core-site.xml, and hdfs-site.xml.

  4. Add paths to ODI_ADDITIONAL_CLASSPATH, so that the ODI agent can access the JAR files. If you are not using Oracle SQL Connector for HDFS, then omit $OSCH_HOME from the setting.

    Note:

    In these commands, $HADOOP_CONF points to the directory containing the Hadoop configuration files. This directory is often the same as $HADOOP_HOME.
    • For CDH4, use a command like the following:

      ODI_ADDITIONAL_CLASSPATH=$HIVE_HOME/lib/'*':$HADOOP_HOME/client/'*':$OSCH_HOME/jlib/'*':$HADOOP_CONF
      
    • For CDH3, use a command like the following, replacing hadoop-*-core*.jar and hadoop-*-tools*.jar with the full path names:

      ODI_ADDITIONAL_CLASSPATH=$HIVE_HOME/lib/'*':$HADOOP_HOME/hadoop-*-core*.jar:$HADOOP_HOME/hadoop-*-tools*.jar:$OSCH_HOME/jlib/'*':$HADOOP_CONF
      
  5. Set environment variable ODI_HIVE_SESSION_JARS to include Hive Regex SerDe:

    ODI_HIVE_SESSION_JARS=$HIVE_HOME/lib/hive-contrib-*.jar
    

    Include other JAR files as required, such as custom SerDes JAR files. These JAR files are added to every Hive JDBC session and thus are added to every Hive MapReduce job.

  6. Set environment variable HADOOP_CLASSPATH:

    HADOOP_CLASSPATH=$HIVE_HOME/lib/hive-metastore-*.jar:$HIVE_HOME/lib/libthrift.jar:$HIVE_HOME/lib/libfb*.jar:$HIVE_HOME/lib/hive-common-*.jar:$HIVE_HOME/lib/hive-exec-*.jar. 
    

    This setting enables the Hadoop script to start Hive MapReduce jobs.

To use Oracle Loader for Hadoop: 

  1. Install Oracle Loader for Hadoop on your Oracle Data Integrator agent system. See "Installing Oracle Loader for Hadoop".

  2. Set environment variable OLH_HOME.

  3. Optionally, set environment variable ODI_OLH_JARS. You must list any JAR files required for custom input formats, Hive, Hive SerDes, and so forth, used by Oracle Loader for Hadoop. Do not include the Oracle Loader for Hadoop JAR files.

    For example, for extracting data from Hive, you need the extra jars listed in "Specifying Hive Input Format JAR Files". Enter valid file names for your installation.

    $HIVE_HOME/lib/hive-metastore-*.jar,
    $HIVE_HOME/lib/libthrift.jar,
    $HIVE_HOME/lib/libfb*.jar
    
  4. Add paths to HADOOP_CLASSPATH:

    HADOOP_CLASSPATH=$OLH_HOME/jlib/'*':$HADOOP_CLASSPATH
    
  5. Set environment variable ODI_HIVE_SESSION_JARS to include Hive Regex SerDe:

    ODI_HIVE_SESSION_JARS=$HIVE_HOME/lib/hive-contrib-*.jar
    

    Include other JAR files as required, such as custom SerDes JAR files. These JAR files are added to every Hive JDBC session and thus are added to every Hive MapReduce job.

  6. To use Oracle SQL Connector for HDFS (OLH_OUTPUT_MODE=DP_OSCH or OSCH), you must first install it. See "Oracle SQL Connector for Hadoop Distributed File System Setup."

4.2.4 Configuring Oracle Data Integrator Studio for Executing Hadoop Jobs on the Local Agent

For executing Hadoop jobs on the local agent of an Oracle Data Integrator Studio installation, follow the configuration steps in the previous section with the following change: Copy JAR files into the Oracle Data Integrator userlib directory instead of the drivers directory.

4.3 Setting Up an Integration Project

Setting up a project follows the standard procedures. See the Oracle Fusion Middleware Developer's Guide for Oracle Data Integrator.

Import the following KMs into Oracle Data Integrator project:

  • IKM File to Hive (Load Data)

  • IKM Hive Control Append

  • IKM Hive Transform

  • IKM File-Hive to Oracle (OLH)

  • CKM Hive

  • RKM Hive

4.4 Creating an Oracle Data Integrator Model from a Reverse-Engineered Hive Model

This section contains the following topics:

4.4.1 Creating a Model

To create a model that is based on the technology hosting Hive and on the logical schema created when you configured the Hive connection, follow the standard procedure described in the Oracle Fusion Middleware Developer's Guide for Oracle Data Integrator.

4.4.2 Reverse Engineering Hive Tables

RKM Hive is used to reverse engineer Hive tables and views. To perform a customized reverse-engineering of Hive tables with RKM Hive, follow the usual procedures, as described in the Oracle Fusion Middleware Developer's Guide for Oracle Data Integrator. This topic details information specific to Hive tables.

The reverse-engineering process creates the data stores for the corresponding Hive table or views. You can use the data stores as either a source or a target in an integration interface.

RKM Hive

RKM Hive reverses these metadata elements:

  • Hive tables and views as Oracle Data Integrator data stores.

    Specify the reverse mask in the Mask field, and then select the tables and views to reverse. The Mask field in the Reverse tab filters reverse-engineered objects based on their names. The Mask field cannot be empty and must contain at least the percent sign (%).

  • Hive columns as Oracle Data Integrator columns with their data types.

  • Information about buckets, partitioning, clusters, and sort columns are set in the respective flex fields in the Oracle Data Integrator data store or column metadata.

Table 4-3 describes the options for RKM Hive.

Table 4-3 RKM Hive Options

Option Description

USE_LOG

Log intermediate results?

LOG_FILE_NAME

Path and file name of log file. Default path is the user home and the default file name is reverse.log.


Table 4-4 describes the created flex fields.

Table 4-4 Flex Fields for Reverse-Engineered Hive Tables and Views

Object Flex Field Name Flex Field Code Flex Field Type Description

DataStore

Hive Buckets

HIVE_BUCKETS

String

Number of buckets to be used for clustering

Column

Hive Partition Column

HIVE_PARTITION_COLUMN

Numeric

All partitioning columns are marked as "1". Partition information can come from the following:

  • Mapped source column

  • Constant value specified in the target column

  • File name fragment

Column

Hive Cluster Column

HIVE_CLUSTER_COLUMN

Numeric

All cluster columns are marked as "1".

Column

Hive Sort Column

HIVE_SORT_COLUMN

Numeric

All sort columns are marked as "1".


4.5 Designing the Interface

After reverse engineering Hive tables and configuring them, you can choose from these interface configurations:

4.5.1 Loading Data from Files into Hive

To load data from the local file system or the HDFS file system into Hive tables:

  1. Create the data stores for local files and HDFS files.

    Refer to the Oracle Fusion Middleware Connectivity and Knowledge Modules Guide for Oracle Data Integrator for information about reverse engineering and configuring local file data sources.

  2. Create an interface using the file data store as the source and the corresponding Hive table as the target. Use the IKM File to Hive (Load Data) knowledge module specified in the flow tab of the interface. This integration knowledge module loads data from flat files into Hive, replacing or appending any existing data.

IKM File to Hive

IKM File to Hive (Load Data) supports:

  • One or more input files. To load multiple source files, enter an asterisk or a question mark as a wildcard character in the resource name of the file DataStore (for example, webshop_*.log).

  • File formats:

    • Fixed length

    • Delimited

    • Customized format

  • Loading options:

    • Immediate or deferred loading

    • Overwrite or append

    • Hive external tables

Table 4-5 describes the options for IKM File to Hive (Load Data). See the knowledge module for additional details.

Table 4-5 IKM File to Hive Options

Option Description

CREATE_TARG_TABLE

Create target table.

TRUNCATE

Truncate data in target table.

FILE_IS_LOCAL

Is the file in the local file system or in HDFS?

EXTERNAL_TABLE

Use an externally managed Hive table.

USE_STAGING_TABLE

Use a Hive staging table.

Select this option if the source and target do not match or if the partition column value is part of the data file.

If the partitioning value is provided by a file name fragment or a constant in target mapping, then set this value to false.

DELETE_TEMPORARY_OBJECTS

Remove temporary objects after the interface execution.

DEFER_TARGET_LOAD

Load data into the final target now or defer?

OVERRIDE_ROW_FORMAT

Provide a parsing expression for handling a custom file format to perform the mapping from source to target.

STOP_ON_FILE_NOT_FOUND

Stop if no source file is found?


4.5.2 Validating and Transforming Data Within Hive

After loading data into Hive, you can validate and transform the data using the following knowledge modules.

4.5.2.1 IKM Hive Control Append

This knowledge module validates and controls the data, and integrates it into a Hive target table in truncate/insert (append) mode. Invalid data is isolated in an error table and can be recycled. IKM Hive Control Append supports inline view interfaces that use either this knowledge module or IKM Hive Transform.

Table 4-6 lists the options. See the knowledge module for additional details.

Table 4-6 IKM Hive Control Append Options

Option Description

FLOW_CONTROL

Validate incoming data?

RECYCLE_ERRORS

Reintegrate data from error table?

STATIC_CONTROL

Validate data after load?

CREATE_TARG_TABLE

Create target table?

DELETE_TEMPORARY_OBJECTS

Remove temporary objects after execution?

TRUNCATE

Truncate data in target table?


4.5.2.2 CKM Hive

This knowledge module checks data integrity for Hive tables. It verifies the validity of the constraints of a Hive data store and diverts the invalid records to an error table. You can use CKM Hive for static control and flow control. You must also define these constraints on the stored data.

Table 4-7 lists the options for this check knowledge module. See the knowledge module for additional details.

Table 4-7 CKM Hive Options

Option Description

DROP_ERROR_TABLE

Drop error table before execution?


4.5.2.3 IKM Hive Transform

This knowledge module performs transformations. It uses a shell script to transform the data, and then integrates it into a Hive target table using replace mode. The knowledge module supports inline view interfaces and can be used as an inline-view for IKM Hive Control Append.

The transformation script must read the input columns in the order defined by the source data store. Only mapped source columns are streamed into the transformations. The transformation script must provide the output columns in the order defined by the target data store.

Table 4-8 lists the options for this integration knowledge module. See the knowledge module for additional details.

Table 4-8 IKM Hive Transform Options

Option Description

CREATE_TARG_TABLE

Create target table?

DELETE_TEMPORARY_OBJECTS

Remove temporary objects after execution?

TRANSFORM_SCRIPT_NAME

Script file name

TRANSFORM_SCRIPT

Script content

PRE_TRANSFORM_DISTRIBUTE

Provides an optional, comma-separated list of source column names, which enables the knowledge module to distribute the data before the transformation script is applied

PRE_TRANSFORM_SORT

Provide an optional, comma-separated list of source column names, which enables the knowledge module to sort the data before the transformation script is applied

POST_TRANSFORM_DISTRIBUTE

Provides an optional, comma-separated list of target column names, which enables the knowledge module to distribute the data after the transformation script is applied

POST_TRANSFORM_SORT

Provides an optional, comma-separated list of target column names, which enables the knowledge module to sort the data after the transformation script is applied


4.5.3 Loading Data into an Oracle Database from Hive and HDFS

IKM File-Hive to Oracle (OLH) integrates data from an HDFS file or Hive source into an Oracle database target using Oracle Loader for Hadoop. Using the interface configuration and the selected options, the knowledge module generates an appropriate Oracle Database target instance. Hive and Hadoop versions must follow the Oracle Loader for Hadoop requirements.

See Also:

Table 4-9 lists the options for this integration knowledge module. See the knowledge module for additional details.

Table 4-9 IKM File - Hive to Oracle (OLH)

Option Description

OLH_OUTPUT_MODE

Specify JDBC, OCI, or Data Pump for data transfer.

CREATE_TARG_TABLE

Create target table?

REJECT_LIMIT

Maximum number of errors for Oracle Loader for Hadoop and EXTTAB.

USE_HIVE_STAGING_TABLE

Materialize Hive source data before extract?

USE_ORACLE_STAGING_TABLE

Use an Oracle database staging table?

EXT_TAB_DIR_LOCATION

Shared file path used for Oracle Data Pump transfer.

TEMP_DIR

Local path for temporary files.

MAPRED_OUTPUT_BASE_DIR

HDFS directory for Oracle Loader for Hadoop output files.

FLOW_TABLE_OPTIONS

Options for flow (stage) table creation when you are using an Oracle database staging table.

DELETE_TEMPORARY_OBJECTS

Remove temporary objects after execution?

OVERRIDE_INPUTFORMAT

Set to handle custom file formats.

EXTRA_OLH_CONF_PROPERTIES

Optional Oracle Loader for Hadoop configuration file properties

TRUNCATE

Truncate data in target table?

DELETE_ALL

Delete all data in target table?