3 Setting Up the Environment for Integrating Hadoop Data

This chapter provides information steps you need to perform to set up the environment to integrate Hadoop data.

This chapter includes the following sections:

3.1 Creating and Initializing the Hadoop Data Server

To create and initialize the Hadoop data server:

  1. Click the Topology tab.

  2. In the Physical Architecture tree, under Technologies, right-click Hadoop and then click New Data Server.

  3. In the Definition tab, specify the details of the Hadoop data server.

    See Section 3.1.1, "Hadoop Data Server Definition" for more information.

  4. Click Test Connection to test the connection to the Hadoop data server.

  5. Click Initialize to initialize the Hadoop data server.

    Initializing the Hadoop data server creates the structure of the ODI Master repository and Work repository in HDFS.

3.1.1 Hadoop Data Server Definition

The following table describes the fields that you need to specify on the Definition tab when creating a new Hadoop data server.

Note: Only the fields required or specific for defining a Hadoop data server are described.

Table 3-1 Hadoop Data Server Definition

Field Description

Name

Name of the data server that appears in Oracle Data Integrator.

Data Server

Physical name of the data server.

User/Password

Hadoop user with its password.

If password is not provided, only simple authentication is performed using the username on HDFS and Oozie.

HDFS Node Name URI

URI of the HDFS node name.

hdfs://localhost:8020

Resource Manager/Job Tracker URI

URI of the resource manager or the job tracker.

localhost:8032

ODI HDFS Root

Path of the ODI HDFS root directory.

/user/<login_username>/odi_home.

Additional Class Path

Specify additional classpaths.

Add the following additional classpaths:

  • /usr/lib/hadoop/*

  • /usr/lib/hadoop/lib/*

  • /usr/lib/hadoop-hdfs/*

  • /usr/lib/hadoop-mapreduce/*

  • /usr/lib/hadoop-yarn/*

  • /usr/lib/oozie/lib/*

  • /etc/hadoop/conf/


Section 3.1, "Creating and Initializing the Hadoop Data Server"

3.2 Creating a Hadoop Physical Schema

Create a Hadoop physical schema using the standard procedure, as described in Creating a Physical Schema in Administering Oracle Data Integrator.

Create for this physical schema a logical schema using the standard procedure, as described in Creating a Logical Schema in Administering Oracle Data Integrator and associate it in a given context.

3.3 Configuring the Oracle Data Integrator Agent to Execute Hadoop Jobs

You must configure the Oracle Data Integrator agent to execute Hadoop jobs.

To configure the Oracle Data Integrator agent:

  1. Install Hadoop on your Oracle Data Integrator agent computer.

    For Oracle Big Data Appliance, see Oracle Big Data Appliance Software User's Guide for instructions for setting up a remote Hadoop client.

  2. Install Hive on your Oracle Data Integrator agent computer.

  3. Install SQOOP on your Oracle Data Integrator agent computer.

  4. Set the following base environment variables for Hadoop and Hive on your ODI agent computer.

    Table 3-2 Environment variables mandatory for Hadoop and Hive

    Environment Variable Value

    HADOOP_HOME

    Location of Hadoop dir. For example, /usr/lib/hadoop

    HADOOP_CONF

    Location of Hadoop configuration files such as core-default.xml, core-site.xml, and hdfs-site.xml. For example, /home/shared/hadoop-conf

    HIVE_HOME

    Location of Hive dir. For example, /usr/lib/hive

    HIVE_CONF

    Location of Hive configuration files such as hive-site.xml. For example, /home/shared/hive-conf

    HADOOP_CLASSPATH

    $HIVE_HOME/lib/hive-metastore-*.jar:$HIVE_HOME/lib/libthrift-*.jar:$HIVE_HOME/lib/libfb*.jar:$HIVE_HOME/lib/hive-exec-*.jar:$HIVE_CONF

    ODI_ADDITIONAL_CLASSPATH

    $HIVE_HOME/lib/'*':$HADOOP_HOME/client/*:$HADOOP_CONF

    ODI_HIVE_SESSION_JARS

    $HIVE_HOME/lib/hive-contrib-*.jar

    • Include other JAR files as required, such as custom SerDes JAR files. These JAR files are added to every Hive JDBC session and thus are added to every Hive MapReduce job.

    • List of JARs is separated by ":", wildcards in file names must not evaluate to more than one file.


  5. If you plan to use HBase features, set the following environment variables on your ODI agent computer. Note that you need to set these environment variables in addition to the base Hadoop and Hive environment variables.

    Table 3-3 Environment Variables mandatory for HBase (In addition to base Hadoop and Hive environment variables)

    Environment Variable Value

    HBASE_HOME

    Location of HBase dir. For example, /usr/lib/hbase

    HADOOP_CLASSPATH

    $HBASE_HOME/lib/hbase-*.jar:$HIVE_HOME/lib/hive-hbase-handler*.jar:$HBASE_HOME/hbase.jar

    ODI_ADDITIONAL_CLASSPATH

    $HBASE_HOME/hbase.jar

    ODI_HIVE_SESSION_JARS

    $HBASE_HOME/hbase.jar:$HBASE_HOME/lib/hbase-sep-api-*.jar:$HBASE_HOME/lib/hbase-sep-impl-*hbase*.jar:/$HBASE_HOME/lib/hbase-sep-impl-common-*.jar:/$HBASE_HOME/lib/hbase-sep-tools-*.jar:$HIVE_HOME/lib/hive-hbase-handler-*.jar


3.4 Configuring Oracle Loader for Hadoop

If you want to use Oracle Loader for Hadoop, you must install and configure Oracle Loader for Hadoop on your Oracle Data Integrator agent computer.

To install and configure Oracle Loader for Hadoop:

  1. Install Oracle Loader for Hadoop on your Oracle Data Integrator agent computer.

    See Installing Oracle Loader for Hadoop in Oracle Big Data Connectors User's Guide.

  2. To use Oracle SQL Connector for HDFS (OLH_OUTPUT_MODE=DP_OSCH or OSCH), you must first install it.

    See "Oracle SQL Connector for Hadoop Distributed File System Setup" in Oracle Big Data Connectors User's Guide.

  3. Set the following environment variables for Oracle Loader for Hadoop on your ODI agent computer.

    Note that you must set these environment variables in addition to the base Hadoop and Hive environment variables.

    Table 3-4 Environment Variables mandatory for Oracle Loader for Hadoop (In addition to base Hadoop and Hive environment variables)

    Environment Variable Value

    OLH_HOME

    Location of OLH installation. For example, /u01/connectors/olh

    OSCH_HOME

    Location of OSCH installation. For example, /u01/connectors/osch

    HADOOP_CLASSPATH

    $OLH_HOME/jlib/*:$OSCH_HOME/jlib/*

    In order to work with OLH, the Hadoop jars in the HADOOP_CLASSPATH have to be manually resolved without wildcards.

    ODI_OLH_JARS

    Comma-separated list of all JAR files required for custom input formats, Hive, Hive SerDes, and so forth, used by Oracle Loader for Hadoop. All filenames have to be expanded without wildcards.

    For example:

    $HIVE_HOME/lib/hive-metastore-0.10.0-cdh4.5.0.jar,$HIVE_HOME/lib/libthrift-0.9.0-cdh4-1.jar,$HIVE_HOME/lib/libfb303-0.9.0.jar

    ODI_OLH_SHAREDLIBS

    $OLH_HOME/lib/libolh12.so,$OLH_HOME/lib/libclntsh.so.12.1,$OLH_HOME/lib/libnnz12.so,$OLH_HOME/lib/libociei.so,$OLH_HOME/lib/libclntshcore.so.12.1,$OLH_HOME/lib/libons.so

    ODI_ADDITIONAL_CLASSPATH

    $OSCH_HOME/jlib/'*'


3.5 Configuring Oracle Data Integrator to Connect to a Secure Cluster

To run the Oracle Data Integrator agent on a Hadoop cluster that is protected by Kerberos authentication, you must configure a Kerberos-secured cluster.

To use a Kerberos-secured cluster:

  1. Log in to the node04 of the Oracle Big Data Appliance, where the Oracle Data Integrator agent runs.

  2. Generate a new Kerberos ticket for the oracle user. Use the following command, replacing realm with the actual Kerberos realm name.

    $ kinit oracle@realm

  3. Set the environment variables by using the following commands. Substitute the appropriate values for your appliance:

    $ export KRB5CCNAME=Kerberos-ticket-cache-directory

    $ export KRB5_CONFIG=Kerberos-configuration-file

    $ export HADOOP_OPTS="$HADOOP_OPTS -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal. jaxp.DocumentBuilderFactoryImpl-Djava.security.krb5.conf=Kerberos-configuration-file"

    In this example, the configuration files are named krb5* and are located in /tmp/oracle_krb/:

    $ export KRB5CCNAME=/tmp/oracle_krb/krb5cc_1000

    $ export KRB5_CONFIG=/tmp/oracle_krb/krb5.conf

    $ export HADOOP_OPTS="$HADOOP_OPTS -D javax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal. jaxp.DocumentBuilderFactoryImpl -D java.security.krb5.conf=/tmp/oracle_krb/krb5.conf"

  4. Redefine the JDBC connection URL, using syntax like the following:

    jdbc:hive2://node1:10000/default;principal=HiveServer2-Kerberos-Principal

    For example:

    jdbc:hive2://bda1node01.example.com:10000/default;principal= hive/HiveServer2Host@EXAMPLE.COM

    See also, "HiveServer2 Security Configuration" in the CDH5 Security Guide at the following URL:

    http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Security-Guide/cdh5sg_hiveserver2_security.html

  5. Renew the Kerberos ticket for the Oracle use on a regular basis to prevent disruptions in service.

    See Oracle Big Data Appliance Software User's Guide for instructions about managing Kerberos on Oracle Big Data Appliance.

3.6 Configuring Oracle Data Integrator Studio for Executing Hadoop Jobs on the Local Agent

For executing Hadoop jobs on the local agent of an Oracle Data Integrator Studio installation, follow the configuration steps in the previous section with the following change: Copy JAR files into the Oracle Data Integrator userlib directory instead of the drivers directory. For example:

Linux: $USER_HOME/.odi/oracledi/userlib directory.

Windows: C:\Users\<USERNAME>\AppData\Roaming\odi\oracledi\userlib directory