3 Setting Up the Environment for Integrating Hadoop Data

This chapter provides information steps you need to perform to set up the environment to integrate Hadoop data.

This chapter includes the following sections:

Section 3.1, "Creating and Initializing the Hadoop Data Server"
Section 3.2, "Creating a Hadoop Physical Schema"
Section 3.3, "Configuring the Oracle Data Integrator Agent to Execute Hadoop Jobs"
Section 3.4, "Configuring Oracle Loader for Hadoop"
Section 3.5, "Configuring Oracle Data Integrator to Connect to a Secure Cluster"
Section 3.6, "Configuring Oracle Data Integrator Studio for Executing Hadoop Jobs on the Local Agent"

3.1 Creating and Initializing the Hadoop Data Server

To create and initialize the Hadoop data server:

Click the Topology tab.
In the Physical Architecture tree, under Technologies, right-click Hadoop and then click New Data Server.
In the Definition tab, specify the details of the Hadoop data server.

See Section 3.1.1, "Hadoop Data Server Definition" for more information.
Click Test Connection to test the connection to the Hadoop data server.
Click Initialize to initialize the Hadoop data server.

Initializing the Hadoop data server creates the structure of the ODI Master repository and Work repository in HDFS.

3.1.1 Hadoop Data Server Definition

The following table describes the fields that you need to specify on the Definition tab when creating a new Hadoop data server.

Note: Only the fields required or specific for defining a Hadoop data server are described.

Table 3-1 Hadoop Data Server Definition

Field	Description
Name	Name of the data server that appears in Oracle Data Integrator.
Data Server	Physical name of the data server.
User/Password	Hadoop user with its password. If password is not provided, only simple authentication is performed using the username on HDFS and Oozie.
HDFS Node Name URI	URI of the HDFS node name. `hdfs://localhost:8020`
Resource Manager/Job Tracker URI	URI of the resource manager or the job tracker. `localhost:8032`
ODI HDFS Root	Path of the ODI HDFS root directory. `/user/<login_username>/odi_home`.
Additional Class Path	Specify additional classpaths. Add the following additional classpaths: `/usr/lib/hadoop/` `/usr/lib/hadoop/lib/` `/usr/lib/hadoop-hdfs/` `/usr/lib/hadoop-mapreduce/` `/usr/lib/hadoop-yarn/` `/usr/lib/oozie/lib/` `/etc/hadoop/conf/`

Section 3.1, "Creating and Initializing the Hadoop Data Server"

3.2 Creating a Hadoop Physical Schema

Create a Hadoop physical schema using the standard procedure, as described in Creating a Physical Schema in Administering Oracle Data Integrator.

Create for this physical schema a logical schema using the standard procedure, as described in Creating a Logical Schema in Administering Oracle Data Integrator and associate it in a given context.

3.3 Configuring the Oracle Data Integrator Agent to Execute Hadoop Jobs

You must configure the Oracle Data Integrator agent to execute Hadoop jobs.

To configure the Oracle Data Integrator agent:

Install Hadoop on your Oracle Data Integrator agent computer.

For Oracle Big Data Appliance, see Oracle Big Data Appliance Software User's Guide for instructions for setting up a remote Hadoop client.
Install Hive on your Oracle Data Integrator agent computer.
Install SQOOP on your Oracle Data Integrator agent computer.

Set the following base environment variables for Hadoop and Hive on your ODI agent computer.

Table 3-2 Environment variables mandatory for Hadoop and Hive

Environment Variable	Value
`HADOOP_HOME`	Location of Hadoop dir. For example, `/usr/lib/hadoop`
`HADOOP_CONF`	Location of Hadoop configuration files such as core-default.xml, core-site.xml, and hdfs-site.xml. For example, `/home/shared/hadoop-conf`
`HIVE_HOME`	Location of Hive dir. For example, `/usr/lib/hive`
`HIVE_CONF`	Location of Hive configuration files such as hive-site.xml. For example, `/home/shared/hive-conf`
`HADOOP_CLASSPATH`	`$HIVE_HOME/lib/hive-metastore-.jar:$HIVE_HOME/lib/libthrift-.jar:$HIVE_HOME/lib/libfb.jar:$HIVE_HOME/lib/hive-exec-.jar:$HIVE_CONF`
`ODI_ADDITIONAL_CLASSPATH`	`$HIVE_HOME/lib/'':$HADOOP_HOME/client/:$HADOOP_CONF`
`ODI_HIVE_SESSION_JARS`	`$HIVE_HOME/lib/hive-contrib-*.jar` Include other JAR files as required, such as custom SerDes JAR files. These JAR files are added to every Hive JDBC session and thus are added to every Hive MapReduce job. List of JARs is separated by ":", wildcards in file names must not evaluate to more than one file.

If you plan to use HBase features, set the following environment variables on your ODI agent computer. Note that you need to set these environment variables in addition to the base Hadoop and Hive environment variables.

Table 3-3 Environment Variables mandatory for HBase (In addition to base Hadoop and Hive environment variables)

Environment Variable	Value
`HBASE_HOME`	Location of HBase dir. For example, `/usr/lib/hbase`
`HADOOP_CLASSPATH`	`$HBASE_HOME/lib/hbase-.jar:$HIVE_HOME/lib/hive-hbase-handler.jar:$HBASE_HOME/hbase.jar`
`ODI_ADDITIONAL_CLASSPATH`	`$HBASE_HOME/hbase.jar`
`ODI_HIVE_SESSION_JARS`	`$HBASE_HOME/hbase.jar:$HBASE_HOME/lib/hbase-sep-api-.jar:$HBASE_HOME/lib/hbase-sep-impl-hbase.jar:/$HBASE_HOME/lib/hbase-sep-impl-common-.jar:/$HBASE_HOME/lib/hbase-sep-tools-.jar:$HIVE_HOME/lib/hive-hbase-handler-.jar`

3.4 Configuring Oracle Loader for Hadoop

If you want to use Oracle Loader for Hadoop, you must install and configure Oracle Loader for Hadoop on your Oracle Data Integrator agent computer.

To install and configure Oracle Loader for Hadoop:

Install Oracle Loader for Hadoop on your Oracle Data Integrator agent computer.

See Installing Oracle Loader for Hadoop in Oracle Big Data Connectors User's Guide.
To use Oracle SQL Connector for HDFS (OLH_OUTPUT_MODE=DP_OSCH or OSCH), you must first install it.

See "Oracle SQL Connector for Hadoop Distributed File System Setup" in Oracle Big Data Connectors User's Guide.

Set the following environment variables for Oracle Loader for Hadoop on your ODI agent computer.

Note that you must set these environment variables in addition to the base Hadoop and Hive environment variables.

Table 3-4 Environment Variables mandatory for Oracle Loader for Hadoop (In addition to base Hadoop and Hive environment variables)

Environment Variable	Value
OLH_HOME	Location of OLH installation. For example, `/u01/connectors/olh`
OSCH_HOME	Location of OSCH installation. For example, `/u01/connectors/osch`
HADOOP_CLASSPATH	`$OLH_HOME/jlib/:$OSCH_HOME/jlib/` In order to work with OLH, the Hadoop jars in the `HADOOP_CLASSPATH` have to be manually resolved without wildcards.
ODI_OLH_JARS	Comma-separated list of all JAR files required for custom input formats, Hive, Hive SerDes, and so forth, used by Oracle Loader for Hadoop. All filenames have to be expanded without wildcards. For example: `$HIVE_HOME/lib/hive-metastore-0.10.0-cdh4.5.0.jar,$HIVE_HOME/lib/libthrift-0.9.0-cdh4-1.jar,$HIVE_HOME/lib/libfb303-0.9.0.jar`
ODI_OLH_SHAREDLIBS	`$OLH_HOME/lib/libolh12.so,$OLH_HOME/lib/libclntsh.so.12.1,$OLH_HOME/lib/libnnz12.so,$OLH_HOME/lib/libociei.so,$OLH_HOME/lib/libclntshcore.so.12.1,$OLH_HOME/lib/libons.so`
ODI_ADDITIONAL_CLASSPATH	`$OSCH_HOME/jlib/'*'`

3.5 Configuring Oracle Data Integrator to Connect to a Secure Cluster

To run the Oracle Data Integrator agent on a Hadoop cluster that is protected by Kerberos authentication, you must configure a Kerberos-secured cluster.

To use a Kerberos-secured cluster:

Log in to the node04 of the Oracle Big Data Appliance, where the Oracle Data Integrator agent runs.
Generate a new Kerberos ticket for the oracle user. Use the following command, replacing realm with the actual Kerberos realm name.

$ kinit oracle@realm
Set the environment variables by using the following commands. Substitute the appropriate values for your appliance:

$ export KRB5CCNAME=Kerberos-ticket-cache-directory

$ export KRB5_CONFIG=Kerberos-configuration-file

$ export HADOOP_OPTS="$HADOOP_OPTS -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal. jaxp.DocumentBuilderFactoryImpl-Djava.security.krb5.conf=Kerberos-configuration-file"

In this example, the configuration files are named krb5* and are located in /tmp/oracle_krb/:

$ export KRB5CCNAME=/tmp/oracle_krb/krb5cc_1000

$ export KRB5_CONFIG=/tmp/oracle_krb/krb5.conf

$ export HADOOP_OPTS="$HADOOP_OPTS -D javax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal. jaxp.DocumentBuilderFactoryImpl -D java.security.krb5.conf=/tmp/oracle_krb/krb5.conf"
Redefine the JDBC connection URL, using syntax like the following:

jdbc:hive2://node1:10000/default;principal=HiveServer2-Kerberos-Principal

For example:

jdbc:hive2://bda1node01.example.com:10000/default;principal= hive/HiveServer2Host@EXAMPLE.COM

See also, "HiveServer2 Security Configuration" in the CDH5 Security Guide at the following URL:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Security-Guide/cdh5sg_hiveserver2_security.html
Renew the Kerberos ticket for the Oracle use on a regular basis to prevent disruptions in service.

See Oracle Big Data Appliance Software User's Guide for instructions about managing Kerberos on Oracle Big Data Appliance.

3.6 Configuring Oracle Data Integrator Studio for Executing Hadoop Jobs on the Local Agent

For executing Hadoop jobs on the local agent of an Oracle Data Integrator Studio installation, follow the configuration steps in the previous section with the following change: Copy JAR files into the Oracle Data Integrator userlib directory instead of the drivers directory. For example:

Linux: $USER_HOME/.odi/oracledi/userlib directory.

Windows: C:\Users\<USERNAME>\AppData\Roaming\odi\oracledi\userlib directory