3 Setting Up the Environment for Integrating Hadoop Data

This chapter provides information steps you need to perform to set up the environment to integrate Hadoop data.

This chapter includes the following sections:

Section 3.1, "Configuring Big Data technologies using the Big Data Configurations Wizard"
Section 3.2, "Creating and Initializing the Hadoop Data Server"
Section 3.3, "Creating a Hadoop Physical Schema"
Section 3.4, "Configuring the Oracle Data Integrator Agent to Execute Hadoop Jobs"
Section 3.5, "Configuring Oracle Loader for Hadoop"
Section 3.6, "Configuring Oracle Data Integrator to Connect to a Secure Cluster"
Section 3.7, "Configuring Oracle Data Integrator Studio for Executing Hadoop Jobs on the Local Agent"

3.1 Configuring Big Data technologies using the Big Data Configurations Wizard

The Big Data Configurations wizard provides a single entry point to set up multiple Hadoop technologies. You can quickly create data servers, physical schema, logical schema, and set a context for different Hadoop technologies such as Hadoop, HBase, Oozie, Spark, Hive, Pig, etc.

The default metadata for different distributions, such as properties, host names, port numbers, etc., and default values for environment variables are pre-populated for you. This helps you to easily create the data servers along with the physical and logical schema, without having in-depth knowledge about these technologies.

After all the technologies are configured, you can validate the settings against the data servers to test the connection status.

Note:

If you do not want to use the Big Data Configurations wizard, you can set up the data servers for the Big Data technologies manually using the information mentioned in the subsequent sections.

To run the Big Data Configurations Wizard:

In ODI Studio, select File and click New....
In the New Gallery dialog, select Big Data Configurations and click OK.

The Big Data Configurations wizard appears.
In the General Settings panel of the wizard, specify the required options.

See Section 3.1.1, "General Settings" for more information.
Click Next.

Data server panel for each of the technologies you selected in the General Settings panel will be displayed.
In the Hadoop panel of the wizard, do the following:
- Specify the options required to create the Hadoop data server.
  
  See Section 3.2.1, "Hadoop Data Server Definition" for more information.
- In Properties section, click the + icon to add any data server properties.
- Select a logical schema, physical schema, and a context from the appropriate drop-down lists.
Click Next.
In the HBase panel of the wizard, do the following:
- Specify the options required to create the HBase data server.
  
  See Section 3.1.2, "HBase Data Server Definition" for more information.
- In the Properties section, click + icon to add any data server properties.
- Select a logical schema, physical schema, and a context from the appropriate drop-down lists.
In the Spark panel of the wizard, do the following:
- Specify the options required to create the Spark data server.
  
  See Section 6.6.1, "Spark Data Server Definition" for more information.
- In the Properties section, click + icon to add any data server properties.
- Select a logical schema, physical schema, and a context from the appropriate drop-down lists.
Click Next.
In the Pig panel of the wizard, do the following:
- Specify the options required to create the Pig data server.
  
  See Section 6.4.1, "Pig Data Server Definition" for more information.
- In the Properties section, click + icon to add any data server properties.
- Select a logical schema, physical schema, and a context from the appropriate drop-down lists.
Click Next.
In the Hive panel of the wizard, do the following:
- Specify the options required to create the Hive data server.
  
  See Section 6.2.1, "Hive Data Server Definition" for more information.
- In the Properties section, click + icon to add any data server properties.
- Select a logical schema, physical schema, and a context from the appropriate drop-down lists.
Click Next.
In the Oozie panel of the wizard, do the following:
- Specify the options required to create the Oozie runtime engine.
  
  See Section 5.2.1, "Oozie Runtime Engine Definition" for more information.
- Under Properties section, review the data server properties that are listed.
  
  Note: You cannot add new properties or remove listed properties. However, if required, you can change the value of listed properties.
  
  See Section 5.2.2, "Oozie Runtime Engine Properties" for more information.
- Select a logical schema and a context from the appropriate drop-down lists.
Click Next.
In the Validate all the settings panel, click Test All Settings to validate the settings against the data servers to ensure the connection status.
Click Finish.

3.1.1 General Settings

The following table describes the options that you need to set on the General Settings panel of the Big Data Configurations wizard.

Table 3-1 General Settings Options

Option	Description
Prefix	Specify a prefix. This prefix is attached to the data server name, logical schema name, and physical schema name.
Distribution	Select a distribution, either Manual or CDH <version>.
Base Directory	Specify the base directory. This base directory is automatically populated in all other panels of the wizard. Note: This option appears only if the distribution is other than Manual.
Technologies	Select the technologies that you want to configure. Note: Data server creation panels only for the selected technologies are displayed.

Section 3.1, "Configuring Big Data technologies using the Big Data Configurations Wizard".

3.1.2 HBase Data Server Definition

The following table describes the options that you must specify to create an HBase data server.

Note: Only the fields required or specific for defining a HBase data server are described.

Table 3-2 HBase Data Server Definition

Option	Description
Name	Type a name for the data server. This name appears in Oracle Data Integrator.
HBase Quorum	Quorum of the HBase installation. For example, `localhost:2181`.
User/Password	User name with its password.
Hadoop Data Server	Hadoop data server that you want to associate with the HBase data server.
Additional Classpath	By default, the following classpaths are added: `/usr/lib/hbase/` `usr/lib/hbase/lib/` Specify the additional classpaths, if required.

Section 3.1, "Configuring Big Data technologies using the Big Data Configurations Wizard".

3.2 Creating and Initializing the Hadoop Data Server

To create and initialize the Hadoop data server:

Click the Topology tab.
In the Physical Architecture tree, under Technologies, right-click Hadoop and then click New Data Server.
In the Definition tab, specify the details of the Hadoop data server.

See Section 3.2.1, "Hadoop Data Server Definition" for more information.
In the Properties tab, specify the properties for the Hadoop data server.

See Section 3.2.2, "Hadoop Data Server Properties" for more information.
Click Initialize to initialize the Hadoop data server.

Initializing the Hadoop data server creates the structure of the ODI Master repository and Work repository in HDFS.
Click Test Connection to test the connection to the Hadoop data server.

3.2.1 Hadoop Data Server Definition

The following table describes the fields that you need to specify on the Definition tab when creating a new Hadoop data server.

Note: Only the fields required or specific for defining a Hadoop data server are described.

Table 3-3 Hadoop Data Server Definition

Field	Description
Name	Name of the data server that appears in Oracle Data Integrator.
Data Server	Physical name of the data server.
User/Password	Hadoop user with its password. If password is not provided, only simple authentication is performed using the username on HDFS and Oozie.
HDFS Node Name URI	URI of the HDFS node name. `hdfs://localhost:8020`
Resource Manager/Job Tracker URI	URI of the resource manager or the job tracker. `localhost:8032`
ODI HDFS Root	Path of the ODI HDFS root directory. `/user/<login_username>/odi_home`.
Additional Class Path	Specify additional classpaths. Add the following additional classpaths: `/usr/lib/hadoop/` `/usr/lib/hadoop/lib/` `/usr/lib/hadoop-hdfs/` `/usr/lib/hadoop-mapreduce/` `/usr/lib/hadoop-yarn/` `/usr/lib/oozie/lib/` `/etc/hadoop/conf/`

Section 3.2, "Creating and Initializing the Hadoop Data Server"

Section 3.1, "Configuring Big Data technologies using the Big Data Configurations Wizard".

3.2.2 Hadoop Data Server Properties

The following table describes the properties that you can configure in the Properties tab when defining a new Hadoop data server.

Note: These properties can be inherited by other Hadoop technologies, such as Hive or HDFS. To inherit these properties, you must select the configured Hadoop data server when creating data server for other Hadoop technologies.

Table 3-4 Hadoop Data Server Properties

Property	Description/Value
Properties mandatory for Hadoop and Hive The following properties are mandatory for Hadoop and Hive.
HADOOP_HOME	Location of Hadoop dir. For example, `/usr/lib/hadoop`
HADOOP_CONF	Location of Hadoop configuration files such as core-default.xml, core-site.xml, and hdfs-site.xml. For example, `/home/shared/hadoop-conf`
HIVE_HOME	Location of Hive dir. For example, `/usr/lib/hive`
HIVE_CONF	Location of Hive configuration files such as hive-site.xml. For example, `/home/shared/hive-conf`
HADOOP_CLASSPATH	`$HIVE_HOME/lib/hive-metastore-.jar:$HIVE_HOME/lib/libthrift-.jar:$HIVE_HOME/lib/libfb.jar:$HIVE_HOME/lib/hive-exec-.jar:$HIVE_CONF`
HADOOP_CLIENT_OPTS	`-Dlog4j.debug -Dhadoop.root.logger=INFO,console -Dlog4j.configuration=file:/etc/hadoop/conf.cloudera.yarn/log4j.properties`
ODI_ADDITIONAL_CLASSPATH	`$HIVE_HOME/lib/'':$HADOOP_HOME/client/:$HADOOP_CONF`
HIVE_SESSION_JARS	`$HIVE_HOME/lib/hive-contrib-*.jar:<ODI library directory>/wlhive.jar` Actual path of `wlhive.jar` can be determined under ODI installation home. Include other JAR files as required, such as custom SerDes JAR files. These JAR files are added to every Hive JDBC session and thus are added to every Hive MapReduce job. List of JARs is separated by ":", wildcards in file names must not evaluate to more than one file.
Properties mandatory for HBase ((In addition to base Hadoop and Hive environment variables) The following properties are mandatory for HBase. Note that you need to set these properties in addition to the base Hadoop and Hive properties.
HBASE_HOME	Location of HBase dir. For example, `/usr/lib/hbase`
HADOOP_CLASSPATH	`$HBASE_HOME/lib/hbase-.jar:$HIVE_HOME/lib/hive-hbase-handler.jar:$HBASE_HOME/hbase.jar`
ODI_ADDITIONAL_CLASSPATH	`$HBASE_HOME/hbase.jar`
HIVE_SESSION_JARS	`$HBASE_HOME/hbase.jar:$HBASE_HOME/lib/hbase-sep-api-.jar:$HBASE_HOME/lib/hbase-sep-impl-hbase.jar:/$HBASE_HOME/lib/hbase-sep-impl-common-.jar:/$HBASE_HOME/lib/hbase-sep-tools-.jar:$HIVE_HOME/lib/hive-hbase-handler-.jar`
Properties mandatory for Oracle Loader for Hadoop (In addition to base Hadoop and Hive properties) The following properties are mandatory for Oracle Loader for Hadoop. Note that you need to set these properties in addition to the base Hadoop and Hive properties.
OLH_HOME	Location of OLH installation. For example, `/u01/connectors/olh`
OLH_FILES	`usr/lib/hive/lib/hive-contrib-1.1.0-cdh5.5.1.jar`
ODCH_HOME	Location of OSCH installation. For example, `/u01/connectors/osch`
HADOOP_CLASSPATH	`$OLH_HOME/jlib/:$OSCH_HOME/jlib/` In order to work with OLH, the Hadoop jars in the `HADOOP_CLASSPATH` have to be manually resolved without wildcards.
OLH_JARS	Comma-separated list of all JAR files required for custom input formats, Hive, Hive SerDes, and so forth, used by Oracle Loader for Hadoop. All filenames have to be expanded without wildcards. For example: `$HIVE_HOME/lib/hive-metastore-0.10.0-cdh4.5.0.jar,$HIVE_HOME/lib/libthrift-0.9.0-cdh4-1.jar,$HIVE_HOME/lib/libfb303-0.9.0.jar`
OLH_SHAREDLIBS	`$OLH_HOME/lib/libolh12.so,$OLH_HOME/lib/libclntsh.so.12.1,$OLH_HOME/lib/libnnz12.so,$OLH_HOME/lib/libociei.so,$OLH_HOME/lib/libclntshcore.so.12.1,$OLH_HOME/lib/libons.so`
ODI_ADDITIONAL_CLASSPATH	`$OSCH_HOME/jlib/'*'`
Properties mandatory for SQOOP (In addition to base Hadoop and Hive properties) The following properties are mandatory for SQOOP. Note that you need to set these properties in addition to the base Hadoop and Hive properties.
SQOOP_HOME	Location of Sqoop dir. For example, `/usr/lib/sqoop`
SQOOP_LIBJARS	Location of the SQOOP library jars. For example, `usr/lib/hive/lib/hive-contrib-1.1.0-cdh5.5.1.jar`

Section 3.2, "Creating and Initializing the Hadoop Data Server"

3.3 Creating a Hadoop Physical Schema

Create a Hadoop physical schema using the standard procedure, as described in Creating a Physical Schema in Administering Oracle Data Integrator.

Create for this physical schema a logical schema using the standard procedure, as described in Creating a Logical Schema in Administering Oracle Data Integrator and associate it in a given context.

3.4 Configuring the Oracle Data Integrator Agent to Execute Hadoop Jobs

You must configure the Oracle Data Integrator agent to execute Hadoop jobs.

To configure the Oracle Data Integrator agent:

Install Hadoop on your Oracle Data Integrator agent computer.

For Oracle Big Data Appliance, see Oracle Big Data Appliance Software User's Guide for instructions for setting up a remote Hadoop client.
Install Hive on your Oracle Data Integrator agent computer.
Install SQOOP on your Oracle Data Integrator agent computer.
Set the base properties for Hadoop and Hive on your ODI agent computer.

These properties must be added as Hadoop data server properties. For more information, see Section 3.2.2, "Hadoop Data Server Properties".
If you plan to use HBase features, set the properties on your ODI agent computer. Note that you need to set these properties in addition to the base Hadoop and Hive properties.

These properties must be added as Hadoop data server properties. For more information, see Section 3.2.2, "Hadoop Data Server Properties".

3.5 Configuring Oracle Loader for Hadoop

If you want to use Oracle Loader for Hadoop, you must install and configure Oracle Loader for Hadoop on your Oracle Data Integrator agent computer.

To install and configure Oracle Loader for Hadoop:

Install Oracle Loader for Hadoop on your Oracle Data Integrator agent computer.

See Installing Oracle Loader for Hadoop in Oracle Big Data Connectors User's Guide.
To use Oracle SQL Connector for HDFS (OLH_OUTPUT_MODE=DP_OSCH or OSCH), you must first install it.

See "Oracle SQL Connector for Hadoop Distributed File System Setup" in Oracle Big Data Connectors User's Guide.
Set the properties for Oracle Loader for Hadoop on your ODI agent computer. Note that you must set these properties in addition to the base Hadoop and Hive properties.

These properties must be added as Hadoop data server properties. For more information, see Section 3.2.2, "Hadoop Data Server Properties".

3.6 Configuring Oracle Data Integrator to Connect to a Secure Cluster

To run the Oracle Data Integrator agent on a Hadoop cluster that is protected by Kerberos authentication, you must configure a Kerberos-secured cluster.

To use a Kerberos-secured cluster:

Log in to the node04 of the Oracle Big Data Appliance, where the Oracle Data Integrator agent runs.
Generate a new Kerberos ticket for the oracle user. Use the following command, replacing realm with the actual Kerberos realm name.

$ kinit oracle@realm
Set the environment variables by using the following commands. Substitute the appropriate values for your appliance:

$ export KRB5CCNAME=Kerberos-ticket-cache-directory

$ export KRB5_CONFIG=Kerberos-configuration-file

$ export HADOOP_OPTS="$HADOOP_OPTS -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal. jaxp.DocumentBuilderFactoryImpl-Djava.security.krb5.conf=Kerberos-configuration-file"

In this example, the configuration files are named krb5* and are located in /tmp/oracle_krb/:

$ export KRB5CCNAME=/tmp/oracle_krb/krb5cc_1000

$ export KRB5_CONFIG=/tmp/oracle_krb/krb5.conf

$ export HADOOP_OPTS="$HADOOP_OPTS -D javax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal. jaxp.DocumentBuilderFactoryImpl -D java.security.krb5.conf=/tmp/oracle_krb/krb5.conf"
Redefine the JDBC connection URL, using syntax like the following:

jdbc:hive2://node1:10000/default;principal=HiveServer2-Kerberos-Principal

For example:

jdbc:hive2://bda1node01.example.com:10000/default;principal= hive/HiveServer2Host@EXAMPLE.COM

See also, "HiveServer2 Security Configuration" in the CDH5 Security Guide at the following URL:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Security-Guide/cdh5sg_hiveserver2_security.html
Renew the Kerberos ticket for the Oracle use on a regular basis to prevent disruptions in service.

See Oracle Big Data Appliance Software User's Guide for instructions about managing Kerberos on Oracle Big Data Appliance.

3.7 Configuring Oracle Data Integrator Studio for Executing Hadoop Jobs on the Local Agent

For executing Hadoop jobs on the local agent of an Oracle Data Integrator Studio installation, follow the configuration steps in the Section 3.4, "Configuring the Oracle Data Integrator Agent to Execute Hadoop Jobs" with the following change: Copy JAR files into the Oracle Data Integrator userlib directory.

For example:

Linux: $USER_HOME/.odi/oracledi/userlib directory.

Windows: C:\Users\<USERNAME>\AppData\Roaming\odi\oracledi\userlib directory