3 Setting Up the Environment for Integrating Big Data

This chapter provides information on the steps you need to perform to set up the environment to integrate Big Data.

This chapter includes the following sections:

Configuring Big Data technologies using the Big Data Configurations Wizard

The Big Data Configurations wizard provides a single entry point to set up multiple Hadoop technologies. You can quickly create data servers, physical schema, logical schema, and set a context for different Hadoop technologies such as Hadoop File System or HDFS, HBase, Oozie, Spark, Hive, Pig, etc

The default metadata for different distributions, such as properties, host names, port numbers, etc., and default values for environment variables are pre-populated for you. This helps you to easily create the data servers along with the physical and logical schema, without having in-depth knowledge about these technologies.

After all the technologies are configured, you can validate the settings against the data servers to test the connection status.

Note:

If you do not want to use the Big Data Configurations wizard, you can set up the data servers for the Big Data technologies manually using the information mentioned in the subsequent sections.

To run the Big Data Configurations Wizard:

  1. In ODI Studio, select File and click New... or

    Select Topology tab — Topology Menu — Big Data Configurations.

  2. In the New Gallery dialog, select Big Data Configurations and click OK.

    The Big Data Configurations wizard appears.

  3. In the General Settings panel of the wizard, specify the required options.

    See General Settings for more information.

  4. Click Next.

    Data server panel for each of the technologies you selected in the General Settings panel will be displayed.

  5. In the Hadoop panel of the wizard, do the following:
    • Specify the options required to create the Hadoop data server.

      See Hadoop Data Server Definition for more information.

    • In Properties section, click the + icon to add any data server properties.

    • Select a logical schema, physical schema, and a context from the appropriate drop-down lists.

  6. Click Next.
  7. In the HDFS panel of the wizard, do the following:
    • Specify the options required to create the HDFS data server.

      See HDFS Data Server Definition for more information.

    • In the Properties section, click + icon to add any data server properties.

    • Select a logical schema, physical schema, and a context from the appropriate drop-down lists.

  8. Click Next.
  9. In the HBase panel of the wizard, do the following:
    • Specify the options required to create the HBase data server.

      See HBase Data Server Definition for more information.

    • In the Properties section, click + icon to add any data server properties.

    • Select a logical schema, physical schema, and a context from the appropriate drop-down lists.

  10. In the Spark panel of the wizard, do the following:
    • Specify the options required to create the Spark data server.

      See Spark Data Server Definition for more information.

    • In the Properties section, click + icon to add any data server properties.

    • Select a logical schema, physical schema, and a context from the appropriate drop-down lists.

  11. Click Next.
  12. In the Kafka panel of the wizard, do the following:
    • Specify the options required to create the Kafka data server.

      See Kafka Data Server Definition for more information.

    • In the Properties section, click + icon to add any data server properties.

    • Select a logical schema, physical schema, and a context from the appropriate drop-down lists.

  13. Click Next.
  14. In the Pig panel of the wizard, do the following:
    • Specify the options required to create the Pig data server.

      See Pig Data Server Definition for more information.

    • In the Properties section, click + icon to add any data server properties.

    • Select a logical schema, physical schema, and a context from the appropriate drop-down lists.

  15. Click Next.
  16. In the Hive panel of the wizard, do the following:
    • Specify the options required to create the Hive data server.

      See Hive Data Server Definition for more information.

    • In the Properties section, click + icon to add any data server properties.

    • Select a logical schema, physical schema, and a context from the appropriate drop-down lists.

  17. Click Next.
  18. In the Oozie panel of the wizard, do the following:
    • Specify the options required to create the Oozie run-time engine.

      See Oozie Runtime Engine Definition for more information.

    • Under Properties section, review the data server properties that are listed.

      Note: You cannot add new properties or remove listed properties. However, if required, you can change the value of listed properties.

      See Oozie Runtime Engine Properties for more information.

    • Select an existing logical agent and context or enter new names for the logical agent and context.

  19. Click Next.
  20. In the Validate all settings panel, click Validate All Settings to initialize operations and validate the settings against the data servers to ensure the connection status.
  21. Click Finish.

General Settings

The following table describes the options that you need to set on the General Settings panel of the Big Data Configurations wizard.

Table 3-1 General Settings Options

Option Description

Prefix

Specify a prefix. This prefix is attached to the data server name, logical schema name, and physical schema name.

Distribution

Select a distribution, either Manual or Cloudera Distribution for Hadoop (CDH) <version>.

Base Directory

Specify the directory location where CDH is installed. This base directory is automatically populated in all other panels of the wizard.

Note: This option appears only if the distribution is other than Manual.

Distribution Type

Select a distribution type, either Normal or Kerberized.

Technologies

Select the technologies that you want to configure.

Note: Data server creation panels are displayed only for the selected technologies.

HDFS Data Server Definition

The following table describes the options that you must specify to create a HDFS data server.

Note:

Only the fields required or specific for defining a HDFS data server are described.

Table 3-2 HDFS Data Server Definition

Option Description

Name

Type a name for the data server. This name appears in Oracle Data Integrator.

User/Password

HDFS currently does not implement User/Password security. Leave this option blank.

Hadoop Data Server

Hadoop data server that you want to associate with the HDFS data server.

Additional Classpath

Specify additional jar files to the classpath if needed.

HBase Data Server Definition

The following table describes the options that you must specify to create an HBase data server.

Note: Only the fields required or specific for defining a HBase data server are described.

Table 3-3 HBase Data Server Definition

Option Description

Name

Type a name for the data server. This name appears in Oracle Data Integrator.

HBase Quorum

ZooKeeper Quorum address in hbase-site.xml . For example, localhost:2181.

User/Password

HBase currently does not implement User/Password security. Leave these fields blank.

Hadoop Data Server

Hadoop data server that you want to associate with the HBase data server.

Additional Classpath

Specify any additional classes/jar files to be added.

The following classpath entries will be built by the Base Directory value:

  • /usr/lib/hbase/*

  • /usr/lib/hbase/lib

Kafka Data Server Definition

The following table describes the options that you must specify to create a Kafka data server.

Note:

Only the fields required or specific for defining a Kafka data server are described.

Table 3-4 Kafka Data Server Definition

Option Description

Name

Type a name for the data server.

User/Password

User name with its password.

Hadoop Data Server

Hadoop data server that you want to associate with the Kafka data server.

If Kafka is not running on the Hadoop server, then there is no need to specify a Hadoop Data Server. This option is useful when Kafka runs on its own server.

Additional Classpath

Specify any additional classes/jar files to be added.

The following classpath entries will be built by the Base Directory value:

  • /opt/cloudera/parcels/CDH/lib/kafka/libs/*

If required, you can add more additional classpaths.

If Kafka is not running on the Hadoop server, then specify the absolute path of Kafka libraries in this field.

Note:

This field appears only when you are creating the Kafka Data Server using the Big Data Configuration wizard.

Kafka Data Server Properties

The following table describes the Kafka data server properties that you need to add on the Properties tab when creating a new Kafka data server.

Table 3-5 Kafka Data Server Properties

Key Value

metadata.broker.list

This is a comma separated list of Kafka metadata brokers. Each broker is defined by hostname:port. The list of brokers can be found in the server.properties file, typically located in /etc/kafka/conf.

oracle.odi.prefer.dataserver.packages

Retrieves the topic and message from Kafka server. The address is scala, kafka, oracle.odi.kafka.client.api.impl, org.apache.log4j.

security.protocol

Protocol used to communicate with brokers. Valid values are: PLAINTEXT, SSL, SASL_PLAINTEXT, and SASL_SSL.

zookeeper.connect

Specifies the ZooKeeper connection string in the form hostname:port, where host and port are the host and port of a ZooKeeper server. To allow connecting through other ZooKeeper nodes when a ZooKeeper machine is down you can also specify multiple hosts in the form hostname1:port1,hostname2:port2,hostname3:port3.

Creating and Initializing the Hadoop Data Server

Configure the Hadoop Data Server Definitions and Properties, to create and initialize Hadoop Data Server.

To create and initialize the Hadoop data server:

  1. Click the Topology tab.
  2. In the Physical Architecture tree, under Technologies, right-click Hadoop and then click New Data Server.
  3. In the Definition tab, specify the details of the Hadoop data server.

    See Hadoop Data Server Definition for more information.

  4. In the Properties tab, specify the properties for the Hadoop data server.

    See Hadoop Data Server Properties for more information.

  5. Click Initialize to initialize the Hadoop data server.

    Initializing the Hadoop data server creates the structure of the ODI Master repository and Work repository in HDFS.

  6. Click Test Connection to test the connection to the Hadoop data server.

Hadoop Data Server Definition

The following table describes the fields that you must specify on the Definition tab when creating a new Hadoop data server.

Note: Only the fields required or specific for defining a Hadoop data server are described.

Table 3-6 Hadoop Data Server Definition

Field Description

Name

Name of the data server that appears in Oracle Data Integrator.

Data Server

Physical name of the data server.

User/Password

Hadoop user with its password.

If password is not provided, only simple authentication is performed using the username on HDFS and Oozie.

Authentication Method

Select one of the following authentication methods:

  • Simple Username Authentication (unsecured)

  • Kerberos Principal Username/Password (secured)

  • Kerberos Credential Ticket Cache (secured)

Note:

The following link helps determine if the Hadoop cluster is secured:

https://www.cloudera.com/documentation/cdh/5-0-x/CDH5-Security-Guide/cdh5sg_hadoop_security_enable.html

HDFS Node Name URI

URI of the HDFS node name.

hdfs://localhost:8020

Resource Manager/Job Tracker URI

URI of the resource manager or the job tracker.

localhost:8032

ODI HDFS Root

Path of the ODI HDFS root directory.

/user/<login_username>/odi_home.

Additional Class Path

Specify additional classpaths.

Add the following additional classpaths:

  • /usr/lib/hadoop/*

  • /usr/lib/hadoop/client/*

  • /usr/lib/hadoop/lib/*

  • /usr/lib/hadoop-hdfs/*

  • /usr/lib/hadoop-mapreduce/*

  • /usr/lib/hadoop-yarn/*

  • /usr/lib/hbase/lib/*

  • /usr/lib/hive/lib/*

  • /usr/lib/oozie/lib/*

  • /etc/hadoop/conf/

  • /etc/hbase/conf

  • /etc/hive/conf

  • /opt/oracle/orahdfs/jlib/*

Hadoop Data Server Properties

The following table describes the properties that you can configure in the Properties tab when defining a new Hadoop data server.

Note:

By default, only the oracle.odi.prefer.dataserver.packages property is displayed. Click the + icon to add the other properties manually.

These properties can be inherited by other Hadoop technologies, such as Hive or HDFS. To inherit these properties, you must select the configured Hadoop data server when creating data server for other Hadoop technologies.

Table 3-7 Hadoop Data Server Properties Mandatory for Hadoop and Hive

Property Group Property Description/Value

General

HADOOP_HOME

Location of Hadoop dir. For example, /usr/lib/hadoop

User Defined

HADOOP_CONF

Location of Hadoop configuration files such as core-default.xml, core-site.xml, and hdfs-site.xml. For example, /home/shared/hadoop-conf

Hive

HIVE_HOME

Location of Hive dir. For example, /usr/lib/hive

User Defined

HIVE_CONF

Location of Hive configuration files such as hive-site.xml. For example, /home/shared/hive-conf

General

HADOOP_CLASSPATH

$HIVE_HOME/lib/hive-metastore-*.jar:$HIVE_HOME/lib/libthrift-*.jar:$HIVE_HOME/lib/libfb*.jar:$HIVE_HOME/lib/hive-exec-*.jar:$HIVE_CONF

General

HADOOP_CLIENT_OPTS

-Dlog4j.debug -Dhadoop.root.logger=INFO,console -Dlog4j.configuration=file:/etc/hadoop/conf.cloudera.yarn/log4j.properties

Hive

HIVE_SESSION_JARS

$HIVE_HOME/lib/hive-contrib-*.jar:<ODI library directory>/wlhive.jar

  • Actual path of wlhive.jar can be determined under ODI installation home.

  • Include other JAR files as required, such as custom SerDes JAR files. These JAR files are added to every Hive JDBC session and thus are added to every Hive MapReduce job.

  • List of JARs is separated by ":", wildcards in file names must not evaluate to more than one file.

  • Follow the steps for Hadoop Security models, such as Apache Sentry, to allow the Hive ADD JAR call used inside ODI Hive KMs:
    • Define the environment variable HIVE_SESSION_JARS as empty.

    • Add all required jars for Hive in the global Hive configuration hive-site.xml.

Table 3-8 Hadoop Data Server Properties Mandatory for HBase (In addition to base Hadoop and Hive Properties)

Property Group Property Description/Value

HBase

HBASE_HOME

Location of HBase dir. For example, /usr/lib/hbase

General

HADOOP_CLASSPATH

$HBASE_HOME/lib/hbase-*.jar:$HIVE_HOME/lib/hive-hbase-handler*.jar:$HBASE_HOME/hbase.jar

Hive

HIVE_SESSION_JARS

$HBASE_HOME/hbase.jar:$HBASE_HOME/lib/hbase-sep-api-*.jar:$HBASE_HOME/lib/hbase-sep-impl-*hbase*.jar:/$HBASE_HOME/lib/hbase-sep-impl-common-*.jar:/$HBASE_HOME/lib/hbase-sep-tools-*.jar:$HIVE_HOME/lib/hive-hbase-handler-*.jar

Note:

Follow the steps for Hadoop Security models, such as Apache Sentry, to allow the Hive ADD JAR call used inside ODI Hive KMs:
  • Define the environment variable HIVE_SESSION_JARS as empty.

  • Add all required jars for Hive in the global Hive configuration hive-site.xml.

Table 3-9 Hadoop Data Server Properties Mandatory for Oracle Loader for Hadoop (In addition to base Hadoop and Hive properties)

Property Group Property Description/Value

OLH/OSCH

OLH_HOME

Location of OLH installation. For example, /u01/connectors/olh

OLH/OSCH

OLH_FILES

usr/lib/hive/lib/hive-contrib-1.1.0-cdh5.5.1.jar

OLH/OSCH

ODCH_HOME

Location of OSCH installation. For example, /u01/connectors/osch

General

HADOOP_CLASSPATH

$OLH_HOME/jlib/*:$OSCH_HOME/jlib/*

OLH/OSCH

OLH_JARS

Comma-separated list of all JAR files required for custom input formats, Hive, Hive SerDes, and so forth, used by Oracle Loader for Hadoop. All filenames have to be expanded without wildcards.

For example:

$HIVE_HOME/lib/hive-metastore-0.10.0-cdh4.5.0.jar,$HIVE_HOME/lib/libthrift-0.9.0-cdh4-1.jar,$HIVE_HOME/lib/libfb303-0.9.0.jar

OLH/OSCH

OLH_SHAREDLIBS (deprecated)

$OLH_HOME/lib/libolh12.so,$OLH_HOME/lib/libclntsh.so.12.1,$OLH_HOME/lib/libnnz12.so,$OLH_HOME/lib/libociei.so,$OLH_HOME/lib/libclntshcore.so.12.1,$OLH_HOME/lib/libons.so

Table 3-10 Hadoop Data Server Properties Mandatory for SQOOP (In addition to base Hadoop and Hive properties)

Property Group Property Description/Value

SQOOP

SQOOP_HOME

Location of Sqoop directory. For example, /usr/lib/sqoop

SQOOP

SQOOP_LIBJARS

Location of the SQOOP library jars. For example, usr/lib/hive/lib/hive-contrib.jar

Creating and Initializing the Hadoop Data Server

Creating a Hadoop Physical Schema

To create a physical schema for Hadoop, first create a logical schema for the same using the standard procedure.

Create a Hadoop physical schema using the standard procedure, as described in the Creating a Physical Schema section in Administering Oracle Data Integrator.

Create for this physical schema a logical schema using the standard procedure, as described in the Creating a Logical Schema section in Administering Oracle Data Integrator and associate it in a given context.

Configuring the Oracle Data Integrator Agent to Execute Hadoop Jobs

You must configure the Oracle Data Integrator agent to execute Hadoop jobs.

For information on creating a physical agent, see the Creating a Physical Agent section in Administering Oracle Data Integrator.

To configure the Oracle Data Integrator agent:

  1. If the ODI agent is not installed on one of the Hadoop Cluster nodes, then you must install the Hadoop Client libraries on that computer.

    For instructions on setting up a remote Hadoop client in Oracle Big Data Appliance, see the Providing Remote Client Access to CDH section in the Oracle Big Data Appliance Software User's Guide .

  2. Install Hive on your Oracle Data Integrator agent computer.
  3. Install SQOOP on your Oracle Data Integrator agent computer.
  4. Set the base properties for Hadoop and Hive on your ODI agent computer.

    These properties must be added as Hadoop data server properties. For more information, see Hadoop Data Server Properties.

  5. If you plan to use HBase features, set the properties on your ODI agent computer. You must set these properties in addition to the base Hadoop and Hive properties.

    These properties must be added as Hadoop data server properties. For more information, see Hadoop Data Server Properties.

Configuring Oracle Loader for Hadoop

If you want to use Oracle Loader for Hadoop, you must install and configure Oracle Loader for Hadoop on your Oracle Data Integrator agent computer.

Oracle Loader for Hadoop is an efficient and high-performance loader for fast loading of data from a Hadoop cluster into a table in an Oracle database.

To install and configure Oracle Loader for Hadoop:

  1. Install Oracle Loader for Hadoop on your Oracle Data Integrator agent computer.

    See the Installing Oracle Loader for Hadoop section in Oracle Big Data Connectors User's Guide.

  2. To use Oracle SQL Connector for HDFS (OLH_OUTPUT_MODE=DP_OSCH or OSCH), you must first install it.

    See the Oracle SQL Connector for Hadoop Distributed File System Setup section in Oracle Big Data Connectors User's Guide.

  3. Set the properties for Oracle Loader for Hadoop on your ODI agent computer. You must set these properties in addition to the base Hadoop and Hive properties.

    These properties must be added as Hadoop data server properties. For more information, see Hadoop Data Server Properties.

Configuring Oracle Data Integrator to Connect to a Secure Cluster

To run the Oracle Data Integrator agent on a Hadoop cluster that is protected by Kerberos authentication, you must configure a Kerberos-secured cluster.

To use a Kerberos-secured cluster:

  1. Log in to the node of the Oracle Big Data Appliance, where the Oracle Data Integrator agent runs.
  2. Set the environment variables by using the following commands. The user name in the following example is oracle. Substitute the appropriate values for your appliance:

    $ export KRB5CCNAME=Kerberos-ticket-cache-directory

    $ export KRB5_CONFIG=Kerberos-configuration-file

    $ export HADOOP_OPTS="$HADOOP_OPTS -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal. jaxp.DocumentBuilderFactoryImpl -Djava.security.krb5.conf=Kerberos-configuration-file"

    In this example, the configuration files are named krb5* and are located in /tmp/oracle_krb/:

    $ export KRB5CCNAME=/tmp/oracle_krb/krb5cc_1000

    $ export KRB5_CONFIG=/tmp/oracle_krb/krb5.conf

    $ export HADOOP_OPTS="$HADOOP_OPTS -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal. jaxp.DocumentBuilderFactoryImpl -Djava.security.krb5.conf=/tmp/oracle_krb/krb5.conf"

  3. Generate a new Kerberos ticket for the oracle user. Use the following command, replacing realm with the actual Kerberos realm name.

    $ kinit oracle@realm

  4. ODI Studio: To set the VM for ODI Studio, add AddVMoption in odi.conf in the same folder as odi.sh.
    Kerberos configuration file location:
    AddVMOption -Djava.security.krb5.conf=/etc/krb5.conf
    AddVMOption -Dsun.security.krb5.debug=true 
    AddVMOption -Dsun.security.krb5.principal=odidemo

  5. Redefine the JDBC connection URL, using syntax like the following:

    Table 3-11 Kerberos Configuration for Dataserver

    Technology Configuration Example
    Hadoop No specific configuration to be done, general settings is sufficient.  
    Hive $MW_HOME/oracle_common/modules/datadirect/JDBCDriverLogin.conf Example of configuration file
    JDBC_DRIVER_01 {
    com.sun.security.auth.module.Krb5LoginModule required
    debug=true
    useTicketCache=true
    ticketCache="/tmp/krb5cc_500"
    doNotPrompt=true
    ;
    };

    Example of Hive URL

    jdbc:weblogic:hive://<hostname>:10000;DatabaseName=default;AuthenticationMethod=kerberos;ServicePrincipalName=<username>/<fully.qualified.domain.name>@<YOUR-REALM>.COM

    HBase
    export HBASE_HOME=/scratch/fully.qualified.domain.name/etc/hbase/conf       
    export HBASE_CONF_DIR = $HBASE_HOME/conf       
    export HBASE_OPTS ="-Djava.security.auth.login.config=$HBASE_CONF_DIR/hbase-client.jaas"
    export HBASE_MASTER_OPTS ="-Djava.security.auth.login.config=$HBASE_CONF_DIR/hbase-server.jaas"

    ODI Studio Configuration:

    AddVMOption -Djava.security.auth.login.config=$HBASE_CONF_DIR/hbase-client.jaas"

    Example of Hbase configuration file:
    hbase-client.jaas
    Client {
    com.sun.security.auth.module.Krb5LoginModule required
    useKeyTab=false
    useTicketCache=true;
    };
    
    Spark
    Spark Kerberos configuration is done through spark submit parameters
    --principal // define principle name 
    --keytab	 // location of keytab file 

    Example of spark-submit command:

    spark-submit --master yarn --py-files  /tmp/pyspark_ext.py --executor-memory 1G --driver-memory 512M --executor-cores 1 --driver-cores 1 --num-executors 2 --principal fully.qualified.domain.name@YOUR-REALM.com --keytab /tmp/fully.qualified.domain.name.tab --queue default /tmp/New_Mapping_Physical.py
    Kafka

    Kafka Kerberos configuration is done through kafka-client.jaas file: The configuration file is placed in Kafka configuration folder.

    Example of Kafka configuration file:

    KafkaClient {
     com.sun.security.auth.module.Krb5LoginModule required
     useKeyTab=false
     useTicketCache=true
     ticketCache="/tmp/krb5cc_1500"
     serviceName="kafka";
    };

    The location of Kafka configuration file is set in ODI Studio VM option

    AddVMOption -Djava.security.auth.login.config="/etc/kafka-jaas.conf"

    Pig/Oozie Pig and Ooize will extend the Kerberos configuration of linked Hadoop data server and does not require specific configuration.  

    For more information on these properties and settings, see "HiveServer2 Security Configuration" in the CDH5 Security Guide at the following URL:

    https://www.cloudera.com/documentation/cdh/5-0-x/CDH5-Security-Guide/cdh5sg_hiveserver2_security.html

  6. Renew the Kerberos ticket for the Oracle user on a regular basis to prevent disruptions in service.
  7. Download the unlimited strength JCE security jars.

    For instructions about managing Kerberos on Oracle Big Data Appliance, see the About Accessing a Kerberos-Secured Cluster section in Oracle Big Data Appliance Software User's Guide .

Configuring Oracle Data Integrator Studio for Executing Hadoop Jobs on the Local Agent

Perform the following configuration steps to execute Hadoop jobs on the local agent of Oracle Data Integrator Studio.

For executing Hadoop jobs on the local agent of an Oracle Data Integrator Studio installation, follow the configuration steps in Configuring the Oracle Data Integrator Agent to Execute Hadoop Jobs with the following change:

Copy the following Hadoop client jar files to the local machines.

/usr/lib/hadoop/*.jar
/usr/lib/hadoop/lib/*.jar 
/usr/lib/hadoop/client/*.jar 
/usr/lib/hadoop-hdfs/*.jar 
/usr/lib/hadoop-mapreduce/*.jar 
/usr/lib/hadoop-yarn/*.jar 
/usr/lib/oozie/lib/*.jar 
/usr/lib/hive/*.jar 
/usr/lib/hive/lib/*.jar 
/usr/lib/hbase/*.jar 
/usr/lib/hbase/lib/*.jar

Add the above classpaths in the additional_path.txt file under the userlib directory.

For example:

Linux: $USER_HOME/.odi/oracledi/userlib directory.

Windows: C:\Users\<USERNAME>\AppData\Roaming\odi\oracledi\userlib directory