1 Getting Started with Oracle Big Data Connectors

This chapter describes the Oracle Big Data Connectors and provides installation instructions.

This chapter contains the following sections:

About Oracle Big Data Connectors

Oracle Big Data Connectors facilitate access to data stored in an Apache Hadoop cluster. They can be licensed for use on either Oracle Big Data Appliance or a Hadoop cluster running on commodity hardware.

These are the connectors:

  • Oracle SQL Connector for Hadoop Distributed File System: Enables an Oracle external table to access data stored in Hadoop Distributed File System (HDFS) files or a table in Apache Hive. The data can remain in HDFS or the Hive table, or it can be loaded into an Oracle database.

  • Oracle Loader for Apache Hadoop: Provides an efficient and high-performance loader for fast movement of data from a Hadoop cluster into a table in an Oracle database. Oracle Loader for Hadoop prepartitions the data if necessary and transforms it into a database-ready format. It optionally sorts records by primary key or user-defined columns before loading the data or creating output files.

  • Oracle XQuery for Apache Hadoop: Runs transformations expressed in the XQuery language by translating them into a series of MapReduce jobs, which are executed in parallel on the Hadoop cluster. The input data can be located in a file system accessible through the Hadoop File System API, such as the Hadoop Distributed File System (HDFS), or stored in Oracle NoSQL Database. Oracle XQuery for Hadoop can write the transformation results to HDFS, Oracle NoSQL Database, Apache Solr, or Oracle Database. An additional XML processing capability is through XML Extensions for Hive.

  • Oracle Shell for Hadoop Loaders: A helper shell that provides a simple-to-use command line interface to Oracle Loader for Hadoop, Oracle SQL Connector for HDFS, and Copy to Hadoop (a feature of Big Data SQL). It has basic shell features such as command line recall, history, inheriting environment variables from the parent process, setting new or existing environment variables, and performing environmental substitution in the command line.

  • Oracle R Advanced Analytics for Apache Hadoop: Provides a general computation framework, in which you can use the R language to write your custom logic as mappers or reducers. A collection of R packages provides predictive analytic techniques that run as MapReduce jobs. The code executes in a distributed, parallel manner using the available compute and storage resources on the Hadoop cluster. Oracle R Advanced Analytics for Hadoop includes interfaces to work with Apache Hive tables, the Apache Hadoop compute infrastructure, the local R environment, and Oracle database tables.

  • Oracle Data Integrator: Extracts, loads, and transforms data from sources such as files and databases into Hadoop and from Hadoop into Oracle or third-party databases. Oracle Data Integrator provides a graphical user interface to utilize the native Hadoop tools and transformation engines such as Hive, HBase, Sqoop, Oracle Loader for Hadoop, and Oracle SQL Connector for Hadoop Distributed File System.

  • Oracle Datasource for Hadoop: Provides direct, fast, parallel, secure and consistent access to master data in Oracle Database using Hive SQL, Spark SQL, as well as Hadoop APIs that support SerDes, HCatalog, InputFormat and StorageHandler.

Individual connectors may require that software components be installed in Oracle Database and either the Hadoop cluster or an external system set up as a Hadoop client for the cluster. Users may also need additional access privileges in Oracle Database.

See Also:

My Oracle Support Information Center: Big Data Connectors (ID 1487399.2) and its related information centers.

Big Data Concepts and Technologies

Enterprises are seeing large amounts of data coming from multiple sources. Click-stream data in web logs, GPS tracking information, data from retail operations, sensor data, and multimedia streams are just a few examples of vast amounts of data that can be of tremendous value to an enterprise if analyzed. The unstructured and semi-structured information provided by raw data feeds is of little value in and of itself. The data must be processed to extract information of real value, which can then be stored and managed in the database. Analytics of this data along with the structured data in the database can provide new insights into the data and lead to substantial business benefits.

What is MapReduce?

MapReduce is a parallel programming model for processing data on a distributed system. It can process vast amounts of data quickly and can scale linearly. It is particularly effective as a mechanism for batch processing of unstructured and semi-structured data. MapReduce abstracts lower level operations into computations over a set of keys and values.

A simplified definition of a MapReduce job is the successive alternation of two phases, the map phase and the reduce phase. Each map phase applies a transform function over each record in the input data to produce a set of records expressed as key-value pairs. The output from the map phase is input to the reduce phase. In the reduce phase, the map output records are sorted into key-value sets so that all records in a set have the same key value. A reducer function is applied to all the records in a set and a set of output records are produced as key-value pairs. The map phase is logically run in parallel over each record while the reduce phase is run in parallel over all key values.

Note:

Oracle Big Data Connectors 3.0 and later supports the Yet Another Resource Negotiator (YARN) implementation of MapReduce.

What is Apache Hadoop?

Apache Hadoop is the software framework for the development and deployment of data processing jobs based on the MapReduce programming model. At the core, Hadoop provides a reliable shared storage and analysis system. Analysis is provided by MapReduce. Storage is provided by the Hadoop Distributed File System (HDFS), a shared storage system designed for MapReduce jobs.

The Hadoop ecosystem includes several other projects including Apache Avro, a data serialization system that is used by Oracle Loader for Hadoop.

Cloudera's Distribution including Apache Hadoop (CDH) is installed on Oracle Big Data Appliance. You can use Oracle Big Data Connectors on a Hadoop cluster running CDH or the equivalent Apache Hadoop components, as described in the setup instructions in this chapter.

See Also:

  • For conceptual information about the Hadoop technologies, the following third-party publication:

    Hadoop: The Definitive Guide, Third Edition by Tom White (O'Reilly Media Inc., 2012, ISBN: 978-1449311520).

  • For information about Cloudera's Distribution including Apache Hadoop (CDH5), visit the Oracle Cloudera website.

  • For information about Apache Hadoop, visit the Apache Hadoop website.

Download and Install Oracle Big Data Connectors

You can download Oracle Big Data Connectors from Oracle Technology Network or Oracle Software Delivery Cloud. Both sites are cross-browser compatible.

Note:

Oracle Big Data Appliance customers do not need to download Oracle Big Data Connectors from an external source. Oracle Big Data Connectors are included in the Oracle Big Data Appliance deployment bundle. See Enabling and Disabling Oracle Big Data Connectors in the Oracle Big Data Appliance Owner’s Guide. All other customers should download the software as described here.

To download from Oracle Technology Network:

  1. Go to

    http://www.oracle.com/technetwork/bdc/big-data-connectors/downloads/index.html

  2. Click the name of each connector to download a zip file containing the installation files.

To download from Oracle Software Delivery Cloud:

  1. Go to https://edelivery.oracle.com/
  2. Sign in and accept the Export Restrictions.
  3. Type in the product name in the Product field and select the platform:

    Product: Oracle Big Data Connectors

    Platform: Linux x86-64

  4. When Oracle Big Data Connectors appears in the Product List, click Continue. The most recent major release of Oracle Big Data Connectors will appear as the selected option.
  5. To choose a different release, click Select Alternate Release and choose another package from the list. Click Continue.
  6. Read the Terms and Conditions. Click the checkbox if you accept them, then click Continue.
  7. On the download site, select the zip files for individual Oracle Big Data Connectors or click Download All.
Each download package includes a README file with installation instructions and a set of examples.

Certified Hadoop Platforms

All Oracle Big Data Connectors run on both CDH (Cloudera Distribution Including Apache Hadoop) and HDP (Hortonworks Data Platform). A few Big Data Connectors also run on CDP (Cloudera Data Platform).

Note:

  • Big Data Connectors 5.x is required to work with CDH 6.x. All components with the exception of Oracle Data Integrator are certified to work with CDH 6.x.
  • Oracle Loader for Hadoop 5.0.1 to 5.1.2 and Oracle SQL Connector for HDFS 5.0.1 to 5.1.2 are certified to work with Hortonworks 3.0.1.
  • Oracle Loader for Hadoop 5.1.2 and Oracle SQL Connector for HDFS 5.1.2 are certified to work with Cloudera Data Platform (CDP) version 7.1.6.

Version 5.1.2 of Oracle Loader for Hadoop and Oracle SQL Connector for HDFS are available from My Oracle Support. The 5.1.2 version of Oracle Shell for Hadoop Loaders, the command-line UI for these connectors, should also be downloaded from My Oracle Support. For details on downloading the 5.1.2 versions of these connectors, please refer to note 2823564.1 on My Oracle Support.

Secure Connection to Oracle Database

Describes using JDBC SSL and Java KeyStore to securely connect to Oracle Database.

Using JDBC SSL

Follow these steps to connect to the Oracle Database using JDBC SSL:
  1. Download the SSL wallet.zip file. For example, see Download Client Credentials for information on downloading client credentials for Oracle Autonomous Data Warehouse.
  2. Copy the wallet.zip file to a directory location on a node in the Hadoop cluster or Hadoop edge node (from where you will run Oracle Loader for Hadoop or Oracle SQL Connector for HDFS or Oracle Shell for Hadoop Loaders). This is your TNS_ADMIN directory on the Hadoop node. For example: /home/oracle/SSL_wallet.
  3. Unzip the wallet.zip file.
  4. Set the permissions of the file to be accessible to the OS user who will run Oracle Loader for Hadoop or Oracle SQL Connector for HDFS or Oracle Shell for Hadoop Loaders.
  5. In the sqlnet.ora file, update the WALLET_LOCATION entry so that the DIRECTORY points to the directory location of the SSL wallet you created in Step 2.
    For Example:
    WALLET_LOCATION=
                    (SOURCE= 
                       (METHOD=FILE)
                       (METHOD_DATA=
                      (DIRECTORY=/home/oracle/SSL_wallet/)))
  6. In the tnsnames.ora file, find the name of the TNS entry using TCPS protocol. For example:
    inst1_ssl = (DESCRIPTION=(ADDRESS=
                                       (PROTOCOL=tcps)
                                        (HOST=<hostname>)
                                         (PORT=1521))
                                         (CONNECT_DATA=(SERVICE_NAME=<service_name>)))
  7. Create a connection.properties file in the TNS_ADMIN directory and add the following code:
    javax.net.ssl.trustStore=<directory_location>/cwallet.sso 
    javax.net.ssl.trustStoreType=SSO 
    javax.net.ssl.keyStore=<directory_location>/cwallet.sso
    javax.net.ssl.keyStoreType=SSO
    For example:
    javax.net.ssl.trustStore=/home/oracle/SSL_wallet/cwallet.sso
    javax.net.ssl.trustStoreType=SSO
    javax.net.ssl.keyStore=/home/oracle/SSL_wallet/cwallet.sso 
    javax.net.ssl.keyStoreType=SSO

Using Secure External Java KeyStore and Hadoop credential command

Oracle Shell for Hadoop Loaders (OHSH) can use a secure external Java KeyStore (JKS) file to store database passwords. The Hadoop credential command can be used to create the Java KeyStore file.

  1. Use the following syntax to create the keyStore file using the Hadoop credential command:
    $ hadoop credential create <password alias> 
              -value <password> 
              -provider jceks://file/<directory location where the keystore file will be stored>/<keystorefilename>.jceks
    For example:
    $ hadoop credential create oracle_passwd 
               -value password 
               -provider jceks://file/home/oracle/passwd.jceks

Oracle SQL Connector for Hadoop Distributed File System Setup

Oracle recommends running Oracle SQL Connector for Hadoop Distributed File System (HDFS) on a Hadoop client or edge node. In addition, you must install and configure Oracle SQL Connector for HDFS on the system where Oracle Database runs.

On Oracle Big Data Appliance, there is an option to include Oracle Big Data Connectors in the installation. If this was selected in the Configuration Generation Utility, then Oracle SQL Connector for HDFS is already available on the appliance. (See Chapter 4, Using Oracle Big Data Appliance Configuration Generation Utility, in the Big Data Appliance Owner’s Guide.)

This section contains the following topics:

Software Requirements

Oracle SQL Connector for HDFS requires the following software:

Hadoop Requirements:

  • A certified release of either CDH (Cloudera Distribution Including Apache Hadoop) or HDP (Hadoop Data Platform).

  • Java Development Kit (JDK). Consult the distributor of your Hadoop software (Cloudera or Hortonworks) for the recommended version.

Oracle Big Data Appliance already meets these requirements. If you are using CDH or HDP on a commodity server platform, check to ensure that the system meets them.

Oracle Database System and Hadoop Client System Requirements:

  • A version of Oracle Database that is currently supported by Oracle.

  • The same version of Hadoop that is installed on your Hadoop cluster: CDH 5.7 or higher, CDH6, or Hortonworks Data Platform 2.4.0, 2.5.0, 2.61, or 3.0.1. .

    If you have a secure Hadoop cluster configured with Kerberos, then the Hadoop client on the database system must be set up to access a secure cluster.

  • The same version of the JDK that is installed on your Hadoop cluster.

Note:

Oracle SQL Connector for HDFS requires a Hadoop client on the OS platform of the database system. This is straightforward for Linux systems. Platforms other than Linux require a tarball installation of the Hadoop client. Refer to this Oracle Blog post Connecting Hadoop with Oracle. See the following documents in My Oracle Support for details:

Install and Configure a Hadoop Client on the Oracle Database System

Oracle SQL Connector for HDFS requires a Hadoop client on the Oracle Database System. The Hadoop installation can be minimally configured for Hadoop client use only. The full configuration of Hadoop is not needed.

Note:

Even if there is a complete Hadoop installation on the Oracle Database system, do not start Hadoop on this system at any time. If Hadoop is running locally, then Oracle SQL Connector for HDFS attempts to connect to it instead of to the external Hadoop cluster.

For Oracle RAC systems including Oracle Exadata Database Machine, you must install and configure the Hadoop client using identical paths on all systems running Oracle instances.

Installing and configuring a Hadoop client on the Oracle Database systems involves several steps which includes downloading and installing various components such as Java JDK, Hadoop software, Hadoop Client configuration files, Kerberos configuration and keytab files and testing connectivity to the Hadoop cluster. See the following document on My Oracle Support for details: “How to Install and Configure a Hadoop Client on the Oracle Database System for Oracle SQL Connector for HDFS (OSCH)” (My Oracle Support note 2768424.1).

Install and Configure Oracle SQL Connector for HDFS

Follow this procedure to install Oracle SQL Connector for HDFS on the Oracle Database system.

For Oracle RAC systems, including Oracle Exadata Database Machine, you must install and configure Oracle SQL Connector for HDFS using identical paths on all systems running Oracle instances.

In addition to this required installation on the database system, you can also install Oracle SQL Connector for HDFS on any system configured as a compatible Hadoop client. This will give you the option to create Oracle Database external tables from that node.

To install Oracle SQL Connector for HDFS on the Oracle Database system:

  1. Download the zip file to a directory on the system where Oracle Database runs.

  2. Unpack the content of oraosch-<version>.zip.

    $ unzip oraosch-<version>.zip
    Archive:  oraosch-<version>.zip
     extracting: orahdfs-<version>.zip
      inflating: README.txt
  3. Unpack orahdfs-<version>.zip into a permanent directory:

    $ unzip orahdfs-<version>.zip
    unzip orahdfs-<version>.zip
    Archive:  orahdfs-<version>.zip
       creating: orahdfs-<version>/
       creating: orahdfs-<version>/log/
      inflating: orahdfs-<version>/examples.zip   
       creating: orahdfs-<version>/doc/
      inflating: orahdfs-<version>/doc/README.txt
       creating: orahdfs-<version>/jlib/
      inflating: orahdfs-<version>/jlib/osdt_cert.jar
      inflating: orahdfs-<version>/jlib/oraclepki.jar
      inflating: orahdfs-<version>/jlib/osdt_core.jar
      inflating: orahdfs-<version>/jlib/ojdbc7.jar
      inflating: orahdfs-<version>/jlib/orahdfs.jar
      inflating: orahdfs-<version>/jlib/ora-hadoop-common.jar
       creating: orahdfs-<version>/bin/
      inflating: orahdfs-<version>/bin/hdfs_stream
      inflating: orahdfs-<version>/bin/hdfs_stream.cmd

    The unzipped files have the structure shown in Example 1-1. The examples.zip file is not unzipped. Unzip that file later when you want to work with the examples.

  4. Open the orahdfs-<version>/bin/hdfs_stream Bash shell script in a text editor, and make the changes indicated by the comments in the script, if necessary

    To configure the hdfs_stream script, see “How to Install and Configure a Hadoop Client on the Oracle Database System for Oracle SQL Connector for HDFS (OSCH)” (My Oracle Support note 2768424.1). See the /u01/app/cloudera/cdhenv Bash shell script defined in Step 7 of that document.

    The hdfs_stream script does not inherit any environment variable settings, and so they are set in the script if Oracle SQL Connector for HDFS needs them:

    • Add the following lines before the line export PATH=:

      CDH_ROOT=/u01/app/cloudera
      CDH_HOME=${CDH_ROOT}/parcels/CDH
      export JAVA_HOME=${CDH_ROOT}/jdk
      export HADOOP_CONF_DIR=${CDH_ROOT}/yarn-conf
      #----------- set Kerberos security variables
      export KRB5_CONFIG=${CDH_ROOT}/krb5.conf
      export HADOOP_CLIENT_OPTS='-Djava.security.krb5.conf='${KRB5_CONFIG}
    • Modify PATH as shown below to add ‘hadoop’ script to PATH:

      export PATH=${CDH_HOME}/bin:/usr/bin:/bin

    See the comments in the hdfs_stream script for more information about these environment variables, and make any additional changes, as needed.

    The hdfs_stream script is the preprocessor for the Oracle Database external table created by Oracle SQL Connector for HDFS.

  5. If your cluster is secured with Kerberos and the account does not already have a Kerberos ticket, then obtain one:

    $ kinit
    

    See Use Oracle SQL Connector for HDFS on a Secure Hadoop Cluster in this guide for information on acquiring and maintaining Kerberos tickets.

  6. Run hdfs_stream to test permissions and usage.

    Run hdfs_stream from the Oracle SQL Connector for HDFS /bin directory. Use a new shell window to ensure that your shell environment does not contain an inherited environment from earlier steps. You should see this usage information:

    $ ./hdfs_stream
    Usage: hdfs_stream locationFile
    

    If you do not see the usage statement, then ensure that the operating system user that Oracle Database is running under (such as oracle) has the following permissions. OSCH_HOME represents the Oracle SQL Connector for HDFS home directory.

    • Read and execute permissions on the hdfs_stream script:

      $ ls -l OSCH_HOME/bin/hdfs_stream
      -rwxr-xr-x 1 oracle oinstall Nov 27 15:51 hdfs_stream
    • Read permission on orahdfs.jar.

      $ ls -l OSCH_HOME/jlib/orahdfs.jar
      -rwxr-xr-x 1 oracle oinstall Nov 27 15:51 orahdfs.jar

    If you do not see these permissions, then enter a chmod command to fix them, for example:

    $ chmod 755 OSCH_HOME/bin/hdfs_stream

    In the previous commands, OSCH_HOME represents the Oracle SQL Connector for HDFS home directory.

  7. For an Oracle RAC system, repeat the previous steps for every Oracle instance, using identical path locations.

  8. Log in to Oracle Database and create a database directory for the orahdfs-<version>/bin directory where hdfs_stream resides. For Oracle RAC systems, this directory must be accessible by all Oracle instances through identical paths.

    In this example, Oracle SQL Connector for HDFS is installed in /etc:

    SQL> CREATE OR REPLACE DIRECTORY osch_bin_path AS '/etc/orahdfs-<version>/bin';

    The directory path used for the OSCH_BIN_PATH database directory must be an absolute path without symbolic links.

  9. To support access to Hive tables:

    1. Ensure that the system is configured as a Hive client.

    2. Add the Hive JAR files and the Hive conf directory to the HADOOP_CLASSPATH environment variable. To avoid JAR conflicts among the various Hadoop products, Oracle recommends that you set HADOOP_CLASSPATH in your local shell initialization script instead of making a global change to HADOOP_CLASSPATH. If there are multiple JAR file paths in HADOOP_CLASSPATH ensure that the JARs for the current product are listed first.

The following figure illustrates shows the flow of data and the components locations.

Figure 1-1 Oracle SQL Connector for HDFS Installation for HDFS and Data Pump Files

Description of Figure 1-1 follows
Description of "Figure 1-1 Oracle SQL Connector for HDFS Installation for HDFS and Data Pump Files"

The unzipped files have the structure shown in the following example.

Example 1-1 Structure of the orahdfs Directory

orahdfs-<version>
   bin/
      hdfs_stream
      hdfs_stream.cmd   
   doc/
      README.txt
   jlib/ 
      ojdbc7.jar
      oraloader.jar      
      ora-hadoop-common.jar
      oraclepki.jar
      orahdfs.jar
      osdt_cert.jar
      osdt_core.jar
   log/
   examples.zip 

Oracle Database Privileges for OSCH Users

Oracle Database users require these privileges in order to use Oracle SQL Connector for HDFS to create external tables:

  • CREATE SESSION

  • CREATE TABLE

  • CREATE VIEW

  • EXECUTE on the UTL_FILE PL/SQL package

  • READ and EXECUTE on the OSCH_BIN_PATH directory created during the installation of Oracle SQL Connector for HDFS. Do not grant write access to anyone. Grant EXECUTE only to those who intend to use Oracle SQL Connector for HDFS.

  • READ and WRITE on a database directory for storing external tables, or the CREATE ANY DIRECTORY system privilege. For Oracle RAC systems, this directory must be on a shared disk that all Oracle instances can access.

  • A tablespace and quota for copying data into the Oracle database. Optional.

The following example shows the SQL commands granting these privileges to HDFSUSER.

Note:

To query an external table that uses Oracle SQL Connector for HDFS, users need READ privilege for Oracle Database 12c or later and SELECT privilege for older versions of the database.

Example 1-2 Granting Users Access to Oracle SQL Connector for HDFS

CONNECT / AS sysdba;
CREATE USER hdfsuser IDENTIFIED BY password
   DEFAULT TABLESPACE hdfsdata
   QUOTA UNLIMITED ON hdfsdata;
GRANT CREATE SESSION, CREATE TABLE, CREATE VIEW TO hdfsuser;
GRANT EXECUTE ON sys.utl_file TO hdfsuser;
GRANT READ, EXECUTE ON DIRECTORY osch_bin_path TO hdfsuser;
GRANT READ, WRITE ON DIRECTORY external_table_dir TO hdfsuser;

OS-Level Requirements for OSCH Users

Wherever Oracle SQL Connector for HDFS is installed (on the Oracle Database system, a Hadoop cluster node, or a separate system set up as a Hadoop client), the OS-level user account that logs in to use OSCH requires access to the shell variable HADOOP_CLASSPATH. This variable must include the OSCH path on the Hadoop cluster – path/orahdfs-<version>/jlib/*

Set the HADOOP_CLASSPATH as shown in the following example, where the OSCH path is prepended to the current HADOOP_CLASSPATH. Putting OSCH first gives it precedence over other JARs in the path.

$ export HADOOP_CLASSPATH="/etc/orahdfs-<version>/jlib/*:$HADOOP_CLASSPATH"

Use Oracle SQL Connector for HDFS on a Secure Hadoop Cluster

When users access an external table that was created using Oracle SQL Connector for HDFS, the external table behaves like a Hadoop client. On the system where the Oracle database is running, it connects as the OS user of the Oracle process (usuallyoracle). For OSCH to work, this account requires read permission on the files of all OSCH users. On a non-secure cluster these files are world-readable, but on a Kerberos-secured cluster this access requires a Kerberos ticket.

For a user to authenticate using kinit:

  • A Hadoop administrator must register the operating system user (such as oracle) and password in the Key Distribution Center (KDC) for the cluster.

  • A system administrator for the Oracle Database system must configure /etc/krb5.conf and add a domain definition that refers to the KDC managed by the secure cluster.

These steps enable the operating system user to authenticate with the kinit utility before submitting Oracle SQL Connector for HDFS jobs. The kinit utility typically uses a Kerberos keytab file for authentication without an interactive prompt for a password.

The system should run kinit on a regular basis, before letting the Kerberos ticket expire, to enable Oracle SQL Connector for HDFS to authenticate transparently. Use cron or a similar utility to run kinit. For example, if Kerberos tickets expire every two weeks, then set up a cron job to renew the ticket weekly.

Be sure to schedule the cron job to run when Oracle SQL Connector for HDFS is not actively being used.

Do not call kinit within the Oracle SQL Connector for HDFS preprocessor script (hdfs_stream), because it could trigger a high volume of concurrent calls to kinit and create internal Kerberos caching errors.

Note:

Oracle Big Data Appliance configures Kerberos security automatically as a configuration option.

Oracle Loader for Hadoop Setup

Software Requirements

Oracle Loader for Hadoop requires the following software:

  • A certified release of CDH or HDP.

  • A target database system running a version of Oracle Database that is currently supported by Oracle.

Oracle Big Data Appliance already meets these requirements. If you are using CDH or HDP on a commodity server platform, check to ensure that the system meets them.

For instructions on configuring OHSH to run on Oracle Big Data Service, see Use Big Data Connectors and Copy to Hadoop to Copy Data Between Big Data Service and a Database Instance in Using Oracle Big Data Service.

Install Oracle Loader for Hadoop

Oracle Loader for Hadoop is packaged with the Oracle Database 12c (12.1.0.2 and 12.2.0.1) client libraries and Oracle Instant Client libraries for connecting to Oracle Database 11.2.0.4, 12.1.0.2, or 12.2.0.1.

To install Oracle Loader for Hadoop:

  1. Unpack the content of oraloader-<version>.x86_64.zip into a directory on your Hadoop cluster or on a system configured as a Hadoop client.
  2. Unzip oraloader-<version>-h2.x86_64.zip into a directory on your Hadoop cluster.

    A directory named oraloader-<version>-h2 is created with the following subdirectories along with the examples. zip file, which you must unzip yourself.

    doc
    jlib
    lib
    examples.zip
  3. Create a variable named OLH_HOME and set it to the installation directory.
  4. Add the following paths to the HADOOP_CLASSPATH variable:
    • For all installations:

      $OLH_HOME/jlib/*

      When using OLH, $OLH_HOME/jlib/* should always be listed first in HADOOP_CLASSPATH. Alternatively, you can avoid conflict with other scripts by defining HADOOP_CLASSPATH within a script that uses it.

    • To support data loads from Hive tables:

      /usr/lib/hive/lib/*
      /etc/hive/conf

      See "oracle.hadoop.xquery.lib.share."

    • To read data from Oracle NoSQL Database Release 2:

      $KVHOME/lib/kvstore.jar

Oracle Database Privileges for OLH Users

Oracle Database users require these privileges in order to use Oracle Loader for Hadoop to load data into the table:

  • CREATE SESSION (to connect to the database).
  • A tablespace and quota for inserting rows into the table.

Note:

As a OLH user, you must own the table. If you don't own the table, then you need additional privileges to access DBMS_METADATA. See DBMS_METADATA

Provide Support for Offline Database Mode

In a typical installation, Oracle Loader for Hadoop can connect to the Oracle Database system from the Hadoop cluster or a Hadoop client. If this connection is impossible—for example, the systems are located on distinct networks—then you can use Oracle Loader for Hadoop in offline database mode.

To support offline database mode, you must install Oracle Loader for Hadoop on two systems:

  • The Hadoop cluster or a system set up as a Hadoop client.

  • The Oracle Database system or a system with network access to Oracle Database, as described in the following procedure.

To support Oracle Loader for Hadoop in offline database mode:

  1. Unpack the content of oraloader-<version>.zip into a directory on the Oracle Database system or a system with network access to Oracle Database. You must use the same version of the software as you installed on the Hadoop cluster.
  2. Unzip oraloader-<version>-h2.x86_64.zip.
  3. Create a variable named OLH_HOME and set it to the installation directory. This example uses the Bash shell syntax:
    $ export OLH_HOME="/usr/bin/oraloader-<version>-h2/"
  4. Add the Oracle Loader for Hadoop JAR files to the HADOOP_CLASSPATH environment variable. If there are other JAR file paths in HADOOP_CLASSPATH, ensure that the Oracle Loader for Hadoop JAR file path is listed first when using Oracle Loader for Hadoop . This example uses the Bash shell syntax:
    $ export HADOOP_CLASSPATH=$OLH_HOME/jlib/*:$HADOOP_CLASSPATH

Use Oracle Loader for Hadoop on a Secure Hadoop Cluster

A secure Hadoop cluster has Kerberos installed and configured to authenticate client activity. An operating system user must be authenticated before initiating an Oracle Loader for Hadoop job to run on a secure Hadoop cluster. For authentication, the user must log in to the operating system where the job will be submitted and use the standard Kerberos kinit utility.

For a user to authenticate using kinit:

  • A Hadoop administrator must register the operating system user and password in the Key Distribution Center (KDC) for the cluster.

  • A system administrator for the client system, where the operating system user will initiate an Oracle Loader for Hadoop job, must configure /etc/krb5.conf and add a domain definition that refers to the KDC managed by the secure cluster.

Typically, the kinit utility obtains an authentication ticket that lasts several days. Subsequent Oracle Loader for Hadoop jobs authenticate transparently using the unexpired ticket.

Oracle Big Data Appliance configures Kerberos security automatically as a configuration option.

Oracle Shell for Hadoop Loaders Setup

Oracle Shell for Hadoop Loaders (OHSH) is integrated with Big Data Connectors. It provides a set of declarative commands you can use to load content from Hadoop and Hive to Oracle Database tables using Oracle Loader for Hadoop (OLH) and Oracle SQL Connector for Hadoop Distributed File System (OSCH). It also enables you to load contents from Oracle Database tables to Hadoop and Hive using the Copy to Hadoop feature of Big Data SQL.

Prerequisites

Oracle recommends running OHSH on a Hadoop client or edge node. It is also possible to run OHSH on a Hadoop node or on the Oracle Database server.

OHSH can be set up in any of the environments above (Hadoop client, Hadoop node, edge node, or Oracle Database server). To use OHSH, you need to install the software in only one of these environments.

Each environment has its own prerequisites. Check the relevant column in table below and install any software packages that are missing from the environment where you choose to run OHSH. As the table indicates, JDBC connectivity is required in all cases.

Table 1-1 Prerequisites for Running OHSH

If You Plan to run OHSH From... The Prerequisites are...
A Hadoop node, a Hadoop client, or an edge node.

Note:

When Big Data Connectors is licensed on Oracle Big Data Appliance, all of the prerequisite software and OHSH itself are pre-installed.

  • SQL*Plus

  • OLH and OSCH

  • JDBC access to Oracle Database

The Oracle Database server.
  • Hadoop and Hive libraries (installed and configured).

  • OLH and OSCH

  • JDBC access to Oracle Database

Installing Oracle Shell for Hadoop Loaders

Follow these instructions for setting up Oracle Shell for Hadoop Loaders. The instructions are applicable to set up on a Hadoop client, an edge node, a Hadoop node, or, on the Oracle Database server.

  1. Extract the contents of ohsh-<version>.zip to a directory on the system where you plan to run OHSH. 

    The extraction creates a directory named ohsh-<version> with a README.txt file, the examples package (examples.zip) and four subdirectories:

    README.txt
    examples.zip
    /bin
    /conf
    /doc
    /jlib

    You must unzip examples.zip yourself.

    The directory ohsh-<version> is referred to as <OHSH_HOME> later in these instructions.

  2. Follow the instructions contained in README.txt to configure Oracle Shell for Hadoop Loaders. Below are instructions to install and configure Oracle Shell for Hadoop Loaders on a Hadoop node.

Install Oracle Shell for Hadoop Loaders on a Hadoop Node

This procedure applies to installation on a Hadoop Node only. See the OHSH README.txt file for installation on other systems.

Note:

These instructions use placeholders for the absolute paths that you will set as the value of some variables. These are in italic font and are framed in brackets. For example, <OHSH_HOME> is the path where OHSH is installed.
  1. Install and set up SQL*Plus if it is not already on the node.
    1. Download the Oracle Instant Client for Linux along with the corresponding Instant Client Package for SQL*Plus from the Oracle Technology Network. Select the client version that matches the version of the Oracle Database.
      For example, you can find the client downloads for Oracle Database versions at this address:

      http://www.oracle.com/technetwork/topics/linuxx86-64soft-092277.html

  2. Extract both clients into the same directory ( <SQLPLUS_DIR> ).
  3. Copy the tnsnames.ora and sqlnet.ora files from ${TNS_ADMIN} on the Oracle Database host to a directory of your choice on the Hadoop node (<TNS_ADMIN_DIR>).
    If an Oracle Wallet is created for the Oracle Database host, copy the wallet file to a directory of your choice on the Hadoop node (<WALLET_LOCATION_DIR>.).
  4. Edit sqlnet.ora . Set WALLET_LOCATION to <WALLET_LOCATION_DIR>. Also check to be sure that sqlnet.wallet_override is set to “true”.
       WALLET_LOCATION=
          (SOURCE=(METHOD=FILE)(METHOD_DATA=
            (DIRECTORY=<WALLET_LOCATION_DIR>)))
       sqlnet.wallet_override=true
    
  5. Install OLH and OSCH on the Hadoop node if they are not already installed.
    Note that OSCH requires installation and configuration steps on the Oracle Database host as well as on the Hadoop node. For both OLH and OSCH, follow the setup instructions in the Big Data Connectors User’s Guide.
  6. Edit <OHSH_HOME>/bin/ohsh_config.sh, to configure the home directories of OHSH dependencies
       export HADOOP_HOME=<HADOOP_CLIENT_KIT>   
       export HADOOP_CONF_DIR=<HADOOP_CONF>
       export HIVE_HOME=<HIVE_CLIENT_KIT>
       export HIVE_CONF_DIR=<HIVE_CONF>
       export OLH_HOME=<OLH_HOME>
       export OSCH_HOME=<OSCH_HOME>
       export CP2HADOOP_HOME=<CP2HADOOP_HOME>
       export HS2_HOST_PORT=<HS2_HOST>:<HS2_PORT>
       export HIVE_SESS_VAR_LIST=<semicolon_separated_variable_list>
  7. If TNS admin and Oracle Wallet are enabled, then also set the following variables:
    export WALLET_LOCATION="<WALLET_LOCATION_DIR>"
    export TNS_ADMIN="<TNS_ADMIN_DIR>"
    These values are assigned to OHSH tnsadmin and walletlocation defaults at the start of an OHSH session. They are used for all TNS and Oracle Wallet authentication in the session.
  8. Add <OHSH_HOME>/bin to the PATH environment variable.
  9. Start an OHSH session.

    Note: The HADOOP_CLASSPATH environment variable should be cleared before invoking Oracle Shell for Hadoop Loaders.

    $ ohsh
    Under the banner you should see a list of the kits that were found (OSCH, OLH, and CP2HADOOP).
    You can use the show resources command to see what resources are available.
    ohsh> show resources
    You will always see the three predefined resources: hadoop0, hive0, and bash0.

See Also:

Oracle Database Privileges for OHSH Users

OHSH users should have these privileges to load data:

Configure OHSH to Enable Job Monitoring

When OHSH jobs are executed, status and other information about the job is recorded into a back-end database. To access information from the OHSH command line, you must first configure the connection to the database.

Configuration Steps

Configure the following properties in conf/smartloader-conf.xml in order to enable a database instance where job history is stored.

  • oracle.hadoop.smartloader.diagnostics.jobhistory.jdbc.driver

    Specifies the JDBC driver for the supported back-end database type. Currently, MYSQL and ORACLE are valid values. If this property is not specified, the job history commands fail.

Additional properties must be set. These differ, depending upon which database type is defined as the supported back-end database

  • If jdc.driver = ORACLE:

    • oracle.hadoop.smartloader.diagnostics.jobhistory.jdbc.oracleConnectId

      A TNS entry name defined in the tnsnames.ora file.
    • oracle.hadoop.smartloader.diagnostics.jobhistory.jdbc.oracleWalletDir

      The OS directory containing the Oracle Wallet used to connect to an Oracle Database schema through JDBC.

    • oracle.hadoop.smartloader.diagnostics.jobhistory.jdbc.oracleTnsDir

      The file path to a directory on the node where OHSH runs. This directory contains SQL*Net configuration files such as sqlnet.ora and tnsnames.ora. Typically, this is ${ORACLE_HOME}/network/admin.

      Note:

      If you are running OHSH from a Hadoop client and want to use Oracle Wallet, copy tnsnames.ora and the wallet files to any directory on the Hadoop client.
  • If jdbc.driver = MYSQL:

    • oracle.hadoop.smartloader.diagnostics.jobhistory.jdbc.mysqlConnectUrl

      The URL used to make a JDBC connection to the MySQL database

    • oracle.hadoop.smartloader.diagnostics.jobhistory.jdbc.mysqlUser

      MySQL user of job history schema

    • oracle.hadoop.smartloader.diagnostics.jobhistory.jdbc.mysqlPW

      Password of the MySQL user account.

Commands for Monitoring OHSH Jobs

After this configuration is completed, you will be able to execute the following OHSH commands:

  • ohsh> show job <job_id>

    Shows the detailed information of the job specified by ID.

  • ohsh> show job summary <job_id>

    Shows the performance of the completed job specified by ID.

  • ohsh> show job abstract <job_id>

    Provides a functional description of the job.

  • ohsh>  show jobs [failed|running|completed|finished] [extended] [<integer>]

    Shows the last n jobs of a particular job status.

    • The first parameter specifies job status. If the status is not specified, all jobs are shown, regardless of job status.

    • The second parameter specifies whether to show details.  

    • The third parameter specifies that the last n jobs of the specified status should be shown. If n is not specified, then all jobs of that status are shown.

  • ohsh>  truncate jobs [<integer>]

    Removes the last n jobs from the database. If the integer is not specified,  the command removes all jobs

Oracle XQuery for Hadoop Setup

You install and configure Oracle XQuery for Hadoop on the Hadoop cluster. If you are using Oracle Big Data Appliance, then the software is already installed.

The following topics describe the software installation:

Software Requirements

Oracle Big Data Appliance Release 4.3 and later releases meet the requirements below. However, if you are installing Oracle XQuery for Hadoop on a third-party cluster, then you must ensure that these components are installed.

  • Java 8.x or 7.x.

  • A certified release of either CDH (Cloudera's Distribution including Apache Hadoop) or HDP (Hortonworks Data Platform).

  • Oracle NoSQL Database 3.x or 2.x to support reading and writing to Oracle NoSQL Database

  • Oracle Loader for Hadoop 3.8.0 or greater to support writing tables in Oracle databases

Install Oracle XQuery for Hadoop

Perform the following steps to install Oracle XQuery for Hadoop.

To install Oracle XQuery for Hadoop:

  1. Unpack the contents of oxh-<version>.zip into the installation directory:

    $ unzip oxh-<version>-cdh-<version>.zip
    Archive:  oxh-<version>-cdh-<version>.zip
       creating: oxh-<version>-cdh<version>/
       creating: oxh-<version>-cdh<version>/lib/
       creating: oxh-<version>-cdh<version>/oozie/
       creating: oxh-<version>-cdh<version>/oozie/lib/
      inflating: oxh-<version>-cdh<version>/lib/ant-launcher.jar
      inflating: oxh-<version>-cdh<version>/lib/ant.jar
         .
         .
         .

    You can now run Oracle XQuery for Hadoop.

  2. For the fastest execution time, copy the libraries into the Hadoop distributed cache:

    1. Copy all Oracle XQuery for Hadoop and third-party libraries into an HDFS directory. To use the -exportliboozie option to copy the files, see "Oracle XQuery for Hadoop Options". Alternatively, you can copy the libraries manually using the HDFS command line interface.

      If you use Oozie, then use the same folder for all files. See "Configure Oozie for the Oracle XQuery for Hadoop Action"

    2. Set the oracle.hadoop.xquery.lib.share property or use the -sharelib option on the command line to identify the directory for the Hadoop distributed cache.

  3. To support data loads into Oracle Database, install Oracle Loader for Hadoop:

    1. Unpack the content of oraloader-<version>.x86_64.zip into a directory on your Hadoop cluster or on a system configured as a Hadoop client. This archive contains an archive and a README file.

    2. Unzip the archive into a directory on your Hadoop cluster:

      unzip oraloader-<version>-h2.x86_64.zip

      A directory named oraloader-<version>-h2 is created with the following subdirectories and the examples.zip file:

      doc
      jlib
      lib
      examples.zip

      Unzip the example.zip file yourself.

    3. Create an environment variable named OLH_HOME and set it to the installation directory. Do not set HADOOP_CLASSPATH.

  4. To support data loads into Oracle NoSQL Database, install it, and then set an environment variable named KVHOMEto the Oracle NoSQL Database installation directory.

    Note:

    Do not add NoSQL Database jar files to a HADOOP_CLASSPATH.

  5. To support indexing by Apache Solr:

    1. Ensure that Solr is installed and configured in your Hadoop cluster. Solr is included in Cloudera Search, which is installed automatically on Oracle Big Data Appliance.

    2. Create a collection in your Solr installation into which you will load documents. To create a collection, use the solrctl utility.

      See Also:

      For the solrctl utility, Cloudera Search User Guide at

      Cloudera Search User

    3. Configure Oracle XQuery for Hadoop to use your Solr installation by setting the OXH_SOLR_MR_HOME environment variable to the local directory containing search-mr-<version>.jar and search-mr-<version>-job.jar. For example:

      $ export OXH_SOLR_MR_HOME="/usr/lib/solr/contrib/mr"

      Note:

      Configure Oracle XQuery for Hadoop and set the OXH_SOLR_MR_HOME environment variable to the local directory before using Apache Tika adapter as well.

Troubleshoot the File Paths

If Oracle XQuery for Hadoop fails to find its own or third-party libraries when running queries, first ensure that the environment variables were set correctly during Oracle XQuery for Hadoop installation.

Note:

The HADOOP_CLASSPATH environment variable or -libjars command line option must not contain either an OXH or third-party library.

If they are set correctly, then you may need to edit lib/oxh-lib.xml. This file identifies the location of Oracle XQuery for Hadoop system JAR files and other libraries, such as Avro, Oracle Loader for Hadoop, and Oracle NoSQL Database.

If necessary, you can reference environment variables in this file as ${env.variable}, such as ${env.OLH_HOME}. You can also reference Hadoop properties as ${property}, such as ${mapred.output.dir}.

Configure Oozie for the Oracle XQuery for Hadoop Action

You can use Apache Oozie workflows to run your queries, as described in "Run Queries from Apache Oozie". The software is already installed and configured on Oracle Big Data Appliance.

For other Hadoop clusters, you must first configure Oozie to use the Oracle XQuery for Hadoop action. These are the general steps to install the Oracle XQuery for Hadoop action:

  1. Modify the Oozie configuration. If you run CDH on third-party hardware, then use Cloudera Manager to change the Oozie server configuration. For other Hadoop installations, edit oozie-site.htm.
    • Add oracle.hadoop.xquery.oozie.OXHActionExecutor to the value of the oozie.service.ActionService.executor.ext.classes property.

    • Add oxh-action-v1.xsd to the value of the oozie.service.SchemaService.wf.ext.schemas property.

  2. Add oxh-oozie.jar to the Oozie server class path. For example, in a CDH5 installation, copy oxh-oozie.jar to /var/lib/oozie on the server.
  3. Add all Oracle XQuery for Hadoop dependencies to the Oozie shared library in a subdirectory named oxh. You can use the CLI -exportliboozie option. See "Oracle XQuery for Hadoop Options".
  4. Restart Oozie for the changes to take effect.

The specific steps depend on your Oozie installation, such as whether Oozie is already installed and which version you are using.

Oracle R Advanced Analytics for Hadoop Setup

An overview of Oracle R Advanced Analytics for Hadoop (ORAAH) is provided in Part IV of this guide .

Release notes, installation instructions, comprehensive reference material, and a list of changes in the current release are published separately on the Oracle Technology Network.

Each ORAAH release is compatible with a number of Oracle Big Data Appliance releases and releases of CDH running on non-Oracle platforms.

For a complete ORAAH compatibility matrix, see Document 2225633.1 on My Oracle Support.

Install the Software on Hadoop

Oracle Big Data Appliance supports Oracle R Advanced Analytics for Hadoop without any additional software installation or configuration. However, to use Oracle R Advanced Analytics for Hadoop on a third-party Hadoop cluster, you must create the necessary environment.

Software Requirements for a Third-Party Hadoop Cluster

You must install several software components on a third-party Hadoop cluster to support Oracle R Advanced Analytics for Hadoop.

Install these components on third-party servers:

Note:

Do not set HADOOP_HOME on the Hadoop cluster. CDH5 does not need it, and it interferes with Oracle R Advanced Analytics for Hadoop. If you must set HADOOP_HOME for another application, then also set HADOOP_LIBEXEC_DIR in the /etc/bashrc file. For example:

export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
Install Sqoop on a Third-Party Hadoop Cluster

Sqoop provides a SQL-like interface to Hadoop, which is a Java-based environment. Oracle R Advanced Analytics for Hadoop uses Sqoop for access to Oracle Database.

Note:

Sqoop is required even when using Oracle Loader for Hadoop as a driver for loading data into Oracle Database. Sqoop performs additional functions, such as copying data from a database to HDFS and sending free-form queries to a database. The driver also uses Sqoop to perform operations that Oracle Loader for Hadoop does not support.

To install and configure Sqoop for use with Oracle Database:

  1. Install Sqoop if it is not already installed on the server.

    For Cloudera's Distribution including Apache Hadoop, see the Sqoop installation instructions in the CDH Installation Guide.

  2. Download the appropriate Java Database Connectivity (JDBC) driver for Oracle Database from Oracle Technology Network at
  3. Copy the driver JAR file to $SQOOP_HOME/lib, which is a directory such as /usr/lib/sqoop/lib.
  4. Provide Sqoop with the connection string to Oracle Database.
    $ sqoop import --connect jdbc_connection_string

    For example, sqoop import --connect jdbc:oracle:thin@myhost:1521/orcl.

Install Hive on a Third-Party Hadoop Cluster

Hive provides an alternative storage and retrieval mechanism to HDFS files through a querying language called HiveQL. Oracle R Advanced Analytics for Hadoop uses the data preparation and analysis features of HiveQL, while enabling you to use R language constructs.

To install Hive:

  1. Follow the instructions provided by the distributor (Cloudera or Apache) for installing Hive.
  2. Verify that the installation is working correctly:
  3. $ hive -H usage: hive -d,--define <key=value> Variable subsitution to apply to hive commands. e.g. -d A=B or --define A=B . . .
  4. If the command fails or you see warnings in the output, then fix the Hive installation.
Install R on a Hadoop Client

You can download R 2.13.2 and get the installation instructions from the Oracle R Distribution website at

https://oss.oracle.com/ORD/

When you are done, ensure that users have the necessary permissions to connect to the Linux server and run R.

You may also want to install RStudio Server to facilitate access by R users. See the RStudio website at

http://rstudio.org/

Install R on a Third-Party Hadoop Cluster

You can download Oracle R Distribution 3.0.1 and get the installation instructions from the website at

http://www.oracle.com/technetwork/database/database-technologies/r/r-distribution/downloads/index.html

Install the ORCH Package on a Third-Party Hadoop Cluster

ORCH is the name of the Oracle R Advanced Analytics for Hadoop package.

To install the ORCH package:

  1. Log in as root to the first node of the cluster.
  2. Set the environment variables for the supporting software:
    $ export JAVA_HOME="/usr/lib/jdk7"
    $ export R_HOME="/usr/lib64/R"
    $ export SQOOP_HOME "/usr/lib/sqoop"
  3. Unzip the downloaded file:
    $ unzip orch-<version>.zip
    $ unzip orch-linux-x86_64-<version>.zip 
    Archive:  orch-linux-x86_64-<version>.zip
       creating: ORCH<version>/
     extracting: ORCH<version>/ORCH_<version>_R_x86_64-unknown-linux-gnu.tar.gz  
      inflating: ORCH<version>/ORCHcore_<version>_R_x86_64-unknown-linux-gnu.tar.gz  
         .
         .
         .
  4. Change to the new directory:
    $ cd ORCH<version>
  5. Install the packages in the exact order shown here:
    R --vanilla CMD INSTALL OREbase_<version>_R_x86_64-unknown-linux-gnu.tar.gz
    R --vanilla CMD INSTALL OREstats_<version>_R_x86_64-unknown-linux-gnu.tar.gz
    R --vanilla CMD INSTALL OREmodels_<version>_R_x86_64-unknown-linux-gnu.tar.gz
    R --vanilla CMD INSTALL OREserver_<version>_R_x86_64-unknown-linux-gnu.tar.gz
    R --vanilla CMD INSTALL ORCHcore_<version>_R_x86_64-unknown-linux-gnu.tar.gz
    R --vanilla CMD INSTALL ORCHstats_<version>_R_x86_64-unknown-linux-gnu.tar.gz
    R --vanilla CMD INSTALL ORCH_<version>_R_x86_64-unknown-linux-gnu.tar.gz
  6. You must also install these packages on all other nodes of the cluster:
    • OREbase

    • OREmodels

    • OREserver

    • OREstats

    The following examples use the dcli utility, which is available on Oracle Big Data Appliance but not on third-party clusters, to copy and install the OREserver package:

    $ dcli -C -f OREserver_<version>_R_x86_64-unknown-linux-gnu.tar.gz -d /tmp/ OREserver_<version>_R_x86_64-unknown-linux-gnu.tar.gz
    
    $ dcli -C " R --vanilla CMD INSTALL /tmp/OREserver_<version>_R_x86_64-unknown-linux-gnu.tar.gz"

Install Additional R Packages

Your Hadoop cluster must have libpng-devel installed on every node. If you are using a cluster running on commodity hardware, then you can follow the same basic procedures. However, you cannot use the dcli utility to replicate the commands across all nodes. See Oracle Big Data Appliance Owner's Guide for the syntax of the dcli utility.

To install libpng-devel:

  1. Log in as root to any node in your Hadoop cluster.

  2. Check whether libpng-devel is already installed:

    # dcli rpm -qi libpng-devel
    bda1node01: package libpng-devel is not installed
    bda1node02: package libpng-devel is not installed
         .
         .
         .

    If the package is already installed on all servers, then you can skip this procedure.

  3. If you need a proxy server to go outside a firewall, then set the HTTP_PROXY environment variable. This example uses dcli, which is available only on Oracle Big Data Appliance:

    # dcli export HTTP_PROXY="http://proxy.example.com"
  4. Change to the yum directory:

    # cd /etc/yum.repos.d
  5. Download and configure the appropriate configuration file for your version of Linux:

    For Enterprise Linux 5 (EL5):

    1. Download the yum configuration file:

      # wget http://public-yum.oracle.com/public-yum-el5.repo
    2. Open public-yum-el5.repo in a text editor and make these changes:

      Under el5_latest, set enabled=1

      Under el5_addons, set enabled=1

    3. Save your changes and exit.

    4. Copy the file to the other Oracle Big Data Appliance servers:

      # dcli -d /etc/yum.repos.d -f public-yum-el5.repo

    For Oracle Linux 6 (OL6):

    1. Download the yum configuration file:

      # wget http://public-yum.oracle.com/public-yum-ol6.repo
    2. Open public-yum-ol6.repo in a text editor and make these changes:

      Under ol6_latest, set enabled=1

      Under ol6_addons, set enabled=1

    3. Save your changes and exit.

    4. Copy the file to the other Oracle Big Data Appliance servers:

      # dcli -d /etc/yum.repos.d -f public-yum-ol6.repo
  6. Install the package on all servers:

    # dcli yum -y install libpng-devel
    bda1node01: Loaded plugins: rhnplugin, security
    bda1node01: Repository 'bda' is missing name in configuration, using id
    bda1node01: This system is not registered with ULN.
    bda1node01: ULN support will be disabled.
    bda1node01: http://bda1node01-master.abcd.com/bda/repodata/repomd.xml:
    bda1node01: [Errno 14] HTTP Error 502: notresolvable
    bda1node01: Trying other mirror.
         .
         .
         .
    bda1node01: Running Transaction
    bda1node01: Installing     : libpng-devel                    1/2
    bda1node01: Installing     : libpng-devel                    2/2
     
    bda1node01: Installed:
    bda1node01: libpng-devel.i386 2:1.2.10-17.el5_8  ibpng-devel.x86_64 2:1.2.10-17.el5_8
     
    bda1node01: Complete!
    bda1node02: Loaded plugins: rhnplugin, security
         .
         .
         .
  7. Verify that the installation was successful on all servers:

    # dcli rpm -qi libpng-devel
    bda1node01: Name        : libpng-devel Relocations: (not relocatable)
    bda1node01: Version     : 1.2.10       Vendor: Oracle America
    bda1node01: Release     : 17.el5_8      Build Date: Wed 25 Apr 2012 06:51:15 AM PDT
    bda1node01: Install Date: Tue 05 Feb 2013 11:41:14 AM PST  Build Host: ca-build56.abcd.com
    bda1node01: Group       : Development/Libraries  Source RPM: libpng-1.2.10-17.el5_8.src.rpm
    bda1node01: Size        : 482483                 License: zlib
    bda1node01: Signature   : DSA/SHA1, Wed 25 Apr 2012 06:51:41 AM PDT, Key ID 66ced3de1e5e0159
    bda1node01: URL         : http://www.libpng.org/pub/png/
    bda1node01: Summary     : Development tools for programs to manipulate PNG image format files.
    bda1node01: Description :
    bda1node01: The libpng-devel package contains the header files and static
    bda1node01: libraries necessary for developing programs using the PNG (Portable
    bda1node01: Network Graphics) library.
         .
         .
         .

Provide Remote Client Access to R Users

Whereas R users will run their programs as MapReduce jobs on the Hadoop cluster, they do not typically have individual accounts on that platform. Instead, an external Linux server provides remote access.

Software Requirements for Remote Client Access

To provide access to a Hadoop cluster to R users, install these components on a Linux server:

  • The same version of Hadoop as your Hadoop cluster; otherwise, unexpected issues and failures can occur

  • The same version of Sqoop as your Hadoop cluster; required only to support copying data in and out of Oracle databases

  • Mahout; required only for the orch.ls function with the Mahout ALS-WS algorithm

  • The same version of the Java Development Kit (JDK) as your Hadoop cluster

  • Oracle R distribution 3.0.1 with all base libraries

  • ORCH R package

To provide access to database objects, you must have the Oracle Advanced Analytics option to Oracle Database. Then you can install this additional component on the Hadoop client:

  • Oracle R Enterprise Client Packages

Configure the Server as a Hadoop Client

You must install Hadoop on the client and minimally configure it for HDFS client use.

To install and configure Hadoop on the client system:

  1. Install and configure CDH5 or Apache Hadoop 2.2.0 on the client system. This system can be the host for Oracle Database. If you are using Oracle Big Data Appliance, then complete the procedures for providing remote client access in the Oracle Big Data Appliance Software User's Guide. Otherwise, follow the installation instructions provided by the distributor (Cloudera or Apache).
  2. Log in to the client system as an R user.
  3. Open a Bash shell and enter this Hadoop file system command:
    $HADOOP_HOME/bin/hdfs dfs -ls /user
  4. If you see a list of files, then you are done. If not, then ensure that the Hadoop cluster is up and running. If that does not fix the problem, then you must debug your client Hadoop installation.
Install Sqoop on a Hadoop Client

Complete the same procedures on the client system for installing and configuring Sqoop as those provided in "Install Sqoop on a Third-Party Hadoop Cluster".

Install R on a Hadoop Client

You can download R 2.13.2 and get the installation instructions from the Oracle R Distribution website at

https://oss.oracle.com/ORD/

When you are done, ensure that users have the necessary permissions to connect to the Linux server and run R.

You may also want to install RStudio Server to facilitate access by R users. See the RStudio website at

http://rstudio.org/

Install the ORCH Package on a Hadoop Client

To install ORCH on your Hadoop client system:

  1. Download the ORCH package and unzip it on the client system.
  2. Change to the installation directory.
  3. Run the client script:
    # ./install-client.sh
Install the Oracle R Enterprise Client Packages (Optional)

To support full access to Oracle Database using R, install the Oracle R Enterprise client packages. Without them, Oracle R Advanced Analytics for Hadoop does not have access to the advanced statistical algorithms provided by Oracle R Enterprise.

See Also:

Oracle R Enterprise User's Guide for information about installing R and Oracle R Enterprise

Oracle Data Integrator

For the instructions to set up and use Oracle Data Integrator refer to Oracle Fusion Middleware Integrating Big Data with Oracle Data Integrator.

Note:

Oracle Data Integrator is not supported for CDH 6.x.

Oracle Datasource for Apache Hadoop Setup

Software Requirements

Oracle Datasource for Apache Hadoop requires the following software:

  • A target database system running Oracle Database 12c, 11.2.0.4, or earlier Oracle database releases that can be queried with the Oracle JDBC driver for 12c.

    Note that Oracle Database 11.2.0.4 and potentially earlier Oracle Database release may work. However, some of the SPLIT patterns have dependencies on Oracle Database 12c and might not provide accurate splits for parallel hadoop jobs when used to query earlier releases of Oracle Database.

  • Cloudera's Distribution including Apache Hadoop version 5 (CDH5), Hortonworks Data Platform (HDP) 2.x, or, Apache Hadoop 2.2.0 to 2.6.0.

  • Apache Hive 0.13.0, 0.13.1 or 1.1.0 (in order to query data from Oracle Database tables).

Installing Oracle Datasource for Apache Hadoop

Set HADOOP_CLASSPATH to include $OD4H_HOME/jlib/* in the Hadoop node where you are running the Hive client. Ensure that this is listed first in HADOOP_CLASSPATH.

Ensure that the OD4H jars in $OD4H_HOME/jlib are accessible to Hive commands using OD4H. You can do this in one of the following ways:

1. Configure HiveServer 2 and make jars available cluster wide. This will enable Hive client beeline and other tools (such as SQL Developer) to work with OD4H.

2. To isolate configuration to a particular session, add jars manually. You can use this method for situations when you use OD4H and do not want the OD4H jars to interfere with other applications.

Add jar files to Hive CLI

To enable jar files to be present locally to a Hive CLI session, add the jar files via Hive as follows:

$hive

hive> Add jar <jar name>

hive> Add jar <jar name>

Configure HiveServer2

Following are the steps to configure HiveServer2:

1. Login to Cloudera Manager

2. Click on Hive.

3. Under Status Summary, click HiveServer2.

4. Click on Configuration.

Figure 1-5 Configuring HiveServer2

Description of Figure 1-5 follows
Description of "Figure 1-5 Configuring HiveServer2"

5. Click on Service Configuration.

Figure 1-6 Service Configuration

Description of Figure 1-6 follows
Description of "Figure 1-6 Service Configuration"

6. Type ‘AUX’ in the search box. Under the property ‘Service-Wide/Advanced’, add the directory containing jars you want to add in HiveServer2.

7. Restart Hive.