1 Getting Started with Oracle Big Data Connectors

This chapter introduces you to Oracle Big Data Connectors, provides installation instructions, and identifies the permissions needed for users to access the connectors.

This chapter contains these topics:

About Oracle Big Data Connectors
Downloading the Software
Oracle Direct Connector for Hadoop Distributed File System
Oracle Loader for Hadoop
Oracle Data Integrator Application Adapter for Hadoop
Oracle R Connector for Hadoop

1.1 About Oracle Big Data Connectors

Oracle Big Data Connectors facilitate data access between data stored in a Hadoop cluster and Oracle Database. They can be licensed for use on either Oracle Big Data Appliance or a Hadoop cluster running on commodity hardware.

These are the connectors:

Oracle Direct Connector for Hadoop Distributed File System: Enables Oracle Database to access data stored in a Hadoop Distributed File System (HDFS). The data can remain in HDFS or it can be loaded into Oracle Database.
Oracle Loader for Hadoop: Provides an efficient and high performance loader for fast movement of data from a Hadoop cluster into a table in an Oracle database. Oracle Loader for Hadoop prepartitions the data if necessary and transforms it into an Oracle-ready format. It optionally sorts records by primary key before loading the data or creating output files. Oracle Loader for Hadoop is a MapReduce application that is invoked as a command line utility. It accepts the generic command-line options that are supported by the Tool interface.
Oracle Data Integrator Application Adapter for Hadoop: Extracts, transforms, and loads data from a Hadoop cluster into tables in Oracle Database, as defined using a graphical user interface.
Oracle R Connector for Hadoop: Provides an interface between a local R environment, Oracle Database, and Hadoop, allowing speed-of-thought, interactive analysis on all three platforms. Oracle R Connector for Hadoop is designed to work independently, but if the enterprise data for your analysis is also stored in Oracle Database, then the full power of this connector is achieved when it is used with Oracle R Enterprise.

Individual connectors may require that software components are installed in Oracle Database, the Hadoop cluster, and the user's PC. Users may also need additional access privileges in Oracle Database.

1.2 Downloading the Software

You can download Oracle Big Data Connectors from Oracle Technology Network (OTN) or Oracle Delivery Cloud.

To download from OTN:

Use any browser to visit this website:

http://www.oracle.com/technetwork/bdc/big-data-connectors/downloads/index.html
Click the name of each connector to download a zip file containing the installation files.

To download from Oracle Software Delivery Cloud:

You can also download the software from Oracle Software Delivery Cloud at

https://edelivery.oracle.com/
Accept the Terms and Restrictions to see the Media Pack Search page.
Select the search terms:

Select a Product Pack: Oracle Database

Platform: Linux x86-64
Click Go to display a list of product packs.
Select Oracle Big Data Connectors Media Pack for Linux x86-64 (B65965-0x), then click Continue.
Click Download for each connector to download a zip file containing the installation files.

1.3 Oracle Direct Connector for Hadoop Distributed File System

Oracle Direct Connector for Hadoop Distributed File System (Oracle Direct Connector) is installed and runs on the system where Oracle Database runs. Before installing Oracle Direct Connector, verify that you have the required software.

1.3.1 Required Software

Oracle Direct Connector requires the following software:

Cloudera's Distribution including Apache Hadoop Version CDH3 or Apache Hadoop 0.20.2.
Oracle JDK 1.6.0_8 or higher for CDH3. Cloudera recommends version 1.6.0_26.
Oracle Database Release 11g Release 2 (11.2.0.2 or 11.2.0.3) for Linux.
To support the Data Pump file format, a database one-off patch. To download this patch, go to http://support.oracle.com and search for bug 13079417.
The same version of Hadoop on the database system as your Hadoop cluster, either CDH3 or Apache Hadoop 0.20.2.
The same version of Oracle JDK on the database system as your Hadoop cluster.

1.3.2 Installing and Configuring Hadoop

Oracle Direct Connector works as an HDFS client. You do not need to configure Hadoop on the database system to run MapReduce jobs for Oracle Direct Connector. However, you must install Hadoop on the database system and minimally configure it for HDFS client use only.

To configure the database system as a Hadoop client:

Install CDH3 or Apache Hadoop 0.20.2 on the database system. Follow the installation instructions provided by the distributor (Cloudera or Apache). Do not follow the configuration instructions.
Use a text editor to open conf/hadoop-env.sh in the Hadoop home directory on the database system, then make these changes:
1. Uncomment the line that begins export JAVA_HOME.
2. Set JAVA_HOME to the directory where JDK1.6 is installed.

Edit conf/core-site.xml in the same directory to identify the NameNode of your Hadoop cluster as follows:

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://host:port</value>
    </property>
</configuration>

Ensure that Oracle Database has access to Hadoop and HDFS:
1. Log in to the system where Oracle Database is running using the Oracle database account.
2. Open a bash shell and issue this command:
```
$HADOOP_HOME/bin/hadoop fs -ls /user
```
  In this command, $HADOOP_HOME is the absolute path to the Hadoop home directory. You should see a list of files. If not, then first ensure that the Hadoop cluster is up and running. If the problem persists, then you must correct the Hadoop client configuration so that Oracle Database has access to the Hadoop cluster file system.

The database system is now ready for use as a Hadoop client. No other Hadoop configuration steps are needed.

1.3.3 Installing Oracle Direct Connector

To install Oracle Direct Connector:

Download the zip file to a directory on the system where Oracle Database runs.
Unzip orahdfs-version.zip into a directory. The unzipped files have the structure shown in Example 1-1.
Open the hdfs_stream bash shell script in a text editor and make these changes:
- HADOOP_HOME: Set to the absolute path of the Hadoop home directory.
- DIRECTHDFS_HOME: Set to the absolute path of the Oracle Direct Connector installation directory.
The hdfs_stream script is the preprocessor script for the HDFS external table. Comments in the script provide complete instructions for making these changes.
Run the hdfs_stream script from the Oracle Direct Connector installation directory. You should see this usage information:
```
$ bin/hdfs_stream
Oracle Direct HDFS Release 1.0.0.0.0 - Production
Copyright (c) 2011, Oracle and/or its affiliates. All rights reserved.
Usage: $HADOOP_HOME/bin/hadoop jar orahdfs.jar oracle.hadoop.hdfs.exttab.HdfsStream <locationPath>
```
If not, then ensure that the operating system user that Oracle is running under has the following permissions:
- Read and execute permissions on the hdfs_stream script:
```
$ ls -l DIRECTHDFS_HOME/bin/hdfs_stream
-rwxr-xr-x 1 oracle oinstall 2273 Apr 27 15:51 hdfs_stream
```
  If you do not see these permissions, then issue a chmod command to fix them:
```
$ chmod 755 DIRECTHDFS_HOME/bin/hdfs_stream
```
  In these commands, DIRECTHDFS_HOME represents the Oracle Direct Connector home directory.
- Read permission on DIRECTHDFS_HOME/jlib/orahdfs.jar.
Create a database directory for the orahdfs-version/bin directory where hdfs_stream resides. In this example, the Oracle Direct Connector kit is installed in /etc:
```
SQL> CREATE OR REPLACE DIRECTORY hdfs_bin_path AS  '/etc/orahdfs-1.0/bin'
```

Example 1-1 Structure of the orahdfs Directory

orahdfs-version
   bin/
      hdfs_stream
   jlib/ 
      orahdfs.jar
   log/
   README.txt

1.3.4 Granting User Access to Oracle Direct Connector

Oracle Database users require these privileges to use Oracle Direct Connector:

CREATE SESSION
EXECUTE on the UTL_FILE PL/SQL package.
READ and EXECUTE on the HDFS_BIN_PATH directory created in Step 0 Do not grant write access to anyone. Grant EXECUTE only to those who intend to use Oracle Direct Connector.

Example 1-2 shows the SQL commands granting these privileges to HDFSUSER.

Example 1-2 Granting Users Access to Oracle Direct Connector

CONNECT / AS sysdba;
CREATE USER hdfsuser IDENTIFIED BY password;
GRANT CREATE SESSION TO hdfsuser;
GRANT EXECUTE ON SYS.UTL_FILE TO hdfsuser;
GRANT READ, EXECUTE ON DIRECTORY hdfs_bin_path TO hdfsuser;

1.4 Oracle Loader for Hadoop

Before installing Oracle Loader for Hadoop, verify that you have the required software.

1.4.1 Required Software

Oracle Loader for Hadoop requires the following software:

A target database system running one of the following:
- Oracle Database 10g Release 2 (10.2.0.5) with required patch
- Oracle Database 11g Release 2 (11.2.0.2) with required patch
- Oracle Database 11g Release 2 (11.2.0.3)
Note:
To use Oracle Loader for Hadoop with Oracle Database 10g Release 2 (10.2.0.5) or Oracle Database 11g Release 2 (11.2.0.2), you must first apply a one-off patch that addresses bug number 11897896. To access this patch, go to http://support.oracle.com and search for the bug number.
Cloudera's Distribution including Apache Hadoop (CDH3) or Apache Hadoop 0.20.2
Hive 0.7.0 or 0.7.1, if using the HiveToAvroInputFormat class

1.4.2 Installing Oracle Loader for Hadoop

Oracle Loader for Hadoop is packaged with the Oracle Database 11g Release 2 client libraries and Oracle Instant Client libraries for connecting to Oracle Database 10.2.0.5, 11.2.0.2, or 11.2.0.3.

To install Oracle Loader for Hadoop:

Unpack the content of the oraloader-version.zip archive into a directory on your Hadoop cluster.

A directory named oraloader-version is created with the following subdirectories:
- doc
- jlib
- lib
- examples
This guide uses the variable ${OLH_HOME} to refer to this installation directory.
Add ${OLH_HOME}/jlib/* to the HADOOP_CLASSPATH variable.

1.5 Oracle Data Integrator Application Adapter for Hadoop

Installation requirements for Oracle Data Integrator Application Adapter for Hadoop are provided in these topics:

System Requirements and Certifications
Technology Specific Requirements

1.5.1 System Requirements and Certifications

To use the Application Adapter for Hadoop, you must first have Oracle Data Integrator, which is licensed separately from Oracle Big Data Connectors. You can download Oracle Data Integrator from the Oracle website at

http://www.oracle.com/technetwork/middleware/data-integrator/downloads/index.html

Oracle Data Integrator Application Adapter for Hadoop Knowledge Modules require a minimum version of Oracle Data Integrator 11.1.1.6.0.

Before performing any installation, read the system requirements and certification documentation to ensure that your environment meets the minimum installation requirements for the products you are installing.

The list of supported platforms and versions is available on Oracle Technical Network:

http://www.oracle.com/technology/products/oracle-data-integrator/index.html

1.5.2 Technology Specific Requirements

The list of supported technologies and versions is available on Oracle Technical Network:

http://www.oracle.com/technology/products/oracle-data-integrator/index.html

1.5.3 Location of the Oracle Data Integrator Application Adapter for Hadoop

Oracle Data Integrator Application Adapter for Hadoop is available in the xml-reference directory of the Oracle Data Integrator Companion CD.

1.5.4 Setting Up the Topology

See Chapter 4, "Oracle Data Integrator Application Adapter for Hadoop."

1.6 Oracle R Connector for Hadoop

Oracle R Connector for Hadoop requires the installation of a software environment on the Hadoop side and on a client Linux system.

1.6.1 Installing the Server Software

Oracle Big Data Appliance supports Oracle R Connector for Hadoop without any additional software installation or configuration.

To use Oracle R Connector for Hadoop on any other Hadoop cluster, you must create the necessary environment.

Install these components on third-party servers:

Java Virtual Machine (JVM), preferably Java HotSpot Virtual Machine 6.
R distribution 2.13.2 with all base libraries on all nodes in the Hadoop cluster.
ORHC package installed on each R engine, which must exist on every node of the Hadoop cluster. See the following instructions.

To install ORHC:

Set the environment variables for the Hadoop and JVM home directories:
```
$ setenv HADOOP_HOME /usr/lib/hadoop-0.2.0
$ setenv JAVA_HOME /usr/lib/jdk6
```
In this example, both home directories are in /usr/lib.

Unzip the downloaded file:

$ unzip orhc.tgz.zip
Archive:  orhc.tgz.zip

Open R and install the package:

> install.packages("/home/tmp/orhc.tgz", repos=NULL)
Installing package(s) into ...
.
.
.
Hadoop is up and running.

Alternatively, you can install the package from the Linux command line:

$ R CMD INSTALL orhc.tgz
* installing *source* package 'ORHC' ...
** R
.
.
.
Hadoop is up and running.
 
* DONE (ORHC)

1.6.2 Installing the Client Software

To provide access to a Hadoop cluster to R users, install these components on a Linux server:

Hadoop Client to allow access to the Hadoop cluster

For Oracle Big Data Appliance, see the Oracle Big Data Appliance Software User's Guide for detailed instructions on setting up remote client access.
Java Virtual Machine, preferably Java HotSpot Virtual Machine 6
R distribution 2.13.2
ORHC R package

Follow the steps for installing ORHC in "Installing the Server Software".
Oracle R Enterprise libraries (optional). They support access to Oracle Database; otherwise, Oracle R Connector for Hadoop operates only with in-memory R objects and local data files without access to the advanced statistical algorithms provided by Oracle R Enterprise. For example:
```
library(DBI)
library(ROracle)
library(OREbase)
library(OREeda)
library(OREgraphics)
library(OREstats)
library(RToXmp)
```

When you are done, ensure that users have the necessary permissions to connect to the Linux server and run R.