1 Getting Started with Oracle Big Data Connectors

This chapter describes the Oracle Big Data Connectors, provides installation instructions, and identifies the permissions needed for users to access the connectors.

This chapter contains the following sections:

About Oracle Big Data Connectors
Big Data Concepts and Technologies
Downloading the Oracle Big Data Connectors Software
Oracle Direct Connector for Hadoop Distributed File System Setup
Oracle Loader for Hadoop Setup
Oracle Data Integrator Application Adapter for Hadoop Setup
Oracle R Connector for Hadoop Setup

1.1 About Oracle Big Data Connectors

Oracle Big Data Connectors facilitate data access between data stored in a Hadoop cluster and Oracle Database. They can be licensed for use on either Oracle Big Data Appliance or a Hadoop cluster running on commodity hardware.

These are the connectors:

Oracle Direct Connector for Hadoop Distributed File System: Enables Oracle Database to access data stored in Hadoop Distributed File System (HDFS). The data can remain in HDFS, or it can be loaded into an Oracle database.
Oracle Loader for Hadoop: Provides an efficient and high-performance loader for fast movement of data from a Hadoop cluster into a table in an Oracle database. Oracle Loader for Hadoop prepartitions the data if necessary and transforms it into an database-ready format. It optionally sorts records by primary key before loading the data or creating output files. Oracle Loader for Hadoop is a MapReduce application that is invoked as a command-line utility. It accepts the generic command-line options that are supported by the Tool interface.
Oracle Data Integrator Application Adapter for Hadoop: Extracts, transforms, and loads data from a Hadoop cluster into tables in an Oracle database, as defined using a graphical user interface.
Oracle R Connector for Hadoop: Provides an interface between a local R environment, Oracle Database, and Hadoop, allowing speed-of-thought, interactive analysis on all three platforms. Oracle R Connector for Hadoop is designed to work independently, but if the enterprise data for your analysis is also stored in Oracle Database, then the full power of this connector is achieved when it is used with Oracle R Enterprise.

Individual connectors may require that software components be installed in Oracle Database and the Hadoop cluster. Users may also need additional access privileges in Oracle Database.

1.2 Big Data Concepts and Technologies

Enterprises are seeing large amounts of data coming from multiple sources. Click-stream data in web logs, GPS tracking information, data from retail operations, sensor data, and multimedia streams are just a few examples of vast amounts of data that can be of tremendous value to an enterprise if analyzed. The unstructured and semi-structured information provided by raw data feeds is of little value in and of itself. The data needs to be processed to extract information of real value, which can then be stored and managed in the database. Analytics of this data along with the structured data in the database can provide new insights into the data and lead to substantial business benefits.

1.2.1 What is MapReduce?

MapReduce is a parallel programming model for processing data on a distributed system. It can process vast amounts of data in a timely manner and can scale linearly. It is particularly effective as a mechanism for batch processing of unstructured and semi-structured data. MapReduce abstracts lower level operations into computations over a set of keys and values.

A simplified definition of a MapReduce job is the successive alternation of two phases, the map phase and the reduce phase. Each map phase applies a transform function over each record in the input data to produce a set of records expressed as key-value pairs. The output from the map phase is input to the reduce phase. In the reduce phase, the map output records are sorted into key-value sets so that all records in a set have the same key value. A reducer function is applied to all the records in a set and a set of output records are produced as key-value pairs. The map phase is logically run in parallel over each record while the reduce phase is run in parallel over all key values.

1.2.2 What is Apache Hadoop?

Apache Hadoop is the software framework for the development and deployment of data processing jobs based on the MapReduce programming model. At the core, Hadoop provides a reliable shared storage and analysis system^Foot 1. Analysis is provided by MapReduce. Storage is provided by the Hadoop Distributed File System (HDFS), a shared storage system designed for MapReduce jobs.

The Hadoop ecosystem includes several other projects including Apache Avro, a data serialization system that is used by Oracle Loader for Hadoop.

Cloudera's Distribution including Apache Hadoop (CDH) is installed on Oracle Big Data Appliance. You can use Oracle Big Data Connectors on a Hadoop cluster running CDH or the equivalent Apache Hadoop components, as described in the setup instructions in this chapter.

1.3 Downloading the Oracle Big Data Connectors Software

You can download Oracle Big Data Connectors from Oracle Technology Network or Oracle Software Delivery Cloud.

To download from Oracle Technology Network:

Use any browser to visit this website:

http://www.oracle.com/technetwork/bdc/big-data-connectors/downloads/index.html
Click the name of each connector to download a zip file containing the installation files.

To download from Oracle Software Delivery Cloud:

Use any browser to visit this website:

https://edelivery.oracle.com/
Accept the Terms and Restrictions to see the Media Pack Search page.
Select the search terms:

Select a Product Pack: Oracle Database

Platform: Linux x86-64
Click Go to display a list of product packs.
Select Oracle Big Data Connectors Media Pack for Linux x86-64 (B65965-0x), and then click Continue.
Click Download for each connector to download a zip file containing the installation files.

1.4 Oracle Direct Connector for Hadoop Distributed File System Setup

Oracle Direct Connector for Hadoop Distributed File System (HDFS) is installed and runs on the system where Oracle Database runs. Before installing Oracle Direct Connector for HDFS, verify that you have the required software.

1.4.1 Required Software

Oracle Direct Connector for HDFS requires the following software:

Cloudera's Distribution including Apache Hadoop version 3 (CDH3) or Apache Hadoop 0.20.2.
Java Development Kit (JDK) 1.6.0_8 or later for CDH3. Cloudera recommends version 1.6.0_26.
Oracle Database release 11g release 2 (11.2.0.2 or 11.2.0.3) for Linux.
To support the Oracle Data Pump file format, an Oracle Database one-off patch. To download this patch, go to http://support.oracle.com and search for bug 14557588.
The same version of Hadoop on the Oracle Database system as your Hadoop cluster, either CDH3 or Apache Hadoop 0.20.2.
The same version of JDK on the Oracle Database system as your Hadoop cluster.

1.4.2 Installing and Configuring Hadoop

Oracle Direct Connector for HDFS works as a Hadoop client. You do not need to configure Hadoop on the Oracle Database system to run MapReduce jobs for Oracle Direct Connector for HDFS. However, you must install Hadoop on the Oracle Database system and minimally configure it for Hadoop client use only.

To configure the Oracle Database system as a Hadoop client:

Install and configure CDH3 or Apache Hadoop 0.20.2 on the Oracle Database system. If you are using Oracle Big Data Appliance, then complete the procedures for providing remote client access in the Oracle Big Data Appliance Software User's Guide. Otherwise, follow the installation instructions provided by the distributor (Cloudera or Apache).
Ensure that Oracle Database has access to Hadoop and HDFS:
1. Log in to the system where Oracle Database is running by using the Oracle Database account.
2. Open a Bash shell and enter this command:
```
$HADOOP_HOME/bin/hadoop fs -ls /user
```
  In this command, $HADOOP_HOME is the absolute path to the Hadoop home directory. You should see a list of files. If not, then first ensure that the Hadoop cluster is up and running. If the problem persists, then you must correct the Hadoop client configuration so that Oracle Database has access to the Hadoop cluster file system.

The Oracle Database system is now ready for use as a Hadoop client. No other Hadoop configuration steps are needed.

1.4.3 Installing Oracle Direct Connector for HDFS

Complete this procedure to install Oracle Direct Connector for HDFS on the Oracle Database system.

To install Oracle Direct Connector for HDFS:

Download the zip file to a directory on the system where Oracle Database runs.
Unzip orahdfs-version.zip into a directory. The unzipped files have the structure shown in Example 1-1.
Open the hdfs_stream Bash shell script in a text editor, and make the changes indicated by the comments in the script.

The hdfs_stream script is the preprocessor script for the HDFS external table.
Run the hdfs_stream script from the Oracle Direct Connector for HDFS installation directory. You should see this usage information:
```
$ bin/hdfs_stream
Oracle Direct HDFS Release 1.0.0.0.0 - Production
Copyright (c) 2011, Oracle and/or its affiliates. All rights reserved.
Usage: $HADOOP_HOME/bin/hadoop jar orahdfs.jar oracle.hadoop.hdfs.exttab.HdfsStream <locationPath>
```
If not, then ensure that the operating system user that Oracle Database is running under has the following permissions:
- Read and execute permissions on the hdfs_stream script:
```
$ ls -l DIRECTHDFS_HOME/bin/hdfs_stream
-rwxr-xr-x 1 oracle oinstall 2273 Apr 27 15:51 hdfs_stream
```
  If you do not see these permissions, then enter a chmod command to fix them:
```
$ chmod 755 DIRECTHDFS_HOME/bin/hdfs_stream
```
  In these commands, DIRECTHDFS_HOME represents the Oracle Direct Connector for HDFS home directory.
- Read permission on DIRECTHDFS_HOME/jlib/orahdfs.jar.
Create a database directory for the orahdfs-version/bin directory where hdfs_stream resides. In this example, the Oracle Direct Connector for HDFS kit is installed in /etc:
```
SQL> CREATE OR REPLACE DIRECTORY hdfs_bin_path AS  '/etc/orahdfs-1.0/bin'
```

Example 1-1 Structure of the orahdfs Directory

orahdfs-version
   bin/
      hdfs_stream
   jlib/ 
      orahdfs.jar
   log/
   README.txt

1.4.4 Granting User Access to Oracle Direct Connector for HDFS

Oracle Database users require these privileges to use Oracle Direct Connector for HDFS:

CREATE SESSION
EXECUTE on the UTL_FILE PL/SQL package
READ and EXECUTE on the HDFS_BIN_PATH directory created during the installation of Oracle Direct Connector for HDFS. Do not grant write access to anyone. Grant EXECUTE only to those who intend to use Oracle Direct Connector for HDFS.

Example 1-2 shows the SQL commands granting these privileges to HDFSUSER.

Example 1-2 Granting Users Access to Oracle Direct Connector for HDFS

CONNECT / AS sysdba;
CREATE USER hdfsuser IDENTIFIED BY password;
GRANT CREATE SESSION TO hdfsuser;
GRANT EXECUTE ON SYS.UTL_FILE TO hdfsuser;
GRANT READ, EXECUTE ON DIRECTORY hdfs_bin_path TO hdfsuser;

1.5 Oracle Loader for Hadoop Setup

Before installing Oracle Loader for Hadoop, verify that you have the required software.

1.5.1 Required Software

Oracle Loader for Hadoop requires the following software:

A target database system running one of the following:
- Oracle Database 10g release 2 (10.2.0.5) with required patch
- Oracle Database 11g release 2 (11.2.0.2) with required patch
- Oracle Database 11g release 2 (11.2.0.3)
Note:
To use Oracle Loader for Hadoop with Oracle Database 10g release 2 (10.2.0.5) or Oracle Database 11g release 2 (11.2.0.2), you must first apply a one-off patch that addresses bug number 11897896. To access this patch, go to http://support.oracle.com and search for the bug number.
Cloudera's Distribution including Apache Hadoop version 3 (CDH3) or Apache Hadoop 0.20.2
Hive 0.7.0 or 0.7.1, if using the HiveToAvroInputFormat class

1.5.2 Installing Oracle Loader for Hadoop

Oracle Loader for Hadoop is packaged with the Oracle Database 11g release 2 client libraries and Oracle Instant Client libraries for connecting to Oracle Database 10.2.0.5, 11.2.0.2, or 11.2.0.3.

To install Oracle Loader for Hadoop:

Unpack the content of the oraloader-version.zip archive into a directory on your Hadoop cluster.

A directory named oraloader-version is created with the following subdirectories:
- doc
- jlib
- lib
- examples
Create a variable named $OLH_HOME and set it to the installation directory.
Add $OLH_HOME/jlib/* to the HADOOP_CLASSPATH variable.

1.6 Oracle Data Integrator Application Adapter for Hadoop Setup

Installation requirements for Oracle Data Integrator (ODI) Application Adapter for Hadoop are provided in these topics:

System Requirements and Certifications
Technology-Specific Requirements
Location of ODI Application Adapter for Hadoop
Setting Up the Topology

1.6.1 System Requirements and Certifications

To use ODI Application Adapter for Hadoop, you must first have Oracle Data Integrator, which is licensed separately from Oracle Big Data Connectors. You can download ODI from the Oracle website at

http://www.oracle.com/technetwork/middleware/data-integrator/downloads/index.html

ODI Application Adapter for Hadoop requires a minimum version of Oracle Data Integrator 11.1.1.6.0.

Before performing any installation, read the system requirements and certification documentation to ensure that your environment meets the minimum installation requirements for the products that you are installing.

The list of supported platforms and versions is available on Oracle Technology Network:

http://www.oracle.com/technetwork/middleware/data-integrator/overview/index.html

1.6.2 Technology-Specific Requirements

The list of supported technologies and versions is available on Oracle Technical Network:

http://www.oracle.com/technetwork/middleware/data-integrator/overview/index.html

1.6.3 Location of ODI Application Adapter for Hadoop

ODI Application Adapter for Hadoop is available in the xml-reference directory of the Oracle Data Integrator Companion CD.

1.6.4 Setting Up the Topology

To set up the topology, see Chapter 4, "Oracle Data Integrator Application Adapter for Hadoop."

1.7 Oracle R Connector for Hadoop Setup

Oracle R Connector for Hadoop requires the installation of a software environment on the Hadoop side and on a client Linux system.

1.7.1 Installing the Software on Hadoop

Oracle Big Data Appliance supports Oracle R Connector for Hadoop without any additional software installation or configuration.

However, to use Oracle R Connector for Hadoop on any other Hadoop cluster, you must create the necessary environment.

1.7.1.1 Software Requirements for a Third-Party Hadoop Cluster

You must install several software components on a third-party Hadoop cluster to support Oracle R Connector for Hadoop.

Install these components on third-party servers:

Cloudera's Distribution including Apache Hadoop version 3 (CDH3) or Apache Hadoop 0.20.2.

Complete the instructions provided by the distributor.
Sqoop for the execution of functions that connect to Oracle Database. Oracle R Connector for Hadoop does not require Sqoop to install or load.

See "Installing Sqoop on a Hadoop Cluster".
Java Virtual Machine (JVM), preferably Java HotSpot Virtual Machine 6.

Complete the instructions provided at the download site at

http://www.oracle.com/technetwork/java/javase/downloads/index.html
R distribution 2.13.2 with all base libraries on all nodes in the Hadoop cluster.

Go to http://www.r-project.org/ and follow the download and installation instructions.
The ORCH package on each R engine, which must exist on every node of the Hadoop cluster.

See "Installing the ORCH Package on a Hadoop Cluster".
Open-source R bitops package installed on every node of the cluster, to support orch.pack and orch.unpack.

1.7.1.2 Installing Sqoop on a Hadoop Cluster

Sqoop provides a SQL-like interface to Hadoop, which is a Java-based environment. Oracle R Connector for Hadoop uses Sqoop for access to Oracle Database.

To install and configure Sqoop for use with Oracle Database:

Install Sqoop if it is not already installed on the server.

For Cloudera's Distribution including Apache Hadoop, see the Sqoop installation instructions in the CDH Installation Guide at

http://oracle.cloudera.com/
Download the appropriate Java Database Connectivity (JDBC) driver for Oracle Database from Oracle Technology Network at

http://www.oracle.com/technetwork/database/features/jdbc/index-091264.html
Copy the driver JAR file to $SQOOP_HOME/lib, which is a directory such as /usr/lib/sqoop/lib.
Provide Sqoop with the connection string to Oracle Database.
```
$ sqoop import --connect jdbc_connection_string
```
For example, sqoop import --connect jdbc:oracle:thin@myhost:1521/orcl.

1.7.1.3 Installing R on a Hadoop Cluster

You can download Oracle R Distribution 2.13.2 and get the installation instructions from the website at

http://www.oracle.com/technetwork/indexes/downloads/r-distribution-1532464.html

1.7.1.4 Installing the ORCH Package on a Hadoop Cluster

ORCH is the name of the Oracle R Connector for Hadoop package.

To install the ORCH package:

Set the environment variables for the supporting software:

$ setenv HADOOP_HOME /usr/lib/hadoop-0.2.0
$ setenv JAVA_HOME /usr/lib/jdk6
$ setenv R_HOME /usr/lib64/R
$ setenv SQOOP_HOME /usr/lib/sqoop

Unzip the downloaded file:

$ unzip orch-version.tgz.zip
Archive:  orch-0.1.8.tgz.zip

Open R and install the package:

> install.packages("/home/tmp/orch-0.1.8.tgz", repos=NULL)
Installing package(s) into ...
.
.
.
Hadoop 0.20.2-cdh3u3 is up
Sqoop 1.3.0-cdh3u3 is up

Alternatively, you can install the package from the Linux command line:

$ R CMD INSTALL orch-0.1.8.tgz
* installing *source* package 'ORCH' ...
** R
.
.
.
Hadoop 0.20.2-cdh3u3 is up
Sqoop 1.3.0-cdh3u3 is up
 
* DONE (ORCH)

1.7.2 Providing Remote Client Access to R Users

Whereas R users will run their programs as MapReduce jobs on the Hadoop cluster, they do not typically have individual accounts on that platform. Instead, an external Linux server provides remote access.

1.7.2.1 Software Requirements for Remote Client Access

To provide access to a Hadoop cluster to R users, install these components on a Linux server:

The same version of Hadoop as your Hadoop cluster
The same version of Sqoop as your Hadoop cluster
The same version of the Java Development Kit (JDK) as your Hadoop cluster
R distribution 2.13.2 with all base libraries
ORCH R package

To provide access to database objects, you must have the Oracle Advanced Analytics option to Oracle Database. Then you can install this additional component on the Hadoop client:

Oracle R Enterprise Client Packages

1.7.2.2 Configuring the Server as a Hadoop Client

You must install Hadoop on the client and minimally configure it for HDFS client use.

To install and configure Hadoop on the client system:

Install and configure CDH3 or Apache Hadoop 0.20.2 on the client system. This system can be the host for Oracle Database. If you are using Oracle Big Data Appliance, then complete the procedures for providing remote client access in the Oracle Big Data Appliance Software User's Guide. Otherwise, follow the installation instructions provided by the distributor (Cloudera or Apache).
Log in to the client system as an R user.
Open a Bash shell and enter this Hadoop file system command:
```
$HADOOP_HOME/bin/hadoop fs -ls /user
```
If you see a list of files, then you are done. If not, then ensure that the Hadoop cluster is up and running. If that does not fix the problem, then you must debug your client Hadoop installation.

1.7.2.3 Installing Sqoop on a Hadoop Client

Complete the same procedures on the client system for installing and configuring Sqoop as those provided in "Installing Sqoop on a Hadoop Cluster".

1.7.2.4 Installing R on a Hadoop Client

You can download R 2.13.2 and get the installation instructions from the Oracle R Distribution website at

http://oss.oracle.com/ORD/

When you are done, ensure that users have the necessary permissions to connect to the Linux server and run R.

You may also want to install RStudio Server to facilitate access by R users. See the RStudio website at

http://rstudio.org/

1.7.2.5 Installing the ORCH Package on a Hadoop Client

Complete the procedures on the client system for installing ORCH as described in "Installing the Software on Hadoop".

1.7.2.6 Installing the Oracle R Enterprise Client Packages (Optional)

To support access to Oracle Database, install the Oracle R Enterprise client packages. Without them, Oracle R Connector for Hadoop operates only with in-memory R objects and local data files, and does not have access to Oracle Database or to the advanced statistical algorithms provided by Oracle R Enterprise.