Oracle® Big Data Connectors User's Guide Release 1 (1.0) Part Number E27365-06 |
|
|
PDF · Mobi · ePub |
This chapter introduces you to Oracle Big Data Connectors, provides installation instructions, and identifies the permissions needed for users to access the connectors.
This chapter contains these topics:
Oracle Big Data Connectors facilitate data access between data stored in a Hadoop cluster and Oracle Database. They can be licensed for use on either Oracle Big Data Appliance or a Hadoop cluster running on commodity hardware.
These are the connectors:
Oracle Direct Connector for Hadoop Distributed File System: Enables Oracle Database to access data stored in a Hadoop Distributed File System (HDFS). The data can remain in HDFS or it can be loaded into Oracle Database.
Oracle Loader for Hadoop: Provides an efficient and high performance loader for fast movement of data from a Hadoop cluster into a table in an Oracle database. Oracle Loader for Hadoop prepartitions the data if necessary and transforms it into an Oracle-ready format. It optionally sorts records by primary key before loading the data or creating output files. Oracle Loader for Hadoop is a MapReduce application that is invoked as a command line utility. It accepts the generic command-line options that are supported by the Tool interface.
Oracle Data Integrator Application Adapter for Hadoop: Extracts, transforms, and loads data from a Hadoop cluster into tables in Oracle Database, as defined using a graphical user interface.
Oracle R Connector for Hadoop: Provides an interface between a local R environment, Oracle Database, and Hadoop, allowing speed-of-thought, interactive analysis on all three platforms. Oracle R Connector for Hadoop is designed to work independently, but if the enterprise data for your analysis is also stored in Oracle Database, then the full power of this connector is achieved when it is used with Oracle R Enterprise.
Individual connectors may require that software components are installed in Oracle Database, the Hadoop cluster, and the user's PC. Users may also need additional access privileges in Oracle Database.
See Also:
My Oracle Support Master Note 1416116.1 and its related notesYou can download Oracle Big Data Connectors from Oracle Technology Network (OTN) or Oracle Delivery Cloud.
To download from OTN:
Use any browser to visit this website:
http://www.oracle.com/technetwork/bdc/big-data-connectors/downloads/index.html
Click the name of each connector to download a zip file containing the installation files.
To download from Oracle Software Delivery Cloud:
You can also download the software from Oracle Software Delivery Cloud at
Accept the Terms and Restrictions to see the Media Pack Search page.
Select the search terms:
Select a Product Pack: Oracle Database
Platform: Linux x86-64
Click Go to display a list of product packs.
Select Oracle Big Data Connectors Media Pack for Linux x86-64 (B65965-0x), then click Continue.
Click Download for each connector to download a zip file containing the installation files.
Oracle Direct Connector for Hadoop Distributed File System (Oracle Direct Connector) is installed and runs on the system where Oracle Database runs. Before installing Oracle Direct Connector, verify that you have the required software.
Oracle Direct Connector requires the following software:
Cloudera's Distribution including Apache Hadoop Version CDH3 or Apache Hadoop 0.20.2.
Oracle JDK 1.6.0_8 or higher for CDH3. Cloudera recommends version 1.6.0_26.
Oracle Database Release 11g Release 2 (11.2.0.2 or 11.2.0.3) for Linux.
To support the Data Pump file format, a database one-off patch. To download this patch, go to http://support.oracle.com
and search for bug 13079417.
The same version of Hadoop on the database system as your Hadoop cluster, either CDH3 or Apache Hadoop 0.20.2.
The same version of Oracle JDK on the database system as your Hadoop cluster.
Oracle Direct Connector works as an HDFS client. You do not need to configure Hadoop on the database system to run MapReduce jobs for Oracle Direct Connector. However, you must install Hadoop on the database system and minimally configure it for HDFS client use only.
To configure the database system as a Hadoop client:
Install CDH3 or Apache Hadoop 0.20.2 on the database system. Follow the installation instructions provided by the distributor (Cloudera or Apache). Do not follow the configuration instructions.
Use a text editor to open conf/hadoop-env.sh
in the Hadoop home directory on the database system, then make these changes:
Uncomment the line that begins export JAVA_HOME
.
Set JAVA_HOME
to the directory where JDK1.6 is installed.
Edit conf/core-site.xml
in the same directory to identify the NameNode of your Hadoop cluster as follows:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://host:port</value> </property> </configuration>
Ensure that Oracle Database has access to Hadoop and HDFS:
Log in to the system where Oracle Database is running using the Oracle database account.
Open a bash shell and issue this command:
$HADOOP_HOME/bin/hadoop fs -ls /user
In this command, $HADOOP_HOME
is the absolute path to the Hadoop home directory. You should see a list of files. If not, then first ensure that the Hadoop cluster is up and running. If the problem persists, then you must correct the Hadoop client configuration so that Oracle Database has access to the Hadoop cluster file system.
The database system is now ready for use as a Hadoop client. No other Hadoop configuration steps are needed.
To install Oracle Direct Connector:
Download the zip file to a directory on the system where Oracle Database runs.
Unzip orahdfs-
version.zip
into a directory. The unzipped files have the structure shown in Example 1-1.
Open the hdfs_stream
bash shell script in a text editor and make these changes:
HADOOP_HOME
: Set to the absolute path of the Hadoop home directory.
DIRECTHDFS_HOME
: Set to the absolute path of the Oracle Direct Connector installation directory.
The hdfs_stream
script is the preprocessor script for the HDFS external table. Comments in the script provide complete instructions for making these changes.
Run the hdfs_stream
script from the Oracle Direct Connector installation directory. You should see this usage information:
$ bin/hdfs_stream
Oracle Direct HDFS Release 1.0.0.0.0 - Production
Copyright (c) 2011, Oracle and/or its affiliates. All rights reserved.
Usage: $HADOOP_HOME/bin/hadoop jar orahdfs.jar oracle.hadoop.hdfs.exttab.HdfsStream <locationPath>
If not, then ensure that the operating system user that Oracle is running under has the following permissions:
Read and execute permissions on the hdfs_stream
script:
$ ls -l DIRECTHDFS_HOME/bin/hdfs_stream -rwxr-xr-x 1 oracle oinstall 2273 Apr 27 15:51 hdfs_stream
If you do not see these permissions, then issue a chmod
command to fix them:
$ chmod 755 DIRECTHDFS_HOME/bin/hdfs_stream
In these commands, DIRECTHDFS_HOME
represents the Oracle Direct Connector home directory.
Read permission on DIRECTHDFS_HOME
/jlib/orahdfs.jar
.
Create a database directory for the orahdfs-
version/bin
directory where hdfs_stream
resides. In this example, the Oracle Direct Connector kit is installed in /etc
:
SQL> CREATE OR REPLACE DIRECTORY hdfs_bin_path AS '/etc/orahdfs-1.0/bin'
Oracle Database users require these privileges to use Oracle Direct Connector:
CREATE SESSION
EXECUTE
on the UTL_FILE
PL/SQL package.
READ
and EXECUTE
on the HDFS_BIN_PATH
directory created in Step 0 Do not grant write access to anyone. Grant EXECUTE
only to those who intend to use Oracle Direct Connector.
Example 1-2 shows the SQL commands granting these privileges to HDFSUSER
.
Before installing Oracle Loader for Hadoop, verify that you have the required software.
Oracle Loader for Hadoop requires the following software:
A target database system running one of the following:
Oracle Database 10g Release 2 (10.2.0.5) with required patch
Oracle Database 11g Release 2 (11.2.0.2) with required patch
Oracle Database 11g Release 2 (11.2.0.3)
Note:
To use Oracle Loader for Hadoop with Oracle Database 10g Release 2 (10.2.0.5) or Oracle Database 11g Release 2 (11.2.0.2), you must first apply a one-off patch that addresses bug number 11897896. To access this patch, go tohttp://support.oracle.com
and search for the bug number.Cloudera's Distribution including Apache Hadoop (CDH3) or Apache Hadoop 0.20.2
Hive 0.7.0 or 0.7.1, if using the HiveToAvroInputFormat
class
Oracle Loader for Hadoop is packaged with the Oracle Database 11g Release 2 client libraries and Oracle Instant Client libraries for connecting to Oracle Database 10.2.0.5, 11.2.0.2, or 11.2.0.3.
To install Oracle Loader for Hadoop:
Unpack the content of the oraloader-
version
.zip
archive into a directory on your Hadoop cluster.
A directory named oraloader-
version
is created with the following subdirectories:
doc
jlib
lib
examples
This guide uses the variable ${OLH_HOME}
to refer to this installation directory.
Add ${OLH_HOME}/jlib/*
to the HADOOP_CLASSPATH
variable.
Installation requirements for Oracle Data Integrator Application Adapter for Hadoop are provided in these topics:
To use the Application Adapter for Hadoop, you must first have Oracle Data Integrator, which is licensed separately from Oracle Big Data Connectors. You can download Oracle Data Integrator from the Oracle website at
http://www.oracle.com/technetwork/middleware/data-integrator/downloads/index.html
Oracle Data Integrator Application Adapter for Hadoop Knowledge Modules require a minimum version of Oracle Data Integrator 11.1.1.6.0.
Before performing any installation, read the system requirements and certification documentation to ensure that your environment meets the minimum installation requirements for the products you are installing.
The list of supported platforms and versions is available on Oracle Technical Network:
http://www.oracle.com/technology/products/oracle-data-integrator/index.html
The list of supported technologies and versions is available on Oracle Technical Network:
http://www.oracle.com/technology/products/oracle-data-integrator/index.html
Oracle Data Integrator Application Adapter for Hadoop is available in the xml-reference
directory of the Oracle Data Integrator Companion CD.
See Chapter 4, "Oracle Data Integrator Application Adapter for Hadoop."
Oracle R Connector for Hadoop requires the installation of a software environment on the Hadoop side and on a client Linux system.
Oracle Big Data Appliance supports Oracle R Connector for Hadoop without any additional software installation or configuration.
To use Oracle R Connector for Hadoop on any other Hadoop cluster, you must create the necessary environment.
Install these components on third-party servers:
Java Virtual Machine (JVM), preferably Java HotSpot Virtual Machine 6.
R distribution 2.13.2 with all base libraries on all nodes in the Hadoop cluster.
ORHC package installed on each R engine, which must exist on every node of the Hadoop cluster. See the following instructions.
To install ORHC:
Set the environment variables for the Hadoop and JVM home directories:
$ setenv HADOOP_HOME /usr/lib/hadoop-0.2.0 $ setenv JAVA_HOME /usr/lib/jdk6
In this example, both home directories are in /usr/lib.
Unzip the downloaded file:
$ unzip orhc.tgz.zip
Archive: orhc.tgz.zip
Open R and install the package:
> install.packages("/home/tmp/orhc.tgz", repos=NULL)
Installing package(s) into ...
.
.
.
Hadoop is up and running.
Alternatively, you can install the package from the Linux command line:
$ R CMD INSTALL orhc.tgz
* installing *source* package 'ORHC' ...
** R
.
.
.
Hadoop is up and running.
* DONE (ORHC)
To provide access to a Hadoop cluster to R users, install these components on a Linux server:
Hadoop Client to allow access to the Hadoop cluster
For Oracle Big Data Appliance, see the Oracle Big Data Appliance Software User's Guide for detailed instructions on setting up remote client access.
Java Virtual Machine, preferably Java HotSpot Virtual Machine 6
R distribution 2.13.2
ORHC R package
Follow the steps for installing ORHC in "Installing the Server Software".
Oracle R Enterprise libraries (optional). They support access to Oracle Database; otherwise, Oracle R Connector for Hadoop operates only with in-memory R objects and local data files without access to the advanced statistical algorithms provided by Oracle R Enterprise. For example:
library(DBI) library(ROracle) library(OREbase) library(OREeda) library(OREgraphics) library(OREstats) library(RToXmp)
When you are done, ensure that users have the necessary permissions to connect to the Linux server and run R.