This chapter describes the Oracle Big Data Connectors and provides installation instructions.
This chapter contains the following sections:
Oracle Big Data Connectors facilitate access to data stored in an Apache Hadoop cluster. They can be licensed for use on either Oracle Big Data Appliance or a Hadoop cluster running on commodity hardware.
These are the connectors:
Oracle SQL Connector for Hadoop Distributed File System: Enables an Oracle external table to access data stored in Hadoop Distributed File System (HDFS) files or a table in Apache Hive. The data can remain in HDFS or the Hive table, or it can be loaded into an Oracle database.
Oracle Loader for Hadoop: Provides an efficient and high-performance loader for fast movement of data from a Hadoop cluster into a table in an Oracle database. Oracle Loader for Hadoop prepartitions the data if necessary and transforms it into a database-ready format. It optionally sorts records by primary key or user-defined columns before loading the data or creating output files.
Oracle XQuery for Hadoop: Runs transformations expressed in the XQuery language by translating them into a series of MapReduce jobs, which are executed in parallel on the Hadoop cluster. The input data can be located in a file system accessible through the Hadoop File System API, such as the Hadoop Distributed File System (HDFS), or stored in Oracle NoSQL Database. Oracle XQuery for Hadoop can write the transformation results to HDFS, Oracle NoSQL Database, Apache Solr, or Oracle Database. An additional XML processing capability is through XML Extensions for Hive.
Oracle Shell for Hadoop Loaders: A helper shell that provides a simple-to-use command line interface to Oracle Loader for Hadoop, Oracle SQL Connector for HDFS, and Copy to Hadoop (a feature of Big Data SQL). It has basic shell features such as command line recall, history, inheriting environment variables from the parent process, setting new or existing environment variables, and performing environmental substitution in the command line.
Oracle R Advanced Analytics for Hadoop: Provides a general computation framework, in which you can use the R language to write your custom logic as mappers or reducers. A collection of R packages provides predictive analytic techniques that run as MapReduce jobs. The code executes in a distributed, parallel manner using the available compute and storage resources on the Hadoop cluster. Oracle R Advanced Analytics for Hadoop includes interfaces to work with Apache Hive tables, the Apache Hadoop compute infrastructure, the local R environment, and Oracle database tables.
Oracle Data Integrator: Extracts, loads, and transforms data from sources such as files and databases into Hadoop and from Hadoop into Oracle or third-party databases. Oracle Data Integrator provides a graphical user interface to utilize the native Hadoop tools and transformation engines such as Hive, HBase, Sqoop, Oracle Loader for Hadoop, and Oracle SQL Connector for Hadoop Distributed File System.
Oracle Datasource for Hadoop: Provides direct, fast, parallel, secure and consistent access to master data in Oracle Database using Hive SQL, Spark SQL, as well as Hadoop APIs that support SerDes, HCatalog, InputFormat and StorageHandler.
Individual connectors may require that software components be installed in Oracle Database and either the Hadoop cluster or an external system set up as a Hadoop client for the cluster. Users may also need additional access privileges in Oracle Database. For details on integrating Oracle Database and Apache Hadoop visit the Certification Matrix.
See Also:
My Oracle Support Information Center: Big Data Connectors (ID 1487399.2) and its related information centers.
Enterprises are seeing large amounts of data coming from multiple sources. Click-stream data in web logs, GPS tracking information, data from retail operations, sensor data, and multimedia streams are just a few examples of vast amounts of data that can be of tremendous value to an enterprise if analyzed. The unstructured and semi-structured information provided by raw data feeds is of little value in and of itself. The data must be processed to extract information of real value, which can then be stored and managed in the database. Analytics of this data along with the structured data in the database can provide new insights into the data and lead to substantial business benefits.
MapReduce is a parallel programming model for processing data on a distributed system. It can process vast amounts of data quickly and can scale linearly. It is particularly effective as a mechanism for batch processing of unstructured and semi-structured data. MapReduce abstracts lower level operations into computations over a set of keys and values.
A simplified definition of a MapReduce job is the successive alternation of two phases, the map phase and the reduce phase. Each map phase applies a transform function over each record in the input data to produce a set of records expressed as key-value pairs. The output from the map phase is input to the reduce phase. In the reduce phase, the map output records are sorted into key-value sets so that all records in a set have the same key value. A reducer function is applied to all the records in a set and a set of output records are produced as key-value pairs. The map phase is logically run in parallel over each record while the reduce phase is run in parallel over all key values.
Note:
Oracle Big Data Connectors 3.0 and later supports the Yet Another Resource Negotiator (YARN) implementation of MapReduce.
Apache Hadoop is the software framework for the development and deployment of data processing jobs based on the MapReduce programming model. At the core, Hadoop provides a reliable shared storage and analysis systemFoot 1. Analysis is provided by MapReduce. Storage is provided by the Hadoop Distributed File System (HDFS), a shared storage system designed for MapReduce jobs.
The Hadoop ecosystem includes several other projects including Apache Avro, a data serialization system that is used by Oracle Loader for Hadoop.
Cloudera's Distribution including Apache Hadoop (CDH) is installed on Oracle Big Data Appliance. You can use Oracle Big Data Connectors on a Hadoop cluster running CDH or the equivalent Apache Hadoop components, as described in the setup instructions in this chapter.
See Also:
For conceptual information about the Hadoop technologies, the following third-party publication:
Hadoop: The Definitive Guide, Third Edition by Tom White (O'Reilly Media Inc., 2012, ISBN: 978-1449311520).
For information about Cloudera's Distribution including Apache Hadoop (CDH5), the Oracle Cloudera website at
For information about Apache Hadoop, the website at
You can download Oracle Big Data Connectors from Oracle Technology Network or Oracle Software Delivery Cloud. Both sites are cross-browser compatible.
To download from Oracle Technology Network:
http://www.oracle.com/technetwork/bdc/big-data-connectors/downloads/index.html
Click the name of each connector to download a zip file containing the installation files.
To download from Oracle Software Delivery Cloud:
You install and configure Oracle SQL Connector for Hadoop Distributed File System (HDFS) on the system where Oracle Database runs. If Hive tables are used as the data source, then you must also install and run Oracle SQL Connector for HDFS on a Hadoop client where users access Hive.
On Oracle Big Data Appliance, there is an option to include Oracle Big Data Connectors in the installation. If this was selected in the Configuration Generation Utility, then Oracle SQL Connector for HDFS is already available on the appliance. (See Chapter 4, Using Oracle Big Data Appliance Configuration Generation Utility, in the Big Data Appliance Owner’s Guide.)
This section contains the following topics:
Oracle SQL Connector for HDFS requires the following software:
On the Hadoop cluster:
Cloudera's Distribution including Apache Hadoop version 5 (CDH5), or, Hortonworks Data Platform 2.4.0 and 2.5.0.
Java Development Kit (JDK). Consult the distributor of your Hadoop software (Cloudera or Apache) for the recommended version.
Hive 1.1.0, or 1.2.1 (required for Hive table access, otherwise optional)
This software is already installed on Oracle Big Data Appliance.
On the Oracle Database system and Hadoop client systems:
Oracle Database 12c (12.2.0.1 and 12.1.0.2) , Oracle Database 11g release 2 (11.2.0.4 or later).
The same version of Hadoop as your Hadoop cluster: CDH5, or Hortonworks Data Platform 2.4.0 and 2.5.0.
If you have a secure Hadoop cluster configured with Kerberos, then the Hadoop client on the database system must be set up to access a secure cluster. See "Using Oracle SQL Connector for HDFS on a Secure Hadoop Cluster."
The same version of JDK as your Hadoop cluster.
Note:
Oracle SQL Connector for HDFS requires a Hadoop client on the OS platform of the database system. This is straightforward for Linux systems. Platforms other than Linux require a tarball installation of the Hadoop client. Refer to this Oracle Blog post Connecting Hadoop with Oracle. See the following documents in My Oracle Support for details:Oracle SQL Connector for HDFS requires a Hadoop client on the Oracle Database System. The Hadoop installation can be minimally configured for Hadoop client use only. The full configuration of Hadoop is not needed. The only parts of Hadoop needed for Oracle SQL Connector for HDFS are the Hadoop JAR files and the configuration files from the Hadoop installation.
Note:
Even if there is a complete Hadoop installation on the Oracle Database system, do not start Hadoop on this system at any time. If Hadoop is running locally, then Oracle SQL Connector for HDFS attempts to connect to it instead of to the Hadoop cluster.For Oracle RAC systems including Oracle Exadata Database Machine, you must install and configure Oracle SQL Connector for HDFS using identical paths on all systems running Oracle instances.
Adding a Hadoop Client for use with Oracle Big Data Appliance
Oracle Big Data Appliance requires that you follow its own system-supported procedures for installing a Hadoop client. If your Hadoop system is an Oracle Big Data Appliance, see Providing Remote Access to CDH in the Oracle Big Data Appliance Software User's Guide. This section describes how to install the CDH client, configure it for use in a Kerberos-secured or non-secured environment, and verify HDFS access.
Adding a Hadoop Client for use with Other Hadoop Systems
For connections to Hadoop systems other than Oracle Big Data Appliance, download and install the Hadoop client provide by the Hadoop distributor (Cloudera or Apache). The following example shows how to connect a Hadoop client to a CDH system that is not an Oracle Big Data Appliance. You can use these steps to install the client, configure it for use in Kerberos-secured or a non-secured environment, and test to verify HDFS access.
In this case also you need set up Kerberos access (if Kerberos is installed) and should include a final test to make sure HDFS access is working.
Use one of these methods to obtain the files:
Download the tarball from the Cloudera tarball downloads page. Check that the Hadoop version in the filename (as in hadoop-2.5.0-cdh5.2.5.tar.gz) that matches the version of the Hadoop cluster.
Click on the hdfs service in Cloudera Manager, and select the action Download Client Configuration.
Extract the files and copy them to a permanent path of your choice on the database system.
Set the HADOOP_HOME environment variable to this path and add HADOOP_HOME/bin to the PATH variable.
Ensure that JAVA_HOME points to a JDK installation with the version required by the Hadoop installation.
If your cluster is secured with Kerberos, then configure the Oracle system to permit Kerberos authentication. See "Using Oracle SQL Connector for HDFS on a Secure Hadoop Cluster."
Test HDFS access from the Oracle Database system:
Log in to the system where Oracle Database is running by using the Oracle Database account.
$ hdfs dfs -ls /user
You might need to add the directory containing the Hadoop executable file to the environment variable. The default path for CDH is /usr/bin
.
You should see the same list of directories that you see when you run the command directly on the Hadoop cluster. If not, then first ensure that the Hadoop cluster is up and running. If the problem persists, then you must correct the Hadoop client configuration so that Oracle Database has access to the Hadoop cluster file system.
For an Oracle RAC system, repeat this procedure for every Oracle Database instance.
Setting up Hadoop Clients on Additional Systems
You have the option to set up Hadoop clients on other servers in addition to the Oracle Database system. If you do, use the procedure provided in this section and install the same version of CDH or Apache Hadoop consistently.
Follow this procedure to install Oracle SQL Connector for HDFS on the Oracle Database system.
In addition to this required installation on the database system, you can also install Oracle SQL Connector for HDFS on any system configured as a compatible Hadoop client. This will give you the option to create Oracle Database external tables from that node.
To install Oracle SQL Connector for HDFS on the Oracle Database system:
Download the zip file to a directory on the system where Oracle Database runs.
Unpack the content of oraosch-
<version>
.zip
.
$ unzip oraosch-<version>.zip
Archive: oraosch-<version>.zip
extracting: orahdfs-<version>.zip
inflating: README.txt
Unpack orahdfs-
<version>
.zip
into a permanent directory:
$ unzip orahdfs-<version>.zip
unzip orahdfs-<version>.zip
Archive: orahdfs-<version>.zip
creating: orahdfs-<version>/
creating: orahdfs-<version>/log/
creating: orahdfs-<version>/examples/
creating: orahdfs-<version>/examples/sql/
inflating: orahdfs-<version>/examples/sql/mkhive_unionall_view.sql
creating: orahdfs-<version>/doc/
inflating: orahdfs-<version>/doc/README.txt
creating: orahdfs-<version>/jlib/
inflating: orahdfs-<version>/jlib/osdt_cert.jar
inflating: orahdfs-<version>/jlib/oraclepki.jar
inflating: orahdfs-<version>/jlib/osdt_core.jar
inflating: orahdfs-<version>/jlib/ojdbc7.jar
inflating: orahdfs-<version>/jlib/orahdfs.jar
inflating: orahdfs-<version>/jlib/ora-hadoop-common.jar
creating: orahdfs-<version>/bin/
inflating: orahdfs-<version>/bin/hdfs_stream
inflating: orahdfs-<version>/bin/hdfs_stream.cmd
The unzipped files have the structure shown in Example 1-1.
Open the orahdfs-<version>/bin/hdfs_stream
Bash shell script in a text editor, and make the changes indicated by the comments in the script, if necessary
The hdfs_stream script does not inherit any environment variable settings, and so they are set in the script if Oracle SQL Connector for HDFS needs them:
PATH
: If the hadoop
script is not in /usr/bin:bin
(the path initially set in hdfs_stream
), then add the Hadoop bin directory, such as /usr/lib/hadoop/bin.
JAVA_HOME
: If Hadoop does not detect Java, then set this variable to the Java installation directory. For example, /usr/bin/java
.
See the comments in the script for more information about these environment variables.
The hdfs_stream
script is the preprocessor for the Oracle Database external table created by Oracle SQL Connector for HDFS.
If your cluster is secured with Kerberos and the account does not already have a Kerberos ticket, then obtain one:
$ kinit
See Using Oracle SQL Connector for HDFS on a Secure Hadoop Cluster in this guide for information on acquiring and maintaining Kerberos tickets.
Run hdfs_stream
from the Oracle SQL Connector for HDFS /bin
directory. You should see this usage information:
$ ./hdfs_stream
Usage: hdfs_stream locationFile
If you do not see the usage statement, then ensure that the operating system user that Oracle Database is running under (such as oracle
) has the following permissions:
Read and execute permissions on the hdfs_stream script:
$ ls -l OSCH_HOME/bin/hdfs_stream -rwxr-xr-x 1 oracle oinstall Nov 27 15:51 hdfs_stream
Read permission on orahdfs.jar.
$ ls -l OSCH_HOME/jlib/orahdfs.jar -rwxr-xr-x 1 oracle oinstall Nov 27 15:51 orahdfs.jar
If you do not see these permissions, then enter a chmod
command to fix them, for example:
$ chmod 755 OSCH_HOME/bin/hdfs_stream
In the previous commands, OSCH_HOME
represents the Oracle SQL Connector for HDFS home directory.
For an Oracle RAC system, repeat the previous steps for every Oracle instance, using identical path locations.
Log in to Oracle Database and create a database directory for the orahdfs-
<version>
/bin
directory where hdfs_stream resides. For Oracle RAC systems, this directory must be accessible by all Oracle instances through identical paths.
In this example, Oracle SQL Connector for HDFS is installed in /etc:
SQL> CREATE OR REPLACE DIRECTORY osch_bin_path AS '/etc/orahdfs-<version>/bin';
To support access to Hive tables:
Ensure that the system is configured as a Hive client.
Add the Hive JAR files and the Hive conf directory to the HADOOP_CLASSPATH
environment variable. To avoid JAR conflicts among the various Hadoop products, Oracle recommends that you set HADOOP_CLASSPATH
in your local shell initialization script instead of making a global change to HADOOP_CLASSPATH
. If there are multiple JAR file paths in HADOOP_CLASSPATH
ensure that the JARs for the current product are listed first.
The following figure illustrates shows the flow of data and the components locations.
Figure 1-1 Oracle SQL Connector for HDFS Installation for HDFS and Data Pump Files
The unzipped files have the structure shown in the following example.
Example 1-1 Structure of the orahdfs Directory
orahdfs-<version>
bin/
hdfs_stream
hdfs_stream.cmd
doc/
README.txt
examples/
sql/
mkhive_unionall_view.sql
jlib/
ojdbc7.jar
oraloader.jar
ora-hadoop-common.jar
oraclepki.jar
orahdfs.jar
osdt_cert.jar
osdt_core.jar
log/
Oracle Database users require these privileges in order to use Oracle SQL Connector for HDFS to create external tables:
CREATE SESSION
CREATE VIEW
READ
and EXECUTE
on the OSCH_BIN_PATH
directory created during the installation of Oracle SQL Connector for HDFS. Do not grant write access to anyone. Grant EXECUTE
only to those who intend to use Oracle SQL Connector for HDFS.
READ
and WRITE
on a database directory for storing external tables, or the CREATE ANY DIRECTORY
system privilege. For Oracle RAC systems, this directory must be on a shared disk that all Oracle instances can access.
A tablespace and quota for copying data into the Oracle database. Optional.
The following example shows the SQL commands granting these privileges to HDFSUSER
.
Note:
To query an external table that uses Oracle SQL Connector for HDFS, users need READ
privilege for Oracle Database 12c or later and SELECT
privilege for older versions of the database.
Example 1-2 Granting Users Access to Oracle SQL Connector for HDFS
CONNECT / AS sysdba;
CREATE USER hdfsuser IDENTIFIED BY password
DEFAULT TABLESPACE hdfsdata
QUOTA UNLIMITED ON hdfsdata;
GRANT CREATE SESSION, CREATE TABLE, CREATE VIEW TO hdfsuser;
GRANT EXECUTE ON sys.utl_file TO hdfsuser;
GRANT READ, EXECUTE ON DIRECTORY osch_bin_path TO hdfsuser;
GRANT READ, WRITE ON DIRECTORY external_table_dir TO hdfsuser;
Wherever Oracle SQL Connector for HDFS is installed (on the Oracle Database system, a Hadoop cluster node, or a separate system set up as a Hadoop client), the OS-level user account that logs in to use OSCH requires access to the shell variable HADOOP_CLASSPATH
. This variable must include the OSCH path on the Hadoop cluster – path
/orahdfs-<version>/jlib/*
Set the HADOOP_CLASSPATH as shown in the following example, where the OSCH path is prepended to the current HADOOP_CLASSPATH
. Putting OSCH first gives it precedence over other JARs in the path.
$ export HADOOP_CLASSPATH="/etc/orahdfs-<version>/jlib/*:$HADOOP_CLASSPATH"
When users access an external table that was created using Oracle SQL Connector for HDFS, the external table behaves like a Hadoop client. On the system where the Oracle database is running, it connects as the OS user of the Oracle process (usuallyoracle
). For OSCH to work, this account requires read permission on the files of all OSCH users. On a non-secure cluster these files are world-readable, but on a Kerberos-secured cluster this access requires a Kerberos ticket.
For a user to authenticate using kinit
:
A Hadoop administrator must register the operating system user (such as oracle
) and password in the Key Distribution Center (KDC) for the cluster.
A system administrator for the Oracle Database system must configure /etc/krb5.conf
and add a domain definition that refers to the KDC managed by the secure cluster.
These steps enable the operating system user to authenticate with the kinit
utility before submitting Oracle SQL Connector for HDFS jobs. The kinit
utility typically uses a Kerberos keytab file for authentication without an interactive prompt for a password.
The system should run kinit
on a regular basis, before letting the Kerberos ticket expire, to enable Oracle SQL Connector for HDFS to authenticate transparently. Use cron
or a similar utility to run kinit
. For example, if Kerberos tickets expire every two weeks, then set up a cron
job to renew the ticket weekly.
Be sure to schedule the cron
job to run when Oracle SQL Connector for HDFS is not actively being used.
Do not call kinit
within the Oracle SQL Connector for HDFS preprocessor script (hdfs_stream
), because it could trigger a high volume of concurrent calls to kinit
and create internal Kerberos caching errors.
Note:
Oracle Big Data Appliance configures Kerberos security automatically as a configuration option.
Oracle SQL Developer is a free graphical IDE that includes integration with Oracle Database and Oracle Big Data Connectors (among many other products). It provides wizards to help you access and use Oracle Big Data Connectors. See Using Oracle SQL Developer With Oracle Big Data Connectors in this guide for instructions on downloading Oracle SQL Developer and configuring it for use with Oracle Big Data Connectors.
Follow the instructions in these sections for setting up Oracle Loader for Hadoop:
Oracle Loader for Hadoop requires the following software:
A target database system running one of the following:
Oracle Database 12c (12.1.0.2 or 12.2.0.1)
Oracle Database 11g release 2 (11.2.0.4)
Cloudera's Distribution including Apache Hadoop version 5 (CDH5), or Hortonworks Data Platform 2.4.0 or 2.5.0.
Apache Hive 0.12.0, 0.13.0, 0.13.1, 1.1.0, or 1.2.1 if you are loading data from Hive tables.
In a typical installation, Oracle Loader for Hadoop can connect to the Oracle Database system from the Hadoop cluster or a Hadoop client. If this connection is impossible—for example, the systems are located on distinct networks—then you can use Oracle Loader for Hadoop in offline database mode.
To support offline database mode, you must install Oracle Loader for Hadoop on two systems:
The Hadoop cluster or a system set up as a Hadoop client.
The Oracle Database system or a system with network access to Oracle Database, as described in the following procedure.
To support Oracle Loader for Hadoop in offline database mode:
Related Topics
A secure Hadoop cluster has Kerberos installed and configured to authenticate client activity. An operating system user must be authenticated before initiating an Oracle Loader for Hadoop job to run on a secure Hadoop cluster. For authentication, the user must log in to the operating system where the job will be submitted and use the standard Kerberos kinit
utility.
For a user to authenticate using kinit
:
A Hadoop administrator must register the operating system user and password in the Key Distribution Center (KDC) for the cluster.
A system administrator for the client system, where the operating system user will initiate an Oracle Loader for Hadoop job, must configure /etc/krb5.conf
and add a domain definition that refers to the KDC managed by the secure cluster.
Typically, the kinit
utility obtains an authentication ticket that lasts several days. Subsequent Oracle Loader for Hadoop jobs authenticate transparently using the unexpired ticket.
Oracle Big Data Appliance configures Kerberos security automatically as a configuration option.
Oracle Shell for Hadoop Loaders (OHSH) is integrated with Big Data Connectors. It provides a set of declarative commands you can use to copy contents from Hadoop and Hive to Oracle tables using Oracle Loader for Hadoop (OLH) and Oracle SQL Connector for Hadoop Distributed File System (OSCH). It also integrates with Oracle Big Data SQL to copy contents from an Oracle Database table to a Hive table.
Prerequisites
Oracle Shell for Hadoop Loaders can work with OLH and OSCH on either a Hadoop client or edge node, a Hadoop node, or on the Oracle Database server.
OHSH requires the installation of either Copy To Hadoop or the Oracle Big Data Connectors.
The following table identifies OHSH dependencies that are not present by default on the different supported platforms. You need to add these if they do not exist.
Table 1-1 Prerequisites for Running OHSH
Environment | Prerequisites for OHSH |
---|---|
Hadoop clients and Hadoop nodes |
|
Oracle Database servers |
|
OHSH can be set up in any of the environments above (Hadoop client, Hadoop node, or Oracle Database Server). You need to install the software in only one of these environments.
Installing Oracle Shell for Hadoop Loaders
Follow these instructions for setting up Oracle Shell for Hadoop Loaders. The instructions are applicable to set up on a Hadoop client, an edge node, a Hadoop node, or, on the Oracle Database server.
Extract the contents of ohsh-<version>.zip to a directory on the database server.
The extraction creates a directory named ohsh-<version> with a README.txt file and the following subdirectories:
README.txt /bin /conf /doc /examples /jlib
Follow the instructions contained in README.txt to configure Oracle Shell for Hadoop Loaders.
You install and configure Oracle XQuery for Hadoop on the Hadoop cluster. If you are using Oracle Big Data Appliance, then the software is already installed.
The following topics describe the software installation:
Oracle Big Data Appliance Release 4.3 and later releases meet the requirements below. However, if you are installing Oracle XQuery for Hadoop on a third-party cluster, then you must ensure that these components are installed.
Java 8.x or 7.x.
Cloudera's Distribution including Apache Hadoop Version 4.1.2 and above (including CDH 5.x)
Oracle NoSQL Database 3.x or 2.x to support reading and writing to Oracle NoSQL Database
Oracle Loader for Hadoop 3.8.0 or greater to support writing tables in Oracle databases
Take the following steps to install Oracle XQuery for Hadoop.
To install Oracle XQuery for Hadoop:
Unpack the contents of oxh-
<version>
.zip
into the installation directory:
$ unzip oxh-<version>-cdh-<version>.zip Archive: oxh-<version>-cdh-<version>.zip creating: oxh-<version>-cdh<version>/ creating: oxh-<version>-cdh<version>/lib/ creating: oxh-<version>-cdh<version>/oozie/ creating: oxh-<version>-cdh<version>/oozie/lib/ inflating: oxh-<version>-cdh<version>/lib/ant-launcher.jar inflating: oxh-<version>-cdh<version>/lib/ant.jar . . .
You can now run Oracle XQuery for Hadoop.
For the fastest execution time, copy the libraries into the Hadoop distributed cache:
Copy all Oracle XQuery for Hadoop and third-party libraries into an HDFS directory. To use the -exportliboozie
option to copy the files, see "Oracle XQuery for Hadoop Options". Alternatively, you can copy the libraries manually using the HDFS command line interface.
If you use Oozie, then use the same folder for all files. See "Configuring Oozie for the Oracle XQuery for Hadoop Action"
Set the oracle.hadoop.xquery.lib.share
property or use the -sharelib
option on the command line to identify the directory for the Hadoop distributed cache.
To support data loads into Oracle Database, install Oracle Loader for Hadoop:
Unpack the content of oraloader-
<version>
.x86_64.zip
into a directory on your Hadoop cluster or on a system configured as a Hadoop client. This archive contains an archive and a README
file.
Unzip the archive into a directory on your Hadoop cluster:
unzip oraloader-<version>-h2.x86_64.zip
A directory named oraloader-
<version>
-h2
is created with the following subdirectories:
doc jlib lib examples
Create an environment variable named OLH_HOME
and set it to the installation directory. Do not set HADOOP_CLASSPATH
.
To support data loads into Oracle NoSQL Database, install it, and then set an environment variable named KVHOME
to the Oracle NoSQL Database installation directory.
Note:
Do not add NoSQL Database jar files to a HADOOP_CLASSPATH
.
To support indexing by Apache Solr:
Ensure that Solr is installed and configured in your Hadoop cluster. Solr is included in Cloudera Search, which is installed automatically on Oracle Big Data Appliance.
Create a collection in your Solr installation into which you will load documents. To create a collection, use the solrctl
utility.
See Also:
For the solrctl
utility, Cloudera Search User Guide at
Configure Oracle XQuery for Hadoop to use your Solr installation by setting the OXH_SOLR_MR_HOME
environment variable to the local directory containing search-mr-
<version>
.jar
and search-mr-
<version>
-job.jar
. For example:
$ export OXH_SOLR_MR_HOME="/usr/lib/solr/contrib/mr"
Note:
Configure Oracle XQuery for Hadoop and set the OXH_SOLR_MR_HOME
environment variable to the local directory before using Apache Tika adapter as well.
If Oracle XQuery for Hadoop fails to find its own or third-party libraries when running queries, first ensure that the environment variables were set correctly during Oracle XQuery for Hadoop installation.
Note:
The HADOOP_CLASSPATH
environment variable or -libjars
command line option must not contain either an OXH or third-party library.
If they are set correctly, then you may need to edit lib/oxh-lib.xml
. This file identifies the location of Oracle XQuery for Hadoop system JAR files and other libraries, such as Avro, Oracle Loader for Hadoop, and Oracle NoSQL Database.
If necessary, you can reference environment variables in this file as ${env.
variable
}
, such as ${env.OLH_HOME}
. You can also reference Hadoop properties as ${
property
}
, such as ${mapred.output.dir}
.
Related Topics
You can use Apache Oozie workflows to run your queries, as described in "Running Queries from Apache Oozie". The software is already installed and configured on Oracle Big Data Appliance.
For other Hadoop clusters, you must first configure Oozie to use the Oracle XQuery for Hadoop action. These are the general steps to install the Oracle XQuery for Hadoop action:
The specific steps depend on your Oozie installation, such as whether Oozie is already installed and which version you are using.
An overview of Oracle R Advanced Analytics for Hadoop (ORAAH) is provided in Part IV of this guide .
Release notes, installation instructions, comprehensive reference material, and a list of changes in the current release are published separately on the Oracle Technology Network:
Each ORAAH release is compatible with a number of Oracle Big Data Appliance releases and releases of CDH running on non-Oracle platforms.
For a complete ORAAH compatibility matrix, see Document 2225633.1 on My Oracle Support.
Oracle Big Data Appliance supports Oracle R Advanced Analytics for Hadoop without any additional software installation or configuration. However, to use Oracle R Advanced Analytics for Hadoop on a third-party Hadoop cluster, you must create the necessary environment.
You must install several software components on a third-party Hadoop cluster to support Oracle R Advanced Analytics for Hadoop.
Install these components on third-party servers:
Cloudera's Distribution including Apache Hadoop version 4 (CDH5) or Apache Hadoop 0.20.2+923.479 or later.
Complete the instructions provided by the distributor.
Apache Hive 0.10.0+67 or later
Sqoop 1.3.0+5.95 or later for the execution of functions that connect to Oracle Database. Oracle R Advanced Analytics for Hadoop does not require Sqoop to install or load.
Mahout for the execution of (orch_lmf_mahout_als.R
).
Java Virtual Machine (JVM), preferably Java HotSpot Virtual Machine 6.
Complete the instructions provided at the download site at
http://www.oracle.com/technetwork/java/javase/downloads/index.html
Oracle R Distribution 3.0.1 with all base libraries on all nodes in the Hadoop cluster.
The ORCH package on each R engine, which must exist on every node of the Hadoop cluster.
See "Installing the ORCH Package on a Third-Party Hadoop Cluster".
Oracle Loader for Hadoop to support the OLH driver (optional).
Note:
Do not set HADOOP_HOME
on the Hadoop cluster. CDH5 does not need it, and it interferes with Oracle R Advanced Analytics for Hadoop. If you must set HADOOP_HOME
for another application, then also set HADOOP_LIBEXEC_DIR
in the /etc/bashrc
file. For example:
export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
Sqoop provides a SQL-like interface to Hadoop, which is a Java-based environment. Oracle R Advanced Analytics for Hadoop uses Sqoop for access to Oracle Database.
Note:
Sqoop is required even when using Oracle Loader for Hadoop as a driver for loading data into Oracle Database. Sqoop performs additional functions, such as copying data from a database to HDFS and sending free-form queries to a database. The driver also uses Sqoop to perform operations that Oracle Loader for Hadoop does not support.
To install and configure Sqoop for use with Oracle Database:
Hive provides an alternative storage and retrieval mechanism to HDFS files through a querying language called HiveQL. Oracle R Advanced Analytics for Hadoop uses the data preparation and analysis features of HiveQL, while enabling you to use R language constructs.
To install Hive:
You can download R 2.13.2 and get the installation instructions from the Oracle R Distribution website at
When you are done, ensure that users have the necessary permissions to connect to the Linux server and run R.
You may also want to install RStudio Server to facilitate access by R users. See the RStudio website at
Your Hadoop cluster must have libpng-devel
installed on every node. If you are using a cluster running on commodity hardware, then you can follow the same basic procedures. However, you cannot use the dcli
utility to replicate the commands across all nodes. See Oracle Big Data Appliance Owner's Guide for the syntax of the dcli
utility.
To install libpng-devel:
Log in as root
to any node in your Hadoop cluster.
Check whether libpng-devel
is already installed:
# dcli rpm -qi libpng-devel
bda1node01: package libpng-devel is not installed
bda1node02: package libpng-devel is not installed
.
.
.
If the package is already installed on all servers, then you can skip this procedure.
If you need a proxy server to go outside a firewall, then set the HTTP_PROXY
environment variable. This example uses dcli, which is available only on Oracle Big Data Appliance:
# dcli export HTTP_PROXY="http://proxy.example.com"
Change to the yum
directory:
# cd /etc/yum.repos.d
Download and configure the appropriate configuration file for your version of Linux:
For Enterprise Linux 5 (EL5):
Download the yum configuration file:
# wget http://public-yum.oracle.com/public-yum-el5.repo
Open public-yum-el5.repo
in a text editor and make these changes:
Under el5_latest
, set enabled=1
Under el5_addons
, set enabled=1
Save your changes and exit.
Copy the file to the other Oracle Big Data Appliance servers:
# dcli -d /etc/yum.repos.d -f public-yum-el5.repo
For Oracle Linux 6 (OL6):
Download the yum configuration file:
# wget http://public-yum.oracle.com/public-yum-ol6.repo
Open public-yum-ol6.repo
in a text editor and make these changes:
Under ol6_latest
, set enabled=1
Under ol6_addons
, set enabled=1
Save your changes and exit.
Copy the file to the other Oracle Big Data Appliance servers:
# dcli -d /etc/yum.repos.d -f public-yum-ol6.repo
Install the package on all servers:
# dcli yum -y install libpng-devel
bda1node01: Loaded plugins: rhnplugin, security
bda1node01: Repository 'bda' is missing name in configuration, using id
bda1node01: This system is not registered with ULN.
bda1node01: ULN support will be disabled.
bda1node01: http://bda1node01-master.abcd.com/bda/repodata/repomd.xml:
bda1node01: [Errno 14] HTTP Error 502: notresolvable
bda1node01: Trying other mirror.
.
.
.
bda1node01: Running Transaction
bda1node01: Installing : libpng-devel 1/2
bda1node01: Installing : libpng-devel 2/2
bda1node01: Installed:
bda1node01: libpng-devel.i386 2:1.2.10-17.el5_8 ibpng-devel.x86_64 2:1.2.10-17.el5_8
bda1node01: Complete!
bda1node02: Loaded plugins: rhnplugin, security
.
.
.
Verify that the installation was successful on all servers:
# dcli rpm -qi libpng-devel
bda1node01: Name : libpng-devel Relocations: (not relocatable)
bda1node01: Version : 1.2.10 Vendor: Oracle America
bda1node01: Release : 17.el5_8 Build Date: Wed 25 Apr 2012 06:51:15 AM PDT
bda1node01: Install Date: Tue 05 Feb 2013 11:41:14 AM PST Build Host: ca-build56.abcd.com
bda1node01: Group : Development/Libraries Source RPM: libpng-1.2.10-17.el5_8.src.rpm
bda1node01: Size : 482483 License: zlib
bda1node01: Signature : DSA/SHA1, Wed 25 Apr 2012 06:51:41 AM PDT, Key ID 66ced3de1e5e0159
bda1node01: URL : http://www.libpng.org/pub/png/
bda1node01: Summary : Development tools for programs to manipulate PNG image format files.
bda1node01: Description :
bda1node01: The libpng-devel package contains the header files and static
bda1node01: libraries necessary for developing programs using the PNG (Portable
bda1node01: Network Graphics) library.
.
.
.
Whereas R users will run their programs as MapReduce jobs on the Hadoop cluster, they do not typically have individual accounts on that platform. Instead, an external Linux server provides remote access.
To provide access to a Hadoop cluster to R users, install these components on a Linux server:
The same version of Hadoop as your Hadoop cluster; otherwise, unexpected issues and failures can occur
The same version of Sqoop as your Hadoop cluster; required only to support copying data in and out of Oracle databases
Mahout; required only for the orch.ls
function with the Mahout ALS-WS algorithm
The same version of the Java Development Kit (JDK) as your Hadoop cluster
ORCH R package
To provide access to database objects, you must have the Oracle Advanced Analytics option to Oracle Database. Then you can install this additional component on the Hadoop client:
Oracle R Enterprise Client Packages
You must install Hadoop on the client and minimally configure it for HDFS client use.
To install and configure Hadoop on the client system:
Complete the same procedures on the client system for installing and configuring Sqoop as those provided in "Installing Sqoop on a Third-Party Hadoop Cluster".
You can download R 2.13.2 and get the installation instructions from the Oracle R Distribution website at
When you are done, ensure that users have the necessary permissions to connect to the Linux server and run R.
You may also want to install RStudio Server to facilitate access by R users. See the RStudio website at
To install ORCH on your Hadoop client system:
To support full access to Oracle Database using R, install the Oracle R Enterprise Release 1.4 client packages. Without them, Oracle R Advanced Analytics for Hadoop does not have access to the advanced statistical algorithms provided by Oracle R Enterprise.
See Also:
Oracle R Enterprise User's Guide for information about installing R and Oracle R Enterprise
For the instructions to set up and use Oracle Data Integrator refer to Oracle Fusion Middleware Integrating Big Data with Oracle Data Integrator.
Software Requirements
Oracle Datasource for Apache Hadoop requires the following software:
A target database system running Oracle Database 12c, 11.2.0.4, or earlier Oracle database releases that can be queried with the Oracle JDBC driver for 12c.
Note that Oracle Database 11.2.0.4 and potentially earlier Oracle Database release may work. However, some of the SPLIT patterns have dependencies on Oracle Database 12c and might not provide accurate splits for parallel hadoop jobs when used to query earlier releases of Oracle Database.
Cloudera's Distribution including Apache Hadoop version 5 (CDH5), Hortonworks Data Platform (HDP) 2.x, or, Apache Hadoop 2.2.0 to 2.6.0.
Apache Hive 0.13.0, 0.13.1 or 1.1.0 (in order to query data from Oracle Database tables).
Installing Oracle Datasource for Hadoop
Unpack the contents of od4h-*.zip
into a directory on your Hadoop cluster or on a system configured as a Hadoop client. A directory named od4h
will be created with the following subdirectories:
doc jlib
Create a variable named OD4H_HOME
and set it to the installation directory created in Step 1.
Add OD4H_HOME/jlib/*
to the HADOOP_CLASSPATH
variable. When using OD4H, OD4H_HOME/jlib
should be listed first in HADOOP_CLASSPATH
to avoid conflict with other versions of JARs with the same name in HADOOP_CLASSPATH
.
Footnote Legend
Footnote 1:Hadoop: The Definitive Guide, Third Edition by Tom White (O'Reilly Media Inc., 2012, 978-1449311520).