Copy to Hadoop
Copy to Hadoop makes it easy to identify and copy data from an Oracle Database table to the Apache Hadoop Distributed File System (HDFS) and create an Apache Hive external table over that data. In HDFS, the data can be saved in Oracle Data Pump files (which store data as Oracle data types). This provides an exact copy of the data in the database, without the potential for data type conversion errors. You can then query the data via Hive (and Hadoop technologies that access data through Hive) and by Oracle Big Data SQL. Oracle Data Pump data can be converted to Parquet or ORC using Hive
Oracle Shell for Hadoop Loaders (OHSH)
OHSH is an intuitive command line tool to move data between Hadoop and Oracle Database. It provides a set of declarative commands you can use to copy contents from an Oracle Database table to a Hive table.
OHSH is bundled with both Oracle Big Data SQL and Oracle Big Data Connectors. With Oracle Big Data SQL, OHSH works with Copy to Hadoop. With Oracle Big Data Connectors, OHSH provides an interface to copy contents from Hadoop and Hive to Oracle tables using Oracle Loader for Hadoop (OLH) and Oracle SQL Connector for Hadoop Distributed File System (OSCH).
Although OHSH can serve as a front end for OLH and OSCH as well as for Copy to Hadoop, OLH and OSCH are part of the Oracle Big Data Connectors product and are not installed by Oracle Big Data SQL.
Installation and Configuration of Copy to Hadoop and OHSH
You can use Copy to Hadoop and OHSH from any of these environments:
A Hadoop cluster node
An Oracle Database node
An edge node.
Oracle Big Data SQL installs the kits for Copy to Hadoop and Oracle Shell for Hadoop Loaders on the Hadoop cluster management server and on any Oracle Database server node where you install Oracle Big Data SQL.
You can also manually copy the kits for both tools from the Hadoop cluster management server to an edge node and configure them for use from the edge node. The kits are stored at /opt/oracle/orahivedp-<version>.
. See /opt/oracle/ohsh-<version>/doc/OHSH_EDGE_NODE_README.txt
on the Hadoop cluster management server for instructions.
Note:
In the Oracle Big Data SQL installation, the Copy to Hadoop is fully automated except for the setting of theCP2HADOOP_HOME
listed in the table below. You must set this manually.
You must complete configuration of Oracle Shell for Hadoop Loaders manually, as described in this section.
Environment Variables for OHSH
On the Hadoop cluster management server, the OHSH configuration file is/opt/oracle/ohsh-<version>/bin/ohsh_config.sh
. On an Oracle Database node, it is $ORACLE_HOME/ohsh/bin/ohsh_config.sh
. This file contains environment required by OHSH. The set of variables needed the installation on a Hadoop cluster, Oracle Database Node or an edge node differ. The table below shows which variables you need to set for each environment. Some of the required variables are set for you automatically by the Oracle Big Data SQL installer.
The Oracle Big Data SQL installation automatically copies the OHSH installation kit to each node where Oracle Big Data SQL is installed. There are some manual steps you must perform to complete the installation.
Note:
The README file at$ORACLE_HOME/ohsh/doc/OHSH_DB_NODE_README.txt
on the Hadoop cluster management server where you installed Oracle Big Data SQL describes the full set of steps ordinarily needed to enable OHSH on an Oracle Database node. However, when OHSH is installed by Oracle Big Data SQL, some of the setup is done for you by the installer. To complete the configuration in this context, use this shortened procedure instead of the full procedure documented in the README.
Also notice below that patch 23237037 is required for 12.1.0.2 (or earlier Oracle Database releases) only. Do not install this patch if you are running a later release of Oracle Database.
Log on to the Oracle Database server as the oracle
user.
Set up OHSH to work with Copy to Hadoop.
Create a staging directory in the file system on the Oracle Database node:
$ mkdir <OS_STAGE_DIR>
Create an Oracle directory on this staging directory and grant read/write privileges to the Oracle Database user (oracle
or other). In this example, ORACLE_STAGE_DIR refers to the name of the directory object in Oracle Database. OS_STAGE_DIR refers to the path on the server.
SQL> CREATE DIRECTORY <ORACLE_STAGE_DIR> AS '<OS_STAGE_DIR>'; SQL> GRANT READ,WRITE ON DIRECTORY <ORACLE_STAGE_DIR> TO <ORACLE_USER>;
Remember:
When you use OHSH later to copy Oracle tables to Hive using the "stage" method, use the OHSHset locationdirectory
command to point to the Oracle directory object name (the name of the directory you created above in this step).If you are connecting to a CDH-based Hadoop system (Oracle Big Data Appliance or other), perform the substeps below to enable the OHSH “FUSE” method, in which you copy files to Hive tables over an HDFS Mountable FUSE directory. Note that HDP does not support Fuse, therefore you cannot create a mount point to HDFS on HDP using these steps.
If you are using Oracle Database 12.1.0.2 or earlier, install patch 23237037 (available on My Oracle Support) on all database nodes.
On the Oracle Database host, create a FUSE mount to the root of the HDFS system on the Hadoop cluster. Refer to Cloudera’s documentation on installing FUSE. You may find current instructions at Configuring Mountable HDFS on Cloudera’s website.
Create the Oracle directory on the FUSE mount and grant privileges. In this example, ORACLE_FUSE_DIR refers to the name of the directory object in Oracle Database. OS_FUSE_MOUNT refers to the path to the mount point on the server.
SQL> CREATE DIRECTORY <ORACLE_FUSE_DIR> AS '<OS_FUSE_MOUNT>'; SQL> GRANT READ,WRITE ON DIRECTORY <ORACLE_FUSE_DIR> TO <ORACLE_USER>;
Note that when you use OHSH later to copy Oracle tables to Hive using the "FUSE" method, use the OHSH set locationdirectory
command to point to the same <ORACLE_FUSE_DIR>
Edit $ORACLE_HOME/ohsh/bin/ohsh_config.sh
in order to configure the home directories of dependencies. When OHSH is installed by Oracle Big Data SQL, most of the required variables are already set for you. Set HS2_HOST_PORT
and (if needed) HIVE_SESS_VAR_LIST
as follows.
export HS2_HOST_PORT=<HS2_HOST>:<HS2_PORT>
Also set HIVE_SESS_VAR_LIST
to any required Hive session variables. For example:
To connect to HiveServer2 with Kerberos authentication: export HIVE_SESS_VAR_LIST="principal=<The server principal of HiveServer2>"
The Hive principal is specified by the configuration property hive.server2.authentication.kerberos.principal
in hive-site.xml
.
To connect to HiveServer2 running in HTTP mode: export HIVE_SESS_VAR_LIST="transportMode=http;httpPath=<The HTTP endpoint>"
To connect to HiveServer2 when SSL is enabled: export HIVE_SESS_VAR_LIST="ssl=true;sslTrustStore=<Trust Store path>;trustStorePassword=<Trust Store password>"
If you want to configure OHSH to use Oracle Wallet, set environment the variables TNS_ADMIN
and WALLET_LOCATION
to the directory path where TNS and Oracle Wallet are stored.
export WALLET_LOCATION="<wallet_location>" export TNS_ADMIN="<tns_admin_location>"
Note:
Later, when you start an OHSH session, if the TNS_ADMIN
and WALLET_LOCATION
are different on the Hadoop cluster then set hadooptnsadmin
and hadoopwalletlocation
:
set hadooptnsadmin <tns_admin_hadoop_node> set hadoopwalletlocation <wallet_location_hadoop_node>
Start an OHSH session and check for resources. This check indicates whether or not the configuration is correct.
$ ohsh ohsh> show resources
You should see the resources hadoop0
, hive0
, and bash0
.
On the Hadoop cluster management server, Oracle Big Data SQL installs OHSH from an RPM to /opt/oracle/ohsh
. Perform the steps below as root
in order to complete the Hadoop side of the configuration.
Note:
The installation instructions at$ORACLE_HOME/ohsh/doc/OHSH_HADOOP_NODE_README.txt
on the Hadoop cluster management server contain extra information that is not applicable to the installation of Copy to Hadoop and OHSH in the context of the Oracle Big Data SQL installation. Use the instructions below instead.If the Hadoop system is not an Oracle Big Data Appliance, then download the SQL*Plus Instant Client and Oracle Instant Client for Linux Instant Client Downloads for Linux x86.
Check to be sure that the version of the packages you select match the version of the Oracle Database. If the Oracle Database is 12.1.0.2, you can use the following downloads:
Extract both packages into the same directory.
Add the directory to PATH
and LD_LIBRARY_PATH
.
Create a local directory for TNS and from TNS_ADMIN
on the Oracle Database host, copy the tnsnames.ora
and sqlnet.ora
file into this directory.
sqlnet.ora
points to the correct Oracle Wallet location.
WALLET_LOCATION=
(SOURCE=(METHOD=FILE)(METHOD_DATA=
(DIRECTORY=<your wallet directory>)))
sqlnet.wallet_override=true
/opt/oracle/ohsh/bin/ohsh_config.sh
. Set the following environment variables.
export CP2HADOOP_HOME=/opt/oracle/orahivedp-<version export HS2_HOST_PORT=<HiveServer2 hostname>:<Port number>
Also set HIVE_SESS_VAR_LIST
to any required Hive session variables. For example:
To connect to HiveServer2 with Kerberos authentication: export HIVE_SESS_VAR_LIST="principal=<The server principal of HiveServer2>"
To connect to HiveServer2 running in HTTP mode: export HIVE_SESS_VAR_LIST="transportMode=http;httpPath=<The HTTP endpoint>"
To connect to HiveServer2 when SSL is enabled: export HIVE_SESS_VAR_LIST="ssl=true;sslTrustStore=<Trust Store path>;trustStorePassword=<Trust Store password>"
If TNS and Oracle Wallet are enabled, also add the following variables to the file and export them:
export WALLET_LOCATION="<your wallet directory>" export TNS_ADMIN="<your TNS directory>"
Add /opt/oracle/ohsh
to the PATH variable
PATH=/opt/oracle/ohsh-<version>/bin:$PATH
Start an OHSH session and check for resources. This check indicates whether or not the configuration is correct.
$ ohsh ohsh> show resources
You should see the resources hadoop0
, hive0
, and bash0
.
See $ORACLE_HOME/ohsh/doc
and $ORACLE_HOME/ohsh/examples
for examples that show how to run the OHSH load methods.
See the Oracle Big Data SQL User’s Guide more information on Copy to Hadoop and Oracle Shell for Hadoop Loaders.