4 Copying Oracle Tables to Hadoop

This chapter describes how to use Copy to Hadoop to copy tables in an Oracle database to Hadoop. It contains the following sections:

4.1 What Is Copy to Hadoop?

Oracle Big Data SQL includes the Oracle Copy to Hadoop utility. This utility makes it simple to identify and copy Oracle data to the Hadoop Distributed File System. It can be accessed either through a command-line interface or via Oracle SQL Developer. Data exported to the Hadoop cluster by Copy to Hadoop is stored in Oracle Data Pump format. This format optimizes queries thru Big Data SQL:

The data is stored as Oracle data types – eliminating data type conversions
The data is queried directly – without requiring the overhead associated with Java SerDes

After generating Data Pump format files from the tables and copying the files to HDFS, you can use Apache Hive to query the data. Hive can process the data locally without accessing Oracle Database. When the Oracle table changes, you can refresh the copy in Hadoop. Copy to Hadoop is primarily useful for Oracle tables that are relatively static, and thus do not require frequent refreshes.

Copy to Hadoop is licensed under Oracle Big Data SQL. You must have an Oracle Big Data SQL license in order to use utility

4.2 Getting Started Using Copy to Hadoop

Take the following steps to use Copy to Hadoop:

Ensure that your system meets the prerequisites, and that the required software is installed on both the Hadoop cluster (on Oracle Big Data Appliance or another Hadoop system) and on the Oracle Database server (Oracle Exadata Database Machine or other).

See "Installing Copy to Hadoop".
On the Oracle Database server, connect to Oracle Database and generate Data Pump format files containing the table data and metadata.

See "Generating the Data Pump Files".
Copy the files to HDFS on the Hadoop cluster.

See "Copying the Files to HDFS".
Connect to Apache Hive and create an external table from the files.

See "Creating a Hive Table".
Query this Hive table the same as you would any other Hive table.

4.3 Installing Copy to Hadoop

Where Oracle Big Data SQL is installed, Copy to Hadoop is available on the Oracle Database server that is connected to Hadoop cluster.

4.3.1 Prerequisites for Copy to Hadoop

For network connections, supported Oracle Database levels, and other requirements, see the general requirements for installing Oracle Big Data SQL on your platform: Installing Oracle Big Data SQL.

4.3.2 Installing Copy to Hadoop in an Oracle Big Data Appliance/Oracle Exadata Database Machine Environment

If Oracle Big Data SQL is installed, Copy to Hadoop is available on the Oracle Exadata Database Machine connected to Oracle Big Data Appliance.

4.3.2.1 Installing Copy to Hadoop on Oracle Big Data Appliance

Copy to Hadoop is a component of Oracle Big Data SQL, which is an installation option on Oracle Big Data Appliance. You can enable Oracle Big Data SQL either during the initial software installation or at a later time using the standard methods for enabling and disabling services. See "Performing the Installation".

4.3.2.2 Installing Copy to Hadoop on Oracle Exadata Database Machine

Copy to Hadoop only requires a Hadoop client on Oracle Exadata Database Machine. It does not employ the additional software required by Oracle Big Data SQL.

If you plan to use Oracle Big Data SQL, then the Hadoop client is created automatically when you run the bds-exa-install.sh installation script. In this case, you do not need to take any additional steps. See "Running the Post-Installation Script for Oracle Big Data SQL".

If you do not plan to use Oracle Big Data SQL at this time, then you can install the Hadoop client manually instead of running the script.

4.3.3 Installing Copy to Hadoop on Other Systems

Where Oracle Big Data SQL is installed, Copy to Hadoop is available on the Oracle Database server that is connected to Hadoop cluster.

4.3.3.1 Installing Copy to Hadoop on a Hadoop Cluster (Other than an Oracle Big Data Database Appliance Cluster)

Copy to Hadoop is a component of Oracle Big Data SQL and requires some additional setups after the Oracle Big Data SQL installation. See the following MOS note for instructions — Big Data SQL 3.0: Installing and Configuring Copy to Hadoop (Doc ID 2115762.1). Contact Oracle Support if you have any questions.

4.3.3.2 Installing Copy to Hadoop on an Oracle Database Server (Other Than Oracle Exadata Database Machine)

In order to use Copy to Hadoop, additional configuration steps are required after the Oracle Big Data SQL installation. See the following MOS note for instructions — Big Data SQL 3.0: Installing and Configuring Copy to Hadoop (Doc ID 2115762.1). Contact Oracle Support if you have any questions.

4.4 Generating the Data Pump Files

The SQL CREATE TABLE statement has a clause specifically for creating external tables, in which you specify the ORACLE_DATAPUMP access driver. The information that you provide in this clause enables the access driver to generate a Data Pump format file that contains the data and metadata from the Oracle database table.

This section contains the following topics:

4.4.1 About Data Pump Format Files

Data Pump files are typically used to move data and metadata from one database to another. Copy to Hadoop uses this file format to copy data from an Oracle database to HDFS.

To generate Data Pump format files, you create an external table from an existing Oracle table. An external table in Oracle Database is an object that identifies and describes the location of data outside of a database. External tables use access drivers to parse and format the data. For Copy to Hadoop, you use the ORACLE_DATAPUMP access driver. It copies the data and metadata from internal Oracle tables and populates the Data Pump format files of the external table.

4.4.2 Identifying the Target Directory

You must have read and write access to a database directory in Oracle Database. Only Oracle Database users with the CREATE ANY DIRECTORY system privilege can create directories.

This example creates a database directory named EXPORTDIR that points to the /exportdir directory on the Oracle Database server (Oracle Exadata Database Machine or other):

SQL> CREATE DIRECTORY exportdir AS '/exportdir';

4.4.3 About the CREATE TABLE Syntax

The following is the basic syntax of the CREATE TABLE statement for Data Pump format files:

CREATE TABLE table_name
   ORGANIZATION EXTERNAL (
     TYPE oracle_datapump
     DEFAULT DIRECTORY database_directory
     LOCATION ('filename1.dmp','filename2.dmp'...)
     ) PARALLEL n
   AS SELECT * FROM tablename;

DEFAULT DIRECTORY

Identifies the database directory that you created for this purpose. See "Identifying the Target Directory".

LOCATION

Lists the names of the Data Pump files to be created. The number of names should match the degree of parallelism (DOP) specified by the PARALLEL clause. Otherwise, the DOP drops to the number of files.

The number of files and the degree of parallelism affect the performance of Oracle Database when generating the Data Pump format files. They do not affect querying performance in Hive.

PARALLEL

Sets the degree of parallelism (DOP). Use the maximum number that your Oracle DBA permits you to use. By default the DOP is 1, which is serial processing. Larger numbers enable parallel processing.

AS SELECT

Use the full SQL SELECT syntax for this clause. It is not restricted. The tablename identifies the Oracle table to be copied to HDFS.

4.4.4 Copying the Files to HDFS

The Oracle Big Data SQL installation installs Hadoop client files on the Oracle Datatabase server (Oracle Exadata Database Machine or other). The Hadoop client installation enables you to use Hadoop commands to copy the Data Pump files to HDFS. You must have write privileges on the HDFS directory.

To copy the dmp files into HDFS, use the hadoop fs -put command. This example copies the files into the HDFS customers directory owned by the oracle user:

$ hadoop fs -put customers*.dmp /user/oracle/customers

4.5 Creating a Hive Table

To provide access to the data in the Data Pump files, you create a Hive external table over the Data Pump files. Copy to Hadoop provides SerDes that enable Hive to read the files. These SerDes are read only, so you cannot use them to write to the files.

4.5.1 About Hive External Tables

For external tables, Hive loads the table metadata into its metastore. The data remains in its original location, which you identify in the LOCATION clause. If you drop an external table using a HiveQL DROP TABLE statement, then only the metadata is discarded, while the external data remains unchanged. In this respect, Hive handles external tables in fundamentally the same way as Oracle Database.

External tables support data sources that are shared by multiple programs. In this case, you use Oracle Database to update the data and then generate a new file. You can overwrite the old HDFS files with the updated files while leaving the Hive metadata intact.

The following is the basic syntax of a Hive CREATE TABLE statement for creating a Hive external table for use with a Data Pump format file:

CREATE EXTERNAL TABLE tablename 
ROW FORMAT
   SERDE 'oracle.hadoop.hive.datapump.DPSerDe'
STORED AS
   INPUTFORMAT  'oracle.hadoop.hive.datapump.DPInputFormat'
   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'hdfs_directory'

4.5.2 About Column Mappings

The Hive table columns automatically have the same names as the Oracle columns, which are provided by the metadata stored in the Data Pump files. Any user-specified column definitions are ignored.

4.5.3 About Data Type Conversions

Copy to Hadoop automatically converts the data in an Oracle table to an appropriate Hive data type. Table 4-1 shows the default mappings between Oracle and Hive data types.

Table 4-1 Oracle to Hive Data Type Conversions

Oracle Data Type	Hive Data Type
NUMBER	INT when the scale is 0 and the precision is less than 10 BIGINT when the scale is 0 and the precision is less than 19 DECIMAL when the scale is greater than 0 or the precision is greater than 19
CLOB NCLOB	STRING
BINARY_DOUBLE	DOUBLE
BINARY_FLOAT	FLOAT
BLOB	BINARY
CHAR NCHAR	CHAR
VARCHAR2 NVARCHAR2	VARCHAR
ROWID UROWID	BINARY
DATE	TIMESTAMP
TIMESTAMP	TIMESTAMP
TIMESTAMPTZ¹ TIMESTAMPLTZ	Unsupported
RAW	BINARY

4.6 Example Using the Sample Schemas

This example shows all steps in the process of creating a Hive table from an Oracle table using Copy to Hadoop.

4.6.1 About the Sample Data

The Oracle tables are from the Sales History (SH) sample schema. The CUSTOMERS table provides extensive information about individual customers, including names, addresses, telephone numbers, birth dates, and credit limits. The COUNTRIES table provides a list of countries, and identifies regions and subregions.

This query shows a small selection of data in the CUSTOMERS table:

SELECT cust_first_name first_name,
   cust_last_name last_name,
   cust_gender gender, 
   cust_year_of_birth birth
FROM customers 
ORDER BY cust_city, last_name 
FETCH FIRST 10 ROWS ONLY;

The query returns the following rows:

FIRST_NAME      LAST_NAME            GENDER      BIRTH
--------------- -------------------- ------ ----------
Lise            Abbey                F            1963
Lotus           Alden                M            1958
Emmanuel        Aubrey               M            1933
Phil            Ball                 M            1956
Valentina       Bardwell             F            1965
Lolita          Barkley              F            1966
Heloise         Barnes               M            1980
Royden          Barrett              M            1937
Gilbert         Braun                M            1984
Portia          Capp                 F            1948

To reproduce this example, install the sample schemas in Oracle Database and connect as the SH user.

4.6.2 Creating the EXPDIR Database Directory

These SQL statements create a local database directory named EXPDIR and grant access to the SH user:

SQL> CREATE DIRECTORY expdir AS '/expdir';
Directory created.
SQL> GRANT READ, WRITE ON DIRECTORY expdir TO SH;
Grant succeeded.

4.6.3 Creating Data Pump Format Files for Customer Data

The following examples show how to create the Data Pump files and check their contents.

Copy to Hadoop supports only the syntax shown in the examples. Data pump files created with the Export utility or Oracle Data Pump are not compatible.

4.6.3.1 CREATE TABLE Example With a Simple SELECT Statement

This example shows a very simple SQL command for creating a Data Pump format file from the CUSTOMERS table. It selects the entire table and generates a single output file named customers.dmp in the local /expdir directory.

CREATE TABLE export_customers 
   ORGANIZATION EXTERNAL
   (
   TYPE oracle_datapump
   DEFAULT DIRECTORY expdir
   LOCATION('customers.dmp')
   )
AS SELECT * FROM customers;

4.6.3.2 CREATE TABLE Example With a More Complex SQL SELECT Statement

The next example shows more complexity in the syntax. It joins the CUSTOMERS and COUNTRIES tables on the COUNTRY_ID columns to provide the country names. It also limits the rows to customers in the Americas. The command generates two output files in parallel, named americas1.dmp and americas2.dmp, in the local /expdir directory.

CREATE TABLE export_americas 
   ORGANIZATION EXTERNAL
   (
   TYPE oracle_datapump
   DEFAULT DIRECTORY expdir
   LOCATION('americas1.dmp', 'americas2.dmp')
   )
   PARALLEL 2
AS SELECT a.cust_first_name first_name,
   a.cust_last_name last_name,
   a.cust_gender gender,
   a.cust_year_of_birth birth, 
   a.cust_email email, 
   a.cust_postal_code postal_code, 
   b.country_name country
FROM customers a,
     countries b
WHERE a.country_id=b.country_id AND
      b.country_region='Americas'
ORDER BY a.country_id, a.cust_postal_code;

4.6.4 Verifying the Contents of the Data Files

You can check the content of the output data files before copying them to Hadoop. The previous CREATE TABLE statement created an external table named EXPORT_AMERICAS, which you can describe and query the same as any other table.

The DESCRIBE statement shows the selection of columns and the modified names:

SQL> DESCRIBE export_americas;
 Name                      Null?    Type
 ------------------------- -------- -----------------
 FIRST_NAME                NOT NULL VARCHAR2(20)
 LAST_NAME                 NOT NULL VARCHAR2(40)
 GENDER                    NOT NULL CHAR(1)
 BIRTH                     NOT NULL NUMBER(4)
 EMAIL                              VARCHAR2(50)
 POSTAL_CODE               NOT NULL VARCHAR2(10)
 COUNTRY                   NOT NULL VARCHAR2(40)

A SELECT statement like the following shows a sample of the data:

SELECT first_name, last_name, gender, birth, country 
   FROM export_americas 
   WHERE birth > 1985 
   ORDER BY last_name 
   FETCH FIRST 5 ROWS ONLY;

FIRST_NAME      LAST_NAME            GENDER      BIRTH COUNTRY
--------------- -------------------- ------ ---------- ------------------------
Opal            Aaron                M            1990 United States of America
KaKit           Abeles               M            1986 United States of America
Mitchel         Alambarati           M            1987 Canada
Jade            Anderson             M            1986 United States of America
Roderica        Austin               M            1986 United States of America

4.6.5 Copying the Files into Hadoop

The following commands list the files in the local expdir directory, create a Hadoop subdirectory named customers, and copy the files to it. The user is connected to the Hadoop cluster (Oracle Big Data Appliance or other) as the oracle user.

$ cd /expdir
$ ls americas*.dmp
americas1.dmp  americas2.dmp
$ hadoop fs -mkdir customers
$ hadoop fs -put *.dmp customers
$ hadoop fs -ls customers
Found 2 items
-rw-r--r--   1 oracle oracle     798720 2014-10-13 17:04 customers/americas1.dmp
-rw-r--r--   1 oracle oracle     954368 2014-10-13 17:04 customers/americas2.dmp

4.6.6 Creating a Hive External Table

This HiveQL statement creates an external table using the Copy to Hadoop SerDes. The LOCATION clause identifies the full path to the Hadoop directory containing the Data Pump files:

CREATE EXTERNAL TABLE customers
   ROW FORMAT SERDE 'oracle.hadoop.hive.datapump.DPSerDe'
   STORED AS
      INPUTFORMAT  'oracle.hadoop.hive.datapump.DPInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
   LOCATION '/user/oracle/customers';

The DESCRIBE command shows the columns of the CUSTOMERS external table.

hive> DESCRIBE customers;
OK
first_name               varchar(20)              from deserializer   
last_name                varchar(40)              from deserializer   
gender                   char(1)                  from deserializer   
birth                    int                      from deserializer   
email                    varchar(50)              from deserializer   
postal_code              varchar(10)              from deserializer   
country                  varchar(40)              from deserializer

4.6.7 Querying the Data in Hive

The following HiveQL SELECT statement shows the same data as the SQL SELECT statement from Oracle Database shown in "Verifying the Contents of the Data Files". The two queries access copies of the same Data Pump files.

SELECT first_name, last_name, gender, birth, country 
  FROM customers 
  WHERE birth > 1985 
  ORDER BY last_name LIMIT 5;

Total MapReduce jobs = 1
Launching Job 1 out of 1
     .
     .
     .
OK
Opal      Aaron       M     1990     United States of America
KaKit     Abeles      M     1986     United States of America
Mitchel   Alambarati  M     1987     Canada
Jade      Anderson    M     1986     United States of America
Roderica  Austin      M     1986     United States of America

4.7 Using Oracle Shell for Hadoop Loaders With Copy to Hadoop

4.7.1 Introduction to Oracle Shell for Hadoop Loaders

What is Oracle Shell for Hadoop Loaders?

Oracle Shell for Hadoop Loaders is a helper shell that provides a simple to use command line interface to Oracle Loader for Hadoop, Oracle SQL Connector for HDFS, and Copy to Hadoop. It has basic shell features such as command line recall, history, inheriting environment variables from the parent process, setting new or existing environment variables, and performing environmental substitution in the command line.

The core functionality of Oracle Shell for Hadoop Loaders includes the following:

Defining named external resources with which Oracle Shell for Hadoop Loaders interacts to perform loading tasks.
Setting default values for load operations.
Running load commands.
Delegating simple pre and post load tasks to the Operating System, HDFS, Hive and Oracle. These tasks include viewing the data to be loaded, and viewing the data in the target table after loading.

Getting Started with Oracle Shell for Hadoop Loaders

The examples directory in the OHSH kit contains many examples that define resources and load data using Oracle Shell for Hadoop Loaders. See <OHSH_KIT>/examples/README.txt for a description of the examples and instructions for running them.

Getting Help

The OHSH shell provides online help for all commands.

To get a list of all possible OHSH commands:

ohsh> help

To get help on a specific command, enter help, followed by the command:

ohsh> help show

4.7.2 Oracle Shell for Hadoop Loaders Setup

Prerequisites

The following are prerequisites on the Oracle Database server.

Copy to Hadoop is installed
Hadoop and Hive client libraries are installed and configured
SQL*Plus is installed
There is JDBC access to Oracle Database.

Installation of the Oracle Big Data Connectors OLH and OSCH is optional.

Installing Oracle Shell for Hadoop Loaders

Follow the instructions in these sections for setting up Oracle Shell for Hadoop Loaders on the Oracle Database Server

After Oracle Big Data SQL is installed, find the Oracle Shell for Hadoop Loaders download package at $ORACLE_HOME/bigdatasql/ohsh-<version>.zip
Extract the contents of ohsh-<version>.zip to a directory of your choice on the database server.

The extraction creates a directory named ohsh-<version> with a README.txt file and the following subdirectories:
```
README.txt
/bin
/conf
/doc
/examples
/jlib
```
Follow the instructions in README.txt to configure Oracle Shell for Hadoop Loaders.

In the /doc directory, there are additional README files with instructions for installing the software on Hadoop nodes, database nodes, and edge nodes, a description of the security model for Hive table loads, and other useful information.

To copy TIMESTAMPTZ and TIMESTAMPLTZ data to Hive, cast the columns to TIMESTAMP when exporting them to the Data Pump files. Hive does not have a data type that supports time zones or time offsets.