3.2 Use Copy to Hadoop
Learn how to use Copy to Hadoop to copy Oracle Database tables to Hadoop.
3.2.1 What Is Copy to Hadoop?
Oracle Big Data SQL includes the Oracle Copy to Hadoop utility. This utility makes it simple to identify and copy Oracle data to the Hadoop Distributed File System. It can be accessed through the command-line interface Oracle Shell for Hadoop Loaders.
Data exported to the Hadoop cluster by Copy to Hadoop is stored in Oracle Data Pump format. The Oracle Data Pump files can be queried by Hive or Big Data SQL. The Oracle Data Pump format optimizes queries through Big Data SQL in the following ways:
- 
                        The data is stored as Oracle data types – eliminating data type conversions. 
- 
                        The data is queried directly – without requiring the overhead associated with Java SerDes. 
After Data Pump format files are in HDFS, you can use Apache Hive to query the data. Hive can process the data locally without accessing Oracle Database. When the Oracle table changes, you can refresh the copy in Hadoop. Copy to Hadoop is primarily useful for Oracle tables that are relatively static, and thus do not require frequent refreshes.
Copy to Hadoop is licensed under Oracle Big Data SQL. You must have an Oracle Big Data SQL license in order to use this utility.
3.2.2 Getting Started Using Copy to Hadoop
To install and start using Copy to Hadoop:
Tip:
For Hadoop power users with specialized requirements, the manual option for Direct Copy is recommended. See Manual Steps for Using Copy to Hadoop for Direct Copies in Appendix B.3.2.2.1 Table Access Requirements for Copy to Hadoop
To copy a table using Copy to Hadoop, an Oracle Database user must meet one of these requirements.
- 
                           The user is the owner of the table, or 
- 
                           The user is accessing a table in another schema and has the following privileges: - 
                                 The SELECTprivilege on the table.
- 
                                 The select_catalog_roleprivilege (which providesSELECTprivileges on data dictionary views).
 
- 
                                 
3.2.3 Using Oracle Shell for Hadoop Loaders With Copy to Hadoop
3.2.3.1 Introducing Oracle Shell for Hadoop Loaders
What is Oracle Shell for Hadoop Loaders?
Oracle Shell for Hadoop Loaders (OHSH) is a helper shell that provides an easy-to-use command line interface to Oracle Loader for Apache Hadoop, Oracle SQL Connector for HDFS, and Copy to Hadoop. It has basic shell features such as command line recall, history, inheriting environment variables from the parent process, setting new or existing environment variables, and performing environmental substitution in the command line.
The core functionality of Oracle Shell for Hadoop Loaders includes the following:
- 
                           Defining named external resources with which Oracle Shell for Hadoop Loaders interacts to perform loading tasks. 
- 
                           Setting default values for load operations. 
- 
                           Running load commands. 
- 
                           Delegating simple pre and post load tasks to the Operating System, HDFS, Hive and Oracle. These tasks include viewing the data to be loaded, and viewing the data in the target table after loading. 
See Also:
The examples directory in the OHSH kit contains many examples that define
                resources and load data using Oracle Shell for Hadoop Loaders.  Unzip
                    <OHSH_KIT>/examples.zip and
                    see<OHSH_KIT>/examples/README.txt for a description
                of the examples and instructions on how to run OHSH load methods.
                        
3.2.4 Copy to Hadoop by Example
3.2.4.1 First Look: Loading an Oracle Table Into Hive and Storing the Data in Hadoop
This set of examples shows how to use Copy to Hadoop to load data from an Oracle table, store the data in Hadooop, and perform related operations within the OHSH shell. It assumes that OHSH and Copy to Hadoop are already installed and configured.
What’s Demonstrated in The Examples
These examples demonstrate the following tasks:
- 
                              Starting an OHSH session and creating the resources you’ll need for Copy to Hadoop. 
- 
                              Using Copy to Hadoop to copy the data from the selected Oracle Database table to a new Hive table in Hadoop (using the resources that you created). 
- 
                              Using the loadoperation to add more data to the Hive table created in the first example.
- 
                              Using the create or replaceoperation to drop the Hive table and replace it with a new one that has a different record set.
- 
                              Querying the data in the Hive table and in the Oracle Database table. 
- 
                              Converting the data into other formats 
Tip:
You may want to create select or create a small table in Oracle Database and work through these steps.Starting OHSH, Creating Resources, and Running Copy to Hadoop
- 
                              Start OHSH. (The startup command below assumes that you’ve added the OHSH path to your PATH variable as recommended.) $ ohsh ohsh>
- 
                              Create the following resources. - 
                                    SQL*Plus resource. ohsh> create sqlplus resource sql0 connectid=”<database_connection_url>”
- 
                                    JDBC resource. ohsh> create jdbc resource jdbc0 connectid=”<database_connection_url>”
 Note: For the Hive access shown in this example, only the defaulthive0resource is needed. This resource is already configured to connect to the default Hive database. If additional Hive resources were required, you would create them as follows:ohsh> create hive resource hive_mydatabase connectionurl=”jdbc:hive2:///<Hive_database_name>”
- 
                                    
- 
                              Include the Oracle Database table name in the create hive tablecommand below and run the command below. This command uses the Copy to Hadoopdirectcopymethod. Note thatdirectcopyis the default mode and you do not actually need to name it explicitly.ohsh> create hive table hive0:<new_Hive_table_name> from oracle table jdbc0:<Oracle_Database_table_name> from oracle table jdbc0:<Oracle_Database_table_name> using directcopyThe Oracle Table data is now stored in Hadoop as a Hive table. 
Adding More Data to the Hive Table
Use the OHSH load method to add data to an existing Hive table. 
                        
Let’s assume that the original Oracle table includes a time field in the format DD-MM-YY and that a number of daily records were added after the Copy to Hadoop operation that created the corresponding Hive table.
Use load to add these new records to the existing Hive table:
                        
ohsh> load hive table hive0:<Hive_table_name> from oracle table jdbc0:<Oracle_Database_table_name> where “(time >= ’01-FEB-18’)”Using OHSH create or replace
The OHSH create or replace operation does the following:
                        
- 
                              Drops the named Hive table (and the associated Data Pump files) if a table by this name already exists. Note: Unlike create or replace, acreateoperation fails and returns an error if the Hive table and the related Data Pump files already exist.
- 
                              Creates a new Hive table using the name provided. 
Suppose some records were deleted from the original Oracle Database table and you want to realign the Hive table with the new state of the Oracle Database table. Hive does not support update or delete operations on records, but the create or replace operation in OHSH can achieve the same end result:
                        
ohsh> create or replace hive table hive0:<new_hive_table_name> from oracle table jdbc0:<Oracle_Database_table_name>Note:
Data copied to Hadoop by Copy to Hadoop can be queried through Hive, but the data itself is actually stored as Oracle Data Pump files. Hive only points to the Data Pump files.Querying the Hive Table
You can invoke a Hive resource in OHSH in order to run HiveQL commands. Likewise, you can invoke an SQL*Plus resource to run SQL commands. For example, these two queries compare the original Oracle Database table with the derivative Hive table:
ohsh> %sql0 select count(*) from <Oracle_Database_table_name>
ohsh> %hive0 select count(*) from <Hive_table_name>Storing Data in Other Formats, Such as Parquet or ORC
By default, Copy to Hadoop outputs Data Pump files. In a create operation, you can use the “stored as” syntax to change the destination format to Parquet or ORC:
                        
ohsh> %hive0 create table <Hive_table_name_parquet> stored as parquet as select * from <Hive_table_name>This example creates the Data Pump files, but then immediately copies them to Parquet format. (The original Data Pump files are not deleted.)
3.2.4.2 Working With the Examples in the Copy to Hadoop Product Kit
The OHSH product kit provides an examples directory at the path where OHSH is installed. This section walks you through several examples from the kit.
                     
3.2.4.2.1 Using Copy to Hadoop With the Default Copy Method
The following examples from the Copy to Hadoop product kit show how to use
        Copy to Hadoop with the default method of loading data. You can find the code in the
        examples directory where the kit is installed
            (<OHSH_KIT>/examples ). 
                        
The following examples assume that OHSH and Copy to Hadoop are installed and configured.
Example 3-1 createreplace_directcopy.ohsh
This script uses the create or replace operation to create a Hive
                external table called cp2hadoop_fivdti from the Oracle table
                    OHSH_CP2HADOOP_FIVDTI. It then loads the Hive table with 10000
                rows.  It uses the default load method directcopy to run a map job
                on Hadoop and split the Oracle table into input splits. The resulting Hive external
                table includes all of the splits.
                           
create or replace hive table hive0:cp2hadoop_fivdti \ 
from oracle table jdbc0:ohsh_cp2hadoop_fivdti using directcopy In the example below and in the code samples that follow, olhp is a user-defined JDBC resource.
                           
Example 3-2 load_directcopy.ohsh
The load_directcopy.ohsh script shows how to load the Hive table that was created in createreplace_directcopy.ohsh with an additional 30 rows. This script also uses the directcopy method.
                           
load hive table hive0:cp2hadoop_fivdti from oracle table jdbc0:ohsh_cp2hadoop_fivdti \
using directcopy where "(i7 < 30)";Tip:
You have the option to convert the storage in Hadoop from the default Data Pump format to Parquet or ORC format. For example:%hive0 create table cp2hadoop_fivdti_parquet stored as parquet as select * from cp2hadoop_fivdtiThe original Data Pump files are not deleted.
3.2.4.2.2 Using Copy to Hadoop With the Staged Copy Method
The first example below shows how to use Oracle Shell for Hadoop Loaders (OHSH) with Copy to Hadoop to do a staged, two-step copy from Oracle Database to Hadoop. The stage method is an alternative to the directcopy method.
                        
The second example shows how to load additional rows into the same table. It also uses the stage method.
Both examples assume that OHSH and Copy to Hadoop have been installed and configured, and that the examples have been configured according to the instructions in README.txt in the examples directory of the OHSH installation. The scripts below and many others are available in the examples directory.
                        
Example 3-3 createreplace_stage.ohsh
This script uses create or replace to create a Hive table called
                    cp2hadoop_fivdti from the Oracle table
                    OHSH_CP2HADOOP_FIVDTI.  It uses the stage
                command, which automatically does the following: 
                           
- 
                                 Exports the contents of the source table in Oracle to Data Pump format files on local disk 
- 
                                 Moves the Data Pump format files to HDFS. 
- 
                                 Creates the Hive external table that maps to the Data Pump format files in HDFS. 
create or replace hive table hive0:cp2hadoop_fivdti \
from oracle table jdbc0:ohsh_cp2hadoop_fivdti using stageIn the command above (and also in the next code example), olhp is a user-defined JDBC resource. 
                           
Example 3-4 load_stage.ohsh
The load_stage.ohsh script shows how to load the Hive table created by createreplace_stage.ohsh with an additional 30 rows using the stage method.
                           
load hive table hive0:cp2hadoop_fivdti from oracle table jdbc0:ohsh_cp2hadoop_fivdti \
using stage where "(i7 < 30)";Manual Option
The two-step method demonstrated in the createreplace_stage.ohsh and load_stage.ohsh example scripts automates some of the tasks required to do staged copies. However, there may be reasons to perform the steps manually, such as:
                           
- 
                                 You want to load columns from multiple Oracle Database source tables. 
- 
                                 You want to load columns of type TIMESTAMPZ or TIMESTAMPLTZ. 
See Appendix A: Manual Steps for Using Copy to Hadoop for Staged Copies.
3.2.5 Querying the Data in Hive
The following OHSH command shows the number of rows in the Hive table after copying from the Oracle table.
                     
%hive0 select count(*) from cp2hadoop_fivdti;
3.2.6 Column Mappings and Data Type Conversions in Copy to Hadoop
Get help with column mappings and data type conversions in Copy to Hadoop.
3.2.6.1 About Column Mappings
The Hive table columns automatically have the same names as the Oracle columns, which are provided by the metadata stored in the Data Pump files. Any user-specified column definitions in the Hive table are ignored.
3.2.6.2 About Data Type Conversions
Copy to Hadoop automatically converts the data in an Oracle table to an appropriate Hive data type. Table 3-2 shows the default mappings between Oracle and Hive data types.
Table 3-2 Oracle to Hive Data Type Conversions
| Oracle Data Type | Hive Data Type | 
|---|---|
| NUMBER | INT when the scale is 0 and the precision is less than 10 BIGINT when the scale is 0 and the precision is less than 19 DECIMAL when the scale is greater than 0 or the precision is greater than 19 | 
| CLOB NCLOB | STRING | 
| INTERVALYM INTERVALDS | STRING | 
| BINARY_DOUBLE | DOUBLE | 
| BINARY_FLOAT | FLOAT | 
| BLOB | BINARY | 
| ROWID UROWID | BINARY | 
| RAW | BINARY | 
| CHAR NCHAR | CHAR | 
| VARCHAR2 NVARCHAR2 | VARCHAR | 
| DATE | TIMESTAMP | 
| TIMESTAMP | TIMESTAMP | 
| TIMESTAMPTZ TIMESTAMPLTZFoot 1 | Unsupported | 
Footnote 1
To copy TIMESTAMPTZ and TIMESTAMPLTZ data to Hive, follow the instructions in Appendix A: Manual Steps for Using Copy to Hadoop to do Staged Copies. Cast the columns to TIMESTAMP when exporting them to the Data Pump files.
3.2.7 Working With Spark
The Oracle Data Pump files exported by Copy to Hadoop can be used in Spark.
prompt> spark-shell --jars orahivedp.jar,ojdbc7.jar,oraloader.jar,orai18n.jar,ora-hadoop-common.jarscala> sqlContext res0:org.apache.spark.sql.SQLContext = org.apache.spark.sql.hive.HiveContext@66ad7167scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)df that points to a Hive external table over Oracle Data Pump files:scala> val df = sqlContext.table("<hive external table>") <hive external table>: 
org.apache.spark.sql.DataFrame = [ <column names> ]scala> df.count
scala> df.headscala> sqlContext.sql(“CREATE EXTERNAL TABLE <hive external table> ROW FORMAT SERDE 
'oracle.hadoop.hive.datapump.DPSerDe' STORED AS INPUTFORMAT
'oracle.hadoop.hive.datapump.DPInputFormat' OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION
'/user/oracle/oracle_warehouse/<hive database name>'")3.2.8 Using Oracle SQL Developer with Copy to Hadoop
Oracle SQL Developer is a free, GUI-based development environment that provides easy to use tools for working Oracle Big Data Connectors, including Copy to Hadoop.
Using Oracle SQL Developer, you can copy data and create a new Hive table, or append data to an existing Hive external table that was created by Copy to Hadoop. In the GUI, you can initiate Copy to Hadoop in Oracle SQL Developer by right-clicking the Tables icon under any Hive schema. You can then append to an existing Hive external table by right-clicking the icon for that Hive table.
See Installing Oracle SQL Developer in this manual for instructions on where to obtain Oracle SQL Developer and how do the basic installation.