Use Copy to Hadoop
The following sections describe how to use Copy to Hadoop to copy Oracle Database tables to Hadoop.
About Copy to Hadoop
Copy to Hadoop makes it simple to identify and copy Oracle data to the Hadoop Distributed File System (HDFS).
Data exported to the Hadoop cluster by Copy to Hadoop is stored in Oracle Data Pump format. The Oracle Data Pump files can be queried by Hive. When the Oracle table changes, you can refresh the copy in Hadoop. Copy to Hadoop is primarily useful for Oracle tables that are relatively static, and thus do not require frequent refreshes.
Oracle Shell for Hadoop Loaders (OHSH) is the preferred way to use Copy to Hadoop. It includes a command line interface (whose simple command syntax can also be scripted) for moving data between Hadoop and Oracle Database using various resources, including Copy to Hadoop. See Use the Oracle Shell for Hadoop Loaders Interface (OHSH).
First Look: Loading an Oracle Table Into Hive and Storing the Data in Hadoop
This set of examples shows how to use Copy to Hadoop to load data from an Oracle table, store the data in Hadooop, and perform related operations within the OHSH shell. It assumes that OHSH and Copy to Hadoop are already installed and configured.
What’s Demonstrated in The Examples
These examples demonstrate the following tasks:
-
Starting an OHSH session and creating the resources you’ll need for Copy to Hadoop.
-
Using Copy to Hadoop to copy the data from the selected Oracle Database table to a new Hive table in Hadoop (using the resources that you created).
-
Using the
load
operation to add more data to the Hive table created in the first example. -
Using the
create or replace
operation to drop the Hive table and replace it with a new one that has a different record set. -
Querying the data in the Hive table and in the Oracle Database table.
-
Converting the data into other formats
Tip:
You may want to create select or create a small table in Oracle Database and work through these steps.Starting OHSH, Creating Resources, and Running Copy to Hadoop
-
Start OHSH. (The startup command below assumes that you’ve added the OHSH path to your PATH variable as recommended.)
$ ohsh ohsh>
-
Create the following resources.
-
SQL*Plus resource.
ohsh> create sqlplus resource sql0 connectid=”<database_connection_url>”
-
JDBC resource.
ohsh> create jdbc resource jdbc0 connectid=”<database_connection_url>”
Note:
For the Hive access shown in this example, only the defaulthive0
resource is needed. This resource is already configured to connect to the default Hive database. If additional Hive resources were required, you would create them as follows:ohsh> create hive resource hive_mydatabase connectionurl=”jdbc:hive2:///<Hive_database_name>”
-
-
Include the Oracle Database table name in the
create hive table
command below and run the command below. This command uses the Copy to Hadoopdirectcopy
method. Note thatdirectcopy
is the default mode and you do not actually need to name it explicitly.ohsh> create hive table hive0:<new_Hive_table_name> from oracle table jdbc0:<Oracle_Database_table_name> from oracle table jdbc0:<Oracle_Database_table_name> using directcopy
The Oracle Table data is now stored in Hadoop as a Hive table.
Adding More Data to the Hive Table
Use the OHSH load
method to add data to an existing Hive table.
Let’s assume that the original Oracle table includes a time field in the format DD-MM-YY and that a number of daily records were added after the Copy to Hadoop operation that created the corresponding Hive table.
Use load
to add these new records to the existing Hive table:
ohsh> load hive table hive0:<Hive_table_name> from oracle table jdbc0:<Oracle_Database_table_name> where “(time >= ’01-FEB-18’)”
Using OHSH create or replace
The OHSH create or replace
operation does the following:
-
Drops the named Hive table (and the associated Data Pump files) if a table by this name already exists.
Note:
Unlike
create or replace
, acreate
operation fails and returns an error if the Hive table and the related Data Pump files already exist. -
Creates a new Hive table using the name provided.
Suppose some records were deleted from the original Oracle Database table and you want to realign the Hive table with the new state of the Oracle Database table. Hive does not support update or delete operations on records, but the create or replace
operation in OHSH can achieve the same end result:
ohsh> create or replace hive table hive0:<new_hive_table_name> from oracle table jdbc0:<Oracle_Database_table_name>
Note:
Data copied to Hadoop by Copy to Hadoop can be queried through Hive, but the data itself is actually stored as Oracle Data Pump files. Hive only points to the Data Pump files.Querying the Hive Table
You can invoke a Hive resource in OHSH in order to run HiveQL commands. Likewise, you can invoke an SQL*Plus resource to run SQL commands. For example, these two queries compare the original Oracle Database table with the derivative Hive table:
ohsh> %sql0 select count(*) from <Oracle_Database_table_name>
ohsh> %hive0 select count(*) from <Hive_table_name>
Storing Data in Other Formats, Such as Parquet or ORC
By default, Copy to Hadoop outputs Data Pump files. In a create
operation, you can use the “stored as
” syntax to change the destination format to Parquet or ORC:
ohsh> %hive0 create table <Hive_table_name_parquet> stored as parquet as select * from <Hive_table_name>
This example creates the Data Pump files, but then immediately copies them to Parquet format. (The original Data Pump files are not deleted.)