Use Copy to Hadoop

The following sections describe how to use Copy to Hadoop to copy Oracle Database tables to Hadoop.

About Copy to Hadoop

Copy to Hadoop makes it simple to identify and copy Oracle data to the Hadoop Distributed File System (HDFS).

Data exported to the Hadoop cluster by Copy to Hadoop is stored in Oracle Data Pump format. The Oracle Data Pump files can be queried by Hive. When the Oracle table changes, you can refresh the copy in Hadoop. Copy to Hadoop is primarily useful for Oracle tables that are relatively static, and thus do not require frequent refreshes.

Oracle Shell for Hadoop Loaders (OHSH) is the preferred way to use Copy to Hadoop. It includes a command line interface (whose simple command syntax can also be scripted) for moving data between Hadoop and Oracle Database using various resources, including Copy to Hadoop. See Use the Oracle Shell for Hadoop Loaders Interface (OHSH).

First Look: Loading an Oracle Table Into Hive and Storing the Data in Hadoop

This set of examples shows how to use Copy to Hadoop to load data from an Oracle table, store the data in Hadooop, and perform related operations within the OHSH shell. It assumes that OHSH and Copy to Hadoop are already installed and configured.

What’s Demonstrated in The Examples

These examples demonstrate the following tasks:

  • Starting an OHSH session and creating the resources you’ll need for Copy to Hadoop.

  • Using Copy to Hadoop to copy the data from the selected Oracle Database table to a new Hive table in Hadoop (using the resources that you created).

  • Using the load operation to add more data to the Hive table created in the first example.

  • Using the create or replace operation to drop the Hive table and replace it with a new one that has a different record set.

  • Querying the data in the Hive table and in the Oracle Database table.

  • Converting the data into other formats

Tip:

You may want to create select or create a small table in Oracle Database and work through these steps.

Starting OHSH, Creating Resources, and Running Copy to Hadoop

  1. Start OHSH. (The startup command below assumes that you’ve added the OHSH path to your PATH variable as recommended.)

    $ ohsh
    ohsh>
  2. Create the following resources.

    • SQL*Plus resource.

      ohsh> create sqlplus resource sql0 connectid=”<database_connection_url>”
    • JDBC resource.

      ohsh> create jdbc resource jdbc0 connectid=”<database_connection_url>”

    Note:

    For the Hive access shown in this example, only the default hive0 resource is needed. This resource is already configured to connect to the default Hive database. If additional Hive resources were required, you would create them as follows:
    ohsh> create hive resource hive_mydatabase connectionurl=”jdbc:hive2:///<Hive_database_name>”
  3. Include the Oracle Database table name in the create hive table command below and run the command below. This command uses the Copy to Hadoop directcopy method. Note that directcopy is the default mode and you do not actually need to name it explicitly.

    ohsh> create hive table hive0:<new_Hive_table_name> from oracle table jdbc0:<Oracle_Database_table_name> from oracle table jdbc0:<Oracle_Database_table_name> using directcopy

    The Oracle Table data is now stored in Hadoop as a Hive table.

Adding More Data to the Hive Table

Use the OHSH load method to add data to an existing Hive table.

Let’s assume that the original Oracle table includes a time field in the format DD-MM-YY and that a number of daily records were added after the Copy to Hadoop operation that created the corresponding Hive table.

Use load to add these new records to the existing Hive table:

ohsh> load hive table hive0:<Hive_table_name> from oracle table jdbc0:<Oracle_Database_table_name> where “(time >= ’01-FEB-18’)”

Using OHSH create or replace

The OHSH create or replace operation does the following:

  1. Drops the named Hive table (and the associated Data Pump files) if a table by this name already exists.

    Note:

    Unlike create or replace, a create operation fails and returns an error if the Hive table and the related Data Pump files already exist.

  2. Creates a new Hive table using the name provided.

Suppose some records were deleted from the original Oracle Database table and you want to realign the Hive table with the new state of the Oracle Database table. Hive does not support update or delete operations on records, but the create or replace operation in OHSH can achieve the same end result:

ohsh> create or replace hive table hive0:<new_hive_table_name> from oracle table jdbc0:<Oracle_Database_table_name>

Note:

Data copied to Hadoop by Copy to Hadoop can be queried through Hive, but the data itself is actually stored as Oracle Data Pump files. Hive only points to the Data Pump files.

Querying the Hive Table

You can invoke a Hive resource in OHSH in order to run HiveQL commands. Likewise, you can invoke an SQL*Plus resource to run SQL commands. For example, these two queries compare the original Oracle Database table with the derivative Hive table:

ohsh> %sql0 select count(*) from <Oracle_Database_table_name>
ohsh> %hive0 select count(*) from <Hive_table_name>

Storing Data in Other Formats, Such as Parquet or ORC

By default, Copy to Hadoop outputs Data Pump files. In a create operation, you can use the “stored as” syntax to change the destination format to Parquet or ORC:

ohsh> %hive0 create table <Hive_table_name_parquet> stored as parquet as select * from <Hive_table_name>

This example creates the Data Pump files, but then immediately copies them to Parquet format. (The original Data Pump files are not deleted.)