Use Oracle Loader for Hadoop

The following sections describe how to use Oracle Loader for Hadoop to load data from Hadoop into tables in Oracle Database.

About Oracle Loader for Hadoop

Oracle Loader for Hadoop (OLH) is an efficient and high-performance loader for fast loading of data from a Hadoop cluster into a table in an Oracle database.

Oracle Loader for Hadoop prepartitions the data if necessary and transforms it into a database-ready format. It can also sort records by primary key or user-specified columns before loading the data or creating output files. Oracle Loader for Hadoop uses the parallel processing framework of Hadoop to perform these preprocessing operations, which other loaders typically perform on the database server as part of the load process. Off-loading these operations to Hadoop reduces the CPU requirements on the database server, thereby lessening the performance impact on other database tasks.

Oracle Shell for Hadoop Loaders (OHSH) is the preferred way to use Oracle Loader for Hadoop. It includes a command line interface (whose simple command syntax can also be scripted) for moving data between Hadoop and Oracle Database using various resources, including Oracle Loader for Hadoop. See Use the Oracle Shell for Hadoop Loaders Interface (OHSH).

Get Started With Oracle Loader for Hadoop

These instructions show how to use Oracle Loader for Hadoop through OHSH.

Before You Start

This is what you need to know before using OLH to load an Oracle Database table with data stored in Hadoop:

  • The password of the database schema you are connecting to (which is implied by the database connection URL).

  • The name of the Oracle Database table.

  • The source of the data living in Hadoop (either a path to an HDFS directory or the name of a Hive table).

  • The preferred method for loading. Choose either JDBC or direct path. Direct path load is faster, but requires partitioning of the target table. JDBC does not.

About Resources

In OHSH, the term resources refers to the interfaces that OHSH presents for defining the data source, destination, and command language. Four types of resources are available:

  • Hadoop resources – for executing HDFS commands to navigate HDFS and use HDFS as a source or destination.

  • Hive resources – for executing Hive commands and specifying Hive as a source or destination.

  • JDBC resources – for making JDBC connections to a database.

  • SQL*Plus resources – for executing SQL commands in a database schema.

Two resources are created upon OHSH startup:

  • hive0 – enables access to the Hive database default.

  • hadoop0 – enables access to HDFS.

You can create SQL*Plus and JDBC resources with a session, as well as additional Hive resources (for example, to connect to other Hive databases). Assign a resource any name that is meaningful to you. In the examples below, we use the names ora_mydatabase and sql0 .

Where resources are invoked in the commands below, the percent sign (%) prefix identifies a resource name.

Loading an Oracle Database Table

  1. Start an OHSH session.

    $ ohsh
    ohsh>
  2. Create the following resources:

    • SQL*Plus resource

      ohsh> create sqlplus resource sql0 connectid=”<database connection url>”

      At prompt, enter the database password.

    • JDBC resource.

      You can provide any name. A name that indicates the target schema is recommended.

      ohsh> create jdbc resource ora_mydatabase connectid=”<database connection ur1>”

      At the prompt, enter the database password.

    • Additional Hive resources (if required). The default Hive resource hive0 connects to the default database in Hive. If you want to connect to another Hive database, create another resource:

      ohsh> create hive resource hive_mydatabase connectionurl=”jdbc:hive2:///<Hive database name>
  3. Use the load command to load files from HDFS into a target table in the Oracle database.

    The following command loads data from a delimited text file in HDFS <HDFS path> into the target table in Oracle Database using the direct path option.

    ohsh> load oracle table ora_mydatabase:<target table in the Oracle database> from path hadoop0:/user/<HDFS path> using directpath

    Note:

    The default direct path method is the fastest way to load a table. However, it requires partitioned target table. Direct path is always recommended for use with partition tables. Use the JDBC option to load into a non-partitioned target table.

    If the command does not explicitly state the load method, then OHSH automatically uses the appropriate method. If the target Oracle table is partitioned, then by default, OHSH uses direct path (i.e. Oracle OCI). If the Oracle table is not partitioned, it uses JDBC.

  4. After loading, check the number of rows.

    You can do this conveniently from the OHSH command line:

    ohsh> %sql0 select count(*) from <target table in Oracle Database>

Loading a Hive Table Into an Oracle Database Table

You can use OHSH to load a Hive table into a target table in an Oracle database. The command below shows how to do this using the direct path method.

ohsh> load oracle table ora_mydatabase:<target table in Oracle Database> from hive table hive0:<Hive table name>

Note that if the target table is partitioned, then OHSH uses direct path automatically. You do not need to enter using directpath explicitly in the command.

If the target table is non-partitioned, then specify the JDBC method instead:

ohsh> load oracle table ora_mydatabase:<target table in Oracle Database> from hive table hive0:<Hive table name> using jdbc

Note:

The load command assumes that the column names in the Hive table and in the Oracle Database table are identically matched. If they do not match, then use OHSH loadermap.

Using OHSH Loadermaps

The simple load examples in this section assume the following:

  • Where we load data from a text file in Hadoop into an Oracle Database table, the declared order of columns in the target table maps correctly to the physical ordering of the delimited text fields in the file.

  • Where we load Hive tables in to Oracle Database tables, the Hive and Oracle Database column names are identically matched.

However, in less straightforward cases where the column names (or the order of column names and delimited text fields) do not match, use the OHSH loadermap construct to correct these mismatches.

You can also use a loadermap to specify a subset of target columns to load into table or in the case of a load from a text file, specify the format of a field in the load.

Loadermaps are not covered in this introduction.

Performance Tuning Oracle Loader for Hadoop in OHSH

Aside from network bandwidth, two factors can have significant impact on Oracle Loader for Hadoop performance. You can tune both in OHSH.

  • Degree of parallelism

    The degree of parallelism affects performance when Oracle Loader for Hadoop runs in Hadoop. For the default method (direct path), parallelism is determined by the number of reducer tasks. The higher the number of reducer tasks, the faster the performance. The default value is 4. To set the number of tasks:

    ohsh> set reducetasks 18

    For the JDBC option, parallelism is determined by the number of map tasks and the optimal number is determined automatically. However, remember that if the target table is partitioned, direct path is faster than JDBC.

  • Load balancing

    Performance is best when the load is balanced evenly across reduce tasks. The load is detected by sampling. Sampling is always enabled by default for loads using the JDBC and the default copy method.

Debugging in OHSH

Several OHSH settings control the availability of debugging information:

  • outputlevel

    The outputlevel is set to minimal by default. Set it to verbose in order to return a stack trace when a command fails:

    ohsh> set outputlevel verbose
  • logbadrecords

    ohsh> set logbadrecords true

    This is set to true by default.

These log files are informative for debugging:

  • Oracle Loader for Hadoop log files.

    /user/<username>/smartloader/jobhistory/oracle/<target table schema>/<target table name>/<OHSH job ID>/_olh
  • Log files generated by the map and reduce tasks.

Other OHSH Properties That are Useful for Oracle Loader for Hadoop

You can set these properties on the OHSH command line or in a script.

  • dateformat

    ohsh> set dateformat “yyyy-MM-dd HH:mm:ss”

    The syntax for this command is dictated by the Java date format.

  • rejectlimit

    The number of rows that can be rejected before the load of a delimited text file fails.

  • fieldterminator

    The field terminator in loads of delimited text files.

  • hadooptnsadmin

    Location of an Oracle TNS admin directory in the Hadoop cluster

  • hadoopwalletlocation

    Location of the Oracle Wallet directory in the Hadoop cluster.