B Using Copy to Hadoop With Direct Copy

Copy to Hadoop with the direct copy option copies data from an Oracle Database table directly to Oracle Datapump files stored in HDFS.

Copy to Hadoop simplifies the data copy because it does not require intermediate storage on the database server. The rest of the steps needed to make the data accessible to the Hadoop ecosystem such as creating Hive external table to access the copied data and running Hive queries is common to both copy options. (stage and direct).

The intended audience for this section is power users of Hadoop with specialized requirements . All other users should use the Oracle Shell for Hadoop Loader (OHSH) CLI for Copy to Hadoop operations. See Using Oracle Shell for Hadoop Loaders With Copy to Hadoop.

B.1 Manual Steps for Using Copy to Hadoop for Direct Copies

Follow these steps.

Getting Started

  1. First confirm that Copy to Hadoop is installed and configured.

  2. Ensure that the user account has sufficient privileges to copy a database table. (See Table Access Requirements for Copy to Hadoop.)

  3. Make sure that the table contains supported column types. (See Column Mappings and Data Type Conversions in Copy to Hadoop.)

  4. Log in to either a node in the Hadoop cluster or a system set up as a Hadoop client for the cluster.

  5. If you are connecting to a secure cluster then run kinit to authenticate the user.

  6. Run the Copy To Hadoop job using direct copy. See “Running the Copy to Hadoop Job for Direct Copy” below.

  7. After the job succeeds, check the jobReport.log file in the _ctoh subdirectory of the output directory in HDFS. Check that the RowCount listed in the log file is equal to the number of rows in the database table.

  8. Connect to Apache Hive and create an external table from the Data Pump files. (See Creating a Hive Table.)

Running the Copy to Hadoop Job (Using Direct Copy)

  1. Set the environment variables required by Copy to Hadoop.

    Locate the installation directory of Copy to Hadoop and set the CP2HADOOP_HOME Bash shell variable. For example:
    # export CP2HADOOP_HOME=”/opt/oracle/orahivedp-3.1.0”
    
    Add the Copy to Hadoop JARs to HADOOP_CLASSPATH. For example:
    # export HADOOP_CLASSPATH="${CP2HADOOP_HOME}/jlib/*:${HADOOP_CLASSPATH}"
    

    Tip:

    When using Copy to Hadoop, you should always list $CP2HADOOP_HOME/jlib/* first in HADOOP_CLASSPATH. Another way to avoid JAR conflicts is to define an appropriately ordered HADOOP_CLASSPATH within a script that uses it.
  2. Run the Job.

    This is the command syntax:
    # hadoop jar ${CP2HADOOP_HOME}/jlib/orahivedp.jar oracle.hadoop.ctoh.CtohDriver \
    [-D <configuration-property>=<configuration-value>]+

Example 1: Running the Job on a Secure Cluster Using Oracle Wallet

# hadoop jar ${CP2HADOOP_HOME}/jlib/orahivedp.jar oracle.hadoop.ctoh.CtohDriver \
-D oracle.hadoop.ctoh.connection.tnsEntry=<my-oracle-tns-entry> \
-D oracle.hadoop.ctoh.connection.walletLoc=<local-oracle-wallet-dir> \
-D oracle.hadoop.ctoh.connection.tnsAdmin=<local-oracle-wallet-dir> \
-D oracle.hadoop.ctoh.connection.clusterWalletLoc=<oracle-wallet-dir-on-hadoop-cluster> \
-D oracle.hadoop.ctoh.connection.clusterTnsAdmin=<oracle-wallet-dir-on-hadoop-cluster> \
-D mapreduce.output.fileoutputformat.outputdir=<mytab-hdfs-output-dir> \
-D oracle.hadoop.ctoh.splitterType="BLOCK_SPLITTER" \
-D oracle.hadoop.ctoh.table=<dbSchema.dbTable> \
-D oracle.hadoop.ctoh.maxSplits=10

Example 2: Running the Job on A Unsecured Hadoop Cluster (for Demo Purposes Only)

# hadoop jar ${CP2HADOOP_HOME}/jlib/orahivedp.jar oracle.hadoop.ctoh.CtohDriver \
-D oracle.hadoop.ctoh.jdbc.url="jdbc:oracle:thin:@myhost:1521/myservice" \
-D oracle.hadoop.ctoh.connection.username="myuser" \
-D oracle.hadoop.ctoh.connection.password="mypassword" \
-D mapreduce.output.fileoutputformat.outputdir="mytable_output_dir" \
-D oracle.hadoop.ctoh.splitterType="BLOCK_SPLITTER" \
-D oracle.hadoop.ctoh.table="otherUser.mytable" \
-D oracle.hadoop.ctoh.maxSplits=10

Performance Tuning Tips

You can control the degree of parallelism of the Copy to Hadoop job by specifying the number of map processes using the oracle.hadoop.ctoh.maxSplits property. The higher the number of map processes, the higher the parallelism. Note that each process connects to the database, so this value also determines the number of simultaneous connections to the database. Typically, a number such as 64 works well.

Required Configuration Properties

See the Copy to Hadoop Property Reference for more information on these and other properties.

  • oracle.hadoop.ctoh.table

  • mapreduce.output.fileoutputformat.outputdir

  • oracle.hadoop.ctoh.maxSplits

  • oracle.hadoop.ctoh.splitterType

Connection Properties for a secure Hadoop cluster using Oracle Wallet:

  • oracle.hadoop.ctoh.connection.walletLoc

  • oracle.hadoop.ctoh.connection.tnsAdmin

  • oracle.hadoop.ctoh.connection.tnsEntry

  • The following properties are also required if the Oracle Wallet directory on the Hadoop cluster is different from the directory on the Hadoop client:

    • oracle.hadoop.ctoh.connection.clusterWalletLoc

    • oracle.hadoop.ctoh.connection.clusterTnsAdmin

Connection Properties for Unsecured Hadoop clusters (for Demo Purposes Only):

For demo purposes, use the following properties in place of the properties used with secured clusters.

  • oracle.hadoop.ctoh.connection.username

  • oracle.hadoop.ctoh.connection.password

  • oracle.hadoop.ctoh.jdbc.url

An Incremental Copy using Copy to Hadoop

To incrementally copy data from the same Oracle table to a pre-existing destination directory in HDFS, the following additional properties are required. (This configuration assumes that a Copy To Hadoop job was run initially to copy data from an Oracle Database table to datapump files in an HDFS directory. )

  • oracle.hadoop.ctoh.whereClause

  • oracle.hadoop.ctoh.datapump.output

  • oracle.hadoop.ctoh.datapump.basename

oracle.hadoop.ctoh.datapump.output specifies a preexisting HDFS location that contains the datapump files from a previous run of Copy To Hadoop.

oracle.hadoop.ctoh.whereClause identifies the subset of rows to be copied from the Oracle table for the incremental load.

oracle.hadoop.ctoh.datapump.basename specifies a unique prefix for the datapump files. This property is used to generate unique datapump file names to prevent file name collisions during an incremental load.

B.2 Copy to Hadoop Property Reference

This reference describes customer-accessible properties of Copy to Hadoop.

Copy to Hadoop Configuration Property Reference (for Direct Copy)

Property Description

oracle.hadoop.ctoh.home

Type: String

Default Value: Value of the CP2HADOOP_HOME environment variable.

Description: This configuration property is used to locate jars required for the Copy to Hadoop job.

oracle.hadoop.ctoh.table

Type: String

Default Value: None.

Description: The name of the database table whose content is copied to Hdoop as datapump files. It can also be schema qualified. For example, to specify the table EMPLOYEE in schema MANAGER, you can use MANAGER.EMPLOYEE

mapreduce.output.fileoutputformat.outputdir

Type: String

Default Value: None.

Description: The name of the output directory where datapump files are created by the Hadoop job. The job output logs are also stored in the_ctoh subdirectory.

oracle.hadoop.ctoh.datapump.output

Type: String

Default Value: None.

Description: Specifies the destination directory for datapump files. If this property is not specified,the datapump files will live in the directory specified by the mapreduce.output.fileoutputformat.outputdir property.

oracle.hadoop.ctoh.datapump.basename

Type: String.

Default Value: dppart

Description: The prefix or base-name of generated data pump files. For example if a user specifies this property as “dp_tbl”, then the generated datapump file is dp_tbl-m-00000.dmp.

oracle.hadoop.ctoh.datapump.extension

Type:

Default Value: dmp

Description: The suffix of generated data pump files. For example if a user specifies this property as “.dp”, then the generated datapump file is dppart—m-00000.dp.

oracle.hadoop.ctoh.maxSplits

Type: Integer.

Default Value: None.

Description: The number of datapump files that are created by the Hadoop job. This is also the number of mappers that will be produced by the Hadoop job.

oracle.hadoop.ctoh.splitterType

Type:

Default Value: None.

Description: The type of splitters that will be used to split the table data.

  • BLOCK_SPLITTER: This splitter divides the table into block ranges.

  • ROW_SPLITTER: The splitter divides the table into row ranges.

  • PARTITION_SPLITTER: If a table is partitioned, the partition splitter can be used. When this splitter is specified, the number of datapump files created by the Hadoop job is at most equal to the number of partitions in the table.

oracle.hadoop.ctoh.columns

Type:

Default Value: None.

Description: Specifies the subset of columns to be copied. For example if a user specifies “NAME,MANAGER” then the data for columns NAME and MANAGER are copied. If this property is not specified then all columns are copied (unless filtered by a WHERE clause).

oracle.hadoop.ctoh.whereClause

Type: String.

Default Value: None.

Description: This property is used to copy a subset of rows. For example, to copy employees whose ids are less than 1000 and greater than 500, then specify the following WHERE clause: EMPLOYEE_ID < 1000 AND EMPLOYEE_ID > 500.

oracle.hadoop.ctoh.jdbc.url

Type: String.

Default Value: None.

Description: The JDBC url to connect to the database. This property can used for demo purposes and non-secure clusters. Use Oracle Wallet with Secure Hadoop clusters in production environments.

oracle.hadoop.ctoh.connection.username

Type: String.

Default Value: None.

Description: The name of the Oracle Database user. This property can used for demo purposes and non secure clusters. Use Oracle Wallet with Secure Hadoop clusters in production environments.

oracle.hadoop.ctoh.connection.password

Type: String.

Default Value: None.

Description: The password of the Oracle Database user. This property can used for demo purposes and non secure clusters. Use Oracle Wallet with Secure Hadoop clusters in production environments.

oracle.hadoop.ctoh.connection.walletLoc

Type: String.

Default Value: None.

Description: Location of the Oracle Wallet directory on the Hadoop client.

When using an Oracle Wallet, you must also set the following properties:

  • oracle.hadoop.ctoh.connection.tnsAdmin

  • oracle.hadoop.ctoh.connection.tnsEntry

oracle.hadoop.ctoh.connection.clusterWalletLoc

Type:

Default Value: Value of oracle.hadoop.ctoh.connection.walletLoc.

Description: Location of the Oracle wallet directory on the Hadoop cluster. NOTE: This directory must be visible to all nodes in the Hadoop cluster. When using this property, you must also set the following properties:

  • oracle.hadoop.ctoh.connection.clusterTnsAdmin

  • oracle.hadoop.ctoh.connection.tnsEntry

oracle.hadoop.ctoh.connection.tnsAdmin

Type: String.

Default Value: Not Defined.

Description: Location of the directory on the Hadoop client, containing the SQL*Net configuration files such as sqlnet.ora and tnsnames.ora. Set this property so that you can use TNS entry names in the database connection strings. When using Oracle Wallet, this property value can be the same as the value of the oracle.hadoop.ctoh.connection.walletLoc property.

You must set this property when using an Oracle Wallet as an external password store. See oracle.hadoop.ctoh.connection.walletLoc.

oracle.hadoop.ctoh.connection.clusterTnsAdmin

Type: String.

Default Value: Value of the property oracle.hadoop.ctoh.connection.tnsAdmin.

Description: Location of the directory on the Hadoop cluster containing SQL*Net configuration files such as sqlnet.ora and tnsnames.ora. NOTE: This directory must be visible to all nodes in the Hadoop cluster.

Set this property so that you can use TNS entry names in database connection strings. When using Oracle Wallet, this property value can be the same as the value of oracle.hadoop.ctoh.connection.clusterWalletLoc.

You must set this property when using an Oracle Wallet as an external password store (as Oracle recommends). See oracle.hadoop.ctoh.connection.clusterWalletLoc.

oracle.hadoop.ctoh.connection.tnsEntry

Type: String.

Default Value: None.

Description: The TNS entry name defined in the tnsnames.ora file. Use this property with oracle.hadoop.ctoh.connection.tnsAdmin. When using Oracle Wallet, make sure that the tnsEntry name matches your wallet credential.

oracle.hadoop.ctoh.cachePath

Type: String.

Default Value: ${mapreduce.output.fileoutputformat.outputdir}/../ctohCache

Description: Identifies the full path to an HDFS directory where cCopy to Hadoop can create files that are loaded into the MapReduce distributed cache. The distributed cache is a facility for caching large, application-specific files and distributing them efficiently across the nodes in a cluster.