Using Copy to Hadoop With Direct Copy

B Using Copy to Hadoop With Direct Copy

Copy to Hadoop with the direct copy option copies data from an Oracle Database table directly to Oracle Datapump files stored in HDFS.

Copy to Hadoop simplifies the data copy because it does not require intermediate storage on the database server. The rest of the steps needed to make the data accessible to the Hadoop ecosystem such as creating Hive external table to access the copied data and running Hive queries is common to both copy options. (stage and direct).

The intended audience for this section is power users of Hadoop with specialized requirements . All other users should use the Oracle Shell for Hadoop Loader (OHSH) CLI for Copy to Hadoop operations. See Using Oracle Shell for Hadoop Loaders With Copy to Hadoop.

B.1 Manual Steps for Using Copy to Hadoop for Direct Copies

Follow these steps.

Getting Started

First confirm that Copy to Hadoop is installed and configured.
Ensure that the user account has sufficient privileges to copy a database table. (See Table Access Requirements for Copy to Hadoop.)
Make sure that the table contains supported column types. (See Column Mappings and Data Type Conversions in Copy to Hadoop.)
Log in to either a node in the Hadoop cluster or a system set up as a Hadoop client for the cluster.
If you are connecting to a secure cluster then run kinit to authenticate the user.
Run the Copy To Hadoop job using direct copy. See “Running the Copy to Hadoop Job for Direct Copy” below.
After the job succeeds, check the jobReport.log file in the _ctoh subdirectory of the output directory in HDFS. Check that the RowCount listed in the log file is equal to the number of rows in the database table.
Connect to Apache Hive and create an external table from the Data Pump files. (See Creating a Hive Table.)

Running the Copy to Hadoop Job (Using Direct Copy)

Set the environment variables required by Copy to Hadoop.
Locate the installation directory of Copy to Hadoop and set the CP2HADOOP_HOME Bash shell variable. For example:
```
# export CP2HADOOP_HOME=”/opt/oracle/orahivedp-3.1.0”
```
Add the Copy to Hadoop JARs to HADOOP_CLASSPATH. For example:
```
# export HADOOP_CLASSPATH="${CP2HADOOP_HOME}/jlib/*:${HADOOP_CLASSPATH}"
```
Tip:
When using Copy to Hadoop, you should always list $CP2HADOOP_HOME/jlib/* first in HADOOP_CLASSPATH. Another way to avoid JAR conflicts is to define an appropriately ordered HADOOP_CLASSPATH within a script that uses it.

Run the Job.

This is the command syntax:

# hadoop jar ${CP2HADOOP_HOME}/jlib/orahivedp.jar oracle.hadoop.ctoh.CtohDriver \
[-D <configuration-property>=<configuration-value>]+

Example 1: Running the Job on a Secure Cluster Using Oracle Wallet

# hadoop jar ${CP2HADOOP_HOME}/jlib/orahivedp.jar oracle.hadoop.ctoh.CtohDriver \
-D oracle.hadoop.ctoh.connection.tnsEntry=<my-oracle-tns-entry> \
-D oracle.hadoop.ctoh.connection.walletLoc=<local-oracle-wallet-dir> \
-D oracle.hadoop.ctoh.connection.tnsAdmin=<local-oracle-wallet-dir> \
-D oracle.hadoop.ctoh.connection.clusterWalletLoc=<oracle-wallet-dir-on-hadoop-cluster> \
-D oracle.hadoop.ctoh.connection.clusterTnsAdmin=<oracle-wallet-dir-on-hadoop-cluster> \
-D mapreduce.output.fileoutputformat.outputdir=<mytab-hdfs-output-dir> \
-D oracle.hadoop.ctoh.splitterType="BLOCK_SPLITTER" \
-D oracle.hadoop.ctoh.table=<dbSchema.dbTable> \
-D oracle.hadoop.ctoh.maxSplits=10

Example 2: Running the Job on A Unsecured Hadoop Cluster (for Demo Purposes Only)

# hadoop jar ${CP2HADOOP_HOME}/jlib/orahivedp.jar oracle.hadoop.ctoh.CtohDriver \
-D oracle.hadoop.ctoh.jdbc.url="jdbc:oracle:thin:@myhost:1521/myservice" \
-D oracle.hadoop.ctoh.connection.username="myuser" \
-D oracle.hadoop.ctoh.connection.password="mypassword" \
-D mapreduce.output.fileoutputformat.outputdir="mytable_output_dir" \
-D oracle.hadoop.ctoh.splitterType="BLOCK_SPLITTER" \
-D oracle.hadoop.ctoh.table="otherUser.mytable" \
-D oracle.hadoop.ctoh.maxSplits=10

Performance Tuning Tips

You can control the degree of parallelism of the Copy to Hadoop job by specifying the number of map processes using the oracle.hadoop.ctoh.maxSplits property. The higher the number of map processes, the higher the parallelism. Note that each process connects to the database, so this value also determines the number of simultaneous connections to the database. Typically, a number such as 64 works well.

Required Configuration Properties

See the Copy to Hadoop Property Reference for more information on these and other properties.

oracle.hadoop.ctoh.table
mapreduce.output.fileoutputformat.outputdir
oracle.hadoop.ctoh.maxSplits
oracle.hadoop.ctoh.splitterType

Connection Properties for a secure Hadoop cluster using Oracle Wallet:

oracle.hadoop.ctoh.connection.walletLoc
oracle.hadoop.ctoh.connection.tnsAdmin
oracle.hadoop.ctoh.connection.tnsEntry
The following properties are also required if the Oracle Wallet directory on the Hadoop cluster is different from the directory on the Hadoop client:
- oracle.hadoop.ctoh.connection.clusterWalletLoc
- oracle.hadoop.ctoh.connection.clusterTnsAdmin

Connection Properties for Unsecured Hadoop clusters (for Demo Purposes Only):

For demo purposes, use the following properties in place of the properties used with secured clusters.

oracle.hadoop.ctoh.connection.username
oracle.hadoop.ctoh.connection.password
oracle.hadoop.ctoh.jdbc.url

An Incremental Copy using Copy to Hadoop

To incrementally copy data from the same Oracle table to a pre-existing destination directory in HDFS, the following additional properties are required. (This configuration assumes that a Copy To Hadoop job was run initially to copy data from an Oracle Database table to datapump files in an HDFS directory. )

oracle.hadoop.ctoh.whereClause
oracle.hadoop.ctoh.datapump.output
oracle.hadoop.ctoh.datapump.basename

oracle.hadoop.ctoh.datapump.output specifies a preexisting HDFS location that contains the datapump files from a previous run of Copy To Hadoop.

oracle.hadoop.ctoh.whereClause identifies the subset of rows to be copied from the Oracle table for the incremental load.

oracle.hadoop.ctoh.datapump.basename specifies a unique prefix for the datapump files. This property is used to generate unique datapump file names to prevent file name collisions during an incremental load.

B.2 Copy to Hadoop Property Reference

This reference describes customer-accessible properties of Copy to Hadoop.

Copy to Hadoop Configuration Property Reference (for Direct Copy)

Property	Description
oracle.hadoop.ctoh.home	Type: String Default Value: Value of the `CP2HADOOP_HOME` environment variable. Description: This configuration property is used to locate jars required for the Copy to Hadoop job.
oracle.hadoop.ctoh.table	Type: String Default Value: None. Description: The name of the database table whose content is copied to Hdoop as datapump files. It can also be schema qualified. For example, to specify the table EMPLOYEE in schema MANAGER, you can use `MANAGER.EMPLOYEE`
mapreduce.output.fileoutputformat.outputdir	Type: String Default Value: None. Description: The name of the output directory where datapump files are created by the Hadoop job. The job output logs are also stored in the`_ctoh` subdirectory.
oracle.hadoop.ctoh.datapump.output	Type: String Default Value: None. Description: Specifies the destination directory for datapump files. If this property is not specified,the datapump files will live in the directory specified by the `mapreduce.output.fileoutputformat.outputdir` property.
oracle.hadoop.ctoh.datapump.basename	Type: String. Default Value: `dppart` Description: The prefix or base-name of generated data pump files. For example if a user specifies this property as “dp_tbl”, then the generated datapump file is `dp_tbl-m-00000.dmp`.
oracle.hadoop.ctoh.datapump.extension	Type: Default Value: `dmp` Description: The suffix of generated data pump files. For example if a user specifies this property as “.dp”, then the generated datapump file is `dppart—m-00000.dp`.
oracle.hadoop.ctoh.maxSplits	Type: Integer. Default Value: None. Description: The number of datapump files that are created by the Hadoop job. This is also the number of mappers that will be produced by the Hadoop job.
oracle.hadoop.ctoh.splitterType	Type: Default Value: None. Description: The type of splitters that will be used to split the table data. `BLOCK_SPLITTER`: This splitter divides the table into block ranges. `ROW_SPLITTER`: The splitter divides the table into row ranges. `PARTITION_SPLITTER`: If a table is partitioned, the partition splitter can be used. When this splitter is specified, the number of datapump files created by the Hadoop job is at most equal to the number of partitions in the table.
oracle.hadoop.ctoh.columns	Type: Default Value: None. Description: Specifies the subset of columns to be copied. For example if a user specifies “NAME,MANAGER” then the data for columns `NAME` and `MANAGER` are copied. If this property is not specified then all columns are copied (unless filtered by a WHERE clause).
oracle.hadoop.ctoh.whereClause	Type: String. Default Value: None. Description: This property is used to copy a subset of rows. For example, to copy employees whose ids are less than 1000 and greater than 500, then specify the following WHERE clause: `EMPLOYEE_ID < 1000 AND EMPLOYEE_ID > 500`.
oracle.hadoop.ctoh.jdbc.url	Type: String. Default Value: None. Description: The JDBC url to connect to the database. This property can used for demo purposes and non-secure clusters. Use Oracle Wallet with Secure Hadoop clusters in production environments.
oracle.hadoop.ctoh.connection.username	Type: String. Default Value: None. Description: The name of the Oracle Database user. This property can used for demo purposes and non secure clusters. Use Oracle Wallet with Secure Hadoop clusters in production environments.
oracle.hadoop.ctoh.connection.password	Type: String. Default Value: None. Description: The password of the Oracle Database user. This property can used for demo purposes and non secure clusters. Use Oracle Wallet with Secure Hadoop clusters in production environments.
oracle.hadoop.ctoh.connection.walletLoc	Type: String. Default Value: None. Description: Location of the Oracle Wallet directory on the Hadoop client. When using an Oracle Wallet, you must also set the following properties: `oracle.hadoop.ctoh.connection.tnsAdmin` `oracle.hadoop.ctoh.connection.tnsEntry`
oracle.hadoop.ctoh.connection.clusterWalletLoc	Type: Default Value: Value of `oracle.hadoop.ctoh.connection.walletLoc`. Description: Location of the Oracle wallet directory on the Hadoop cluster. NOTE: This directory must be visible to all nodes in the Hadoop cluster. When using this property, you must also set the following properties: `oracle.hadoop.ctoh.connection.clusterTnsAdmin` `oracle.hadoop.ctoh.connection.tnsEntry`
oracle.hadoop.ctoh.connection.tnsAdmin	Type: String. Default Value: Not Defined. Description: Location of the directory on the Hadoop client, containing the SQL*Net configuration files such as `sqlnet.ora` and `tnsnames.ora`. Set this property so that you can use TNS entry names in the database connection strings. When using Oracle Wallet, this property value can be the same as the value of the `oracle.hadoop.ctoh.connection.walletLoc` property. You must set this property when using an Oracle Wallet as an external password store. See `oracle.hadoop.ctoh.connection.walletLoc`.
oracle.hadoop.ctoh.connection.clusterTnsAdmin	Type: String. Default Value: Value of the property `oracle.hadoop.ctoh.connection.tnsAdmin`. Description: Location of the directory on the Hadoop cluster containing SQL*Net configuration files such as `sqlnet.ora` and `tnsnames.ora`. NOTE: This directory must be visible to all nodes in the Hadoop cluster. Set this property so that you can use TNS entry names in database connection strings. When using Oracle Wallet, this property value can be the same as the value of `oracle.hadoop.ctoh.connection.clusterWalletLoc`. You must set this property when using an Oracle Wallet as an external password store (as Oracle recommends). See `oracle.hadoop.ctoh.connection.clusterWalletLoc`.
oracle.hadoop.ctoh.connection.tnsEntry	Type: String. Default Value: None. Description: The TNS entry name defined in the `tnsnames.ora` file. Use this property with `oracle.hadoop.ctoh.connection.tnsAdmin`. When using Oracle Wallet, make sure that the `tnsEntry` name matches your wallet credential.
oracle.hadoop.ctoh.cachePath	Type: String. Default Value: `${mapreduce.output.fileoutputformat.outputdir}/../ctohCache` Description: Identifies the full path to an HDFS directory where cCopy to Hadoop can create files that are loaded into the MapReduce distributed cache. The distributed cache is a facility for caching large, application-specific files and distributing them efficiently across the nodes in a cluster.