This topic does not apply to Oracle Big Data SQL Cloud Service.
Copy to Hadoop with the direct copy option copies data from an Oracle Database table directly to Oracle Datapump files stored in HDFS.
Copy to Hadoop simplifies the data copy because it does not require intermediate storage on the database server. The rest of steps needed to make the data accessible to the Hadoop ecosystem such as creating Hive external table to access the copied data and running Hive queries is common to both copy options. (stage and direct).
The intended audience for this section are power users of Hadoop with specialized requirements . All other users should use the Oracle Shell for Hadoop Loader (OHSH) CLI for Copy to Hadoop operations. See Using Oracle Shell for Hadoop Loaders With Copy to Hadoop.
Follow these steps.
Getting Started
First confirm that Copy to Hadoop is installed and configured.
Ensure that the user account has sufficient privileges to copy a database table. (See Table Access Requirements for Copy to Hadoop.)
Make sure that the table contains supported column types. (See About Column Mappings and Data Type Conversions.)
Log in to either a node in the Hadoop cluster or a system set up as a Hadoop client for the cluster.
If you are connecting to a secure cluster then run kinit
to authenticate the user, as described in the Oracle Big Data Appliance Software User’s Guide.
Run the Copy To Hadoop job using direct copy. See “Running the Copy to Hadoop Job for Direct Copy” below.
After the job succeeds, check the jobReport.log
file in the _ctoh
subdirectory of the output directory in HDFS. Check that the RowCount listed in the log file is equal to the number of rows in the database table.
Connect to Apache Hive and create an external table from the Data Pump files. (See Creating a Hive Table.)
Running the Copy to Hadoop Job (Using Direct Copy)
Set the environment variables required by Copy to Hadoop.
CP2HADOOP_HOME
Bash shell variable. For example:
# export CP2HADOOP_HOME=”/opt/oracle/orahivedp-3.1.0”
HADOOP_CLASSPATH
. For example:
# export HADOOP_CLASSPATH="${CP2HADOOP_HOME}/jlib/*:${HADOOP_CLASSPATH}"
Tip:
When using Copy to Hadoop, you should always list$CP2HADOOP_HOME/jlib/*
first in HADOOP_CLASSPATH
. Another way to avoid JAR conflicts is to define an appropriately ordered HADOOP_CLASSPATH
within a script that uses it.Run the Job.
# hadoop jar ${CP2HADOOP_HOME}/jlib/orahivedp.jar oracle.hadoop.ctoh.CtohDriver \ [-D <configuration-property>=<configuration-value>]+
Example 1: Running the Job on a Secure Cluster Using Oracle Wallet
# hadoop jar ${CP2HADOOP_HOME}/jlib/orahivedp.jar oracle.hadoop.ctoh.CtohDriver \ -D oracle.hadoop.ctoh.connection.tnsEntry=<my-oracle-tns-entry> \ -D oracle.hadoop.ctoh.connection.walletLoc=<local-oracle-wallet-dir> \ -D oracle.hadoop.ctoh.connection.tnsAdmin=<local-oracle-wallet-dir> \ -D oracle.hadoop.ctoh.connection.clusterWalletLoc=<oracle-wallet-dir-on-hadoop-cluster> \ -D oracle.hadoop.ctoh.connection.clusterTnsAdmin=<oracle-wallet-dir-on-hadoop-cluster> \ -D mapreduce.output.fileoutputformat.outputdir=<mytab-hdfs-output-dir> \ -D oracle.hadoop.ctoh.splitterType="BLOCK_SPLITTER" \ -D oracle.hadoop.ctoh.table=<dbSchema.dbTable> \ -D oracle.hadoop.ctoh.maxSplits=10
Example 2: Running the Job on A Unsecured Hadoop Cluster (for Demo Purposes Only)
# hadoop jar ${CP2HADOOP_HOME}/jlib/orahivedp.jar oracle.hadoop.ctoh.CtohDriver \ -D oracle.hadoop.ctoh.jdbc.url="jdbc:oracle:thin:@myhost:1521/myservice" \ -D oracle.hadoop.ctoh.connection.username="myuser" \ -D oracle.hadoop.ctoh.connection.password="mypassword" \ -D mapreduce.output.fileoutputformat.outputdir="mytable_output_dir" \ -D oracle.hadoop.ctoh.splitterType="BLOCK_SPLITTER" \ -D oracle.hadoop.ctoh.table="otherUser.mytable" \ -D oracle.hadoop.ctoh.maxSplits=10
Performance Tuning Tips
You can control the degree of parallelism of the Copy to Hadoop job by specifying the number of map processes using the oracle.hadoop.ctoh.maxSplits
property. The higher the number of map processes, the higher the parallelism. Note that each process connects to the database, so this value also determines the number of simultaneous connections to the database. Typically, a number such as 64 works well.
Required Configuration Properties
See Section “Configuration Properties Reference” for more information on these and other properties.
oracle.hadoop.ctoh.table
mapreduce.output.fileoutputformat.outputdir
oracle.hadoop.ctoh.maxSplits
oracle.hadoop.ctoh.splitterType
Connection Properties for a secure Hadoop cluster using Oracle Wallet:
oracle.hadoop.ctoh.connection.walletLoc
oracle.hadoop.ctoh.connection.tnsAdmin
oracle.hadoop.ctoh.connection.tnsEntry
The following properties are also required if the Oracle Wallet directory on the Hadoop cluster is different from the directory on the Hadoop client:
oracle.hadoop.ctoh.connection.clusterWalletLoc
oracle.hadoop.ctoh.connection.clusterTnsAdmin
Connection Properties for Unsecured Hadoop clusters (for Demo Purposes Only):
For demo purposes, use the following properties in place of the properties used with secured clusters.
oracle.hadoop.ctoh.connection.username
oracle.hadoop.ctoh.connection.password
oracle.hadoop.ctoh.jdbc.url
An Incremental Copy using Copy to Hadoop
To incrementally copy data from the same Oracle table to a pre-existing destination directory in HDFS, the following additional properties are required. (This configuration assumes that a Copy To Hadoop job was run initially to copy data from an Oracle Database table to datapump files in an HDFS directory. )
oracle.hadoop.ctoh.whereClause
oracle.hadoop.ctoh.datapump.output
oracle.hadoop.ctoh.datapump.basename
oracle.hadoop.ctoh.datapump.output
specifies a preexisting HDFS location that contains the datapump files from a previous run of Copy To Hadoop.
oracle.hadoop.ctoh.whereClause
identifies the subset of rows to be copied from the Oracle table for the incremental load.
oracle.hadoop.ctoh.datapump.basename
specifies a unique prefix for the datapump files. This property is used to generate unique datapump file names to prevent file name collisions during an incremental load.
This reference describes customer-accessible properties of Copy to Hadoop.
Copy to Hadoop Configuration Property Reference (for Direct Copy)
Property | Description |
---|---|
oracle.hadoop.ctoh.home |
Type: String Default Value: Value of the Description: This configuration property is used to locate jars required for the Copy to Hadoop job. |
oracle.hadoop.ctoh.table |
Type: String Default Value: None. Description: The name of the database table whose content is copied to Hdoop as datapump files. It can also be schema qualified. For example, to specify the table EMPLOYEE in schema MANAGER, you can use |
mapreduce.output.fileoutputformat.outputdir |
Type: String Default Value: None. Description: The name of the output directory where datapump files are created by the Hadoop job. The job output logs are also stored in the |
oracle.hadoop.ctoh.datapump.output |
Type: String Default Value: None. Description: Specifies the destination directory for datapump files. If this property is not specified,the datapump files will live in the directory specified by the |
oracle.hadoop.ctoh.datapump.basename |
Type: String. Default Value: Description: The prefix or base-name of generated data pump files. For example if a user specifies this property as “dp_tbl”, then the generated datapump file is |
oracle.hadoop.ctoh.datapump.extension |
Type: Default Value: Description: The suffix of generated data pump files. For example if a user specifies this property as “.dp”, then the generated datapump file is |
oracle.hadoop.ctoh.maxSplits |
Type: Integer. Default Value: None. Description: The number of datapump files that are created by the Hadoop job. This is also the number of mappers that will be produced by the Hadoop job. |
oracle.hadoop.ctoh.splitterType |
Type: Default Value: None. Description: The type of splitters that will be used to split the table data.
|
oracle.hadoop.ctoh.columns |
Type: Default Value: None. Description: Specifies the subset of columns to be copied. For example if a user specifies “NAME,MANAGER” then the data for columns |
oracle.hadoop.ctoh.whereClause |
Type: String. Default Value: None. Description: This property is used to copy a subset of rows. For example, to copy employees whose ids are less than 1000 and greater than 500, then specify the following WHERE clause: |
oracle.hadoop.ctoh.jdbc.url |
Type: String. Default Value: None. Description: The JDBC url to connect to the database. This property can used for demo purposes and non-secure clusters. Use Oracle Wallet with Secure Hadoop clusters in production environments. |
oracle.hadoop.ctoh.connection.username |
Type: String. Default Value: None. Description: The name of the Oracle Database user. This property can used for demo purposes and non secure clusters. Use Oracle Wallet with Secure Hadoop clusters in production environments. |
oracle.hadoop.ctoh.connection.password |
Type: String. Default Value: None. Description: The password of the Oracle Database user. This property can used for demo purposes and non secure clusters. Use Oracle Wallet with Secure Hadoop clusters in production environments. |
oracle.hadoop.ctoh.connection.walletLoc |
Type: String. Default Value: None. Description: Location of the Oracle Wallet directory on the Hadoop client. When using an Oracle Wallet, you must also set the following properties:
|
oracle.hadoop.ctoh.connection.clusterWalletLoc |
Type: Default Value: Value of Description: Location of the Oracle wallet directory on the Hadoop cluster. NOTE: This directory must be visible to all nodes in the Hadoop cluster. When using this property, you must also set the following properties:
|
oracle.hadoop.ctoh.connection.tnsAdmin |
Type: String. Default Value: Not Defined. Description: Location of the directory on the Hadoop client, containing the SQL*Net configuration files such as You must set this property when using an Oracle Wallet as an external password store. See |
oracle.hadoop.ctoh.connection.clusterTnsAdmin |
Type: String. Default Value: Value of the property Description: Location of the directory on the Hadoop cluster containing SQL*Net configuration files such as Set this property so that you can use TNS entry names in database connection strings. When using Oracle Wallet, this property value can be the same as the value of You must set this property when using an Oracle Wallet as an external password store (as Oracle recommends). See |
oracle.hadoop.ctoh.connection.tnsEntry |
Type: String. Default Value: None. Description: The TNS entry name defined in the |
oracle.hadoop.ctoh.cachePath |
Type: String. Default Value: Description: Identifies the full path to an HDFS directory where cCopy to Hadoop can create files that are loaded into the MapReduce distributed cache. The distributed cache is a facility for caching large, application-specific files and distributing them efficiently across the nodes in a cluster. |