Using odcp Command Line Utility to Copy Data

Use the odcp command line utility to manage copy jobs data between HDFS on your cluster and remote storage providers.

Note

odcp CLI can be used only in clusters that use Cloudera Distribution including Hadoop.

odcp uses Spark to provide parallel transfer of one or more files. It takes the input file and splits it into chunks, which are then transferred in parallel to the destination. By default, transferred chunks are then merged back to one output file.

odcp supports copying files when using the following:

Apache Hadoop Distributed File Service (HDFS)
Apache WebHDFS and Secure WebHDFS (SWebHDFS)
Amazon Simple Storage Service (S3)
Oracle Cloud Infrastructure Object Storage
Hypertext Transfer Protocol (HTTP) and HTTP Secure (HTTPS) — Used for sources only.

Before You Begin

The following topics tell how to use the options to the odcp command to copy data between HDFS on your cluster and external storage providers.

For all the operations, every cluster node must have:

Access to all running storage services.
All required credentials established, for example Oracle Cloud Infrastructure Object Storage instances.

See odcp Reference for the odcp syntax, parameters, and options.

Using bda-oss-admin with odcp

Use bda-oss-admin commands to configure the cluster for use with storage providers. This makes it easier and faster to use odcp with the storage provider.

Any user with access privileges to the cluster can run odcp.

Note

To copy data between HDFS and a storage provider, for example, Oracle Cloud Infrastructure Object Storage, you must have an account with the data store and access to it.

To copy data between HDFS and a storage provider:

Open a command shell and connect to the cluster. You can connect to any node for which you have HDFS access rights. See Connecting to a Cluster Node Using SSH.
Set shell environment variable values for Cloudera Manager access. See Understanding bda-oss-admin Environment Variables.
Set these environment variables
- CM_ADMIN — Cloudera Manager administrator (cluster administrator) user name
- CM_PASSWORD — Cloudera Manager administrator (cluster administrator) password
- CM_URL — Cloudera Manager URL
You must also have access privileges to the storage provider you want to use.
- Data providers must be added to the Hadoop configuration.
  
  To see what credentials are already added, use the bda-oss-admin list_oci_cred or bda-oss-admin list_s3_cred commands. You can also look at the /etc/hadoop/conf/core-site.xml file. See Reviewing the Configuration.
- Set the PROVIDER_NAME environment variable to refer to the provider you want to use. For example, if you have a provider named rssfeeds-admin2, use SSH to connect to the cluster and enter:
```
PROVIDER_NAME="rssfeeds-admin2"
```
  Or, in a shell script:
```
export PROVIDER_NAME="rssfeeds-admin2"
```
You can use the hadoop fs -ls command to browse your HDFS and storage data.
Use the odcp command to copy files. See odcp Reference.

Copying Data on a Secure Cluster

Using odcp to copy data on a Kerberos-enabled cluster requires some additional steps.

Note

In Oracle Big Data Service, a cluster is Kerberos-enabled when it's created with the Secure and Highly Available (HA) option selected.

If you want to execute a long running job or run odcp from an automated shell script or from a workflow service such as Apache Oozie, then you must pass to the odcp command a Kerberos principal and the full path to the principal's keytab file, as described below:

Use SSH to connect to any node on the cluster.
Choose the principal to be used for running the odcp command. In the example below it's odcp@BDACLOUDSERVICE.EXAMPLE.COM.

Generate a keytab file for the principal, as shown below:

$ kutil
ktutil:  addent -password -p odcp@BDSERVICE.EXAMPLE.COM -k 1 -e rc4-hmac
Password for odcp@BDSERVICE.EXAMPLE.COM: [enter your password]
ktutil:  addent -password -p odcp@BDSERVICE.EXAMPLE.COM -k 1 -e aes256-cts
Password for odcp@BDSERVICE.EXAMPLE.COM: [enter your password]
ktutil:  wkt /home/odcp/odcp.keytab
ktutil:  quit

Pass the principal and the full path to the keytab file to the odcp command, for example:
odcp --krb-principal odcp@BDSERVICE.EXAMPLE.COM --krb-keytab /home/odcp/odcp.keytab source destination

If you just want to execute a short-running odcp job from the console, you don't have to generate a keytab file or specify the principal. You just have to have an active Kerberos token (created using the kinit command).

Retrying a Failed Copy Job

If a copy job fails, you can retry it. When retrying the job, the source and destination are automatically synchronized. Therefore odcp doesn't transfer successfully transferred file parts from source to destination.

Use the following:

odcp --retry <source> <target>

Debugging odcp

You must configure the cluster to enable debugging for odcp.

Configuring a Cluster to Enable Debugging

To configure the cluster:

As the root user, add following lines to /etc/hadoop/conf/log4j.properties on each node of the cluster:

log4j.logger.oracle.paas.bdcs.conductor=DEBUG
log4j.logger.org.apche.hadoop.fs.swift.http=DEBUG

Or, to configure all nodes:

$ dcli -c $NODES "echo 'log4j.logger.oracle.paas.bdcs.conductor=DEBUG' >> /etc/hadoop/conf/log4j.properties"
$ dcli -c $NODES "echo 'log4j.logger.org.apche.hadoop.fs.swift.http=DEBUG' >> /etc/hadoop/conf/log4j.properties"

As the oracle user, find the logs in following HDFS directory:

hdfs:///tmp/logs/username/logs/application_application_id/

For example:

$ hadoop fs -ls /tmp/logs/oracle/logs/
Found 15 items
drwxrwx---   - oracle hadoop       0 2016-08-23 07:29 /tmp/logs/oracle/logs/application_14789235086687_0029
drwxrwx---   - oracle hadoop       0 2016-08-23 08:07 /tmp/logs/oracle/logs/application_14789235086687_0030
drwxrwx---   - oracle hadoop       0 2016-08-23 08:20 /tmp/logs/oracle/logs/application_14789235086687_0001
drwxrwx---   - oracle hadoop       0 2016-08-23 10:19 /tmp/logs/oracle/logs/application_14789235086687_0002
drwxrwx---   - oracle hadoop       0 2016-08-23 10:20 /tmp/logs/oracle/logs/application_14789235086687_0003
drwxrwx---   - oracle hadoop       0 2016-08-23 10:40 /tmp/logs/oracle/logs/application_14789235086687_0004
...

# cat logs as:
hadoop fs -cat /tmp/logs/oracle/logs/application_1469028504906_0032/slclbv0036.em3.oraclecloud.com_8041

# copy to local FShadoop fs -copyToLocal /tmp/logs/oracle/logs/application_1469028504906_0032/slclbv0036.em3.oraclecloud.com_8041 /tmp/log/slclbv0036.em3.oraclecloud.com_8041

Collecting Transfer Rates

You can collect the transfer rates when debugging is enabled.

Transfer rates are reported after every:

Read chunk operation
Write or upload chunk operation

The summary throughput is reported after a chunk transfer is completed. The summary throughput includes all:

Read operations
Write or upload operations
Spark framework operations (task distribution, task management, etc.)

Output Example:

./get-transfer-rates.sh application_1476272395108_0054 2>/dev/null
Action,Speed [MBps],Start time,End time,Duration [s],Size [B]
Download from OSS,2.5855451864420473,2016-10-31 11:34:48,2016-10-31 11:38:06,198.024,536870912
Download from OSS,2.548912231791706,2016-10-31 11:34:47,2016-10-31 11:38:08,200.87,536870912
Download from OSS,2.53447780846872,2016-10-31 11:34:47,2016-10-31 11:38:09,202.014,536870912
Download from OSS,2.5130931169717226,2016-10-31 11:34:48,2016-10-31 11:38:11,203.733,536870912
Write to HDFS,208.04550995530275,2016-10-31 14:00:30,2016-10-31 14:00:33,2.46099999999999967435,536870912
Write to HDFS,271.76220806794055,2016-10-31 14:00:38,2016-10-31 14:00:40,1.88400000000000001398,536870912
Write to HDFS,277.5067750677507,2016-10-31 14:00:43,2016-10-31 14:00:45,1.84499999999999985045,536870912
Write to HDFS,218.0579216354344,2016-10-31 14:00:44,2016-10-31 14:00:46,2.34800000000000013207,536870912
Write to HDFS,195.56913674560735,2016-10-31 14:00:44,2016-10-31 14:00:47,2.61799999999999978370,536870912

Use the following command to collect output rates:

get-transfer-rates.sh application_id

odcp Reference

The odcp command-line utility has the single command odcp, with parameters and options as described below.

Syntax

odcp [<options>]
          <source1> [<source2> ...]
          <destination>

Parameters


Parameter	Description
`<source1>` `[<source2>` ...]	The source can be any of the following: One or more individual files. Wildcard characters are allowed (glob patterns). One or more HDFS directories. One or more storage containers. If you specify multiple sources, list them one after the other: `odcp <source1> <source2> <source3> <destination>` If two or more source files have the same name, nothing is copied and `odcp` throws an exception. Regular expressions are supported through these parameters: `--srcPattern <pattern>` Files with matching names are copied. This parameter is ignored if the `––groupBy` parameter is set. `--groupBy <pattern>` Files with matching names are copied and are then concatenated into one output file. Set a name for the concatenated file name by using the parameter `--groupName output_file_name`. When the `--groupBy` parameter is used, the `--srcPattern` parameter is ignored.
`<destination>`	The destination can be any of the following: A specified file in an HDFS directory or a storage container If you don't specify a file name, the name of the source file is used for the copied file at the destination. But you can specify a different filename at the destination, to prevent overwriting a file with the same name. An HDFS directory A storage container

Options


Option	Description
`-b` `--block-size`	Destination file part size in bytes. Default = `134217728` Minimum = `1048576` Maximum = `2147483647` The remainder after dividing `partSize` by `blockSize` must be equal to zero.
`-c` `--concat`	Concatenate the file chunks (default).
`--executor-cores`	Specify the number of executor cores. The default value is `5`.
`--executor-memory`	Specify the executors memory limit in gigabytes. The default value is `40 GB`.
`--extra-conf`	Specify extra configuration options. For example: `--extra-conf spark.kryoserializer.buffer.max=128m`
`--groupBy`	Specify files to concatenate to a `<destination>` file by matching source file names with a regular expression.
`-h` `--help`	Show help for this command.
`--krb-keytab`	The full path to the keytab file of the Kerberos principal. (Use in a Kerberos-enabled Spark environment only.)
`--krb-principal`	The Kerberos principal. (Use in a Kerberos-enabled Spark environment only.)
`-n` `--no-clobber`	Don't overwrite an existing file.
`--non-recursive`	Don't copy files recursively.
`--num-executors`	Specify the number of executors. The default value is `3` executors.
`--progress`	Show the progress of the data transfer.
`--retry`	Retry if the previous transfer failed or was interrupted.
`--partSize`	Destination file part size in bytes. Default = `536870912` Minimum = `1048576` Maximum = `2147483647` The remainder after dividing `partSize` by `blockSize` must be equal to zero.
`--spark-home`	The path to a directory containing an Apache Spark installation. If nothing is specified, `odcp` tries to find it in `/opt/cloudera directory`.
`--srcPattern`	Filters sources by matching the source name with a regular expression. `--srcPattern` is ignored when the `--groupBy` parameter is used.
`--sync`	Synchronize the `<destination>` with the `<source>`.
`-V`	Enable verbose mode for debugging.

Oracle Cloud Infrastructure Documentation