Using odcp Command Line Utility to Copy Data

Use the odcp command line utility to manage copy jobs data between HDFS on your cluster and remote storage providers.

Note

odcp CLI can be used only in clusters that use Cloudera Distribution including Hadoop.

odcp uses Spark to provide parallel transfer of one or more files. It takes the input file and splits it into chunks, which are then transferred in parallel to the destination. By default, transferred chunks are then merged back to one output file.

odcp supports copying files when using the following:
  • Apache Hadoop Distributed File Service (HDFS)

  • Apache WebHDFS and Secure WebHDFS (SWebHDFS)

  • Amazon Simple Storage Service (S3)

  • Oracle Cloud Infrastructure Object Storage

  • Hypertext Transfer Protocol (HTTP) and HTTP Secure (HTTPS) — Used for sources only.

Before You Begin

The following topics tell how to use the options to the odcp command to copy data between HDFS on your cluster and external storage providers.

For all the operations, every cluster node must have:
  • Access to all running storage services.

  • All required credentials established, for example Oracle Cloud Infrastructure Object Storage instances.

See odcp Reference for the odcp syntax, parameters, and options.

Using bda-oss-admin with odcp

Use bda-oss-admin commands to configure the cluster for use with storage providers. This makes it easier and faster to use odcp with the storage provider.

Any user with access privileges to the cluster can run odcp.

Note

To copy data between HDFS and a storage provider, for example, Oracle Cloud Infrastructure Object Storage, you must have an account with the data store and access to it.

To copy data between HDFS and a storage provider:

  1. Open a command shell and connect to the cluster. You can connect to any node for which you have HDFS access rights. See Connecting to a Cluster Node Using SSH.
  2. Set shell environment variable values for Cloudera Manager access. See Understanding bda-oss-admin Environment Variables.

    Set these environment variables

    • CM_ADMIN — Cloudera Manager administrator (cluster administrator) user name

    • CM_PASSWORD — Cloudera Manager administrator (cluster administrator) password

    • CM_URL — Cloudera Manager URL

  3. You must also have access privileges to the storage provider you want to use.
    • Set the PROVIDER_NAME environment variable to refer to the provider you want to use. For example, if you have a provider named rssfeeds-admin2, use SSH to connect to the cluster and enter:

      PROVIDER_NAME="rssfeeds-admin2"

      Or, in a shell script:

      export PROVIDER_NAME="rssfeeds-admin2"
  4. You can use the hadoop fs -ls command to browse your HDFS and storage data.
  5. Use the odcp command to copy files. See odcp Reference.

Copying Data on a Secure Cluster

Using odcp to copy data on a Kerberos-enabled cluster requires some additional steps.

Note

In Oracle Big Data Service, a cluster is Kerberos-enabled when it's created with the Secure and Highly Available (HA) option selected.

If you want to execute a long running job or run odcp from an automated shell script or from a workflow service such as Apache Oozie, then you must pass to the odcp command a Kerberos principal and the full path to the principal's keytab file, as described below:

  1. Use SSH to connect to any node on the cluster.
  2. Choose the principal to be used for running the odcp command. In the example below it's odcp@BDACLOUDSERVICE.EXAMPLE.COM.
  3. Generate a keytab file for the principal, as shown below: 
    $ kutil
    ktutil:  addent -password -p odcp@BDSERVICE.EXAMPLE.COM -k 1 -e rc4-hmac
    Password for odcp@BDSERVICE.EXAMPLE.COM: [enter your password]
    ktutil:  addent -password -p odcp@BDSERVICE.EXAMPLE.COM -k 1 -e aes256-cts
    Password for odcp@BDSERVICE.EXAMPLE.COM: [enter your password]
    ktutil:  wkt /home/odcp/odcp.keytab
    ktutil:  quit
  4. Pass the principal and the full path to the keytab file to the odcp command, for example:
    odcp --krb-principal odcp@BDSERVICE.EXAMPLE.COM --krb-keytab /home/odcp/odcp.keytab source destination
If you just want to execute a short-running odcp job from the console, you don't have to generate a keytab file or specify the principal. You just have to have an active Kerberos token (created using the kinit command).

Retrying a Failed Copy Job

If a copy job fails, you can retry it. When retrying the job, the source and destination are automatically synchronized. Therefore odcp doesn't transfer successfully transferred file parts from source to destination.

Use the following:

odcp --retry <source> <target>

Debugging odcp

You must configure the cluster to enable debugging for odcp.

Configuring a Cluster to Enable Debugging

To configure the cluster:
  1. As the root user, add following lines to /etc/hadoop/conf/log4j.properties on each node of the cluster:
    log4j.logger.oracle.paas.bdcs.conductor=DEBUG
    log4j.logger.org.apche.hadoop.fs.swift.http=DEBUG
    

    Or, to configure all nodes:

    $ dcli -c $NODES "echo 'log4j.logger.oracle.paas.bdcs.conductor=DEBUG' >> /etc/hadoop/conf/log4j.properties"
    $ dcli -c $NODES "echo 'log4j.logger.org.apche.hadoop.fs.swift.http=DEBUG' >> /etc/hadoop/conf/log4j.properties"
    
    
  2. As the oracle user, find the logs in following HDFS directory: 

     hdfs:///tmp/logs/username/logs/application_application_id/

    For example:

    $ hadoop fs -ls /tmp/logs/oracle/logs/
    Found 15 items
    drwxrwx---   - oracle hadoop       0 2016-08-23 07:29 /tmp/logs/oracle/logs/application_14789235086687_0029
    drwxrwx---   - oracle hadoop       0 2016-08-23 08:07 /tmp/logs/oracle/logs/application_14789235086687_0030
    drwxrwx---   - oracle hadoop       0 2016-08-23 08:20 /tmp/logs/oracle/logs/application_14789235086687_0001
    drwxrwx---   - oracle hadoop       0 2016-08-23 10:19 /tmp/logs/oracle/logs/application_14789235086687_0002
    drwxrwx---   - oracle hadoop       0 2016-08-23 10:20 /tmp/logs/oracle/logs/application_14789235086687_0003
    drwxrwx---   - oracle hadoop       0 2016-08-23 10:40 /tmp/logs/oracle/logs/application_14789235086687_0004
    ...
    
    # cat logs as:
    hadoop fs -cat /tmp/logs/oracle/logs/application_1469028504906_0032/slclbv0036.em3.oraclecloud.com_8041
    
    # copy to local FShadoop fs -copyToLocal /tmp/logs/oracle/logs/application_1469028504906_0032/slclbv0036.em3.oraclecloud.com_8041 /tmp/log/slclbv0036.em3.oraclecloud.com_8041

Collecting Transfer Rates

You can collect the transfer rates when debugging is enabled.

Transfer rates are reported after every:

  • Read chunk operation

  • Write or upload chunk operation

The summary throughput is reported after a chunk transfer is completed. The summary throughput includes all:

  • Read operations

  • Write or upload operations

  • Spark framework operations (task distribution, task management, etc.)

Output Example:

./get-transfer-rates.sh application_1476272395108_0054 2>/dev/null
Action,Speed [MBps],Start time,End time,Duration [s],Size [B]
Download from OSS,2.5855451864420473,2016-10-31 11:34:48,2016-10-31 11:38:06,198.024,536870912
Download from OSS,2.548912231791706,2016-10-31 11:34:47,2016-10-31 11:38:08,200.87,536870912
Download from OSS,2.53447780846872,2016-10-31 11:34:47,2016-10-31 11:38:09,202.014,536870912
Download from OSS,2.5130931169717226,2016-10-31 11:34:48,2016-10-31 11:38:11,203.733,536870912
Write to HDFS,208.04550995530275,2016-10-31 14:00:30,2016-10-31 14:00:33,2.46099999999999967435,536870912
Write to HDFS,271.76220806794055,2016-10-31 14:00:38,2016-10-31 14:00:40,1.88400000000000001398,536870912
Write to HDFS,277.5067750677507,2016-10-31 14:00:43,2016-10-31 14:00:45,1.84499999999999985045,536870912
Write to HDFS,218.0579216354344,2016-10-31 14:00:44,2016-10-31 14:00:46,2.34800000000000013207,536870912
Write to HDFS,195.56913674560735,2016-10-31 14:00:44,2016-10-31 14:00:47,2.61799999999999978370,536870912

Use the following command to collect output rates:

get-transfer-rates.sh application_id

odcp Reference

The odcp command-line utility has the single command odcp, with parameters and options as described below.

Syntax

odcp [<options>] <source1> [<source2> ...] <destination>

Parameters

Parameter Description
<source1> [<source2> ...]

The source can be any of the following:

  • One or more individual files. Wildcard characters are allowed (glob patterns).

  • One or more HDFS directories.

  • One or more storage containers.

If you specify multiple sources, list them one after the other:

odcp <source1> <source2> <source3> <destination>

If two or more source files have the same name, nothing is copied and odcp throws an exception.

Regular expressions are supported through these parameters:

  • --srcPattern <pattern>

    Files with matching names are copied. This parameter is ignored if the ––groupBy parameter is set.

  • --groupBy <pattern>

    Files with matching names are copied and are then concatenated into one output file. Set a name for the concatenated file name by using the parameter --groupName output_file_name.

    When the --groupBy parameter is used, the --srcPattern parameter is ignored.

<destination>

The destination can be any of the following:

  • A specified file in an HDFS directory or a storage container

    If you don't specify a file name, the name of the source file is used for the copied file at the destination. But you can specify a different filename at the destination, to prevent overwriting a file with the same name.

  • An HDFS directory

  • A storage container

Options

Option Description

-b

--block-size

Destination file part size in bytes.

  • Default = 134217728

  • Minimum = 1048576

  • Maximum = 2147483647

The remainder after dividing partSize by blockSize must be equal to zero.

-c

--concat

Concatenate the file chunks (default).

--executor-cores

Specify the number of executor cores.

The default value is 5.

--executor-memory

Specify the executors memory limit in gigabytes.

The default value is 40 GB.

--extra-conf

Specify extra configuration options. For example:

--extra-conf spark.kryoserializer.buffer.max=128m

--groupBy

Specify files to concatenate to a <destination> file by matching source file names with a regular expression.

-h

--help

Show help for this command.

--krb-keytab

The full path to the keytab file of the Kerberos principal. (Use in a Kerberos-enabled Spark environment only.)

--krb-principal

The Kerberos principal. (Use in a Kerberos-enabled Spark environment only.)

-n

--no-clobber

Don't overwrite an existing file.

--non-recursive

Don't copy files recursively.

--num-executors

Specify the number of executors. The default value is 3 executors.

--progress

Show the progress of the data transfer.

--retry

Retry if the previous transfer failed or was interrupted.

--partSize

Destination file part size in bytes.

  • Default = 536870912

  • Minimum = 1048576

  • Maximum = 2147483647

The remainder after dividing partSize by blockSize must be equal to zero.

--spark-home 

The path to a directory containing an Apache Spark installation. If nothing is specified, odcp tries to find it in /opt/cloudera directory.

--srcPattern

Filters sources by matching the source name with a regular expression.

--srcPattern is ignored when the --groupBy parameter is used.

--sync

Synchronize the <destination> with the <source>.

-V

Enable verbose mode for debugging.