8 Use odiff to Compare Large Data Sets

odiff (Oracle Distributed Diff) is a utility that compares large data sets stored in various locations.

odiff runs as a distributed Spark application. It is compatible with Cloudera Distributed Hadoop 5.7.x.
  • Only HDFS and Oracle Storage Cloud Service are supported.

  • When odiff compares two objects, no data is downloaded. Only segment checksums are compared. If objects are equal but have segments with different sizes, then they’re evaluated as different objects

  • The default size of a block to compare is 128 MB.

  • The minimum block size to compare is 5 MB. The maximum is 2 GB.

odiff Reference

Use odiff at the command line, as described below.

Syntax

/usr/bin/odiff [OPTIONS] directory_or_file_1 directory_or_file_2

where

directory_or_file is a directory or a file, qualified by its path, for example, file:///tmp/diff/originalFiles (directory) or file:///tmp/diff/originalFiles/file-b.txt (file).

Environment Variable (Optional)

By default, odiff uses the first provider configured in core-site.xml. If core-site.xml contains more than one provider, you can specify which one to use by declaring the PROVIDER_NAME environment variable.

You can export the environment variable...

# export PROVIDER_NAME="some_value" /usr/bin/odiff [OPTIONS] path/file_1 path/file_2

...or inject it:

# PROVIDER_NAME="some_value" /usr/bin/odiff [OPTIONS] path/file_1 path/file_2

Options

Option Use

-b

--diffBlockSize

Diff the file block sizes in bytes. Not used when comparing files stored in Oracle Storage Cloud Service.

-d

--showDetails

Shows detailed output.
--executor-cores Specify the count of executors cores. Default value is 5.
--executor-memory Specify the executors memory limit in GB. Default value is 40.
--extra-conf Specify extra configuration options; for example, --extra-conf spark.kryoserializer.buffer.max=128m

-h

--help

Display this help text.
--krb-keytab Specify the full path to the Kerberos keytab of the principal. Use in a Kerberos-enabled Spark environment only
--krb-principal Specify the Kerberos principal. Use in a Kerberos-enabled Spark environment only.
--num-executors Specify the count of executors, The default value is 3.

-O

--output

Specify an output file.
--spark-home Specify the path to the directory containing the Spark installation. If this option isn’t specified, odiff tries to find it in the /opt/cloudera directory.
-V Enable verbose mode for debugging.

Usage Examples

/usr/bin/odiff hdfs:///user/oracle/data.raw swift://myContainer.myProvider/data.raw
/usr/bin/odiff swift://jmyContainer.myProvider/data.raw hdfs:///user/oracle/odcp-data.raw

If you have more then three nodes you can increase transfer speed by specifying a higher number of executors. For example, for six nodes use following command:

/usr/bin/odiff --num-executors=6 hdfs:///user/oracle/data.raw swift://myContainer.myProvider/data.raw

For debugging you can enable verbose mode using switch -V:

/usr/bin/odiff -V swift://jmyContainer.myProvider/data.raw hdfs:///user/oracle/odcp-data.raw

Limitations:

/usr/bin/odiff consumes a lot of resources of your cluster. If you want to execute in parallel other Spark/MapReduce jobs, you need to decrease the number of executors, the executors memory, or the number of executors cores by using the --num-executors, --executor-memory, and --executor-cores parameters.

odiff Examples

This topic examines a more extended example of using odiff to compare data structures.

The Data

Consider the following situation:

  • Files from a directory originalFiles were copied (by using the odcp distributed-copy tool) to a directory named copiedFiles.

  • After copying, the user 

    • added a few lines of text to originalFiles/file-b.txt

    • deleted a few lines of text in the originalFiles/file-c.txt

    • modified one byte of text in the originalFiles/file-d.txt 

    • created .hiddenFile in the copiedFiles directory

As a result, the directories and files look like this:

originalFiles
    +-- file-a.txt     
    +-- file-b.txt     
    +-- file-c.txt     
    `-- file-d.txt     
 
copiedFiles
    +-- file-a.txt      (same as originalFiles/file-a.txt)
    +-- file-b.txt      (added few lines)
    +-- file-c.txt      (deleted few lines)
    +-- file-d.txt      (modified one byte)
    `-- .hiddenFile     (added)

The remaining sections show different odiff operations performed on the above data.

Compare Two Files (Original and Copied)

odiff Parameters Output Return Code
file:///tmp/diff/originalFiles/file-a.txt
file:///tmp/diff/copiedFiles/file-a.txt

--diffBlockSize 104857600
.Files file:///tmp/diff/originalFiles/file-a.txt and file:///tmp/diff/copiedFiles/file-a.txt are same. 0
file:///tmp/diff/originalFiles/file-a.txt
file:///tmp/diff/copiedFiles/file-a.txt
--diffBlockSize 104857600
--showDetails
Files file:///tmp/diff/originalFiles/file-a.txt and file:///tmp/diff/copiedFiles/file-a.txt are same. 0

Compare Two Directories (One With Original Files and the Other With Copied Files)

odiff Parameters Output Return Code
file:///tmp/diff/originalFiles
file:///tmp/diff/copiedFiles
--diffBlockSize 104857600
Directories file:///tmp/diff/originalFiles and file:///tmp/diff/originalFiles are same. 0
file:///tmp/diff/originalFiles
file:///tmp/diff/copiedFiles
--diffBlockSize 104857600
--showDetails
Directories file:///tmp/diff/originalFiles and file:///tmp/diff/originalFiles are same. 0

Compare Two Different Files

odiff Parameters Output Return Code
file:///tmp/diff/originalFiles/file-b.txt
file:///tmp/diff/copiedFiles/file-b.txt
--diffBlockSize 104857600
Files file:/tmp/diff/originalFiles/file-b.txt and file:/tmp/diff/copiedFiles/file-b.txt are different 1
file:///tmp/diff/originalFiles/file-b.txt
file:///tmp/diff/copiedFiles/file-b.txt
--diffBlockSize 104857600
--showDetails

Files file:/tmp/diff/originalFiles/file-b.txt and file:/tmp/diff/copiedFiles/file-b.txt are different.

Block#00000001: equals

Block#00000002: equals

Block#00000003: equals

Block#00000004: missing in revised file

1

Compare Two Different Directories

odiff Parameters Output Return Code
file:///tmp/diff/originalFiles 
file:///tmp/diff/copiedFiles 
--diffBlockSize 1048576

Found 1 same file(s), found 4 different file(s):

Files file:/tmp/diff/originalFiles/file-b.txt and file:/tmp/diff/copiedFiles/file-b.txt are different

Files file:/tmp/diff/originalFiles/file-c.txt and file:/tmp/diff/copiedFiles/file-c.txt are different

Files file:/tmp/diff/originalFiles/file-a.txt and file:/tmp/diff/copiedFiles/file-a.txt are same

Files file:/tmp/diff/originalFiles/file-d.txt and file:/tmp/diff/copiedFiles/file-d.txt are different

Files file:/tmp/diff/originalFiles/.hiddenFile and file:/tmp/diff/copiedFiles/.hiddenFile are different:

The original file file:/tmp/diff/originalFiles/.hiddenFile is missing

1
file:///tmp/diff/originalFiles
file:///tmp/diff/copiedFiles
--diffBlockSize 104857600
--showDetails

Found 1 same file(s), found 4 different file(s):

Files file:/tmp/diff/originalFiles/file-a.txt and file:/tmp/diff/copiedFiles/file-a.txt are same

Files file:/tmp/diff/originalFiles/file-d.txt and file:/tmp/diff/copiedFiles/file-d.txt are different

Block#00000001: different

Block#00000002: equals

Block#00000003: equals

Block#00000004: equals

Block#00000005: equals

Block#00000006: equals

Files file:/tmp/diff/originalFiles/file-b.txt and file:/tmp/diff/copiedFiles/file-b.txt are different

Block#00000001: equals

Block#00000002: equals

Block#00000003: equals

Block#00000004: missing in revised file

Files file:/tmp/diff/originalFiles/file-c.txt and file:/tmp/diff/copiedFiles/file-c.txt are different

Block#00000001: equals

Block#00000002: equals

Block#00000003: different

Files file:/tmp/diff/originalFiles/.hiddenFile and file:/tmp/diff/copiedFiles/.hiddenFile are different: Original file file:/tmp/diff/originalFiles/.hiddenFile is missing

1