8 Use odiff to Compare Large Data Sets
odiff (Oracle Distributed Diff) is a utility that compares large data sets stored in various locations.
-
Only HDFS and Oracle Storage Cloud Service are supported.
-
When odiff compares two objects, no data is downloaded. Only segment checksums are compared. If objects are equal but have segments with different sizes, then they’re evaluated as different objects
-
The default size of a block to compare is 128 MB.
-
The minimum block size to compare is 5 MB. The maximum is 2 GB.
odiff Reference
Use odiff
at the command line, as described below.
Syntax
/usr/bin/odiff [OPTIONS] directory_or_file_1 directory_or_file_2
where
directory_or_file
is a directory or a file, qualified by its path, for example, file:///tmp/diff/originalFiles
(directory) or file:///tmp/diff/originalFiles/file-b.txt
(file).
Environment Variable (Optional)
By default, odiff
uses the first provider configured in core-site.xml
. If core-site.xml
contains more than one provider, you can specify which one to use by declaring the PROVIDER_NAME
environment variable.
You can export the environment variable...
# export PROVIDER_NAME="some_value" /usr/bin/odiff [OPTIONS] path/file_1 path/file_2
...or inject it:
# PROVIDER_NAME="some_value" /usr/bin/odiff [OPTIONS] path/file_1 path/file_2
Options
Option | Use |
---|---|
|
Diff the file block sizes in bytes. Not used when comparing files stored in Oracle Storage Cloud Service. |
|
Shows detailed output. |
--executor-cores |
Specify the count of executors cores. Default value is 5 .
|
--executor-memory |
Specify the executors memory limit in GB. Default value is 40 .
|
--extra-conf |
Specify extra configuration options; for example, --extra-conf spark.kryoserializer.buffer.max=128m |
-h --help |
Display this help text. |
--krb-keytab |
Specify the full path to the Kerberos keytab of the principal. Use in a Kerberos-enabled Spark environment only |
--krb-principal |
Specify the Kerberos principal. Use in a Kerberos-enabled Spark environment only. |
--num-executors |
Specify the count of executors, The default value is 3 .
|
-O --output |
Specify an output file. |
--spark-home |
Specify the path to the directory containing the Spark installation. If this option isn’t specified, odiff tries to find it in the /opt/cloudera directory .
|
-V |
Enable verbose mode for debugging. |
Usage Examples
/usr/bin/odiff hdfs:///user/oracle/data.raw swift://myContainer.myProvider/data.raw
/usr/bin/odiff swift://jmyContainer.myProvider/data.raw hdfs:///user/oracle/odcp-data.raw
If you have more then three nodes you can increase transfer speed by specifying a higher number of executors. For example, for six nodes use following command:
/usr/bin/odiff --num-executors=6 hdfs:///user/oracle/data.raw swift://myContainer.myProvider/data.raw
For debugging you can enable verbose mode using switch -V
:
/usr/bin/odiff -V swift://jmyContainer.myProvider/data.raw hdfs:///user/oracle/odcp-data.raw
Limitations:
/usr/bin/odiff
consumes a lot of resources of your cluster. If you want to execute in parallel other Spark/MapReduce jobs, you need to decrease the number of executors, the executors memory, or the number of executors cores by using the --num-executors
, --executor-memory
, and --executor-cores
parameters.
odiff Examples
This topic examines a more extended example of using odiff
to compare data structures.
The Data
Consider the following situation:
-
Files from a directory
originalFiles
were copied (by using the odcp distributed-copy tool) to a directory namedcopiedFiles
. -
After copying, the user
-
added a few lines of text to
originalFiles/file-b.txt
-
deleted a few lines of text in the
originalFiles/file-c.txt
-
modified one byte of text in the
originalFiles/file-d.txt
-
created
.hiddenFile
in thecopiedFiles
directory
-
As a result, the directories and files look like this:
originalFiles
+-- file-a.txt
+-- file-b.txt
+-- file-c.txt
`-- file-d.txt
copiedFiles
+-- file-a.txt (same as originalFiles/file-a.txt)
+-- file-b.txt (added few lines)
+-- file-c.txt (deleted few lines)
+-- file-d.txt (modified one byte)
`-- .hiddenFile (added)
The remaining sections show different odiff
operations performed on the above data.
Compare Two Files (Original and Copied)
odiff Parameters | Output | Return Code |
---|---|---|
|
.Files file:///tmp/diff/originalFiles/file-a.txt and file:///tmp/diff/copiedFiles/file-a.txt are same. |
0 |
|
Files file:///tmp/diff/originalFiles/file-a.txt and file:///tmp/diff/copiedFiles/file-a.txt are same. |
0 |
Compare Two Directories (One With Original Files and the Other With Copied Files)
odiff Parameters | Output | Return Code |
---|---|---|
|
Directories file:///tmp/diff/originalFiles and file:///tmp/diff/originalFiles are same. |
0 |
|
Directories file:///tmp/diff/originalFiles and file:///tmp/diff/originalFiles are same. |
0 |
Compare Two Different Files
odiff Parameters | Output | Return Code |
---|---|---|
|
Files file:/tmp/diff/originalFiles/file-b.txt and file:/tmp/diff/copiedFiles/file-b.txt are different |
1 |
|
|
1 |
Compare Two Different Directories
odiff Parameters | Output | Return Code |
---|---|---|
|
|
1 |
|
|
1 |