4.7 Comparing Data Sets

In the Oracle Big Data Manager console, you can create, schedule, and run jobs that compare large data sets in different storage providers.

A compare job uses the odiff utility on Oracle Big Data Appliance, and the computation runs as distributed Spark application.

  1. Click Data on the menu bar to open the Data Explorer .
  2. Click the Explorer tab (on the left side of the page).
  3. Select an item in the left panel and an item in the right panel to compare. You can only compare like items, for example file to file or directory to directory.
  4. On the toolbar, click Compare Compare .
  5. In the New compare data job dialog box, enter the following values.
    General tab
    • Job name: A name is provided for the job, but you can append to it or replace it with a different name.
    • Job type: This read-only field describes the type of job. In this case, it’s Oracle Distributed Diff — compare.
    • Run immediately: Select this option to run the job immediately and only once.
    • Repeated execution: Select this option to schedule the time and frequency of repeated executions of the job.
    Advanced tab
    • Number of executors: Select the number of executors from the drop-down list. The default number is 3. If you have more then three nodes you can increase execution speed by specifying a higher number of executors. If you want to execute this job in parallel with other Spark or MapReduce jobs, decrease the number of executors to increase performance.
    • Number of CPU cores per executor: Select the number of cores from the drop-down list. The default number is 5. If you want to execute this job in parallel with other Spark or MapReduce jobs, decrease the number of cores to increase performance.
    • Memory allocated for each execution: Select the amount of memory from the drop-down list. The default value is 40 GB. If you want to execute this job in parallel with other Spark or MapReduce jobs, decrease the memory to increase performance.
    • Memory allocated for driver: Select the memory limit from the drop-down list.
    • Custom logging level: Select this option to log the job’s activity and to select the logging level.
  6. Click Create.
    The Data compare job job_number created dialog box shows minimal status information about the job. Click the View more details link to show more details about the job in the Jobs section of console.
  7. Review the job results. In particular, in the Jobs section of the console, click the Comparison results tab on the left side of the page to display what’s the same and what’s different about the compared items.