4.7 Compare Data Sets

In the Oracle Big Data Manager console, you can create, schedule, and run jobs that compare large data sets in different storage providers.

  1. Click the Data tab at the top of the page, and then click the Explorer tab on the left side of the page.
  2. Navigate to and select an item in the left panel and an item in the right panel to compare. You can only compare like items, for example file to file or directory to directory.
  3. On the toolbar, click Compare Compare icon .
  4. In the New compare data job dialog box, provide values as described below.

    General tab

    • Job name: A name is provided for the job, but you can change it if you want.
    • Job type: This read-only field describes the type of job. In this case, it’s Oracle Distributed Diff - compare.
    • CPU utilization: Use the slider to specify CPU utilization for the job. The proper job configuration will be calculated based on the cluster's shape. This is set to 30 percent by default. If you set this to a higher value, you'll have more CPUs for the job, which can mean better performance when you're copying a large number of files. But assigning more CPUs to a job also means there will be fewer CPUs in the cluster available for other tasks.
    • Memory utilization: Use the slider to specify memory utilization for the job. The proper job configuration will be calculated based on the cluster's shape. This is set to 30 percent by default. Assigning more memory to a job can increase its performance, but also leaves less free memory available for other tasks. If the job is given too little memory, it will crash. If the job is given more memory than what's currently available, it remains in a PENDING state until the requested amount of memory becomes available.
    • Run immediately: Select this option to run the job immediately and only once. This is selected by default.
    • Repeated execution: Select this option to schedule the time and frequency of repeated executions of the job. You can specify a simplified entry, or click Advanced entry to enter a cron expression.

    Advanced tab

    • Diff file block size: Select a value from the drop-down list to specify the level of comparison. The lower the value the more detailed the comparison will be and the more time it will take. The default value is 512 MB.
    • Number of executors per node: Specify the number of CPU cores. The default is 5. If you want to execute this job in parallel with other Spark or MapReduce jobs, decrease the number of cores to increase performance.
    • Memory allocated for driver: Select the memory limit from the drop-down list. Memory allocated for the driver is memory allocated for the Application Driver responsible for task scheduling. The default is 1 GB.
    • Custom logging level: Select this option to log the job’s activity and to select the logging level. The default logging level is INFO. If this check box is not selected, the logging level for the job defaults to whatever is configured in the cluster.
  5. Click Create.

    The Data compare job job_number created dialog box shows minimal status information about the job. When the job completes, click View more details to show more details about the job in the Jobs section of console. You can also click this link while the job is running.

  6. Review the job results. In particular, click the Comparison results tab on the left side of the page to display what’s the same and what’s different about the compared items.