4.2 Copying Data (Including from Multiple Sources)

In the Data section of the Oracle Big Data Manager console, you can create, schedule, and run job that includes multiple sources. You can also copy via HTTP(S).

  1. Click the Data tab on the top of the page, and then click the Explorer tab on the left side of the page.
  2. In either panel of the Data Explorer page,, select a target location as the destination for the copy job.
  3. On the toolbar for that panel, click Copy here from HTTP(S) Copy here from HTTP(S).
  4. In the New copy data job dialog box, enter information in the Sources row, as follows::
    1. From the first drop-down list, select Direct link to copy a single file or select Link to list of files to copy multiple files that are listed in a manifest file containing the list in comma-separated values (CSV) format.
    2. From the second drop-down list, select the data source from which you are copying. This list shows the data providers registered with Oracle Big Data Manager.
    3. The last control in the Sources row depends on the type of data source selected in the second drop-down list. For HTTP(S), enter the URL of the source in the Enter a valid HTTP(S) text box. For other types of data sources, click the Select file button to navigate to and select a file.
  5. If you want to copy from multiple sources in the same copy job, click the Add source button and repeat the tasks in the previous step.
  6. If you want to change the destination for the copy job, click in the Destination field and edit the current location.
  7. In the tabs of the New copy data job dialog box, enter the following values.
    General tab
    • Job name: A name is provided for the job, but you can append to it or replace it with a different name.
    • Job type: This read-only field describes the type of job. In this case, it’s Data transfer — import from HTTP.
    • Run immediately: Select this option to run the job immediately and only once.
    • Repeated execution: Select this option to schedule the time and frequency of repeated executions of the job.
    Advanced tab
    • Number of executors: Select the number of executors from the drop-down list. The default number is 3. If you have more then three nodes you can increase execution speed by specifying a higher number of executors. If you want to execute this job in parallel with other Spark or MapReduce jobs, decrease the number of executors to increase performance.
    • Number of CPU cores per executor: Select the number of cores from the drop-down list. The default number is 5. If you want to execute this job in parallel with other Spark or MapReduce jobs, decrease the number of cores to increase performance.
    • Memory allocated for each execution: Select the amount of memory from the drop-down list. The default value is 40 GB. If you want to execute this job in parallel with other Spark or MapReduce jobs, decrease the memory to increase performance.
    • Memory allocated for driver: Select the memory limit from the drop-down list.
    • Custom logging level: Select this option to log the job’s activity and to select the logging level.
    • HTTP proxy: If this data transfer type is HTTP(S) and if you have HTTP(S) header information stored in a file, you can use that header information in the HTTP(S) request header. From the HTTP headers file drop-down list, select the storage that contains the file. If it’s via HTTP(S), enter the URI for the file in the Enter a valid HTTP(S) URI field. If it’s a different kind of provider, click the Select File button and navigate to and choose the file.
  8. Click Create.
    The Data compare job job_number created dialog box shows minimal status information about the job. Click the View more details link to show more details about the job in the Jobs section of console.
  9. Review the job results. In particular, in the Jobs section of the console, click the Comparison results tab on the left side of the page to display what’s the same and what’s different about the compared items.