4.2 Copy Data (Including from Multiple Sources)

In the Data section of the Oracle Big Data Manager console, you can create, schedule, and run jobs that include multiple sources. You can also copy via HTTP(S).

  1. Click the Data tab at the top of the page, and then click the Explorer tab on the left side of the page.
  2. In either panel, select a destination storage provider for the copy job from the Storage drop-down list.
  3. On the toolbar for that panel, click Copy here from HTTP(S) Copy here from HTTP(S) icon.
  4. In the New copy data job dialog box, select the source to be copied:
    1. Click the Select file or directory button next to Sources.
    2. From the Location drop-down list, select the storage provider from which you are copying. The list shows the storage providers registered with Oracle Big Data Manager.
    3. If desired, select Use file as link to list of files to link to a list of files instead of linking directly to a single file.
      If you select this option, the file must be a .csv file, and each line of the file must satisfy the following schema: link_to_file[,http_headers_ encoded_in_Base64], where ,http_headers_ encoded_in_Base64 is optional. For example:
      http://172.16.253.111/public/big.file
      https://172.16.253.111/public/small.file
      http://172.16.253.111/private/secret.file,QXV0aG9yaXphdGlvbjogQmFzaWMgYjNKaFkyeGxPa2cwY0hCNVJqQjQK
      https://oracle:passwd@172.16.253.111/private/small.file
    4. Navigate to and select the item you're copying. You can also specify a path manually in the Path field. For HTTP(S), enter the URI of the source in the URI field.
    5. Click Select.
  5. If you want to copy from multiple sources in the same copy job, click the Add source button and repeat the tasks in the previous step.
  6. If you want to change the destination for the copy job, click inside the Destination field and edit the current location.
  7. In the New copy data job dialog box, provide values as described below.

    General tab

    • Job name: A name is provided for the job, but you can change it if you want.
    • Job type: This read-only field describes the type of job. In this case, it’s Data transfer - import from HTTP.
    • CPU utilization: Use the slider to specify CPU utilization for the job. The proper job configuration will be calculated based on the cluster's shape. This is set to 30 percent by default. If you set this to a higher value, you'll have more CPUs for the job, which can mean better performance when you're copying a large number of files. But assigning more CPUs to a job also means there will be fewer CPUs in the cluster available for other tasks.
    • Memory utilization: Use the slider to specify memory utilization for the job. The proper job configuration will be calculated based on the cluster's shape. This is set to 30 percent by default. Assigning more memory to a job can increase its performance, but also leaves less free memory available for other tasks. If the job is given too little memory, it will crash. If the job is given more memory than what's currently available, it remains in a PENDING state until the requested amount of memory becomes available.
    • Overwrite existing files: Select this option to overwrite existing files of the same name in the target destination. This is selected by default.
    • Run immediately: Select this option to run the job immediately and only once. This is selected by default.
    • Repeated execution: Select this option to schedule the time and frequency of repeated executions of the job. You can specify a simplified entry, or click Advanced entry to enter a cron expression.
    Advanced tab
    • Block size: Select file chunk size in HDFS from the drop-down list.
    • Number of executors per node: Specify the number of cores. The default is 5. If you want to execute this job in parallel with other Spark or MapReduce jobs, decrease the number of cores to increase performance.
    • Memory allocated for driver: Select the memory limit from the drop-down list. Memory allocated for the driver is memory allocated for the Application Driver responsible for task scheduling. The default is 1 GB.
    • Custom logging level: Select this option to log the job’s activity and to select the logging level. The default logging level is INFO. If this check box is not selected, the logging level for the job defaults to whatever is configured in the cluster.
    • HTTP proxy: If this data transfer type is HTTP(S) and if you have HTTP(S) header information stored in a file, you can use that header information in the HTTP(S) request header. Click Select file, select the storage provider from the Location drop-down list, and navigate to and select the file that contains the HTTP(S) header information. For HTTP(S), enter the URI for the file in the URI field. Click Select file.

      The structure of the files with HTTP(S) headers is regex_pattern,http_headers. For example, the following file applies specific HTTP(S) headers for files that contain image in their path or name. Note that HTTP(S) headers must be Base64 encoded:

      .*image.*,QXV0aG9yaXphdGlvbjogQmFzaWMgYjNKaFkyeGxPa2cwY0hCNVJqQjQK

  8. Click Create.

    The Data copy job job_number created dialog box shows minimal status information about the job. When the job completes, click View more details to show more details about the job in the Jobs section of the console. You can also click this link while the job is running.

  9. Review the job results. The tabs on the left provide different types of information. You can also stop or remove running jobs and rerun or remove completed jobs from the Menu icon menu for the job on each tab.
    • The Summary tab shows summary information for the job.
    • The Arguments tab shows the parameters passed to the job.
    • The Job output tab shows job output, which you can also download.

    Also see Manage Jobs in Oracle Big Data Manager.