Copy Data from an HTTP(S) Server with Oracle Big Data Manager

Before You Begin

In this 15-minute tutorial, you learn how to use Oracle Big Data Manager to copy data files hosted on a Hyper Text Transfer Protocol Secure (HTTPS) server to the Hadoop Distributed File System (HDFS) on your cluster.

Background

This is the first tutorial in the Work with Oracle Big Data Manager series. Read them sequentially.

Copy Data from an HTTP(S) Server with Oracle Big Data Manager
Analyze Data with Oracle Big Data Manager Notebook
Create a Personal Dashboard in Oracle Big Data Manager

What Do You Need?

Access to an HTTP server such as Apache HTTP server, Oracle HTTP Server (OHS), or Oracle Weblogic Server.
Access to either an instance of Oracle Big Data Cloud Service or to an Oracle Big Data Appliance, and the required login credentials.
Access to Oracle Big Data Manager, on either an instance of Oracle Big Data Cloud Service or on an Oracle Big Data Appliance, and the required sign in credentials. A port must be opened to permit access to Oracle Big Data Manager, as described in Enabling Oracle Big Data Manager.
Read/Write privileges to the /user/demo HDFS directory.
Basic familiarity with HDFS, Spark, and optionally, Apache Zeppelin.

Configure and Run Your HTTP(S) Server

In this section, you copy two files to a new HTTP(S) web server directory on your cluster. Next, you configure and run your HTTP(S) server which will host the data files.

Sign in to Oracle Big Data Manager. See Access Oracle Big Data Manager.
Copy the taxidropoff_files.zip file to a new directory named taxi_telemetry on your machine where you will run the HTTP(S) server. This file contains (12) .csv data files that were created from several datasets on the NYC Taxi & Limousine Commission website. The taxi_telemetry directory will be accessible via web browsers, HTTP(S), and Oracle Big Data Manager.
Extract the taxidropoff_files.zip file into the taxi_telemetry directory.
Make sure that you have the following files in your taxidropoff_files HTTP(S) web server directory.

Description of the illustration files-on-cluster.png
Right-click the list_of_files.txt file, select Save link as from the context menu, and then save it in your taxi_telemetry directory on your machine where you will run the HTTP(S) server. Edit the list_of_files.txt file. Replace the occurrences of your_host_name with your host name. Replace the occurrences of your_port_number with your port number. Save the file.

Description of the illustration list-of-files.png

Your taxi_telemetry HTTP(S) web server directory should contain the list_of_files.txt file and the (12) .csv data files.
In this tutorial, we are using the SimpleHTTPServer module that is available when you install Python. You can use this built-in HTTP(S) server (or your own HTTP(S) server) to turn any directory in your cluster into a web server directory. In your terminal window, cd into the taxi_telemetry directory.
```
$ cd taxi_telemetry
```
You can use any port that is available to you with your HTTP(S) server. Your HTTP(S) server will listen to HTTP(S) requests for the data hosted on your web server directory using this port. In this tutorial, we will use port 17777. For example, to make sure that your port is available, enter the following command:
```
$ netstat -tlnp 2>/dev/null | grep [port#]
```
In the preceding command, substitute [port#] with your port number.
Start-up the HTTP(S) server. For example, to start the HTTP(S) server on port 17777, enter the following command at the $ prompt:
```
$ python -m SimpleHTTPServer 17777
```
Description of the illustration start-http-server.png

In the preceding command, substitute port 17777 with your port number.

Copy a Data File from an HTTP(S) Server to HDFS

In this section, you copy the taxidropoff_1.csv data file from the taxi_telemtry directory to HDFS using the Copy here from HTTP(S) feature in Data explorer.

On the Oracle Big Data Manager page, click the Data tab.
In the Data explorer section, select HDFS storage (hdfs) from the Storage drop-down list. Navigate to the /user/demo directory, and then click Copy here from HTTP(S) on the toolbar. The New copy data job dialog box is displayed. It has the Sources and Destination sections and the General and Advanced tabs.
In the Sources section, click Select file or directory to display the Select file or directory dialog box. Select HTTP(S) from the Location drop-down list, if not already selected. In the URI field, enter a valid HTTP(S) URI using the following format, and then click Select. Substitute your_host_name with your actual host name. Substitute your_port_number with your port number that you are using with your HTTP server.
```
http://your_host_name:your_port_number/taxidropoff_1.csv
```
Description of the illustration sources-section.png

You can click Add source to add more source locations to this copy data job. In addition to HTTP(S), you can specify other storage source locations such as HDFS storage (hdfs) and Oracle Object Storage Classic (bdcs). In this example, we will only include the HTTP(S) source.
In the Destination section, make sure that the /user/demo HDFS destination directory is displayed. To make any changes to the destination directory, click Edit destination .

Description of the illustration destination-section.png
In the General tab, accept the defaults for all the fields, and then click Create.

Description of the illustration general-tab.png
A Data copy job created window is displayed for the requested data transfer (copy). The window displays the job details such as the job number, start time, progress, start date and time, and duration. To display additional details about the data copy job, click View more details. When the file is copied successfully to HDFS, a Succeeded window is displayed.
In the Data explorer section, select HDFS storage (hdfs) from the Storage drop-down list. Navigate to the /user/demo directory, and then click Refresh on the toolbar. The taxidropoff_1.csv file is now displayed in the /user/demo HDFS directory.

Description of the illustration file-in-hdfs.png

Copy Multiple Data Files from an HTTP(S) Server to HDFS

In this section, you copy six .csv data files from the taxi_telemetrydirectory to HDFS using the Use file as link to list of files checkbox in the New copy data job dialog box. You will use the list_of_files.txt file which contains the urls for the six .csv data files.

On the Oracle Big Data Manager page, click the Data tab.
In the Data explorer section, select HDFS storage (hdfs) from the Storage drop-down list. Navigate to the /user/demo directory, and then click Copy here from HTTP(S) on the toolbar. The New copy data job dialog box is displayed.
In the Sources section, click Select file or directory to display the Select file or directory dialog box. Select HTTP(S) from the Location drop-down list, if not already selected. Select the Use file as link to list of files checkbox. In the URI field, enter a valid HTTP(S) URI that references the list_of_files.txt file using the following format, and then click Select. The New copy data job dialog box is displayed.
```
http://your_host_name:your_port_number/list_of_files.txt
```
In the preceding URI, substitute your_host_name with your actual host name. Substitute your_port_number with your port number that you are using with your HTTP server.

Description of the illustration copy-using-list-of-files.png

To display the contents of the list_of_files.txt file in the taxi_telemetry directory, enter the following command:
```
$ cat list_of_files.txt
```
In the Destination section, make sure that the /user/demo HDFS destination directory is displayed.
In the General tab, accept the defaults for all the fields, and then click Create.
A Data Copy Job Created window is displayed for the requested data transfer. When the files are copied successfully to HDFS, a Succeeded window is displayed. Click Close to close the window.
The copied .csv files are now displayed in the /user/demo HDFS directory. If the files are not displayed, click Refresh on the toolbar.

Description of the illustration files-in-hdfs.png

Next Tutorial

Analyze Data with Oracle Big Data Manager Notebook