Before You
Begin
In this 15-minute tutorial, you learn how to use Oracle Big Data Manager to copy data files hosted on a Hyper Text Transfer Protocol Secure (HTTPS) server to the Hadoop Distributed File System (HDFS) on your cluster.
Background
This is the first tutorial in the Work with Oracle Big Data Manager series. Read them sequentially.
- Copy Data from an HTTP(S) Server with Oracle Big Data Manager
- Analyze Data with Oracle Big Data Manager Notebook
- Create a Personal Dashboard in Oracle Big Data Manager
What Do You Need?
- Access to an HTTP server such as Apache HTTP server, Oracle HTTP Server (OHS), or Oracle Weblogic Server.
- Access to either an instance of Oracle Big Data Cloud Service or to an Oracle Big Data Appliance, and the required login credentials.
- Access to Oracle Big Data Manager, on either an instance of Oracle Big Data Cloud Service or on an Oracle Big Data Appliance, and the required sign in credentials. A port must be opened to permit access to Oracle Big Data Manager, as described in Enabling Oracle Big Data Manager.
- Read/Write privileges to the
/user/demo
HDFS directory. - Basic familiarity with HDFS, Spark, and optionally, Apache Zeppelin.
Configure and Run Your HTTP(S) Server
In this section, you copy two files to a new HTTP(S) web server directory on your cluster. Next, you configure and run your HTTP(S) server which will host the data files.
- Sign in to Oracle Big Data Manager. See Access Oracle Big Data Manager.
- Copy the taxidropoff_files.zip file to a new directory named
taxi_telemetry
on your machine where you will run the HTTP(S) server. This file contains (12).csv
data files that were created from several datasets on the NYC Taxi & Limousine Commission website. Thetaxi_telemetry
directory will be accessible via web browsers, HTTP(S), and Oracle Big Data Manager. - Extract the
taxidropoff_files.zip
file into thetaxi_telemetry
directory. - Make sure that you have the following files in your
taxidropoff_files
HTTP(S) web server directory.Description of the illustration files-on-cluster.png - Right-click the list_of_files.txt file, select Save link as from
the context menu, and then save it in your
taxi_telemetry
directory on your machine where you will run the HTTP(S) server. Edit thelist_of_files.txt
file. Replace the occurrences of your_host_name with your host name. Replace the occurrences of your_port_number with your port number. Save the file.Description of the illustration list-of-files.png Your
taxi_telemetry
HTTP(S) web server directory should contain thelist_of_files.txt
file and the (12).csv
data files. - In this tutorial, we are using the
SimpleHTTPServer
module that is available when you install Python. You can use this built-in HTTP(S) server (or your own HTTP(S) server) to turn any directory in your cluster into a web server directory. In your terminal window,cd
into thetaxi_telemetry
directory.$
cd taxi_telemetry
-
You can use any port that is available to you with your HTTP(S) server. Your HTTP(S) server will listen to HTTP(S) requests
for the data hosted on your web server directory using this port. In this tutorial, we will use port 17777. For example, to
make sure that your port is available, enter the following
command:
$ netstat -tlnp 2>/dev/null | grep [port#]
In the preceding command, substitute
[port#]
with your port number. - Start-up the HTTP(S) server. For example, to start the HTTP(S) server on port 17777, enter the
following command at the
$
prompt:$ python -m SimpleHTTPServer 17777
Description of the illustration start-http-server.png In the preceding command, substitute port
17777
with your port number.
Copy a Data File from an HTTP(S) Server to HDFS
In this section, you copy the taxidropoff_1.csv
data file from the
taxi_telemtry
directory to HDFS using the Copy here from
HTTP(S) feature in Data explorer.
- On the Oracle Big Data Manager page, click the Data tab.
- In the Data explorer section, select HDFS storage (hdfs)
from the Storage drop-down list. Navigate to the
/user/demo
directory, and then click Copy here from HTTP(S)on the toolbar. The New copy data job dialog box is displayed. It has the Sources and Destination sections and the General and Advanced tabs.
- In the Sources section, click Select file or directory to display the Select file or directory dialog box. Select HTTP(S) from the Location drop-down list, if not already selected. In the URI field, enter a valid HTTP(S) URI using the following format, and then click Select. Substitute your_host_name with your actual host name. Substitute your_port_number with your port number that you are using with your HTTP server.
http://your_host_name:your_port_number/taxidropoff_1.csv
Description of the illustration sources-section.png You can click Add source to add more source locations to this copy data job. In addition to HTTP(S), you can specify other storage source locations such as HDFS storage (hdfs) and Oracle Object Storage Classic (bdcs). In this example, we will only include the HTTP(S) source.
- In the Destination section, make sure that the
/user/demo
HDFS destination directory is displayed. To make any changes to the destination directory, click Edit destination.
Description of the illustration destination-section.png - In the General tab, accept the defaults for all the fields, and then click
Create.
Description of the illustration general-tab.png - A Data copy job created window is displayed for the requested data transfer (copy). The window displays the job details such as the job number, start time, progress, start date and time, and duration. To display additional details about the data copy job, click View more details. When the file is copied successfully to HDFS, a Succeeded window is displayed.
- In the Data explorer section, select HDFS storage (hdfs)
from the Storage drop-down list. Navigate to the
/user/demo
directory, and then click Refreshon the toolbar. The
taxidropoff_1.csv
file is now displayed in the/user/demo
HDFS directory.Description of the illustration file-in-hdfs.png
Copy Multiple Data Files from an HTTP(S) Server to HDFS
In this section, you copy six .csv
data files from the taxi_telemetry
directory to HDFS using the Use file as link to list of files checkbox in the New copy data job dialog box. You will use the list_of_files.txt
file which contains the urls for the six .csv
data files.
- On the Oracle Big Data Manager page, click the Data tab.
- In the Data explorer section, select HDFS storage (hdfs)
from the Storage drop-down list. Navigate to the
/user/demo
directory, and then click Copy here from HTTP(S)on the toolbar. The New copy data job dialog box is displayed.
- In the Sources section, click Select file or directory to display the Select file or directory dialog box. Select HTTP(S) from the Location drop-down list, if not already selected. Select the Use file as link to list of files checkbox. In the URI field, enter a valid HTTP(S) URI that references the
list_of_files.txt
file using the following format, and then click Select. The New copy data job dialog box is displayed.http://your_host_name:your_port_number/list_of_files.txt
In the preceding URI, substitute your_host_name with your actual host name. Substitute your_port_number with your port number that you are using with your HTTP server.
Description of the illustration copy-using-list-of-files.png To display the contents of the
list_of_files.txt
file in thetaxi_telemetry
directory, enter the following command:$ cat list_of_files.txt
- In the Destination section, make sure that the
/user/demo
HDFS destination directory is displayed. - In the General tab, accept the defaults for all the fields, and then click Create.
- A Data Copy Job Created window is displayed for the requested data transfer. When the files are copied successfully to HDFS, a Succeeded window is displayed. Click Close to close the window.
- The copied
.csv
files are now displayed in the/user/demo
HDFS directory. If the files are not displayed, click Refreshon the toolbar.
Description of the illustration files-in-hdfs.png
Next Tutorial
Analyze Data with Oracle Big Data Manager Notebook