Before You
Begin
In this 15-minute tutorial, you learn how to use Oracle Big Data Manager to upload a local file to the Hadoop Distributed File System (HDFS) on your cluster. You also learn how to import the file into a new Apache Hive table, create a new Oracle Big Data Manager paragraph from the table, and view the data.
Background
The Oracle Big Data Manager console provides a browser-based GUI to simplify data transfer tasks. You can use the Data explorer feature in Oracle Big Data Manager to upload local files and directories to HDFS.
You can also use Data explorer to view the available Hive databases and tables. You can import local comma-separated-values (csv) files from HDFS to Hive. This provides you with quick access to the data in Apache Hive and it also enables you to query your data.
You can use the Oracle Big Data Manager Notebook feature to load files or Hive tables into a new paragraph in a Notebook where you can view and analyze the data.
What Do You Need?
- Access to either an instance of Oracle Big Data Cloud Service or to an Oracle Big Data Appliance, and the required login credentials.
- Access to Oracle Big Data Manager, on either an instance of Oracle Big Data Cloud Service or on an Oracle Big Data Appliance, and the required sign in credentials. A port must be opened to permit access to Oracle Big Data Manager, as described in Enabling Oracle Big Data Manager.
- Read/Write privileges to the
/tmp
HDFS directory. - Read/Write access to the Hive storage provider.
- Basic familiarity with HDFS, Spark, database concepts and SQL, and optionally, Apache Zeppelin.
Upload a Local File to HDFS
In this section, you upload the local taxidropoff.csv
data file to
HDFS using the file upload feature in Data explorer. This .csv
file contains many
rows and columns which makes it difficult to view, analyze, and query.
- Sign in to Oracle Big Data Manager. See Access Oracle Big Data Manager.
- Right-click the taxidropoff.csv file, select Save link as from the context menu, and then save it to your local machine. This dataset was created from several datasets on the NYC Taxi & Limousine Commission website.
- On the Big Data Manager page, click the Data tab.
- In the Data explorer section, select HDFS storage (hdfs)
from the Storage drop-down list. Navigate to the
/tmp
directory, and then click File uploadon the toolbar.
Description of the illustration file-upload.png - In the Files upload dialog box, select Files upload from the Select files drop-down list. In the Open dialog box, navigate to your local directory
that contains the
taxidropoff.csv
file, and then select the file. Thetaxidropoff.csv
file is displayed in the Name column. Click Upload.Description of the illustration upload.png Note: To upload an entire local directory to HDFS, select Directory upload from the Select files drop-down list.
- Drill-down on the Details section in the Files upload
dialog box to display the progress of the file upload operation. When the file is uploaded
successfully to HDFS, the Upload has finished message is displayed. Click
Close to close the dialog box. The
taxidropoff.csv
file is now displayed in the/tmp
HDFS directory.
Import a File from HDFS into an Apache Hive Table
In this section, you import the taxidropoff.csv
file from HDFS into a new Apache Hive
table in the default
Hive database.
- In the Data explorer section, select HDFS storage (hdfs)
from the Storage drop-down list, if not already selected. Navigate to
the
/tmp
directory, if needed. Right-click thetaxidropoff.csv
file. From the context menu, select Import to Apache Hive, and then Import as CSV. - In the Source and destination tab, change the Job name to
Import taxidropoff.csv to Hive.
Thetaxidropoff.csv
file will be imported from HDFS into a new Hive table namedtaxidropoff_csv
in thedefault
Hive database. Accept the default values for all other fields. - Click the Hive import options tab. This tab contains two main sections: Options and File preview. In the Options section, accept the default selections for all the fields.
- The File preview section previews the first ten rows from the
taxidropoff.csv
file and the names and data types that Oracle Big Data Manager assigned to each column in the file. You can modify the Hive table columns' labels and data types as needed. Change the first column label fromdropoff_geocode_geo_city
todropoff_city.
- Click Create to create and run a new job. This job will create the new
taxidropoff_csv
Hive table in thedefault
Hive database and loads thetaxidropoff.csv
data into the new table. - Click the Jobs tab in the Oracle Big Data Manager page. The new
Import taxidropoff.csv to Hive
job is displayed in the list of available jobs. If the job executes successfully, the Last execution field showsSucceeded.
You can click theImport taxidropoff.csv to Hive
link to display the job details.Description of the illustration job-successful.png

The Create a new job dialog box is displayed. It has four tabs: Source and destination, Hive import options, Advanced job options, and Environment variables.



Load an Apache Hive Table into a New Oracle Big Data Manager Notebook
In this section, you load the new taxidropoff_csv
Hive table into a new Oracle Big Data
Manager Notebook.
- In the Data explorer section, select Apache Hive
from the Storage drop-down list. In the Name
column, double-click the
default
database to display the tables in this database. - Right-click the
taxidropoff_csv
Hive table. From the context menu, select Notebook, and then Load Apache Hive table. - In the Paragraph settings wizard page, enter
NYC taxidropoff_csv Dataset
in the Paragraph name field. Selectspark
from the Interpreter drop-down list, and then click Next. - In the Note settings wizard page, accept the default for Create
new note, enter
NYC taxidropoff_csv Hive Table Note
in the New note name field, and then click Next. - In the Overview wizard page, review the settings. If you need to make a
correction, click the back arrow
. If you are satisfied with the settings, click Finish to display the note.
- To control the settings of a paragraph, click Settings
, and then select the appropriate options from the context menu.
- Place the mouse cursor at the end of line 2, and then press the Enter key to add a new line. Copy the Scala
println
command from the print-command.txt file, and then paste it at the end of the paragraph to display the number of rows in thetaxidropoff_csv
Hive table. Click Run this paragraph. The output is displayed at the bottom of the paragraph.
- In the empty paragraph at the bottom of the note, copy the Hive SQL query from the hive-query.txt file, and then paste it in the paragraph. The
%sql
enables you to execute a Spark SQL query. The query displays all rows in thetaxidropoff_csv
Hive table. Click Run this paragraph. The following image shows the partial output of the
taxidropoff_csv
Hive table in a table format. - To view the data in a graph format, click any of the available graph icons on the toolbar. Click
Bar Chart
.
Description of the illustration bar-chart.png


The Load table in Notebook wizard is displayed. It has three pages: Paragraph settings, Note settings, and Overview.



The new Oracle Big Data Manager note and paragraph are displayed.

The %spark
Spark interpreter
creates a SparkContext and provides you with a Scala environment. In this example, the Spark
interpreter created a DataFrame which references the shown SQL query. This SQL query returns the
number of rows in the nyc_taxidropoff_csv
Hive table. The DataFrame is stored in a
variable named file0.
You can reference this variable anywhere in the Note.

