Before You
Begin
In this 15-minute tutorial, you learn how to use Oracle Big Data Manager to upload a local file to the Hadoop Distributed File System (HDFS) on your cluster. You also learn how to import the file into a new Apache Hive table, create a new Oracle Big Data Manager paragraph from the table, and view the data.
Background
The Oracle Big Data Manager console provides a browser-based GUI to simplify data transfer tasks. You can use the Data explorer feature in Oracle Big Data Manager to upload local files and directories to HDFS.
You can also use Data explorer to view the available Hive databases and tables. You can import local comma-separated-values (csv) files from HDFS to Hive. This provides you with quick access to the data in Apache Hive and it also enables you to query your data.
You can use the Oracle Big Data Manager Notebook feature to load files or Hive tables into a new paragraph in a Notebook where you can view and analyze the data.
What Do You Need?
- Access to either an instance of Oracle Big Data Cloud Service or to an Oracle Big Data Appliance, and the required login credentials.
- Access to Oracle Big Data Manager, on either an instance of Oracle Big Data Cloud Service or on an Oracle Big Data Appliance, and the required sign in credentials. A port must be opened to permit access to Oracle Big Data Manager, as described in Enabling Oracle Big Data Manager.
- Read/Write privileges to the
/tmpHDFS directory. - Read/Write access to the Hive storage provider.
- Basic familiarity with HDFS, Spark, database concepts and SQL, and optionally, Apache Zeppelin.
Upload a Local File to HDFS
In this section, you upload the local taxidropoff.csv data file to
HDFS using the file upload feature in Data explorer. This .csv file contains many
rows and columns which makes it difficult to view, analyze, and query.
- Sign in to Oracle Big Data Manager. See Access Oracle Big Data Manager.
- Right-click the taxidropoff.csv file, select Save link as from the context menu, and then save it to your local machine. This dataset was created from several datasets on the NYC Taxi & Limousine Commission website.
- On the Big Data Manager page, click the Data tab.
- In the Data explorer section, select HDFS storage (hdfs)
from the Storage drop-down list. Navigate to the
/tmpdirectory, and then click File upload
on the toolbar.
Description of the illustration file-upload.png - In the Files upload dialog box, select Files upload from the Select files drop-down list. In the Open dialog box, navigate to your local directory
that contains the
taxidropoff.csvfile, and then select the file. Thetaxidropoff.csvfile is displayed in the Name column. Click Upload.
Description of the illustration upload.png Note: To upload an entire local directory to HDFS, select Directory upload from the Select files drop-down list.
- Drill-down on the Details section in the Files upload
dialog box to display the progress of the file upload operation. When the file is uploaded
successfully to HDFS, the Upload has finished message is displayed. Click
Close to close the dialog box. The
taxidropoff.csvfile is now displayed in the/tmpHDFS directory.
Import a File from HDFS into an Apache Hive Table
In this section, you import the taxidropoff.csv file from HDFS into a new Apache Hive
table in the default Hive database.
- In the Data explorer section, select HDFS storage (hdfs)
from the Storage drop-down list, if not already selected. Navigate to
the
/tmpdirectory, if needed. Right-click thetaxidropoff.csvfile. From the context menu, select Import to Apache Hive, and then Import as CSV. - In the Source and destination tab, change the Job name to
Import taxidropoff.csv to Hive.Thetaxidropoff.csvfile will be imported from HDFS into a new Hive table namedtaxidropoff_csvin thedefaultHive database. Accept the default values for all other fields. - Click the Hive import options tab. This tab contains two main sections: Options and File preview. In the Options section, accept the default selections for all the fields.
- The File preview section previews the first ten rows from the
taxidropoff.csvfile and the names and data types that Oracle Big Data Manager assigned to each column in the file. You can modify the Hive table columns' labels and data types as needed. Change the first column label fromdropoff_geocode_geo_citytodropoff_city. - Click Create to create and run a new job. This job will create the new
taxidropoff_csvHive table in thedefaultHive database and loads thetaxidropoff.csvdata into the new table. - Click the Jobs tab in the Oracle Big Data Manager page. The new
Import taxidropoff.csv to Hivejob is displayed in the list of available jobs. If the job executes successfully, the Last execution field showsSucceeded.You can click theImport taxidropoff.csv to Hivelink to display the job details.
Description of the illustration job-successful.png
The Create a new job dialog box is displayed. It has four tabs: Source and destination, Hive import options, Advanced job options, and Environment variables.
Load an Apache Hive Table into a New Oracle Big Data Manager Notebook
In this section, you load the new taxidropoff_csv Hive table into a new Oracle Big Data
Manager Notebook.
- In the Data explorer section, select Apache Hive
from the Storage drop-down list. In the Name
column, double-click the
defaultdatabase to display the tables in this database. - Right-click the
taxidropoff_csvHive table. From the context menu, select Notebook, and then Load Apache Hive table. - In the Paragraph settings wizard page, enter
NYC taxidropoff_csv Datasetin the Paragraph name field. Selectsparkfrom the Interpreter drop-down list, and then click Next. - In the Note settings wizard page, accept the default for Create
new note, enter
NYC taxidropoff_csv Hive Table Notein the New note name field, and then click Next. - In the Overview wizard page, review the settings. If you need to make a
correction, click the back arrow
. If you are
satisfied with the settings, click Finish to display the note.
- To control the settings of a paragraph, click Settings
, and then select the appropriate options from the context menu.
- Place the mouse cursor at the end of line 2, and then press the Enter key to add a new line. Copy the Scala
printlncommand from the print-command.txt file, and then paste it at the end of the paragraph to display the number of rows in thetaxidropoff_csvHive table. Click Run this paragraph
. The output is displayed at the bottom of
the paragraph.
- In the empty paragraph at the bottom of the note, copy the Hive SQL query from the hive-query.txt file, and then paste it in the paragraph. The
%sqlenables you to execute a Spark SQL query. The query displays all rows in thetaxidropoff_csvHive table. Click Run this paragraph
.
The following image shows the partial output of the taxidropoff_csvHive table in a table format. - To view the data in a graph format, click any of the available graph icons on the toolbar. Click
Bar Chart
.
Description of the illustration bar-chart.png
The Load table in Notebook wizard is displayed. It has three pages: Paragraph settings, Note settings, and Overview.
The new Oracle Big Data Manager note and paragraph are displayed.
The %spark Spark interpreter
creates a SparkContext and provides you with a Scala environment. In this example, the Spark
interpreter created a DataFrame which references the shown SQL query. This SQL query returns the
number of rows in the nyc_taxidropoff_csv Hive table. The DataFrame is stored in a
variable named file0. You can reference this variable anywhere in the Note.
Upload and Analyze Data with Oracle Big Data Manager