Analyze Data from Multiple Sources with Oracle Big Data Manager

Before You Begin

In this 15-minute tutorial, you learn how to import a Zeppelin note from a GitHub repository into Oracle Big Data Manager Notebook. You also learn how to load a comma-separated-value (.csv) file from the Hadoop Distributed File System (HDFS) and an Oracle database table as new paragraphs into the imported note. Finally, you run the note.

Background

This is the third tutorial in the Integrate GitHub and Oracle Database with Oracle Big Data Manager series. Read the tutorials in the order listed.

Register a GitHub Repository as a Storage Provider with Oracle Big Data Manager
Register an Oracle Database as a Storage Provider with Oracle Big Data Manager
Analyze Data from Multiple Sources with Oracle Big Data Manager

What Do You Need?

Access to either an instance of Oracle Big Data Cloud Service or to an Oracle Big Data Appliance, and the required login credentials.
Access to Oracle Big Data Manager, on either an instance of Oracle Big Data Cloud Service or on an Oracle Big Data Appliance, and the required sign in credentials. A port must be opened to permit access to Oracle Big Data Manager, as described in Enabling Oracle Big Data Manager.
Read/Write privileges to the /user/demo HDFS directory.
Complete the first two tutorials in this series: The Register a GitHub Repository as a Storage Provider with Oracle Big Data Manager and the Register an Oracle Database as a Storage Provider with Oracle Big Data Manager tutorials.
Access to an Oracle database and the required login credentials.
Basic familiarity with HDFS, Spark, and optionally, Apache Zeppelin.

Import a Zeppelin Note from GitHub into Oracle Big Data Manager Notebook

In this section, you import a Zeppelin note from a GitHub repository into Oracle Big Data Manager Notebook.

Sign in to Oracle Big Data Manager. See Access Oracle Big Data Manager.
On the Oracle Big Data Manager page, click the Data tab.
In the Data explorer section, select Github (github) from the Storage drop-down list. The bdm-notebook-demo GitHub repository that you registered in the first tutorial is displayed in the Name column. Double-click the repository name and navigate to the master branch. This branch contains the Notebook.json Zeppelin note that you will import into Oracle Big Data Manager Notebook.

Description of the illustration master-branch.png
Right-click the Notebook.json file. From the context menu, select Notebook, and then Import as Notebook Note. The Import as Notebook Note dialog box is displayed.
Enter HDFS and Oracle Database Data Analysis in the Note name field, and then click Create. The HDFS and Oracle Database Data Analysis note is imported and displayed in Notebook. The initial status of each paragraph in the note is READY which indicates that the paragraph has not been executed yet.
The top paragraph in the note uses the %md Markdown interpreter to generate static html from Markdown plain text to display detailed information about this note. Click Hide editor(Ctrl+Alt+E) on the paragraph's toolbar to hide the detailed information.

Description of the illustration
display-note.png — Description of the illustration display-note.png

Load an HDFS File as a Paragraph into the Imported Note

In this section, you upload the edges.csv file to HDFS and load the file as a new paragraph into the imported note. The edges.csv file contains two columns. Each column represents a student id. The two student ids in each row represent two students who are collaborating on a specific class project.

Right-click the edges.csv file, select Save link as... from the context menu, and then save it to your local machine.
Upload the edges.csv file to HDFS. On the Oracle Big Data Manager page, click the Data tab. In the Data explorer section, select HDFS storage (hdfs) from the Storage drop-down list. Navigate to the /user/demo directory, and then click File upload on the toolbar.

Description of the illustration file-upload.png
In the Files upload dialog box, choose Files upload from the Select files drop-down list. In the Open dialog box, navigate to your local directory that contains the edges.csv file, and then select the file. The edges.csv file is displayed in the Name column. Click Upload.
Drill-down on the Details section in the Files upload dialog box to display the progress of the file upload operation. When the file is uploaded successfully to HDFS, the Upload has finished message is displayed. Click Close to close the dialog box. The edges.csv file is now displayed in the /user/demo HDFS directory.
Right-click the edges.csv file. From the context menu, select Notebook, and then Load as general file. The Load file in Notebook wizard is displayed. It has three pages: Paragraph settings, Note settings, and Overview.
In the Paragraph settings wizard page, enter Load Data from HDFS in the Paragraph name field. Select spark from the Interpreter drop-down list, and then click Next.
In the Note settings wizard page, select Add to existing note in the Note field. Select the HDFS and Oracle Database Data Analysis note from the Select note drop-down list. Select Top of note from the New paragraph(s) position drop-down list, and then click Next.

Description of the
illustration note-settings-page.png — Description of the illustration note-settings-page.png

In the Overview wizard page, review the settings. If you need to make a correction, click the back arrow . If you are satisfied with the settings, click Finish to display the updated note. The new Load Data from HDFS paragraph is displayed at the top of the updated note.

Description of the illustration
hdfs-paragraph.png — Description of the illustration hdfs-paragraph.png

The new Load Data from HDFS paragraph uses the %spark Spark interpreter to create a DataFrame that references the edges.csv file that you uploaded earlier to the /user/demo HDFS directory. This DataFrame is stored in the fileDataFrame variable. Rename this variable to graphText. The Process the Data paragraph in this note references the graphText variable.
```
val graphText = sc.textFile("hdfs:/user/demo/edges.csv")
```

Load an Oracle Database Table into the Imported Note

In this section, you load the USERS Oracle database table that you created in the second tutorial as a new paragraph into the imported note. The USERS table contains information about each student's name, id, and class category.

On the Oracle Big Data Manager page, click the Data tab. In the Data explorer section, select Oracle database (oracle-database-demo) from the Storage drop-down list. Navigate to your schema that contains the USERS table that you created earlier. In this example, we created the USERS table in the BDM_DEMO schema. The table is displayed in the Name column.

Description of the illustration
users-table.png — Description of the illustration users-table.png

Right-click the USERS table. From the context menu, select Notebook, and then Load table. The Load table in Notebook wizard is displayed. It has three pages: Paragraph settings, Note settings, and Overview.
In the Paragraph settings wizard page, enter Load Data from Oracle Database Table in the Paragraph name field. Select spark from the Interpreter drop-down list, and then click Next.
In the Note settings wizard page, select Add to existing note in the Note field. Select the HDFS and Oracle Database Data Analysis note from the Select note drop-down list. Select Top of note from the New paragraph(s) position drop-down list, and then click Next.
In the Overview wizard page, review the settings. If you need to make a correction, click the back arrow . If you are satisfied with the settings, click Finish to display updated note. Two new paragraphs are added at the top of the note.
Click the title of the Load Data from Oracle Database Table (paragraph 1) paragraph. The title becomes active in read/write mode. Rename the title to Load Data from Oracle Database Table. Rename the title of the Load Data from Oracle Database Table (paragraph 2) paragraph to Count Number of Students.

Description of the illustration oracle-database-paragraphs.png

When you load a relational database table into a note, Oracle Big Data Manager creates two paragraphs. The first paragraph creates the jdbcDF DataFrame which references the USERS table in your Oracle database. It also registers the jdbcDF DataFrame as a temporary table named USERS. You can run SQL queries on this temporary table. The second paragraph uses the %sql interpreter which enables you to execute a Spark SQL query. This simple query counts the number of rows in the USERS table.
Make sure that the url field in the Load Data from Oracle Database Table paragraph references your host name and port # where your Oracle database is running. Enter your Oracle database username in the empty user field. Enter your Oracle database password in the empty password field.

Description of the illustration
load-oracle-paragraph.png — Description of the illustration load-oracle-paragraph.png

Run the Note

In this section, you run the updated note and review the output.

Click Run all paragraphs on the Note's toolbar to run all paragraphs in this note. A Run all paragraphs? confirmation message is displayed. Click OK. When a paragraph executes successfully, its status changes to FINSIHED.
The output of the Count Number of Students paragraph is displayed in the default table format in the result section at the bottom of the paragraph.

Description of the
number-of-students.png — Description of the illustration number-of-students.png

You can click the SPARK JOB icon in a paragraph to display the Spark job details in your cluster's Spark web UI, only if you configured the firewall to allow access to the Spark web UI.

The Process the Data paragraph performs most of the analysis on the dataset. It joins the HDFS data with the Oracle database table data. The code accesses the two id columns from the edges.csv file, and then uses the USERS table as a look-up table to find the associated names and class categories for those student ids. The code then registers the joined data as a new temporary table named FRIENDS. This new table contains six columns which represent the two students collaborating on a project from the edges.csv file. Finally, the code counts the number total projects, same-class projects, and cross-class projects. These values are used and displayed in the Project Collaboration Statistics paragraph.

Description of the illustration process-data-paragraph.png
The output of the Projects Collaboration Details paragraph is displayed in the default table format in the result section at the bottom of the paragraph. The following image shows the partial output.

Description of the
projects-collaboration-details.png — Description of the illustration projects-collaboration-details.png

The results of the SQL query are returned in an arbitrary order unless you use the order by clause followed by the column(s) name; therefore, your results might not match the result shown in the image.

The output of the Projects Collaboration Statistics paragraph is displayed in the result section at the bottom of the paragraph.

Description of the
projects-collaboration-statistics.png — Description of the illustration projects-collaboration-statistics.png

Clean up (Optional)

If you want to re-run the tutorials in this series again, you will need to perform the following clean up steps.

Remove the Github and the Oracle database registered storage providers.

On the Oracle Big Data Manager page, click the Administration tab to display the Storage providers page.
In the row for the GitHub registered storage provider, click Manage this provider , and then select Remove from the context menu. In the Remove storage dialog box, click Remove.
In the row for the Oracle database registered storage provider, click Manage this provider , and then select Remove from the context menu. In the Remove storage dialog box, click Remove.

Drop the USERS table that you created in your Oracle Database. In a terminal window, connect to your Oracle Database using SQL*Plus, and enter the following command at the SQL> command-line prompt:
```
SQL>drop table users;
```
You can reuse the create-and-populate-users-table.sql script file that you already saved to your machine when you re-run the series.
Delete the HDFS and Oracle Database Data Analysis note. On the Oracle Big Data Manager page, click the Notebook tab to display the Notebook page. In the list of notes, locate the HDFS and Oracle Database Data Analysis note. Click Move note to Trash next to the note. In the Move this note to trash dialog box, click OK.
Delete the edges.csv file from the /user/demo HDFS directory. On the Oracle Big Data Manager page, click the Data tab. In the Data explorer section, select HDFS storage (hdfs) from the Storage drop-down list. Navigate to the /user/demo directory. Right-click the edges.csv file, and then select Delete from the context menu. In the Delete file edges.csv dialog box, click Delete.
You can reuse the edges.csv file that you already saved to your machine when you re-run the series.

Want to Learn More?

Overview of Oracle Big Data Manager

Oracle Big Data Cloud Service

Oracle Database Cloud Service