Oracle by Example brandingUpload and Analyze Data with Oracle Big Data Manager

section 0Before You Begin

In this 15-minute tutorial, you learn how to use Oracle Big Data Manager to upload a local file to the Hadoop Distributed File System (HDFS) on your cluster. You also learn how to import the file into a new Apache Hive table, create a new Oracle Big Data Manager paragraph from the table, and view the data.

Background

The Oracle Big Data Manager console provides a browser-based GUI to simplify data transfer tasks. You can use the Data explorer feature in Oracle Big Data Manager to upload local files and directories to HDFS.

You can also use Data explorer to view the available Hive databases and tables. You can import local comma-separated-values (csv) files from HDFS to Hive. This provides you with quick access to the data in Apache Hive and it also enables you to query your data.

You can use the Oracle Big Data Manager Notebook feature to load files or Hive tables into a new paragraph in a Notebook where you can view and analyze the data.

What Do You Need?

  • Access to either an instance of Oracle Big Data Cloud Service or to an Oracle Big Data Appliance, and the required login credentials.
  • Access to Oracle Big Data Manager, on either an instance of Oracle Big Data Cloud Service or on an Oracle Big Data Appliance, and the required sign in credentials. A port must be opened to permit access to Oracle Big Data Manager, as described in Enabling Oracle Big Data Manager.
  • Read/Write privileges to the /tmp HDFS directory.
  • Read/Write access to the Hive storage provider.
  • Basic familiarity with HDFS, Spark, database concepts and SQL, and optionally, Apache Zeppelin.

section 1Upload a Local File to HDFS

In this section, you upload the local taxidropoff.csv data file to HDFS using the file upload feature in Data explorer. This .csv file contains many rows and columns which makes it difficult to view, analyze, and query.

  1. Sign in to Oracle Big Data Manager. See Access Oracle Big Data Manager.
  2. Right-click the taxidropoff.csv file, select Save link as from the context menu, and then save it to your local machine. This dataset was created from several datasets on the NYC Taxi & Limousine Commission website.
  3. On the Big Data Manager page, click the Data tab.
  4. In the Data explorer section, select HDFS storage (hdfs) from the Storage drop-down list. Navigate to the /tmp directory, and then click File upload File upload on the toolbar.
    Description of the illustration file-upload.png
    Description of the illustration file-upload.png
  5. In the Files upload dialog box, select Files upload from the Select files drop-down list. In the Open dialog box, navigate to your local directory that contains the taxidropoff.csv file, and then select the file. The taxidropoff.csv file is displayed in the Name column. Click Upload.
    Description of the illustration upload.png
    Description of the illustration upload.png

    Note: To upload an entire local directory to HDFS, select Directory upload from the Select files drop-down list.

  6. Drill-down on the Details section in the Files upload dialog box to display the progress of the file upload operation. When the file is uploaded successfully to HDFS, the Upload has finished message is displayed. Click Close to close the dialog box. The taxidropoff.csv file is now displayed in the /tmp HDFS directory.


section 2Import a File from HDFS into an Apache Hive Table

In this section, you import the taxidropoff.csv file from HDFS into a new Apache Hive table in the default Hive database.

  1. In the Data explorer section, select HDFS storage (hdfs) from the Storage drop-down list, if not already selected. Navigate to the /tmp directory, if needed. Right-click the taxidropoff.csv file. From the context menu, select Import to Apache Hive, and then Import as CSV.
  2. Description of the illustration import-to-hive.png
    Description of the illustration import-to-hive.png

    The Create a new job dialog box is displayed. It has four tabs: Source and destination, Hive import options, Advanced job options, and Environment variables.

  3. In the Source and destination tab, change the Job name to Import taxidropoff.csv to Hive. The taxidropoff.csv file will be imported from HDFS into a new Hive table named taxidropoff_csv in the default Hive database. Accept the default values for all other fields.
  4. Description of the illustration source-and-destination.png
    Description of the illustration source-and-destination.png
  5. Click the Hive import options tab. This tab contains two main sections: Options and File preview. In the Options section, accept the default selections for all the fields.
  6. Description of the illustration options-section.png
    Description of the illustration options-section.png
  7. The File preview section previews the first ten rows from the taxidropoff.csv file and the names and data types that Oracle Big Data Manager assigned to each column in the file. You can modify the Hive table columns' labels and data types as needed. Change the first column label from dropoff_geocode_geo_city to dropoff_city.
  8. Description of the illustration file-preview-section.png
    Description of the illustration file-preview-section.png
  9. Click Create to create and run a new job. This job will create the new taxidropoff_csv Hive table in the default Hive database and loads the taxidropoff.csv data into the new table.
  10. Click the Jobs tab in the Oracle Big Data Manager page. The new Import taxidropoff.csv to Hive job is displayed in the list of available jobs. If the job executes successfully, the Last execution field shows Succeeded. You can click the Import taxidropoff.csv to Hive link to display the job details.
    Description of the illustration 
								job-successful.png
    Description of the illustration job-successful.png

section 3Load an Apache Hive Table into a New Oracle Big Data Manager Notebook

In this section, you load the new taxidropoff_csv Hive table into a new Oracle Big Data Manager Notebook.

  1. In the Data explorer section, select Apache Hive from the Storage drop-down list. In the Name column, double-click the default database to display the tables in this database.
  2. Description of the illustration 
                                select-default-db.png
    Description of the illustration select-default-db.png
  3. Right-click the taxidropoff_csv Hive table. From the context menu, select Notebook, and then Load Apache Hive table.
  4. Description of the illustration 
                                load-hive-table.png
    Description of the illustration load-hive-table.png

    The Load table in Notebook wizard is displayed. It has three pages: Paragraph settings, Note settings, and Overview.

  5. In the Paragraph settings wizard page, enter NYC taxidropoff_csv Dataset in the Paragraph name field. Select spark from the Interpreter drop-down list, and then click Next.
  6. Description of the illustration 
                              paragraph-settings.png
    Description of the illustration paragraph-settings.png
  7. In the Note settings wizard page, accept the default for Create new note, enter NYC taxidropoff_csv Hive Table Note in the New note name field, and then click Next.
  8. Description of the 
                                 illustration create-new-note.png
    Description of the illustration create-new-note.png
  9. In the Overview wizard page, review the settings. If you need to make a correction, click the back arrow Go back. If you are satisfied with the settings, click Finish to display the note.
  10. Description of the illustration overview.png
    Description of the illustration overview.png

    The new Oracle Big Data Manager note and paragraph are displayed.

    Description of the illustration view-note.png
    Description of the illustration view-note.png

    The %spark Spark interpreter creates a SparkContext and provides you with a Scala environment. In this example, the Spark interpreter created a DataFrame which references the shown SQL query. This SQL query returns the number of rows in the nyc_taxidropoff_csv Hive table. The DataFrame is stored in a variable named file0. You can reference this variable anywhere in the Note.

  11. To control the settings of a paragraph, click SettingsGear icon, and then select the appropriate options from the context menu.
  12. Place the mouse cursor at the end of line 2, and then press the Enter key to add a new line. Copy the Scala println command from the print-command.txt file, and then paste it at the end of the paragraph to display the number of rows in the taxidropoff_csv Hive table. Click Run this paragraphRun icon. The output is displayed at the bottom of the paragraph.
  13. Description of the illustration print-output.png
    Description of the illustration print-output.png
  14. In the empty paragraph at the bottom of the note, copy the Hive SQL query from the hive-query.txt file, and then paste it in the paragraph. The %sql enables you to execute a Spark SQL query. The query displays all rows in the taxidropoff_csv Hive table. Click Run this paragraphRun icon. The following image shows the partial output of the taxidropoff_csv Hive table in a table format.
  15. Description of the illustration
                                  select-all-table-view.png
    Description of the illustration select-all-table-view.png
  16. To view the data in a graph format, click any of the available graph icons on the toolbar. Click Bar ChartBar chart icon.
    Description of the illustration bar-chart.png
    Description of the illustration bar-chart.png

more informationWant to Learn More?

Oracle Big Data Cloud Service