Getting Started with Oracle Cloud Infrastructure Data Flow

This tutorial introduces you to Oracle Cloud Infrastructure Data Flow, a service that lets you run any Apache Spark Application  at any scale with no infrastructure to deploy or manage. If you've used Spark before, you'll get more out of this tutorial, but no prior Spark knowledge is required. All Spark applications and data have been provided for you. This tutorial shows how Data Flow makes running Spark applications easy, repeatable, secure, and simple to share across the enterprise.

In this tutorial you learn:
  1. How to use Java to perform ETL in a Data Flow Data Flow Application .
  2. How to use SparkSQL in a SQL Application.
  3. How to create and run a Python Application to perform a simple machine learning task.
Data Flow Advantages
Here’s why Data Flow is better than running your own Spark clusters, or other Spark Services out there.
  • It's serverless, which means you don’t need experts to provision, patch, upgrade or maintain Spark clusters. That means you focus on your Spark code and nothing else.
  • It has simple operations and tuning. Access to the Spark UI is a click away and is governed by IAM authorization policies. If a user complains that a job is running too slow, then anyone with access to the Run can open the Spark UI and get to the root cause. Accessing the Spark History Server is as simple for jobs that are already done.
  • It is great for batch processing. Application output is automatically captured and made available by REST APIs. Do you need to run a four-hour Spark SQL job and load the results in your pipeline management system? In Data Flow, it’s just two REST API calls away.
  • It has consolidated control. Data Flow gives you a consolidated view of all Spark applications, who is running them and how much they consume. Do you want to know which applications are writing the most data and who is running them? Simply sort by the Data Written column. Is a job running for too long? Anyone with the right IAM permissions can see the job and stop it.
Runs in Data Flow

Before You Begin

To successfully perform this tutorial, you must have Set Up Your Tenancy and be able to Access Data Flow.

Set Up Your Tenancy

Before Data Flow can run, you must grant permissions that allow effective log capture and run management. See the Set Up Administration section of Data Flow Service Guide, and follow the instructions given there.

Access Data Flow

From the Console, click the hamburger menu to display the list of available services. Console

Select Data Flow and click Applications. Select Data Flow Applications

1. ETL with Java

An exercise to learn how to create a Java application in Data Flow

Overview

The most common first step in data processing applications, is to take data from some source and get it into a format that is suitable for reporting and other forms of analytics. In a database, you would load a flat file into the database and create indexes. In Spark, your first step is usually to clean and convert data from a text format into Parquet format. Parquet is an optimized binary format supporting efficient reads, making it ideal for reporting and analytics. In this exercise, you take source data, convert it into Parquet, and then do a few interesting things with it. Your dataset is the Berlin Airbnb Data dataset, downloaded from the Kaggle website under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) "Public Domain Dedication" license. The processing in this tutorial

The data is provided in CSV format and your first step is to convert this data to Parquet and store it in object store for downstream processing. There is a Spark application provided to make this conversion. It is called oow-lab-2019-java-etl-1.0-SNAPSHOT.jar. Your objective is to create a Data Flow Application which runs this Spark app, and execute it with the correct parameters. Since you’re starting out, this exercise guides you step by step, and provides the parameters you need. Later you need to provide the parameters yourself, so you must understand what you’re entering and why.

Create the Java Application

Create a Data Flow Application.

  1. Navigate to the Data Flow service in the Console by expanding the hamburger menu on the top left and scrolling to the bottom.
  2. Highlight Data Flow, then select Applications. Choose a compartment where you want your Data Flow applications to be created. Finally, click Create Application. Click Create Application
  3. Select Java Application and enter a name for your Application, for example, Tutorial Example 1.
  4. Scroll down to Resource Configuration. Leave all these values as their defaults.
  5. Scroll down to Application Configuration. Configure the application as follows:
    1. File URL: is the location of the JAR file in object storage. The location for this application is:
      oci://oow_2019_dataflow_lab@bigdatadatasciencelarge/usercontent/oow-lab-2019-java-etl-1.0-SNAPSHOT.jar
    2. Main Class Name: Java applications need a Main Class Name which depends on the application. For this exercise, enter
      convert.Convert
    3. Arguments: The Spark application expects two command line parameters, one for the input and one for the output. In the Arguments field, enter
      ${input} ${output}
      You are prompted for default values, and it’s a good idea to enter them now.
  6. The input and output arguments should be:
    1. Input:
      oci://oow_2019_dataflow_lab@bigdatadatasciencelarge/usercontent/kaggle_berlin_airbnb_listings_summary.csv
    2. Output:
      oci://<yourbucket>@<namespace>/optimized_listings

    Double-check your Application configuration, to confirm it looks similar to the following:

    Note

    You must customize the output path to point to your bucket in your tenant.
  7. When done, click Create. When the Application is created, you see it in the Application list.

Congratulations! You've created your first Data Flow Application. Now you can run it.

Run the Data Flow Java Application

Having created a Java application you can run it.

  1. If you followed the steps precisely, all you need to do is highlight your Application in the list, click the Actions icon, and click Run.
  2. You’re presented with the ability to customize parameters before running the Application. In your case, you entered the precise values ahead-of-time, and you can start running by clicking Run.
  3. While the Application is running, you can optionally load the Spark UI  to monitor progress. From the Actions icon for the run in question, select Spark UI.

  4. You are automatically redirected to the Apache Spark UI, which is useful for debugging and performance tuning.
  5. After a minute or so your Data Flow Run  should show successful completion with a State of Succeeded:

  6. Drill into the Run to see more details, and scroll to the bottom to see a listing of logs.

  7. When you click the spark_application_stdout.log.gz file, you should see the following log output:

  8. You can also navigate to your output object storage bucket to confirm that new files have been created.

    These new files are used by subsequent applications. Ensure you can see them in your bucket before moving onto the next exercises.

2. SparkSQL Made Simple

In this exercise, you run a SQL script to perform basic profiling of the dataset you generated in 1. ETL with Java. You have completed it successfully before you are able to attempt this exercise.

Overview

As with other Data Flow Applications, SQL files are stored in object storage and may be shared among many SQL users. To facilitate this, Data Flow allows you to parameterize SQL scripts and customize them at run-time. As with other applications you can supply default values for parameters which often serve as valuable clues to people running these scripts.

The SQL script is available for use directly in your Data Flow Application, you do not need to create a copy of it. The script is reproduced here to illustrate a few points.

Reference text of the SparkSQL Script: SparkSQL script

Important highlights:
  1. The script begins by creating the SQL tables we need. Currently, Data Flow does not have a persistent SQL catalog so all scripts must begin by defining the tables they require.
  2. The table’s location is set as ${location} This is a parameter which the user needs to supply at runtime. This gives Data Flow the flexibility to use one script to process many different locations and to share code among different users. For this lab, we must customize ${location} to point to the output location we used in Exercise 1
  3. As we will see, the SQL script’s output will be captured and made available to us under the Run.
Create a SQL Application
  1. In Data Flow, create a SQL Application, select SQL as type, and accept default resources.
  2. Under Application Configuration, configure the SQL Application as follows:
    1. File URL: is the location of the SQL file in object storage. The location for this application is:
      oci://oow_2019_dataflow_lab@bigdatadatasciencelarge/usercontent/oow_lab_2019_sparksql_report.sql
    2. Arguments: The SQL script expects one parameter, the location of output from the prior step. Click Add Parameter and enter a parameter named location with the value you used as the output path in step a, based on the template
      oci://[bucket]@[namespace]/optimized_listings

    When you're done, confirm that your Application configuration looks similar to the following:

  3. Customize the location value to a valid path in your tenancy.
Run a SQL Application
  1. Save your Application and run it from the Applications list.
  2. After your Run is complete, open the Run:
  3. Navigate to the Run logs:
  4. Open spark_application_stdout.log.gz and confirm that your output agrees with the following output.
    Note

    Your rows may be in a different order from the picture but values should agree.
  5. Based on your SQL profiling, you can conclude that, in this dataset, Neukolln has the lowest average listing price at $46.57, while Charlottenburg-Wilmersdorf has the highest average at $114.27 (Note: the source dataset has prices in USD rather than EUR.)

This exercise has shown some key aspects of Data Flow, once a SQL application is in place anyone can easily run it without worrying about cluster capacity, data access and retention, credential management, or other security considerations. For example, a business analyst can easily use Spark-based reporting with Data Flow.

3. Machine Learning with PySpark

This exercise uses the output from 1. ETL with Java. This time you use PySpark to perform a simple machine learning task over the input data. Your objective is to identify the best bargains among the various Airbnb listings using Spark machine learning algorithms.

Overview

A PySpark application is available for you to use directly in your Data Flow Applications. You do not need to create a copy.

Reference text of the PySpark script is provided here to illustrate a few points:

A few observations from this code:
  1. The Python script expects a command line argument (highlighted in red). When you create the Data Flow Application, you need to create a parameter with which the user sets to the input path.
  2. The script uses linear regression to predict a price per listing, and determines the best bargains by subtracting the list price from the prediction. The most negative value indicates the best value, per the model.
  3. The model in this script is simplified, and only considers square footage. In a real setting you would use more variables, such as the neighborhood and other important predictor variables.
Create a PySpark Application
  1. Create an Application, and select the Python type.
  2. In Application Configuration, configure the Application as follows:
    1. File URL: is the location of the Python file in object storage. The location for this application is:
      oci://oow_2019_dataflow_lab@bigdatadatasciencelarge/usercontent/oow_lab_2019_pyspark_ml.py
    2. Arguments: The Spark app expects two command line parameters, one for the input and one for the output. In the Arguments field, enter
      ${location}
      . You are prompted for a default value. Enter the value used as the output path in step a on the template:
      oci://<bucket>@<namespace>/optimized_listings
  3. Double-check your Application configuration, and confirm it is similar to the following:
  4. Customize the location value to a valid path in your tenancy.
Run a PySpark Application
  1. Run the Application from the Application list.
  2. When the Run completes, open it and navigate to the logs.

  3. Open the spark_application_stdout.log.gz file. Your output should be identical to the following:
  4. From this output, you see that listing ID 690578 is the best bargain with a predicted price of $313.70, compared to the list price of $35.00 with listed square footage of 4639 square feet. If it sounds a little too good to be true, the unique ID means you can drill into the data, to better understand if it really is the steal of the century. Again, a business analyst could easily consume the output of this machine learning algorithm to further their analysis.

What's Next

Now you can create and run Java, Python, or SQL applications with Data Flow, and explore the results.

Data Flow handles all details of deployment, teardown, log management, security, and UI access. With Data Flow, you focus on developing your Spark applications without worrying about the infrastructure.