Note:

Use OCI Data Flow with Apache Spark Streaming to process a Kafka topic in a scalable and near real-time application

Introduction

Oracle Cloud Infrastructure (OCI) Data Flow is a managed service for the open-source project named Apache Spark. Basically, with Spark you can use it for massive processing files, streaming and database operations. You can build applications with very high scalable processing. Spark can scale and use clustered machines to paralellize jobs with minimum configuration.

Using Spark as a managed service (Data Flow), you can add many scalable services to multiply the power of cloud processing. Data Flow has the ability to process Spark Streaming.

Streaming applications require continuous execution for a long period of time that often extends beyond 24 hours, and might be as long as weeks or even months. In case of unexpected failures, streaming applications must restart from the point of failure without producing incorrect computational results. Data Flow relies on Spark structured streaming check-pointing to record the processed offset which can be stored in your Object Storage bucket.

Note: If you need to process data as a batch strategy, you can read this article: Process large files in Autonomous Database and Kafka with Oracle Cloud Infrastructure Data Flow

dataflow-use-case.png

In this tutorial, you can see the most common activities used to process data volume streaming, querying database and merge/join the data to form another table in memory or send data to any destination near real-time. You can write this massive data into your database and in a Kafka queue with very low-cost and highly effective performance.

Objectives

Prerequisites

Task 1: Create the Object Storage structure

The Object Storage will be used as a default file repository. You can use other type of file repositories, but Object Storage is a simple and low-cost way to manipulate files with performance. In this tutorial, both applications will load a large CSV file from the object storage, showing how Apache Spark is fast and smart to process a high volume of data.

  1. Create a compartment: Compartments are important to organize and isolate your cloud resources. You can isolate your resources by IAM Policies.

    • You can use this link to understand and setup the policies for compartments: Managing Compartments

    • Create one compartment to host all resources of the 2 applications in this tutorial. Create a compartment named analytics.

    • Go to the Oracle Cloud main menu and search for: Identity & Security, Compartments. In the Compartments section, click Create Compartment and enter the name. create-compartment.png

      Note: You need to give the access to a group of users and include your user.

    • Click Create Compartment to include your compartment.

  2. Create your bucket in the Object Storage: Buckets are logical containers for storing objects, so all files used for this demo will be stored in this bucket.

    • Go to the Oracle Cloud main menu and search for Storage and Buckets. In the Buckets section, select your compartment (analytics), created previously.

      select-compartment.png

    • Click Create Bucket. Create 4 buckets: apps, data, dataflow-logs, Wallet

      create-bucket.png

    • Enter the Bucket Name information with these 4 buckets and maintain the other parameters with the default selection.

    • For each bucket, click Create. You can see your buckets created.

      buckets-dataflow.png

Note: Review the IAM Policies for the bucket. You must set up the policies if you want to use these buckets in your demo applications. You can review the concepts and setup here Overview of Object Storage and IAM Policies.

Task 2: Create the Autonomous Database

Oracle Cloud Autonomous Database is a managed service for the Oracle Database. For this tutorial, the applications will connect to the database through a Wallet for security reasons.

Note: Review IAM Policies for accessing the Autonomous Database here: IAM Policy for Autonomous Database

Task 3: Upload the CSV Sample Files

To demonstrate the power of Apache Spark, the applications will read a CSV file with 1,000,000 lines. This data will be inserted in the Autonomous Data Warehouse database with just one command line and published on a Kafka streaming (Oracle Cloud Streaming). All these resources are scalable and perfect for high data volume.

You can see your new table named GDPPERCAPTA imported successfully.

adw-table-imported.png

Task 4: Create a Secret Vault for your ADW ADMIN password

For security reasons, the ADW ADMIN password will be saved on a Vault. Oracle Cloud Vault can host this password with security and can be accessed on your application with OCI Authentication.

Note: Review the IAM Policy for OCI Vault here: OCI Vault IAM Policy.

Task 5: Create a Kafka Streaming (Oracle Cloud Streaming)

Oracle Cloud Streaming is a Kafka like managed streaming service. You can develop applications using the Kafka APIs and common SDKs. In this tutorial, you will create an instance of Streaming and configure it to execute in both applications to publish and consume a high volume of data.

  1. From the Oracle Cloud main menu, go to Analytics & AI, Streams.

  2. Change the compartment to analytics. Every resource in this demo will be created on this compartment. This is more secure and easy to control IAM.

  3. Click Create Stream.

    create-stream.png

  4. Enter the name as kafka_like (for example) and you can maintain all other parameters with the default values.

    save-create-stream.png

  5. Click Create to initialize the instance.

  6. Wait for the Active status. Now you can use the instance.

    Note: In the streaming creation process, you can select the Auto-Create a default stream pool option to automatically create your default pool.

  7. Click on the DefaultPool link.

    default-pool-option.png

  8. View the connection setting.

    stream-conn-settings.png

    kafka-conn.png

  9. Annotate this information as you will need it in the next step.

Note: Review the IAM Policies for the OCI Streaming here: IAM Policy for OCI Streaming.

Task 6: Generate a AUTH TOKEN to access Kafka

You can access OCI Streaming (Kafka API) and other resources in Oracle Cloud with an Auth Token associated to your user on OCI IAM. In Kafka Connection Settings, the SASL Connection Strings has a parameter named password and an AUTH_TOKEN value as described in the previous task. To enable access to OCI Streaming, you need to go to your user on OCI Console and create an AUTH TOKEN.

  1. From the Oracle Cloud main menu, go to Identity & Security, Users.

    Note: Remember that the user you need to create the AUTH TOKEN is the user configured with your OCI CLI and all the IAM Policies configuration for the resources created until now. The resources are:

    • Oracle Cloud Autonomous Data Warehouse
    • Oracle Cloud Streaming
    • Oracle Object Storage
    • Oracle Data Flow
  2. Click on your username to view the details.

    auth_token_create.png

  3. Click on the Auth Tokens option on the left side of the console and click Generate Token.

    Note: The token will be generated only in this step and will not be visible after you complete the step. So, copy the value and save it. If you lose the token value, you must generate the auth token again.

auth_token_1.png auth_token_2.png

Task 7: Setup the Demo application

This tutorial has a demo application for which we will set up the required information.

  1. Download the application using the following link:

  2. Find the following details in your Oracle Cloud Console:

    • Tenancy Namespace

      tenancy-namespace-1.png

      tenancy-namespace-detail.png

    • Password Secret

      vault-adw.png

      vault-adw-detail.png

      secret-adw.png

    • Streaming Connection Settings

      kafka-conn.png

    • Auth Token

      auth_token_create.png

      auth_token_2.png

  3. Open the downloaded zip file (Java-CSV-DB.zip and JavaConsumeKafka.zip). Go to the /src/main/java/example folder and find the Example.java code.

    code-variables.png

    These are the variables that need to be changed with your tenancy resources values.

    VARIABLE NAME RESOURCE NAME INFORMATION TITLE
    bootstrapServers Streaming Connection Settings Bootstrap Servers
    streamPoolId Streaming Connection Settings ocid1.streampool.oc1.iad….. value in SASL Connection String
    kafkaUsername Streaming Connection Settings value of usename inside “ “ in SASL Connection String
    kafkaPassword Auth Token The value is displayed only in the creation step
    OBJECT_STORAGE_NAMESPACE TENANCY NAMESPACE TENANCY
    NAMESPACE TENANCY NAMESPACE TENANCY
    PASSWORD_SECRET_OCID PASSWORD_SECRET_OCID OCID

Note: All the resources created for this demo are in the US-ASHBURN-1 region. Check in what region you want to work. If you change the region, you need to change 2 points in 2 code files:

Task 8: Understand the Java code

This tutorial was created in Java and this code can be ported to Python also. To prove the efficiency and scalability, the application was developed to show some possibilities in a common use case of an integration process. So the code for the application shows the following examples:

This demo can be executed in your local machine and deployed into the Data Flow instance to run as a job execution.

Note: For Data Flow job and your local machine, use the OCI CLI configuration to access the OCI resources. On the Data Flow side, everything is pre-configured, so no need to change the parameters. In your local machine side, you should have installed the OCI CLI and configured the tenant, user and private key to access your OCI resources.

Let’s show the Example.java code in sections:

Task 9: Package your application with Maven

Before you execute the job in Spark, it is necessary to package your application with Maven.

  1. Go to the /DataflowSparkStreamDemo folder and execute this command:

    mvn package

  2. You can see Maven starting the packaging.

    maven-package-1a.png

  3. If everything is correct, you can see Success message.

    maven-success-1a.png

Task 10: Verify the execution

  1. Test your application in your local Spark machine by running this command:

    spark-submit --class example.Example target/consumekafka-1.0-SNAPSHOT.jar

  2. Go to your Oracle Cloud Streaming Kafka instance and click Produce Test Message to generate some data to test your real-time application.

    test-kafka-1.png

  3. You can put this JSON message into the Kafka topic.

    {"Organization Id": "1235", "Name": "Teste", "Country": "Luxembourg"}

    test-kafka-2.png

  4. Every time you click Produce, you send one message to the application. You can see the application’s output log something like this:

    • This is the data read from the kafka topic.

      test-output-1.png

    • This is the merged data from ADW table.

      test-output-2.png

Task 11: Create and execute a Data Flow job

Now, with both applications running with success in your local Spark machine, you can deploy them into the Oracle Cloud Data Flow in your tenancy.

Note: See the Spark Streaming documentation to configure access to resources like Oracle Object Storage and Oracle Streaming (Kafka): Enable Access to Data Flow

  1. Upload the packages into Object Storage.

    • Before you create a Data Flow application, you need to upload your Java artifact application (your ***-SNAPSHOT.jar file) into the Object Storage bucket named apps.
  2. Create a Data Flow Application.

    • Select the Oracle Cloud main menu and go to Analytics & AI and Data Flow. Be sure to select your analytics compartment before creating a Data Flow Application.

    • Click Create application.

      create-dataflow-app.png

    • Fill the parameters like this.

      dataflow-app.png

    • Click Create.

    • After creation, click on the Scale Demo link to view details. To run a job, click RUN.

      Note: Click Show advanced options to enable OCI security for the Spark Stream execution type.

      advanced-options.png

  3. Activate the following options.

    principal-execution.png

  4. Click Run to execute the job.

  5. Confirm the parameters and click Run again.

    dataflow-run-job.png

    • It’s possible to view the Status of the job.

      dataflow-run-status.png

    • Wait until the Status go to Succeeded and you can see the results.

      dataflow-run-success.png

Acknowledgments

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.