Note:

Process large files in Autonomous Database and Kafka with Oracle Cloud Infrastructure Data Flow

Introduction

Oracle Cloud Infrastructure (OCI) Data Flow is a fully managed service for running Apache Spark ™ applications. Data Flow is used for processing large files, streaming, database operations, and you can build a lot of applications with very high scalable processing. Apache Spark can scale and use clustered machines to parallelize jobs with minimum configuration.

Using Apache Spark as a managed service (Data Flow), you can add many scalable services to multiply the power of cloud processing and this tutorial shows you how to use:

dataflow-use-case.png

In this tutorial, you can see the most common activities used to process large files, querying database and merge/join the data to form another table in memory. You can write this massive data into your database and in a Kafka queue with very low-cost and high performance.

Objectives

Prerequisites

Task 1: Create the Object Storage structure

The Object Storage will be used as a default file repository. You can use other type of file repositories, but Object Storage is a simple and low-cost way to manipulate files with performance. In this tutorial, both applications will load a large CSV file from the object storage, showing how Apache Spark is fast and smart to process a high volume of data.

  1. Create a compartment: Compartments are important to organize and isolate your cloud resources. You can isolate your resources by IAM Policies.

    • You can use this link to understand and setup the policies for compartments: Managing Compartments

    • Create one compartment to host all resources of the 2 applications in this tutorial. Create a compartment named analytics.

    • Go to the Oracle Cloud main menu and search for: Identity & Security, Compartments. In the Compartments section, click Create Compartment and enter the name.

      create-compartment.png

      Note: You need to give the access to a group of users and include your user.

    • Click Create Compartment to include your compartment.

  2. Create your bucket in the Object Storage: Buckets are logical containers for storing objects, so all files used for this demo will be stored in this bucket.

    • Go to the Oracle Cloud main menu and search for Storage and Buckets. In the Buckets section, select your compartment (analytics), created previously.

      select-compartment.png

    • Click Create Bucket. Create 4 buckets: apps, data, dataflow-logs, Wallet

      create-bucket.png

    • Enter the Bucket Name information with these 4 buckets and maintain the other parameters with the default selection.

    • For each bucket, click Create. You can see your buckets created.

      buckets-dataflow.png

Note: Review the IAM Policies for the bucket. You must set up the policies if you want to use these buckets in your demo applications. You can review the concepts and setup here Overview of Object Storage and IAM Policies.

Task 2: Create the Autonomous Database

Oracle Cloud Autonomous Database is a managed service for the Oracle Database. For this tutorial, the applications will connect to the database through a Wallet for security reasons.

Note: Review IAM Policies for accessing the Autonomous Database here: IAM Policy for Autonomous Database

Task 3: Upload the CSV sample files

To demonstrate the power of Apache Spark, the applications will read a CSV file with 1,000,000 lines. This data will be inserted in the Autonomous Data Warehouse database with just one command line and published on a Kafka streaming (Oracle Cloud Streaming). All these resources are scalable and perfect for high data volume.

You can see your new table named GDPPERCAPTA imported successfully.

adw-table-imported.png

Task 4: Create a Secret Vault for your ADW ADMIN password

For security reasons, the ADW ADMIN password will be saved on a Vault. Oracle Cloud Vault can host this password with security and can be accessed on your application with OCI Authentication.

Note: Review the IAM Policy for OCI Vault here: OCI Vault IAM Policy.

Task 5: Create a Kafka Streaming with Oracle Cloud Streaming service

Oracle Cloud Streaming is a Kafka like managed streaming service. You can develop applications using the Kafka APIs and common SDKs. In this tutorial, you will create an instance of Streaming and configure it to execute in both applications to publish and consume a high volume of data.

  1. From the Oracle Cloud main menu, go to Analytics & AI, Streams.

  2. Change the compartment to analytics. Every resource in this demo will be created on this compartment. This is more secure and easy to control IAM.

  3. Click Create Stream.

    create-stream.png

  4. Enter the name as kafka_like (for example) and you can maintain all other parameters with the default values.

    save-create-stream.png

  5. Click Create to initialize the instance.

  6. Wait for the Active status. Now you can use the instance.

    Note: In the streaming creation process, you can select the Auto-Create a default stream pool option to automatically create your default pool.

  7. Click on the DefaultPool link.

    default-pool-option.png

  8. View the connection setting.

    stream-conn-settings.png

    kafka-conn.png

  9. Annotate this information as you will need it in the next step.

Note: Review the IAM Policies for the OCI Streaming here: IAM Policy for OCI Streaming.

Task 6: Generate a AUTH TOKEN to access Kafka

You can access OCI Streaming (Kafka API) and other resources in Oracle Cloud with an Auth Token associated to your user on OCI IAM. In Kafka Connection Settings, the SASL Connection Strings has a parameter named password and an AUTH_TOKEN value as described in the previous task. To enable access to OCI Streaming, you need to go to your user on OCI Console and create an AUTH TOKEN.

  1. From the Oracle Cloud main menu, go to Identity & Security, Users.

    Note: Remember that the user you need to create the AUTH TOKEN is the user configured with your OCI CLI and all the IAM Policies configuration for the resources created until now. The resources are:

    • Oracle Cloud Autonomous Data Warehouse
    • Oracle Cloud Streaming
    • Oracle Object Storage
    • Oracle Data Flow
  2. Click on your username to view the details.

    auth_token_create.png

  3. Click on the Auth Tokens option in the left side of the console and click Generate Token.

    Note: The token will be generated only in this step and will not be visible after you complete the step. So, copy the value and save it. If you lose the token value, you must generate the auth token again.

    auth_token_1.png

    auth_token_2.png

Task 7: Set up the Demo applications

This tutorial has 2 demo applications for which we will set up the required information:

  1. Download the applications using the following links:

  2. Find the following details in your Oracle Cloud Console:

    • Tenancy Namespace

      tenancy-namespace-1.png

      tenancy-namespace-detail.png

    • Password Secret

      vault-adw.png

      vault-adw-detail.png

      secret-adw.png

    • Streaming Connection Settings

      kafka-conn.png

    • Auth Token

      auth_token_create.png

      auth_token_2.png

  3. Open the downloaded zip files (Java-CSV-DB.zip and JavaConsumeKafka.zip). Go to the /src/main/java/example folder and find the Example.java code.

    code-variables.png

    • These are the variables that need to be changed with your tenancy resources values.

      VARIABLE NAME RESOURCE NAME INFORMATION TITLE
      NAMESPACE TENANCY NAMESPACE TENANCY
      OBJECT_STORAGE_NAMESPACE TENANCY NAMESPACE TENANCY
      PASSWORD_SECRET_OCID PASSWORD_SECRET_OCID OCID
      streamPoolId Streaming Connection Settings ocid1.streampool.oc1.iad….. value in SASL Connection String
      kafkaUsername Streaming Connection Settings value of usename inside “ “ in SASL Connection String
      kafkaPassword Auth Token The value is displayed only in the creation step

Note: All the resources created for this tutorial are in the US-ASHBURN-1 region. Check in what region you want to work. If you change the region, you need to change the following details in the 2 code files:

Task 8: Understand the Java code

This tutorial was created in Java and this code can be ported to Python also. The tutorial is divided in 2 parts:

To prove the efficiency and scalability, both applications were developed to show some possibilities in a common use case of an integration process. So the code for both the applications show the following examples:

This demo can be executed in your local machine and can be deployed into the Data Flow instance to run as a job execution.

Note: For both the Data Flow job and your local machine, use the OCI CLI configuration to access the OCI resources. In the Data Flow side, everything is pre-configured, so no need to change the parameters. In your local machine side, install the OCI CLI and configure the tenant, user and private key to access your OCI resources.

Let’s show the Example.java code in sections:

Task 9: Package your application with Maven

Before executing the job in Apache Spark, it is necessary to package your application with Maven. Maven is one of the most known utilities to package applications with libraries and plugins.

Note:

  1. Java-CSV-DB Package

    1. Go to /Java-CSV-DB folder and execute this command:

      mvn package

    2. You can see Maven starting the packaging.

      maven-package-1a.png

    3. If everything is correct, you can see the Success message.

      maven-success-1a.png

    4. To test your application in your local Apache Spark machine, execute this command:

      spark-submit --class example.Example target/loadadw-1.0-SNAPSHOT.jar

  2. JavaConsumeKafka Package

    1. Go to the /JavaConsumeKafka folder and execute this command:

      mvn package

    2. You can see Maven starting the packaging.

      maven-package-2a.png

    3. If everything is correct, you can see the Success message.

      maven-success-2a.png

    4. To test your application in yourr local Apache Spark machine, execute this command:

      spark-submit --class example.Example target/loadkafka-1.0-SNAPSHOT.jar

Task 10: Verify the execution

  1. Confirm ADW Insertions

    1. Go to the Oracle Cloud main menu, select Oracle Database and Autonomous Data Warehouse.

    2. Click on the Processed Logs Instance to view the details.

    3. Click Database actions to go to the database utilities.

      adw-actions.png

    4. Enter your credentials for the ADMIN user.

      adw-login.png

    5. Click on the SQL option to go to the Query Utilities.

      adw-select-sql.png

    6. Execute a query to see the 1,000,000 of lines in your table.

      ADW-query-organizations.png

  2. Confirm Execution Logs

    • You can see in the execution logs if the job can access and load the datasets.

      spark-csv-results.png

Task 11: Create and execute a Data Flow job

Now, with both applications running successfully in your local Apache Spark machine, you can deploy them into the Oracle Cloud Data Flow in your tenancy.

  1. From the Oracle Cloud main menu, go to Analytics & AI and Data Flow.

  2. Be sure to select your analytics compartment before create a Data Flow Application.

  3. Click Create application.

    create-dataflow-app.png

  4. Complete the parameters as shown in the following image:

    dataflow-app.png

  5. Click Create.

  6. After creation, click on the Scale Demo link to view details.

  7. Click Run to execute the job.

  8. Confirm the parameters and click Run again.

    dataflow-run-job.png

  9. View the Status of the job, wait until the Status changes to Succeeded and you can see the results.

    dataflow-run-status.png

    dataflow-run-success.png

Next Steps

The first application publishes data into Kafka Streaming. The second application consumes this data from Kafka.

Acknowledgments

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.