Note:
- This tutorial requires access to Oracle Cloud. To sign up for a free account, see Get started with Oracle Cloud Infrastructure Free Tier.
- It uses example values for Oracle Cloud Infrastructure credentials, tenancy, and compartments. When completing your lab, substitute these values with ones specific to your cloud environment.
Process large files in Autonomous Database and Kafka with Oracle Cloud Infrastructure Data Flow
Introduction
Oracle Cloud Infrastructure (OCI) Data Flow is a fully managed service for running Apache Spark ™ applications. Data Flow is used for processing large files, streaming, database operations, and you can build a lot of applications with very high scalable processing. Apache Spark can scale and use clustered machines to parallelize jobs with minimum configuration.
Using Apache Spark as a managed service (Data Flow), you can add many scalable services to multiply the power of cloud processing and this tutorial shows you how to use:
- Object Storage: As low-cost and scalable a file repository
- Autonomous Database: As a scalable Database in the cloud
- Streaming: As a high scalable Kafka managed service
In this tutorial, you can see the most common activities used to process large files, querying database and merge/join the data to form another table in memory. You can write this massive data into your database and in a Kafka queue with very low-cost and high performance.
Objectives
- Learn how Data Flow can be used to process a large amount of data
- Learn how to integrate scalable services: File Repository, Database and Queue
Prerequisites
-
An operational Oracle Cloud tenant: You can create a free Oracle Cloud account with US$ 300.00 for a month to try this tutorial. See Create a Free Oracle Cloud Account
-
OCI CLI (Oracle Cloud Command Line Interface) installed on your local machine: This is the link to install the OCI CLI.
-
An Apache Spark application installed in your local machine. Review Develop Oracle Cloud Infrastructure Data Flow Applications Locally, Deploy to The Cloud to understand how to develop locally and in Data Flow.
Note: This is the official page to install: Apache Spark. There are alternative procedures to install Apache Spark for each type of Operational System (Linux/Mac OS/Windows).
-
Spark Submit CLI installed. This is the link to install Spark Submit CLI.
-
Maven installed in your local machine.
-
Knowledge of OCI Concepts:
- Compartments
- IAM Policies
- Tenancy
- OCID of your resources
Task 1: Create the Object Storage structure
The Object Storage will be used as a default file repository. You can use other type of file repositories, but Object Storage is a simple and low-cost way to manipulate files with performance. In this tutorial, both applications will load a large CSV file from the object storage, showing how Apache Spark is fast and smart to process a high volume of data.
-
Create a compartment: Compartments are important to organize and isolate your cloud resources. You can isolate your resources by IAM Policies.
-
You can use this link to understand and setup the policies for compartments: Managing Compartments
-
Create one compartment to host all resources of the 2 applications in this tutorial. Create a compartment named analytics.
-
Go to the Oracle Cloud main menu and search for: Identity & Security, Compartments. In the Compartments section, click Create Compartment and enter the name.
Note: You need to give the access to a group of users and include your user.
-
Click Create Compartment to include your compartment.
-
-
Create your bucket in the Object Storage: Buckets are logical containers for storing objects, so all files used for this demo will be stored in this bucket.
-
Go to the Oracle Cloud main menu and search for Storage and Buckets. In the Buckets section, select your compartment (analytics), created previously.
-
Click Create Bucket. Create 4 buckets: apps, data, dataflow-logs, Wallet
-
Enter the Bucket Name information with these 4 buckets and maintain the other parameters with the default selection.
-
For each bucket, click Create. You can see your buckets created.
-
Note: Review the IAM Policies for the bucket. You must set up the policies if you want to use these buckets in your demo applications. You can review the concepts and setup here Overview of Object Storage and IAM Policies.
Task 2: Create the Autonomous Database
Oracle Cloud Autonomous Database is a managed service for the Oracle Database. For this tutorial, the applications will connect to the database through a Wallet for security reasons.
-
Instantiate the Autonomous Database as described here: Provision Autonomous Database.
-
From the Oracle Cloud main menu, select the Data Warehouse option, select Oracle Database and Autonomous Data Warehouse; select your compartment analytics and follow the tutorial to create the database instance.
-
Name your instance Processed Logs, choose logs as the database name and you don’t need to change any code in the applications.
-
Enter the ADMIN password and download the Wallet zip file.
-
After creating the database, you can setup the ADMIN user password and download the Wallet zip file.
-
Save your Wallet zip file (
Wallet_logs.zip
) and annotate your ADMIN password, you will need to setup the application code. -
Go to Storage, Buckets. Change to analytics compartment and you will see the Wallet bucket. Click on it.
-
To upload your Wallet zip file, just click Upload and attach the Wallet_logs.zip file.
Note: Review IAM Policies for accessing the Autonomous Database here: IAM Policy for Autonomous Database
Task 3: Upload the CSV sample files
To demonstrate the power of Apache Spark, the applications will read a CSV file with 1,000,000 lines. This data will be inserted in the Autonomous Data Warehouse database with just one command line and published on a Kafka streaming (Oracle Cloud Streaming). All these resources are scalable and perfect for high data volume.
-
Download these 2 links and Upload to the data bucket:
-
Note:
- organizations.csv has only 100 lines, just to test the applications on your local machine.
- organizations1M.csv contains 1,000,000 lines and will be used to run on the Data Flow instance.
-
From the Oracle Cloud main menu, go to Storage and Buckets. Click on the data bucket and upload the 2 files from the previous step.
-
Upload an auxiliary table to ADW Database
-
Download this file to upload to the ADW Database: GDP PER CAPTA COUNTRY.csv
-
From the Oracle Cloud main menu, select Oracle Database and Autonomous Data Warehouse.
-
Click on the Processed Logs Instance to view the details.
-
Click Database actions to go to the database utilities.
-
Enter your credentials for the ADMIN user.
-
Click on the SQL option to go to the Query Utilities.
-
Click Data Load.
-
Drop the GDP PER CAPTA COUNTRY.csv file into the console panel and proceed to import the data into a table.
-
You can see your new table named GDPPERCAPTA imported successfully.
Task 4: Create a Secret Vault for your ADW ADMIN password
For security reasons, the ADW ADMIN password will be saved on a Vault. Oracle Cloud Vault can host this password with security and can be accessed on your application with OCI Authentication.
-
Create your secret in a vault as described in the following documentation: Add the database admin password to Vault
-
Create a variable named PASSWORD_SECRET_OCID in your applications and enter the OCID.
Note: Review the IAM Policy for OCI Vault here: OCI Vault IAM Policy.
Task 5: Create a Kafka Streaming with Oracle Cloud Streaming service
Oracle Cloud Streaming is a Kafka like managed streaming service. You can develop applications using the Kafka APIs and common SDKs. In this tutorial, you will create an instance of Streaming and configure it to execute in both applications to publish and consume a high volume of data.
-
From the Oracle Cloud main menu, go to Analytics & AI, Streams.
-
Change the compartment to analytics. Every resource in this demo will be created on this compartment. This is more secure and easy to control IAM.
-
Click Create Stream.
-
Enter the name as kafka_like (for example) and you can maintain all other parameters with the default values.
-
Click Create to initialize the instance.
-
Wait for the Active status. Now you can use the instance.
Note: In the streaming creation process, you can select the Auto-Create a default stream pool option to automatically create your default pool.
-
Click on the DefaultPool link.
-
View the connection setting.
-
Annotate this information as you will need it in the next step.
Note: Review the IAM Policies for the OCI Streaming here: IAM Policy for OCI Streaming.
Task 6: Generate a AUTH TOKEN to access Kafka
You can access OCI Streaming (Kafka API) and other resources in Oracle Cloud with an Auth Token associated to your user on OCI IAM. In Kafka Connection Settings, the SASL Connection Strings has a parameter named password and an AUTH_TOKEN value as described in the previous task. To enable access to OCI Streaming, you need to go to your user on OCI Console and create an AUTH TOKEN.
-
From the Oracle Cloud main menu, go to Identity & Security, Users.
Note: Remember that the user you need to create the AUTH TOKEN is the user configured with your OCI CLI and all the IAM Policies configuration for the resources created until now. The resources are:
- Oracle Cloud Autonomous Data Warehouse
- Oracle Cloud Streaming
- Oracle Object Storage
- Oracle Data Flow
-
Click on your username to view the details.
-
Click on the Auth Tokens option in the left side of the console and click Generate Token.
Note: The token will be generated only in this step and will not be visible after you complete the step. So, copy the value and save it. If you lose the token value, you must generate the auth token again.
Task 7: Set up the Demo applications
This tutorial has 2 demo applications for which we will set up the required information:
-
Java-CSV-DB: This application will read 1,000,000 lines of a csv file (organizations1M.csv) and execute some usual processes in a common scenario for integration with a database (Oracle Cloud Autonomous Data Warehouse) and a Kafka streaming (Oracle Cloud Streaming).
The demo shows how a CSV dataset can be merged with a auxiliary table in database and crossing types of tables generating a third dataset in memory. After the execution, the dataset will be inserted on ADW and published on Kafka streaming.
-
JavaConsumeKafka: This application will repeat some steps of the first application just to take CPU and memory for a high volume of processing. The difference is, the first application publishes to the Kafka streaming, whereas this application reads from the Streaming.
-
Download the applications using the following links:
-
Find the following details in your Oracle Cloud Console:
-
Tenancy Namespace
-
Password Secret
-
Streaming Connection Settings
-
Auth Token
-
-
Open the downloaded zip files (
Java-CSV-DB.zip
andJavaConsumeKafka.zip
). Go to the /src/main/java/example folder and find the Example.java code.-
These are the variables that need to be changed with your tenancy resources values.
VARIABLE NAME RESOURCE NAME INFORMATION TITLE NAMESPACE TENANCY NAMESPACE TENANCY OBJECT_STORAGE_NAMESPACE TENANCY NAMESPACE TENANCY PASSWORD_SECRET_OCID PASSWORD_SECRET_OCID OCID streamPoolId Streaming Connection Settings ocid1.streampool.oc1.iad….. value in SASL Connection String kafkaUsername Streaming Connection Settings value of usename inside “ “ in SASL Connection String kafkaPassword Auth Token The value is displayed only in the creation step
-
Note: All the resources created for this tutorial are in the US-ASHBURN-1 region. Check in what region you want to work. If you change the region, you need to change the following details in the 2 code files:
Example.java: Change the bootstrapServers variable, replacing the “us-ashburn-1” with your new region.
OboTokenClientConfigurator.java: Change the CANONICAL_REGION_NAME variable with your new region.
Task 8: Understand the Java code
This tutorial was created in Java and this code can be ported to Python also. The tutorial is divided in 2 parts:
-
Application 1 to publish to Kafka Streaming
-
Application 2 to consume from Kafka Streaming
To prove the efficiency and scalability, both applications were developed to show some possibilities in a common use case of an integration process. So the code for both the applications show the following examples:
-
Read a CSV file with 1,000,000 lines
-
Prepare the ADW Wallet to connect through a JDBC Connection
-
Insert 1,000,000 lines of CSV data into the ADW database
-
Execute a SQL sentence to query an ADW table
-
Execute a SQL sentence to JOIN a CSV dataset with an ADW dataset table
-
Perform a loop of the CSV dataset to demonstrate an iteration with the data
-
Operate with Kafka Streaming
This demo can be executed in your local machine and can be deployed into the Data Flow instance to run as a job execution.
Note: For both the Data Flow job and your local machine, use the OCI CLI configuration to access the OCI resources. In the Data Flow side, everything is pre-configured, so no need to change the parameters. In your local machine side, install the OCI CLI and configure the tenant, user and private key to access your OCI resources.
Let’s show the Example.java
code in sections:
-
Apache Spark initialization: This part of the code represents Spark initialization. Most configurations to perform the execution processes are configured automatically, so it’s very easy to work with the Spark engine.
-
Read a large file in many formats: The Apache Spark engine and SDK permit a fast load and write file formats. A high volume can be manipulated in seconds and even milliseconds. So you can MERGE, FILTER, JOIN datasets in memory and manipulate different data sources.
-
Read the ADW Vault Secret: This part of the code accesses your vault to obtain the secret for your ADW instance.
-
Read the
Wallet.zip
file to connect through JDBC: This section shows how to load theWallet.zip
file from Object Storage and configure the JDBC driver. -
Insert 1,000,000 lines of CSV dataset into ADW Database: From the CSV dataset, it is possible to batch insert into the ADW Database directly. Apache Spark can optimize the execution using all the power of machines clustered, CPUs and memory to obtain the best performance.
-
Data Transformation: Imagine loading many CSVs files, querying some tables in the database in datasets, JOIN, filter, eliminate columns, calculate and many other operations in a few code lines, in a fraction of time and perform a write operation in any format. In this example, a new dataset named oracleDF2 was created from a CSV dataset and an ADW Database dataset.
-
Iterate with a dataset in a loop: This is an example of a loop iteration over the CSV dataset (1,000,000 lines). The row object contains the mapping of the CSV fields structure. So you can obtain the data of each line and can execute API calls and many other operations.
-
Kafka Operations: This is the preparation for connecting to OCI Streaming using the Kafka API.
Note: Oracle Cloud Streaming is compatible with most Kafka APIs.
-
After configuring the connection parameters, the code shows how to produce and consume the streaming.
Task 9: Package your application with Maven
Before executing the job in Apache Spark, it is necessary to package your application with Maven. Maven is one of the most known utilities to package applications with libraries and plugins.
Note:
You can execute a fast test changing the CSV file with another with only 100 lines. To do this, just locate the following code in the Example.java file: private static String INPUT_PATH = “oci://data@” + OBJECT_STORAGE_NAMESPACE + “/organizations1M.csv”;
Replace
organizations1M.csv
withorganizations.csv
and the execution will be significantly faster.
-
Java-CSV-DB Package
-
Go to /Java-CSV-DB folder and execute this command:
mvn package
-
You can see Maven starting the packaging.
-
If everything is correct, you can see the Success message.
-
To test your application in your local Apache Spark machine, execute this command:
spark-submit --class example.Example target/loadadw-1.0-SNAPSHOT.jar
-
-
JavaConsumeKafka Package
-
Go to the /JavaConsumeKafka folder and execute this command:
mvn package
-
You can see Maven starting the packaging.
-
If everything is correct, you can see the Success message.
-
To test your application in yourr local Apache Spark machine, execute this command:
spark-submit --class example.Example target/loadkafka-1.0-SNAPSHOT.jar
-
Task 10: Verify the execution
-
Confirm ADW Insertions
-
Go to the Oracle Cloud main menu, select Oracle Database and Autonomous Data Warehouse.
-
Click on the Processed Logs Instance to view the details.
-
Click Database actions to go to the database utilities.
-
Enter your credentials for the ADMIN user.
-
Click on the SQL option to go to the Query Utilities.
-
Execute a query to see the 1,000,000 of lines in your table.
-
-
Confirm Execution Logs
-
You can see in the execution logs if the job can access and load the datasets.
-
Task 11: Create and execute a Data Flow job
Now, with both applications running successfully in your local Apache Spark machine, you can deploy them into the Oracle Cloud Data Flow in your tenancy.
-
From the Oracle Cloud main menu, go to Analytics & AI and Data Flow.
-
Be sure to select your analytics compartment before create a Data Flow Application.
-
Click Create application.
-
Complete the parameters as shown in the following image:
-
Click Create.
-
After creation, click on the Scale Demo link to view details.
-
Click Run to execute the job.
-
Confirm the parameters and click Run again.
-
View the Status of the job, wait until the Status changes to Succeeded and you can see the results.
Next Steps
The first application publishes data into Kafka Streaming. The second application consumes this data from Kafka.
-
Create another Data Flow Application using the same steps when you created the first Data Flow application.
-
You must change the Name of your application and change the package, from loadadw-1.0-SNAPSHOT.jar to loadkafka-1.0-SNAPSHOT.jar.
-
You can retain the other parameters to be the same as the first Data Flow application and RUN the job.
Related Links
Acknowledgments
- Author - Cristiano Hoshikawa (LAD A-Team Solution Engineer)
More Learning Resources
Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.
For product documentation, visit Oracle Help Center.
Process large files in Autonomous Database and Kafka with Oracle Cloud Infrastructure Data Flow
F79141-01
March 2023
Copyright © 2023, Oracle and/or its affiliates.