Connect Data Flow PySpark apps to Autonomous Database in Oracle Cloud Infrastructure

Introduction

If your PySpark app needs to access Autonomous Database, either Autonomous Data Warehouse or Autonomous Transaction Processing, it must import JDBC drivers. And if your PySpark app needs access to Object Storage, it must import the oci library. Neither one of these libraries are pre-installed in Data Flow. You must arrange to have them available to your app at runtime. You do this by creating an object called a Dependency Archive.

Dependency archives are loaded to Oracle Cloud Infrastructure Object Storage. You supply your Data Flow Application with a link to the dependency archive within Object Storage. When you run your Application, Data Flow retrieves the archive and makes it available to your PySpark apps. Your PySpark apps import the libraries using the Python import statement exactly as they would in any other Python program.

Objectives

After you complete this tutorial, you will have a dependency archive in a file called archive.zip. You can use the dependency archive with PySpark apps that access Object Storage and Autonomous Database.

Prerequisites

Before starting this tutorial, download and install Docker. See the Related Links section.

Download the packager tool Docker image

At the command line, enter the following command:

docker pull phx.ocir.io/oracle/dataflow/dependency-packager:latest

Download the JDBC drivers

Download the JDBC drivers from Oracle JDBC Downloads Page. If you’re not sure which database version you have in your tenancy, download the drivers for 19c.

Use the Zipped JDBC driver and companion JARs for version 8, ojdbc8-full.tar.gz. Do not download ojdbc11-full.tar.gz.

Extract the following files and put them in the same directory that you will run the packager tool in.

ojdbc8.jar
oraclepki.jar
osdt_cert.jar
osdt_core.jar
ucp.jar

Because you put these files into the same directory that you run the packager tool in, you don’t need to create a packages.txt file.

Create the configuration file

Create a requirements.txt file. The file contains a list of the libraries that your Data Flow PySpark application depends on. For connecting to Object Storage, the oci library is required. You can include other libraries as needed by your PySpark app.

This example shows how to specify the latest release of the oci library:

oci

If you want to specify a specific release, then the requirements.txt file would be similar to this example:

oci==2.3.0

Notes:

The pyspark and py4j libraries are provided by Data Flow. Do not put these libraries in your requirements.txt file.
You can have blank lines in the requirements.txt file, but the packager will complain about an invalid requirement for each blank line.

Run the packager tool.

The packager tool downloads the oci library and its dependencies and creates a file called archive.zip. It also scans the current directory for jar files. It should find the five JDBC driver files and ask if you want to include them in archive.zip.

On Windows:

Open the Windows Command Prompt As Administrator.
Change to the directory that has your jar files and the requirements.txt file.

Run the following command:

docker run --rm -v %CD%:/opt/dataflow -it phx.ocir.io/oracle/dataflow/dependency-packager:latest

On Linux or Mac:

Open a terminal.
Change to the directory that has your jar files and the requirements.txt file.

Run the following command:

docker run --rm -v $(pwd):/opt/dataflow -it phx.ocir.io/oracle/dataflow/dependency-packager:latest

When the tool completes, open archive.zip and take a look at its structure. It should be very similar to the following example:

python/lib/python3.6/site-packages/backports
python/lib/python3.6/site-packages/certifi
python/lib/python3.6/site-packages/<...>
java/ojdbc8.jar
java/oraclepki.jar
java/<...>
version.txt

Next Steps

To make the dependency archive available to your PySpark apps, you must do two things:

Upload the archive.zip file to a bucket in Object Storage.
Configure your Data Flow Application to link to the archive.zip file in Object Storage.

Refer to the Object Storage documentation and the Data Flow documentation for instructions on how to do that. The archive can be in the same Object Storage bucket as the Python apps that need it.

Acknowledgements

Author - Jeff Schering (User Assistance Developer)

Learn More

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.

More Learning Resources

For product documentation, visit Oracle Help Center.

Title and Copyright Information

Connect Data Flow PySpark apps to Autonomous Database in Oracle Cloud Infrastructure

F41575-02

January 2022