Adding Third-Party Libraries to Data Flow Applications

Your PySpark applications might need custom dependencies in the form of Python wheels or virtual environments. Your Java or Scala applications might need additional JAR files that you can't, or don't, want to bundle in a Fat JAR. Or you might want to include native code or other assets to make available within your Spark runtime.

Data Flow allows you to provide a ZIP archive in addition to your application. It is installed on all Spark nodes before launching the application. If you construct it properly, the Python libraries will be added to your runtime, and the JAR files will be added to the Spark classpath. The libraries added are completely isolated to one Run. That means they do not interfere with other concurrent Runs or subsequent Runs. Only one archive can be provided per Run.

Anything in the archive must be compatible with the Data Flow runtime. For example, Data Flow runs on Oracle Linux using particular versions of Java and Python. Binary code compiled for other operating systems, or JAR files compiled for other Java versions, might cause your Run to crash. Data Flow provides tools to help you build archives with compatible software. However, these archives are ordinary Zip files, so you are free to create them any way you want. If you use your own tools, you are responsible for ensuring compatibility.

Dependency archives, similarly to your Spark applications, are loaded to Oracle Cloud Infrastructure Object Storage. Your Data Flow Application definition contains a link to this archive, which can be overridden at runtime. When you run your Application, the archive is downloaded and installed before the Spark job runs. The archive is completely private to the Run. This means, for example, that you can run concurrently two different instances of the same Application, with different dependencies, but without any conflicts. Dependencies do not persist between Runs, so there won't be any problems with conflicting versions for other Spark applications that you might run.

Build a Dependency Archive Using the Data Flow Dependency Packager

  1. Download docker.
  2. Download the packager tool image:
    docker pull phx.ocir.io/oracle/dataflow/dependency-packager:latest
  3. For Python dependencies, create a requirements.txt file. For example, it might look like:
    numpy==1.18.1
    pandas==1.0.3
    pyarrow==0.14.0
    Note

    Do not include pyspark or py4j. These dependencies are provided by Data Flow, and including them causes your Runs to fail.
    The Data Flow Dependency Packager uses Python's pip tool to install all dependencies. If you have Python wheels that can't be downloaded from public sources, place them in a directory beneath where you build the package. Refer to them in requirements.txt with a prefix of /opt/dataflow/. For example:
    /opt/dataflow/<my-python-wheel.whl>

    where <my-python-wheel.whl> represents the name of your Python wheel. Pip sees it as a local file and installs it normally.

  4. For Java dependencies, create a file called packages.txt. For example, it might look like:
    ml.dmlc:xgboost4j:0.90
    ml.dmlc:xgboost4j-spark:0.90
    https://repo1.maven.org/maven2/com/nimbusds/nimbus-jose-jwt/8.11/nimbus-jose-jwt-8.11.jar

    The Data Flow Dependency Packager uses Apache Maven to download dependency JAR files. If you have JAR files that cannot be downloaded from public sources, place them in a local directory beneath where you build the package. Any JAR files in any subdirectory where you build the package are included in the archive.

  5. Use docker container to create the archive.
    • Use this command if using MacOS or Linux:
      docker run --rm -v $(pwd):/opt/dataflow --pull always -it phx.ocir.io/oracle/dataflow/dependency-packager:latest -p 3.6
    • If using Windows command prompt as the Administrator, use this command:
      docker run --rm -v %CD%:/opt/dataflow --pull always -it phx.ocir.io/oracle/dataflow/dependency-packager:latest -p 3.6
    • If using Windows PowerShell as the Administrator, use this command:
      docker run --rm -v ${PWD}:/opt/dataflow --pull always -it phx.ocir.io/oracle/dataflow/dependency-packager:latest -p 3.6
    These commands create a file called archive.zip.
    Note

    If you have no write permission on the docker container, you cannot write files back to the host machine or VM. The archive.zip creation fails, and Dependency Packager prints error logs similar to the following:
    cp: cannot create regular file '/opt/dataflow/./archive.zip': Permission denied
    cp: cannot create regular file '/opt/dataflow/./version.txt': Permission denied
    To be able to create archive.zip in this case:
    1. Create a temporary folder with read and write permission for all users.
    2. Move packages.txt or requirements.txt to the folder.
    3. Run the command to create archive.zip there.
    For example:
    mkdir /tmp/shared
    chmod -R a+rw /tmp/shared
    cp <packages.txt or requirements.txt> /tmp/shared
    cd /tmp/shared
    docker run --rm -v $(pwd):/opt/dataflow --pull always -it phx.ocir.io/oracle/dataflow/dependency-packager:latest
  6. (Optional) You can add static content. You might want to include other content in your archive. For example, you might want to deploy a data file, an ML model file, or an executable your Spark program calls at runtime. Add files to archive.zip after you created it in Step 4.
    For Java applications:
    1. Unzip archive.zip.
    2. Add the JAR files in the java/ directory only. Only JAR and DS_store files are allowed.
    3. Zip the file.
    4. Upload it to your object store.
    For Python applications:
    1. Unzip archive.zip.
    2. Add your local modules to only these four subdirectories of the python/ directory:
       python/bin
       python/lib
       python/lib32
       python/lib64
      In the example in The Structure of the Dependency Archive, the subdirectory used is python/lib/user/<your_static_data>.
      Note

      These subdirectories of python/bin are forbidden:
      python/bin/python
      python/bin/python3
      python/bin/pip
      python/bin/pip3
    3. Zip the file.
    4. Upload it to your object store.
    Note

    Only these five directories are allowed for storing your Java and Python dependencies.
    When your Data Flow application runs, the static content is available on any node under the directory where you chose to place it. For example, if you added files under python/lib/ in your archive, they are available in the /opt/dataflow/python/lib/ directory on any node.
  7. Upload archive.zip to your object store.
  8. Add the library to your application. See Create a Java or Scala Data Flow Application or Create a PySpark Data Flow Application sections for more information.

The Structure of the Dependency Archive

Dependency archives are ordinary ZIP files. Advanced users might choose to build archives with their own tools rather than using the Data Flow Dependency Packager. A properly-constructed dependency archive will have this general outline:

python
python/lib
python/lib/python3.6/<your_library1>
python/lib/python3.6/<your_library2>
python/lib/python3.6/<...>
python/lib/python3.6/<your_libraryN>
python/lib/user
python/lib/user/<your_static_data>
java
java/<your_jar_file1>
java/<...>
java/<your_jar_fileN>
Note

Data Flow extracts archive files under /opt/dataflow directory.

Validate an Archive.zip File Using the Data Flow Dependency Packager.

You can use the Data Flow Dependency Packager to validate an archive.zip file locally, before uploading the file to Object Storage.

Navigate to the directory containing the archive.zip file, and run the following command:
docker run --rm -v $(pwd):/opt/dataflow --pull always -it phx.ocir.io/oracle/dataflow/dependency-packager:latest --validate archive.zip

Example Requirements.txt and Packages.txt Files

This example of a requirements.txt file includes the Oracle Cloud Infrastructure SDK for Python version 2.14.3 in a Data Flow Application:
-i https://pypi.org/simple
certifi==2020.4.5.1
cffi==1.14.0
configparser==4.0.2
cryptography==2.8
oci==2.14.3
pycparser==2.20
pyopenssl==19.1.0
python-dateutil==2.8.1
pytz==2020.1
six==1.15.0
This example of a requirements.txt file includes a mix of PyPI sources, web sources, and local sources for Python wheel files:
-i https://pypi.org/simple
blis==0.4.1
catalogue==1.0.0
certifi==2020.4.5.1
chardet==3.0.4
cymem==2.0.3
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz#egg=en-core-web-sm
idna==2.9
importlib-metadata==1.6.0 ; python_version < '3.8'
murmurhash==1.0.2
numpy==1.18.3
plac==1.1.3
preshed==3.0.2
requests==2.23.0
spacy==2.2.4
srsly==1.0.2
thinc==7.4.0
tqdm==4.45.0
urllib3==1.25.9
wasabi==0.6.0
zipp==3.1.0
/opt/dataflow/mywheel-0.1-py3-none-any.whl
To connect to Oracle databases like ADW, you need to include Oracle JDBC JAR files. Download and extract the compatible driver JAR files into a directory below where you build the package. For example, to package the Oracle 18.3 (18c) JDBC driver, ensure all these JARs are present:
ojdbc8-18.3.jar
oraclepki-18.3.jar
osdt_cert-18.3.jar
osdt_core-18.3.jar
ucp-18.3.jar