Integrating Conda Pack with Data Flow

Follow these steps to integrate Conda pack with Data Flow.

Conda is one of the most widely-used Python package management systems. By using conda-pack, PySpark users can directly use a Conda environment to ship third-party Python packages. If using Data Flow with Spark 3.2.1, you can integrate it with Conda pack.

  1. Generate your environment's conda pack tar.gz file by installing and using Conda Pack for Python 3.8.13. You must use Python 3.8.13, as this is the supported version with Spark 3.2.1. For more information on supported versions, see the Before you Begin with Data Flow section.
    Note

    Use the Conda Linux Installer, as the Data Flow Spark image uses oraclelinux:7-slim at runtime.
    For example, these steps create a sample conda pack file with Python 3.8 and NumPy:
    1. Log into a docker container with the image oraclelinux:7-slim, or use an Oracle Linux 7 machine.
      docker run -it --entrypoint /bin/bash oraclelinux:7-slim
    2. Install the Conda Linux Installer.
      curl -O https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
      chmod u+x Anaconda3-2022.05-Linux-x86_64.sh
      ./Anaconda3-2022.05-Linux-x86_64.sh
    3. Create a Python 3.8.environment.
      source ~/.bashrc
      conda create -n mypython3.8 python=3.8
      conda activate mypython3.8
    4. Install NumPy.
      pip install numpy
      conda pack -f -o mypython3.8.tar.gz
    5. Copy the tar.gz file from the docker container to your local machine.
      docker cp <container_id>:/mypython3.8.tar.gz
  2. Upload your local tar.gz file to Object store.
    Make a note of the URI to the file. It is similar to oci://<bucket-name>@<namespace-name>/<path>/conda_env.tar.gz
  3. In your Applications and Runs to be created or updated, set spark.archives to:
    oci://<bucket-name>@<namespace-name>/<path>/conda_env.tar.gz#conda

    where #conda tells Data Flow to set conda as the effective environment name at /opt/spark/wor-dir/conda/ and to use the Python version given at /opt/spark/work-dir/conda/bin/python3 for the driver and executor pods.

  4. (Optional) Alternatively, you can use your own environment name, but it requires the setting of PYSPARK_PYTHON in your code. For more information, see using Conda with Python Packaging.