Adding Third-Party Libraries to Data Flow Applications
Your PySpark applications might need custom dependencies in the form of Python wheels or virtual environments. Your Java or Scala applications might need additional JAR files that you can't, or don't, want to bundle in a Fat JAR. Or you might want to include native code or other assets to make available within your Spark runtime.
Data Flow allows you to provide a ZIP archive in addition to your application. It is installed on all Spark nodes before launching the application. If you construct it properly, the Python libraries will be added to your runtime, and the JAR files will be added to the Spark classpath. The libraries added are completely isolated to one Run. That means they do not interfere with other concurrent Runs or subsequent Runs. Only one archive can be provided per Run.
Anything in the archive must be compatible with the Data Flow runtime. For example, Data Flow runs on Oracle Linux using particular versions of Java and Python. Binary code compiled for other operating systems, or JAR files compiled for other Java versions, might cause your Run to crash. Data Flow provides tools to help you build archives with compatible software. However, these archives are ordinary Zip files, so you are free to create them any way you want. If you use your own tools, you are responsible for ensuring compatibility.
Dependency archives, similarly to your Spark applications, are loaded to Oracle Cloud Infrastructure Object Storage. Your Data Flow Application definition contains a link to this archive, which can be overridden at runtime. When you run your Application, the archive is downloaded and installed before the Spark job runs. The archive is completely private to the Run. This means, for example, that you can run concurrently two different instances of the same Application, with different dependencies, but without any conflicts. Dependencies do not persist between Runs, so there won't be any problems with conflicting versions for other Spark applications that you might run.
Build a Dependency Archive Using the Data Flow Dependency Packager
- Download docker.
- Download the packager tool image:
docker pull phx.ocir.io/oracle/dataflow/dependency-packager:latest
- For Python dependencies, create a
requirements.txt
file. For example, it might look like:numpy==1.18.1 pandas==1.0.3 pyarrow==0.14.0
NoteThe Data Flow Dependency Packager uses Python's pip tool to install all dependencies. If you have Python wheels that can't be downloaded from public sources, place them in a directory beneath where you build the package. Refer to them in
Do not includepyspark
orpy4j
. These dependencies are provided by Data Flow, and including them causes your Runs to fail.requirements.txt
with a prefix of/opt/dataflow/
. For example:/opt/dataflow/<my-python-wheel.whl>
where <my-python-wheel.whl> represents the name of your Python wheel. Pip sees it as a local file and installs it normally.
- For Java dependencies, create a file called
packages.txt
. For example, it might look like:ml.dmlc:xgboost4j:0.90 ml.dmlc:xgboost4j-spark:0.90 https://repo1.maven.org/maven2/com/nimbusds/nimbus-jose-jwt/8.11/nimbus-jose-jwt-8.11.jar
The Data Flow Dependency Packager uses Apache Maven to download dependency JAR files. If you have JAR files that cannot be downloaded from public sources, place them in a local directory beneath where you build the package. Any JAR files in any subdirectory where you build the package are included in the archive.
- Use docker container to create the archive.
- Use this command if using MacOS or
Linux:
docker run --rm -v $(pwd):/opt/dataflow --pull always -it phx.ocir.io/oracle/dataflow/dependency-packager:latest -p 3.6
- If using Windows command prompt as the Administrator, use this
command:
docker run --rm -v %CD%:/opt/dataflow --pull always -it phx.ocir.io/oracle/dataflow/dependency-packager:latest -p 3.6
- If using Windows PowerShell as the Administrator, use this command:
docker run --rm -v ${PWD}:/opt/dataflow --pull always -it phx.ocir.io/oracle/dataflow/dependency-packager:latest -p 3.6
archive.zip
.Note
If you have no write permission on the docker container, you cannot write files back to the host machine or VM. Thearchive.zip
creation fails, and Dependency Packager prints error logs similar to the following:
To be able to createcp: cannot create regular file '/opt/dataflow/./archive.zip': Permission denied cp: cannot create regular file '/opt/dataflow/./version.txt': Permission denied
archive.zip
in this case:- Create a temporary folder with read and write permission for all users.
- Move
packages.txt
orrequirements.txt
to the folder. - Run the command to create
archive.zip
there.
mkdir /tmp/shared chmod -R a+rw /tmp/shared cp <packages.txt or requirements.txt> /tmp/shared cd /tmp/shared docker run --rm -v $(pwd):/opt/dataflow --pull always -it phx.ocir.io/oracle/dataflow/dependency-packager:latest
- Use this command if using MacOS or
Linux:
- (Optional) You can add static content. You might want to include other content in
your archive. For example, you might want to deploy a data file, an ML model file,
or an executable your Spark program calls at runtime. Add files to
archive.zip
after you created it in Step 4.For Java applications:- Unzip
archive.zip
. - Add the JAR files in the
java/
directory only. Only JAR and DS_store files are allowed. - Zip the file.
- Upload it to your object store.
For Python applications:When your Data Flow application runs, the static content is available on any node under the directory where you chose to place it. For example, if you added files under- Unzip
archive.zip
. - Add your local modules to only these four subdirectories of the
python/
directory:
In the example in The Structure of the Dependency Archive, the subdirectory used ispython/bin python/lib python/lib32 python/lib64
python/lib/user/<your_static_data>
.Note
These subdirectories ofpython/bin
are forbidden:python/bin/python python/bin/python3 python/bin/pip python/bin/pip3
- Zip the file.
- Upload it to your object store.
Note
Only these five directories are allowed for storing your Java and Python dependencies.python/lib/
in your archive, they are available in the/opt/dataflow/python/lib/
directory on any node. - Unzip
- Upload
archive.zip
to your object store. - Add the library to your application. See Create a Java or Scala Data Flow Application or Create a PySpark Data Flow Application sections for more information.
The Structure of the Dependency Archive
Dependency archives are ordinary ZIP files. Advanced users might choose to build archives with their own tools rather than using the Data Flow Dependency Packager. A properly-constructed dependency archive will have this general outline:
python
python/lib
python/lib/python3.6/<your_library1>
python/lib/python3.6/<your_library2>
python/lib/python3.6/<...>
python/lib/python3.6/<your_libraryN>
python/lib/user
python/lib/user/<your_static_data>
java
java/<your_jar_file1>
java/<...>
java/<your_jar_fileN>
Data Flow extracts archive files under
/opt/dataflow
directory.Validate an Archive.zip File Using the Data Flow Dependency Packager.
You can use the Data Flow Dependency Packager to validate an archive.zip file locally, before uploading the file to Object Storage.
docker run --rm -v $(pwd):/opt/dataflow --pull always -it phx.ocir.io/oracle/dataflow/dependency-packager:latest --validate archive.zip
Example Requirements.txt and Packages.txt Files
requirements.txt
file includes the Oracle Cloud Infrastructure
SDK for Python version 2.14.3 in a Data Flow
Application:-i https://pypi.org/simple
certifi==2020.4.5.1
cffi==1.14.0
configparser==4.0.2
cryptography==2.8
oci==2.14.3
pycparser==2.20
pyopenssl==19.1.0
python-dateutil==2.8.1
pytz==2020.1
six==1.15.0
requirements.txt
file includes a mix of PyPI sources,
web sources, and local sources for Python wheel
files:-i https://pypi.org/simple
blis==0.4.1
catalogue==1.0.0
certifi==2020.4.5.1
chardet==3.0.4
cymem==2.0.3
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz#egg=en-core-web-sm
idna==2.9
importlib-metadata==1.6.0 ; python_version < '3.8'
murmurhash==1.0.2
numpy==1.18.3
plac==1.1.3
preshed==3.0.2
requests==2.23.0
spacy==2.2.4
srsly==1.0.2
thinc==7.4.0
tqdm==4.45.0
urllib3==1.25.9
wasabi==0.6.0
zipp==3.1.0
/opt/dataflow/mywheel-0.1-py3-none-any.whl
ojdbc8-18.3.jar
oraclepki-18.3.jar
osdt_cert-18.3.jar
osdt_core-18.3.jar
ucp-18.3.jar