Setting Up PySpark Integration

To integrate Jupyter and PySpark, install the findspark application.

ODH cluster nodes include Python 3, Apache Spark3 client, and PySpark.

PySpark is an interface for Apache Spark in Python. With PySpark, you can write Spark applications using Python APIs. The PySpark shell is an environment for analyzing data in a distributed environment.

The findspark application finds and adds PySpark to the system path. This way, Jupyter, and PySpark integrate seamlessly.

  1. Access your ODH cluster node:
    • The second utility node of an HA (highly available) cluster.

    • The first (and only) utility node of a non-HA cluster.

  2. Install Python for Java.
    sudo python3 -m pip install py4j
  3. Install findspark.
    sudo python3 -m pip install findspark