PySpark
A description of the PySpark 3.0 and Data Flow CPU on Python 3.7 (version 5.0) conda environment.
Released |
June 24, 2022 |
---|---|
Description |
Apply the power of Apache Spark and MLlib to speed up your model building. Use PySparkSQL to analyze structured and semi-structured data that is stored in Object Storage. These files can be accessed using Resource Principals for easy and secure authentication. PySpark applies the full power of a notebook session by using parallel computing. For larger jobs, you can develop Spark applications and then submit them to the Data Flow service. To get started with this conda environment, review the
|
Python Version |
3.7 |
Slug |
|
Object Storage Path |
|
Top Libraries |
For a complete list of preinstalled Python libraries, see pyspark30_p37_cpu_v5.txt. |
Example Notebooks |
|
A description of the PySpark 3.0 and Data Flow CPU on Python 3.7 (version 4.0) conda environment.
Released |
March 29, 2022 |
---|---|
Description |
Apply the power of Apache Spark and MLlib to speed up your model building. Use PySparkSQL to analyze structured and semi-structured data that is stored in Object Storage. These files can be accessed using Resource Principals for easy and secure authentication. PySpark applies the full power of a notebook session by using parallel computing. For larger jobs, you can develop Spark applications and then submit them to the Data Flow service. To get started with this conda environment, review the
|
Python Version |
3.7 |
Slug |
|
Object Storage Path |
|
Top Libraries |
For a complete list of preinstalled Python libraries, see pyspark30_p37_cpu_v4.txt. |
Example Notebooks |
Using the Notebook Explorer to access Notebook Examples describes how to locate and access the included interactive example notebooks, and what each of them can be used for. |
A description of the PySpark 3.0 and Data Flow CPU on Python 3.7 (version 3.0) conda environment.
Released |
February 9, 2022 |
---|---|
Description |
Apply the power of Apache Spark and MLlib to speed up your model building. Use PySparkSQL to analyze structured and semi-structured data that is stored in Object Storage. These files can be accessed using Resource Principals for easy and secure authentication. PySpark applies the full power of a notebook session by using parallel computing. For larger jobs, you can develop Spark applications and then submit them to the Data Flow service. To get started with this conda environment, review the
|
Python Version |
3.7 |
Slug |
|
Object Storage Path |
|
Top Libraries |
For a complete list of preinstalled Python libraries, see pyspark30_p37_cpu_v3.txt. |
Example Notebooks |
Using the Notebook Explorer to access Notebook Examples describes how to locate and access the included interactive example notebooks, and what each of them can be used for. |
This conda environment has been removed due to a critical vulnerability within the Apache Log4j module (CVE-2021-44228).
If you have created published conda environments by cloning this environment, we strongly encourage you to remediate the vulnerability.
A description of the PySpark 3.0 and Data Flow CPU on Python 3.7 (version 2.0) conda environment.
Released |
July 15, 2021 |
---|---|
Description |
Apply the power of Apache Spark and MLlib to speed up your model building. Use PySparkSQL to analyze structured and semi-structured data that is stored in Object Storage. These files can be accessed using Resource Principals for easy and secure authentication. PySpark applies the full power of a notebook session by using parallel computing. For larger jobs, you can develop Spark applications and then submit them to the Data Flow service. To get started with this conda environment, review the
|
Python Version |
3.7 |
Slug |
|
Top Libraries |
For a complete list of preinstalled Python libraries, see pyspark30_p37_cpu_v2.txt. |
Example Notebooks |
Using the Notebook Explorer to access Notebook Examples describes how to locate and access the included interactive example notebooks, and what each of them can be used for. |
This conda environment has been removed due to a critical vulnerability within the Apache Log4j module (CVE-2021-44228).
If you have created published conda environments by cloning this environment, we strongly encourage you to remediate the vulnerability.
A description of the PySpark 3.0 and Data Flow CPU on Python 3.7 (version 1.0) conda environment.
Released |
June 1, 2021 |
---|---|
Description |
This conda allows data scientists to leverage Apache Spark including the machine learning algorithms in MLlib. Use PySparkSQL to analyze structured and semi-structured data that is store on Object Storage. PySpark leverages the full power of a notebook session by using parallel computing. For larger jobs, you can develop Spark applications and submit them to the Data Flow service. Support for PySpark version 3.0.2 was added. This version is compatible with the OCI Data Flow service. To get started with this conda environment, review the
|
Python Version |
3.7 |
Slug |
|
Top Libraries |
For a complete list of preinstalled Python libraries, see pyspark30_p37_cpu_v1.txt. |
Example Notebooks |
Using the Notebook Explorer to access Notebook Examples describes how to locate and access the included interactive example notebooks, and what each of them can be used for. |
A description of the PySpark 2.4 and Data Flow CPU on Python 3.7 (version 3.0) conda environment.
Released |
March 29, 2022 |
---|---|
Description |
Apply the power of Apache Spark and MLlib to speed up your model building. Use PySparkSQL to analyze structured and semi-structured data that is stored in Object Storage. These files can be accessed using Resource Principals for easy and secure authentication. PySpark applies the full power of a notebook session by using parallel computing. For larger jobs, you can develop Spark applications and then submit them to the Data Flow service. To get started with this conda environment, review the
|
Python Version |
3.7 |
Slug |
|
Object Storage Path |
|
Top Libraries |
For a complete list of preinstalled Python libraries, see pyspark24_p37_cpu_v3.txt. |
Example Notebooks |
Using the Notebook Explorer to access Notebook Examples describes how to locate and access the included interactive example notebooks, and what each of them can be used for. |
This conda environment has been removed due to a critical vulnerability within the Apache Log4j module (CVE-2021-44228).
If you have created published conda environments by cloning this environment, we strongly encourage you to remediate the vulnerability.
A description of the PySpark 2.4 and Data Flow CPU on Python 3.7 (version 2.0) conda environment.
Released |
July 15, 2021 |
---|---|
Description |
Apply the power of Apache Spark and MLlib to speed up your model building. Use PySparkSQL to analyze structured and semi-structured data that is stored in Object Storage. These files can be accessed using Resource Principals for easy and secure authentication. PySpark applies the full power of a notebook session by using parallel computing. For larger jobs, you can develop Spark applications then submit them to the Data Flow service. To get started with this conda environment, review the
|
Python Version |
3.7 |
Slug |
|
Top Libraries |
For a complete list of preinstalled Python libraries, see pyspark24_p37_cpu_v2.txt. |
Example Notebooks |
Using the Notebook Explorer to access Notebook Examples describes how to locate and access the included interactive example notebooks, and what each of them can be used for. |
This conda environment has been removed due to a critical vulnerability within the Apache Log4j module (CVE-2021-44228).
If you have created published conda environments by cloning this environment, we strongly encourage you to remediate the vulnerability.
A description of the PySpark 2.4 and Data Flow CPU on Python 3.7 (version 1.0) conda environment.
Released |
May 11, 2021 |
---|---|
Description |
This conda allows data scientists to leverage Apache Spark including the machine learning algorithms in MLlib. Use PySpark SQL to analyze structured and semi-structured data that is store in Object Storage. PySpark leverages the power of a notebook session by using parallel computing. For large jobs, you can develop Spark applications, and then submit them to the Data Flow service. To get started with this conda environment, review the
|
Python Version |
3.7 |
Slug |
|
Top Libraries |
For a complete list of preinstalled Python libraries, see pyspark24_p37_cpu_v1.txt. |
Example Notebooks |
Using the Notebook Explorer to access Notebook Examples describes how to locate and access the included interactive example notebooks, and what each of them can be used for. |
A description of the PySpark (version 1.0) conda environment.
Released |
January 13, 2021 |
---|---|
Description |
The PySpark conda allows you to apply the power of Apache Spark. Use
it to access the full computational power of a notebook session by
using parallel computing. For larger jobs, you can interactively
develop Spark applications and submit them to Data Flow without
blocking the notebook session. PySpark MLlib implements a wide
collection of powerful machine-learning algorithms. Use the
PySparkSQL SQL-like language to analyze huge amounts of structured
and semi-structured data stored in Object Storage. Speed up your
workflow by using To get started with this conda environment, review the
|
Python Version |
3.6 |
Slug |
|
Object Storage Path |
|
Top Libraries |
For a complete list of preinstalled Python libraries, see pyspv10.txt. |
Example Notebooks |
|
Use these configuration steps so that PySpark can connect to Object Storage:
-
Authenticate the user by generating the OCI configuration file and API keys, see SSH keys setup and prerequisites and Authenticating to the OCI APIs from a Notebook Session
Important
PySpark can't reach Object Storage if you authenticate using resource principals. Also, the key and configuration files can't have a passphrase.
If you must have configuration and key files with a passphrase, you can download your files from Object Storage using the Python SDK, and then load the file in Spark context.
-
Configure the properties in the
/home/datascience/spark_config_dir/core-site.xml
file by providing your values between<value> </value>
:fs.oci.client.hostname
-
The address of the Object Storage OCID of your tenancy (
ifs.oci.client.auth.tenantId:
). fs.oci.client.auth.userId
-
Your user OCID.
fs.oci.client.auth.fingerprint
-
The fingerprint for the key pair being used.
fs.oci.client.auth.pemfilepath
-
The full path and file name of the private key used for authentication.
For details about these properties, see HDFS Connector for Object Storage.