Supported Spark Properties in Data Flow
For every run of a Data Flow application, you can add Spark Properties in the Spark Configuration Properties field.
When you're running in Data Flow, don't change the value of
spark.master. If you do, the job doesn't use all the resources you provisioned. Data Flow Proprietary Spark Configuration List
Spark configurations proprietary to Data Flow and how to use them.
| Spark Configuration | Usage Description | Applicable Spark Versions | 
|---|---|---|
| dataflow.auth | Setting the configuration value to 'resource_principal' enables resource principal authentication for the Data Flow run. This configuration is required for runs that are intended for running longer than 24 hours. Before enabling resource principal, set up the appropriate policy. | All | 
| spark.dataflow.acquireQuotaTimeout | Data Flow gives you the option to
                                submit jobs when you don't have enough resources to run them. The
                                jobs are held in an internal queue and are released when resources
                                become available. Data Flow keeps
                                checking until the timeout value that you've set is finished. You
                                set the spark.dataflow.acquireQuotaTimeoutproperty
                                to specify this timeout value. Set the property under
                                    Advanced options when creating an application,
                                or when running an
                                    application. For example:Usehto represent timeout hours andmorminto represent timeout
                                minutes.
 | All | 
| spark.archives#conda | The spark.archives configuration serves exactly the same
                                functionalities as its open source counterpart.
                                When using Conda as the package
                                manager to submit PySpark jobs in OCI
Data Flow, attach #conda to
                                the artifact package entries so that Data Flow extracts the artifacts
                                into a proper directory. For
                                more information, see Integrating Conda Pack with Data
                                    Flow). | 3.2.1 or later | 
| spark.dataflow.streaming.restartPolicy.restartPeriod | Note: Applicable to Data Flow Streaming type runs only. This property specifies a minimum delay between restarts for a Streaming application. The default value for it's set to 3 minutes to prevent transient issues causing many restarts in a specific time period. | 3.0.2, 3.2.1 or later | 
| spark.dataflow.streaming.restartPolicy.maxConsecutiveFailures | Note: Applicable to Data Flow Streaming type runs only. This property specifies the maximum number of consecutive failures that can occur before Data Flow stops restarting a failed Streaming application. The default value for this is 10. | 3.0.2, 3.2.1 or later | 
| spark.sql.streaming.graceful.shutdown.timeout | Note: Applicable to Data Flow Streaming type runs only. Data Flow streaming runs uses the shutdown duration to preserve checkpoint data to restart from the prior state correctly. The configuration specifies the maximum time Data Flow streaming runs can use for gracefully preserving the checkpoint state before being forced to shut down. The default is 30 minutes. | 3.0.2, 3.2.1 or later | 
| spark.oracle.datasource.enabled | Spark Oracle Datasource is an extension of the Spark JDBC datasource. It simplifies the connection to Oracle databases from Spark. In addition to all the options provided by Spark's JDBC datasource, Spark Oracle Datasource simplifies connecting Oracle databases from Spark by providing: 
 For
                                more information see, Spark Oracle Datasource. | 3.0.2 or later | 
| spark.scheduler.minRegisteredResourcesRatio | Default: 1.0 Note: Specified as a double between 0.0 and 1.0. The minimum ratio of registered resources per total expected resource to wait for before scheduling a run in the Job layer. Adjusting this parameter involves a trade off between a faster job startup and ensuring adequate resource availability. For example, a value of 0.8 means 80% of resources waited. | All | 
| spark.dataflow.overAllocationRatio | Default: 1.0 Note: Specified as a double larger than, or equal to, 1.0. The ratio of excessive resource creation to avoid job failure resulting from the failure to create a minor part of the instances. Extra instance creation is billed only during the creation phase and ended after the job starts. For example, a value of 1.1 means that 10% more resources were created to accomodate the expected resources for customer jobs. | All |