Spark Streaming

Learn about Spark streaming in Data Flow.

Streaming applications require continuous execution for a long period of time that often extends beyond 24 hours, and might be as long as weeks or even months. In case of unexpected failures, streaming applications must restart from the point of failure without producing incorrect computational results. Data Flow relies on Spark structured streaming check-pointing to record the processed offset which can be stored in your Object Storage bucket.

To allow for regular Oracle Cloud Infrastructure maintenance, Data Flow implements a graceful shutdown of the Spark clusters for Spark structured streaming. When maintenance is complete, a new Spark cluster with the updated software is created, and a new run appears in the list. The status of the previous Run indicates that it is stopped for maintenance.

Data Flow provides access to the Spark UI and Spark History Server, which is a suite of web user interfaces (UIs) that you can use to monitor the events, status and resource consumption of your Spark cluster. Importantly it lets you explore logical and physical execution plans. For streaming it provides insights on processing progress, for example, input/output rates, offsets, durations, and statistical distribution. Spark UI provides information about currently running jobs and History Server about finished jobs.

Batch runs allow multiple concurrent runs of the same code with mostly same arguments. But running multiple instances of streaming applications corrupts the checkpoint data, so Data Flow is limited to only one run per streaming application. To avoid any unintentional corruption of the streaming application, you must stop it running before you can edit it. When the edit is complete, you can restart the streaming application. To help you identify batch and streaming applications, there is the Appliation Type, which has the values of Batch or Streaming.

As with batch runs, Data Flow permits streaming applications to connect to private networks.

If the run stops in error, Data Flow makes up to 10 attempts to restart it, waiting for three minutes between attempts. If the tenth attempt fails, then no more attempts are made and the run is stopped.