Migrating Data Flow Applications from Using Spark 2.4 to Using Spark 3.0.2

Learn about migrating Data Flow to use Spark 3.0.2 rather than Spark 2.4.4.

Data Flow now supports both Spark 2.4.4 and Spark 3.0.2. This chapter describes what you need to do if you are migrating existing Data Flow applications from Spark 2.4.4 to Spark 3.0.2.

Migrating Data Flow to Spark 3.0.2

The steps to migrate Data Flow from Spark 2.4.4 to Spark 3.0.2, and issues to look out for.

  1. Follow the steps in the Spark Migration Guide.
  2. Spark 3.0.2 requires Scala 2.12. Scala applications compiled with a previous version of Scala, and their dependencies, must be updated accordingly.
  3. If you have allowlisted a property for Spark 2.4.4, you must create a separate request to allowlist it for Spark 3.0.2.
  4. Update POM.xml to clearly call out the dependencies for the Spark libraries.
  5. When you Editing an Application, Creating a Java or Scala Data Flow Application, Creating a PySpark Data Flow Application, or Creating a SQL Data Flow Application, select Spark 3.0.2 -> Scala 2.12 for Spark Version.
    Note

    If you change the Spark version for an application that already has a Run, and then run again this prior Run, it uses the new version of Spark, not the version it was originally run with.

Errors when Parsing Timestamp or Date Strings

Learn how to overcome parsing or formatting errors in Data Flow related to date or timestamp strings having migrated from Spark 2.4.4 to Spark 3.0.2.

Having migrated your Data Flow applications to Spark 3.0.2, you get the following error when running an application:
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'MMMMM' pattern in the DateTimeFormatter.
There is a simple fix. Set spark.sql.legacy.timeParserPolicy to LEGACY. Re-run your application.