9.2.16 Databricks

Overview

Databricks is a unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data,analytics, and AI solutions at scale.

9.2.16.1 Detailed Functionality

Replication to Databricks uses stage and merge data flow.

The change data from the Oracle GoldenGate trails is staged in micro-batches at a temporary staging location, typically a cloud object store.

The staged records are then merged into the Databricks target tables using a merge SQL statement.

9.2.16.1.1 Staging location

The change data records from the GoldenGate trail files are formatted into Avro OCF (Object Container Format) and uploaded to the staging location. Change data can be staged in one of the following object stores based on the Databricks configuration.

  • Azure Data Lake Storage (ADLS) Gen2
  • AWS Simple Storage Service (S3)
  • Google Cloud Storage (GCS)

9.2.16.1.2 Database User Privileges

The database user used for replicating into Databricks has to be granted the following privileges:

  • CREATE, INSERT,UPDATE, DELETE, and TRUNCATE on the target tables.
  • CREATE, ALTER, and DROP external tables.

9.2.16.1.3 Prerequisites

  • You must have Azure, Amazon Web Services, or Google Cloud Platform cloud accounts set up for Databricks.
  • Azure storage accounts must have hierarchical namespace enabled for replication to Databricks on Azure.
  • Databricks JDBC driver.

9.2.16.2 Configuration

The configuration of the Databricks replication properties is stored in the Replicat properties file.

9.2.16.2.1 Automatic Configuration

Databricks replication involves configuring multiple components, such as the File Writer Handler, ABS or S3 or GCS Event Handler, and the target Databricks Event Handler.

The Automatic Configuration functionality helps you to autoconfigure these components so that the manual configuration is minimal.

The properties modified by autoconfiguration is also logged in the handler log file.

To enable autoconfiguration to replicate to the Databricks target, set the parameter gg.target=databricks.

The parameter gg.stage determines the staging location.

If gg.stage is unset, ADLS Gen2 will be used as the staging location.

If gg.state is set to either abs, s3, or gcs, then ADLS Gen2, AWS S3, or GCS are respectively used as the staging locations.

The JDBC Metadata provider is also automatically enabled to retrieve target table metadata from Databricks.

Target tables are automatically created if missing.

9.2.16.2.1.1 File Writer Handler Configuration
The File Writer Handler name is pre-set to the value databricks and its properties are automatically set to the required values for Databricks.
9.2.16.2.1.2 ABS Event Handler Configuration

The ABS Event Handler name is pre-set to the value abs and must be configured to match your ADLS Gen2 configuration.

The following is an example of editing a property of the S3 Event Handler: gg.eventhandler.abs.bucketMappingTemplate=container1

For more information about integrating with ADLS Gen2, see Azure Blob Storage Configuration.

9.2.16.2.1.3 s3 Event Handler Configuration
The s3 Event Handler name is pre-set to the value s3 and must be configured to match your s3 configuration. The following is an example of editing a property of the s3 Event Handler: gg.eventhandler.s3.bucketMappingTemplate=bucket1.

For more information about integrating with s3, see S3 Event handler configuration.

9.2.16.2.1.4 GCS Event Handler Configuration

The GCS Event Handler name is pre-set to the value gcs and must be configured to match your GCS configuration. The following is an example of editing a GCS Event Handler property: gg.eventhandler.gcs.bucketMappingTemplate=bucket1

The following is an example of editing a property of the S3 Event Handler: gg.eventhandler.abs.bucketMappingTemplate=container1

For more information about integrating with GCS, see GCS Event Handler Configuration.

9.2.16.2.1.5 Databricks Event Handler Configuration

The Databricks Event Handler name is pre-set to the value databricks.

The following are configuration properties available for the Databricks Event handler, the required ones must be changed to match your Databricks configuration.

Table 9-14

Properties Required/Optional Legal Values Default Explanation
gg.eventhandler.databricks.connectionURL Required jdbc:databricks://<server-hostname>:443;httpPath=<http-path>[;<setting1>=<value1>;<setting2>=<value2>;<settingN>=<valueN>] None JDBC URL to connect to Databricks. See Databricks Authentication Methods.
gg.eventhandler.databricks.UserName Optional Supported database user name string. None Databricks database user or token.
gg.eventhandler.databricks.Password Optional Supported database password string. None Databricks database password or token value.
gg.eventhandler.databricks.credential Optional Storage Credential name. None External Storage credential name to access files on object storage such as ADLS Gen2, S3 or GCS. For more information, see Create a Storage Credential.
gg.eventhandler.databricks.createTable Optional true or false true If the value is set to true, then target tables are automatically created if missing.
gg.eventhandler.databricks.maxConnections Optional Integer value 10 Use this parameter to control the number of concurrent JDBC database connections to the target database.
gg.eventhandler.databricks.connectionRetries Optional Integer value 3 Specifies the number of times connections to the target data warehouse will be retried.
gg.eventhandler.databricks.connectionRetryIntervalSeconds Optional Integer value 30 Specifies the delay in seconds between connection retry attempts.
gg.eventhandler.databricks.deleteInsert Optional true or false false If set to true, Replicat will merge records using SQL DELETE+INSERT statements instead of SQL MERGEstatement.

Note:

Applicable only if gg.compressed.update is set to false.
gg.eventhandler.databricks.detectMissingBaseRow Optional true or false false Diagnostic parameter to find UPDATE operations without base row. If set to true, Replicat will ABEND if there are UPDATE operations without base row. These rows will be collected into another table that can be investigated.
gg.eventhandler.databricks.dropStagingTablesOnShutdown Optional true or false false If set to true, the temporary staging tables created by Oracle GoldenGate will be dropped on replicat graceful stop.
gg.handler.databricks.fileRollInterval Optional The default unit of measure is milliseconds. You can stipulate ms, s, m, h to signify milliseconds, seconds, minutes, or hours respectively. Examples of legal values include 10000, 10000ms, 10s, 10m, or 1.5h. Values of 0 or less indicate that file rolling on time is turned off. 3m (three minutes) The parameter determines how often the data will be merged into Snowflake. Use with caution, the higher this value is the more data will need to be stored in the memory of the Replicat process.

Note:

Use the parameter with caution. Increasing its default value (3m) will increase the amount of data stored in the internal memory of the Replicat. This can cause out of memory errors and stop the Replicat if it runs out of memory.

Note:

gg.validate.keyupdate Optional true or false false If set to true, Replicat will validate key update operations (optype 115) and correct to normal update if no key values have changed. Compressed key update operations do not qualify for merge.
gg.compressed.update Optional true or false true If set the true, then this indicates that the source trail files contain compressed update operations. If set to false, then the source trail files are expected to contain uncompressed update operations.

9.2.16.2.2 Authentication to Databricks

Databricks JDBC connection URLs use the following format: jdbc:databricks://<Host>:<Port>;httpPath=<http-path>[;<setting1>=<value1>;<setting2>=<value2>;<settingN>=<valueN>]

9.2.16.2.2.1 Compute Settings for the Databricks JDBC Driver

The driver requires the following compute resource configuration settings:

Table 9-15 Compute Settings for the Databricks JDBC Driver

Setting Description
Host The Databricks compute resource’s Server Hostname value.
Port 443
httpPath The Databricks compute resource’s HTTP Path value.
ssl 1
9.2.16.2.2.1.1 Connection Details for Compute Cluster

To get the connection details for the Databricks compute cluster:

  • Log in to your Databricks workspace.
  • In the sidebar, click Compute.
  • In the list of available warehouses, click the target cluster's name.
  • On the Configuration tab, expand Advanced options.
  • In the JDBC/ODBC tab, the Server hostname, Port, and HTTP path can be found.
9.2.16.2.2.1.2 Connection Details for Databricks SQL Warehouse

To get the connection details for the Databricks SQL warehouse:

  • Log in to your Databricks workspace.
  • In the sidebar, click SQL Warehouses.
  • In the list of available warehouses, click the target warehouse’s name.
  • In the Connection details tab, the Server hostname, Port, and HTTP path can be found.
9.2.16.2.2.1.3 Databricks Authentication Methods

The Databricks JDBC Driver supports the following authentication methods:

  • Databricks personal access token
  • Databricks username and password
  • OAuth 2.0 tokens
    • OAuth user-to-machine (U2M) authentication

      Note:

      OAuth U2M or OAuth 2.0 browser-based authentication works only with applications that run locally. It does not work with server-based or cloud-based applications.
    • OAuth machine-to-machine (M2M) authentication
9.2.16.2.2.1.4 Databricks Personal Access Token

To create a Databricks personal access token:

  • In your Databricks workspace, click your Databricks username in the top bar, and then select User Settings from the drop down.
  • Click Developer.
  • Next to Access tokens, click Manage.
  • Click Generate New Token.
  • (Optional) Enter a comment that helps you to identify this token in the future, and change the token’s default lifetime of 90 days.

    To create a token with no lifetime (not recommended), leave the Lifetime (days) box empty (blank).

  • Click Generate.
  • Copy the displayed token to a secure location, and then click Done.
9.2.16.2.2.1.5 Databricks Username and Password

Databricks username and password authentication is also known as Databricks basic authentication. Username and password authentication is possible only if single sign-on is disabled.

To use Databricks username and password for authentication, set the gg.eventhandler.databricks.UserName and gg.eventhandler.databricks.Password properties to the respective values.

Note:

In the 23ai release, Oracle has not yet certified authentication using OAuth 2.0 tokens, OAuth user-to-machine (U2M), and OAuth machine-to-machine (M2M).

You can choose to configure the JDBC URL as per: https://docs.databricks.com/en/integrations/jdbc/authentication.html#oauth-20-tokens . For more information about all the authentication mechanisms supported by Databricks, see https://docs.databricks.com/en/integrations/jdbc/authentication.html

9.2.16.2.3 Unity Catalog

Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Databricks workspaces.

In Unity Catalog, the hierarchy of primary data objects flows from metastore to table or volume:

  • Metastore: The top-level container for metadata. Each metastore exposes a three-level namespace (catalog.schema.table) that organizes your data.

    Note:

    If your workspace includes a legacy Hive metastore, then the data in that metastore will still be available alongside data defined in Unity Catalog, in a catalog named hive_metastore.
  • Catalog: The first layer of the object hierarchy, used to organize your data assets.
  • Schema: Also known as databases, schemas are the second layer of the object hierarchy and contain tables and views.
  • Tables, views, and volumes: At the lowest level in the data object hierarchy are tables, views, and volumes. Volumes provide governance for non-tabular data.
9.2.16.2.3.1 Managed Tables

Oracle GoldenGate replicates data to Databricks managed tables. Managed tables use the DELTA table format.

9.2.16.2.3.1.1 Tables Inside hive_metastore Catalog

If Unity Catalog was enabled on an existing Databricks workspace, then the existing tables are available in the hive_metastore catalog.

The tables under the hive_metastore do not support primary key.

Oracle GoldenGate Replicat MAP statement should use KEYCOLS to define the key columns required for stage/merge replication.

9.2.16.2.3.2 External Tables

External tables are file-backed tables that reference data stored in an external location.

External location is an object storage such as ADLS Gen2, AWS S3, or GCS.

To manage cloud storage for external tables, unity catalog uses the following:
  • Storage Credential: A storage credential allows Databricks to access data in cloud storage.
  • External locations contain a reference to a storage credential and a cloud storage path.
9.2.16.2.3.3 Create a Storage Credential

A storage credential represents an authentication and authorization mechanism for accessing data stored on your cloud tenant.

9.2.16.2.3.3.1 Storage Credential to Access Azure Storage Account

To create a storage credential to access an Azure storage account:

  • Create an Azure resource called “Access Connector for Azure Databricks”
    • In the Azure portal, search for Access Connector for Azure Databricks and select the Access Connector for Azure Databricks service.
    • Following the steps to create a connector resource.
    • In the Azure portal, search for “Storage accounts” and select the storage account that should be used as a GoldenGate staging location.
    • In the sidebar, click Access Control (IAM).
    • On Role assignments, click +Add, select Add role assigment from the drop-down menu.
    • In the search bar, enter Storage Blob Data Contributor, select the role from the list, and click Next.
    • Click Members: +Select members, in the search bar, enter the name of the Azure Databricks connector resource, and click Select.
    • Back in the Azure portal, search for Azure Databricks connector resource, and note down the Resource ID.
  • Create a storage credential in Databricks:
    • Log in to your Databricks workspace.
    • In the sidebar, click Catalog.
    • Click External Data, then click Storage Credentials.
    • Click Create Storage Credential.
    • In the Create Storage Credential dialog, enter the following details:
      • Credential Type: Choose Azure Managed Identity from the drop-down menu.
      • Storage credential name: Enter a name for the storage credential.
      • Access connector ID: Enter the resource ID of the Azure Databricks connector resource.
      • Click Advanced Options, check the box Limit to read-only use.
      • Click Create.

Note:

You can create multiple storage credentials for different storage accounts.
9.2.16.2.3.3.2 Storage Credential to Access Google Storage Account

To create a storage credential to access an Google storage account:

  • Create a storage credential to access Google Cloud Storage:
    1. Log in to your Databricks workspace.
    2. In the sidebar, click Catalog.
    3. Click External Data, then click Storage Credentials.
    4. Click Create Storage Credential.
    5. In the Create Storage Credential dialog, enter the following details:
      • Credential Type: Choose Google Service Account from the drop-down menu.
      • Storage credential name: Enter a name for the storage credential.
    6. Click Advanced Options, check the box Limit to read-only use.
    7. Click Create.
    8. On the Storage credential created dialog, make a note of the service account ID, which is in the form of an email address, and click Done.
  • Configure permissions for the service account:
    1. Go to the Google Cloud console and open the GCS bucket that you want to access from your Databricks workspace.

      The bucket should be in the same region as your Databricks workspace.

    2. On the Permission tab, click + Grant access and assign the service account the following roles:
      • Storage Legacy Bucket Reader
      • Storage Object Admin

        Use the service account’s email address as the principal identifier.

Note:

You can create multiple storage credentials for different storage accounts.

9.2.16.2.4 Create an External Location

External locations associate Unity Catalog storage credentials with cloud object storage containers.

External locations are used to define managed storage locations for catalogs and schemas, and to define locations for external tables and external volumes.

Oracle GoldenGate recommends creating an external location to simplify configuration, but you may decide to skip this step.

9.2.16.2.4.1 External Location to Access Azure Storage Account

To create an external location:

  • Log in to your Databricks workspace.
  • In the sidebar, click Catalog.
  • Click External Data, then click on External Locations.
  • Click Create Location.
  • In the Create a new external location form, enter the following details:
    • External location name: Enter a name for the external location.
    • Storage credential: Choose the storage credential that you created earlier from the drop-down menu.
    • URL: Enter the bucket path that you want to use as the external location. For example:
      abfss://[email protected]/
    • Click Advanced Options, check the box Limit to read-only use.
    • Click Create.

Note:

If an external location is created and tied to a storage credential, then there is no need to set the event handler property gg.eventhandler.databricks.credential.

9.2.16.2.4.2 External Location to Access Google Storage Account

To create an external location:

  • Log in to your Databricks workspace.
  • In the sidebar, click Catalog.
  • Click External Data, and then click External Locations.
  • Click Create Location.
  • In the Create a new external location form, enter the following details:
    • External location name: Enter a name for the external location.
    • Storage credential: Choose the storage credential that you created earlier from the drop-down menu.
    • URL: Enter the bucket path that you want to use as the external location. For example:
      gs://gcs-bucket/ogg
    • Click Advanced Options, check the box Limit to read-only use.
    • Click Create.

9.2.16.2.5 Compute Clusters Without Unity Catalog

Legacy Databricks compute clusters may not have Unity Catalog support or some compute clusters do not have Unity Catalog enabled.

These compute clusters cannot use storage credentials or external locations.

Access to external object storage should be configured by setting spark configuration as follows:
  • Log in to your Databricks workspace.
  • In the sidebar, click Compute.
  • In the list of available clusters, click the target cluster’s name.
  • On the Configuration tab, expand Advanced options.
  • In the Spark tab, you can specify the spark configuration for the cluster.
9.2.16.2.5.1 Spark Configuration to Access Azure Storage Account

Access to ADLS Gen2 can be configured using the storage account name and key.

Storage account key needs can be securely stored in Databricks secrets.

The following is an example of setting the spark configuration for Databricks on Azure:
fs.azure.account.key.storageaccountname.dfs.core.windows.net {{secrets/gg/azureAccountKey-for-storageaccountname}}

In this example, storageaccountname is the Azure storage account name, and {{secrets/gg/azureAccountKey-for-storageaccountname}} is the Databricks secret that contains the storage account key.

9.2.16.2.5.2 Spark Configuration to Access Google Cloud Storage

Access to Google Cloud Storage can be configured using a Google service account that has permission to access the storage bucket.

The following information is required to configure access to GCS:
  • Service account email
  • Google project-id.
  • Service account private key.
  • Service account private key id.

    Service account private key and service account key id can be securely stored in Databricks secrets.

    The following is an example of setting the spark configuration for Databricks on Azure:
    google.cloud.auth.service.account.enable true
        fs.gs.auth.service.account.email <client-email>
        fs.gs.project.id <project-id>
        fs.gs.auth.service.account.private.key {{secrets/scope/gsa_private_key}}
        fs.gs.auth.service.account.private.key.id {{secrets/scope/gsa_private_key_id}}
9.2.16.2.5.3 Spark Configuration to Access AWS S3

Access to AWS S3 can be configured in multiple ways. The following environment variables can be set in the Spark configuration to access S3:

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
Edit the Spark Environment variables and set the following:
AWS_SECRET_ACCESS_KEY={{secrets/gg/aws_secret_access_key}}
AWS_ACCESS_KEY_ID={{secrets/gg/aws_access_key_id}}
9.2.16.2.5.4 Creating Databricks Secrets

To create a Databricks secret:

  • Log in to your Databricks workspace.
  • In the sidebar, click Compute.
  • In the list of available clusters, click the target cluster’s name.
  • On the Apps tab, click Web Terminal. This should open up a terminal session tab in the web browser.
  • Create a Databricks secrets scope using databricks secrets create-scope <scope-name>. For example, databricks secrets create-scope gg.
  • Create a secret using
    databricks secrets put-secret --json '{
      "scope": "<scope-name>",
      "key": "<key-name>",
      "string_value": "<secret>"
      }'
    For example:
    databricks secrets put-secret --json '{
      "scope": "gg",
      "key": "storageaccountname",
      "string_value": "---------storage-account-key-----------------"
      }'

    Note:

    These commands were run using Databricks Runtime (DBR) 15.0.

Note:

On Unity Catalog enabled workspaces, the tables inside the hive_metastore catalog cannot use external location or external storage credentials.

The tables inside the hive_metastore catalog also require the spark cluster configuration listed in this section.

9.2.16.2.6 Classpath Configuration

Databricks Event Handler uses the Databricks JDBC driver. Ensure that the classpath includes the path to the JDBC driver. You need to also include the respective object store Event Handler’s dependencies in the classpath.
9.2.16.2.6.1 Dependencies
Databricks JDBC driver: You can download the Dependency Downloader tool to download the JDBC driver by running the following script: <OGGDIR>/DependencyDownloader/databricks.sh.

Running this script without any input parameters will download the JDBC driver version 2.6.36. The script also can be run with a single argument to download a specific version of the JDBC driver.

For more information about Dependency Downloader, see Dependency Downloader.

Alternatively, you can also download the JDBC driver from Maven central using the following co-ordinates:
  <dependency>
            <groupId>com.databricks</groupId>
            <artifactId>databricks-jdbc</artifactId>
            <version>2.6.36</version>
        </dependency>
  • If staging location is set to ADLS Gen2, classpath should include the ABS Event handler dependencies. See ABS Event handler dependencies.
  • If staging location is set to S3, classpath should include the S3 Event handler dependencies. See S3 Event Handler.
  • If staging location is set to GCS, classpath should include the GCS Event handler dependencies. See GCS Event Handler.
Edit the gg.classpath configuration parameter to include the path to the object store Event Handler dependencies (if external stage is in use) and the Databricks JDBC driver.

9.2.16.2.7 Proxy Configuration

When the Replicat process runs behind a proxy server, the JDBC connection URL must be appended with the following property values:

  • UseProxy
  • ProxyHost
  • ProxyPort
For example:
jdbc:databricks://<server-hostname>:443;httpPath=<http-path>[;<setting1>=<value1>;<setting2>=<value2>;<settingN>=<valueN>]
      ;EnableArrow=0;UseProxy=1;ProxyHost=<proxy_host>;ProxyPort=<proxy_port>

9.2.16.2.8 Operation Aggregation

Operation aggregation is the process of aggregating (merging/compressing) multiple operations on the same row into a single output operation based on a threshold.

9.2.16.2.8.1 In-memory Operation Aggregation
  • Operation records can be aggregated in-memory, this is the default configuration.
  • User can tune the frequency of merge interval using the File writer gg.handler.databricks.fileRollIntervalproperty, the default value is set to 3m (three minutes).

  • Operation aggregation in-memory requires additional JVM memory configuration.
9.2.16.2.8.2 Operation Aggregation using SQL
  • To use SQL aggregation, it is mandatory that the trail files contain uncompressed UPDATE operation records, which means that the UPDATE operations contain full image of the row being updated.
  • Operation aggregation using SQL can provide better throughput if the trails files contains uncompressed update records.
  • Replicat can aggregate operations using SQL statements by setting the gg.aggregate.operations.using.sql=true.
  • User can tune the frequency of merge interval using the File writer gg.handler.databricks.fileRollInterval property, the default value is set to 3m (three minutes).
  • Operation aggregation using SQL does not require additional JVM memory configuration.

9.2.16.2.9 Compressed Update Handling

A compressed update record contains values for the key columns and the modified columns.

An uncompressed update record contains values for all the columns.

GoldenGate trails may contain compressed or uncompressed update records. The default extract configuration writes compressed updates to the trails.

The parameter gg.compressed.update can be set to true or false to indicate compressed/uncompressed update records.
9.2.16.2.9.1 MERGE Statement with Uncompressed Updates

In some use cases, if the trail contains uncompressed update records, then the MERGE SQL statement can be optimized for better performance by setting gg.compressed.update=false

If you want to use DELETE+INSERT SQL statements instead of a MERGE SQL statement, then set gg.eventhandler.databricks.deleteInsert=true.

9.2.16.2.10 End-to-End Configuration

The following is an end-end configuration example which uses autoconfiguration.

The sample properties file can also be found in the directory <OGGDIR>/AdapterExamples/big-data/databricks/.

  • dbx-az.props: Configuration using ADLS Gen2 stage for Databricks on Azure.
  • dbx-s3.props: Configuration using S3 stage for Databricks on AWS.
  • dbx-gcs.props: Configuration using GCS stage for Databricks on GCP.
9.2.16.2.10.1 Databricks on Azure
# Configuration to load GoldenGate trail operation records into Databricks using ADLS Gen2 stage.
# Note: Recommended to only edit the configuration marked as  TODO
gg.target=databricks
# Azure Blob Event handler.
gg.eventhandler.abs.bucketMappingTemplate=<azure_adls_gen2_container_name>
gg.eventhandler.abs.accountName=<azure_storage_account_name>
gg.eventhandler.abs.accountKey=<azure_storage_account_key>

# Databricks Event Handler.
#TODO: Edit JDBC ConnectionUrl
gg.eventhandler.databricks.connectionURL=jdbc:databricks://<server-hostname>:443;httpPath=<http-path>[;<setting1>=<value1>;<setting2>=<value2>;<settingN>=<valueN>];EnableArrow=0
#TODO: Edit JDBC username or 'token'
gg.eventhandler.databricks.UserName=token
#TODO: Edit JDBC password
gg.eventhandler.databricks.Password=<password>

#TODO: Edit the classpath to include Azure Blob Event Handler dependencies and Databricks JDBC driver.
gg.classpath=$THIRD_PARTY_DIR/abs/*:$THIRD_PARTY_DIR/databricks/*
#TODO: Provide sufficient memory (at least 8GB).
jvm.bootoptions=-Xmx8g -Xms1g
9.2.16.2.10.2 Databricks on AWS
# Configuration to load GoldenGate trail operation records into Databricks using S3 stage.
# Note: Recommended to only edit the configuration marked as  TODO
gg.target=databricks
gg.stage=s3
#The S3 Event Handler
#TODO: Edit the AWS region
gg.eventhandler.s3.region=<aws region>
#TODO: Edit the AWS S3 bucket
gg.eventhandler.s3.bucketMappingTemplate=<s3 bucket>

# Databricks Event Handler.
#TODO: Edit JDBC ConnectionUrl
gg.eventhandler.databricks.connectionURL=jdbc:databricks://<server-hostname>:443;httpPath=<http-path>[;<setting1>=<value1>;<setting2>=<value2>;<settingN>=<valueN>];EnableArrow=0
#TODO: Edit JDBC username or 'token'
gg.eventhandler.databricks.UserName=token
#TODO: Edit JDBC password
gg.eventhandler.databricks.Password=<password>

#TODO: Edit the classpath to include GCS Event Handler dependencies and Databricks JDBC driver.
gg.classpath=$THIRD_PARTY_DIR/s3/*:$THIRD_PARTY_DIR/databricks/*
#TODO: Provide sufficient memory (at least 8GB).
jvm.bootoptions=-Xmx8g -Xms1g
9.2.16.2.10.3 Databricks on GCP
# Configuration to load GoldenGate trail operation records into Databricks using GCS stage.
# Note: Recommended to only edit the configuration marked as  TODO
gg.target=databricks
gg.stage=gcs
## The GCS Event handler
#TODO: Edit the GCS bucket name
gg.eventhandler.gcs.bucketMappingTemplate=<gcs bucket>
#TODO: Edit the GCS credentialsFile
gg.eventhandler.gcs.credentialsFile=<oggbd-project-credentials.json>

# Databricks Event Handler.
#TODO: Edit JDBC ConnectionUrl
gg.eventhandler.databricks.connectionURL=jdbc:databricks://<server-hostname>:443;httpPath=<http-path>[;<setting1>=<value1>;<setting2>=<value2>;<settingN>=<valueN>];EnableArrow=0
#TODO: Edit JDBC username or 'token'
gg.eventhandler.databricks.UserName=token
#TODO: Edit JDBC password
gg.eventhandler.databricks.Password=<password>

#TODO: Edit the classpath to include GCS Event Handler dependencies and Databricks JDBC driver.
gg.classpath=$THIRD_PARTY_DIR/gcs/*:$THIRD_PARTY_DIR/databricks/*
#TODO: Provide sufficient memory (at least 8GB).
jvm.bootoptions=-Xmx8g -Xms1g

9.2.16.2.11 Table Mapping

If the MAP statement does not specify a target catalog, then the default catalog for the Databricks workspace will be used. The handler will log the default catalog during initialization. Example log message: Connection catalog is set to [dbx-catalog].

9.2.16.2.11.1 Mapping Table

Table 9-16

MAP statement in the Replicat parameter file Databricks Catalog Databricks Schema Databricks Table
MAP SCHEMA_1.TABLE_1, TARGET "schema_1"."table_1"; Default catalog schema_1 table_1
MAP DB_1.SCHEMA_1.TABLE_1, TARGET "db_1"."schema_1"."table_1" db_1 schema_1 table_1

9.2.16.3 Troubleshooting and Diagnostics

  • Unsupported Databricks data types:
    • ARRAY
    • MAP
    • STRUCT
    • INTERVAL
    • VOID
  • Databricks JDBC Driver Exception:
    java.lang.ClassCastException: class
              org.apache.logging.log4j.core.lookup.ResourceBundleLookup

    While using the Databricks JDBC driver 2.6.36, we have come across this exception. Setting the property EnableArrow=0 is the workaround.

    Oracle is working with Databricks to address this.

  • org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Cannot add column ‘<column name>’ with type ‘void’. Please explicitly specify a non-void type.

    If the Databricks table defined a column with data type VOID, all the DML operations will fail even if the VOID column is not mapped by Oracle GoldenGate.

    To proceed with the replication, you should drop the VOID column from the target table.

  • Databricks INTERVAL data type:

    If the target table contains an INTERVAL type, the Databricks JDBC driver ignores the presence of such a column.

    You cannot map any source column to an INTERVAL type. You should also avoid defining INTERVAL types in the target table without a default value and with a NOT NULL constraint.

  • Connectivity issues to Databricks:
    • Validate JDBC connection URL, username, and password.
    • Check proxy configuration if running Replicat process behind a proxy.
  • DDL not applied on the target table: GoldenGate for Distributed Applications and Analytics does not support DDL replication.
  • SQL Errors: In case there are any errors while executing any SQL, the SQL statements along with the bind parameter values are logged into the GoldenGate for Distributed Applications and Analytics handler log file.
  • Co-existence of the components: When using an external stage location (ADLS Gen 2 or S3 or GCS), the location/region of the machine where Replicat process is running and the object store’s region will have an impact on the overall throughput of the apply process.

    For the best possible throughput, the components need to be located ideally in the same region or as close as possible.

  • Replicat ABEND due to partial LOB records in the trail file: Oracle GoldenGate for Distributed Applications and Analytics does not support replication of partial LOB data.

    The trail file needs to be regenerated by Oracle Integrated capture using TRANLOGOPTIONS FETCHPARTIALLOB option in the Extract parameter file.

  • When replicating to more than ten target tables, the parameter maxConnections can be increased to a higher value which can improve throughput.
  • Identity column in the target table:
    • If the target table contains an Identity column, then the MERGE statement would fail because the Identity cannot be updated.
      Example Error Message:
      `Query: MERGE INTO***, Error message
            from Server: org.apache.hive.service.cli.HiveSQLException: Error  running query:
            org.apache.spark.sql.AnalysisException:    UPDATE on IDENTITY column "col9" is not
            supported.`
      To proceed, review the following points:
    • If the Identity column is defined using GENERATED ALWAYS AS IDENTITY, then Replicat would result in the following error: Error message from Server: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Providing values for GENERATED ALWAYS AS IDENTITY column col9 is not supported. To proceed further, the Identity column should be excluded from mapping on the source database using COLSEXCEPT or removed from the target table.
    • If the Identity column is defined using GENERATED BY DEFAULT AS IDENTITY, then Replicat can be configured to use DELETE-INSERT instead of MERGE by setting gg.eventhandler.databricks.deleteInsert=true provided the prerequisites for enabling DELETE-INSERT are met.