8.2.13 Azure Synapse Analytics

Microsoft Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing and big data analytics.

8.2.13.1 Detailed Functionality

Replication to Synapse uses stage and merge data flow.

The change data is staged in a temporary location in micro-batches and eventually merged into the target table.

Azure Data Lake Storage (ADLS) Gen 2 is used as the staging area for change data.

The Synapse Event handler is used as a downstream Event handler connected to the output of the Parquet Event handler.

The Parquet Event handler loads files generated by the File Writer Handler into ADLS Gen2.

The Synapse Event handler executes SQL statements to merge the operation records staged in ADLS Gen2.

The SQL operations are performed in batches providing better throughput.

Oracle GoldenGate for BigData uses the MERGE SQL statement or a combination of DELETE and INSERT SQL statements to perform the merge operation.

8.2.13.1.1 Database User Privileges

Database user used for replication has to be granted the following privileges:

  • INSERT, UPDATE, DELETE, and TRUNCATE on the target tables.
  • CREATE and DROP Synapse external file format.
  • CREATE and DROP Synapse external data source.
  • CREATE and DROP Synapse external table.

8.2.13.1.2 Merge SQL Statement

The merge SQL statement for Azure Synapse Analytics was made generally available during the later part of the year 2022 and therefore Oracle GoldenGate for Big Data uses merge statement by default. To disable merge SQL, ensure that a Java System property is set in the jvm.bootoptions parameter.

For example:
jvm.bootoptions=-Dsynapse.use.merge.sql=false

8.2.13.1.3 Prerequisites

The following are the prerequisites:

  • Uncompressed UPDATE records: If Oracle GoldenGate is configured to not use merge statement (see Merge SQL Statement), then it is mandatory that the trail files used to apply to Synapse contain uncompressed UPDATE operation records, which means that the UPDATE operations contain full image of the row being updated. If UPDATE records have missing columns, then replicat will ABEND on detecting a compressed UPDATE trail record.
  • If Oracle GoldenGate is configured to use merge statement (see Merge SQL Statement), then the target table must be a hash distributed table.
  • Target table existence: The target tables should exist on the Synapse database.
  • Azure storage account: An Azure storage account and container should exist.

    Oracle recommends co-locating the Azure Synapse workspace, and the Azure storage account in the same azure region.

  • If Oracle GoldenGate is configured to use merge statement, then the target table cannot define IDENTITY columns because Synapse merge statement does not support inserting data into IDENTITY columns. For more information about merging SQL statement, see Merge SQL Statement.

8.2.13.2 Configuration

8.2.13.2.1 Automatic Configuration

Synapse replication involves configuration of multiple components, such as File Writer handler, Parquet Event handler, and Synapse Event handler.

The Automatic Configuration functionality helps to auto configure these components so that the user configuration is minimal.

The properties modified by auto configuration will also be logged in the handler log file.

To enable auto-configuration to replicate to Synapse target we need to set the parameter as follows: gg.target=synapse.

When replicating to Synapse target, customization of Parquet Event handler name and Synapse Event handler name is not allowed.

8.2.13.2.1.1 File Writer Handler Configuration
File writer handler name is pre-set to the value synapse. The following is an example to edit a property of File Writer handler:
gg.handler.synapse.pathMappingTemplate=./dirout
8.2.13.2.1.2 Synapse Event Handler Configuration

Synapse Event Handler name is pre-set to the value synapse.

Table 8-13 Synapse Event Handler Configuration

Properties Required/Optional Legal Values Default Explanation
gg.eventhandler.synapse.connectionURL Required
jdbc:sqlserver://<synapse-
workspace>.sql.azuresynapse.net:1433;database=
<db-name>;encrypt=true;
trustServerCertificate=false;
hostNameInCertificate=*.sql.azuresynapse.net;
loginTimeout=300;
None JDBC URL to connect to Synapse.
gg.eventhandler.synapse.UserName Required Database username. None Synapse database user in the Synapse workspace. The username has to be qualified with the Synapse workspace name. Example: sqladminuser@synapseworkspace.
gg.eventhandler.synapse.Password Required Supported database string. None Synapse database password.
gg.eventhandler.synapse.credential Required Credential name. None Synapse database credential name to access Azure Data Lake Gen2 files. See Synapse Database Credentials for steps to create credential.
gg.eventhandler.synapse.maxConnnections Optional Integer value 10 Use this parameter to control the number of concurrent JDBC database connections to the target Synapse database.
gg.eventhandler.synapse.dropStagingTablesOnShutdown Optional true or false false If set to true, the temporary staging tables created by GoldenGate will be dropped on replicat graceful stop.
gg.maxInlineLobSize Optional Integer Value 16000 This parameter can be used to set the maximum inline size of large object (LOB) columns in bytes. For more information, see Large Object (LOB) Performance.
gg.aggregate.operations.flush.interval Optional Integer 30000 The flush interval parameter determines how often the data gets merged into Synapse. The value is set in milliseconds. Use with caution! The higher the value, larger data will have to be stored in the memory of the Replicat process.

Use the flush interval parameter with caution. Increasing its default value increases the amount of data stored in the internal memory of the Replicat. This can cause out-of-memory errors and stop the Replicat if it runs out of memory.

gg.operation.aggregator.validate.keyupdate Optional true or false false If set to true, Operation Aggregator will validate key update operations (optype 115) and correct to normal update if no key values have changed. Compressed key update operations do not qualify for merge.
gg.compressed.update Optional true or false true If set the true, then this indicates that the source trail files contain compressed update operations. If set to true, then the source trail files are expected to contain uncompressed update operations.
gg.eventhandler.synapse.connectionRetryIntervalSeconds Optional Integer Value 30 Specifies the delay in seconds between connection retry attempts.
gg.eventhandler.synapse.connectionRetries Optional Integer Value 3 Specifies the number of times connections to the target data warehouse will be retried.

8.2.13.2.2 Synapse Database Credentials

To allow Synapse to access the data files in Azure Data Lake Gen2 storage account, follow the steps to create a database credential:
  1. Connect to the respective Synapse SQL dedicated pool using the Azure Web SQL console (https://web.azuresynapse.net/en-us/).
  2. Create a DB master key if one does not already exist, using your own password.
  3. Create a database scoped credential. This credential allows Oracle GoldenGate replicat process to access Azure Storage Account.

    Provide the Azure Storage Account name and Access key when creating this credential.

    Storage Account Access keys can be retrieved from the Azure cloud console.

For example:
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Your own password' ;
CREATE DATABASE SCOPED CREDENTIAL OGGBD_ADLS_credential
WITH
-- IDENTITY = '<storage_account_name>' ,
  IDENTITY = 'sanavaccountuseast' ,
-- SECRET = '<storage_account_key>'
  SECRET = 'c8C0yR-this-is-a-fake-access-key-Gc9c5mENOJ1mLyxlO1vSRDlRG0/Ke+tbAvi6xe73HAAhLtdMFZRA=='
;

8.2.13.2.3 Classpath Configuration

Synapse Event handler relies on the upstream File Writer handler and the Parquet Event handler.

8.2.13.2.3.1 Dependencies
  • Microsoft SQLServer JDBC driver: The JDBC driver can be downloaded from Maven central using the following co-ordinates.
      <dependency>
                    <groupId>com.microsoft.sqlserver</groupId>
                    <artifactId>mssql-jdbc</artifactId>
                    <version>8.4.1.jre8</version>
                    <scope>provided</scope>
                </dependency>
    
Alternatively, the JDBC driver can also be downloaded using the script <OGGDIR>/DependencyDownloader/synapse.sh.
  • Parquet Event handler dependencies: See unresolvable-reference.html#GUID-E62904D1-EBFD-4C84-B112-97D1F5D4BE6F to configure classpath to include Parquet dependencies.

  • Hadoop Depedencies: Hadoop dependencies can be downloaded using dependency downloader <OGGDIR>/DependencyDownloader/hadoop.sh
  • Azure Storage dependencies: Azure Storage dependencies can be downloaded from Maven using the following co-ordinates.
<dependencies>
    <dependency>
      <groupId>com.azure</groupId>
      <artifactId>azure-storage-blob</artifactId>
      <version>12.13.0</version>
    </dependency>
    <dependency>
      <groupId>com.azure</groupId>
      <artifactId>azure-identity</artifactId>
      <version>1.3.3</version>
    </dependency>
</dependencies>
8.2.13.2.3.2 Classpath
Edit the gg.classpath configuration parameter to include the path to the Parquet Event Handler, Synapse JDBC, Azure Storage and hadoop dependencies.
For example:
gg.classpath=/path/to/parquet-deps/*:/path/to/abs-deps/*:path/to/synapse-deps/mssql-jdbc-8.4.1.jre8.jar:/path/to/hadoop-deps/*

8.2.13.2.4 INSERTALLRECORDS Support

Stage and merge targets supports INSERTALLRECORDS parameter.

See INSERTALLRECORDS in Reference for Oracle GoldenGate. Set the INSERTALLRECORDS parameter in the Replicat parameter file (.prm). Set the INSERTALLRECORDS parameter in the Replicat parameter file (.prm)

Setting this property directs the Replicat process to use bulk insert operations to load operation data into the target table. You can tune the batch size of bulk inserts using the File Writer property gg.handler.synapse.maxFileSize. The default value is set to 1GB. The frequency of bulk inserts can be tuned using the File Writer property gg.handler.synapse.fileRollInterval, the default value is set to 3m (three minutes).

Note:

  • When using the Synapse internal stage, the staging files can be compressed by setting gg.handler.synapse.putSQLAutoCompress to true.

8.2.13.2.5 Large Object (LOB) Performance

The presence of large object (LOB) columns can impact Replicat's apply performance. Any LOB column changes that exceed the inline threshold gg.maxInlineLobSize does not qualify for batch processing and such operations gets slower.

If the compute machine has sufficient RAM, you can increase this parameter to speed up processing.

8.2.13.2.6 End-to-End Configuration

The following is an end-end configuration example which uses auto-configuration for FW handler, Parquet and Synapse Event handlers.

This sample properties file can also be found in the directory AdapterExamples/big-data/synapse/synapse.props:

# Azure Synapse Analytics Data Warehouse Template
# Configuration to load GoldenGate trail operation records into Azure Synapse Analytics by chaining
# File writer handler -> Parquet Event handler -> Synapse Event handler.
# Note: Recommended to only edit the configuration marked as  TODO

gg.target=synapse

#The Parquet Event Handler
No properties are required for the Parquet Event handler.
#gg.eventhandler.parquet.finalizeAction=delete

ADLS Gen 2 stage (Using Azure Blob SDK). 
#Azure Blob Event handler 
#TODO: Edit the Azure Blob Storage container name
gg.eventhandler.abs.bucketMappingTemplate=<abs-container-name> 
#TODO: Edit the Azure storage.account name. gg.eventhandler.abs.accountName=<storage-account-name> 
#TODO: Edit the Azure storage account key.
gg.eventhandler.abs.accountKey=<storage-account-key>

#The Synapse Event Handler
#TODO: Edit JDBC ConnectionUrl
gg.eventhandler.synapse.connectionURL=jdbc:sqlserver://<synapse-workspace>.sql.azuresynapse.net:1433;database=<db-name>;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.sql.azuresynapse.net;loginTimeout=300;
#TODO: Edit JDBC user name
gg.eventhandler.synapse.UserName=<db user name>@<synapse-workspace>
#TODO: Edit JDBC password
gg.eventhandler.synapse.Password=<db password>
#TODO: Edit Credential to access Azure storage.
gg.eventhandler.synapse.credential=OGGBD_ADLS_credential
#TODO: Edit the classpath to include dependencies for Parquet Event Handler, ABS Event handler and the Synapse JDBC driver.
gg.classpath=/path/to/parquet-deps/:/path/to/abs-deps/:path/to/synapse-deps/mssql-jdbc-8.4.1.jre8.jar:/path/to/hadoop-deps/*
#TODO: Provide sufficient memory (at least 8GB).
jvm.bootoptions=-Xmx8g -Xms8g -DSYNAPSE_STAGE=parquet,abs

8.2.13.2.7 Compressed Update Handling

A compressed update record contains values for the key columns and the modified columns.

An uncompressed update record contains values for all the columns.

Oracle GoldenGate trails may contain compressed or uncompressed update records. The default extract configuration writes compressed updates to the trails.

The parameter gg.compressed.update can be set to true or false to indicate compressed/uncompressed update records.

8.2.13.2.7.1 MERGE Statement with Uncompressed Updates

In some use cases, if the trail contains uncompressed update records, then the MERGE SQL statement can be optimized for better performance by setting gg.compressed.update=false.

8.2.13.3 Troubleshooting and Diagnostics

  • Connectivity Issues to Synapse:
    • Validate JDBC connection URL, username and password.
    • Check if http/https proxy is enabled. Synapse does not support connections over http(s) proxy.
  • DDL not applied on the target table: Oracle GoldenGate for BigData does not support DDL replication.
  • Target table existence: It is expected that the Synapse target table exists before starting the replicat process. replicat process will ABEND if the target table is missing.
  • SQL Errors: In case there are any errors while executing any SQL, the entire SQL statement along with the bind parameter values are logged into the OGGBD handler log file.
  • Co-existence of the components: The location/region of the machine where replicat process is running, Azure Data Lake Storage container region and the Synapse region would impact the overall throughput of the apply process. Data flow is as follows: Oracle GoldenGate -> Azure Data Lake Gen 2 -> Synapse. For best throughput, the components need to located as close as possible.
  • Replicat ABEND due to partial LOB records in the trail file: Oracle GoldenGate for Big Data Synapse apply does not support replication of partial LOB. The trail file needs to be regenerated by Oracle Integrated capture using TRANLOGOPTIONS FETCHPARTIALLOB option in the extract parameter file.
  • Error:com.microsoft.sqlserver.jdbc.SQLServerException: Conversion failed when converting date and/or time from character string:

    This occurs when the source datetime column and target datetime column are incompatible.

    For example: A case where the source column is a timestamp type, and the target column is Synapse time.

  • If the Synapse table or column names contain double quotes, then Oracle GoldenGate for Big Data replicat will ABEND.
  • Error: com.microsoft.sqlserver.jdbc.SQLServerException: HdfsBridge::recordReaderFillBuffer. This indicates that the data in the external table backed by Azure Data Lake file is not readable. Contact Oracle support.
  • IDENTITY column in the target table: The Synapse MERGE statement does not support inserting data into IDENTITY columns. Therefore, if MERGE statement is enabled using jvm.bootoptions=-Dsynapse.use.merge.sql=true, then Replicat will ABEND with following error message:
    Exception:
    com.microsoft.sqlserver.jdbc.SQLServerException: Cannot update identity column 'ORDER_ID'
  • Error: com.microsoft.sqlserver.jdbc.SQLServerException: Merge statements with a WHEN NOT MATCHED [BY TARGET] clause must target a hash distributed table:

    This indicates that merge SQL statement is on and Synapse target table is not a hash distributed table. You need to create the target table with a hash distribution.