Azure Synapse Analytics Data Warehouse

8.2.14 Azure Synapse Analytics Data Warehouse

Microsoft Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing and Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) analytics.

Parent topic: Target

8.2.14.1 Detailed Functionality

Replication to Synapse uses stage and merge data flow.

The change data is staged in a temporary location in micro-batches and eventually merged into the target table.

Azure Data Lake Storage (ADLS) Gen 2 is used as the staging area for change data.

The Synapse Event handler is used as a downstream Event handler connected to the output of the Parquet Event handler.

The Parquet Event handler loads files generated by the File Writer Handler into ADLS Gen2.

The Synapse Event handler executes SQL statements to merge the operation records staged in ADLS Gen2.

The SQL operations are performed in batches providing better throughput.

Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) uses the MERGE SQL statement or a combination of DELETE and INSERT SQL statements to perform the merge operation.

Parent topic: Azure Synapse Analytics Data Warehouse

8.2.14.1.1 Database User Privileges

Database user used for replication has to be granted the following privileges:

INSERT, UPDATE, DELETE, and TRUNCATE on the target tables.
CREATE and DROP Synapse external file format.
CREATE and DROP Synapse external data source.
CREATE and DROP Synapse external table.

Parent topic: Detailed Functionality

8.2.14.1.2 Merge SQL Statement

The merge SQL statement for Azure Synapse Analytics was made generally available during the later part of the year 2022 and therefore Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) uses merge statement by default. To disable merge SQL, ensure that a Java System property is set in the jvm.bootoptions parameter.

For example:

jvm.bootoptions=-Dsynapse.use.merge.sql=false

Parent topic: Detailed Functionality

8.2.14.1.3 Prerequisites

The following are the prerequisites:

Uncompressed UPDATE records: If Oracle GoldenGate is configured to not use merge statement (see Merge SQL Statement), then it is mandatory that the trail files used to apply to Synapse contain uncompressed UPDATE operation records, which means that the UPDATE operations contain full image of the row being updated. If UPDATE records have missing columns, then replicat will ABEND on detecting a compressed UPDATE trail record.
If Oracle GoldenGate is configured to use merge statement (see Merge SQL Statement), then the target table must be a hash distributed table.
Target table existence: The target tables should exist on the Synapse database.
Azure storage account: An Azure storage account and container should exist.
Oracle recommends co-locating the Azure Synapse workspace, and the Azure storage account in the same azure region.
If Oracle GoldenGate is configured to use merge statement, then the target table cannot define IDENTITY columns because Synapse merge statement does not support inserting data into IDENTITY columns. For more information about merging SQL statement, see Merge SQL Statement.

Parent topic: Detailed Functionality

8.2.14.2 Operation Aggregation

Operation aggregation is the process of aggregating (merging/compressing) multiple operations on the same row into a single output operation based on a threshold.

Parent topic: Azure Synapse Analytics Data Warehouse

8.2.14.2.1 In-Memory Operation Aggregation

Operation records are aggregated in-memory by default.
The gg.aggregate.operations.flush.interval property has been deprecated and is no longer supported. If gg.aggregate.operations.flush.interval is used in GG for DAA 23ai, then replicat will run; but add a warning to log file about the property being deprecated and not supported.
To control the time window for aggregation, use the gg.handler.snowflake.fileRollInterval property. By default, it is set to 3 minutes. Longer intervals will increase latency, and may increase memory usage. Shorter intervals will increase overhead in Oracle GoldenGate and the target database.
Operation aggregation in-memory requires additional JVM memory configuration.

Parent topic: Operation Aggregation

8.2.14.2.2 Operation Aggregation Using SQL

To use SQL aggregation, it is mandatory that the trail files contain uncompressed UPDATE operation records, which means that the UPDATE operations contain full image of the row being updated.
Operation aggregation using SQL can provide better throughput if the trails files contains uncompressed update records.
Replicat can aggregate operations using SQL statements by setting the gg.aggregate.operations.using.sql=true.
You can tune the frequency of merge interval using the File writer gg.handler.synapse.fileRollInterval property, the default value is set to 3m (three minutes).
Operation aggregation using SQL does not require additional JVM memory configuration.

Parent topic: Operation Aggregation

8.2.14.3 Compressed Update Handling

A compressed update record contains values for the key columns and the modified columns.

An uncompressed update record contains values for all the columns.

Oracle GoldenGate trails may contain compressed or uncompressed update records. The default extract configuration writes compressed updates to the trails. The parameter gg.compressed.update can be set to true or false to indicate compressed or uncompressed update records.

The default extract configuration writes compressed updates to the trails. The parameter gg.compressed.update can be set to true or false to indicate compressed/uncompressed update records.

MERGE statement with Uncompressed Updates

Parent topic: Azure Synapse Analytics Data Warehouse

8.2.14.3.1 MERGE statement with Uncompressed Updates

In some use cases, if the trail contains uncompressed update records, then the MERGE SQL statement can be optimized for better performance by setting gg.compressed.update=false.

If you want to use DELETE+INSERT SQL statements instead of a MERGE SQL statement, then set gg.eventhandler.synapse.deleteInsert=true.

Parent topic: Compressed Update Handling

8.2.14.4 Configuration

Parent topic: Azure Synapse Analytics Data Warehouse

8.2.14.4.1 Automatic Configuration

Synapse replication involves configuration of multiple components, such as File Writer handler, Parquet Event handler, and Synapse Event handler.

The Automatic Configuration functionality helps to auto configure these components so that the user configuration is minimal.

The properties modified by auto configuration will also be logged in the handler log file.

To enable auto-configuration to replicate to Synapse target we need to set the parameter as follows: gg.target=synapse.

When replicating to Synapse target, customization of Parquet Event handler name and Synapse Event handler name is not allowed.

Parent topic: Configuration

8.2.14.4.1.1 File Writer Handler Configuration

File writer handler name is pre-set to the value synapse. The following is an example to edit a property of File Writer handler:

gg.handler.synapse.pathMappingTemplate=./dirout

Parent topic: Automatic Configuration

8.2.14.4.1.2 Parquet Event Handler Configuration

The Parquet Event Handler name is pre-set to the value parquet. The Parquet Event Handler is auto-configured to write to HDFS. The hadoop configuration file core-site.xml must be configured to write data files to the respective container in the Azure Data Lake Storage(ADLS) Gen2 account. See Azure Data Lake Gen2 using Hadoop Client and ABFS.

The following is an example to edit a property of Parquet Event handler:

gg.eventhandler.parquet.finalizeAction=delete

Parent topic: Automatic Configuration

8.2.14.4.1.3 Synapse Event Handler Configuration

Synapse Event Handler name is pre-set to the value synapse.

Table 8-12 Synapse Event Handler Configuration

Properties	Required/Optional	Legal Values	Default	Explanation
`gg.eventhandler.synapse.connectionURL`	Required	jdbc:sqlserver://<synapse- workspace>.sql.azuresynapse.net:1433;database= <db-name>;encrypt=true; trustServerCertificate=false; hostNameInCertificate=*.sql.azuresynapse.net; loginTimeout=300;	None	JDBC URL to connect to Synapse.
`gg.eventhandler.synapse.UserName`	Required	Database username.	None	Synapse database user in the Synapse workspace. The username has to be qualified with the Synapse workspace name. Example: `sqladminuser@synapseworkspace`.
`gg.eventhandler.synapse.Password`	Required	Supported database string.	None	Synapse database password.
`gg.eventhandler.synapse.credential`	Required	Credential name.	None	Synapse database credential name to access Azure Data Lake Gen2 files. See Synapse Database Credentials for steps to create credential.
`gg.eventhandler.synapse.maxConnnections`	Optional	Integer value	10	Use this parameter to control the number of concurrent JDBC database connections to the target Synapse database.
`gg.eventhandler.synapse.dropStagingTablesOnShutdown`	Optional	`true` or `false`	`false`	If set to `true`, the temporary staging tables created by GoldenGate will be dropped on replicat graceful stop.
`gg.maxInlineLobSize`	Optional	Integer Value	16000	This parameter can be used to set the maximum inline size of large object (LOB) columns in bytes. For more information, see Large Object (LOB) Performance.
`gg.handler.synapse.fileRollInterval`	Optional	The default unit of measure is milliseconds. You can stipulate ms, s, m, h to signify milliseconds, seconds, minutes, or hours respectively. Examples of legal values include 10000, 10000ms, 10s, 10m, or 1.5h. Values of 0 or less indicate that file rolling on time is turned off.	3m (three minutes)	The parameter determines how often the data will be merged into Synapse. Use with caution, the higher this value is the more data will need to be stored in the memory of the Replicat process. Note: Use the parameter with caution. Increasing its default value (3m) will increase the amount of data stored in the internal memory of the Replicat. This can cause out of memory errors and stop the Replicat if it runs out of memory.
`gg.compressed.update`	Optional	`true` or `false`	`true`	If set the `true`, then this indicates that the source trail files contain compressed update operations. If set to `true`, then the source trail files are expected to contain uncompressed update operations.
`gg.eventhandler.synapse.connectionRetryIntervalSeconds`	Optional	Integer Value	30	Specifies the delay in seconds between connection retry attempts.
`gg.eventhandler.synapse.connectionRetries`	Optional	Integer Value	3	Specifies the number of times connections to the target data warehouse will be retried.
`gg.validate.keyupdate`	Optional	`true` or `false`	`false`	If set to `true`, Replicat will validate key update operations (optype 115) and correct to normal update if no key values have changed. Compressed key update operations do not qualify for merge.
`gg.eventhandler.synapse.deleteInsert`	Optional	`true` or `false`	`false`	If set to `true`, Replicat will merge records using SQL `DELETE+INSERT` statements instead of SQL `MERGE` statement. Note: Applicable only if `gg.compressed.update` is set to `false`.

Parent topic: Automatic Configuration

8.2.14.4.2 Synapse Database Credentials

To allow Synapse to access the data files in Azure Data Lake Gen2 storage account, follow the steps to create a database credential:

Connect to the respective Synapse SQL dedicated pool using the Azure Web SQL console (https://web.azuresynapse.net/en-us/).
Create a DB master key if one does not already exist, using your own password.
Create a database scoped credential. This credential allows Oracle GoldenGate replicat process to access Azure Storage Account.
Provide the Azure Storage Account name and Access key when creating this credential.

Storage Account Access keys can be retrieved from the Azure cloud console.

For example:

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Your own password' ;
CREATE DATABASE SCOPED CREDENTIAL OGGBD_ADLS_credential
WITH
-- IDENTITY = '<storage_account_name>' ,
  IDENTITY = 'sanavaccountuseast' ,
-- SECRET = '<storage_account_key>'
  SECRET = 'c8C0yR-this-is-a-fake-access-key-Gc9c5mENOJ1mLyxlO1vSRDlRG0/Ke+tbAvi6xe73HAAhLtdMFZRA=='
;

Parent topic: Configuration

8.2.14.4.3 Classpath Configuration

Synapse Event handler relies on the upstream File Writer handler and the Parquet Event handler.

Parent topic: Configuration

8.2.14.4.3.1 Dependencies

Microsoft SQLServer JDBC driver: The JDBC driver can be downloaded from Maven central using the following co-ordinates.

  <dependency>
                <groupId>com.microsoft.sqlserver</groupId>
                <artifactId>mssql-jdbc</artifactId>
                <version>8.4.1.jre8</version>
                <scope>provided</scope>
            </dependency>

Alternatively, the JDBC driver can also be downloaded using the script <OGGDIR>/DependencyDownloader/synapse.sh.

Parquet Event handler dependencies: See Parquet Event Handler Configuration to configure classpath to include Parquet dependencies.
Hadoop Depedencies: Hadoop dependencies can be downloaded using dependency downloader<OGGDIR>/DependencyDownloader/hadoop.sh
Azure Storage dependencies: Azure Storage dependencies can be downloaded from Maven using the following co-ordinates.

<dependencies>
    <dependency>
      <groupId>com.azure</groupId>
      <artifactId>azure-storage-blob</artifactId>
      <version>12.13.0</version>
    </dependency>
    <dependency>
      <groupId>com.azure</groupId>
      <artifactId>azure-identity</artifactId>
      <version>1.3.3</version>
    </dependency>
</dependencies>

Parent topic: Classpath Configuration

8.2.14.4.3.2 Classpath

Edit the gg.classpath configuration parameter to include the path to the Parquet Event Handler, Synapse JDBC, Azure Storage and hadoop dependencies.

For example:

gg.classpath=/path/to/parquet-deps/*:/path/to/abs-deps/*:path/to/synapse-deps/mssql-jdbc-8.4.1.jre8.jar:/path/to/hadoop-deps/*

Parent topic: Classpath Configuration

8.2.14.4.4 INSERTALLRECORDS Support

Stage and merge targets supports INSERTALLRECORDS parameter.

See INSERTALLRECORDS in Reference for Oracle GoldenGate. Set the INSERTALLRECORDS parameter in the Replicat parameter file (.prm). Set the INSERTALLRECORDS parameter in the Replicat parameter file (.prm)

Setting this property directs the Replicat process to use bulk insert operations to load operation data into the target table. You can tune the batch size of bulk inserts using the File Writer property gg.handler.synapse.maxFileSize. The default value is set to 1GB. The frequency of bulk inserts can be tuned using the File Writer property gg.handler.synapse.fileRollInterval, the default value is set to 3m (three minutes).

Note:

When using the Synapse internal stage, the staging files can be compressed by setting gg.handler.synapse.putSQLAutoCompress to true.

Parent topic: Configuration

8.2.14.4.5 Large Object (LOB) Performance

The presence of large object (LOB) columns can impact Replicat's apply performance. Any LOB column changes that exceed the inline threshold gg.maxInlineLobSize does not qualify for batch processing and such operations gets slower.

If the compute machine has sufficient RAM, you can increase this parameter to speed up processing.

Parent topic: Configuration

8.2.14.4.6 End-to-End Configuration

The following is an end-end configuration example which uses auto-configuration for FW handler, Parquet and Synapse Event handlers.

This sample properties file can also be found in the directory AdapterExamples/big-data/synapse/synapse.props:

# Azure Synapse Analytics Data Warehouse Template
# Configuration to load GoldenGate trail operation records into Azure Synapse Analytics by chaining
# File writer handler -> Parquet Event handler -> Synapse Event handler.
# Note: Recommended to only edit the configuration marked as  TODO

gg.target=synapse

#The Parquet Event Handler
No properties are required for the Parquet Event handler.
#gg.eventhandler.parquet.finalizeAction=delete

ADLS Gen 2 stage (Using Azure Blob SDK). 
#Azure Blob Event handler 
#TODO: Edit the Azure Blob Storage container name
gg.eventhandler.abs.bucketMappingTemplate=<abs-container-name> 
#TODO: Edit the Azure storage.account name. gg.eventhandler.abs.accountName=<storage-account-name> 
#TODO: Edit the Azure storage account key.
gg.eventhandler.abs.accountKey=<storage-account-key>

#The Synapse Event Handler
#TODO: Edit JDBC ConnectionUrl
gg.eventhandler.synapse.connectionURL=jdbc:sqlserver://<synapse-workspace>.sql.azuresynapse.net:1433;database=<db-name>;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.sql.azuresynapse.net;loginTimeout=300;
#TODO: Edit JDBC user name
gg.eventhandler.synapse.UserName=<db user name>@<synapse-workspace>
#TODO: Edit JDBC password
gg.eventhandler.synapse.Password=<db password>
#TODO: Edit Credential to access Azure storage.
gg.eventhandler.synapse.credential=OGGBD_ADLS_credential
#TODO: Edit the classpath to include dependencies for Parquet Event Handler, ABS Event handler and the Synapse JDBC driver.
gg.classpath=/path/to/parquet-deps/:/path/to/abs-deps/:path/to/synapse-deps/mssql-jdbc-8.4.1.jre8.jar:/path/to/hadoop-deps/*
#TODO: Provide sufficient memory (at least 8GB).
jvm.bootoptions=-Xmx8g -Xms8g -DSYNAPSE_STAGE=parquet,abs

Parent topic: Configuration

8.2.14.5 Troubleshooting and Diagnostics

Connectivity Issues to Synapse:
- Validate JDBC connection URL, username and password.
- Check if http/https proxy is enabled. Synapse does not support connections over http(s) proxy.
DDL not applied on the target table: Oracle GoldenGate for Distributed Applications and Analytics (GG for DAA) does not support DDL replication.
Target table existence: It is expected that the Synapse target table exists before starting the replicat process. replicat process will ABEND if the target table is missing.
SQL Errors: In case there are any errors while executing any SQL, the entire SQL statement along with the bind parameter values are logged into the GG for DAA handler log file.
Co-existence of the components: The location/region of the machine where replicat process is running, Azure Data Lake Storage container region and the Synapse region would impact the overall throughput of the apply process. Data flow is as follows: Oracle GoldenGate -> Azure Data Lake Gen 2 -> Synapse. For best throughput, the components need to located as close as possible.
Replicat ABEND due to partial LOB records in the trail file: GG for DAA Synapse apply does not support replication of partial LOB. The trail file needs to be regenerated by Oracle Integrated capture using TRANLOGOPTIONS FETCHPARTIALLOB option in the extract parameter file.
Error:com.microsoft.sqlserver.jdbc.SQLServerException: Conversion failed when converting date and/or time from character string:
This occurs when the source datetime column and target datetime column are incompatible.

For example: A case where the source column is a timestamp type, and the target column is Synapse time.
If the Synapse table or column names contain double quotes, then GG for DAA replicat will ABEND.
Error: com.microsoft.sqlserver.jdbc.SQLServerException: HdfsBridge::recordReaderFillBuffer. This indicates that the data in the external table backed by Azure Data Lake file is not readable. Contact Oracle support.
IDENTITY column in the target table: The Synapse MERGE statement does not support inserting data into IDENTITY columns. Therefore, if MERGE statement is enabled using jvm.bootoptions=-Dsynapse.use.merge.sql=true, then Replicat will ABEND with following error message:
Exception:
```
com.microsoft.sqlserver.jdbc.SQLServerException: Cannot update identity column 'ORDER_ID'
```
Error: com.microsoft.sqlserver.jdbc.SQLServerException: Merge statements with a WHEN NOT MATCHED [BY TARGET] clause must target a hash distributed table:
This indicates that merge SQL statement is on and Synapse target table is not a hash distributed table. You need to create the target table with a hash distribution.

Parent topic: Azure Synapse Analytics Data Warehouse