32 Stage and Merge Data Warehouse Replication

Data warehouse targets typically support Massively Parallel Processing (MPP). The cost of a single Data Manipulation Language (DML) operation is comparable to the cost of execution of batch DMLs.

Therefore, for better throughput the change data from the Oracle GoldenGate trails can be staged in micro batches at a temporary staging location, and the staged data records are merged into the data warehouse target table using the respective data warehouse’s merge SQL statement. This section outlines an approach to replicate change data records from source databases to target data warehouses using stage and merge. The solution uses Command Event handler to invoke custom bash-shell scripts.

This chapter contains examples of what you can do with command event handler feature.

32.1 Steps for Stage and Merge

32.1.1 Stage

In this step the change data records in the Oracle GoldenGate trail files are pushed into a staging location. The staging location is typically a cloud object store such as OCI, AWS S3, Azure Data Lake, or Google Cloud Storage.

This can be achieved using File Writer handler and one of the Oracle GoldenGate for Big Data object store Event handlers.

32.1.2 Merge

In this step the change data files in the object store are viewed as an external table defined in the data warehouse. The data in the external staging table is merged onto the target table.

Merge SQL uses the external table as the staging table. The merge is a batch operation leading to better throughput.

32.1.3 Configuration of Handlers

File Writer(FW) handler needs to be configured to generate local staging files that contain change data from the GoldenGate trail files.

The FW handler needs to be chained to an object store Event handler that can upload the staging files into a staging location.

The staging location is typically a cloud object store, such as AWS S3 or Azure Data Lake.

The output of the object store event handler is chained with the Command Event handler that can invoke custom scripts to execute merge SQL statements on the target data warehouse.

32.1.4 File Writer Handler

File Writer (FW) handler is typically configured to generate files partitioned by table using the configuration gg.handler.{name}.partitionByTable=true.

In most cases FW handler is configured to use the Avro Object Container Format (OCF) formatter.

The output file format could change based on the specific data warehouse target.

32.1.5 Operation Aggregation

Operation aggregation is the process of aggregating (merging/compressing) multiple operations on the same row into a single output operation based on a threshold.

Operation Aggregation needs to be enabled for stage and merge replication using the configuration gg.aggregate.operations=true.

32.1.6 Object Store Event handler

The File Writer handler needs to be chained with an object store Event handler. Oracle GoldenGate for BigData supports uploading files to most cloud object stores such as OCI, AWS S3, and Azure Data Lake.

32.1.7 JDBC Metadata Provider

If the data warehouse supports JDBC connection, then the JDBC metadata provider needs to be enabled.

32.1.8 Command Event handler Merge Script

Command Event handler is configured to invoke a bash-shell script. Oracle provides a bash-shell script that can execute the SQL statements so that the change data in the staging files are merged into the target tables.

The shell script needs to be customized as per the required configuration before starting the replicat process.

32.1.9 Stage and Merge Sample Configuration

A working configuration for the respective data warehouse is available under the directory AdapterExamples/big-data/data-warehouse-utils/<target>/.

This directory contains the following:
  • replicat parameter (.prm) file.
  • replicat properties file that contains the FW handler and all the Event handler configuration.
  • DDL file for the sample table used in the merge script.
  • Merge script for the specific data warehouse. This script contains SQL statements tested using the sample table defined in the DDL file.

32.1.10 Variables in the Merge Script

Typically, variables appear at the beginning of the Oracle provided script. There are lines starting with #TODO: that document the changes required for variables in the script.

Example:
#TODO: Edit this. Provide the replicat group name.
repName=RBD

#TODO: Edit this. Ensure each replicat uses a unique prefix.
stagingTablePrefix=${repName}_STAGE_

#TODO: Edit the AWS S3 bucket name.
bucket=<AWS S3 bucket name>

#TODO: Edit this variable as needed.
s3Location="'s3://${bucket}/${dir}/'"

#TODO: Edit AWS credentials awsKeyId and awsSecretKey
awsKeyId=<AWS Access Key Id>
awsSecretKey=<AWS Secret key>

The variables repName and stagingTablePrefix are relevant for all the data warehouse targets.

32.1.11 SQL Statements in the Merge Script

The SQL statements in the shell script needs to be customized. There are lines starting with #TODO: that document the changes required for SQL statements.

In most cases, we need to double quote " identifiers in the SQL statement. The double quote needs to be escaped in the script using backslash. For example: \".

Oracle provides a working example of SQL statements for a single table with a pre-defined set of columns defined in the sample DDL file. You need to add new sections for your own tables as part of if-else code block in the script.

Example:
if [ "${tableName}" == "DBO.TCUSTORD" ]
then
  #TODO: Edit all the column names of the staging and target tables.
  # The merge SQL example here is configured for the example table defined in the DDL file.
  # Oracle provided SQL statements

# TODO: Add similar SQL queries for each table.
elif [ "${tableName}" == "DBO.ANOTHER_TABLE" ]
then
  
#Edit SQLs for this table.
fi

32.1.12 Merge Script Functions

The script is coded to include the following shell functions:

  • main
  • validateParams
  • process
  • processTruncate
  • processDML
  • dropExternalTable
  • createExternalTable
  • merge

The script has code comments for you to infer the purpose of each function.

Merge Script main function

The function main is the entry point of the script. The processing of the staged changed data file begin here.

This function invokes two functions: validateParams and process.

The input parameters to the script is validated in the function: validateParams.

Processing resumes in the process function if validation is successful.

Merge Script process function

This function processes the operation records in the staged change data file and invokes processTruncate or processDML as needed.

Truncate operation records are handled in the function processTruncate. Insert, Update, and Delete operation records are handled in the function processDML.

Merge Script merge function

The merge function invoked by the function processDML contains the merge SQL statement that will be executed for each table.

The key columns to be used in the merge SQL’s ON clause needs to be customized.

To handle key columns with null values, the ON clause uses data warehouse specific NVL functions. Example for a single key column "C01Key":
ON ((NVL(CAST(TARGET.\"C01Key\" AS VARCHAR(4000)),'${uuid}')=NVL(CAST(STAGE.\"C01Key\" AS VARCHAR(4000)),'${uuid}')))`

The column names in the merge statement’s update and insert clauses also needs to be customized for every table.

Merge Script createExternalTable function

The createExternalTable function invoked by the function processDML creates an external table that is backed by the file in the respective object store file.

In this function, the DDL SQL statement for the external table should be customized for every target table to include all the target table columns.

In addition to the target table columns, the external table definition also consists of three meta-columns: optype, position, and fieldmask.

The data type of the meta-columns should not be modified. The position of the meta-columns should not be modified in the DDL statement.

32.1.13 Prerequisites

  • The Command handler merge scripts are available, starting from Oracle GoldenGate for BigData release 19.1.0.0.8.
  • The respective data warehouse’s command line programs to execute SQL queries must be installed on the machine where GoldenGate for Big Data is installed.

32.1.14 Limitations

Primary key update operations are split into delete and insert pair. In case the Oracle GoldenGate trail file doesn't contain column values for all the columns in the respective table, then the missing columns gets updated to null on the target table.

32.2 Snowflake Stage and Merge

Snowflake is a serverless data warehouse. Snowflake can run its compute nodes on any of the following cloud providers: AWS, GCP, or Azure. Snowflake external tables can be used to read object store files.

Snowflake can read data files from object stores in either AWS S3, Azure Data Lake or Google Cloud Storage.

This topic contains examples of what you can do with the Snowflake command event handler .

32.2.1 Configuration

The directory AdapterExamples/big-data/data-warehouse-utils/snowflake/ in the Oracle GoldenGate BigData install contains all the configuration and scripts needed for snowflake replication using stage and merge.

The following are the files:
  • sf.prm: The replicat parameter file.
  • sf.props: The replicat properties file that stages data to AWS S3 and runs the Command Event handler.
  • sf.sh: The bash-shell script that reads data staged in AWS S3 and merges data to Snowflake target table..
  • sf-az.props: The replicat properties file that stages data to Azure Data Lake Gen 2 and runs the Command Event handler.
  • sf-az.sh: The bash-shell script that reads data staged in Azure Data Lake Gen 2 and merges data to Snowflake target table.
  • sf-ddl.sql: The DDL statement of the sample target table used in the scripts sf.sh and sf-az.sh.
Edit the properties indicated by the #TODO: comments in the properties file sf.props and sf-az.props.

The bash-shell script functions createExternalTable() and merge() contain SQL statements that needs to be customized for your target tables.

32.3 Snowflake on AWS

This topic contains examples of what you can do with the Snowflake on AWS.

32.3.1 Data Flow

  • File Writer (FW) handler is configured to generate files in Avro Object Container Format (OCF).
  • The Avro OCF files are uploaded to an S3 bucket using the GoldenGate S3 Event handler.
  • The Command Event handler passes the S3 object store file metadata to the sf.sh script.

32.3.2 Merge Script Variables

The following variables needs to be modified as required:

#TODO: Edit this. Provide the replicat group name.
repName=RBD

#TODO: Edit this. Ensure each replicat uses a unique prefix.
stagingTablePrefix=${repName}_STAGE_

#TODO: Edit the AWS S3 bucket name.
bucket=<AWS S3 bucket name>

#TODO: Edit this variable as needed.
s3Location="'s3://${bucket}/${dir}/'"#TODO: Edit AWS credentials awsKeyId and awsSecretKey
awsKeyId=<AWS Access Key Id>
awsSecretKey=<AWS Secret key>

#TODO: Edit the Snowflake account name, database, username and password in the function executeQuery()
    sfAccount=<account>
    sfRegion=<region>
    sfDatabase=<database>
    sfUser=<user>
    sfPassword=<password>

32.4 Snowflake on Azure

This topic contains examples of what you can do with the Snowflake on Azure command event handler

32.4.1 Data Flow

  • File Writer (FW) handler is configured to generate files in Avro Object Container Format (OCF).
  • The Avro OCF files are uploaded to a container in Azure Storage Account (Azure Data Lake Gen 2) using the HDFS Event handler.

  • The Command Event handler passes the Azure Data Lake Gen 2 object store file metadata to the sf-az.sh script.

32.4.2 Merge Script Variables

The following variables needs to be modified as required:

#TODO: Edit this. Provide the replicat group name.
repName=RBD

#TODO: Edit this. Ensure each replicat uses a unique prefix.
stagingTablePrefix=${repName}_STAGE_

#TODO: Edit the Azure Storage account.
azureStorageAccount=<Azure Storage account>
#TODO: Edit the Azure Container.
azureContainer=<Azure Container name>

#TODO: Edit Snowflake storage integration to access Azure Data Lake.#TODO: Instructions to create storage integration is documented here: https://docs.snowflake.com/en/user-guide/data-load-azure-config.html#option-1-configuring-a-snowflake-storage-integration
storageIntegration=<Snowflake Storage integration>

#TODO: Edit the Snowflake account name, database, username and password in the function executeQuery()
    sfAccount=<account>
    sfRegion=<region>
    sfDatabase=<database>
    sfUser=<user>
    sfPassword=<password>

32.4.3 Prerequisites

The merge scripts requires snowsql command line program to be installed on the machine where Oracle GoldenGate for BigData replicat is installed.

32.5 Google BigQuery Stage and Merge

BigQuery is Google Cloud’s fully managed, petabyte-scale, and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real time.

This topic contains examples of what you can do with the Google BigQuery command event handler.

32.5.1 Data Flow

  • File Writer (FW) handler is configured to generate files in Avro Object Container Format (OCF).
  • The Command Event handler passes the local Avro OCF file metadata to the script bq.sh.

32.5.2 Configuration

The directory AdapterExamples/big-data/data-warehouse-utils/bigquery/ in the Oracle GoldenGate for BigData install contains all the configuration and scripts needed for replication to BigQuery using Stage and Merge.

The following are the files:
  • bq.prm: The replicat parameter file.
  • bq.props: The replicat properties file that generate local files in Avro OCF format and runs the Command Event handler.
  • bq.sh: The bash-shell script that uploads the files to Google Cloud Storage (GCS) and merges the change data in GCS onto BigQuery target tables.
  • bq-ddl.sql: The DDL statement that contains sample target table used in the script bq.sh .

The bash-shell script function mergeIntoBQ() contains SQL statements that need to be customized for your target tables.

32.5.3 Merge Script Variables

The following variables needs to be modified as required:

#TODO: Edit the replicat group name.
repName=BQ

#TODO: Edit this. Ensure each replicat uses a unique prefix.
stagingTablePrefix=${repName}_STAGE_

#TODO: Edit the GCS bucket name.
bucket=sanav_bucket_us

32.5.4 Prerequisites

The merge script bq.sh requires Google Cloud command line programs gsutil and bq to be installed on the machine where Oracle GoldenGate for BigData replicat is installed.

32.6 Hive Stage and Merge

Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability for querying and analysis of large data sets stored in Hadoop files.

This topic contains examples of what you can do with the Hive command event handler

32.6.1 Data Flow

  • File Writer (FW) handler is configured to generate files in Avro Object Container Format (OCF).
  • The HDFS Event handler is used to push the Avro OCF files into Hadoop.
  • The Command Event handler passes the Hadoop file metadata to the hive.sh script.

32.6.2 Configuration

The directory AdapterExamples/big-data/data-warehouse-utils/hive/ in the Oracle GoldenGate BigData install contains all the configuration and scripts needed needed for replication to Hive using stage and merge.

The following are the files:
  • hive.prm: The replicat parameter file.
  • hive.props: The replicat properties file that stages data to Hadoop and runs the Command Event handler.
  • hive.sh: The bash-shell script that reads data staged in Hadoop and merges data to Hive target table.
  • hive-ddl.sql: The DDL statement that contains sample target table used in the script hive.sh.

Edit the properties indicated by the #TODO: comments in the properties file hive.props.

The bash-shell script function merge() contains SQL statements that needs to be customized for your target tables.

32.6.3 Merge Script Variables

Modify the variables needs as needed:

#TODO: Modify the location of the OGGBD dirdef directory where the Avro schema files exist.
avroSchemaDir=/opt/ogg/dirdef

#TODO: Edit the JDBC URL to connect to hive.
hiveJdbcUrl=jdbc:hive2://localhost:10000/default
#TODO: Edit the JDBC user to connect to hive.
hiveJdbcUser=APP
#TODO: Edit the JDBC password to connect to hive.
hiveJdbcPassword=mine

#TODO: Edit the replicat group name.
repName=HIVE

#TODO: Edit this. Ensure each replicat uses a unique prefix.
stagingTablePrefix=${repName}_STAGE_

32.6.4 Prerequisites

The following are the prerequisites:

  • The merge script hive.sh requires command line program beeline to be installed on the machine where Oracle GoldenGate for BigData replicat is installed.
  • The custom script hive.sh uses the merge SQL statement.

    Hive Query Language (Hive QL) introduced support for merge in Hive version 2.2.