32 Stage and Merge Data Warehouse Replication
Data warehouse targets typically support Massively Parallel Processing (MPP). The cost of a single Data Manipulation Language (DML) operation is comparable to the cost of execution of batch DMLs.
Therefore, for better throughput the change data from the Oracle GoldenGate trails can be staged in micro batches at a temporary staging location, and the staged data records are merged into the data warehouse target table using the respective data warehouse’s merge SQL statement. This section outlines an approach to replicate change data records from source databases to target data warehouses using stage and merge. The solution uses Command Event handler to invoke custom bash-shell scripts.
This chapter contains examples of what you can do with command event handler feature.
- Steps for Stage and Merge
- Snowflake Stage and Merge
Snowflake is a serverless data warehouse. Snowflake can run its compute nodes on any of the following cloud providers: AWS, GCP, or Azure. Snowflake external tables can be used to read object store files. - Snowflake on AWS
- Snowflake on Azure
- Google BigQuery Stage and Merge
BigQuery is Google Cloud’s fully managed, petabyte-scale, and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real time. - Hive Stage and Merge
Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability for querying and analysis of large data sets stored in Hadoop files.
32.1 Steps for Stage and Merge
- Stage
In this step the change data records in the Oracle GoldenGate trail files are pushed into a staging location. The staging location is typically a cloud object store such as OCI, AWS S3, Azure Data Lake, or Google Cloud Storage. - Merge
In this step the change data files in the object store are viewed as an external table defined in the data warehouse. The data in the external staging table is merged onto the target table. - Configuration of Handlers
File Writer(FW) handler needs to be configured to generate local staging files that contain change data from the GoldenGate trail files. - File Writer Handler
File Writer (FW) handler is typically configured to generate files partitioned by table using the configurationgg.handler.{name}.partitionByTable=true
. - Operation Aggregation
Operation aggregation is the process of aggregating (merging/compressing) multiple operations on the same row into a single output operation based on a threshold. - Object Store Event handler
The File Writer handler needs to be chained with an object store Event handler. Oracle GoldenGate for BigData supports uploading files to most cloud object stores such as OCI, AWS S3, and Azure Data Lake. - JDBC Metadata Provider
If the data warehouse supports JDBC connection, then the JDBC metadata provider needs to be enabled. - Command Event handler Merge Script
Command Event handler is configured to invoke a bash-shell script. Oracle provides a bash-shell script that can execute the SQL statements so that the change data in the staging files are merged into the target tables. - Stage and Merge Sample Configuration
A working configuration for the respective data warehouse is available under the directoryAdapterExamples/big-data/data-warehouse-utils/<target>/
. - Variables in the Merge Script
Typically, variables appear at the beginning of the Oracle provided script. There are lines starting with#TODO
: that document the changes required for variables in the script. - SQL Statements in the Merge Script
The SQL statements in the shell script needs to be customized. There are lines starting with#TODO
: that document the changes required for SQL statements. - Merge Script Functions
- Prerequisites
- Limitations
Parent topic: Stage and Merge Data Warehouse Replication
32.1.1 Stage
In this step the change data records in the Oracle GoldenGate trail files are pushed into a staging location. The staging location is typically a cloud object store such as OCI, AWS S3, Azure Data Lake, or Google Cloud Storage.
Parent topic: Steps for Stage and Merge
32.1.2 Merge
In this step the change data files in the object store are viewed as an external table defined in the data warehouse. The data in the external staging table is merged onto the target table.
Parent topic: Steps for Stage and Merge
32.1.3 Configuration of Handlers
File Writer(FW) handler needs to be configured to generate local staging files that contain change data from the GoldenGate trail files.
The FW handler needs to be chained to an object store Event handler that can upload the staging files into a staging location.
The staging location is typically a cloud object store, such as AWS S3 or Azure Data Lake.
The output of the object store event handler is chained with the Command Event handler that can invoke custom scripts to execute merge SQL statements on the target data warehouse.
Parent topic: Steps for Stage and Merge
32.1.4 File Writer Handler
File Writer (FW) handler is typically configured to generate files
partitioned by table using the configuration
gg.handler.{name}.partitionByTable=true
.
In most cases FW handler is configured to use the Avro Object Container Format (OCF) formatter.
The output file format could change based on the specific data warehouse target.
Parent topic: Steps for Stage and Merge
32.1.5 Operation Aggregation
Operation aggregation is the process of aggregating (merging/compressing) multiple operations on the same row into a single output operation based on a threshold.
Operation Aggregation needs to be enabled for stage and merge replication using the
configuration gg.aggregate.operations=true
.
Parent topic: Steps for Stage and Merge
32.1.6 Object Store Event handler
The File Writer handler needs to be chained with an object store Event handler. Oracle GoldenGate for BigData supports uploading files to most cloud object stores such as OCI, AWS S3, and Azure Data Lake.
Parent topic: Steps for Stage and Merge
32.1.7 JDBC Metadata Provider
If the data warehouse supports JDBC connection, then the JDBC metadata provider needs to be enabled.
Parent topic: Steps for Stage and Merge
32.1.8 Command Event handler Merge Script
Command Event handler is configured to invoke a bash-shell script. Oracle provides a bash-shell script that can execute the SQL statements so that the change data in the staging files are merged into the target tables.
Parent topic: Steps for Stage and Merge
32.1.9 Stage and Merge Sample Configuration
A working configuration for the respective data warehouse is available
under the directory
AdapterExamples/big-data/data-warehouse-utils/<target>/
.
- replicat parameter (.prm) file.
- replicat properties file that contains the FW handler and all the Event handler configuration.
- DDL file for the sample table used in the merge script.
- Merge script for the specific data warehouse. This script contains SQL statements tested using the sample table defined in the DDL file.
Parent topic: Steps for Stage and Merge
32.1.10 Variables in the Merge Script
Typically, variables appear at the beginning of the Oracle provided
script. There are lines starting with #TODO
: that document the changes
required for variables in the script.
#TODO: Edit this. Provide the replicat group name. repName=RBD #TODO: Edit this. Ensure each replicat uses a unique prefix. stagingTablePrefix=${repName}_STAGE_ #TODO: Edit the AWS S3 bucket name. bucket=<AWS S3 bucket name> #TODO: Edit this variable as needed. s3Location="'s3://${bucket}/${dir}/'" #TODO: Edit AWS credentials awsKeyId and awsSecretKey awsKeyId=<AWS Access Key Id> awsSecretKey=<AWS Secret key>
The variables repName
and stagingTablePrefix
are
relevant for all the data warehouse targets.
Parent topic: Steps for Stage and Merge
32.1.11 SQL Statements in the Merge Script
The SQL statements in the shell script needs to be customized. There are
lines starting with #TODO
: that document the changes required for SQL
statements.
In most cases, we need to double quote " identifiers in the SQL statement. The double
quote needs to be escaped in the script using backslash. For example:
\"
.
Oracle provides a working example of SQL statements for a single table
with a pre-defined set of columns defined in the sample DDL file. You need to add
new sections for your own tables as part of if-else
code block in
the script.
if [ "${tableName}" == "DBO.TCUSTORD" ] then #TODO: Edit all the column names of the staging and target tables. # The merge SQL example here is configured for the example table defined in the DDL file. # Oracle provided SQL statements # TODO: Add similar SQL queries for each table. elif [ "${tableName}" == "DBO.ANOTHER_TABLE" ] then #Edit SQLs for this table. fi
Parent topic: Steps for Stage and Merge
32.1.12 Merge Script Functions
The script is coded to include the following shell functions:
main
validateParams
process
processTruncate
processDML
dropExternalTable
createExternalTable
merge
The script has code comments for you to infer the purpose of each function.
Merge Script main
function
The function main
is the entry point of the script. The
processing of the staged changed data file begin here.
This function invokes two functions: validateParams
and
process
.
The input parameters to the script is validated in the function:
validateParams
.
Processing resumes in the process
function if
validation is successful.
Merge Script process
function
This function processes the operation records in the staged change data file and
invokes processTruncate
or processDML
as needed.
Truncate operation records are handled in the function
processTruncate
. Insert
,
Update
, and Delete
operation records are
handled in the function processDML
.
Merge Script merge
function
The merge
function invoked by the function
processDML
contains the merge SQL statement that will be
executed for each table.
The key columns to be used in the merge SQL’s ON
clause
needs to be customized.
null
values, the
ON
clause uses data warehouse specific NVL
functions. Example for a single key column
"C01Key
":ON ((NVL(CAST(TARGET.\"C01Key\" AS VARCHAR(4000)),'${uuid}')=NVL(CAST(STAGE.\"C01Key\" AS VARCHAR(4000)),'${uuid}')))`
The column names in the merge
statement’s
update
and insert
clauses also needs to be
customized for every table.
Merge Script createExternalTable
function
The createExternalTable
function invoked by the function
processDML
creates an external table that is backed by the file
in the respective object store file.
In this function, the DDL SQL statement for the external table should be customized for every target table to include all the target table columns.
In addition to the target table columns, the external table definition
also consists of three meta-columns: optype
,
position
, and fieldmask
.
The data type of the meta-columns should not be modified. The position of the meta-columns should not be modified in the DDL statement.
Parent topic: Steps for Stage and Merge
32.1.13 Prerequisites
- The Command handler merge scripts are available, starting from Oracle GoldenGate for BigData release 19.1.0.0.8.
- The respective data warehouse’s command line programs to execute SQL queries must be installed on the machine where GoldenGate for Big Data is installed.
Parent topic: Steps for Stage and Merge
32.1.14 Limitations
Primary key update operations are split into delete and insert pair. In
case the Oracle GoldenGate trail file doesn't contain column values for all the
columns in the respective table, then the missing columns gets updated to
null
on the target table.
Parent topic: Steps for Stage and Merge
32.2 Snowflake Stage and Merge
Snowflake is a serverless data warehouse. Snowflake can run its compute nodes on any of the following cloud providers: AWS, GCP, or Azure. Snowflake external tables can be used to read object store files.
Snowflake can read data files from object stores in either AWS S3, Azure Data Lake or Google Cloud Storage.
This topic contains examples of what you can do with the Snowflake command event handler .
- Configuration
The directoryAdapterExamples/big-data/data-warehouse-utils/snowflake/
in the Oracle GoldenGate BigData install contains all the configuration and scripts needed for snowflake replication using stage and merge.
Parent topic: Stage and Merge Data Warehouse Replication
32.2.1 Configuration
The directory
AdapterExamples/big-data/data-warehouse-utils/snowflake/
in the Oracle
GoldenGate BigData install contains all the configuration and scripts needed for snowflake
replication using stage and merge.
sf.prm
: The replicat parameter file.sf.props
: The replicat properties file that stages data to AWS S3 and runs the Command Event handler.sf.sh
: The bash-shell script that reads data staged in AWS S3 and merges data to Snowflake target table..sf-az.props
: The replicat properties file that stages data to Azure Data Lake Gen 2 and runs the Command Event handler.sf-az.sh
: The bash-shell script that reads data staged in Azure Data Lake Gen 2 and merges data to Snowflake target table.sf-ddl.sql
: The DDL statement of the sample target table used in the scriptssf.sh
andsf-az.sh
.
#TODO
: comments in the
properties file sf.props
and sf-az.props
.
The
bash-shell script functions createExternalTable()
and
merge()
contain SQL statements that needs to be customized for
your target tables.
Parent topic: Snowflake Stage and Merge
32.3 Snowflake on AWS
This topic contains examples of what you can do with the Snowflake on AWS.
32.3.1 Data Flow
- File Writer (FW) handler is configured to generate files in Avro Object Container Format (OCF).
- The Avro OCF files are uploaded to an S3 bucket using the GoldenGate S3 Event handler.
- The Command Event handler passes the S3 object store file metadata
to the
sf.sh
script.
Parent topic: Snowflake on AWS
32.3.2 Merge Script Variables
The following variables needs to be modified as required:
#TODO: Edit this. Provide the replicat group name. repName=RBD #TODO: Edit this. Ensure each replicat uses a unique prefix. stagingTablePrefix=${repName}_STAGE_ #TODO: Edit the AWS S3 bucket name. bucket=<AWS S3 bucket name> #TODO: Edit this variable as needed. s3Location="'s3://${bucket}/${dir}/'"#TODO: Edit AWS credentials awsKeyId and awsSecretKey awsKeyId=<AWS Access Key Id> awsSecretKey=<AWS Secret key> #TODO: Edit the Snowflake account name, database, username and password in the function executeQuery() sfAccount=<account> sfRegion=<region> sfDatabase=<database> sfUser=<user> sfPassword=<password>
Parent topic: Snowflake on AWS
32.4 Snowflake on Azure
This topic contains examples of what you can do with the Snowflake on Azure command event handler
Parent topic: Stage and Merge Data Warehouse Replication
32.4.1 Data Flow
- File Writer (FW) handler is configured to generate files in Avro Object Container Format (OCF).
-
The Avro OCF files are uploaded to a container in Azure Storage Account (Azure Data Lake Gen 2) using the HDFS Event handler.
- The Command Event handler passes the Azure Data Lake Gen 2 object
store file metadata to the
sf-az.sh
script.
Parent topic: Snowflake on Azure
32.4.2 Merge Script Variables
The following variables needs to be modified as required:
#TODO: Edit this. Provide the replicat group name. repName=RBD #TODO: Edit this. Ensure each replicat uses a unique prefix. stagingTablePrefix=${repName}_STAGE_ #TODO: Edit the Azure Storage account. azureStorageAccount=<Azure Storage account> #TODO: Edit the Azure Container. azureContainer=<Azure Container name> #TODO: Edit Snowflake storage integration to access Azure Data Lake.#TODO: Instructions to create storage integration is documented here: https://docs.snowflake.com/en/user-guide/data-load-azure-config.html#option-1-configuring-a-snowflake-storage-integration storageIntegration=<Snowflake Storage integration> #TODO: Edit the Snowflake account name, database, username and password in the function executeQuery() sfAccount=<account> sfRegion=<region> sfDatabase=<database> sfUser=<user> sfPassword=<password>
Parent topic: Snowflake on Azure
32.4.3 Prerequisites
The merge scripts requires snowsql
command line program
to be installed on the machine where Oracle GoldenGate for BigData replicat is
installed.
Parent topic: Snowflake on Azure
32.5 Google BigQuery Stage and Merge
BigQuery is Google Cloud’s fully managed, petabyte-scale, and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real time.
This topic contains examples of what you can do with the Google BigQuery command event handler.
- Data Flow
- Configuration
The directoryAdapterExamples/big-data/data-warehouse-utils/bigquery/
in the Oracle GoldenGate for BigData install contains all the configuration and scripts needed for replication to BigQuery using Stage and Merge. - Merge Script Variables
- Prerequisites
Parent topic: Stage and Merge Data Warehouse Replication
32.5.1 Data Flow
- File Writer (FW) handler is configured to generate files in Avro Object Container Format (OCF).
- The Command Event handler passes the local Avro OCF file metadata to the script
bq.sh
.
Parent topic: Google BigQuery Stage and Merge
32.5.2 Configuration
The directory
AdapterExamples/big-data/data-warehouse-utils/bigquery/
in the Oracle
GoldenGate for BigData install contains all the configuration and scripts needed for
replication to BigQuery using Stage and Merge.
bq.prm
: The replicat parameter file.bq.props
: The replicat properties file that generate local files in Avro OCF format and runs the Command Event handler.bq.sh
: The bash-shell script that uploads the files to Google Cloud Storage (GCS) and merges the change data in GCS onto BigQuery target tables.bq-ddl.sql
: The DDL statement that contains sample target table used in the scriptbq.sh
.
The bash-shell script function mergeIntoBQ()
contains SQL statements that need to be customized for your target
tables.
Parent topic: Google BigQuery Stage and Merge
32.5.3 Merge Script Variables
The following variables needs to be modified as required:
#TODO: Edit the replicat group name. repName=BQ #TODO: Edit this. Ensure each replicat uses a unique prefix. stagingTablePrefix=${repName}_STAGE_ #TODO: Edit the GCS bucket name. bucket=sanav_bucket_us
Parent topic: Google BigQuery Stage and Merge
32.5.4 Prerequisites
The merge script bq.sh
requires Google Cloud command
line programs gsutil
and bq
to be installed on the
machine where Oracle GoldenGate for BigData replicat is installed.
Parent topic: Google BigQuery Stage and Merge
32.6 Hive Stage and Merge
Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability for querying and analysis of large data sets stored in Hadoop files.
This topic contains examples of what you can do with the Hive command event handler
- Data Flow
- Configuration
The directoryAdapterExamples/big-data/data-warehouse-utils/hive/
in the Oracle GoldenGate BigData install contains all the configuration and scripts needed needed for replication to Hive using stage and merge. - Merge Script Variables
- Prerequisites
Parent topic: Stage and Merge Data Warehouse Replication
32.6.1 Data Flow
- File Writer (FW) handler is configured to generate files in Avro Object Container Format (OCF).
- The HDFS Event handler is used to push the Avro OCF files into Hadoop.
- The Command Event handler passes the Hadoop file metadata to the
hive.sh
script.
Parent topic: Hive Stage and Merge
32.6.2 Configuration
The directory
AdapterExamples/big-data/data-warehouse-utils/hive/
in the Oracle
GoldenGate BigData install contains all the configuration and scripts needed needed for
replication to Hive using stage and merge.
hive.prm
: The replicat parameter file.hive.props
: The replicat properties file that stages data to Hadoop and runs the Command Event handler.hive.sh
: The bash-shell script that reads data staged in Hadoop and merges data to Hive target table.hive-ddl.sql
: The DDL statement that contains sample target table used in the scripthive.sh
.
Edit the properties indicated by the #TODO
:
comments in the properties file hive.props
.
The
bash-shell script function merge()
contains SQL statements that
needs to be customized for your target tables.
Parent topic: Hive Stage and Merge
32.6.3 Merge Script Variables
Modify the variables needs as needed:
#TODO: Modify the location of the OGGBD dirdef directory where the Avro schema files exist. avroSchemaDir=/opt/ogg/dirdef #TODO: Edit the JDBC URL to connect to hive. hiveJdbcUrl=jdbc:hive2://localhost:10000/default #TODO: Edit the JDBC user to connect to hive. hiveJdbcUser=APP #TODO: Edit the JDBC password to connect to hive. hiveJdbcPassword=mine #TODO: Edit the replicat group name. repName=HIVE #TODO: Edit this. Ensure each replicat uses a unique prefix. stagingTablePrefix=${repName}_STAGE_
Parent topic: Hive Stage and Merge
32.6.4 Prerequisites
The following are the prerequisites:
- The merge script
hive.sh
requires command line programbeeline
to be installed on the machine where Oracle GoldenGate for BigData replicat is installed. - The custom script
hive.sh
uses themerge
SQL statement.Hive Query Language (Hive QL) introduced support for
merge
in Hive version 2.2.
Parent topic: Hive Stage and Merge