8 Replicate Data
Oracle GoldenGate for Big Data supports specific configurations - the handlers (which are compatible with clearly defined software versions) for replicating data.
Handlers in Oracle GoldenGate for Big Data are components that manage the data flow between various sources and targets. They are responsible for reading data from sources such as databases, log files, or message queues, and writing the data to a wide range of target systems. Oracle GoldenGate for Big Data uses Handlers to perform various tasks, such as data ingestion, data transformation, and data integration. Handlers are essential for enabling real-time data movement and data replication across Big Data environments.
This article describes the following Soures and Target Handlers in Oracle GoldenGate for Big Data:
8.1 Source
- Amazon MSK
- Apache Cassandra
The Oracle GoldenGate capture (Extract) for Cassandra is used to get changes from Apache Cassandra databases. - Apache Kafka
The Oracle GoldenGate capture (Extract) for Kafka is used to read messages from a Kafka topic or topics and convert data into logical change records written to GoldenGate trail files. This section explains how to use Oracle GoldenGate capture for Kafka. - Azure Event Hubs
- Confluent Kafka
- DataStax
- Java Message Service (JMS)
- MongoDB
The Oracle GoldenGate capture (Extract) for MongoDB is used to get changes from MongoDB databases. - OCI Streaming
Parent topic: Replicate Data
8.1.1 Amazon MSK
To capture messages from Amazon MSK and parse into logical change records with Oracle GoldenGate for Big Data, you can use Kafka Extract. For more information, see Apache Kafka as source.
Parent topic: Source
8.1.2 Apache Cassandra
The Oracle GoldenGate capture (Extract) for Cassandra is used to get changes from Apache Cassandra databases.
- Overview
- Setting Up Cassandra Change Data Capture
- Deduplication
- Topology Changes
- Data Availability in the CDC Logs
- Using Extract Initial Load
- Using Change Data Capture Extract
- Replicating to RDMBS Targets
- Partition Update or Insert of Static Columns
- Partition Delete
- Security and Authentication
- Cleanup of CDC Commit Log Files
You can use the Cassandra CDC commit log purger program to purge the CDC commit log files that are not in use. - Multiple Extract Support
- CDC Configuration Reference
- Troubleshooting
- Cassandra Capture Client Dependencies
What are the dependencies for the Cassandra Capture (Extract) to connect to Apache Cassandra databases?
Parent topic: Source
8.1.2.1 Overview
Apache Cassandra is a NoSQL Database Management System designed to store large amounts of data. A Cassandra cluster configuration provides horizontal scaling and replication of data across multiple machines. It can provide high availability and eliminate a single point of failure by replicating data to multiple nodes within a Cassandra cluster. Apache Cassandra is open source and designed to run on low-cost commodity hardware.
Cassandra relaxes the axioms of a traditional relational database management systems (RDBMS) regarding atomicity, consistency, isolation, and durability. When considering implementing Cassandra, it is important to understand its differences from a traditional RDBMS and how those differences affect your specific use case.
Cassandra provides eventual consistency. Under the eventual consistency model, accessing the state of data for a specific row eventually returns the latest state of the data for that row as defined by the most recent change. However, there may be a latency period between the creation and modification of the state of a row and what is returned when the state of that row is queried. The benefit of eventual consistency is that the latency period is predicted based on your Cassandra configuration and the level of work load that your Cassandra cluster is currently under, see http://cassandra.apache.org/.
Review the data type support, see About the Cassandra Data Types.
Parent topic: Apache Cassandra
8.1.2.2 Setting Up Cassandra Change Data Capture
Prerequisites
-
Apache Cassandra cluster must have at least one node up and running.
-
Read and write access to CDC commit log files on every live node in the cluster is done through SFTP or NFS. For more information, see Setup SSH Connection to the Cassandra Nodes.
-
Every node in the Cassandra cluster must have the
cdc_enabled
parameter set totrue
in thecassandra.yaml
configuration file. -
Virtual nodes must be enabled on every Cassandra node by setting the
num_tokens
parameter incassandra.yaml
. - You must download the third party libraries using Dependency downloader scripts. For more information, see Cassandra Capture Client Dependencies.
-
New tables can be created with Change Data Capture (CDC) enabled using the
WITH CDC=true
clause in theCREATE TABLE
command. For example:CREATE TABLE ks_demo_rep1.mytable (col1 int, col2 text, col3 text, col4 text, PRIMARY KEY (col1)) WITH cdc=true;
You can enable CDC on existing tables as follows:
ALTER TABLE ks_demo_rep1.mytable WITH cdc=true;
- Setup SSH Connection to the Cassandra Nodes
Oracle GoldenGate for BigData transfers Cassandra commit log files from all the Cassandra nodes. To allow Oracle GoldenGate to transfer commit log files using secure shell protocol ( SFTP), generate aknown_hosts
SSH file. - Data Types
- Cassandra Database Operations
- Set up Credential Store Entry to Detect Source Type
Parent topic: Apache Cassandra
8.1.2.2.1 Setup SSH Connection to the Cassandra Nodes
Oracle GoldenGate for BigData transfers Cassandra commit log files from all the
Cassandra nodes. To allow Oracle GoldenGate to transfer commit log files using secure shell
protocol ( SFTP), generate a known_hosts
SSH file.
known_hosts
SSH file:
Parent topic: Setting Up Cassandra Change Data Capture
8.1.2.2.2 Data Types
Supported Cassandra Data Types
The following are the supported data types:
-
ASCII
-
BIGINT
-
BLOB
-
BOOLEAN
-
DATE
-
DECIMAL
-
DOUBLE
-
DURATION
-
FLOAT
-
INET
-
INT
-
SMALLINT
-
TEXT
-
TIME
-
TIMESTAMP
-
TIMEUUID
-
TINYINT
-
UUID
-
VARCHAR
-
VARINT
Unsupported Data Types
The following are the unsupported data types:
-
COUNTER
-
MAP
-
SET
-
LIST
-
UDT
(user defined type) -
TUPLE
-
CUSTOM_TYPE
Parent topic: Setting Up Cassandra Change Data Capture
8.1.2.2.3 Cassandra Database Operations
Supported Operations
The following are the supported operations:
-
INSERT
-
UPDATE
(Captured asINSERT
) -
DELETE
Unsupported Operations
The TRUNCATE
DDL
(CREATE
, ALTER
, and DROP
) operation is not supported. Because the Cassandra commit log files do not record any before images for the UPDATE
or DELETE
operations. The result is that the captured operations can never have a before image. Oracle GoldenGate features that rely on before image records, such as Conflict Detection and Resolution, are not available.
Parent topic: Setting Up Cassandra Change Data Capture
8.1.2.2.4 Set up Credential Store Entry to Detect Source Type
userid
. The generic format for userid
is as
follows: <dbtype>://<db-user>@<comma separated list of server
addresses>:<port>
The
userid
can have multiple server/nodes addresses.
Microservices Build
More than one node address can be configured in theuserid
.
alter credentialstore add user cassandra://db-user@127.0.0.1,127.0.0.2:9042 password db-passwd alias cass
Classic Build
- The
userid
should contain a single node address. - If there are more than one node address that needs to be
configured for connection, then use the GLOBALS parameter
CLUSTERCONTACTPOINTS
. - The connection to the cluster would concatenate the node addresses
specified in the
userid
andCLUSTERCONTACTPOINTS
parameter.
alter credentialstore add user cassandra://db-user@127.0.0.1:9042 password db-passwd alias cass CLUSTERCONTACTPOINTS 127.0.0.2In this case, the connection will be attempted using
127.0.0.1,127.0.0.2:9042
.
Parent topic: Setting Up Cassandra Change Data Capture
8.1.2.3 Deduplication
One of the features of a Cassandra cluster is its high availability. To support high availability, multiple redundant copies of table data are stored on different nodes in the cluster. Oracle GoldenGate for Big Data Cassandra Capture automatically filters out duplicate rows (deduplicate). Deduplication is active by default. Oracle recommends using it if your data is captured and applied to targets where duplicate records are discouraged (for example RDBMS targets).
Parent topic: Apache Cassandra
8.1.2.4 Topology Changes
Cassandra nodes can change their status (topology change) and the cluster can still be alive. Oracle GoldenGate for Big Data Cassandra Capture can detect the node status changes and react to these changes when applicable. The Cassandra capture process can detect the following events happening in the cluster:
-
Node shutdown and boot.
-
Node decommission and commission.
-
New keyspace and table created.
Due to topology changes, if the capture process detects that an active producer node goes down, it tries to recover any missing rows from an available replica node. During this process, there is a possibility of data duplication for some rows. This is a transient data duplication due to the topology change. For more details about reacting to changes in topology, see Troubleshooting.
Parent topic: Apache Cassandra
8.1.2.5 Data Availability in the CDC Logs
The Cassandra CDC API can only read data from commit log files in the CDC directory. There is a latency for the data in the active commit log directory to be archived (moved) to the CDC commit log directory.
The input data source for the Cassandra capture process is the CDC commit log directory. There could be delays for the data to be captured mainly due to the commit log files not yet visible to the capture process.
On a production cluster with a lot of activity, this latency is very minimal as the data is archived from the active commit log directory to the CDC commit log directory in the order of microseconds.
Parent topic: Apache Cassandra
8.1.2.6 Using Extract Initial Load
Cassandra Extract supports the standard initial load capability to extract source table data to Oracle GoldenGate trail files.
Initial load for Cassandra can be performed to synchronize tables, either as a prerequisite step to replicating changes or as a standalone function.
Direct loading from a source Cassandra table to any target table is not supported.
Configuring the Initial Load
Initial load extract parameter file:
-- ggsci> alter credentialstore add user cassandra://db-user@127.0.0.1 password db-passwd alias cass EXTRACT load -- When using sdk 3.11 or 3.10 or 3.9 JVMOPTIONS CLASSPATH ggjava/ggjava.jar:/path/to/cassandra-driver-core/3.3.1/cassandra-driver-core-3.3.1.jar:dirprm:/path/to/apache-cassandra-3.11.0/lib/*:/path/to/gson/2.3/gson-2.3.jar:/path/to/jsch/0.1.54/jsch-0.1.54.jar -- When using sdk 3.9 --JVMOPTIONS CLASSPATH ggjava/ggjava.jar:/path/to/cassandra-driver-core/3.3.1/cassandra-driver-core-3.3.1.jar:dirprm:/path/to/apache-cassandra-3.9/lib/*:/path/to/gson/2.3/gson-2.3.jar:/path/to/jsch/0.1.54/jsch-0.1.54.jar SOURCEDB USERIDALIAS cass SOURCEISTABLE EXTFILE ./dirdat/la, megabytes 2048, MAXFILES 999 TABLE keyspace1.table1;
Note:
Save the file with the name specified in the example (load.prm
) into the dirprm
directory.
Then you would run these commands in GGSCI:
ADD EXTRACT load, SOURCEISTABLE START EXTRACT load
Parent topic: Apache Cassandra
8.1.2.7 Using Change Data Capture Extract
Review the example .prm
files from Oracle GoldenGate for Big Data installation directory under
$HOME/AdapterExamples/big-data/cassandracapture
.
- When adding the Cassandra Extract trail, you need to use
EXTTRAIL
to create a local trail file.The Cassandra Extract trail file should not be configured with the
RMTTRAIL
option.ggsci> ADD EXTRACT groupname, TRANLOG ggsci> ADD EXTTRAIL trailprefix, EXTRACT groupname Example: ggsci> ADD EXTRACT cass, TRANLOG ggsci> ADD EXTTRAIL ./dirdat/z1, EXTRACT cass
- To configure the Extract, see the example
.prm
files in the Oracle GoldenGate for Big Data installation directory in$HOME/AdapterExamples/big-data/cassandracapture
. - Position the
Extract.
ggsci> ADD EXTRACT groupname, TRANLOG, BEGIN NOW ggsci> ADD EXTRACT groupname, TRANLOG, BEGIN ‘yyyy-mm-dd hh:mm:ss’ ggsci> ALTER EXTRACT groupname, BEGIN ‘yyyy-mm-dd hh:mm:ss’
- Manage the transaction data logging for the
tables.
ggsci> DBLOGIN SOURCEDB nodeaddress USERID userid PASSWORD password ggsci> alter credentialstore add user cassandra://db-user@127.0.0.1 password db-passwd alias cassdblogin useridalias cass ggsci> ADD TRANDATA keyspace.tablename ggsci> INFO TRANDATA keyspace.tablename ggsci> DELETE TRANDATA keyspace.tablename
Examples:
ggsci> dblogin SOURCEDB 127.0.0.1 ggsci> dblogin useridalias cass ggsci> INFO TRANDATA ks_demo_rep1.mytable ggsci> INFO TRANDATA ks_demo_rep1.* ggsci> INFO TRANDATA *.* ggsci> INFO TRANDATA ks_demo_rep1.”CamelCaseTab” ggsci> ADD TRANDATA ks_demo_rep1.mytable ggsci> DELETE TRANDATA ks_demo_rep1.mytable
- Configure the Extract parameter file:
- Apache Cassandra 4x SDK, compatible with Apache Cassandra 4.0 version
-
Extract parameter file:
-- ggsci> alter credentialstore add user cassandra://db-user@127.0.0.1 password db-passwd alias cass EXTRACT groupname JVMOPTIONS CLASSPATH ggjava/ggjava.jar:DependencyDownloader/dependencies/cassandra_capture_4x/* JVMOPTIONS BOOTOPTIONS -Dcassandra.config=file://{/path/to/apache-cassandra-4.x}/config/casandra.yaml -Dcassandra.datacenter={datacenter-name} TRANLOGOPTIONS CDCREADERSDKVERSION 4x TRANLOGOPTIONS CDCLOGDIRTEMPLATE /path/to/data/cdc_raw SOURCEDB USERIDALIAS cass EXTTRAIL trailprefix TABLE source.*;
- Provide the
cassandra.yaml
file path usingJVMOPTIONS BOOTOPTIONS
.JVMOPTIONS BOOTOPTIONS -Dcassandra.config=file://{/path/to/apache-cassandra-4.x}/config/casandra.yaml -Dcassandra.datacenter={datacenter-name}
- Configure cassandra datacenter name under
JVMOPTIONS BOOTOPTIONS
. If you do not provide a value, then by default,datacenter1
is considered.
- Provide the
- Apache Cassandra 3x SDK, compatible with Apache Cassandra 3.9, 3.10, 3.11
-
Extract parameter file:
-- ggsci> alter credentialstore add user cassandra://db-user@127.0.0.1 password db-passwd alias cass JVMOPTIONS CLASSPATH ggjava/ggjava.jar:DependencyDownloader/dependencies/cassandra_capture_3x/* TRANLOGOPTIONS CDCREADERSDKVERSION 3x TRANLOGOPTIONS CDCLOGDIRTEMPLATE /path/to/data/cdc_raw SOURCEDB USERIDALIAS cass EXTTRAIL trailprefix TABLE source.*;
- DSE Cassandra SDK, compatible with DSE Cassandra 6.x versions
-
Extract parameter file
-- ggsci> alter credentialstore add user cassandra://db-user@127.0.0.1 password db-passwd alias cass EXTRACT groupname JVMOPTIONS CLASSPATH ggjava/ggjava.jar:{/path/to/dse-6.x}/resources/cassandra/lib/*:{/path/to/dse-6.x}/lib/*:{/path/to/dse-6.x}/resources/dse/lib/*:DependencyDownloader/dependencies/cassandra_capture_dse/* JVMOPTIONS BOOTOPTIONS -Dcassandra.config=file://{/path/to/dse-6.x}/resources/cassandra/conf/casandra.yaml -Dcassandra.datacenter={datacenter-name} TRANLOGOPTIONS CDCREADERSDKVERSION dse TRANLOGOPTIONS CDCLOGDIRTEMPLATE /path/to/data/cdc_raw SOURCEDB USERIDALIAS cass EXTTRAIL trailprefix TABLE source.*;
- Provide the
cassandra.yaml
file path usingJVMOPTIONS BOOTOPTIONS
:JVMOPTIONS BOOTOPTIONS -Dcassandra.config=file://{/path/to/dse-6.x}/resources/cassandra/conf/casandra.yaml -Dcassandra.datacenter={datacenter-name}
- Configure cassandra datacenter name under
JVMOPTIONS BOOTOPTIONS
. If you do not provide a value, then by default,Cassandra
is considered.
- Provide the
Note:
For DSE 5.x version, configure the extract with Apache 3x SDK as explained in the Apache 3x section.
Parent topic: Apache Cassandra
8.1.2.7.1 Handling Schema Evolution
TRANLOGOPTIONS TRACKSCHEMACHANGES
This will enable extract to capture table level DDL changes from the source at runtime.
Enable this to ensure that the table metadata within the trail stays in sync with the source without any downtime.
When TRACKSCHEMACHANGES
is disabled, the capture
process will ABEND
if a DDL change is detected at the source
table.
Note:
This feature is disabled by default. To enable, update the extract prm file as shown in the syntax above.Parent topic: Using Change Data Capture Extract
8.1.2.8 Replicating to RDMBS Targets
You must take additional care when replicating source UPDATE
operations from Cassandra trail files to RDMBS targets. Any source UPDATE
operation appears as an INSERT
record in the Oracle GoldenGate trail file. Replicat may abend when a source UPDATE
operation is applied as an INSERT
operation on the target database.
You have these options:
-
OVERRIDEDUPS
: If you expect that the source database is to contain mostlyINSERT
operations and very fewUPDATE
operations, thenOVERRIDEDUPS
is the recommended option. Replicat can recover from duplicate key errors while replicating the small number of the sourceUPDATE
operations. SeeOVERRIDEDUPS
\NOOVERRIDEDUPS
-
UPDATEINSERTS
andINSERTMISSINGUPDATES
: Use this configuration if the source database is expected to contain mostlyUPDATE
operations and very fewINSERT
operations. With this configuration, Replicat has fewer missing row errors to recover, which leads to better throughput. SeeUPDATEINSERTS
|NOUPDATEINSERTS
andINSERTMISSINGUPDATES
|NOINSERTMISSINGUPDATES
.
-
No additional configuration is required if the target table can accept duplicate rows or you want to abend Replicat on duplicate rows.
If you configure Replicat to use BATCHSQL
, there may be duplicate row or missing row errors in batch mode. Although there is a reduction in the Replicat throughput due to these errors, Replicat automatically recovers from these errors. If the source operations are mostly INSERTS
, then BATCHSQL
is a good option.
Parent topic: Apache Cassandra
8.1.2.9 Partition Update or Insert of Static Columns
When the source Cassandra table has static columns, the static column values can be modified by skipping any clustering key columns that are in the table.
For example:
create table ks_demo_rep1.nls_staticcol
(
teamname text,
manager text static,
location text static,
membername text,
nationality text,
position text,
PRIMARY KEY ((teamname), membername)
)
WITH cdc=true;
insert into ks_demo_rep1.nls_staticcol (teamname, manager, location) VALUES ('Red Bull', 'Christian Horner', '<unknown>
The insert CQL
is missing the clustering key membername
. Such an operation is a partition insert.
Similarly, you could also update a static column with just the partition keys in the WHERE
clause of the CQL
that is a partition update operation. Cassandra Extract cannot write a INSERT
or UPDATE
operation into the trail with missing key columns. It abends on detecting a partition INSERT
or UPDATE
operation.
Parent topic: Apache Cassandra
8.1.2.10 Partition Delete
A Cassandra table may have a primary key composed on one or more partition key columns and clustering key columns. When a DELETE
operation is performed on a Cassandra table by skipping the clustering key columns from the WHERE
clause, it results in a partition delete operation.
For example:
create table ks_demo_rep1.table1
(
col1 ascii, col2 bigint, col3 boolean, col4 int,
PRIMARY KEY((col1, col2), col4)
) with cdc=true;
delete from ks_demo_rep1.table1 where col1 = 'asciival' and col2 = 9876543210; /** skipped clustering key column col4 **/
Cassandra Extract cannot write a DELETE
operation into the trail with missing key columns and abends on detecting a partition DELETE
operation.
Parent topic: Apache Cassandra
8.1.2.11 Security and Authentication
-
Cassandra Extract can connect to a Cassandra cluster using username and password based authentication and SSL authentication.
-
Connection to Kerberos enabled Cassandra clusters is not supported in this release.
Parent topic: Apache Cassandra
8.1.2.11.1 Configuring SSL
To enable SSL, add the SSL parameter to your GLOBALS
file or Extract parameter file. Additionally, a separate configuration is required for the Java and CPP drivers, see CDC Configuration Reference.
SSL configuration for Java driver (GLOBALS file)
JVMBOOTOPTIONS -Djavax.net.ssl.trustStore=/path/to/SSL/truststore.file -Djavax.net.ssl.trustStorePassword=password -Djavax.net.ssl.keyStore=/path/to/SSL/keystore.file -Djavax.net.ssl.keyStorePassword=password
SSL configuration for Java driver (Extract parameter file)
JVMOPTIONS BOOTOPTIONS -Djavax.net.ssl.trustStore=/path/to/SSL/truststore.file -Djavax.net.ssl.trustStorePassword=password -Djavax.net.ssl.keyStore=/path/to/SSL/keystore.file -Djavax.net.ssl.keyStorePassword=password
Note:
The Extract parameter file configuration has a higher precedence.https://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/secureSSLIntro.html
Using Apache Cassandra 4x SDK / DSE Cassandra SDK
- Create the
application.conf
file with the following properties and override with appropriate values :datastax-java-driver { advanced.ssl-engine-factory { class = DefaultSslEngineFactory # Whether or not to require validation that the hostname of the server certificate's common # name matches the hostname of the server being connected to. If not set, defaults to true. hostname-validation = false # The locations and passwords used to access truststore and keystore contents. # These properties are optional. If either truststore-path or keystore-path are specified, # the driver builds an SSLContext from these files. If neither option is specified, the # default SSLContext is used, which is based on system property configuration. truststore-path = {path to truststore file} truststore-password = password keystore-path = {path to keystore file} keystore-password = cassandra } }
- Provide path of the directory containing the
application.conf
file underJVMCLASSPATH
as follows:JVMCLASSPATH ggjava/ggjava.jar:DependencyDownloader/dependencies/cassandra_capture_4x/*:/path/to/driver/config
Note:
This is valid only in case of theGLOBALS
file.JVMOPTIONS CLASSPATH ggjava/ggjava.jar:DependencyDownloader/dependencies/cassandra_capture_4x/*:/path/to/driver/config/
For more information, see https://github.com/datastax/java-driver/blob/4.x/core/src/main/resources/reference.conf.
SSL configuration for Cassandra CPP driver
To operate with an SSL configuration, you have to add the following parameter in the Oracle GoldenGate GLOBALS
file or Extract parameter file:
CPPDRIVEROPTIONS SSL PEMPUBLICKEYFILE /path/to/PEM/formatted/public/key/file/cassandra.pem CPPDRIVEROPTIONS SSL PEERCERTVERIFICATIONFLAG 0
This configuration is required to connect to a Cassandra cluster with SSL enabled. Additionally, you need to add these settings to your cassandra.yaml
file:
client_encryption_options:
enabled: true
# If enabled and optional is set to true encrypted and unencrypted connections are handled.
optional: false
keystore: /path/to/keystore
keystore_password: password
require_client_auth: false
The PEM formatted certificates can be generated using these instructions:
https://docs.datastax.com/en/developer/cpp-driver/2.8/topics/security/ssl/
Parent topic: Security and Authentication
8.1.2.12 Cleanup of CDC Commit Log Files
You can use the Cassandra CDC commit log purger program to purge the CDC commit log files that are not in use.
For more information, see How to Run the Purge Utility.
- Cassandra CDC Commit Log Purger
A purge utility for Cassandra Handler to purge the staged CDC commit log files. Cassandra Extract moves the CDC commit log files (located at$CASSANDRA/data/cdc_raw
) on each node to a staging directory for processing.
Parent topic: Apache Cassandra
8.1.2.12.1 Cassandra CDC Commit Log Purger
A purge utility for Cassandra Handler to purge the staged CDC commit log
files. Cassandra Extract moves the CDC commit log files (located at
$CASSANDRA/data/cdc_raw
) on each node to a staging directory for
processing.
cdc_raw
commit log directory is
/path/to/cassandra/home/data/cdc_raw
, the staging directory
is /path/to/cassandra/home/data/cdc_raw/../cdc_raw_staged
. The CDC
commit log purger purges those files, which are inside cdc_raw_staged
based on following logic.
The Purge program scans the oggdir directory
for all the following JSON checkpoint files under
dirchk/<EXTGRP>_casschk.json
. The sample JSON file under
dirchk looks similar to the
following:
{
"start_timestamp": -1,
"sequence_id": 34010434,
"updated_datetime": "2018-04-19 23:24:57.164-0700",
"nodes": [
{ "address": "10.247.136.146", "offset": 0, "id": 0 }
,
{ "address": "10.247.136.142", "file": "CommitLog-6-1524110205398.log", "offset": 33554405, "id": 1524110205398 }
,
{ "address": "10.248.10.24", "file": "CommitLog-6-1524110205399.log", "offset": 33554406, "id": 1524110205399 }
]
}
For each node address in JSON checkpoint file, the purge program captures the CDC file name and ID. For each ID obtained from the JSON checkpoint file, the purge program looks into the staged CDC commit log directory and purges the commit log files with the id that are lesser then the id captured in JSON file of checkpoint.
Example:
In JSON file, we had ID as 1524110205398.
In CDC Staging directory, we have
files as CommitLog-6-1524110205396.log
,
CommitLog-6-1524110205397.log
, and
CommitLog-6-1524110205398.log
.
The ids derived from CDC staging directory are 1524110205396, 1524110205397 and 1524110205398. The purge utility purges the files in CDC staging directory whose IDs are less than the ID read in JSON file, which is 1524110205398. The files associated with the ID 1524110205396 are 524110205397 are purged.
- How to Run the Purge Utility
- Sample config.properties for Local File System
- Argument cassCommitLogPurgerConfFile
- Argument purgeInterval
Setting the optional argumentpurgeInterval
helps in configuring the process to run as a daemon. - Argument cassUnProcessedFilesPurgeInterval
Setting the optional argumentcassUnProcessedFilesPurgeInterval
helps in purging historical commit logs for all the nodes that do not have a last processed file.
Parent topic: Cleanup of CDC Commit Log Files
8.1.2.12.1.1 How to Run the Purge Utility
Parent topic: Cassandra CDC Commit Log Purger
8.1.2.12.1.1.1 Third Party Libraries Needed to Run this Program
<dependency> <groupId>com.jcraft</groupId> <artifactId>jsch</artifactId> <version>0.1.54</version> <scope>provided</scope> </dependency>
Parent topic: How to Run the Purge Utility
8.1.2.12.1.1.2 Command to Run the Program
java -Dlog4j.configurationFile=log4j-purge.properties -Dgg.log.level=INFO -cp <OGG_HOME>/ggjava/resources/lib/*:<OGG_HOME>/thirdparty/cass/jsch-0.1.54.jar oracle.goldengate.cassandra.commitlogpurger.CassandraCommitLogPurger --cassCommitLogPurgerConfFile <OGG_HOME>/cassandraPurgeUtil/commitlogpurger.properties --purgeInterval 1 --cassUnProcessedFilesPurgeInterval 3
<OGG_HOME>/ggjava/resources/lib/*
is the directory where the purger utility is located.<OGG_HOME>/thirdparty/cass/jsch-0.1.54.jar
is the dependent jar to execute the purger program.---cassCommitLogPurgerConfFile
,--purgeInterval
and--cassUnProcessedFilesPurgeInterval
are run time arguments.
Sample script to run the commit log purger utility:
#!/bin/bash echo "fileSystemType=remote" > commitlogpurger.properties echo "chkDir=dirchk" >> commitlogpurger.properties echo "cdcStagingDir=data/cdc_raw_staged" >> commitlogpurger.properties echo "userName=username" >> commitlogpurger.properties echo "password=password" >> commitlogpurger.properties java -cp ogghome/ggjava/resources/lib/*:ogghome/thirdparty/cass/jsch-0.1.54.jar oracle.goldengate.cassandra.commitlogpurger.CassandraCommitLogPurger --cassCommitLogPurgerConfFile commitlogpurger.properties --purgeInterval 1 --cassUnProcessedFilesPurgeInterval 3
Parent topic: How to Run the Purge Utility
8.1.2.12.1.1.3 Runtime Arguments
To execute, the utility class
CassandraCommitLogPurger
requires a
mandatory run-time argument
cassCommitLogPurgerConfFile
.
Available Runtime arguments to
CassandraCommitLogPurger
class are:
[required] --cassCommitLogPurgerConfFile path to config.properties
[optional] --purgeInterval
[optional] --cassUnProcessedFilesPurgeInterval
Parent topic: How to Run the Purge Utility
8.1.2.12.1.2 Sample config.properties for Local File System
fileSystemType=local chkDir=apache-cassandra-3.11.2/data/chkdir/ cdcStagingDir=apache-cassandra-3.11.2/data/$nodeAddress/commitlog/
Parent topic: Cassandra CDC Commit Log Purger
8.1.2.12.1.3 Argument cassCommitLogPurgerConfFile
cassCommitLogPurgerConfFile
argument
takes the config file with following mandate fields.
Table 8-1 Argument cassCommitLogPurgerConfFile
Parameters | Description |
---|---|
fileSystemType | Default: local
Mandatory: Yes Legal Values: remote/ local Description: In every live node in the
cluster, CDC Staged Commit logs can be accessed
through SFTP or NFS. If the
|
chkDir | Default: None
Mandatory: Yes Legal Values: checkpoint directory pathDescription: Location of
Cassandra checkpoint directory where
|
cdcStagingDir | Default: None
Mandatory: Yes Legal Values: staging directory pathDescription: Location of
Cassandra staging directory where CDC commit logs
are present. For example,
|
userName | Default: None
Mandatory: No Legal Values: Valid SFTP auth usernameDescription: SFTP User name to connect to the server. |
password | Default: None
Mandatory: No Legal Values: Valid SFTP auth passwordDescription: SFTP password to connect to the server. |
port | Default: 22
Mandatory: No Legal Values: Valid SFTP auth portDescription: SFTP port number |
privateKey |
Default: None Mandatory: No Legal Values: valid path to the privateKey fileDescription: The private key
is used to perform the authentication, allowing
you to log in without having to specify a
password. Providing the
|
passPhase |
Default: None Mandatory: No Legal Values: valid password for privateKeyDescription: The private key is
typically password protected. If it is provided,
then the |
Parent topic: Cassandra CDC Commit Log Purger
8.1.2.12.1.3.1 Sample config.properties for Local File System
fileSystemType=local chkDir=apache-cassandra-3.11.2/data/chkdir/ cdcStagingDir=apache-cassandra-3.11.2/data/$nodeAddress/commitlog/
Parent topic: Argument cassCommitLogPurgerConfFile
8.1.2.12.1.3.2 Sample config.properties for Remote File System
fileSystemType=remote chkDir=apache-cassandra-3.11.2/data/chkdir/ cdcStagingDir=apache-cassandra-3.11.2/data/$nodeAddress/commitlog/ username=username password=@@@@@ port=22
Parent topic: Argument cassCommitLogPurgerConfFile
8.1.2.12.1.4 Argument purgeInterval
Setting the optional argument purgeInterval
helps in
configuring the process to run as a daemon.
This argument is an integer value representing the time period of
clean-up to happen. For example, if purgeInterval
is set to 1, then
the process runs every day on the time the process started.
Parent topic: Cassandra CDC Commit Log Purger
8.1.2.12.1.5 Argument cassUnProcessedFilesPurgeInterval
Setting the optional argument
cassUnProcessedFilesPurgeInterval
helps in purging historical commit
logs for all the nodes that do not have a last processed file.
cassUnProcessedFilesPurgeInterval
is not set, then
the default value is configured to 2 days; the files older than 2 days or as per the
configured value days, and the commit log files are purged. The
CassandraCommitLogPurger
Utility can't purge files that are older
than a day. It should be either the default 2 days or more than that.
{ "start_timestamp": -1, "sequence_id": 34010434, "updated_datetime": "2018-04-19 23:24:57.164-0700", "nodes": [ { "address": "10.247.136.146", "offset": 0, "id": 0 } , { "address": "10.247.136.142", "file": "CommitLog-6-1524110205398.log", "offset": 33554405, "id": 1524110205398 } , { "address": "10.248.10.24", "file": "CommitLog-6-1524110205399.log", "offset": 33554406, "id": 1524110205399 } , { "address": "10.248.10.25", "offset": 0, "id": 0 } , { "address": "10.248.10.26", "offset": 0, "id": 0 } ] }
In
this example, the Cassandra nodes addresses 10.248.10.25
and
10.248.10.26
do not have a last processed file. The commit log
files in those nodes will be purged as per the configured days of
cassUnProcessedFilesPurgeInterval
argument value.
Note:
The last processing file may not be available due to the following reasons:- A new node was added into the cluster and no commit log files were processed through Cassandra extract yet.
- All the commit log files processed from this node does not contain operation data as per the table wildcard match.
- All the commit log files processed from this node contain operation records that were not written to the trail file due to de-duplication.
Parent topic: Cassandra CDC Commit Log Purger
8.1.2.13 Multiple Extract Support
Multiple Extract groups in a single Oracle GoldenGate for Big Data installation can be configured to connect to the same Cassandra cluster.
To run multiple Extract groups:
- One (and only one) Extract group can be configured to move the commit log files in
the
cdc_raw
directory on the Cassandra nodes to a staging directory. Themovecommitlogstostagingdir
parameter is enabled by default and no additional configuration is required for this Extract group. - All the other Extract groups should be configured with the
nomovecommitlogstostagingdir
parameter in the Extract parameter (.prm
) file.
Parent topic: Apache Cassandra
8.1.2.14 CDC Configuration Reference
The following properties are used with Cassandra change data capture.
Properties | Required/Optional | Location | Default | Explanation |
---|---|---|---|---|
|
Optional |
Extract parameter ( |
|
Use only during initial load process. When set to |
|
Optional |
Extract parameter ( |
Minimum is Maximum is |
Use only during initial load process. Specifies the number of rows of data the driver attempts to fetch on each request submitted to the database server. The parameter value should be lower than the database configuration parameter, Oracle recommends that you set this parameter value to 5000 for initial load Extract optimum performance. |
|
Required |
Extract parameter ( |
None |
The CDC commit log directory path template. The template can optionally have the |
|
Optional |
Extract parameter ( |
None |
The secure file transfer protocol (SFTP) connection details to pull and transfer the commit log files. You can use one or more of these options:
|
|
Optional |
GLOBALS file
Note: Starting from Oracle GoldenGate for Big Data release 23.1.0.0.0, this parameter will be deprecated. |
|
A comma separated list of nodes to be used for a connection to the Cassandra cluster. You should provide at least one node address. The parameter options are: |
|
Optional |
Extract parameter ( |
|
The SDK Version for the CDC reader capture API. |
|
Optional |
Extract parameter ( |
|
When set to |
|
Optional |
Extract parameter ( |
|
Purge CDC commit log files post extract processing. When the value is set to
|
JVMOPTIONS [CLASSPATH <classpath>
| BOOTOPTIONS <options>] |
Mandatory |
Extract parameter ( |
None |
|
|
Optional |
GLOBALS file
Note: Starting from Oracle GoldenGate for Big Data release 23.1.0.0.0, this parameter will be deprecated. |
None |
The boot options for the Java Virtual Machine. Multiple options are delimited by a space character. |
|
Required |
GLOBALS file
Note: Starting from Oracle GoldenGate for Big Data release 23.1.0.0.0, this parameter will be deprecated. |
None |
The classpath for the Java Virtual Machine. You can include an asterisk ( |
|
Required |
GLOBALS file
Note: Starting from Oracle GoldenGate for Big Data release 23.1.0.0.0, this parameter will be deprecated. |
None |
The source database for CDC capture or database queries. The valid value is |
|
Required |
Extract parameter ( |
None |
A single Cassandra node address that is used for a connection to the Cassandra cluster and to query the metadata for the captured tables. |
|
Optional |
Extract parameter ( |
|
If this value is |
|
Optional |
Extract parameter ( |
|
If this value is |
|
Optional |
Extract parameter ( |
|
Enabled by default and this instructs the Extract group to move the commit log files in the |
|
Optional |
GLOBALS or Extract parameter ( |
|
Use for basic SSL support during connection. Additional JSSE configuration through Java System properties is expected when enabling this. Note: The following SSL properties are in |
|
Optional |
GLOBALS or Extract parameter ( String that indicates the absolute path with fully qualified name. This file is must for the SSL connection. |
None, unless the |
Indicates that it is PEM formatted public key file used to verify the peer's certificate. This property is needed for one-way handshake or basic SSL connection. |
|
Optional |
GLOBALS or Extract parameter ( |
|
Enabled indicates a two-way SSL encryption between client and server. It is required to authenticate both the client and the server through PEM formatted certificates. This property also needs the |
|
Optional |
GLOBALS or Extract parameter ( String that indicates the absolute path with fully qualified name. This file is must for the SSL connection. |
None, unless the |
Use for a PEM formatted public key file name used to verify the client's certificate. This is must if you are using |
|
Optional |
GLOBALS or Extract parameter ( String that indicates the absolute path with fully qualified name. This file is must for the SSL connection. |
None, unless the |
Use for a PEM formatted private key file name used to verify the client's certificate. This is must if you are using |
|
Optional |
GLOBALS or Extract parameter ( A string |
None, unless the |
Sets the password for the PEM formatted private key file used to verify the client's certificate. This is must if the private key file is protected with the password. |
|
Optional |
GLOBALS or Extract parameter ( An integer |
|
Sets the verification required on the peer's certificate. The range is 0–4: 0–Disable certificate identity verification. 1–Verify the peer certificate 2–Verify the peer identity 3– Not used so it is similar to disable certificate identity verification. 4 –Verify the peer identity by its domain name |
|
Optional |
GLOBALS or Extract parameter ( |
|
Enables retrieving host name for IP addresses using reverse IP lookup. |
TRANLOGOPTIONS
TRACKSCHEMACHANGES |
Optional | Extract parameter (.prm )
file.
|
By default, the property is disabled. |
This will enable extract to capture table level DDL changes from the source at runtime. Enable this to ensure that the table metadata within the trail
stays in sync with the source without any downtime. When
|
Parent topic: Apache Cassandra
8.1.2.15 Troubleshooting
No data captured by the Cassandra Extract process.
-
The Cassandra database has not flushed the data from the active commit log files to the CDC commit log files. The flush is dependent on the load of the Cassandra cluster.
-
The Cassandra Extract captures data from the CDC commit log files only.
-
Check the CDC property of the source table. The CDC property of the source table should be set to
true
. -
Data is not captured if the
TRANLOGOPTIONS CDCREADERSDKVERSION
3.9 parameter is in use and theJVMCLASSPATH
is configured to point to Cassandra 3.10 or 3.11 JAR files.
Error: OGG-01115 Function getInstance not implemented.
-
The following line is missing from the GLOBALS file.
OGGSOURCE CASSANDRA
Error: Unable to connect to Cassandra cluster, Exception: com.datastax.driver.core.exceptions.NoHostAvailableException
This indicates that the connection to the Cassandra cluster was unsuccessful.
Check the following parameters:
CLUSTERCONTACTPOINTS
Error: Exception in thread "main" java.lang.NoClassDefFoundError: oracle/goldengate/capture/cassandra/CassandraCDCProcessManager
Check the JVMOPTIONS CLASSPATH
parameter in the GLOBALS
file.
Error: oracle.goldengate.util.Util - Unable to invoke method while constructing object. Unable to create object of class "oracle.goldengate.capture.cassandracapture311.SchemaLoader3DOT11" Caused by: java.lang.NoSuchMethodError: org.apache.cassandra.config.DatabaseDescriptor.clientInitialization()V
There is a mismatch in the Cassandra SDK version configuration. The
TRANLOGOPTIONS CDCREADERSDKVERSION 3.11
parameter is in use and
the JVMCLASSPATH
may have the Cassandra 3.9 JAR file path.
Error: OGG-25171 Trail file '/path/to/trail/gg' is remote. Only local trail allowed for this extract.
A Cassandra Extract should only be configured to write to local trail
files. When adding trail files for Cassandra Extract, use the
EXTTRAIL
option. For example:
ADD EXTTRAIL ./dirdat/z1, EXTRACT cass
Errors: OGG-868 error message or OGG-4510 error message
The cause could be any of the following:
-
Unknown user or invalid password
-
Unknown node address
-
Insufficient memory
Another cause could be that the connection to the Cassandra database is broken. The error message indicates the database error that has occurred.
Error: OGG-251712 Keyspace keyspacename does not exist in the database.
The issue could be due to these conditions:
-
During the Extract initial load process, you may have deleted the
KEYSPACE keyspacename
from the Cassandra database. -
The
KEYSPACE keyspacename
does not exist in the Cassandra database.
Error: OGG-25175 Unexpected error while fetching row.
This can occur if the connection to the Cassandra database is broken during initial load process.
Error: “Server-side warning: Read 915936 live rows and 12823104 tombstone cells for query SELECT * FROM keyspace.table(see tombstone_warn_threshold)”.
When the value of the initial load DBOPTIONS
FETCHBATCHSIZE
parameter is greater than the Cassandra database
configuration parameter,tombstone_warn_threshold
, this is likely to
occur.
Increase the value of tombstone_warn_threshold
or
reduce the DBOPTIONS FETCHBATCHSIZE
value to get around this
issue.
Duplicate records in the Cassandra Extract trail.
Internal tests on a multi-node Cassandra cluster have revealed that there is a possibility of duplicate records in the Cassandra CDC commit log files. The duplication in the Cassandra commit log files is more common when there is heavy write parallelism, write errors on nodes, and multiple retry attempts on the Cassandra nodes. In these cases, it is expected that Cassandra trail file will have duplicate records.
JSchException or SftpException in the Extract Report File
Verify that the SFTP credentials (user
,
password
, and privatekey
) are correct. Check
that the SFTP user has read and write permissions for the cdc_raw
directory on each of the nodes in the Cassandra cluster.
ERROR o.g.c.c.CassandraCDCProcessManager - Exception during creation of CDC staging directory [{}]java.nio.file.AccessDeniedException
The Extract process does not have permission to create CDC commit log
staging directory. For example, if the cdc_raw
commit log directory
is /path/to/cassandra/home/data/cdc_raw
, then the staging directory
would be
/path/to/cassandra/home/data/cdc_raw/../cdc_raw_staged
.
Extract report file shows a lot of DEBUG log statements
On production system, you do not need to enable debug logging. To use
INFO
level logging, make sure that the Extract parameter file
include this
JVMBOOTOPTIONS -Dlogback.configurationFile=AdapterExamples/big-data/cassandracapture/logback.xml
To enable SSL in Oracle Golden Gate Cassandra Extract you have to enable SSL in the GLOBALS file or in the Extract Parameter file.
If SSL Keyword is missing, then Extract assumes that you wanted to
connect without SSL. So if the Cassandra.yaml
file has an SSL
configuration entry, then the connection fails.
SSL is enabled and it is one-way handshake
You must specify the
CPPDRIVEROPTIONS SSL PEMPUBLICKEYFILE
/scratch/testcassandra/testssl/ssl/cassandra.pem
property.
If this property is missing, then Extract generates this error:.
2018-06-09 01:55:37 ERROR OGG-25180 The PEM formatted public key file used to verify the peer's certificate is missing.If SSL is enabled, then it is must to set PEMPUBLICKEYFILE in your Oracle GoldenGate GLOBALS file or in Extract parameter file
SSL is enabled and it is two-way handshake
You must specify these properties for SSL two-way handshake:
CPPDRIVEROPTIONS SSL ENABLECLIENTAUTH CPPDRIVEROPTIONS SSL PEMCLIENTPUBLICKEYFILE /scratch/testcassandra/testssl/ssl/datastax-cppdriver.pem CPPDRIVEROPTIONS SSL PEMCLIENTPRIVATEKEYFILE /scratch/testcassandra/testssl/ssl/datastax-cppdriver-private.pem CPPDRIVEROPTIONS SSL PEMCLIENTPRIVATEKEYPASSWD cassandra
Additionally, consider the following:
-
If
ENABLECLIENTAUTH
is missing then Extract assumes that it is one-way handshake so it ignoresPEMCLIENTPRIVATEKEYFILE
andPEMCLIENTPRIVATEKEYFILE
. The following error occurs because thecassandra.yaml
file should haverequire_client_auth
set totrue
.2018-06-09 02:00:35 ERROR OGG-00868 No hosts available for the control connection.
-
If
ENABLECLIENTAUTH
is used andPEMCLIENTPRIVATEKEYFILE
is missing, then this error occurs:2018-06-09 02:04:46 ERROR OGG-25178 The PEM formatted private key file used to verify the client's certificate is missing. For two way handshake or if ENABLECLIENTAUTH is set, then it is mandatory to set PEMCLIENTPRIVATEKEYFILE in your Oracle GoldenGate GLOBALS file or in Extract parameter file.
-
If
ENABLECLIENTAUTH
is use andPEMCLIENTPUBLICKEYFILE
is missing, then this error occurs:2018-06-09 02:06:20 ERROR OGG-25179 The PEM formatted public key file used to verify the client's certificate is missing. For two way handshake or if ENABLECLIENTAUTH is set, then it is mandatory to set PEMCLIENTPUBLICKEYFILE in your Oracle GoldenGate GLOBALS file or in Extract parameter file.
-
If the password is set while generating the client private key file then you must add
PEMCLIENTPRIVATEKEYPASSWD
to avoid this error:2018-06-09 02:09:48 ERROR OGG-25177 The SSL certificate: /scratch/jitiwari/testcassandra/testssl/ssl/datastax-cppdriver-private.pem can not be loaded. Unable to load private key.
-
If any of the PEM file is missing from the specified absolute path, then this error occurs:
2018-06-09 02:12:39 ERROR OGG-25176 Can not open the SSL certificate: /scratch/jitiwari/testcassandra/testssl/ssl/cassandra.pem.
com.jcraft.jsch.JSchException: UnknownHostKey
If the extract process ABENDs with this issue, then it is likely that some or all the
Cassandra node addresses are missing in the SSH known-hosts
file.
For more information, see Setup SSH Connection to the Cassandra Nodes.
General SSL Errors
Consider these general errors:
-
The SSL connection may fail if you have enabled all SSL required parameters in Extract or GLOBALS file and the SSL is not configured in the
cassandra.yaml
file. -
The absolute path or the qualified name of the PEM file may not correct. There could be access issue on the PEM file stored location.
-
The password added during generating the client private key file may not be correct or you may not have enabled it in the Extract parameter or GLOBALS file.
Parent topic: Apache Cassandra
8.1.2.16 Cassandra Capture Client Dependencies
What are the dependencies for the Cassandra Capture (Extract) to connect to Apache Cassandra databases?
The following third party libraries are needed to run Cassandra Change Data Capture.
Capturing from Apache Cassandra 3.x versions:
- cassandra-driver-core (com.datastax.cassandra) version 3.3.1
- cassandra-all (org.apache.cassandra) version 3.11.0
- gson (com.google.code.gson) version 2.8.0
- jsch (com.jcraft) version 0.1.54
- java-driver-core (com.datastax.oss) version 4.14.1
- cassandra-all (org.apache.cassandra) version 4.0.5
- gson (com.google.code.gson) version 2.8.0
- jsch (com.jcraft) version 0.1.54
You can use the Dependency Downloader scripts to download the Datastax Java Driver and its associated dependencies. For more information, see Dependency Downloader.
Parent topic: Apache Cassandra
8.1.3 Apache Kafka
The Oracle GoldenGate capture (Extract) for Kafka is used to read messages from a Kafka topic or topics and convert data into logical change records written to GoldenGate trail files. This section explains how to use Oracle GoldenGate capture for Kafka.
- Overview
- Prerequisites
- General Terms and Functionality of Kafka Capture
- Generic Mutation Builder
- Kafka Connect Mutation Builder
- Example Configuration Files
Parent topic: Source
8.1.3.1 Overview
Kafka has gained market traction in recent years and become a leader in the enterprise messaging space. Kafka is a cluster-based messaging system that provides high availability, fail over, data integrity through redundancy, and high performance. Kafka is now the leading application for implementations of the Enterprise Service Bus architecture. Kafka Capture extract process reads messages from Kafka and transforms those messages into logical change records which are written to Oracle GoldenGate trail files. The generated trail files can then be used to propagate the data in the trail file to various RDBMS implementations or other integrations supported by Oracle GoldenGate replicat processes.
Parent topic: Apache Kafka
8.1.3.2 Prerequisites
8.1.3.2.1 Set up Credential Store Entry to Detect Source Type
userid
. The generic format for userid
is as
follows: <dbtype>://<db-user>@<comma separated list of server
addresses>:<port>
The
userid
value for Kafka capture should be any value with the
prefix kafka://
.
alter credentialstore add user kafka:// password somepass alias kafka
Note:
You can specify a dummy Password for Kafka while setting up the credentials.Parent topic: Prerequisites
8.1.3.3 General Terms and Functionality of Kafka Capture
- Kafka Streams
- Kafka Message Order
- Kafka Message Timestamps
- Kafka Message Coordinates
- Start Extract Modes
- General Configuration Overview
- OGGSOURCE parameter
- The Extract Parameter File
- Kafka Consumer Properties File
Parent topic: Apache Kafka
8.1.3.3.1 Kafka Streams
As a Kafka consumer, you can read from one or more topics. Additionally, each topic can be divided into one or more partitions. Each discrete topic/partition combination is a Kafka stream. This topic discusses Kafka streams extensively and it is important to clearly define the term here.
- Topic: TEST1 Partition: 0
- Topic: TEST1 Partition: 1
- Topic: TEST2 Partition: 0
- Topic: TEST2 Partition: 1
- Topic: TEST2 Partition: 2
Parent topic: General Terms and Functionality of Kafka Capture
8.1.3.3.2 Kafka Message Order
Messages received from the KafkaConsumer for an individual stream should be in the order as stored in the Kafka commit log. However, Kafka streams move independently from one another and the order in which messages are received from different streams is nondeterministic.
- Stream 1: Topic TEST1, partition 0
- Stream 2: Topic TEST1, partition 1
TEST1|0|0|1588888086210 TEST1|0|1|1588888086220 TEST1|0|2|1588888086230 TEST1|0|3|1588888086240 TEST1|0|4|1588888086250
TEST1|1|0|1588888086215 TEST1|1|1|1588888086225 TEST1|1|2|1588888086235 TEST1|1|3|1588888086245 TEST1|1|4|1588888086255
TEST1|1|0|1588888086215 TEST1|1|1|1588888086225 TEST1|0|0|1588888086210 TEST1|0|1|1588888086220 TEST1|0|2|1588888086230 TEST1|0|3|1588888086240 TEST1|0|4|1588888086250 TEST1|1|2|1588888086235 TEST1|1|3|1588888086245 TEST1|1|4|1588888086255
TEST1|0|0|1588888086210 TEST1|0|1|1588888086220 TEST1|1|0|1588888086215 TEST1|1|1|1588888086225 TEST1|0|2|1588888086230 TEST1|0|3|1588888086240 TEST1|0|4|1588888086250 TEST1|1|2|1588888086235 TEST1|1|3|1588888086245 TEST1|1|4|1588888086255
Note:
In the two runs that the messages belonging to the same Kafka stream are delivered in order as they occur in that stream. However, messages from different streams are interlaced in a nondeterministic manner.Parent topic: General Terms and Functionality of Kafka Capture
8.1.3.3.3 Kafka Message Timestamps
Each Kafka message has a timestamp associated with it. The timestamp on the Kafka message maps to the operation timestamp for the record in the generated trail file. Timestamps on Kafka messages are not guaranteed to be monotonically increasing even in the case where extract is reading from only one stream (single topic and partition). Kafka has no requirement that Kafka message timestamps are monotonically increasing even within a stream. The Kafka Producer provides an API whereby the message timestamp can be explicitly set on messages. This means a Kafka Producer can set the Kafka message timestamp to any value.
When reading from multiple topics and/or a topic with multiple partitions it is almost certain that trail files generated by Kafka capture will not have operation timestamps that are monotonically increasing. Kafka streams move independently from one another and there is no guarantee of delivery order for messages received from different streams. Messages from different streams can interlace in any random order when the Kafka Consumer is reading them from a Kafka cluster.
Parent topic: General Terms and Functionality of Kafka Capture
8.1.3.3.4 Kafka Message Coordinates
Kafka Capture performs message gap checking to ensure message consistency withing the context of a message stream. For every Kafka stream from which Kafka capture is consuming messages, there should be no gap in the Kafka message offset sequence.
If a gap is found in the message offset sequence, then the Kafka capture logs an error and the Kafka Capture extract process will abend.
Message gap checking can be disabled by setting the following in the
.prm
file.
SETENV (PERFORMMESSAGEGAPCHECK = "false")
.
Parent topic: General Terms and Functionality of Kafka Capture
8.1.3.3.5 Start Extract Modes
Extract can be configured to start replication from two distinct points.
8.1.3.3.5.1 Start Earliest
ggsci> ADD EXTRACT kafka, TRANLOG ggsci> ADD EXTRAIL dirdat/kc, extract kafka ggsci> START EXTRACT kafka
Parent topic: Start Extract Modes
8.1.3.3.5.2 Start Timestamp
ggsci> ADD EXTRACT kafka, TRANLOG BEGIN 2019-03-27 23:05:05.123456 ggsci> ADD EXTRAIL dirdat/kc, extract kafka ggsci> START EXTRACT kafka
ggsci> ADD EXTRACT kafka, TRANLOG BEGIN NOW ggsci> ADD EXTRAIL dirdat/kc, extract kafka ggsci> START EXTRACT kafka
Note:
Note on starting from a point in time. Kafka Capture will start from the first available record in the stream which fits the criteria (time equal to or greater than the configured time). Replicat will continue from that first message regardless of the timestamps of subsequent messages. As previously discussed, there is no guarantee or requirement that Kafka message timestamps are monotonically increasing.Alter Extract
ggsci> STOP EXTRACT kafka ggsci> ALTER EXTRACT kafka BEGIN {Timestamp} ggsci> START EXTRACT kafka
Alter Now
ggsci> STOP EXTRACT kafka ggsci> ALTER EXTRACT kafka BEGIN NOW ggsci> START EXTRACT kafka
Parent topic: Start Extract Modes
8.1.3.3.6 General Configuration Overview
Parent topic: General Terms and Functionality of Kafka Capture
8.1.3.3.7 OGGSOURCE parameter
OGGSOURCE KAFKA JVMCLASSPATH ggjava/ggjava.jar:/kafka/client/path/*:dirprm JVMBOOTOPTIONS -Xmx512m -Dlog4j.configurationFile=log4j-default.properties -Dgg.log.level=INFO
OGGSOURCE KAFKA
: The first line indicates that the
source of replication is Kafka.
JVMCLASSPATH
ggjava/ggjava.jar:/kafka/client/path/*:dirprm
: The second line sets the
Java JVM classpath. The Java classpath provides the pathing to load all the required
Oracle GoldenGate for Big Data libraries and Kafka client libraries. The Oracle
GoldenGate for Big Data library should be first in the list
(ggjava.jar
). The Kafka client libraries, the Kafka Connect
framework, and the Kafka Connect converters are not included with the Oracle
GoldenGate for Big Data installation. These libraries must be obtained
independently. Oracle recommends you to use the same version of the Kafka client as
the version of the Kafka broker to which you are connecting. The Dependency
Downloading tool can be used to download the dependency libraries. Alternately, the
pathing can be set to a Kafka installation. For more information about Dependency
Downloader, see Dependency Downloader in the
Installing and Upgrading Oracle GoldenGate for Big Data guide.
JVMBOOTOPTIONS -Xmx512m
-Dlog4j.configurationFile=log4j-default.properties -Dgg.log.level=INFO
:
The third line is the JVM boot options. Use this to configure the maximum Java heap
size (-Xmx512m) and the log4j logging parameters to generate the
.log
file
(-Dlog4j.configurationFile=log4j-default.properties
-Dgg.log.level=INFO
)
Note:
Starting from Oracle GoldenGate for Big Data release 23.1.0.0.0, this parameter will be deprecated.Parent topic: General Terms and Functionality of Kafka Capture
8.1.3.3.8 The Extract Parameter File
<extract name>.prm
. For
example, the extract parameter file for the extract process kc would be
kc.prm
.EXTRACT KC -- alter credentialstore add user kafka:// password <somepass> alias kafka SOURCEDB USERIDALIAS kafka JVMOPTIONS CLASSPATH ggjava/ggjava.jar:/kafka/client/path/* JVMOPTIONS BOOTOPTIONS -Xmx512m -Dlog4j.configurationFile=log4j-default.properties -Dgg.log.level=INFO TRANLOGOPTIONS GETMETADATAFROMVAM TRANLOGOPTIONS KAFKACONSUMERPROPERTIES kafka_consumer.properties EXTTRAIL dirdat/kc TABLE QASOURCE.TOPIC1;
EXTRACT KC
: The first line sets the name of the extract
process.
TRANLOGOPTIONS KAFKACONSUMERPROPERTIES
kafka_consumer.properties
: This line sets the name and location of the
Kafka Consumer properties file. The Kafka Consumer properties is a file containing
the Kafka specific configuration which configures connectivity and security to the
Kafka cluster. Documentation on the Kafka Consumer properties can be found in: Kafka Documentation.
EXTTRAIL dirdat/kc
: The fourth line sets the location and prefix of
the trail files to be generated.
TABLE QASOURCE.TOPIC1;
:
The fifth line is the extract TABLE
statement. There can be one or
more TABLE statements. The schema name in the example is QASOURCE
.
The schema name is an OGG artifact and it is required. It can be set to any legal
string. The schema name cannot be wildcarded. Each extact process only supports one
schema name. The configured table name maps to the Kafka topic name. The table
configuration does support wildcards. Legal Kafka topic names can have the following
characters.
- a-z (lowercase a to z)
- A-Z (uppercase A to Z)
- 0-9 (digits 0 to 9)
- . (period)
- _ (underscore)
- - (hyphen)
MYTOPIC1
and MyTopic1
are different Kafka
topics.
TABLE TESTSCHEMA.TEST*; TABLE TESTSCHEMA.MyTopic1; TABLE TESTSCHEMA.”My.Topic1”;
TABLE QASOURCE.TEST*; TABLE TESTSCHEMA.MYTOPIC1;
TABLE QASOURE.My.Topic1;
TABLE *.*;
Optional
.prm
configuration.
SETENV (PERFORMMESSAGEGAPCHECK = "false")
Parent topic: General Terms and Functionality of Kafka Capture
8.1.3.3.9 Kafka Consumer Properties File
The Kafka Consumer properties file contains the properties to configure the Kafka Consumer including how to connect to the Kafka cluster and security parameters.
#Kafka Properties bootstrap.servers=den02box:9092 group.id=mygroupid key.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer value.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer
8.1.3.3.9.1 Encrypt Kafka Producer Properties
For more information about how to use Credential Store, see Using Identities in Oracle GoldenGate Credential Store.
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required
username="alice" password="alice";
can be replaced with:
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required
username=ORACLEWALLETUSERNAME[alias domain_name] password=ORACLEWALLETPASSWORD[alias
domain_name];
Parent topic: Kafka Consumer Properties File
8.1.3.4 Generic Mutation Builder
-
id: This is the primary key for the table. It is typed as a string. The value is the coordinates of the message in Kafka in the following format: topic name:partition number:offset. For example, the value for topic TEST, partition 1, and offset 245 would be TEST:1:245.
-
key: This is the message key field from the source Kafka message. The field is typed as binary. The value of the field is the key from the source Kafka message propagated as bytes.
-
payload: This is the message payload or value from the source Kafka message. The field is typed as binary. The value of the field is the payload from the source Kafka message propagated as bytes.
- All records are propagated as insert operations.
- Each Kafka message creates an operation in its own transaction.
Logdump 2666 >n ___________________________________________________________________ Hdr-Ind : E (x45) Partition : . (x00) UndoFlag : . (x00) BeforeAfter: A (x41) RecLength : 196 (x00c4) IO Time : 2021/07/22 14:57:25.085.436 IOType : 170 (xaa) OrigNode : 2 (x02) TransInd : . (x03) FormatType : R (x52) SyskeyLen : 0 (x00) Incomplete : . (x00) DDR/TDR index: (001, 001) AuditPos : 0 Continued : N (x00) RecCount : 1 (x01) 2021/07/22 14:57:25.085.436 Metadata Len 196 RBA 1335 Table Name: QASOURCE.TOPIC1 * 1)Name 2)Data Type 3)External Length 4)Fetch Offset 5)Scale 6)Level 7)Null 8)Bump if Odd 9)Internal Length 10)Binary Length 11)Table Length 12)Most Sig DT 13)Least Sig DT 14)High Precision 15)Low Precision 16)Elementary Item 17)Occurs 18)Key Column 19)Sub DataType 20)Native DataType 21)Character Set 22)Character Length 23)LOB Type 24)Partial Type 25)Remarks * TDR version: 11 Definition for table QASOURCE.TOPIC1 Record Length: 20016 Columns: 3 id 64 8000 0 0 0 0 0 8000 8000 0 0 0 0 0 1 0 1 0 12 -1 0 0 0 key 64 16000 8005 0 0 1 0 8000 8000 0 0 0 0 0 1 0 0 4 -3 -1 0 0 0 payload 64 8000 16010 0 0 1 0 4000 4000 0 0 0 0 0 1 0 0 4 -4 -1 0 1 0 End of definition
Parent topic: Apache Kafka
8.1.3.5 Kafka Connect Mutation Builder
The Kafka Connect Mutation Builder parses Kafka Connect messages into logical change records and that are then written to Oracle GoldenGate trail files.
- Functionality and Limitations of the Kafka Connect Mutation Builder
- Primary Key
- Kafka Message Key
- Kafka Connect Supported Types
- How to Enable the Kafka Connect Mutation Builder
Parent topic: Apache Kafka
8.1.3.5.1 Functionality and Limitations of the Kafka Connect Mutation Builder
- All records are propagated as insert operations.
- Each Kafka message creates an operation in its own transaction.
- The Kafka message key must be a Kafka Connect primitive type or logical type.
- The Kafka message value must be either a primitive type/logical type or a record containing only primitive types, logical types, and container types. A record cannot contain another record as nested records are not currently supported.
- Kafka Connect array data types are mapped into binary fields. The content of the binary field will be the source array converted into a serialized JSON array.
- Kafka Connect map data types are mapped into binary fields. The contents of the binary field will be the source map converted into a serialized JSON.
- The source Kafka messages must be Kafka Connect messages.
- Kafka Connect Protobuf messages are not currently supported. (The current Kafka Capture functionality only supports primitive or logical types for the Kafka message key. The Kafka Connect Protobuf Converter does not support stand only primitives or logical types.)
- Each source topic must contain messages which conform to the same schema. Interlacing messages in the same Kafka topic which conform to different Kafka Connect schema is not currently supported.
- Schema changes are not currently supported.
Parent topic: Kafka Connect Mutation Builder
8.1.3.5.2 Primary Key
A primary key field is created in the output as a column named
gg_id
. The value of this field is the concatentated topic name,
partition, and offset delimited by the :
character. For example:
TOPIC1:0:1001
.
Parent topic: Kafka Connect Mutation Builder
8.1.3.5.3 Kafka Message Key
The message key is mapped into a called named
gg_key
.
Parent topic: Kafka Connect Mutation Builder
8.1.3.5.4 Kafka Connect Supported Types
- String
- 8 bit Integer
- 16 bit Integer
- 32 bit Integer
- 64 bit Integer
- Boolean
- 32 bit Float
- 64 bit Float
- Bytes (binary)
- Decimal
- Timestamp
- Date
- Time
Supported Container Types
- Array – Only arrays of primitive or logical types are supported. Data is mapped as a binary field the value of which is a JSON array document containing the contents of the source array.
- List – Only lists of primitive or logical types are supported. Data is mapped as a binary field the value of which is a JSON document containing the contents of the source list.
Parent topic: Kafka Connect Mutation Builder
8.1.3.5.5 How to Enable the Kafka Connect Mutation Builder
The Kafka Connect Mutation Builder is enabled by configuration of the Kafka Connect key and value converters in the Kafka Producer properties file.
key.converter=org.apache.kafka.connect.json.JsonConverter value.converter=org.apache.kafka.connect.json.JsonConverter
For the Kafka Connect Avro Converter
key.converter=io.confluent.connect.avro.AvroConverter value.converter=io.confluent.connect.avro.AvroConverter key.converter.schema.registry.url=http://localhost:8081 value.converter.schema.registry.url=http://localhost:8081
The Kafka Capture functionality reads the Kafka producer properties file. If the Kafka Connect converters are configured, then the Kafka Connect mutation builder is invoked.
Sample metadata from the trail file using logdump
2021/08/03 09:06:05.243.881 Metadata Len 1951 RBA 1335 Table Name: TEST.KC * 1)Name 2)Data Type 3)External Length 4)Fetch Offset 5)Scale 6)Level 7)Null 8)Bump if Odd 9)Internal Length 10)Binary Length 11)Table Length 12)Most Sig DT 13)Least Sig DT 14)High Precision 15)Low Precision 16)Elementary Item 17)Occurs 18)Key Column 19)Sub DataType 20)Native DataType 21)Character Set 22)Character Length 23)LOB Type 24)Partial Type 25)Remarks * TDR version: 11 Definition for table TEST.KC Record Length: 36422 Columns: 30 gg_id 64 8000 0 0 0 0 0 8000 8000 0 0 0 0 0 1 0 1 0 12 -1 0 0 0 gg_key 64 4000 8005 0 0 1 0 4000 4000 0 0 0 0 0 1 0 0 0 -1 -1 0 1 0 string_required 64 4000 12010 0 0 1 0 4000 4000 0 0 0 0 0 1 0 0 0 -1 -1 0 1 0 string_optional 64 4000 16015 0 0 1 0 4000 4000 0 0 0 0 0 1 0 0 0 -1 -1 0 1 0 byte_required 134 23 20020 0 0 1 0 8 8 8 0 0 0 0 1 0 0 0 4 -1 0 0 0 byte_optional 134 23 20031 0 0 1 0 8 8 8 0 0 0 0 1 0 0 0 4 -1 0 0 0 short_required 134 23 20042 0 0 1 0 8 8 8 0 0 0 0 1 0 0 0 4 -1 0 0 0 short_optional 134 23 20053 0 0 1 0 8 8 8 0 0 0 0 1 0 0 0 4 -1 0 0 0 integer_required 134 23 20064 0 0 1 0 8 8 8 0 0 0 0 1 0 0 0 4 -1 0 0 0 integer_optional 134 23 20075 0 0 1 0 8 8 8 0 0 0 0 1 0 0 0 4 -1 0 0 0 long_required 134 23 20086 0 0 1 0 8 8 8 0 0 0 0 1 0 0 0 -5 -1 0 0 0 long_optional 134 23 20097 0 0 1 0 8 8 8 0 0 0 0 1 0 0 0 -5 -1 0 0 0 boolean_required 0 2 20108 0 0 1 0 1 1 0 0 0 0 0 1 0 0 4 -2 -1 0 0 0 boolean_optional 0 2 20112 0 0 1 0 1 1 0 0 0 0 0 1 0 0 4 -2 -1 0 0 0 float_required 141 50 20116 0 0 1 0 8 8 8 0 0 0 0 1 0 0 0 6 -1 0 0 0 float_optional 141 50 20127 0 0 1 0 8 8 8 0 0 0 0 1 0 0 0 6 -1 0 0 0 double_required 141 50 20138 0 0 1 0 8 8 8 0 0 0 0 1 0 0 0 8 -1 0 0 0 double_optional 141 50 20149 0 0 1 0 8 8 8 0 0 0 0 1 0 0 0 8 -1 0 0 0 bytes_required 64 8000 20160 0 0 1 0 4000 4000 0 0 0 0 0 1 0 0 4 -4 -1 0 1 0 bytes_optional 64 8000 24165 0 0 1 0 4000 4000 0 0 0 0 0 1 0 0 4 -4 -1 0 1 0 decimal_required 64 50 28170 0 0 1 0 50 50 0 0 0 0 0 1 0 0 0 12 -1 0 0 0 decimal_optional 64 50 28225 0 0 1 0 50 50 0 0 0 0 0 1 0 0 0 12 -1 0 0 0 timestamp_required 192 29 28280 0 0 1 0 29 29 29 0 6 0 0 1 0 0 0 11 -1 0 0 0 timestamp_optional 192 29 28312 0 0 1 0 29 29 29 0 6 0 0 1 0 0 0 11 -1 0 0 0 date_required 192 10 28344 0 0 1 0 10 10 10 0 2 0 0 1 0 0 0 9 -1 0 0 0 date_optional 192 10 28357 0 0 1 0 10 10 10 0 2 0 0 1 0 0 0 9 -1 0 0 0 time_required 192 18 28370 0 0 1 0 18 18 18 3 6 0 0 1 0 0 0 10 -1 0 0 0 time_optional 192 18 28391 0 0 1 0 18 18 18 3 6 0 0 1 0 0 0 10 -1 0 0 0 array_optional 64 8000 28412 0 0 1 0 4000 4000 0 0 0 0 0 1 0 0 4 -4 -1 0 1 0 map_optional 64 8000 32417 0 0 1 0 4000 4000 0 0 0 0 0 1 0 0 4 -4 -1 0 1 0 End of definition
Parent topic: Kafka Connect Mutation Builder
8.1.3.6 Example Configuration Files
8.1.3.6.1 Example kc.prm file
EXTRACT KC OGGSOURCE KAFKA JVMOOPTIONS CLASSPATH ggjava/ggjava.jar:/path/to/kafka/libs/* TRANLOGOPTIONS GETMETADATAFROMVAM --Uncomment the following line to disable Kafka message gap checking. --SETENV (PERFORMMESSAGEGAPCHECK = "false") TRANLOGOPTIONS KAFKACONSUMERPROPERTIES kafka_consumer.properties EXTTRAIL dirdat/kc TABLE TEST.KC;
Parent topic: Example Configuration Files
8.1.3.6.2 Example Kafka Consumer Properties File
#Kafka Properties bootstrap.servers=localhost:9092 group.id=someuniquevalue key.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer value.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer #JSON Converter Settings #Uncomment to use the Kafka Connect Mutation Builder with JSON Kafka Connect Messages #key.converter=org.apache.kafka.connect.json.JsonConverter #value.converter=org.apache.kafka.connect.json.JsonConverter #Avro Converter Settings #Uncomment to use the Kafka Connect Mutation Builder with Avro Kafka Connect Messages #key.converter=io.confluent.connect.avro.AvroConverter #value.converter=io.confluent.connect.avro.AvroConverter #key.converter.schema.registry.url=http://localhost:8081 #value.converter.schema.registry.url=http://localhost:8081
Parent topic: Example Configuration Files
8.1.4 Azure Event Hubs
To capture messages from Azure Event Hubs and parse into logical change records with Oracle GoldenGate for Big Data, you can use Kafka Extract. For more information, see Apache Kafka as source.
Parent topic: Source
8.1.5 Confluent Kafka
To capture Kafka Connect messages from Confluent Kafka and parse into logical change records with Oracle GoldenGate for Big Data, you can use Kafka Connect Mutation Builder. For more information, see Kafka Connect Mutation Builder.
Parent topic: Source
8.1.6 DataStax
Datastax Enterprise is a NoSQL database built on Apache Cassandra. For more information, see Apache Cassandra for configuring change data capture from Datastax Enterprise.
Parent topic: Source
8.1.7 Java Message Service (JMS)
This article explains using the Oracle GoldenGate for Big Data to capture Java Message Service (JMS) messages to be written to an Oracle GoldenGate trail.
- Prerequisites
- Configuring Message Capture
This chapter explains how to configure the VAM Extract to capture JMS messages.
Parent topic: Source
8.1.7.1 Prerequisites
8.1.7.1.1 Set up Credential Store Entry to Detect Source Type
JMS Capture
Similar to Kafka, for the sake of detecting the source type, user can create a credential store entry with the prefix:jms://
.
alter credentialstore add user jms:// password <anypassword> alias jms
SOURCEDB
parameter with USERIDALIAS
option, then
the source type will be assumed to be JMS, and a warning message will be logged to
indicate this.
Parent topic: Prerequisites
8.1.7.2 Configuring Message Capture
Parent topic: Java Message Service (JMS)
8.1.7.2.1 Configuring the VAM Extract
JMS Capture only works with the Oracle GoldenGate Extract process. To run the Java message capture application you need the following:
-
Oracle GoldenGate for Java Adapter
-
Extract process
-
Extract parameter file configured for message capture
-
Description of the incoming data format, such as a source definitions file.
-
Java 8 installed on the host machine
Parent topic: Configuring Message Capture
8.1.7.2.1.1 Adding the Extract
To add the message capture VAM to the Oracle GoldenGate installation, add an Extract and the trail that it will create using GGSCI commands:
ADD EXTRACT jmsvam, VAM ADD EXTTRAIL dirdat/id, EXTRACT jmsvam, MEGABYTES 100
The process name (jmsvam
) can be replaced with any process name that is no more than 8 characters. The trail identifier (id
) can be any two characters.
Note:
Commands to position the Extract, such as BEGIN
or EXTRBA
, are not supported for message capture. The Extract will always resume by reading messages from the end of the message queue.
Parent topic: Configuring the VAM Extract
8.1.7.2.1.2 Configuring the Extract Parameters
The Extract parameter file contains the parameters needed to define and invoke the VAM. Sample Extract parameters for communicating with the VAM are shown in the table.
Parent topic: Configuring the VAM Extract
8.1.7.2.1.3 Configuring Message Capture
Message capture is configured by the properties in the VAM properties file (Adapter Properties file. This file is identified by the PARAMS
option of the Extract VAM
parameter and used to determine logging characteristics, parser mappings and JMS connection settings.
Parent topic: Configuring the VAM Extract
8.1.7.2.2 Connecting and Retrieving the Messages
To process JMS messages you must configure the connection to the JMS interface, retrieve and parse the messages in a transaction, write each message to a trail, commit the transaction, and remove its messages from the queue.
Parent topic: Configuring Message Capture
8.1.7.2.2.1 Connecting to JMS
Connectivity to JMS is through a generic JMS interface. Properties can be set to configure the following characteristics of the connection:
-
Java classpath for the JMS client
-
Name of the JMS queue or topic source destination
-
Java Naming and Directory Interface (JNDI) connection properties
-
Connection properties for Initial Context
-
Connection factory name
-
Destination name
-
-
Security information
-
JNDI authentication credentials
-
JMS user name and password
-
The Extract process that is configured to work with the VAM (such as the jmsvam
in the example) will connect to the message system. when it starts up.
Note:
The Extract may be included in the Manger's AUTORESTART
list so it will automatically be restarted if there are connection problems during processing.
Currently the Oracle GoldenGate for Java message capture adapter supports only JMS text messages.
Parent topic: Connecting and Retrieving the Messages
8.1.7.2.2.2 Retrieving Messages
The connection processing performs the following steps when asked for the next message:
-
Start a local JMS transaction if one is not already started.
-
Read a message from the message queue.
-
If the read fails because no message exists, return an end-of-file message.
-
Otherwise return the contents of the message.
Parent topic: Connecting and Retrieving the Messages
8.1.7.2.2.3 Completing the Transaction
Once all of the messages that make up a transaction have been successfully retrieved, parsed, and written to the Oracle GoldenGate trail, the local JMS transaction is committed and the messages removed from the queue or topic. If there is an error the local transaction is rolled back leaving the messages in the JMS queue.
Parent topic: Connecting and Retrieving the Messages
8.1.8 MongoDB
The Oracle GoldenGate capture (Extract) for MongoDB is used to get changes from MongoDB databases.
This chapter describes how to use the Oracle GoldenGate Capture for MongoDB.
- Overview
- Prerequisites to Setting up MongoDB
- MongoDB Database Operations
- Using Extract Initial Load
- Using Change Data Capture Extract
- Positioning the Extract
- Security and Authentication
- MongoDB Bidirectional Replication
- Mongo DB Configuration Reference
- Columns in Trail File
- Update Operation Behavior
- Oplog Size Recommendations
- Troubleshooting
- MongoDB Capture Client Dependencies
What are the dependencies for the MongoDB Capture to connect to MongoDB databases?
Parent topic: Source
8.1.8.1 Overview
MongoDB is a document-oriented NoSQL database used for high volume data storage and which provides high performance and scalability along with data modelling and data management of huge sets of data in an enterprise application. MongoDB provides:
- High availability through built-in replication and failover
- Horizontal scalability with native sharding
- End-to-end security and many more
Parent topic: MongoDB
8.1.8.2 Prerequisites to Setting up MongoDB
- MongoDB cluster or a MongoDB node must have a replica set.
The minimum recommended configuration for a replica set is a three member
replica set with three data-bearing members: one primary and two secondary
members.
Create mongod instance with the replica set as follows:
bin/mongod --bind_ip localhost --port 27017 --replSet rs0 --dbpath ../data/d1/ bin/mongod --bind_ip localhost --port 27018 --replSet rs0 --dbpath ../data/d2/ bin/mongod --bind_ip localhost --port 27019 --replSet rs0 --dbpath ../data/d3/ bin/mongod --host localhost --port 27017
Adding a replica set:
rs.initiate( { _id : "rs0", members: [ { _id: 0, host: "localhost:27017" }, { _id: 1, host: "localhost:27018" }, { _id: 2, host: "localhost:27019" } ] })
- Replica Set Oplog
MongoDB capture uses oplog to read the CDC records. The operations log (
oplog
) is a capped collection that keeps a rolling record of all operations that modify the data stored in your databases.The MongoDB only removes an oplog entry in the following cases: the oplog has reached the maximum configured size, and the oplog entry is older than the configured number of hours based on the host system clock.
You can control the retention of oplog entries using:
oplogMinRetentionHours
andreplSetResizeOplog
.For more information about oplog, see Oplog Size Recommendations.
- You must download and provide the third party libraries listed in MongoDB Capture Client Dependencies: Reactive Streams Java Driver 4.4.1.
Parent topic: MongoDB
8.1.8.2.1 Set up Credential Store Entry to Detect Source Type
userid
. The generic format for userid is as follows:
<dbtype>://<db-user>@<comma separated list of server
addresses>:<port>
. The userid
value for MongoDB is any valid MongoDB clientURI without the password.
MongoDB Capture
alter credentialstore add user "mongodb+srv://user@127.0.0.1:27017" password db-passwd alias mongo
Note:
Ensure that the userid value is in double quotes.MongoDB Atlas
Example:
alter credentialstore add user "mongodb+srv://user@127.0.0.1:27017" password db-passwd alias mongo
Parent topic: Prerequisites to Setting up MongoDB
8.1.8.3 MongoDB Database Operations
Supported Operations
- INSERT
- UPDATE
- DELETE
Unsupported Operations
- CREATE collection
- RENAME collection
- DROP collection
Parent topic: MongoDB
8.1.8.4 Using Extract Initial Load
MongoDB Extract supports the standard initial load capability to extract source table data to Oracle GoldenGate trail files.
Initial load for MongoDB can be performed to synchronize tables, either as a prerequisite step to replicating changes or as a standalone function.Configuring the Initial Load
Initial Load Parameter file:-- ggsci> alter credentialstore add user mongodb://db-user@localhost:27017/admin password db-passwd alias mongo
EXTRACT LOAD
JVMOPTIONS CLASSPATH ggjava/ggjava.jar:/path/to/mongo-capture/libs/*
SOURCEISTABLE
SOURCEDB USERIDALIAS mongo
TABLE database.collection;
adminclient> ADD EXTRACT load, SOURCEISTABLE
adminclient> START EXTRACT load
Parent topic: MongoDB
8.1.8.5 Using Change Data Capture Extract
Review the example .prm files from Oracle GoldenGate for Big Data installation
directory here: AdapterExamples/big-data/mongodbcapture
.
When adding the MongoDB Extract trail, you need to use EXTTRAIL
to
create a local trail file.
RMTTRAIL
option.adminclient> ADD EXTRACT groupname, TRANLOG
adminclient> ADD EXTTRAIL trailprefix, EXTRACT groupname
Example:
adminclient> ADD EXTRACT mongo, TRANLOG
adminclient> ADD EXTTRAIL ./dirdat/z1, EXTRACT mongo
Parent topic: MongoDB
8.1.8.6 Positioning the Extract
MongoDB extract process allows us to position from EARLIEST, TIMESTAMP, EOF and LSN.
EARLIEST: Positions to the start of the Oplog for a given collection.
Syntax:
ADD EXTRACT groupname, TRANLOG, EARLIEST
TIMESTAMP: Positions to a given time stamp. Token BEGIN can use either NOW to start from present time or with a given timestamp.
BEGIN {NOW | yyyy-mm-dd[ hh:mi:[ss[.cccccc]]]}
Syntax
ADD EXTRACT groupname, TRANLOG, BEGIN NOW
ADD EXTRACT groupname, TRANLOG, BEGIN ‘yyyy-mm-dd hh:mm:ss’
EOF: Positions to end of oplog.
Syntax
ADD EXTRACT groupname, TRANLOG, EOF
LSN: Positions to a given LSN.
LSN in MongoDB Capture is Operation Time in oplog which is unique for each record, time is represents as seconds with the increment as a 20 digit long value.
ADD EXTRACT groupname, TRANLOG, LSN “06931975403544248321”
Parent topic: MongoDB
8.1.8.7 Security and Authentication
MongoDB capture uses Oracle GoldenGate credential store to manage user IDs and their encrypted passwords (together known as credentials) that are used by Oracle GoldenGate processes to interact with the MongoDB database. The credential store eliminates the need to specify user names and clear-text passwords in the Oracle GoldenGate parameter files.
An optional alias can be used in the parameter file instead of the user ID to map to a userid and password pair in the credential store.
In Oracle GoldenGate for Big Data, you specify the alias and domain in the property file and not the actual user ID or password. User credentials are maintained in secure wallet storage.
CREDENTIAL STORE
and DBLOGIN
run
the following commands in the adminclient:
adminclient> add credentialstore
adminclient> alter credentialstore add user "<userid>" password <pwd> alias mongo
Example
value of
userid:mongodb://myUserAdmin@localhost:27017/admin?replicaSet=rs0
Note:
Ensure that the userid value is in double quotes.adminclient > dblogin useridalias mongo
To test
DBLOGIN
, run the following command
adminclient> list tables tcust*
On successful add of authentication to credential store, add the alias in the parameter file of extract.
SOURCEDB USERIDALIAS mongo
MongoDB Capture uses
connection URI to connect to a MongoDB deployment. Authentication and Security is
passed as query string as part of connection URI. See SSL Configuration Setup to
configure SSL. mongodb://<user>@<hostname1>:<port>,<hostname2>:<port>,<hostname3>:<port>/?replicaSet=<replicatName>
To specify TLS/SSL: “+srv”
as
mongodb+srv
automatically sets the tls option to
true
. mongodb+srv://server.example.com/
tls=false
in the query string.
mongodb:// >@<hostname1>:<port>/?replicaSet=<replicatName>&tls=false
To specify Authentication:
mongodb://<user>@<hostname1>:<port>,<hostname2>:<port>,<hostname3>:<port>/?replicaSet=<replicatName>&authSource=admin
mongodb://<user>@<hostname1>:<port>,<hostname2>:<port>,<hostname3>:<port>/?replicaSet=<replicatName>&authSource=admin&authMechanism=GSSAPI
For
more information about Security and Authentication using Connection URL, see Mongo DB DocumentationParent topic: MongoDB
8.1.8.7.1 SSL Configuration Setup
To configure SSL between the MongoDB instance and Oracle GoldenGate for Big Data MongoDB Capture, do the following:
openssl req -passout pass:password -new -x509 -days 3650 -extensions v3_ca -keyout
ca_private.pem -out ca.pem -subj
"/CN=CA/OU=GOLDENGATE/O=ORACLE/L=BANGALORE/ST=KA/C=IN"
Create key and certificate signing requests (CSR) for client and all server nodes
openssl req -newkey rsa:4096 -nodes -out client.csr -keyout client.key -subj
'/CN=certName/OU=OGGBDCLIENT/O=ORACLE/L=BANGALORE/ST=AP/C=IN'
openssl req -newkey rsa:4096 -nodes -out server.csr -keyout server.key -subj
'/CN=slc13auo.us.oracle.com/OU=GOLDENGATE/O=ORACLE/L=BANGALORE/ST=TN/C=IN'
Sign the certificate signing requests with CA
openssl x509 -passin pass:password -sha256 -req -days 365 -in client.csr -CA ca.pem -CAkey
ca_private.pem -CAcreateserial -out client-signed.crtopenssl x509 -passin pass:password -sha256 -req -days 365 -in server.csr -CA ca.pem -CAkey
ca_private.pem -CAcreateserial -out server-signed.crt -extensions v3_req -extfile
<(cat << EOF[ v3_req ]subjectAltName = @alt_names
[ alt_names ]
DNS.1 = 127.0.0.1
DNS.2 = localhost
DNS.3 = hostname
EOF)
cat client-signed.crt client.key > client.pem
cat server-signed.crt server.key > server.pem
Create trust store and keystore
openssl pkcs12 -export -out server.pkcs12 -in server.pem
openssl pkcs12 -export -out client.pkcs12 -in client.pem
bash-4.2$ ls
ca.pem ca_private.pem client.csr client.pem server-signed.crt server.key server.pkcs12
ca.srl client-signed.crt client.key client.pkcs12 server.csr server.pem
Start instances of mongod with the following options:
--tlsMode requireTLS --tlsCertificateKeyFile ../opensslKeys/server.pem --tlsCAFile
../opensslKeys/ca.pem
credentialstore connectionString
alter credentialstore add user
mongodb://myUserAdmin@localhost:27017/admin?ssl=true&tlsCertificateKeyFile=../mcopensslkeys/client.pem&tlsCertificateKeyFilePassword=password&tlsCAFile=../mcopensslkeys/ca.pem
password root alias mongo
Note:
The Length ofconnectionString
should not exceed 256.
For CDC Extract, add the key store and trust store as part of the JVM options.
JVM options
-Xms512m -Xmx4024m -Xss32m -Djavax.net.ssl.trustStore=../mcopensslkeys /server.pkcs12
-Djavax.net.ssl.trustStorePassword=password
-Djavax.net.ssl.keyStore =../mcopensslkeys/client.pkcs12
-Djavax.net.ssl.keyStorePassword=password
Parent topic: Security and Authentication
8.1.8.8 MongoDB Bidirectional Replication
Oracle GoldenGate for Big Data has integration to capture changes from a MongoDB source database, and also apply the changes to a MongoDB target database. In bidirectional replication, Changes that are made to one source collection are replicated to target collection, and changes that are made to the second copy are replicated back to the first copy.
This topic explains the design to support bidirectional replication for MongoDB.
Note:
MongoDB Version 6 or above is required to support bi-directional replication. With versions before 6.0, MongoDB bi-directional is not supported and it fails with the following error message: MONGODB-000XX MongoDB version should be 6 or greater to support bi-directional replication.
- Enabling Trandata
- Enabling MongoDB Bi-directional Replication
- Extracting from Target Replicat which is Bidirectionally Processed
- Troubleshooting
Parent topic: MongoDB
8.1.8.8.1 Enabling Trandata
Before starting the replicat process with bidirectional enabled, one should enable the trandata for the collection where the data is been replicated. By enabling the trandata on the collection before the start of the replicat process, will capture the before image of the operation with which an Oracle GoldenGate for Big Data extract process can identify if the document is processed by the Oracle GoldenGate for Big Data or not.
Extract abends if trandata is not enabled on the collection that been used in the bidirectional enabled replicat process.
Command to Enable Trandata
Dblogin useridalias <aliasname>
“add trandata <schema>.<collectionname>”
Note:
The target collection should be available before the replicat process when executed with bidirectionally enabled.Parent topic: MongoDB Bidirectional Replication
8.1.8.8.2 Enabling MongoDB Bi-directional Replication
To enable MongoDB bi-directional replication, set
gg.handler.mongodb.bidirectional
to true
(gg.handler.mongodb.bidirectional=true
) in replicat properties.
When gg.handler.mongodb.bidirectional
property is set to
true
, replicat process adds filterAttribute and
filterAttributeValue key value pair to the document. filterAttribute
and filterAttributeValue
is needed for loop-detection. Ensure that the
filterAttributeValue contain only ASCII characters [A-Za-z] and numbers [0-9] with a
Maximum length of 256 characters. If the document has the key-value pair of
filterAttribute
and filterAttributeValue
, then it
shows that the document is processed by Oracle GoldenGate for Big Data replicat process.
When gg.handler.mongodb.bidirectional
property is set to
true
, replicat ingests the default value of
filterAttribute
as oggApply
and the default
filterAttributeValue
as true
if not specified
explicitly. You can enable MongoDB bi-directional replication with default settings.
For example: gg.handler.mongodb.bidirectional=true
{ "_id" : ObjectId("65544aa60b0a066d021ba508"), "CUST_CODE" : "test65", "name" : "hello
world", "cost" : 3000, "oggApply":"true"}
filterAttribute
and
filterAttributeValue
. For
example:gg.handler.mongodb.bidirectional=true
gg.handler.mongodb.filterAttribute=region
gg.handler.mongodb.filterAttributeValue=westcentral
Sample
insert doc with custom key-value pair:{ "_id" : ObjectId("65544aa60b0a066d021ba508"), "CUST_CODE" : "test65", "name" : "hello world", "cost" : 3000, "region":"westcentral"}
Parent topic: MongoDB Bidirectional Replication
8.1.8.8.3 Extracting from Target Replicat which is Bidirectionally Processed
In the extract process, users can use TRANLOGOPTIONS FILTERATTRIBUTE
in
parameters added to decide to process/filter the operations or not. User can mention
multiple TRANLOGOPTIONS FILTERATTRIBUTE
options with different key
value pairs.
This option may be used to avoid data looping in a bidirectional configuration of
MongoDB capture by specifying FILTERATTRIBUTE
name with the value that
was used by MongoDB Replicat. The attribute name is optional with a default value
oggApply
.
TRANLOGOPTIONS FILTERATTRIBUTE
: filters default attribute
oggApply
with the default value true
.
For example:
TRANLOGOPTIONS FILTERATTRIBUTE region=westcentral
: filters
attribute region with value westcentral. If the source document contains the specified
FILTERATTRIBUTE
, the document is identified as a replicated
operation.
Note:
TRANLOGOPTIONS FILTERATTRIBUTE
parameter value should be in line
with Replicat's FILTERATTRIBUTE
and
FILTERATTRIBUTEVALUE
to defect the loop or decide to
process/filter the operations.
If the source document contains the specified
FILTERATTRIBUTE
, the document is identified as a replicated
operation. Operations filtering is based on the GETREPLICATES/IGNOREREPLICATES
and \ parameters.
- Use parameters
IGNOREAPPLOPS
andIGNOREREPLICATES
to capture no operations. - Use parameters
GETAPPLOPS
andGETREPLICATES
to capture all operations. - Use parameters
GETREPLICATES
andIGNOREAPPLOPS
to capture only replicated operations. - Use parameters
GETAPPLOPS
andIGNOREREPLICATES
to capture only application operations and filtering replicated operations.
Example 1
The following extract parameter filters the replicated operations marked with default
attribute oggApply
.
TRANLOGOPTIONS FILTERATTRIBUTE
GETAPPLOPS
and IGNOREREPLICATES
Filtered sample message:
{ "_id" : ObjectId("65544aa60b0a066d021ba508"), "CUST_CODE" : "test65", "name" :
"hello world", "cost" : 3000, "oggApply":"true"
}
In the following extract parameter filters the replicated operations marked with
attribute value as westcentral
and captures only the application
operations. If there are other operations marked with a different attribute value, they
will be extracted.
TRANLOGOPTIONS FILTERATTRIBUTE region=westcentral
GETAPPLOPS
and IGNOREREPLICATES
Example 2:
Filtered sample message:
{ "_id" : ObjectId("65544aa60b0a066d021ba508"), "CUST_CODE" : "test65", "name" :
"hello world", "cost" : 3000, "region":"westcentral"}
Extracted sample message:
{ "_id" : ObjectId("1881aa60bMKA66d021b1938"), "CUST_CODE" : "test38", "name" :
"hello world", "cost" : 2000 }
Parent topic: MongoDB Bidirectional Replication
8.1.8.8.4 Troubleshooting
- In bidirectional replication, If no before image is available for the delete
document then abend the process and error out.
Sample error
MONGODB-000XX No before image is available for collection [ <collection name> ] with the document [ <document> ]
. - If MongoDB version used is less than 6, then
MONGODB-000XX
MongoDB version should be 6 or greater to support bi-directional replication.
Parent topic: MongoDB Bidirectional Replication
8.1.8.9 Mongo DB Configuration Reference
The following properties are used with MongoDB change data capture.
Properties | Required/Optional | Location | Default | Explanation |
---|---|---|---|---|
OGGSOURCE <source> |
Required | GLOBALS file
Note: Starting from Oracle GoldenGate for Big Data release 23.1.0.0.0, this parameter will be deprecated. |
None |
The source database for CDC capture or database
queries. The valid value is |
|
Optional |
Extract Parameter file |
None | CLASSPATH : The classpath for the Java Virtual
Machine. You can include an asterisk (*) wildcard to match all JAR
files in any directory. Multiple paths should be delimited with a
colon (:) character. BOOTOPTIONS : The boot options
for the Java Virtual Machine. Multiple options are delimited by a
space character.
|
|
Optional |
GLOBALS file
Note: Starting from Oracle GoldenGate for Big Data release 23.1.0.0.0, this parameter will be deprecated. |
None |
The boot options for the Java Virtual Machine. Multiple options are delimited by a space character. |
JVMCLASSPATH <classpath> |
Required |
GLOBALS file
Note: Starting from Oracle GoldenGate for Big Data release 23.1.0.0.0, this parameter will be deprecated. |
None | The
classpath for the Java Virtual Machine. You can include an asterisk
(*) wildcard to match all JAR files in any directory. Multiple paths
should be delimited with a colon (:) character.
Example:
|
SOURCEDB USERIDALIAS <alias
name> |
Required | Extract parameter (.prm) file | None | This parameter is used by the extract process for authentication in to the source MongoDB database. The alias name refers to the alias that should exist in Oracle Wallet. See Security and Authentication. |
ABEND_ON_DDL |
Optional | CDC Extract parameter (.prm) file | None | This is a default behaviour of MongoDB Capture
extract. On detection of CREATE collection,
RENAME collection, and DROP
collection, extract process will be abended.
|
NO_ABEND_ON_DDL |
Optional | CDC Extract parameter (.prm )
file
|
None |
On detection of |
ABEND_ON_DROP_DATABASE |
Optional | CDC Extract parameter (.prm )
file
|
None | This is a default behaviour of MongoDB Capture extract. On detection of Drop Database operation, extract process will be abended. |
NO_ABEND_ON_DROP_DATABASE |
Optional |
CDC Extract parameter (. |
None |
On detection of Drop Database operation, extract process will skip these operations and continue processing the next operation. |
BINARY_JSON_FORMAT |
Optional | prm | None |
When configured
When using When
|
TRANLOGOPTIONS
FETCHPARTIALJSON |
Optional | CDC Extract parameter (.prm) file | None | On configuring tranlogoptions
FETCHPARTIALJSON , the extract process does a DB
lookup and fetches the full document for the given update operation.
See MongoDB Bidirectional Replication.
|
Table Metadata
When BINARY_JSON_FORMAT
is configured, the column metadata should
have data_type
as 64, sub_data_type
as 4, and JSON
as the Remarks.
Example:
2021/11/11 06:45:06.311.849 Metadata Len 143 RBA 1533 Table Name: MYTEST.TEST * 1)Name 2)Data Type 3)External Length 4)Fetch Offset 5)Scale 6)Level 7)Null 8)Bump if Odd 9)Internal Length 10)Binary Length 11)Table Length 12)Most Sig DT 13)Least Sig DT 14)High Precision 15)Low Precision 16)Elementary Item 17)Occurs 18)Key Column 19)Sub DataType 20)Native DataType 21)Character Set 22)Character Length 23)LOB Type 24)Partial Type 25)Remarks * TDR version: 11 Definition for table MYTEST.TEST Record Length: 16010 Columns: 2 id 64 8000 0 0 0 0 0 8000 8000 0 0 0 0 0 1 0 1 4 -4 -1 0 0 0 JSON payload 64 8000 8005 0 0 1 0 8000 8000 0 0 0 0 0 1 0 0 4 -4 -1 0 1 0 JSON End of definition
When BINARY_JSON_FORMAT
is not configured, the column metadata
should have data_type
as 64, sub_data_type
as 0,
and JSON as the Remarks.
Example:
2021/11/11 06:45:06.311.849 Metadata Len 143 RBA 1533 Table Name: MYTEST.TEST * 1)Name 2)Data Type 3)External Length 4)Fetch Offset 5)Scale 6)Level 7)Null 8)Bump if Odd 9)Internal Length 10)Binary Length 11)Table Length 12)Most Sig DT 13)Least Sig DT 14)High Precision 15)Low Precision 16)Elementary Item 17)Occurs 18)Key Column 19)Sub DataType 20)Native DataType 21)Character Set 22)Character Length 23)LOB Type 24)Partial Type 25)Remarks * TDR version: 11 Definition for table MYTEST.TEST Record Length: 16010 Columns: 2 id 64 8000 0 0 0 0 0 8000 8000 0 0 0 0 0 1 0 1 0 -4 -1 0 0 0 JSON payload 64 8000 8005 0 0 1 0 8000 8000 0 0 0 0 0 1 0 0 0 -4 -1 0 1 0 JSON End of definition
Parent topic: MongoDB
8.1.8.10 Columns in Trail File
- Column 0 as ‘
_id
’, which identifies a document in a collection. - Column 1 as ‘
payload
’, which holds all the columns (fields of a collection).
Based on property BINARY_JSON_FORMAT
, columns are presented as a BSON
format or Extended JSON format. When BINARY_JSON_FORMAT
is configured,
the captured documents are represented in the BSON format as follows.
2021/10/26 06:21:33.000.000 Insert Len 329 RBA 1921 Name: MYTEST.TEST (TDR Index: 1) After Image: Partition x0c G s 0000 1a00 0000 1600 1600 0000 075f 6964 0061 7800 | ..............ax. ddc2 d894 d2f5 fca4 9e00 0100 2701 0000 2301 2301 | ............'...#.#. 0000 075f 6964 0061 7800 ddc2 d894 d2f5 fca4 9e02 | ..._id.ax........... 4355 5354 5f43 4f44 4500 0500 0000 7361 6162 0002 | CUST_CODE.....saab.. 6e61 6d65 0005 0000 006a 6f68 6e00 026c 6173 746e | name.....john..lastn 616d 6500 0500 0000 7769 6c6c 0003 6164 6472 6573 | ame.....will..addres 7365 7300 8300 0000 0373 7472 6565 7464 6574 6169 | ses......streetdetai Column 0 (0x0000), Length 26 (0x001a) id. 0000 1600 1600 0000 075f 6964 0061 7800 ddc2 d894 | ..........ax..... d2f5 fca4 9e00 | ...... Column 1 (0x0001), Length 295 (0x0127) payload. 0000 2301 2301 0000 075f 6964 0061 7800 ddc2 d894 | ..#.#.....ax..... d2f5 fca4 9e02 4355 5354 5f43 4f44 4500 0500 0000 | ......CUST_CODE..... 7361 6162 0002 6e61 6d65 0005 0000 006a 6f68 6e00 | saab..name.....john. 026c 6173 746e 616d 6500 0500 0000 7769 6c6c 0003 | .lastname.....will.. 6164 6472 6573 7365 7300 8300 0000 0373 7472 6565 | addresses......stree 7464 6574 6169 6c73 006f 0000 0003 6172 6561 0020 | tdetails.o....area. 0000 0003 5374 7265 6574 0013 0000 0001 6c61 6e65 | ....Street......lane 0000 0000 0000 005e 4000 0003 666c 6174 6465 7461 | .......^@...flatdeta 696c 7300 3700 0000 0166 6c61 746e 6f00 0000 0000 | ils.7....flatno..... 0040 6940 0270 6c6f 746e 6f00 0300 0000 3262 0002 | .@i@.plotno.....2b.. 6c61 6e65 0009 0000 0032 6e64 7068 6173 6500 0000 | lane.....2ndphase... 0003 7072 6f76 6973 696f 6e00 3000 0000 0373 7461 | ..provision.0....sta 7465 0024 0000 0003 6b61 001b 0000 0002 6b61 726e | te.$....ka......karn 6174 616b 6100 0700 0000 3537 3031 3032 0000 0000 | ataka.....570102.... 0263 6974 7900 0400 0000 626c 7200 00 | .city.....blr..
When BINARY_JSON_FORMAT
is not configured, the captured documents are
represented in the JSON format as follows:
2021/10/01 01:09:35.000.000 Insert Len 366 RBA 1711 Name: MYTEST.testarr (TDR Index: 1) After Image: Partition x0c G s 0000 2700 0000 2300 7b22 246f 6964 223a 2236 3135 | ..'...#.{"$oid":"615 3663 3233 6633 3466 3061 3965 3661 3735 3536 3930 | 6c23f34f0a9e6a755690 6422 7d01 003f 0100 003b 017b 225f 6964 223a 207b | d"}..?...;.{"_id": { 2224 6f69 6422 3a20 2236 3135 3663 3233 6633 3466 | "$oid": "6156c23f34f 3061 3965 3661 3735 3536 3930 6422 7d2c 2022 4355 | 0a9e6a755690d"}, "CU 5354 5f43 4f44 4522 3a20 2265 6d70 3122 2c20 226e | ST_CODE": "emp1", "n 616d 6522 3a20 226a 6f68 6e22 2c20 226c 6173 746e | ame": "john", "lastn Column 0 (0x0000), Length 39 (0x0027). 0000 2300 7b22 246f 6964 223a 2236 3135 3663 3233 | ..#.{"$oid":"6156c23 6633 3466 3061 3965 3661 3735 3536 3930 6422 7d | f34f0a9e6a755690d"} Column 1 (0x0001), Length 319 (0x013f). 0000 3b01 7b22 5f69 6422 3a20 7b22 246f 6964 223a | ..;.{"_id": {"$oid": 2022 3631 3536 6332 3366 3334 6630 6139 6536 6137 | "6156c23f34f0a9e6a7 3535 3639 3064 227d 2c20 2243 5553 545f 434f 4445 | 55690d"}, "CUST_CODE 223a 2022 656d 7031 222c 2022 6e61 6d65 223a 2022 | ": "emp1", "name": " 6a6f 686e 222c 2022 6c61 7374 6e61 6d65 223a 2022 | john", "lastname": " 7769 6c6c 222c 2022 6164 6472 6573 7365 7322 3a20 | will", "addresses": 7b22 7374 7265 6574 6465 7461 696c 7322 3a20 7b22 | {"streetdetails": {" 6172 6561 223a 207b 2253 7472 6565 7422 3a20 7b22 | area": {"Street": {" 6c61 6e65 223a 2031 3230 2e30 7d7d 2c20 2266 6c61 | lane": 120.0}}, "fla 7464 6574 6169 6c73 223a 207b 2266 6c61 746e 6f22 | tdetails": {"flatno" 3a20 3230 322e 302c 2022 706c 6f74 6e6f 223a 2022 | : 202.0, "plotno": " 3262 222c 2022 6c61 6e65 223a 2022 326e 6470 6861 | 2b", "lane": "2ndpha 7365 227d 7d7d 2c20 2270 726f 7669 7369 6f6e 223a | se"}}}, "provision": 207b 2273 7461 7465 223a 207b 226b 6122 3a20 7b22 | {"state": {"ka": {" 6b61 726e 6174 616b 6122 3a20 2235 3730 3130 3222 | karnataka": "570102" 7d7d 7d2c 2022 6369 7479 223a 2022 626c 7222 7d | }}}, "city": "blr"}
Parent topic: MongoDB
8.1.8.11 Update Operation Behavior
MongoDB Capture extract reads change records from the capped collection
oplog.rs
. For Update operations, the collection contains
information on the modified fields only. Thus the MongoDB Capture extract will write
only the modified fields in trail on Update operation as MongoDB native
$set
and $unset
documents.
Example trail record:
2022/02/22 01:26:52.000.000 FieldComp Len 243 RBA 1711 Name: lobt.MNGUPSRT (TDR Index: 1) Min. Replicat version: 21.5, Min. GENERIC version: 0.0, Incompatible Replicat: Abend Column 0 (0x0000), Length 55 (0x0037) id. 0000 3300 7b20 225f 6964 2220 3a20 7b20 2224 6f69 | ..3.{ "_id" : { "$oi 6422 203a 2022 3632 3133 3633 3064 3931 3561 6631 | d" : "6213630d915af1 3633 3265 6264 6461 3766 2220 7d20 7d | 632ebdda7f" } } Column 1 (0x0001), Length 180 (0x00b4) payload. 0000 b000 7b22 2476 223a 207b 2224 6e75 6d62 6572 | ....{"$v": {"$number 496e 7422 3a20 2231 227d 2c20 2224 7365 7422 3a20 | Int": "1"}, "$set": 7b22 6c61 7374 4d6f 6469 6669 6564 223a 207b 2224 | {"lastModified": {"$ 6461 7465 223a 207b 2224 6e75 6d62 6572 4c6f 6e67 | date": {"$numberLong 223a 2022 3136 3435 3532 3230 3132 3238 3522 7d7d | ": "1645522012285"}} 2c20 2273 697a 652e 756f 6d22 3a20 2263 6d22 2c20 | , "size.uom": "cm", 2273 7461 7475 7322 3a20 2250 227d 2c20 225f 6964 | "status": "P"}, "_id 223a 207b 2224 6f69 6422 3a20 2236 3231 3336 3330 | ": {"$oid": "6213630 6439 3135 6166 3136 3332 6562 6464 6137 6622 7d7d | d915af1632ebdda7f"}} GGS tokens: TokenID x50 'P' COLPROPERTY Info x01 Length 6 Column: 1, Property: 0x02, Remarks: Partial TokenID x74 't' ORATAG Info x01 Length 0 TokenID x4c 'L' LOGCSN Info x00 Length 20 3037 3036 3734 3633 3232 3633 3838 3131 3935 3533 | 07067463226388119553 TokenID x36 '6' TRANID Info x00 Length 19 3730 3637 3436 3332 3236 3338 3831 3139 3535 33 | 7067463226388119553
Here The GGS token x50 with Remarks as Partial indicates that this record is a partial record.
On configuring tranlogoptions FETCHPARTIALJSON
, the extract process does
a database lookup and fetches the full document for the given update operation.
Example
2022/02/22 01:26:59.000.000 FieldComp Len 377 RBA 2564 Name: lobt.MNGUPSRT (TDR Index: 1) Column 0 (0x0000), Length 55 (0x0037) id. 0000 3300 7b20 225f 6964 2220 3a20 7b20 2224 6f69 | ..3.{ "_id" : { "$oi 6422 203a 2022 3632 3133 3633 3064 3931 3561 6631 | d" : "6213630d915af1 3633 3265 6264 6461 3764 2220 7d20 7d | 632ebdda7d" } } Column 1 (0x0001), Length 314 (0x013a) payload. 0000 3601 7b20 225f 6964 2220 3a20 7b20 2224 6f69 | ..6.{ "_id" : { "$oi 6422 203a 2022 3632 3133 3633 3064 3931 3561 6631 | d" : "6213630d915af1 3633 3265 6264 6461 3764 2220 7d2c 2022 6974 656d | 632ebdda7d" }, "item 2220 3a20 226d 6f75 7365 7061 6422 2c20 2271 7479 | " : "mousepad", "qty 2220 3a20 7b20 2224 6e75 6d62 6572 446f 7562 6c65 | " : { "$numberDouble 2220 3a20 2232 352e 3022 207d 2c20 2273 697a 6522 | " : "25.0" }, "size" 203a 207b 2022 6822 203a 207b 2022 246e 756d 6265 | : { "h" : { "$numbe 7244 6f75 626c 6522 203a 2022 3139 2e30 2220 7d2c | rDouble" : "19.0" }, 2022 7722 203a 207b 2022 246e 756d 6265 7244 6f75 | "w" : { "$numberDou 626c 6522 203a 2022 3232 2e38 3530 3030 3030 3030 | ble" : "22.850000000 3030 3030 3031 3432 3122 207d 2c20 2275 6f6d 2220 | 000001421" }, "uom" 3a20 2269 6e22 207d 2c20 2273 7461 7475 7322 203a | : "in" }, "status" : 2022 5022 2c20 226c 6173 744d 6f64 6966 6965 6422 | "P", "lastModified" 203a 207b 2022 2464 6174 6522 203a 207b 2022 246e | : { "$date" : { "$n 756d 6265 724c 6f6e 6722 203a 2022 3136 3435 3532 | umberLong" : "164552 3230 3139 3936 3122 207d 207d 207d | 2019961" } } } GGS tokens: TokenID x46 'F' FETCHEDDATA Info x01 Length 1 6 | Current by key TokenID x4c 'L' LOGCSN Info x00 Length 20 3037 3036 3734 3633 3235 3634 3532 3839 3036 3236 | 07067463256452890626 TokenID x36 '6' TRANID Info x00 Length 19 3730 3637 3436 3332 3536 3435 3238 3930 3632 36 | 7067463256452890626
Here The GGS token x46 FETCHEDDATA
indicates that this record is full
image for the update operation.
Parent topic: MongoDB
8.1.8.12 Oplog Size Recommendations
By default, MongoDB uses 5% of disk space as oplog size.
Oplog should be long enough to hold all transactions for the longest downtime you expect on a secondary. At a minimum, an oplog should be able to hold minimum 72 hours of operations or even a week’s work of operations.
Before mongod creates an oplog, you can specify its size with the
--oplogSize
option.
After you have started a replica set member for the first time, use the
replSetResizeOplog
administrative command to change the oplog size.
replSetResizeOplog
enables you to resize the oplog dynamically
without restarting the mongod process.
Workloads Requiring Larger Oplog Size
If you can predict your replica set's workload to resemble one of the following patterns, then you might want to create an oplog that is larger than the default. Conversely, if your application predominantly performs reads with a minimal amount of write operations, a smaller oplog may be sufficient.
The following workloads might require a larger oplog size.
Updates to Multiple Documents at Once
The oplog must translate multi-updates into individual operations in order to maintain idempotency. This can use a great deal of oplog space without a corresponding increase in data size or disk use.
Deletions Equal the Same Amount of Data as Inserts
If you delete roughly the same amount of data as you insert, then the database doesn't grow significantly in disk use, but the size of the operation log can be quite large.
Significant Number of In-Place Updates
If a significant portion of the workload is updates that do not increase the size of the documents, then the database records a large number of operations but does not change the quantity of data on disk.
Parent topic: MongoDB
8.1.8.13 Troubleshooting
- Error : com.mongodb.MongoQueryException: Query
failed with error code 11600 and error message
'interrupted at shutdown' on server
localhost:27018.
The MongoDB server is killed or closed. Restart the Mongod instances and MongoDB capture.
- Error: java.lang.IllegalStateException: state
should be: open.
The active session is closed due to the session's idle time-out value getting exceeded. Increase the mongod instance's
logicalSessionTimeoutMinutes
paramater value and restart the Mongod instances and MongoDB capture. - Error:Exception in thread "main"
com.mongodb.MongoQueryException: Query failed with
error code 136 and error message 'CollectionScan
died due to position in capped collection being
deleted. Last seen record id:
RecordId(6850088381712443337)' on server
localhost:27018 at
com.mongodb.internal.operation.QueryHelper.translateCommandException(QueryHelper.java:29)
This Exception happens when we have Fast writes to mongod and insufficient oplog size. See Oplog Size Recommendations.
- Error: not authorized on DB to execute
command
This error occurs due to insufficient privileges for the user. The user must be authenticated to run the specified command.
- Error: com.mongodb.MongoClientException: Sessions
are not supported by the MongoDB cluster to which this client is connected.
Ensure that the Replica Set is available and accessible. In case of MongoDB instance migration from a different version, set the property
FeatureCompatibilityVersion
as follows:db.adminCommand( { setFeatureCompatibilityVersion: "3.6" } ){_}
Parent topic: MongoDB
8.1.8.14 MongoDB Capture Client Dependencies
What are the dependencies for the MongoDB Capture to connect to MongoDB databases?
Oracle GoldenGate requires that you use the 4.4.1 MongoDB reactive streams or higher integration with MongoDB. You can download this driver from: https://search.maven.org/artifact/org.mongodb/mongodb-driver-reactivestream
- MongoDB Capture Client Dependencies: Reactive Streams Java Driver 4.4.1
- MongoDB Reactive Streams Java Driver 4.4.1
Parent topic: MongoDB
8.1.8.14.1 MongoDB Capture Client Dependencies: Reactive Streams Java Driver 4.4.1
The required dependent client libraries are: bson.jar
,
mongodb-driver-core.jar
,
mongodb-driver-reactivestreams.jar
, and
reactive-streams.jar and reactor-core.jar
You must include the path to the MongoDB reactivestreams Java driver in the
gg.classpath
property. To automatically download the Java driver
from the Maven central repository, add the following Maven coordinates of these third
party libraries that are needed to run MongoDB Change Data Capture in the
pom.xml
file:
<dependency> <groupId>org.mongodb</groupId> <artifactId>mongodb-driver-reactivestreams</artifactId> <version>4.4.1</version> </dependency> <dependency> <groupId>org.mongodb</groupId> <artifactId>bson</artifactId> <version>4.4.1</version> </dependency> <dependency> <groupId>org.mongodb</groupId> <artifactId>mongodb-driver-core</artifactId> <version>4.4.1</version> </dependency> <dependency> <groupId>org.reactivestreams</groupId> <artifactId>reactive-streams</artifactId> <version>1.0.3</version> </dependency> <dependency> <groupId>io.projectreactor</groupId> <artifactId>reactor-core</artifactId> </dependency>
Example
Download version 4.4.1 from Maven central at: https://mvnrepository.com/artifact/org.mongodb/mongodb-driver-reactivestreams.
Parent topic: MongoDB Capture Client Dependencies
8.1.8.14.2 MongoDB Reactive Streams Java Driver 4.4.1
You must include the path to the MongoDB reactivestreams Java driver in the gg.classpath property. To automatically download the Java driver from the Maven central repository, add the following lines in the pom.xml file, substituting your correct information:
<!-- https://search.maven.org/artifact/org.mongodb/mongodb-driver-reactivestreams --> <dependency> <groupId>org.mongodb</groupId> <artifactId>mongodb-driver-reactivestreams</artifactId> <version>4.4.1</version> </dependency> <dependency> <groupId>org.mongodb</groupId> <artifactId>bson</artifactId> <version>4.4.1</version> </dependency> <dependency> <groupId>org.mongodb</groupId> <artifactId>mongodb-driver-core</artifactId> <version>4.4.1</version> </dependency> <dependency> <groupId>org.reactivestreams</groupId> <artifactId>reactive-streams</artifactId> <version>1.0.3</version> </dependency> <dependency> <groupId>io.projectreactor</groupId> <artifactId>reactor-core</artifactId> </dependency>
Parent topic: MongoDB Capture Client Dependencies
8.1.9 OCI Streaming
To capture messages from OCI Streaming and parse into logical change records with Oracle GoldenGate for Big Data, you can use Kafka Extract. For more information, see Apache Kafka as source.
Parent topic: Source
8.2 Target
- Amazon Kinesis
The Kinesis Streams Handler streams data to applications hosted on the Amazon Cloud or in your environment. - Amazon MSK
- Amazon Redshift
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. The purpose of the Redshift Event Handler is to apply operations into Redshift tables. - Amazon S3
Learn how to use the S3 Event Handler, which provides the interface to Amazon S3 web services. - Apache Cassandra
The Cassandra Handler provides the interface to Apache Cassandra databases. - Apache HBase
The HBase Handler is used to populate HBase tables from existing Oracle GoldenGate supported sources. - Apache HDFS
The HDFS Handler is designed to stream change capture data into the Hadoop Distributed File System (HDFS). - Apache Kafka
The Kafka Handler is designed to stream change capture data from an Oracle GoldenGate trail to a Kafka topic. - Apache Hive
- Azure Blob Storage
- Azure Data Lake Storage
- Azure Event Hubs
Kafka handler supports connectivity to Microsoft Azure Event Hubs. - Azure Synapse Analytics
Microsoft Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing and big data analytics. - Confluent Kafka
- DataStax
- Elasticsearch
- Flat Files
- Google BigQuery
- Google Cloud Storage
- Java Message Service (JMS)
The Java Message Service (JMS) Handler allows operations from a trail file to be formatted in messages, and then published to JMS providers like Oracle Weblogic Server, Websphere, and ActiveMQ. - Java Database Connectivity
Learn how to use the Java Database Connectivity (JDBC) Handler, which can replicate source transactional data to a target or database. - Map(R)
- MongoDB
Learn how to use the MongoDB Handler, which can replicate transactional data from Oracle GoldenGate to a target MongoDB and Autonomous JSON databases (AJD and ATP) . - Netezza
- OCI Streaming
Oracle Cloud Infrastructure Streaming (OCI Streaming) supports putting messages to and receiving messages using the Kafka client. Therefore, Oracle GoldenGate for Big Data can be used to publish change data capture operation messages to OCI Streaming. - Oracle NoSQL
The Oracle NoSQL Handler can replicate transactional data from Oracle GoldenGate to a target Oracle NoSQL Database. - OCI Autonomous Data Warehouse
Oracle Autonomous Data Warehouse (ADW) is a fully managed database tuned and optimized for data warehouse workloads with the market-leading performance of Oracle Database. - Oracle Cloud Infrastructure Object Storage
The Oracle Cloud Infrastructure Event Handler is used to load files generated by the File Writer Handler into an Oracle Cloud Infrastructure Object Store. - Redis
Redis is an in-memory data structure store which supports optional durability. Redis is simply a key/value data store where a unique key identifies the data structure stored. The value is the data structure that is stored. - Snowflake
- Additional Details
Parent topic: Replicate Data
8.2.1 Amazon Kinesis
The Kinesis Streams Handler streams data to applications hosted on the Amazon Cloud or in your environment.
This chapter describes how to use the Kinesis Streams Handler.
- Overview
- Detailed Functionality
- Setting Up and Running the Kinesis Streams Handler
- Kinesis Handler Performance Considerations
- Troubleshooting
Parent topic: Target
8.2.1.1 Overview
Amazon Kinesis is a messaging system that is hosted in the Amazon Cloud. Kinesis streams can be used to stream data to other Amazon Cloud applications such as Amazon S3 and Amazon Redshift. Using the Kinesis Streams Handler, you can also stream data to applications hosted on the Amazon Cloud or at your site. Amazon Kinesis streams provides functionality similar to Apache Kafka.
The logical concepts map is as follows:
-
Kafka Topics = Kinesis Streams
-
Kafka Partitions = Kinesis Shards
A Kinesis stream must have at least one shard.
Parent topic: Amazon Kinesis
8.2.1.2 Detailed Functionality
8.2.1.2.1 Amazon Kinesis Java SDK
The Oracle GoldenGate Kinesis Streams Handler uses the AWS Kinesis Java SDK to push data to Amazon Kinesis, see Amazon Kinesis Streams Developer Guide at:
http://docs.aws.amazon.com/streams/latest/dev/developing-producers-with-sdk.html.
The Kinesis Steams Handler was designed and tested with the latest AWS Kinesis Java SDK version 1.11.107. These are the dependencies:
-
Group ID:
com.amazonaws
-
Artifact ID:
aws-java-sdk-kinesis
-
Version:
1.11.107
Oracle GoldenGate for Big Data does not ship with the AWS Kinesis Java SDK. Oracle recommends that you use the AWS Kinesis Java SDK identified in the Certification Matrix, see Verifying Certification, System, and Interoparability Requirements.
Note:
It is assumed by moving to the latest AWS Kinesis Java SDK that there are no changes to the interface, which can break compatibility with the Kinesis Streams Handler.You can download the AWS Java SDK, including Kinesis from:
Parent topic: Detailed Functionality
8.2.1.2.2 Kinesis Streams Input Limits
The upper input limit for a Kinesis stream with a single shard is 1000 messages per second up to a total data size of 1MB per second. Adding streams or shards can increase the potential throughput such as the following:
-
1 stream with 2 shards = 2000 messages per second up to a total data size of 2MB per second
-
3 streams of 1 shard each = 3000 messages per second up to a total data size of 3MB per second
The scaling that you can achieve with the Kinesis Streams Handler depends on how you configure the handler. Kinesis stream names are resolved at runtime based on the configuration of the Kinesis Streams Handler.
Shards are selected by the hash the partition key. The partition key for a Kinesis message cannot be null or an empty string (""
). A null or empty string partition key results in a Kinesis error that results in an abend of the Replicat process.
Maximizing throughput requires that the Kinesis Streams Handler configuration evenly distributes messages across streams and shards.
To achieve the best distribution across shards in a Kinesis stream, select a
partitioning key which rapidly changes. You can select
${primaryKeys}
as it is unique per row in the source database.
Additionally, operations for the same row are sent to the same Kinesis stream and
shard. When the DEBUG
logging is enabled, the Kinesis stream name,
sequence number, and the shard number are logged to the log file for successfully
sent messages.
Parent topic: Detailed Functionality
8.2.1.3 Setting Up and Running the Kinesis Streams Handler
Instructions for configuring the Kinesis Streams Handler components and running the handler are described in the following sections.
Use the following steps to set up the Kinesis Streams Handler:
- Create an Amazon AWS account at https://aws.amazon.com/.
- Log into Amazon AWS.
- From the main page, select Kinesis (under the Analytics subsection).
- Select Amazon Kinesis Streams Go to Streams to create Amazon Kinesis streams and shards within streams.
- Create a client ID and secret to access Kinesis.
The Kinesis Streams Handler requires these credentials at runtime to successfully connect to Kinesis.
- Create the client ID and secret:
- Select your name in AWS (upper right), and then in the list select My Security Credentials.
- Select Access Keys to create and
manage access keys.
Note your client ID and secret upon creation.
The client ID and secret can only be accessed upon creation. If lost, you have to delete the access key, and then recreate it.
- Set the Classpath in Kinesis Streams Handler
- Kinesis Streams Handler Configuration
- Using Templates to Resolve the Stream Name and Partition Name
- Resolving AWS Credentials
- Configuring the Proxy Server for Kinesis Streams Handler
- Configuring Security in Kinesis Streams Handler
Parent topic: Amazon Kinesis
8.2.1.3.1 Set the Classpath in Kinesis Streams Handler
You must configure the gg.classpath
property in the Java Adapter properties file to specify the JARs for the AWS Kinesis Java SDK as follows:
gg.classpath={download_dir}/aws-java-sdk-1.11.107/lib/*:{download_dir}/aws-java-sdk-1.11.107/third-party/lib/*
Parent topic: Setting Up and Running the Kinesis Streams Handler
8.2.1.3.2 Kinesis Streams Handler Configuration
You configure the Kinesis Streams Handler operation using the properties file. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
To enable the selection of the Kinesis Streams Handler, you must first configure the
handler type by specifying
gg.handler.name.type=kinesis_streams
and the other
Kinesis Streams properties as follows:
Table 8-2 Kinesis Streams Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.handler.name.type |
Required |
|
None |
Selects the Kinesis Streams Handler for streaming change data capture into Kinesis. |
gg.handler.name.mode |
Optional | op or tx |
op |
Choose the operating mode. |
gg.handler.name.region |
Required |
The Amazon region name which is hosting your Kinesis instance. |
None |
Setting of the Amazon AWS region name is required. |
gg.handler.name.proxyServer |
Optional |
The host name of the proxy server. |
None |
Set the host name of the proxy server if connectivity to AWS is required to go through a proxy server. |
gg.handler.name.proxyPort |
Optional |
The port number of the proxy server. |
None |
Set the port name of the proxy server if connectivity to AWS is required to go through a proxy server. |
gg.handler.name.proxyUsername |
Optional |
The username of the proxy server (if credentials are required). |
None |
Set the username of the proxy server if connectivity to AWS is required to go through a proxy server and the proxy server requires credentials. |
gg.handler.name.proxyPassword |
Optional |
The password of the proxy server (if credentials are required). |
None |
Set the password of the proxy server if connectivity to AWS is required to go through a proxy server and the proxy server requires credentials. |
gg.handler.name.deferFlushAtTxCommit |
Optional |
|
|
When set to false, the Kinesis Streams Handler will flush data to Kinesis at transaction commit for write durability. However, it may be preferable to defer the flush beyond the transaction commit for performance purposes, see Kinesis Handler Performance Considerations. |
gg.handler.name.deferFlushOpCount |
Optional |
Integer |
None |
Only applicable if |
gg.handler.name.formatPerOp |
Optional |
|
|
When set to |
gg.handler.name.customMessageGrouper |
Optional |
oracle.goldengate.handler.kinesis.KinesisJsonTxMessageGrouper |
None |
This configuration parameter provides the ability to group Kinesis messages using custom logic. Only one implementation is included in the distribution at this time. The |
gg.handler.name.streamMappingTemplate |
Required |
A template string value to resolve the Kinesis message partition key (message key) at runtime. |
None |
See Using Templates to Resolve the Stream Name and Partition Name for more information. |
gg.handler.name.partitionMappingTemplate |
Required |
A template string value to resolve the Kinesis message partition key (message key) at runtime. |
None |
See Using Templates to Resolve the Stream Name and Partition Name for more information. |
gg.hander.name.format |
Required |
Any supported pluggable formatter. |
|
Selects the operations message formatter. JSON is likely the best fit for Kinesis. |
|
Optional |
|
|
By default, the Kinesis Handler automatically creates Kinesis streams if they do not already exist. Set to |
|
Optional |
Positive integer. |
|
A Kinesis stream contains one or more shards. Controls the number of shards on
Kinesis streams that the Kinesis Handler creates. Multiple
shards can help improve the ingest performance to a Kinesis
stream. Use only when
|
|
Optional |
|
|
Sets the proxy protocol connection to the proxy server for additional level of security. The client first performs an SSL handshake with the proxy server, and then an SSL handshake with Amazon AWS. This feature was added into the Amazon SDK in version 1.11.396 so you must use at least that version to use this property. |
gg.handler.name.enableSTS |
Optional | true | false |
false |
Set to true , to enable the Kinesis
Handler to access Kinesis credentials from the AWS Security Token
Service. Ensure that the AWS Security Token Service is enabled if
you set this property to true .
|
gg.handler.name.STSRegion |
Optional | Any legal AWS region specifier. | The region is obtained from the
gg.handler.name.region property.
|
Use to resolve the region for the STS call. It's only
valid if the gg.handler.name.enableSTS property is
set to true . You can set a different AWS region for
resolving credentials from STS than the configured Kinesis region.
|
gg.handler.name.accessKeyId |
Optional | A valid AWS access key. | None | Set this parameter to explicitly set the access key
for AWS. This parameter has no effect if
gg.handler.name.enableSTS is set to
true . If unset, credentials resolution falls
back to the AWS default credentials provider chain.
|
gg.handler.name.secretKey |
Optional | A valid AWS secret key. | None | Set this parameter to explicitly set the secret key
for AWS. This parameter has no effect if
gg.handler.name.enableSTS is set to
true . If unset, credentials resolution falls
back to the AWS default credentials provider chain.
|
Parent topic: Setting Up and Running the Kinesis Streams Handler
8.2.1.3.3 Using Templates to Resolve the Stream Name and Partition Name
The Kinesis Streams Handler provides the functionality to resolve the stream name and the partition key at runtime using a template configuration value. Templates allow you to configure static values and keywords. Keywords are used to dynamically replace the keyword with the context of the current processing. Templates are applicable to the following configuration parameters:
gg.handler.name.streamMappingTemplate
gg.handler.name.partitionMappingTemplate
Source database transactions are made up of 1 or more
individual operations which are the individual inserts, updates, and deletes. The
Kinesis Handler can be configured to send one message per operation (insert, update,
delete, Alternatively, it can be configured to group operations into messages at the
transaction level. Many of the template keywords resolve data based on the context
of an individual source database operation. Therefore, many of the keywords do
not work when sending messages at the transaction level. For example
${fullyQualifiedTableName}
does not work when sending messages
at the transaction level. The ${fullyQualifiedTableName}
property resolves to the qualified source table name for an operation. Transactions
can contain multiple operations for many source tables. Resolving the
fully-qualified table name for messages at the transaction level is
non-deterministic and so abends at runtime.
Example Templates
The following describes example template configuration values and the resolved values.
Example Template | Resolved Value |
---|---|
|
|
|
|
|
|
Parent topic: Setting Up and Running the Kinesis Streams Handler
8.2.1.3.4 Resolving AWS Credentials
- AWS Kinesis Client Authentication
The Kinesis Handler is a client connection to the AWS Kinesis cloud service. The AWS cloud must be able to successfully authenticate the AWS client in order in order to successfully interface with Kinesis.
Parent topic: Setting Up and Running the Kinesis Streams Handler
8.2.1.3.4.1 AWS Kinesis Client Authentication
The Kinesis Handler is a client connection to the AWS Kinesis cloud service. The AWS cloud must be able to successfully authenticate the AWS client in order in order to successfully interface with Kinesis.
The AWS client authentication has become increasingly complicated as more authentication options have been added to the Kinesis Stream Handler. This topic explores the different use cases for AWS client authentication.
- Explicit Configuration of the Client ID and Secret
A client ID and secret are generally the required credentials for the Kinesis Handler to interact with Amazon Kinesis. A client ID and secret are generated using the Amazon AWS website. - Use of the AWS Default Credentials Provider Chain
If thegg.eventhandler.name.accessKeyId
andgg.eventhandler.name.secretKey
are unset, then credentials resolution reverts to the AWS default credentials provider chain. The AWS default credentials provider chain provides various ways by which the AWS credentials can be resolved. - AWS Federated Login
The use case is when you have your on-premise system login integrated with AWS. This means that when you log into an on-premise machine, you are also logged into AWS.
Parent topic: Resolving AWS Credentials
8.2.1.3.4.1.1 Explicit Configuration of the Client ID and Secret
A client ID and secret are generally the required credentials for the Kinesis Handler to interact with Amazon Kinesis. A client ID and secret are generated using the Amazon AWS website.
gg.handler.name.accessKeyId=
gg.handler.name.secretKey=
Furthermore, the Oracle Wallet functionality can be used to encrypt these credentials.
Parent topic: AWS Kinesis Client Authentication
8.2.1.3.4.1.2 Use of the AWS Default Credentials Provider Chain
If the gg.eventhandler.name.accessKeyId
and
gg.eventhandler.name.secretKey
are unset, then
credentials resolution reverts to the AWS default credentials provider
chain. The AWS default credentials provider chain provides various ways by
which the AWS credentials can be resolved.
When Oracle GoldenGate for Big Data runs on an AWS Elastic Compute Cloud (EC2) instance, the general use case is to resolve the credentials from the EC2 metadata service. The AWS default credentials provider chain provides resolution of credentials from the EC2 metadata service as one of the options.
Parent topic: AWS Kinesis Client Authentication
8.2.1.3.4.1.3 AWS Federated Login
The use case is when you have your on-premise system login integrated with AWS. This means that when you log into an on-premise machine, you are also logged into AWS.
- You may not want to generate client IDs and secrets. (Some users disable this feature in the AWS portal).
- The client AWS applications need to interact with the AWS Security Token Service (STS) to obtain an authentication token for programmatic calls made to Kinesis.
gg.eventhandler.name.enableSTS=true
.
Parent topic: AWS Kinesis Client Authentication
8.2.1.3.5 Configuring the Proxy Server for Kinesis Streams Handler
Oracle GoldenGate can be used with a proxy server using the following parameters to enable the proxy server:
gg.handler.name.proxyServer= gg.handler.name.proxyPort=80 gg.handler.name.proxyUsername=username gg.handler.name.proxyPassword=password
Sample configurations:
gg.handlerlist=kinesis
gg.handler.kinesis.type=kinesis_streams
gg.handler.kinesis.mode=op
gg.handler.kinesis.format=json
gg.handler.kinesis.region=us-west-2
gg.handler.kinesis.partitionMappingTemplate=TestPartitionName
gg.handler.kinesis.streamMappingTemplate=TestStream
gg.handler.kinesis.deferFlushAtTxCommit=true
gg.handler.kinesis.deferFlushOpCount=1000
gg.handler.kinesis.formatPerOp=true
#gg.handler.kinesis.customMessageGrouper=oracle.goldengate.handler.kinesis.KinesisJsonTxMessageGrouper
gg.handler.kinesis.proxyServer=www-proxy.myhost.com
gg.handler.kinesis.proxyPort=80
Parent topic: Setting Up and Running the Kinesis Streams Handler
8.2.1.3.6 Configuring Security in Kinesis Streams Handler
The Amazon Web Services (AWS) Kinesis Java SDK uses HTTPS to communicate with Kinesis. Mutual authentication is enabled. The AWS server passes a Certificate Authority (CA) signed certificate to the AWS client which allow the client to authenticate the server. The AWS client passes credentials (client ID and secret) to the AWS server which allows the server to authenticate the client.
Parent topic: Setting Up and Running the Kinesis Streams Handler
8.2.1.4 Kinesis Handler Performance Considerations
Parent topic: Amazon Kinesis
8.2.1.4.1 Kinesis Streams Input Limitations
The maximum write rate to a Kinesis stream with a single shard to be 1000 messages per second up to a maximum of 1MB of data per second. You can scale input to Kinesis by adding additional Kinesis streams or adding shards to streams. Both adding streams and adding shards can linearly increase the Kinesis input capacity and thereby improve performance of the Oracle GoldenGate Kinesis Streams Handler.
Adding streams or shards can linearly increase the potential throughput such as follows:
-
1 stream with 2 shards = 2000 messages per second up to a total data size of 2MB per second.
-
3 streams of 1 shard each = 3000 messages per second up to a total data size of 3MB per second.
To fully take advantage of streams and shards, you must configure the Oracle GoldenGate Kinesis Streams Handler to distribute messages as evenly as possible across streams and shards.
Adding additional Kinesis streams or shards does nothing to scale Kinesis input if all data is sent to using a static partition key into a single Kinesis stream. Kinesis streams are resolved at runtime using the selected mapping methodology. For example, mapping the source table name as the Kinesis stream name may provide good distribution of messages across Kinesis streams if operations from the source trail file are evenly distributed across tables. Shards are selected by a hash of the partition key. Partition keys are resolved at runtime using the selected mapping methodology. Therefore, it is best to choose a mapping methodology to a partition key that rapidly changes to ensure a good distribution of messages across shards.
Parent topic: Kinesis Handler Performance Considerations
8.2.1.4.2 Transaction Batching
The Oracle GoldenGate Kinesis Streams Handler receives messages and then batches together messages by Kinesis stream before sending them via synchronous HTTPS calls to Kinesis. At transaction commit all outstanding messages are flushed to Kinesis. The flush call to Kinesis impacts performance. Therefore, deferring the flush call can dramatically improve performance.
The recommended way to defer the flush call is to use the GROUPTRANSOPS
configuration in the replicat configuration. The GROUPTRANSOPS
groups multiple small transactions into a single larger transaction deferring the transaction commit call until the larger transaction is completed. The GROUPTRANSOPS
parameter works by counting the database operations (inserts, updates, and deletes) and only commits the transaction group when the number of operations equals or exceeds the GROUPTRANSOPS
configuration setting. The default GROUPTRANSOPS
setting for replicat is 1000.
Interim flushes to Kinesis may be required with the GROUPTRANSOPS
setting set to a large amount. An individual call to send batch messages for a Kinesis stream cannot exceed 500 individual messages or 5MB. If the count of pending messages exceeds 500 messages or 5MB on a per stream basis then the Kinesis Handler is required to perform an interim flush.
Parent topic: Kinesis Handler Performance Considerations
8.2.1.4.3 Deferring Flush at Transaction Commit
The messages are by default flushed to Kinesis at transaction commit to ensure write durability. However, it is possible to defer the flush beyond transaction commit. This is only advisable when messages are being grouped and sent to Kinesis at the transaction level (that is one transaction = one Kinesis message or chunked into a small number of Kinesis messages), when the user is trying to capture the transaction as a single messaging unit.
This may require setting the GROUPTRANSOPS
replication parameter to 1 so as not to group multiple smaller transactions from the source trail file into a larger output transaction. This can impact performance as only one or few messages are sent per transaction and then the transaction commit call is invoked which in turn triggers the flush call to Kinesis.
In order to maintain good performance the Oracle GoldenGate Kinesis Streams Handler allows the user to defer the Kinesis flush call beyond the transaction commit call. The Oracle GoldenGate replicat process maintains the checkpoint in the .cpr
file in the {GoldenGate Home}/dirchk
directory. The Java Adapter also maintains a checkpoint file in this directory named .cpj
. The Replicat checkpoint is moved beyond the checkpoint for which the Oracle GoldenGate Kinesis Handler can guarantee message loss will not occur. However, in this mode of operation the GoldenGate Kinesis Streams Handler maintains the correct checkpoint in the .cpj
file. Running in this mode will not result in message loss even with a crash as on restart the checkpoint in the .cpj
file is parsed if it is before the checkpoint in the .cpr
file.
Parent topic: Kinesis Handler Performance Considerations
8.2.1.5 Troubleshooting
Topics:
8.2.1.5.1 Java Classpath
The most common initial error is an incorrect classpath to include all the required AWS Kinesis Java SDK client libraries and creates a ClassNotFound
exception in the log file.
You can troubleshoot by setting the Java Adapter logging to DEBUG
, and then rerun the process. At the debug level, the logging includes information about which JARs were added to the classpath from the gg.classpath
configuration variable.
The gg.classpath
variable supports the wildcard asterisk (*
) character to select all JARs in a configured directory. For example, /usr/kinesis/sdk/*
, see Setting Up and Running the Kinesis Streams Handler.
Parent topic: Troubleshooting
8.2.1.5.2 Kinesis Handler Connectivity Issues
If the Kinesis Streams Handler is unable to connect to Kinesis when running on premise, the problem can be the connectivity to the public Internet is protected by a proxy server. Proxy servers act a gateway between the private network of a company and the public Internet. Contact your network administrator to get the URLs of your proxy server, and then follow the directions in Configuring the Proxy Server for Kinesis Streams Handler.
Parent topic: Troubleshooting
8.2.1.5.3 Logging
The Kinesis Streams Handler logs the state of its configuration to the Java log file.
This is helpful because you can review the configuration values for the handler. Following is a sample of the logging of the state of the configuration:
**** Begin Kinesis Streams Handler - Configuration Summary **** Mode of operation is set to op. The AWS region name is set to [us-west-2]. A proxy server has been set to [www-proxy.us.oracle.com] using port [80]. The Kinesis Streams Handler will flush to Kinesis at transaction commit. Messages from the GoldenGate source trail file will be sent at the operation level. One operation = One Kinesis Message The stream mapping template of [${fullyQualifiedTableName}] resolves to [fully qualified table name]. The partition mapping template of [${primaryKeys}] resolves to [primary keys]. **** End Kinesis Streams Handler - Configuration Summary ****
Parent topic: Troubleshooting
8.2.2 Amazon MSK
Amazon MSK is a fully managed, secure, and a highly available Apache Kafka service. You can use Apache Kafka to replicate to Amazon MSK.
Parent topic: Target
8.2.3 Amazon Redshift
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. The purpose of the Redshift Event Handler is to apply operations into Redshift tables.
See Flat Files.
- Detailed Functionality
Ensure to use the Redshift Event handler as a downstream Event handler connected to the output of the S3 Event handler. The S3 Event handler loads files generated by the File Writer Handler into Amazon S3. - Operation Aggregation
- Unsupported Operations and Limitations
- Uncompressed UPDATE records
It is mandatory that the trail files used to apply to Redshift contain uncompressed UPDATE operation records, which means that theUPDATE
operations contain full image of the row being updated. - Error During the Data Load Proces
Staging operation data from AWS S3 onto temporary staging tables and updating the target table occurs inside a single transaction. In case of any error(s), the entire transaction is rolled back and the replicat process will ABEND. - Troubleshooting and Diagnostics
- Classpath
Redshift apply relies on the upstream File Writer handler and the S3 Event handler. - Configuration
- INSERTALLRECORDS Support
- Redshift COPY SQL Authorization
The Redshift event handler usesCOPY SQL
to read staged files in Amazon Web Services (AWS) S3 buckets. TheCOPY SQL
query may need authorization credentials to access files in AWS S3. - Co-ordinated Apply Support
Parent topic: Target
8.2.3.1 Detailed Functionality
Ensure to use the Redshift Event handler as a downstream Event handler connected to the output of the S3 Event handler. The S3 Event handler loads files generated by the File Writer Handler into Amazon S3.
Redshift Event handler uses the COPY SQL to bulk load operation data available in S3 into temporary Redshift staging tables. The staging table data is then used to update the target table. All the SQL operations are performed in batches providing better throughput.
Parent topic: Amazon Redshift
8.2.3.2 Operation Aggregation
- Aggregation In Memory
Before loading the operation data into S3, the operations in the trail file are aggregated. Operation aggregation is the process of aggregating (merging/compressing) multiple operations on the same row into a single output operation based on a threshold. - Aggregation using SQL post loading data into the staging table
In this aggregation operation, the in-memory operation aggregation need not be performed. The operation data loaded into the temporary staging table is aggregated using SQL queries, such that the staging table contains just one row per key.
Parent topic: Amazon Redshift
8.2.3.2.1 Aggregation In Memory
Before loading the operation data into S3, the operations in the trail file are aggregated. Operation aggregation is the process of aggregating (merging/compressing) multiple operations on the same row into a single output operation based on a threshold.
Table 8-3 Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Optional |
|
|
Aggregate operations based on the primary key of the operation record. |
Parent topic: Operation Aggregation
8.2.3.2.2 Aggregation using SQL post loading data into the staging table
In this aggregation operation, the in-memory operation aggregation need not be performed. The operation data loaded into the temporary staging table is aggregated using SQL queries, such that the staging table contains just one row per key.
Table 8-4 Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.eventhandler.name.aggregateStagingTableRows |
Optional | True| False |
False |
Use SQL to aggregate staging table data before updating the target table. |
Parent topic: Operation Aggregation
8.2.3.3 Unsupported Operations and Limitations
The following operations are not supported by the Redshift Handler:
- DDL changes are not supported.
- Timestamp and Timestamp with Time zone data types: The maximum precision supported is up to microseconds, the nanoseconds portion will be truncated. This is a limitation we have observed with the Redshift COPY SQL.
- Redshift COPY SQL has a limitation on the maximum size of a single input row from any source is 4MB.
Parent topic: Amazon Redshift
8.2.3.4 Uncompressed UPDATE records
It is mandatory that the trail files used to apply to Redshift contain
uncompressed UPDATE operation records, which means that the UPDATE
operations contain full image of the row being updated.
If UPDATE
records have missing columns, then such columns are
updated in the target as null. By setting the parameter
gg.abend.on.missing.columns=true
, replicat can fail
fast on detecting a compressed update trail record. This is the recommended
setting.
Parent topic: Amazon Redshift
8.2.3.5 Error During the Data Load Proces
Staging operation data from AWS S3 onto temporary staging tables and updating the target table occurs inside a single transaction. In case of any error(s), the entire transaction is rolled back and the replicat process will ABEND.
If there are errors with the COPY SQL, then the Redshift system table
stl_load_errors
is also queried and the error traces are made
available in the handler log file.
Parent topic: Amazon Redshift
8.2.3.6 Troubleshooting and Diagnostics
- Connectivity issues to Redshift
- Validate JDBC connection URL, user name and password.
- Check if http/https proxy is enabled. Generally, Redshift endpoints cannot be accessed via proxy.
- DDL and Truncate operations not applied on the target table: The Redshift handler will ignore DDL and truncate records in the source trail file.
- Target table existence: It is expected that the Redshift target table
exists before starting the apply process. Target tables need to be designed with
primary keys, sort keys, partition distribution key columns. Approximations based on
the column metadata in the trail file may not be always correct. Therefore, Redshift
apply will
ABEND
if the target table is missing. - Operation aggregation in-memory
(
gg.aggregagte.operations=true
) is memory intensive where as operation aggregation using SQL(gg.eventhandler.name.aggregateStagingTableRows=true
) requires more SQL processing on the Redshift database. These configurations are mutually exclusive and only one of them should be enabled at a time. Tests within Oracle have revealed that operation aggregation in memory delivers better apply rate. This may not always be the case on all the customer deployments. - Diagnostic information on the apply process is logged onto the handler log file.
- Operation aggregation time (in milli-seconds) in-memory:
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Merge statistics ********START*********************************
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Number of update operations merged into an existing update operation: [232653]
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Time spent aggregating operations : [22064]
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Time spent flushing aggregated operations : [36382]
INFO 2018-10-22 02:53:57.000980 [pool-5-thread-1] - Merge statistics ********END***********************************
- Stage and load processing time (in milli-seconds) for SQL queries
INFO 2018-10-22 02:54:19.000338 [pool-4-thread-1] - Stage and load statistics ********START*********************************
INFO 2018-10-22 02:54:19.000338 [pool-4-thread-1] - Time spent for staging process [277093]
INFO 2018-10-22 02:54:19.000338 [pool-4-thread-1] - Time spent for load process [32650]
INFO 2018-10-22 02:54:19.000338 [pool-4-thread-1] - Stage and load statistics ********END***********************************
- Stage time (in milli-seconds) will also include additional statistics if operation aggregation using SQL is enabled.
- Co-existence of the components: The location/region of the machine
where replicat process is running, AWS S3 bucket region and the Redshift cluster
region would impact the overall throughput of the apply process. Data flow is as
follows:
GoldenGate => AWS S3 => AWS Redshift
. For best throughput, the components need to be located as close as possible.
Parent topic: Amazon Redshift
8.2.3.7 Classpath
Redshift apply relies on the upstream File Writer handler and the S3 Event handler.
Include the required jars needed to run the S3 Event handler in gg.classpath.
See Amazon S3. Redshift Event handler uses the Redshift JDBC driver. Ensure to include the jar file
in gg.classpath
as shown in the following example:
gg.classpath=aws-java-sdk-1.11.356/lib/*:aws-java-sdk-1.11.356/third-party/lib/*:./RedshiftJDBC42-no-awssdk-1.2.8.1005.jar
Parent topic: Amazon Redshift
8.2.3.8 Configuration
Automatic Configuration
AWS Redshift Data warehouse replication involves configuring of multiple components, such as file writer handler, S3 event handler and Redshift event handler. The Automatic Configuration feature auto configures these components so that you need to perform minimal configurations. The properties modified by auto configuration will also be logged in the handler log file.
gg.target=redshift
gg.target
Required
Legal Value: redshift
Default: None
Explanation: Enables replication to Redshift target
When replicating to Redshift target, the customization of S3 event hander name and Redshift event handler name is not allowed.
File Writer Handler Configuration
File writer handler name is pre-set to the value redshift
. The
following is an example to edit a property of file writer handler:
gg.handler.redshift.pathMappingTemplate=./dirout
S3 Event Handler Configuration
S3 event handler name is pre-set to the value s3. The following is an
example to edit a property of the S3 event handler:
gg.eventhandler.s3.bucketMappingTemplate=bucket1
.
Redshift Event Handler Configuration
The Redshift event handler name is pre-set to the value redshift.
Table 8-5 Properties
Properties | Required/Optional | Legal Value | Default | Explanation |
---|---|---|---|---|
gg.eventhandler.redshift.connectionURL |
Required | Redshift JDBC Connection URL | None |
Sets the Redshift JDBC connection URL. Example:
|
gg.eventhandler.redshift.UserName |
Required | JDBC User Name | None | Sets the Redshift database user name. |
gg.eventhandler.redshift.Password |
Required | JDBC Password | None | Sets the Redshift database password. |
gg.eventhandler.redshift.awsIamRole |
Optional | AWS role ARN in the format:
arn:aws:iam::<aws_account_id>:role/<role_name> |
None | AWS IAM role ARN that the Redshift cluster uses for
authentication and authorization for executing COPY
SQL to access objects in AWS S3 buckets.
|
gg.eventhandler.redshift.useAwsSecurityTokenService |
Optional | true | false |
Value is set from the configuration property set in
the upstream s3 Event handler
gg.eventhandler.s3.enableSTS |
Use AWS Security Token Service for authorization. For more information, see Redshift COPY SQL Authorization. |
gg.eventhandler.redshift.awsSTSEndpoint |
Optional | A valid HTTPS URL. | Value is set from the configuration property set in
the upstream s3 Event handler
gg.eventhandler.s3.stsURL .
|
The AWS STS endpoint string. For example: https://sts.us-east-1.amazonaws.com. For more information, see Redshift COPY SQL Authorization. |
gg.eventhandler.redshift.awsSTSRegion |
Optional | A valid AWS region. | Value is set from the configuration property set in
the upstream s3 Event handler
gg.eventhandler.s3.stsRegion .
|
The AWS STS region. For example,
us-east-1 . For more information, see Redshift COPY SQL Authorization.
|
gg.initialLoad |
Optional | true | false |
false |
If set to true , initial load mode
is enabled. See INSERTALLRECORDS Support.
|
gg.operation.aggregator.validate.keyupdate
|
Optional | true or
false |
false |
If set to true , Operation
Aggregator will validate key update operations (optype 115) and
correct to normal update if no key values have changed. Compressed
key update operations do not qualify for merge.
|
End-to-End Configuration
The following is an end-end configuration example which uses auto configuration for FW handler, S3 and Redshift Event handlers.
The sample properties are available at the following location
- In an Oracle GoldenGate Classic install:
<oggbd_install_dir>/AdapterExamples/big-data/redshift-via-s3/rs.props
- In an Oracle GoldenGate Microservices install:
<
oggbd_install_dir>/opt/AdapterExamples/big-data/redshift-via-s3/rs.props
# Configuration to load GoldenGate trail operation records # into Amazon Redshift by chaining # File writer handler -> S3 Event handler -> Redshift Event handler. # Note: Recommended to only edit the configuration marked as TODO
gg.target=redshift #The S3 Event Handler #TODO: Edit the AWS region gg.eventhandler.s3.region=<aws region> #TODO: Edit the AWS S3 bucket gg.eventhandler.s3.bucketMappingTemplate<s3bucket>
#The Redshift Event Handler #TODO: Edit ConnectionUrl gg.eventhandler.redshift.connectionURL=jdbc:redshift://aws-redshift-instance.cjoaij3df5if.us-east-2.redshift.amazonaws.com:5439/mydb #TODO: Edit Redshift user name gg.eventhandler.redshift.UserName=<db user name> #TODO: Edit Redshift password gg.eventhandler.redshift.Password=<db password> #TODO:Set the classpath to include AWS Java SDK and Redshift JDBC driver. gg.classpath=aws-java-sdk-1.11.356/lib/*:aws-java-sdk-1.11.356/third-party/lib/*:./RedshiftJDBC42-no-awssdk-1.2.8.1005.jar jvm.bootoptions=-Xmx8g -Xms32m
Parent topic: Amazon Redshift
8.2.3.9 INSERTALLRECORDS Support
Stage and merge targets supports INSERTALLRECORDS
parameter.
See INSERTALLRECORDS in Reference for
Oracle GoldenGate. Set the INSERTALLRECORDS
parameter in
the Replicat parameter file (.prm
).
Setting this property directs the Replicat process to use bulk insert
operations to load operation data into the target table. You can tune the batch size
of bulk inserts using the File Writer property
gg.handler.redshift.maxFileSize
. The default value is set to
1GB. The frequency of bulk inserts can be tuned using the File Writer property
gg.handler.redshift.fileRollInterval
, the default value is set
to 3m (three minutes).
Note:
Parent topic: Amazon Redshift
8.2.3.10 Redshift COPY SQL Authorization
The Redshift event handler uses COPY SQL
to read staged files in
Amazon Web Services (AWS) S3 buckets. The COPY SQL
query may need
authorization credentials to access files in AWS S3.
Authorization can be provided by using an AWS Identity and Access Management (IAM) role that is attached to the Redshift cluster or by providing a AWS access key and a secret for the access key. As a security consideration, it is a best practise to use role-based access when possible.
AWS Key-Based Authorization
With key-based access control, you provide the access key ID and secret access key for an AWS IAM user that is authorized to access AWS S3. The access key id and secret access key are retrieved by looking up the credentials as follows:
- Environment variables -
AWS_ACCESS_KEY/AWS_ACCESS_KEY_ID
andAWS_SECRET_KEY/AWS_SECRET_ACCESS_KEY
. - Java System Properties -
aws.accessKeyId
andaws.secretKey
. - Credential profiles file at the default location
(
~/.aws/credentials
). - Amazon Elastic Container Service (ECS) container credentials loaded
from Amazon ECS if the environment variable
AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
is set. - Instance profile credentials retrieved from Amazon Elastic Compute Cloud (EC2) metadata service.
Running Replicat on an AWS EC2 Instance
If the replicat process is started on an AWS EC2 instance, then the access key ID and secret access key are automatically retrieved by Oracle GoldenGate for BigData and no explicit user configuration is required.
Temporary Security Credentials using AWS Security Token Service (STS)
If you use the key-based access control, then you can further limit the access users have to your data by retrieving temporary security credentials using AWS Security Token Service. The auto configure feature of the Redshift event handler automatically picks up the AWS Security Token Service (STS) configuration from S3 event handler.
Table 8-6 S3 Event Handler Configuration and Redshift Event Handler Configuration
S3 Event Handler Configuration | Redshift Event Handler Configuration |
---|---|
enableSTS |
useAwsSTS |
stsURL |
awsSTSEndpoint |
stsRegion |
awsSTSRegion |
AWS IAM Role-based Authorization
With role-based authorization, Redshift cluster temporarily assumes an
IAM role when executing COPY SQL
. You need to provide the role
Amazon Resource Number (ARN) as a configuration value as follows:
gg.eventhandler.redshift.AwsIamRole
. For example:
gg.eventhandler.redshift.AwsIamRole=arn:aws:iam::<aws_account_id>:role/<role_name>
.
The role needs to be authorized to read the respective S3 bucket. Ensure that the
trust relationship of the role contains the AWS redshift service. Additionally,
attach this role to the Redshift cluster before starting the Redshift cluster. For
example, AWS IAM policy that can be used in the the trust relationship of the
role.
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "redshift.amazonaws.com" ] }, "Action": "sts:AssumeRole" } ] }
If the role-based authorization is configured
(gg.eventhandler.redshift.AwsIamRole
), then it is given
priority over key-based authorization.
Parent topic: Amazon Redshift
8.2.3.11 Co-ordinated Apply Support
To enable co-ordinated apply for Redshift, ensure that the Redshift
database's isolation level is set to SNAPSHOT
. The Redshift
SNAPSHOT ISOLATION
option allows higher concurrency, where
concurrent modifications to different rows in the same table can complete
successfully.
SQL Query to Alter the Database's Isolation Level
ALTER DATABASE <sampledb>
ISOLATION LEVEL SNAPSHOT;
Parent topic: Amazon Redshift
8.2.4 Amazon S3
Learn how to use the S3 Event Handler, which provides the interface to Amazon S3 web services.
8.2.4.1 Overview
Amazon S3 is object storage hosted in the Amazon cloud. The purpose of the S3 Event Handler is to load data files generated by the File Writer Handler into Amazon S3, see https://aws.amazon.com/s3/.
You can use any format that the File Writer Handler, see Flat Files.
Parent topic: Amazon S3
8.2.4.2 Detailing Functionality
The S3 Event Handler requires the Amazon Web Services (AWS) Java SDK to transfer files to S3 object storage.Oracle GoldenGate for Big Data does not include the AWS Java SDK. You have to download and install the AWS Java SDK from:
https://aws.amazon.com/sdk-for-java/
Then you have to configure the gg.classpath
variable to include the JAR files in the AWS Java SDK and are divided into two directories. Both directories must be in gg.classpath
, for example:
gg.classpath=/usr/var/aws-java-sdk-1.11.240/lib/*:/usr/var/aws-java-sdk-1.11.240/third-party/lib/
8.2.4.2.1 Resolving AWS Credentials
- Amazon Web Services Simple Storage Service Client Authentication
The S3 Event Handler is a client connection to the Amazon Web Services (AWS) Simple Storage Service (S3) cloud service. The AWS cloud must be able to successfully authenticate the AWS client in order in order to successfully interface with S3.
Parent topic: Detailing Functionality
8.2.4.2.1.1 Amazon Web Services Simple Storage Service Client Authentication
The S3 Event Handler is a client connection to the Amazon Web Services (AWS) Simple Storage Service (S3) cloud service. The AWS cloud must be able to successfully authenticate the AWS client in order in order to successfully interface with S3.
- Explicit Configuration of the Client ID and Secret
A client ID and secret are generally the required credentials for the S3 Event Handler to interact with Amazon S3. A client ID and secret are generated using the Amazon AWS website. - Use of the AWS Default Credentials Provider Chain
If thegg.eventhandler.name.accessKeyId
andgg.eventhandler.name.secretKey
are unset, then credentials resolution reverts to the AWS default credentials provider chain. The AWS default credentials provider chain provides various ways by which the AWS credentials can be resolved. - AWS Federated Login
The use case is when you have your on-premise system login integrated with AWS. This means that when you log into an on-premise machine, you are also logged into AWS.
Parent topic: Resolving AWS Credentials
8.2.4.2.1.1.1 Explicit Configuration of the Client ID and Secret
A client ID and secret are generally the required credentials for the S3 Event Handler to interact with Amazon S3. A client ID and secret are generated using the Amazon AWS website.
gg.eventhandler.name.accessKeyId=
gg.eventhandler.name.secretKey=
Furthermore, the Oracle Wallet functionality can be used to encrypt these credentials.
8.2.4.2.1.1.2 Use of the AWS Default Credentials Provider Chain
If the gg.eventhandler.name.accessKeyId
and
gg.eventhandler.name.secretKey
are unset, then
credentials resolution reverts to the AWS default credentials provider
chain. The AWS default credentials provider chain provides various ways by
which the AWS credentials can be resolved.
When Oracle GoldenGate for Big Data runs on an AWS Elastic Compute Cloud (EC2) instance, the general use case is to resolve the credentials from the EC2 metadata service. The AWS default credentials provider chain provides resolution of credentials from the EC2 metadata service as one of the options.
8.2.4.2.1.1.3 AWS Federated Login
The use case is when you have your on-premise system login integrated with AWS. This means that when you log into an on-premise machine, you are also logged into AWS.
- You may not want to generate client IDs and secrets. (Some users disable this feature in the AWS portal).
- The client AWS applications need to interact with the AWS Security Token Service (STS) to obtain an authentication token for programmatic calls made to S3.
gg.eventhandler.name.enableSTS=true
.
8.2.4.2.2 About the AWS S3 Buckets
AWS divides S3 storage into separate file systems called buckets. The S3 Event Handler can write to pre-created buckets. Alternatively, if the S3 bucket does not exist, the S3 Event Handler attempts to create the specified S3 bucket. AWS requires that S3 bucket names are lowercase. Amazon S3 bucket names must be globally unique. If you attempt to create an S3 bucket that already exists in any Amazon account, it causes the S3 Event Handler to abend.
Parent topic: Detailing Functionality
8.2.4.2.3 Troubleshooting
Connectivity Issues
If the S3 Event Handler is unable to connect to the S3 object storage when running on premise, it’s likely your connectivity to the public internet is protected by a proxy server. Proxy servers act a gateway between the private network of a company and the public internet. Contact your network administrator to get the URLs of your proxy server.
Oracle GoldenGate can be used with a proxy server using the following parameters to enable the proxy server:
gg.handler.name.proxyServer= gg.handler.name.proxyPort=80 gg.handler.name.proxyUsername=username gg.handler.name.proxyPassword=password
Sample configuration:
gg.eventhandler.s3.type=s3
gg.eventhandler.s3.region=us-west-2
gg.eventhandler.s3.proxyServer=www-proxy.us.oracle.com
gg.eventhandler.s3.proxyPort=80
gg.eventhandler.s3.proxyProtocol=HTTP
gg.eventhandler.s3.bucketMappingTemplate=yourbucketname
gg.eventhandler.s3.pathMappingTemplate=thepath
gg.eventhandler.s3.finalizeAction=none
Parent topic: Detailing Functionality
8.2.4.3 Configuring the S3 Event Handler
You can configure the S3 Event Handler operation using the properties file. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
To enable the selection of the S3 Event Handler, you must first configure the
handler type by specifying gg.eventhandler.name.type=s3
and
the other S3 Event properties as follows:
Table 8-7 S3 Event Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
|
None |
Selects the S3 Event Handler for use with Replicat. |
|
Required |
The AWS region name that is hosting your S3 instance. |
None |
Setting the legal AWS region name is required. |
gg.eventhandler.name.cannedACL |
Optional | Accepts one of the following values:
|
None | Amazon S3 supports a set of predefined grants, known as canned Access Control Lists. Each canned ACL has a predefined set of grantees and permissions. For more information, see Managing access with ACLs |
|
Optional |
The host name of your proxy server. |
None |
Sets the host name of your proxy server if connectivity to AWS is required use a proxy server. |
|
Optional |
The port number of the proxy server. |
None |
Sets the port number of the proxy server if connectivity to AWS is required use a proxy server. |
|
Optional |
The username of the proxy server. |
None |
Sets the user name of the proxy server if connectivity to AWS is required use a proxy server and the proxy server requires credentials. |
|
Optional |
The password of the proxy server. |
None |
Sets the password for the user name of the proxy server if connectivity to AWS is required use a proxy server and the proxy server requires credentials. |
|
Required |
A string with resolvable keywords and constants used to dynamically generate the path in the S3 bucket to write the file. |
None |
Use resolvable keywords and constants used to dynamically generate the S3 bucket name at runtime. The handler attempts to create the S3 bucket if it does not exist. AWS requires bucket names to be all lowercase. A bucket name with uppercase characters results in a runtime exception. See Template Keywords. |
|
Required |
A string with resolvable keywords and constants used to dynamically generate the path in the S3 bucket to write the file. |
None |
Use keywords interlaced with constants to dynamically generate unique S3 path names
at runtime. Typically, path names follow the format,
|
|
Optional |
A string with resolvable keywords and constants used to dynamically generate the S3 file name at runtime. |
None |
Use resolvable keywords and constants used to dynamically generate the S3 data file name at runtime. If not set, the upstream file name is used. See Template Keywords. |
|
Optional |
|
None |
Set to |
|
Optional |
A unique string identifier cross referencing a child event handler. |
No event handler configured. |
Sets the event handler that is invoked on the file roll event. Event handlers can do file roll event actions like loading files to S3, converting to Parquet or ORC format, or loading files to HDFS. |
|
Optional (unless Dell ECS, then required) |
A legal URL to connect to cloud storage. |
None |
Not required for Amazon AWS S3. Required for Dell ECS. Sets the URL to connect to cloud storage. |
|
Optional |
|
|
Sets the proxy protocol connection to the proxy server for additional level of security. The client first performs an SSL handshake with the proxy server, and then an SSL handshake with Amazon AWS. This feature was added into the Amazon SDK in version 1.11.396 so you must use at least that version to use this property. |
|
Optional |
|
Empty |
Set only if you are enabling S3 server side encryption. Use the parameters to set the algorithm for server side encryption in S3. |
|
Optional |
A legal AWS key management system server side management key or the alias that represents that key. |
Empty |
Set only if you are enabling S3 server side encryption and the S3
algorithm is |
gg.eventhandler.name.enableSTS |
Optional |
|
|
Set to |
gg.eventhandler.name.STSAssumeRole |
Optional | AWS user and role in the following format:
{user arn}:role/{role name} |
None | Set configuration if you want to assume a different user/role. Only valid with STS enabled. |
gg.eventhandler.name.STSAssumeRoleSessionName |
Optional | Any string. | AssumeRoleSession1
|
The assumed role requires a session name for session
logging. However this can be any value. Only valid if both
gg.eventhandler.name.enableSTS=true and
gg.eventhandler.name.STSAssumeRole are
configured.
|
gg.eventhandler.name.STSRegion |
Optional |
Any legal AWS region specifier. |
The region is obtained from the
|
Use to resolve the region for the STS call. It's
only valid if the
|
gg.eventhandler.name.enableBucketAdmin |
Optional |
|
|
Set to |
gg.eventhandler.name.accessKeyId |
Optional | A valid AWS access key. | None | Set this parameter to explicitly set the access key
for AWS. This parameter has no effect if
gg.eventhandler.name.enableSTS is set to
true . If this property is not set, then the
credentials resolution falls back to the AWS default credentials
provider chain.
|
gg.eventhandler.name.secretKey |
Optional | A valid AWS secret key. | None | Set this parameter to explicitly set the secret key
for AWS. This parameter has no effect if
gg.eventhandler.name.enableSTS is set to
true . If this property is not set, then
credentials resolution falls back to the AWS default credentials
provider chain.
|
gg.eventhandler.s3.enableAccelerateMode
|
Optional | true |
false |
false |
Enable/Disable Amazon S3 Transfer Acceleration to transfer files quickly and securely over long distances between your client and an S3 bucket. |
Parent topic: Amazon S3
8.2.5 Apache Cassandra
The Cassandra Handler provides the interface to Apache Cassandra databases.
This chapter describes how to use the Cassandra Handler.
- Overview
- Detailing the Functionality
- Setting Up and Running the Cassandra Handler
- About Automated DDL Handling
The Cassandra Handler performs the table check and reconciliation process the first time an operation for a source table is encountered. Additionally, a DDL event or a metadata change event causes the table definition in the Cassandra Handler to be marked as not suitable. - Performance Considerations
- Additional Considerations
- Troubleshooting
- Cassandra Handler Client Dependencies
What are the dependencies for the Cassandra Handler to connect to Apache Cassandra databases?
Parent topic: Target
8.2.5.1 Overview
Apache Cassandra is a NoSQL Database Management System designed to store large amounts of data. A Cassandra cluster configuration provides horizontal scaling and replication of data across multiple machines. It can provide high availability and eliminate a single point of failure by replicating data to multiple nodes within a Cassandra cluster. Apache Cassandra is open source and designed to run on low-cost commodity hardware.
Cassandra relaxes the axioms of a traditional relational database management systems (RDBMS) regarding atomicity, consistency, isolation, and durability. When considering implementing Cassandra, it is important to understand its differences from a traditional RDBMS and how those differences affect your specific use case.
Cassandra provides eventual consistency. Under the eventual consistency model, accessing the state of data for a specific row eventually returns the latest state of the data for that row as defined by the most recent change. However, there may be a latency period between the creation and modification of the state of a row and what is returned when the state of that row is queried. The benefit of eventual consistency is that the latency period is predicted based on your Cassandra configuration and the level of work load that your Cassandra cluster is currently under, see http://cassandra.apache.org/.
The Cassandra Handler provides some control over consistency with the configuration of the gg.handler.name.consistencyLevel
property in the Java Adapter properties file.
Parent topic: Apache Cassandra
8.2.5.2 Detailing the Functionality
- About the Cassandra Data Types
- About Catalog, Schema, Table, and Column Name Mapping
Traditional RDBMSs separate structured data into tables. Related tables are included in higher-level collections called databases. Cassandra contains both of these concepts. Tables in an RDBMS are also tables in Cassandra, while database schemas in an RDBMS are keyspaces in Cassandra. - About DDL Functionality
- How Operations are Processed
- About Compressed Updates vs. Full Image Updates
- About Primary Key Updates
Parent topic: Apache Cassandra
8.2.5.2.1 About the Cassandra Data Types
Cassandra provides a number of column data types and most of these data types are supported by the Cassandra Handler.
- Supported Cassandra Data Types
ASCII BIGINT BLOB BOOLEAN DATE DECIMAL DOUBLE DURATION FLOAT INET INT SMALLINT TEXT TIME TIMESTAMP TIMEUUID TINYINT UUID VARCHAR VARINT
- Unsupported Cassandra Data Types
COUNTER MAP SET LIST UDT (user defined type) TUPLE CUSTOM_TYPE
- Supported Database Operations
INSERT UPDATE (captured as INSERT) DELETE
The Cassandra commit log files do not record any before images for the
UPDATE
orDELETE
operations. So the captured operations never have a before image section. Oracle GoldenGate features that rely on before image records, such as Conflict Detection and Resolution, are not available.- Unsupported Database Operations
TRUNCATE DDL (CREATE, ALTER, DROP)
The data type of the column value in the source trail file must be converted to the corresponding Java type representing the Cassandra column type in the Cassandra Handler. This data conversion introduces the risk of a runtime conversion error. A poorly mapped field (such as varchar
as the source containing alpha numeric data to a Cassandra int
) may cause a runtime error and cause the Cassandra Handler to abend. You can view the Cassandra Java type mappings at:
It is possible that the data may require specialized processing to get converted to the corresponding Java type for intake into Cassandra. If this is the case, you have two options:
-
Try to use the general regular expression search and replace functionality to format the source column value data in a way that can be converted into the Java data type for use in Cassandra.
Or
-
Implement or extend the default data type conversion logic to override it with custom logic for your use case. Contact Oracle Support for guidance.
Parent topic: Detailing the Functionality
8.2.5.2.2 About Catalog, Schema, Table, and Column Name Mapping
Traditional RDBMSs separate structured data into tables. Related tables are included in higher-level collections called databases. Cassandra contains both of these concepts. Tables in an RDBMS are also tables in Cassandra, while database schemas in an RDBMS are keyspaces in Cassandra.
It is important to understand how data maps from the metadata definition in the source trail file are mapped to the corresponding keyspace and table in Cassandra. Source tables are generally either two-part names defined as schema.table
,or three-part names defined as catalog.schema.table
.
The following table explains how catalog, schema, and table names map into Cassandra. Unless you use special syntax, Cassandra converts all keyspace, table names, and column names to lower case.
Table Name in Source Trail File | Cassandra Keyspace Name | Cassandra Table Name |
---|---|---|
|
|
|
|
|
|
|
|
|
Parent topic: Detailing the Functionality
8.2.5.2.3 About DDL Functionality
Parent topic: Detailing the Functionality
8.2.5.2.3.1 About the Keyspaces
The Cassandra Handler does not automatically create keyspaces in Cassandra. Keyspaces in Cassandra define a replication factor, the replication strategy, and topology. The Cassandra Handler does not have enough information to create the keyspaces, so you must manually create them.
You can create keyspaces in Cassandra by using the CREATE KEYSPACE
command from the Cassandra shell.
Parent topic: About DDL Functionality
8.2.5.2.3.2 About the Tables
The Cassandra Handler can automatically create tables in Cassandra if you configure it to do so. The source table definition may be a poor source of information to create tables in Cassandra. Primary keys in Cassandra are divided into:
-
Partitioning keys that define how data for a table is separated into partitions in Cassandra.
-
Clustering keys that define the order of items within a partition.
In the default mapping for automated table creation, the first primary key is the partition key, and any additional primary keys are mapped as clustering keys.
Automated table creation by the Cassandra Handler may be fine for proof of concept, but it may result in data definitions that do not scale well. When the Cassandra Handler creates tables with poorly constructed primary keys, the performance of ingest and retrieval may decrease as the volume of data stored in Cassandra increases. Oracle recommends that you analyze the metadata of your replicated tables, then manually create corresponding tables in Cassandra that are properly partitioned and clustered for higher scalability.
Primary key definitions for tables in Cassandra are immutable after they are created. Changing a Cassandra table primary key definition requires the following manual steps:
-
Create a staging table.
-
Populate the data in the staging table from original table.
-
Drop the original table.
-
Re-create the original table with the modified primary key definitions.
-
Populate the data in the original table from the staging table.
-
Drop the staging table.
Parent topic: About DDL Functionality
8.2.5.2.3.3 Adding Column Functionality
You can configure the Cassandra Handler to add columns that exist in the source trail file table definition but are missing in the Cassandra table definition. The Cassandra Handler can accommodate metadata change events of this kind. A reconciliation process reconciles the source table definition to the Cassandra table definition. When the Cassandra Handler is configured to add columns, any columns found in the source table definition that do not exist in the Cassandra table definition are added. The reconciliation process for a table occurs after application startup the first time an operation for the table is encountered. The reconciliation process reoccurs after a metadata change event on a source table, when the first operation for the source table is encountered after the change event.
Parent topic: About DDL Functionality
8.2.5.2.3.4 Dropping Column Functionality
You can configure the Cassandra Handler to drop columns that do not exist in the source trail file definition but exist in the Cassandra table definition. The Cassandra Handler can accommodate metadata change events of this kind. A reconciliation process reconciles the source table definition to the Cassandra table definition. When the Cassandra Handler is configured to drop, columns any columns found in the Cassandra table definition that are not in the source table definition are dropped.
Caution:
Dropping a column permanently removes data from a Cassandra table. Carefully consider your use case before you configure this mode.
Note:
Primary key columns cannot be dropped. Attempting to do so results in an abend.
Note:
Column name changes are not well-handled because there is no DDL is processed. When a column name changes in the source database, the Cassandra Handler interprets it as dropping an existing column and adding a new column.
Parent topic: About DDL Functionality
8.2.5.2.4 How Operations are Processed
The Cassandra Handler pushes operations to Cassandra using either the asynchronous or synchronous API. In asynchronous mode, operations are flushed at transaction commit (grouped transaction commit using GROUPTRANSOPS
) to ensure write durability. The Cassandra Handler does not interface with Cassandra in a transactional way.
- Supported Database Operations
INSERT UPDATE (captured as INSERT) DELETE
The Cassandra commit log files do not record any before images for the
UPDATE
orDELETE
operations. So the captured operations never have a before image section. Oracle GoldenGate features that rely on before image records, such as Conflict Detection and Resolution, are not available.- Unsupported Database Operations
TRUNCATE DDL (CREATE, ALTER, DROP)
Insert, update, and delete operations are processed differently in Cassandra than a traditional RDBMS. The following explains how insert, update, and delete operations are interpreted by Cassandra:
-
Inserts: If the row does not exist in Cassandra, then an insert operation is processed as an insert. If the row already exists in Cassandra, then an insert operation is processed as an update.
-
Updates: If a row does not exist in Cassandra, then an update operation is processed as an insert. If the row already exists in Cassandra, then an update operation is processed as insert.
-
Delete:If the row does not exist in Cassandra, then a delete operation has no effect. If the row exists in Cassandra, then a delete operation is processed as a delete.
The state of the data in Cassandra is idempotent. You can replay the source trail files or replay sections of the trail files. The state of the Cassandra database must be the same regardless of the number of times that the trail data is written into Cassandra.
Parent topic: Detailing the Functionality
8.2.5.2.5 About Compressed Updates vs. Full Image Updates
Oracle GoldenGate allows you to control the data that is propagated to the source trail file in the event of an update. The data for an update in the source trail file is either a compressed or a full image of the update, and the column information is provided as follows:
- Compressed
-
For the primary keys and the columns for which the value changed. Data for columns that have not changed is not provided in the trail file.
- Full Image
-
For all columns, including primary keys, columns for which the value has changed, and columns for which the value has not changed.
The amount of information about an update is important to the Cassandra Handler. If the source trail file contains full images of the change data, then the Cassandra Handler can use prepared statements to perform row updates in Cassandra. Full images also allow the Cassandra Handler to perform primary key updates for a row in Cassandra. In Cassandra, primary keys are immutable, so an update that changes a primary key must be treated as a delete and an insert. Conversely, when compressed updates are used, prepared statements cannot be used for Cassandra row updates. Simple statements identifying the changing values and primary keys must be dynamically created and then executed. With compressed updates, primary key updates are not possible and as a result, the Cassandra Handler will abend.
You must set the control properties gg.handler.name.compressedUpdates
and gg.handler.name.compressedUpdatesfor
so that the handler expects either compressed or full image updates.
The default value, true
, sets the Cassandra Handler to expect compressed updates. Prepared statements are not be used for updates, and primary key updates cause the handler to abend.
When the value is false
, prepared statements are used for updates and primary key updates can be processed. A source trail file that does not contain full image data can lead to corrupted data columns, which are considered null. As a result, the null value is pushed to Cassandra. If you are not sure about whether the source trail files contains compressed or full image data, set gg.handler.name.compressedUpdates
to true
.
CLOB and BLOB data types do not propagate LOB data in updates unless the LOB column value changed. Therefore, if the source tables contain LOB data, set gg.handler.name.compressedUpdates
to true
.
Parent topic: Detailing the Functionality
8.2.5.2.6 About Primary Key Updates
Primary key values for a row in Cassandra are immutable. An update operation that changes any primary key value for a Cassandra row must be treated as a delete and insert. The Cassandra Handler can process update operations that result in the change of a primary key in Cassandra only as a delete and insert. To successfully process this operation, the source trail file must contain the complete before and after change data images for all columns. The gg.handler.name.compressed
configuration property of the Cassandra Handler must be set to false
for primary key updates to be successfully processed.
Parent topic: Detailing the Functionality
8.2.5.3 Setting Up and Running the Cassandra Handler
Instructions for configuring the Cassandra Handler components and running the handler are described in the following sections.
Before you run the Cassandra Handler, you must install the Datastax Driver for Cassandra and set the gg.classpath
configuration property.
Get the Driver Libraries
The Cassandra Handler has been updated to use the newer 4.x versions of the Datastax Java Driver or 2.x versions of the Datastax Enterprise Java Driver. The Datastax Java Driver for Cassandra does not ship with Oracle GoldenGate for Big Data. For more information, see
Datastax Java Driver for Apache Cassandra.
You can use the Dependency Downloader scripts to download the Datastax Java Driver and its associated dependencies.
Set the Classpath
You must configure the gg.classpath
configuration property in the
Java Adapter properties file to specify the JARs for the Datastax Java Driver for
Cassandra. Ensure that this JAR is first in the list.
gg.classpath=/path/to/4.x/cassandra-java-driver/*
- Understanding the Cassandra Handler Configuration
- Review a Sample Configuration
- Configuring Security
Parent topic: Apache Cassandra
8.2.5.3.1 Understanding the Cassandra Handler Configuration
The following are the configurable values for the Cassandra Handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
To enable the selection of the Cassandra Handler, you must first configure the
handler type by specifying gg.handler.name.type=cassandra
and the other Cassandra properties as follows:
Table 8-8 Cassandra Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
Any string |
None |
Provides a name for the Cassandra Handler. The Cassandra Handler name then becomes part of the property names listed in this table. |
|
Required |
|
None |
Selects the Cassandra Handler for streaming change data capture into name. |
|
Optional |
|
|
The default is recommended. In
|
|
Optional |
A comma separated list of host names that the Cassandra Handler will connect to. |
|
A comma-separated list of the Cassandra host machines for the driver to establish an initial connection to the Cassandra cluster. This configuration property does not need to include all the machines enlisted in the Cassandra cluster. By connecting to a single machine, the driver can learn about other machines in the Cassandra cluster and establish connections to those machines as required. |
|
Optional |
A legal username string. |
None |
A user name for the connection to name. Required if Cassandra is configured to require credentials. |
|
Optional |
A legal password string. |
None |
A password for the connection to name. Required if Cassandra is configured to require credentials. |
|
Optional |
|
|
Sets the Cassandra Handler whether
to expect full image updates from the source trail
file. A value of A value of |
|
Optional |
|
None |
Configures the Cassandra Handler
for the DDL functionality to provide. Options
include When When When |
|
Optional |
|
|
Sets the interaction between the
Cassandra Handler and name. Set to
Set to |
|
Optional |
|
The Cassandra default. |
Sets the consistency level for operations with name. It configures the criteria that must be met for storage on the Cassandra cluster when an operation is executed. Lower levels of consistency may provide better performance, while higher levels of consistency are safer. |
gg.handler. name. port |
Optional |
Integer |
|
Set to configure the port number that the Cassandra Handler attempts to connect to Cassandra server instances. You can override the default in the Cassandra YAML files. |
gg.handler.name.batchType |
Optional | String | unlogged | Sets the type for Cassandra
batch processing.
|
gg.handler.name.abendOnUnmappedColumns |
Optional | Boolean | true |
Only applicable when
gg.handler.name.ddlHanding is not
configured with ADD. When set to
true , the replicat process will
abend if a column exists in the source table, but
does not exist in the target Cassandra table. When
set to false , the replicat
process will not abend if a column exists in the
source table, but does not exist in the target
Cassandra table. Instead, that column will not be
replicated.
|
gg.handler.name.DatastaxJSSEConfigPath |
Optional | String | None | Set the path and file name of a
properties file containing the Cassandra driver
configuration. Use when the Cassandra driver
configuration needs to be configured for
non-default values and potentially SSL
connectivity. For more information, see Cassandra Driver Configuration
Documentation. You need to follow the
syntax of the configuration file for the driver
version you are using. The suffix of the Cassandra
driver configuration file must be
.conf .
|
gg.handler.name.dataCenter |
Optional | The datacenter name | datacenter1 |
Set the datacenter name. If the datacenter name does not match the configured name on the server, then it will not connect to the database. |
Parent topic: Setting Up and Running the Cassandra Handler
8.2.5.3.2 Review a Sample Configuration
The following is a sample configuration for the Cassandra Handler from the Java Adapter properties file:
gg.handlerlist=cassandra #The handler properties gg.handler.cassandra.type=cassandra gg.handler.cassandra.mode=op gg.handler.cassandra.contactPoints=localhost gg.handler.cassandra.ddlHandling=CREATE,ADD,DROP gg.handler.cassandra.compressedUpdates=true gg.handler.cassandra.cassandraMode=async gg.handler.cassandra.consistencyLevel=ONE
Parent topic: Setting Up and Running the Cassandra Handler
8.2.5.3.3 Configuring Security
The Cassandra Handler connection to the Cassandra Cluster can be secured using user name and password credentials. These are set using the following configuration properties:
gg.handler.name.username
gg.handler.name.password
To configure SSL, the recommendation is to configure the SSL properties
via the Datastax Java Driver configuration file and point to the configuration file
via the gg.handler.name.DatastaxJSSEConfigPath
property. See https://docs.datastax.com/en/developer/java-driver/4.14/manual/core/ssl/
for the SSL settings instructions.
datastax-java-driver { advanced.ssl-engine-factory { class = DefaultSslEngineFactory # This property is optional. If it is not present, the driver won't explicitly enable cipher # suites on the engine, which according to the JDK documentations results in "a minimum quality # of service". // cipher-suites = [ "TLS_RSA_WITH_AES_128_CBC_SHA", "TLS_RSA_WITH_AES_256_CBC_SHA" ] # Whether or not to require validation that the hostname of the server certificate's common # name matches the hostname of the server being connected to. If not set, defaults to true. // hostname-validation = true # The locations and passwords used to access truststore and keystore contents. # These properties are optional. If either truststore-path or keystore-path are specified, # the driver builds an SSLContext from these files. If neither option is specified, the # default SSLContext is used, which is based on system property configuration. // truststore-path = /path/to/client.truststore // truststore-password = password123 // keystore-path = /path/to/client.keystore // keystore-password = password123 } }
Parent topic: Setting Up and Running the Cassandra Handler
8.2.5.4 About Automated DDL Handling
The Cassandra Handler performs the table check and reconciliation process the first time an operation for a source table is encountered. Additionally, a DDL event or a metadata change event causes the table definition in the Cassandra Handler to be marked as not suitable.
Therefore, the next time an operation for the table is encountered, the handler repeats the table check, and reconciliation process as described in this topic.
Parent topic: Apache Cassandra
8.2.5.4.1 About the Table Check and Reconciliation Process
The Cassandra Handler first interrogates the target Cassandra database to determine whether the target Cassandra keyspace exists. If the target Cassandra keyspace does not exist, then the Cassandra Handler abends. Keyspaces must be created by the user. The log file must contain the error of the exact keyspace name that the Cassandra Handler is expecting.
Next, the Cassandra Handler interrogates the target Cassandra database for the table definition. If the table does not exist, the Cassandra Handler either creates a table if gg.handler.name.ddlHandling
includes the CREATE
option or abends the process. A message is logged that shows you the table that does not exist in Cassandra.
If the table exists in Cassandra, then the Cassandra Handler reconciles the table definition from the source trail file and the table definition in Cassandra. This reconciliation process searches for columns that exist in the source table definition and not in the corresponding Cassandra table definition. If it locates columns fitting this criteria and the gg.handler.name.ddlHandling
property includes ADD
, then the Cassandra Handler adds the columns to the target table in Cassandra. Otherwise, it ignores these columns.
Next, the Cassandra Handler searches for columns that exist in the target Cassandra table but do not exist in the source table definition. If it locates columns that fit this criteria and the gg.handler.name.ddlHandling
property includes DROP
, then the Cassandra Handler removes these columns from the target table in Cassandra. Otherwise those columns are ignored.
Finally, the prepared statements are built.
Parent topic: About Automated DDL Handling
8.2.5.4.2 Capturing New Change Data
You can capture all of the new change data into your Cassandra database, including the DDL changes in the trail, for the target apply. Following is the acceptance criteria:
AC1: Support Cassandra as a bulk extract AC2: Support Cassandra as a CDC source AC4: All Cassandra supported data types are supported AC5: Should be able to write into different tables based on any filter conditions, like Updates to Update tables or based on primary keys AC7: Support Parallel processing with multiple threads AC8: Support Filtering based on keywords AC9: Support for Metadata provider AC10: Support for DDL handling on sources and target AC11: Support for target creation and updating of metadata. AC12: Support for error handling and extensive logging AC13: Support for Conflict Detection and Resolution AC14: Performance should be on par or better than HBase
Parent topic: About Automated DDL Handling
8.2.5.5 Performance Considerations
Configuring the Cassandra Handler for async
mode provides better performance than sync
mode. Set Replicat property GROUPTRANSOPS
must be set to the default value of 1000.
Setting the consistency level directly affects performance. The higher the consistency level, the more work must occur on the Cassandra cluster before the transmission of a given operation can be considered complete. Select the minimum consistency level that still satisfies the requirements of your use case.
The Cassandra Handler can work in either operation (op
) or transaction (tx
) mode. For the best performance operation mode is recommended:
gg.handler.name.mode=op
Parent topic: Apache Cassandra
8.2.5.6 Additional Considerations
-
Cassandra database requires at least one primary key. The value of any primary key cannot be null. Automated table creation fails for source tables that do not have a primary key.
-
When
gg.handler.name.compressedUpdates=false
is set, the Cassandra Handler expects to update full before and after images of the data.Note:
Using this property setting with a source trail file with partial image updates results in null values being updated for columns for which the data is missing. This configuration is incorrect and update operations pollute the target data with null values in columns that did not change. -
The Cassandra Handler does not process DDL from the source database, even if the source database provides DDL Instead, it reconciles between the source table definition and the target Cassandra table definition. A DDL statement executed at the source database that changes a column name appears to the Cassandra Handler as if a column is dropped from the source table and a new column is added. This behavior depends on how the
gg.handler.name.ddlHandling
property is configured.gg.handler.name.ddlHandling Configuration Behavior Not configured for
ADD
orDROP
Old column name and data maintained in Cassandra. New column is not created in Cassandra, so no data is replicated for the new column name from the DDL change forward.
Configured for
ADD
onlyOld column name and data maintained in Cassandra. New column iscreated in Cassandra and data replicated for the new column name from the DDL change forward. Column mismatch between the data is located before and after the DDL change.
Configured for
DROP
onlyOld column name and data dropped in Cassandra. New column is not created in Cassandra, so no data replicated for the new column name.
Configured for
ADD
andDROP
Old column name and data dropped in Cassandra. New column is created in Cassandra, and data is replicated for the new column name from the DDL change forward.
Parent topic: Apache Cassandra
8.2.5.7 Troubleshooting
This section contains information to help you troubleshoot various issues.
8.2.5.7.1 Java Classpath
When the classpath that is intended to include the required client libraries, a
ClassNotFound
exception appears in the log file. To
troubleshoot, set the Java Adapter logging to DEBUG
, and then run
the process again. At the debug level, the log contains data about the JARs that
were added to the classpath from the gg.classpath
configuration
variable. The gg.classpath
variable selects the asterisk
(*
) wildcard character to select all JARs in a configured
directory. For example,
/usr/cassandra/cassandra-java-driver4.9.0/*:/usr/cassandra/cassandra-java-driver-4.9.0/lib/*
.
For more information about setting the classpath, see Setting Up and Running the Cassandra Handler and Cassandra Handler Client Dependencies.
Parent topic: Troubleshooting
8.2.5.7.2 Write Timeout Exception
When running the Cassandra handler, you may experience a com.datastax.driver.core.exceptions.WriteTimeoutException
exception that causes the Replicat process to abend. It is likely to occur under some or all of the following conditions:
-
The Cassandra Handler processes large numbers of operations, putting the Cassandra cluster under a significant processing load.
-
GROUPTRANSOPS
is configured higher than the value of 1000 default. -
The Cassandra Handler is configured in asynchronous mode.
-
The Cassandra Handler is configured with a consistency level higher than
ONE
.
When this problem occurs, the Cassandra Handler is streaming data faster than the Cassandra cluster can process it. The write latency in the Cassandra cluster finally exceeds the write request timeout period, which in turn results in the exception.
The following are potential solutions:
-
Increase the write request timeout period. This is controlled with the
write_request_timeout_in_ms
property in Cassandra and is located in thecassandra.yaml
file in thecassandra_install/conf
directory. The default is 2000 (2 seconds). You can increase this value to move past the error, and then restart the Cassandra node or nodes for the change to take effect. -
Decrease the
GROUPTRANSOPS
configuration value of the Replicat process. Typically, decreasing theGROUPTRANSOPS
configuration decreases the size of transactions processed and reduces the likelihood that the Cassandra Handler can overtax the Cassandra cluster. -
Reduce the consistency level of the Cassandra Handler. This in turn reduces the amount of work the Cassandra cluster has to complete for an operation to be considered as written.
Parent topic: Troubleshooting
8.2.5.7.3 Datastax Driver Error
ClassNotFound
exceptions can occur under either of the
following conditions:
- The
gg.classpath configuration
is set to point at the old 3.x version of the Java Driver. - The
gg.classpath
has not been configured to include the 4.x version of the Java Driver.
Parent topic: Troubleshooting
8.2.5.8 Cassandra Handler Client Dependencies
What are the dependencies for the Cassandra Handler to connect to Apache Cassandra databases?
The following Maven dependencies are required for the Cassandra Handler:
Artifact: java-driver-core
GroupId:
com.datastax.oss
AtifactId:
java-driver-core
Version: 4.x
Artifact: java-driver-query-builder
GroupId: com.datastax.oss
Artifact ID: java-driver-query-builder
Version: 4.x
Parent topic: Apache Cassandra
8.2.5.8.1 Cassandra Datastax Java Driver 4.12.0
asm-9.1.jar asm-analysis-9.1.jar asm-commons-9.1.jar asm-tree-9.1.jar asm-util-9.1.jar config-1.4.1.jar esri-geometry-api-1.2.1.jar HdrHistogram-2.1.12.jar jackson-annotations-2.12.2.jar jackson-core-2.12.2.jar jackson-core-asl-1.9.12.jar jackson-databind-2.12.2.jar java-driver-core-4.12.0.jar java-driver-query-builder-4.12.0.jar java-driver-shaded-guava-25.1-jre-graal-sub-1.jar jcip-annotations-1.0-1.jar jffi-1.3.1.jar jffi-1.3.1-native.jar jnr-a64asm-1.0.0.jar jnr-constants-0.10.1.jar jnr-ffi-2.2.2.jar jnr-posix-3.1.5.jar jnr-x86asm-1.0.2.jar json-20090211.jar jsr305-3.0.2.jar metrics-core-4.1.18.jar native-protocol-1.5.0.jar netty-buffer-4.1.60.Final.jar netty-codec-4.1.60.Final.jar netty-common-4.1.60.Final.jar netty-handler-4.1.60.Final.jar netty-resolver-4.1.60.Final.jar netty-transport-4.1.60.Final.jar reactive-streams-1.0.3.jar slf4j-api-1.7.26.jar spotbugs-annotations-3.1.12.jar
Parent topic: Cassandra Handler Client Dependencies
8.2.5.8.2 Cassandra Datastax Java Driver 4.9.0
asm-7.1.jar asm-analysis-7.1.jar asm-commons-7.1.jar asm-tree-7.1.jar asm-util-7.1.jar commons-collections-3.2.2.jar commons-configuration-1.10.jar commons-lang-2.6.jar commons-lang3-3.8.1.jar config-1.3.4.jar esri-geometry-api-1.2.1.jar gremlin-core-3.4.8.jar gremlin-shaded-3.4.8.jar HdrHistogram-2.1.11.jar jackson-annotations-2.11.0.jar jackson-core-2.11.0.jar jackson-core-asl-1.9.12.jar jackson-databind-2.11.0.jar java-driver-core-4.9.0.jar java-driver-query-builder-4.9.0.jar java-driver-shaded-guava-25.1-jre-graal-sub-1.jar javapoet-1.8.0.jar javatuples-1.2.jar jcip-annotations-1.0-1.jar jcl-over-slf4j-1.7.25.jar jffi-1.2.19.jar jffi-1.2.19-native.jar jnr-a64asm-1.0.0.jar jnr-constants-0.9.12.jar jnr-ffi-2.1.10.jar jnr-posix-3.0.50.jar jnr-x86asm-1.0.2.jar json-20090211.jar jsr305-3.0.2.jar metrics-core-4.0.5.jar native-protocol-1.4.11.jar netty-buffer-4.1.51.Final.jar netty-codec-4.1.51.Final.jar netty-common-4.1.51.Final.jar netty-handler-4.1.51.Final.jar netty-resolver-4.1.51.Final.jar netty-transport-4.1.51.Final.jar reactive-streams-1.0.2.jar slf4j-api-1.7.26.jar spotbugs-annotations-3.1.12.jar tinkergraph-gremlin-3.4.8.jar
Parent topic: Cassandra Handler Client Dependencies
8.2.6 Apache HBase
The HBase Handler is used to populate HBase tables from existing Oracle GoldenGate supported sources.
This chapter describes how to use the HBase Handler.
- Overview
- Detailed Functionality
- Setting Up and Running the HBase Handler
- Security
- Metadata Change Events
The HBase Handler seamlessly accommodates metadata change events including adding a column or dropping a column. The only requirement is that the source trail file contains the metadata. - Additional Considerations
- Troubleshooting the HBase Handler
Troubleshooting of the HBase Handler begins with the contents for the Javalog4j
file. Follow the directions in the Java Logging Configuration to configure the runtime to correctly generate the Javalog4j
log file. - HBase Handler Client Dependencies
What are the dependencies for the HBase Handler to connect to Apache HBase databases?
Parent topic: Target
8.2.6.1 Overview
HBase is an open source Big Data application that emulates much of the functionality of a relational database management system (RDBMS). Hadoop is specifically designed to store large amounts of unstructured data. Conversely, data stored in databases and replicated through Oracle GoldenGate is highly structured. HBase provides a way to maintain the important structure of data while taking advantage of the horizontal scaling that is offered by the Hadoop Distributed File System (HDFS).
Parent topic: Apache HBase
8.2.6.2 Detailed Functionality
The HBase Handler takes operations from the source trail file and creates corresponding tables in HBase, and then loads change capture data into those tables.
HBase Table Names
Table names created in an HBase map to the corresponding table name of the operation from the source trail file. Table name is case-sensitive.
HBase Table Namespace
For two-part table names (schema name and table name), the schema name maps to the HBase table namespace. For a three-part table name like Catalog.Schema.MyTable
, the create HBase namespace would be Catalog_Schema
. HBase table namespaces are case sensitive. A null schema name is supported and maps to the default HBase namespace.
HBase Row Key
HBase has a similar concept to the database primary keys, called the HBase row key.
The HBase row key is the unique identifier for a table row. HBase only supports a
single row key per row and it cannot be empty or null. The HBase Handler maps the
primary key value into the HBase row key value. If the source table has multiple
primary keys, then the primary key values are concatenated, separated by a pipe
delimiter (|
). You can configure the HBase row key delimiter.
- If
KEYCOLS
is specified, then it constructs the key based on the specifications defined in theKEYCOLS
clause. -
If
KEYCOLS
is not specified, then it constructs a key based on the concatenation of all eligible columns of the table.
The result is that the value of every column is concatenated to generate the HBase rowkey. However, this is not a good practice.
Workaround: Use the replicat mapping statement to identify one or
more primary key columns. For example: MAP QASOURCE.TCUSTORD, TARGET
QASOURCE.TCUSTORD, KEYCOLS (CUST_CODE);
HBase Column Family
HBase has the concept of a column family. A column family is a way to group column data. Only a single column family is supported. Every HBase column must belong to a single column family. The HBase Handler provides a single column family per table that defaults to cf
. You can configure the column family name. However, after a table is created with a specific column family name, you cannot reconfigure the column family name in the HBase example, without first modifying or dropping the table results in an abend of the Oracle GoldenGateReplicat processes.
Parent topic: Apache HBase
8.2.6.3 Setting Up and Running the HBase Handler
HBase must run either collocated with the HBase Handler process or on a machine that can connect from the network that is hosting the HBase Handler process. The underlying HDFS single instance or clustered instance serving as the repository for HBase data must also run.
Instructions for configuring the HBase Handler components and running the handler are described in this topic.
Parent topic: Apache HBase
8.2.6.3.1 Classpath Configuration
For the HBase Handler to connect to HBase and stream data, the hbase-site.xml file and the HBase client jars must be configured in gg.classpath
variable. The HBase client jars must match the version of HBase to which the HBase Handler is connecting. The HBase client jars are not shipped with the Oracle GoldenGate for Big Data product.
HBase Handler Client Dependencies lists the required HBase client jars by version.
The default location of the hbase-site.xml
file is HBase_Home
/conf
.
The default location of the HBase client JARs is HBase_Home
/lib/*
.
If the HBase Handler is running on Windows, follow the Windows classpathing syntax.
The gg.classpath
must be configured exactly as described. The path to the hbase-site.xml
file must contain only the path with no wild card appended. The inclusion of the * wildcard in the path to the hbase-site.xml
file will cause it to be inaccessible. Conversely, the path to the dependency jars must include the (*) wildcard character in order to include all the jar files in that directory, in the associated classpath. Do not use *.jar
. The following is an example of a correctly configured gg.classpath
variable:
gg.classpath=/var/lib/hbase/lib/*:/var/lib/hbase/conf
Parent topic: Setting Up and Running the HBase Handler
8.2.6.3.2 HBase Handler Configuration
The following are the configurable values for the HBase Handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
To enable the selection of the HBase Handler, you must first configure the handler type by specifying gg.handler.jdbc.type=hbase
and the other HBase properties as follows:
Table 8-9 HBase Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
Any string. |
None |
Provides a name for the HBase Handler. The HBase Handler name is then becomes part of the property names listed in this table. |
|
Required |
|
None |
Selects the HBase Handler for streaming change data capture into HBase. |
|
Optional |
Any string legal for an HBase column family name. |
|
Column family is a grouping mechanism for columns in HBase. The HBase Handler only supports a single column family. |
gg.handler.name.HBase20Compatible |
Optional | true |
false |
false ( HBase 1.0
compatible)
|
HBase 2.x removed methods and changed object
hierarchies. The result is that it broke the binary compatibility with
HBase 1.x. Set this property to true to correctly
interface with HBase 2.x, otherwise HBase 1.x compatibility is
used.
|
|
Optional |
|
|
Using |
|
Optional |
Any string. |
|
Provides a delimiter between key values in a map. For example, |
|
Optional |
Any string. |
|
Provides a delimiter between key value pairs in a map. For example, |
|
Optional |
Any encoding name or alias supported by Java.Foot 1 For a list of supported options, see |
The native system encoding of the machine hosting the Oracle GoldenGate process |
Determines the encoding of values written the HBase. HBase values are written as bytes. |
|
Optional |
|
|
Provides configuration for how the HBase Handler should handle update operations that change a primary key. Primary key operations can be problematic for the HBase Handler and require special consideration by you.
|
|
Optional |
Any string. |
|
Allows you to configure what will be sent to HBase in the case of a NULL column value. The default is |
|
Optional |
|
None |
Setting this property to |
|
Optional (Required if |
Relative or absolute path to a Kerberos |
- |
The |
|
Optional (Required if |
A legal Kerberos principal name (for example, |
- |
The Kerberos principal name for Kerberos authentication. |
|
Optional |
Any string/ |
| |
Configures the delimiter between primary key values from the source table when generating the HBase |
|
Optional |
|
|
Set to |
|
Optional |
|
|
Set to |
gg.handler.name.metaColumnsTemplate |
Optional | A legal string | None | A legal string specifying the metaColumns to be included. For more information, see Metacolumn Keywords. |
Footnote 1
See Java Internalization Support at https://docs.oracle.com/javase/8/docs/technotes/guides/intl/
.
Parent topic: Setting Up and Running the HBase Handler
8.2.6.3.3 Sample Configuration
The following is a sample configuration for the HBase Handler from the Java Adapter properties file:
gg.handlerlist=hbase gg.handler.hbase.type=hbase gg.handler.hbase.mode=tx gg.handler.hbase.hBaseColumnFamilyName=cf gg.handler.hbase.includeTokens=true gg.handler.hbase.keyValueDelimiter=CDATA[=] gg.handler.hbase.keyValuePairDelimiter=CDATA[,] gg.handler.hbase.encoding=UTF-8 gg.handler.hbase.pkUpdateHandling=abend gg.handler.hbase.nullValueRepresentation=CDATA[NULL] gg.handler.hbase.authType=none
Parent topic: Setting Up and Running the HBase Handler
8.2.6.3.4 Performance Considerations
At each transaction commit, the HBase Handler performs a flush call to flush any buffered data to the HBase region server. This must be done to maintain write durability. Flushing to the HBase region server is an expensive call and performance can be greatly improved by using the Replicat GROUPTRANSOPS
parameter to group multiple smaller transactions in the source trail file into a larger single transaction applied to HBase. You can use Replicat base-batching by adding the configuration syntax in the Replicat configuration file.
Operations from multiple transactions are grouped together into a larger transaction, and it is only at the end of the grouped transaction that transaction is committed.
Parent topic: Setting Up and Running the HBase Handler
8.2.6.4 Security
You can secure HBase connectivity using Kerberos authentication. Follow the associated documentation for the HBase release to secure the HBase cluster. The HBase Handler can connect to Kerberos secured clusters. The HBase hbase-site.xml
must be in handlers classpath with the hbase.security.authentication
property set to kerberos
and hbase.security.authorization
property set to true
.
You have to include the directory containing the HDFS core-site.xml
file in the classpath. Kerberos authentication is performed using the Hadoop UserGroupInformation
class. This class relies on the Hadoop configuration property hadoop.security.authentication
being set to kerberos
to successfully perform the kinit
command.
Additionally, you must set the following properties in the HBase Handler Java configuration file:
gg.handler.{name}.authType=kerberos gg.handler.{name}.keberosPrincipalName={legal Kerberos principal name} gg.handler.{name}.kerberosKeytabFile={path to a keytab file that contains the password for the Kerberos principal so that the Oracle GoldenGate HDFS handler can programmatically perform the Kerberos kinit operations to obtain a Kerberos ticket}.
You may encounter the inability to decrypt the Kerberos password from the keytab
file. This causes the Kerberos authentication to fall back to interactive mode which cannot work because it is being invoked programmatically. The cause of this problem is that the Java Cryptography Extension (JCE) is not installed in the Java Runtime Environment (JRE). Ensure that the JCE is loaded in the JRE, see http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html.
Parent topic: Apache HBase
8.2.6.5 Metadata Change Events
The HBase Handler seamlessly accommodates metadata change events including adding a column or dropping a column. The only requirement is that the source trail file contains the metadata.
Parent topic: Apache HBase
8.2.6.6 Additional Considerations
Classpath issues are common during the initial setup of the HBase Handler. The typical indicators are occurrences of the ClassNotFoundException
in the Java log4j
log file. The HBase client jars do not ship with Oracle GoldenGate for Big Data. You must resolve the required HBase client jars. HBase Handler Client Dependencies includes a list of HBase client jars for each supported version. Either the hbase-site.xml
or one or more of the required client JARS are not included in the classpath. For instructions on configuring the classpath of the HBase Handler, see Classpath Configuration.
Parent topic: Apache HBase
8.2.6.7 Troubleshooting the HBase Handler
Troubleshooting of the HBase Handler begins with the contents for the Java
log4j
file. Follow the directions in the Java Logging Configuration to
configure the runtime to correctly generate the Java log4j
log file.
- Java Classpath
- HBase Connection Properties
- Logging of Handler Configuration
- HBase Handler Delete-Insert Problem
Parent topic: Apache HBase
8.2.6.7.1 Java Classpath
Issues with the Java classpath are common. A ClassNotFoundException
in the Java log4j
log file indicates a classpath problem. You can use the Java log4j
log file to troubleshoot this issue. Setting the log level to DEBUG
logs each of the jars referenced in the gg.classpath
object to the log file. You can make sure that all of the required dependency jars are resolved by enabling DEBUG
level logging, and then searching the log file for messages like the following:
2015-09-29 13:04:26 DEBUG ConfigClassPath:74 - ...adding to classpath: url="file:/ggwork/hbase/hbase-1.0.1.1/lib/hbase-server-1.0.1.1.jar"
Parent topic: Troubleshooting the HBase Handler
8.2.6.7.2 HBase Connection Properties
The contents of the HDFS hbase-site.xml
file (including default settings) are output to the Java log4j
log file when the logging level is set to DEBUG
or TRACE
. This file shows the connection properties to HBase. Search for the following in the Java log4j
log file.
2015-09-29 13:04:27 DEBUG HBaseWriter:449 - Begin - HBase configuration object contents for connection troubleshooting. Key: [hbase.auth.token.max.lifetime] Value: [604800000].
Commonly, for the hbase-site.xml
file is not included in the classpath or the path to the hbase-site.xml
file is incorrect. In this case, the HBase Handler cannot establish a connection to HBase, and the Oracle GoldenGate process abends. The following error is reported in the Java log4j
log.
2015-09-29 12:49:29 ERROR HBaseHandler:207 - Failed to initialize the HBase handler. org.apache.hadoop.hbase.ZooKeeperConnectionException: Can't connect to ZooKeeper
Verify that the classpath correctly includes the hbase-site.xml
file and that HBase is running.
Parent topic: Troubleshooting the HBase Handler
8.2.6.7.3 Logging of Handler Configuration
The Java log4j
log file contains information on the configuration state of the HBase Handler. This information is output at the INFO
log level. The following is a sample output:
2015-09-29 12:45:53 INFO HBaseHandler:194 - **** Begin HBase Handler - Configuration Summary **** Mode of operation is set to tx. HBase data will be encoded using the native system encoding. In the event of a primary key update, the HBase Handler will ABEND. HBase column data will use the column family name [cf]. The HBase Handler will not include tokens in the HBase data. The HBase Handler has been configured to use [=] as the delimiter between keys and values. The HBase Handler has been configured to use [,] as the delimiter between key values pairs. The HBase Handler has been configured to output [NULL] for null values. Hbase Handler Authentication type has been configured to use [none]
Parent topic: Troubleshooting the HBase Handler
8.2.6.7.4 HBase Handler Delete-Insert Problem
If you are using the HBase Handler with the
gg.handler.name.setHBaseOperationTimestamp=false
configuration
property, then the source database may get out of sync with data in the HBase tables.
This is caused by the deletion of a row followed by the immediate reinsertion of the
row. HBase creates a tombstone marker for the delete that is identified by a specific
timestamp. This tombstone marker marks any row records in HBase with the same row key as
deleted that have a timestamp before or the same as the tombstone marker. This can occur
when the deleted row is immediately reinserted. The insert operation can inadvertently
have the same timestamp as the delete operation so the delete operation causes the
subsequent insert operation to incorrectly appear as deleted.
To work around this issue, you need to set the
gg.handler.name.setHbaseOperationTimestamp=true
, which does
two things:
-
Sets the timestamp for row operations in the HBase Handler.
-
Detection of a delete-insert operation that ensures that the insert operation has a timestamp that is after the insert.
The default for gg.handler.name.setHbaseOperationTimestamp
is
true
, which means that the HBase server supplies the timestamp for
a row. This prevents the HBase delete-reinsert out-of-sync problem.
Setting the row operation timestamp in the HBase Handler can have these consequences:
-
Since the timestamp is set on the client side, this could create problems if multiple applications are feeding data to the same HBase table.
-
If delete and reinsert is a common pattern in your use case, then the HBase Handler has to increment the timestamp 1 millisecond each time this scenario is encountered.
Processing cannot be allowed to get too far into the future so the HBase Handler only allows the timestamp to increment 100 milliseconds into the future before it attempts to wait the process so that the client side HBase operation timestamp and real time are back in sync. When a delete-insert is used instead of an update in the source database so this sync scenario would be quite common. Processing speeds may be affected by not allowing the HBase timestamp to go over 100 milliseconds into the future if this scenario is common.
Parent topic: Troubleshooting the HBase Handler
8.2.6.8 HBase Handler Client Dependencies
What are the dependencies for the HBase Handler to connect to Apache HBase databases?
The maven central repository artifacts for HBase databases are:
-
Maven groupId:
org.apache.hbase
-
Maven atifactId:
hbase-client
-
Maven version: the HBase version numbers listed for each section
The hbase-client-x.x.x.jar
file is not distributed with Apache HBase, nor is it mandatory to be in the classpath. The hbase-client-x.x.x.jar
file is an empty Maven project whose purpose of aggregating all of the HBase client dependencies.
- HBase 2.4.4
- HBase 2.3.3
- HBase 2.2.0
- HBase 2.1.5
- HBase 2.0.5
- HBase 1.4.10
- HBase 1.3.3
- HBase 1.2.5
- HBase 1.1.1
- HBase 1.0.1.1
Parent topic: Apache HBase
8.2.6.8.1 HBase 2.4.4
apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar audience-annotations-0.5.0.jar avro-1.7.7.jar commons-beanutils-1.9.4.jar commons-cli-1.2.jar commons-codec-1.13.jar commons-collections-3.2.2.jar commons-compress-1.19.jar commons-configuration-1.6.jar commons-crypto-1.0.0.jar commons-digester-1.8.jar commons-io-2.6.jar commons-lang-2.6.jar commons-lang3-3.9.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.1.jar curator-client-2.7.1.jar curator-framework-2.7.1.jar curator-recipes-2.7.1.jar error_prone_annotations-2.3.4.jar gson-2.2.4.jar guava-11.0.2.jar hadoop-annotations-2.10.0.jar hadoop-auth-2.10.0.jar hadoop-common-2.10.0.jar hbase-client-2.4.4.jar hbase-common-2.4.4.jar hbase-hadoop2-compat-2.4.4.jar hbase-hadoop-compat-2.4.4.jar hbase-logging-2.4.4.jar hbase-metrics-2.4.4.jar hbase-metrics-api-2.4.4.jar hbase-protocol-2.4.4.jar hbase-protocol-shaded-2.4.4.jar hbase-shaded-gson-3.4.1.jar hbase-shaded-miscellaneous-3.4.1.jar hbase-shaded-netty-3.4.1.jar hbase-shaded-protobuf-3.4.1.jar htrace-core4-4.2.0-incubating.jar httpclient-4.5.2.jar httpcore-4.4.4.jar jackson-core-asl-1.9.13.jar jackson-mapper-asl-1.9.13.jar javax.activation-api-1.2.0.jar jcip-annotations-1.0-1.jar jcodings-1.0.55.jar jdk.tools-1.8.jar jetty-sslengine-6.1.26.jar joni-2.1.31.jar jsch-0.1.54.jar jsr305-3.0.0.jar log4j-1.2.17.jar metrics-core-3.2.6.jar netty-buffer-4.1.45.Final.jar netty-codec-4.1.45.Final.jar netty-common-4.1.45.Final.jar netty-handler-4.1.45.Final.jar netty-resolver-4.1.45.Final.jar netty-transport-4.1.45.Final.jar netty-transport-native-epoll-4.1.45.Final.jar netty-transport-native-unix-common-4.1.45.Final.jar nimbus-jose-jwt-4.41.1.jar paranamer-2.3.jar protobuf-java-2.5.0.jar slf4j-api-1.7.30.jar slf4j-log4j12-1.7.25.jar snappy-java-1.0.5.jar stax2-api-3.1.4.jar woodstox-core-5.0.3.jar xmlenc-0.52.jar zookeeper-3.5.7.jar zookeeper-jute-3.5.7.jar
Parent topic: HBase Handler Client Dependencies
8.2.6.8.2 HBase 2.3.3
apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar audience-annotations-0.5.0.jar avro-1.7.7.jar commons-beanutils-1.9.4.jar commons-cli-1.2.jar commons-codec-1.13.jar commons-collections-3.2.2.jar commons-compress-1.19.jar commons-configuration-1.6.jar commons-crypto-1.0.0.jar commons-digester-1.8.jar commons-io-2.6.jar commons-lang-2.6.jar commons-lang3-3.9.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.1.jar curator-client-2.7.1.jar curator-framework-2.7.1.jar curator-recipes-2.7.1.jar error_prone_annotations-2.3.4.jar gson-2.2.4.jar guava-11.0.2.jar hadoop-annotations-2.10.0.jar hadoop-auth-2.10.0.jar hadoop-common-2.10.0.jar hbase-client-2.3.3.jar hbase-common-2.3.3.jar hbase-hadoop2-compat-2.3.3.jar hbase-hadoop-compat-2.3.3.jar hbase-logging-2.3.3.jar hbase-metrics-2.3.3.jar hbase-metrics-api-2.3.3.jar hbase-protocol-2.3.3.jar hbase-protocol-shaded-2.3.3.jar hbase-shaded-gson-3.3.0.jar hbase-shaded-miscellaneous-3.3.0.jar hbase-shaded-netty-3.3.0.jar hbase-shaded-protobuf-3.3.0.jar htrace-core4-4.2.0-incubating.jar httpclient-4.5.2.jar httpcore-4.4.4.jar jackson-core-asl-1.9.13.jar jackson-mapper-asl-1.9.13.jar javax.activation-api-1.2.0.jar jcip-annotations-1.0-1.jar jcodings-1.0.18.jar jdk.tools-1.8.jar
Parent topic: HBase Handler Client Dependencies
8.2.6.8.3 HBase 2.2.0
apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar audience-annotations-0.5.0.jar avro-1.7.4.jar commons-beanutils-1.7.0.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.10.jar commons-collections-3.2.2.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-crypto-1.0.0.jar commons-digester-1.8.jar commons-io-2.5.jar commons-lang-2.6.jar commons-lang3-3.6.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.1.jar curator-client-2.7.1.jar curator-framework-2.7.1.jar curator-recipes-2.7.1.jar error_prone_annotations-2.3.3.jar findbugs-annotations-1.3.9-1.jar gson-2.2.4.jar guava-11.0.2.jar hadoop-annotations-2.8.5.jar hadoop-auth-2.8.5.jar hadoop-common-2.8.5.jar hamcrest-core-1.3.jar hbase-client-2.2.0.jar hbase-common-2.2.0.jar hbase-hadoop2-compat-2.2.0.jar hbase-hadoop-compat-2.2.0.jar hbase-metrics-2.2.0.jar hbase-metrics-api-2.2.0.jar hbase-protocol-2.2.0.jar hbase-protocol-shaded-2.2.0.jar hbase-shaded-miscellaneous-2.2.1.jar hbase-shaded-netty-2.2.1.jar hbase-shaded-protobuf-2.2.1.jar htrace-core4-4.2.0-incubating.jar httpclient-4.5.2.jar httpcore-4.4.4.jar jackson-core-asl-1.9.13.jar jackson-mapper-asl-1.9.13.jar jcip-annotations-1.0-1.jar jcodings-1.0.18.jar jdk.tools-1.8.jar jetty-sslengine-6.1.26.jar joni-2.1.11.jar jsch-0.1.54.jar jsr305-3.0.0.jar junit-4.12.jar log4j-1.2.17.jar metrics-core-3.2.6.jar nimbus-jose-jwt-4.41.1.jar paranamer-2.3.jar protobuf-java-2.5.0.jar slf4j-api-1.7.25.jar slf4j-log4j12-1.6.1.jar snappy-java-1.0.4.1.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.10.jar
Parent topic: HBase Handler Client Dependencies
8.2.6.8.4 HBase 2.1.5
apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar audience-annotations-0.5.0.jar avro-1.7.4.jar commons-beanutils-1.7.0.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.10.jar commons-collections-3.2.2.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-crypto-1.0.0.jar commons-digester-1.8.jar commons-httpclient-3.1.jar commons-io-2.5.jar commons-lang-2.6.jar commons-lang3-3.6.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.1.jar curator-client-2.7.1.jar curator-framework-2.7.1.jar curator-recipes-2.7.1.jar findbugs-annotations-1.3.9-1.jar gson-2.2.4.jar guava-11.0.2.jar hadoop-annotations-2.7.7.jar hadoop-auth-2.7.7.jar hadoop-common-2.7.7.jar hamcrest-core-1.3.jar hbase-client-2.1.5.jar hbase-common-2.1.5.jar hbase-hadoop2-compat-2.1.5.jar hbase-hadoop-compat-2.1.5.jar hbase-metrics-2.1.5.jar hbase-metrics-api-2.1.5.jar hbase-protocol-2.1.5.jar hbase-protocol-shaded-2.1.5.jar hbase-shaded-miscellaneous-2.1.0.jar hbase-shaded-netty-2.1.0.jar hbase-shaded-protobuf-2.1.0.jar htrace-core-3.1.0-incubating.jar htrace-core4-4.2.0-incubating.jar httpclient-4.2.5.jar httpcore-4.2.4.jar jackson-annotations-2.9.0.jar jackson-core-2.9.2.jar jackson-core-asl-1.9.13.jar jackson-databind-2.9.2.jar jackson-mapper-asl-1.9.13.jar jcodings-1.0.18.jar jdk.tools-1.8.jar jetty-sslengine-6.1.26.jar joni-2.1.11.jar jsch-0.1.54.jar jsr305-3.0.0.jar junit-4.12.jar log4j-1.2.17.jar metrics-core-3.2.6.jar paranamer-2.3.jar protobuf-java-2.5.0.jar slf4j-api-1.7.25.jar slf4j-log4j12-1.6.1.jar snappy-java-1.0.4.1.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.10.jar
Parent topic: HBase Handler Client Dependencies
8.2.6.8.5 HBase 2.0.5
apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar audience-annotations-0.5.0.jar avro-1.7.4.jar commons-beanutils-1.7.0.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.10.jar commons-collections-3.2.2.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-crypto-1.0.0.jar commons-digester-1.8.jar commons-httpclient-3.1.jar commons-io-2.5.jar commons-lang-2.6.jar commons-lang3-3.6.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.1.jar curator-client-2.7.1.jar curator-framework-2.7.1.jar curator-recipes-2.7.1.jar findbugs-annotations-1.3.9-1.jar gson-2.2.4.jar guava-11.0.2.jar hadoop-annotations-2.7.7.jar hadoop-auth-2.7.7.jar hadoop-common-2.7.7.jar hamcrest-core-1.3.jar hbase-client-2.0.5.jar hbase-common-2.0.5.jar hbase-hadoop2-compat-2.0.5.jar hbase-hadoop-compat-2.0.5.jar hbase-metrics-2.0.5.jar hbase-metrics-api-2.0.5.jar hbase-protocol-2.0.5.jar hbase-protocol-shaded-2.0.5.jar hbase-shaded-miscellaneous-2.1.0.jar hbase-shaded-netty-2.1.0.jar hbase-shaded-protobuf-2.1.0.jar htrace-core4-4.2.0-incubating.jar httpclient-4.2.5.jar httpcore-4.2.4.jar jackson-annotations-2.9.0.jar jackson-core-2.9.2.jar jackson-core-asl-1.9.13.jar jackson-databind-2.9.2.jar jackson-mapper-asl-1.9.13.jar jcodings-1.0.18.jar jdk.tools-1.8.jar jetty-sslengine-6.1.26.jar joni-2.1.11.jar jsch-0.1.54.jar jsr305-3.0.0.jar junit-4.12.jar log4j-1.2.17.jar metrics-core-3.2.1.jar paranamer-2.3.jar protobuf-java-2.5.0.jar slf4j-api-1.7.25.jar slf4j-log4j12-1.6.1.jar snappy-java-1.0.4.1.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.10.jar
Parent topic: HBase Handler Client Dependencies
8.2.6.8.6 HBase 1.4.10
activation-1.1.jar apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar avro-1.7.7.jar commons-beanutils-1.7.0.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.9.jar commons-collections-3.2.2.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-digester-1.8.jar commons-httpclient-3.1.jar commons-io-2.4.jar commons-lang-2.6.jar commons-logging-1.2.jar commons-math3-3.1.1.jar commons-net-3.1.jar curator-client-2.7.1.jar curator-framework-2.7.1.jar curator-recipes-2.7.1.jar findbugs-annotations-1.3.9-1.jar gson-2.2.4.jar guava-12.0.1.jar hadoop-annotations-2.7.4.jar hadoop-auth-2.7.4.jar hadoop-common-2.7.4.jar hadoop-mapreduce-client-core-2.7.4.jar hadoop-yarn-api-2.7.4.jar hadoop-yarn-common-2.7.4.jar hamcrest-core-1.3.jar hbase-annotations-1.4.10.jar hbase-client-1.4.10.jar hbase-common-1.4.10.jar hbase-protocol-1.4.10.jar htrace-core-3.1.0-incubating.jar httpclient-4.2.5.jar httpcore-4.2.4.jar jackson-core-asl-1.9.13.jar jackson-mapper-asl-1.9.13.jar jaxb-api-2.2.2.jar jcodings-1.0.8.jar jdk.tools-1.8.jar jetty-sslengine-6.1.26.jar jetty-util-6.1.26.jar joni-2.1.2.jar jsch-0.1.54.jar jsr305-3.0.0.jar junit-4.12.jar log4j-1.2.17.jar metrics-core-2.2.0.jar netty-3.6.2.Final.jar netty-all-4.1.8.Final.jar paranamer-2.3.jar protobuf-java-2.5.0.jar slf4j-api-1.6.1.jar slf4j-log4j12-1.6.1.jar snappy-java-1.0.5.jar stax-api-1.0-2.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.10.jar
Parent topic: HBase Handler Client Dependencies
8.2.6.8.7 HBase 1.3.3
activation-1.1.jar apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar avro-1.7.4.jar commons-beanutils-1.7.0.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.9.jar commons-collections-3.2.2.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-digester-1.8.jar commons-el-1.0.jar commons-httpclient-3.1.jar commons-io-2.4.jar commons-lang-2.6.jar commons-logging-1.2.jar commons-math3-3.1.1.jar commons-net-3.1.jar findbugs-annotations-1.3.9-1.jar guava-12.0.1.jar hadoop-annotations-2.5.1.jar hadoop-auth-2.5.1.jar hadoop-common-2.5.1.jar hadoop-mapreduce-client-core-2.5.1.jar hadoop-yarn-api-2.5.1.jar hadoop-yarn-common-2.5.1.jar hamcrest-core-1.3.jar hbase-annotations-1.3.3.jar hbase-client-1.3.3.jar hbase-common-1.3.3.jar hbase-protocol-1.3.3.jar htrace-core-3.1.0-incubating.jar httpclient-4.2.5.jar httpcore-4.2.4.jar jackson-core-asl-1.9.13.jar jackson-mapper-asl-1.9.13.jar jaxb-api-2.2.2.jar jcodings-1.0.8.jar jdk.tools-1.6.jar jetty-util-6.1.26.jar joni-2.1.2.jar jsch-0.1.42.jar jsr305-1.3.9.jar junit-4.12.jar log4j-1.2.17.jar metrics-core-2.2.0.jar netty-3.6.2.Final.jar netty-all-4.0.50.Final.jar paranamer-2.3.jar protobuf-java-2.5.0.jar slf4j-api-1.6.1.jar slf4j-log4j12-1.6.1.jar snappy-java-1.0.4.1.jar stax-api-1.0-2.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.6.jar
Parent topic: HBase Handler Client Dependencies
8.2.6.8.8 HBase 1.2.5
activation-1.1.jar apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar avro-1.7.4.jar commons-beanutils-1.7.0.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.9.jar commons-collections-3.2.2.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-digester-1.8.jar commons-el-1.0.jar commons-httpclient-3.1.jar commons-io-2.4.jar commons-lang-2.6.jar commons-logging-1.2.jar commons-math3-3.1.1.jar commons-net-3.1.jar findbugs-annotations-1.3.9-1.jar guava-12.0.1.jar hadoop-annotations-2.5.1.jar hadoop-auth-2.5.1.jar hadoop-common-2.5.1.jar hadoop-mapreduce-client-core-2.5.1.jar hadoop-yarn-api-2.5.1.jar hadoop-yarn-common-2.5.1.jar hamcrest-core-1.3.jar hbase-annotations-1.2.5.jar hbase-client-1.2.5.jar hbase-common-1.2.5.jar hbase-protocol-1.2.5.jar htrace-core-3.1.0-incubating.jar httpclient-4.2.5.jar httpcore-4.2.4.jar jackson-core-asl-1.9.13.jar jackson-mapper-asl-1.9.13.jar jaxb-api-2.2.2.jar jcodings-1.0.8.jar jdk.tools-1.6.jar jetty-util-6.1.26.jar joni-2.1.2.jar jsch-0.1.42.jar jsr305-1.3.9.jar junit-4.12.jar log4j-1.2.17.jar metrics-core-2.2.0.jar netty-3.6.2.Final.jar netty-all-4.0.23.Final.jar paranamer-2.3.jar protobuf-java-2.5.0.jar slf4j-api-1.6.1.jar slf4j-log4j12-1.6.1.jar snappy-java-1.0.4.1.jar stax-api-1.0-2.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.6.jar
Parent topic: HBase Handler Client Dependencies
8.2.6.8.9 HBase 1.1.1
HBase 1.1.1 is effectively the same as HBase 1.1.0.1. You can substitute 1.1.0.1 in the libraries that are versioned as 1.1.1.
activation-1.1.jar apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar avro-1.7.4.jar commons-beanutils-1.7.0.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.9.jar commons-collections-3.2.1.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-digester-1.8.jar commons-el-1.0.jar commons-httpclient-3.1.jar commons-io-2.4.jar commons-lang-2.6.jar commons-logging-1.2.jar commons-math3-3.1.1.jar commons-net-3.1.jar findbugs-annotations-1.3.9-1.jar guava-12.0.1.jar hadoop-annotations-2.5.1.jar hadoop-auth-2.5.1.jar hadoop-common-2.5.1.jar hadoop-mapreduce-client-core-2.5.1.jar hadoop-yarn-api-2.5.1.jar hadoop-yarn-common-2.5.1.jar hamcrest-core-1.3.jar hbase-annotations-1.1.1.jar hbase-client-1.1.1.jar hbase-common-1.1.1.jar hbase-protocol-1.1.1.jar htrace-core-3.1.0-incubating.jar httpclient-4.2.5.jar httpcore-4.2.4.jar jackson-core-asl-1.9.13.jar jackson-mapper-asl-1.9.13.jar jaxb-api-2.2.2.jar jcodings-1.0.8.jar jdk.tools-1.7.jar jetty-util-6.1.26.jar joni-2.1.2.jar jsch-0.1.42.jar jsr305-1.3.9.jar junit-4.11.jar log4j-1.2.17.jar netty-3.6.2.Final.jar netty-all-4.0.23.Final.jar paranamer-2.3.jar protobuf-java-2.5.0.jar slf4j-api-1.6.1.jar slf4j-log4j12-1.6.1.jar snappy-java-1.0.4.1.jar stax-api-1.0-2.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.6.jar
Parent topic: HBase Handler Client Dependencies
8.2.6.8.10 HBase 1.0.1.1
activation-1.1.jar apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar avro-1.7.4.jar commons-beanutils-1.7.0.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.9.jar commons-collections-3.2.1.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-digester-1.8.jar commons-el-1.0.jar commons-httpclient-3.1.jar commons-io-2.4.jar commons-lang-2.6.jar commons-logging-1.2.jar commons-math3-3.1.1.jar commons-net-3.1.jar findbugs-annotations-1.3.9-1.jar guava-12.0.1.jar hadoop-annotations-2.5.1.jar hadoop-auth-2.5.1.jar hadoop-common-2.5.1.jar hadoop-mapreduce-client-core-2.5.1.jar hadoop-yarn-api-2.5.1.jar hadoop-yarn-common-2.5.1.jar hamcrest-core-1.3.jar hbase-annotations-1.0.1.1.jar hbase-client-1.0.1.1.jar hbase-common-1.0.1.1.jar hbase-protocol-1.0.1.1.jar htrace-core-3.1.0-incubating.jar httpclient-4.2.5.jar httpcore-4.2.4.jar jackson-core-asl-1.8.8.jar jackson-mapper-asl-1.8.8.jar jaxb-api-2.2.2.jar jcodings-1.0.8.jar jdk.tools-1.7.jar jetty-util-6.1.26.jar joni-2.1.2.jar jsch-0.1.42.jar jsr305-1.3.9.jar junit-4.11.jar log4j-1.2.17.jar netty-3.6.2.Final.jar netty-all-4.0.23.Final.jar paranamer-2.3.jar protobuf-java-2.5.0.jar slf4j-api-1.6.1.jar slf4j-log4j12-1.6.1.jar snappy-java-1.0.4.1.jar stax-api-1.0-2.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.6.jar
Parent topic: HBase Handler Client Dependencies
8.2.7 Apache HDFS
The HDFS Handler is designed to stream change capture data into the Hadoop Distributed File System (HDFS).
This chapter describes how to use the HDFS Handler.
- Overview
- Writing into HDFS in SequenceFile Format
The HDFSSequenceFile
is a flat file consisting of binary key and value pairs. You can enable writing data inSequenceFile
format by setting thegg.handler.name.format
property tosequencefile
. - Setting Up and Running the HDFS Handler
- Writing in HDFS in Avro Object Container File Format
- Generating HDFS File Names Using Template Strings
- Metadata Change Events
- Partitioning
The partitioning functionality uses the template mapper functionality to resolve partitioning strings. The result is that the you have more control in how to partition source trail data. Starting Oracle GoldenGate for Big Data 21.1, all the keywords that are supported by the templating functionality are supported in HDFS partitioning. - HDFS Additional Considerations
- Best Practices
- Troubleshooting the HDFS Handler
Troubleshooting of the HDFS Handler begins with the contents for the Javalog4j
file. Follow the directions in the Java Logging Configuration to configure the runtime to correctly generate the Javalog4j
log file. - HDFS Handler Client Dependencies
Parent topic: Target
8.2.7.1 Overview
The HDFS is the primary file system for Big Data. Hadoop is typically installed on multiple machines that work together as a Hadoop cluster. Hadoop allows you to store very large amounts of data in the cluster that is horizontally scaled across the machines in the cluster. You can then perform analytics on that data using a variety of Big Data applications.
Parent topic: Apache HDFS
8.2.7.2 Writing into HDFS in SequenceFile Format
The HDFS SequenceFile
is a flat file consisting of binary key and
value pairs. You can enable writing data in SequenceFile
format by setting
the gg.handler.name.format
property to
sequencefile
.
The key
part of the record is set to null, and the actual data is
set in the value
part. For information about Hadoop
SequenceFile
, see https://cwiki.apache.org/confluence/display/HADOOP2/SequenceFile.
Parent topic: Apache HDFS
8.2.7.2.1 Integrating with Hive
Oracle GoldenGate for Big Data release does not include a Hive storage handler because the HDFS Handler provides all of the necessary Hive functionality .
You can create a Hive integration to create tables and update table definitions in case of DDL events. This is limited to data formatted in Avro Object Container File format. For more information, see Writing in HDFS in Avro Object Container File Format and HDFS Handler Configuration.
For Hive to consume sequence files, the DDL creates Hive tables including STORED as sequencefile
. The following is a sample create table
script:
CREATE EXTERNAL TABLE table_name (
col1 string,
...
...
col2 string)
ROW FORMAT DELIMITED
STORED as sequencefile
LOCATION '/path/to/hdfs/file';
Note:
If files are intended to be consumed by Hive, then the gg.handler.name.partitionByTable
property should be set to true
.
Parent topic: Writing into HDFS in SequenceFile Format
8.2.7.2.2 Understanding the Data Format
The data written in the value
part of each record and is in delimited text format. All of the options described in the Using the Delimited Text Row Formatter section are applicable to HDFS SequenceFile when writing data to it.
For example:
gg.handler.name.format=sequencefile
gg.handler.name.format.includeColumnNames=true
gg.handler.name.format.includeOpType=true
gg.handler.name.format.includeCurrentTimestamp=true
gg.handler.name.format.updateOpKey=U
Parent topic: Writing into HDFS in SequenceFile Format
8.2.7.3 Setting Up and Running the HDFS Handler
To run the HDFS Handler, a Hadoop single instance or Hadoop cluster must be installed, running, and network-accessible from the machine running the HDFS Handler. Apache Hadoop is open source and you can download it from:
Follow the Getting Started links for information on how to install a single-node cluster (for pseudo-distributed operation mode) or a clustered setup (for fully-distributed operation mode).
Instructions for configuring the HDFS Handler components and running the handler are described in the following sections.
- Classpath Configuration
- HDFS Handler Configuration
- Review a Sample Configuration
- Performance Considerations
- Security
Parent topic: Apache HDFS
8.2.7.3.1 Classpath Configuration
For the HDFS Handler to connect to HDFS and run, the HDFS core-site.xml
file and the HDFS client jars must be configured in gg.classpath
variable. The HDFS client jars must match the version of HDFS that the HDFS Handler is connecting. For a list of the required client jar files by release, see HDFS Handler Client Dependencies.
The default location of the core-site.xml
file is Hadoop_Home
/etc/hadoop
The default locations of the HDFS client jars are the following directories:
Hadoop_Home
/share/hadoop/common/lib/*
Hadoop_Home
/share/hadoop/common/*
Hadoop_Home
/share/hadoop/hdfs/lib/
*
Hadoop_Home
/share/hadoop/hdfs/*
The gg.classpath
must be configured exactly as shown. The path to the core-site.xml
file must contain the path to the directory containing the core-site.xml
file with no wildcard appended. If you include a (*) wildcard in the path to the core-site.xml
file, the file is not picked up. Conversely, the path to the dependency jars must include the (*) wildcard character in order to include all the jar files in that directory in the associated classpath. Do not use *.jar
.
The following is an example of a correctly configured gg.classpath
variable:
gg.classpath=/ggwork/hadoop/hadoop-2.6.0/etc/hadoop:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/common/lib/*:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/common/*:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/hdfs/*:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/hdfs/lib/*
The HDFS configuration file hdfs-site.xml
must also be in the classpath if Kerberos security is enabled. By default, the hdfs-site.xml
file is located in the Hadoop_Home
/etc/hadoop
directory. If the HDFS Handler is not collocated with Hadoop, either or both files can be copied to another machine.
Parent topic: Setting Up and Running the HDFS Handler
8.2.7.3.2 HDFS Handler Configuration
The following are the configurable values for the HDFS Handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
To enable the selection of the HDFS Handler, you must first configure the handler
type by specifying gg.handler.name.type=hdfs
and the other HDFS
properties as follows:
Property | Optional / Required | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
Any string |
None |
Provides a name for the HDFS Handler. The HDFS Handler name then becomes part of the property names listed in this table. |
|
Required |
|
None |
Selects the HDFS Handler for streaming change data capture into HDFS. |
|
Optional |
|
|
Selects operation ( |
|
Optional |
The default unit of measure is bytes. You can use |
|
Selects the maximum file size of the created HDFS files. |
|
Optional |
Any legal templated string to resolve the target write directory in HDFS. Templates can contain a mix of constants and keywords which are dynamically resolved at runtime to generate the HDFS write directory. |
|
You can use keywords interlaced with constants to dynamically generate the HDFS write directory at runtime, see Generating HDFS File Names Using Template Strings. |
|
Optional |
The default unit of measure is milliseconds. You can stipulate |
File rolling on time is off. |
The timer starts when an HDFS file is created. If the file is still open when the interval elapses, then the file is closed. A new file is not immediately opened. New HDFS files are created on a just-in-time basis. |
|
Optional |
The default unit of measure is milliseconds. You can use |
File inactivity rolling on time is off. |
The timer starts from the latest write to an HDFS file. New writes to an HDFS file restart the counter. If the file is still open when the counter elapses, the HDFS file is closed. A new file is not immediately opened. New HDFS files are created on a just-in-time basis. |
|
Optional |
A string with resolvable keywords and constants used to dynamically generate HDFS file names at runtime. |
|
You can use keywords interlaced with constants to dynamically generate unique HDFS file names at runtime, see Generating HDFS File Names Using Template Strings. File names typically follow the format, |
|
Optional |
|
|
Determines whether data written into HDFS must be partitioned by table. If set to Must be set to |
|
Optional |
|
|
Determines whether HDFS files are rolled in the case of a metadata change. True means the HDFS file is rolled, false means the HDFS file is not rolled. Must be set to |
|
Optional |
|
|
Selects the formatter for the HDFS Handler for how output data is formatted.
|
|
Optional |
|
|
Set to |
|
Optional |
A mixture of templating keywords and constants to resolve a sub directory at runtime to partition the data. |
|
The configuration resolves a sub directory or sub directories, which are appended to
the resolved HDFS target path. These sub
directories are used to partition the data.
|
|
Optional |
kerberos |
|
Setting this property to |
|
Optional (Required if |
Relative or absolute path to a Kerberos |
|
The |
|
Optional (Required if |
A legal Kerberos principal name like |
|
The Kerberos principal name for Kerberos authentication. |
|
Optional |
- |
|
Set to a legal path in HDFS so that schemas (if available) are written in that HDFS directory. Schemas are currently only available for Avro and JSON formatters. In the case of a metadata change event, the schema is overwritten to reflect the schema change. |
Applicable to Sequence File Format only. |
Optional |
|
|
Hadoop Sequence File Compression Type. Applicable only if |
Applicable to Sequence File and writing to HDFS is Avro OCF formats only. |
Optional |
|
|
Hadoop Sequence File Compression Codec. Applicable only if |
gg.handler.name.compressionCodec |
Optional |
|
|
Avro OCF Formatter Compression Code. This configuration controls the selection of the compression library to be used for Avro OCF files. Snappy includes native binaries in the Snappy JAR file and performs a Java-native traversal when compressing or decompressing. Use of Snappy may introduce runtime issues and platform porting issues that you may not experience when working with Java. You may need to perform additional testing to ensure that Snappy works on all of your required platforms. Snappy is an open source library, so Oracle cannot guarantee its ability to operate on all of your required platforms. |
|
Optional |
|
|
Applicable only to the HDFS Handler that is not writing an Avro OCF or sequence file to support extract, load, transform (ELT) situations. When set to File rolls can be triggered by any one of the following:
Data files are being loaded into HDFS and a monitor program is monitoring the write directories waiting to consume the data. The monitoring programs use the appearance of a new file as a trigger so that the previous file can be consumed by the consuming application. |
|
Optional |
|
|
Set to use an Setting For most applications setting this property to |
Parent topic: Setting Up and Running the HDFS Handler
8.2.7.3.3 Review a Sample Configuration
The following is a sample configuration for the HDFS Handler from the Java Adapter properties file:
gg.handlerlist=hdfs gg.handler.hdfs.type=hdfs gg.handler.hdfs.mode=tx gg.handler.hdfs.includeTokens=false gg.handler.hdfs.maxFileSize=1g gg.handler.hdfs.pathMappingTemplate=/ogg/${fullyQualifiedTableName} gg.handler.hdfs.fileRollInterval=0 gg.handler.hdfs.inactivityRollInterval=0 gg.handler.hdfs.partitionByTable=true gg.handler.hdfs.rollOnMetadataChange=true gg.handler.hdfs.authType=none gg.handler.hdfs.format=delimitedtext
Parent topic: Setting Up and Running the HDFS Handler
8.2.7.3.4 Performance Considerations
The HDFS Handler calls the HDFS flush method on the HDFS write stream to flush data to the HDFS data nodes at the end of each transaction in order to maintain write durability. This is an expensive call and performance can adversely affect, especially in the case of transactions of one or few operations that result in numerous HDFS flush calls.
Performance of the HDFS Handler can be greatly improved by batching multiple small transactions into a single larger transaction. If you require high performance, configure batching functionality for the Replicat process. For more information, see Replicat Grouping.
The HDFS client libraries spawn threads for every HDFS file stream opened by the HDFS Handler. Therefore, the number of threads executing in the JMV grows proportionally to the number of HDFS file streams that are open. Performance of the HDFS Handler may degrade as more HDFS file streams are opened. Configuring the HDFS Handler to write to many HDFS files (due to many source replication tables or extensive use of partitioning) may result in degraded performance. If your use case requires writing to many tables, then Oracle recommends that you enable the roll on time or roll on inactivity features to close HDFS file streams. Closing an HDFS file stream causes the HDFS client threads to terminate, and the associated resources can be reclaimed by the JVM.
Parent topic: Setting Up and Running the HDFS Handler
8.2.7.3.5 Security
The HDFS cluster can be secured using Kerberos authentication. The HDFS Handler can connect to Kerberos secured cluster. The HDFS core-site.xml
should be in the handlers classpath with the hadoop.security.authentication
property set to kerberos
and the hadoop.security.authorization
property set to true
. Additionally, you must set the following properties in the HDFS Handler Java configuration file:
gg.handler.name
.authType=kerberos gg.handler.name
.kerberosPrincipalName=legal Kerberos principal name gg.handler.name
.kerberosKeytabFile=path to a keytab file that contains the password for the Kerberos principal so that the HDFS Handler can programmatically perform the Kerberos kinit operations to obtain a Kerberos ticket
You may encounter the inability to decrypt the Kerberos password from the keytab
file. This causes the Kerberos authentication to fall back to interactive mode which cannot work because it is being invoked programmatically. The cause of this problem is that the Java Cryptography Extension (JCE) is not installed in the Java Runtime Environment (JRE). Ensure that the JCE is loaded in the JRE, see http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html.
Parent topic: Setting Up and Running the HDFS Handler
8.2.7.4 Writing in HDFS in Avro Object Container File Format
The HDFS Handler includes specialized functionality to write to HDFS in Avro Object Container File (OCF) format. This Avro OCF is part of the Avro specification and is detailed in the Avro documentation at:
https://avro.apache.org/docs/current/spec.html#Object+Container+Files
Avro OCF format may be a good choice because it:
-
integrates with Apache Hive (Raw Avro written to HDFS is not supported by Hive.)
-
provides good support for schema evolution.
Configure the following to enable writing to HDFS in Avro OCF format:
To write row data to HDFS in Avro OCF format, configure the gg.handler.name.format=avro_row_ocf
property.
To write operation data to HDFS is Avro OCF format, configure the gg.handler.name.format=avro_op_ocf
property.
The HDFS and Avro OCF integration includes functionality to create the corresponding tables in Hive and update the schema for metadata change events. The configuration section provides information on the properties to enable integration with Hive. The Oracle GoldenGate Hive integration accesses Hive using the JDBC interface, so the Hive JDBC server must be running to enable this integration.
Parent topic: Apache HDFS
8.2.7.5 Generating HDFS File Names Using Template Strings
The HDFS Handler can dynamically generate HDFS file names using a template
string. The template string allows you to generate a combination of keywords that are
dynamically resolved at runtime with static strings to provide you more control of
generated HDFS file names. You can control the template file name using the
gg.handler.name.fileNameMappingTemplate
configuration
property. The default value for this parameters is:
${fullyQualifiedTableName}_${groupName}_${currentTimestamp}.txt
See Template Keywords.
Following are examples of legal templates and the resolved strings:
- Legal Template
-
Replacement
-
${schemaName}.${tableName}__${groupName}_${currentTimestamp}.txt
TEST.TABLE1__HDFS001_2017-07-05_04-31-23.123.txt
-
${fullyQualifiedTableName}--${currentTimestamp}.avro
ORACLE.TEST.TABLE1—2017-07-05_04-31-23.123.avro
-
${fullyQualifiedTableName}_${currentTimestamp[yyyy-MM-ddTHH-mm-ss.SSS]}.json
ORACLE.TEST.TABLE1_2017-07-05T04-31-23.123.json
Be aware of these restrictions when generating HDFS file names using templates:
- Generated HDFS file names must be legal HDFS file names.
- Oracle strongly recommends that you use
${groupName}
as part of the HDFS file naming template when using coordinated apply and breaking down source table data to different Replicat threads. The group name provides uniqueness of generated HDFS names that${currentTimestamp}
alone does not guarantee. HDFS file name collisions result in an abend of the Replicat process.
Parent topic: Apache HDFS
8.2.7.6 Metadata Change Events
Metadata change events are now handled in the HDFS Handler. The default behavior of the HDFS Handler is to roll the current relevant file in the event of a metadata change event. This behavior allows for the results of metadata changes to at least be separated into different files. File rolling on metadata change is configurable and can be turned off.
To support metadata change events, the process capturing changes in the source database must support both DDL changes and metadata in trail. Oracle GoldenGatedoes not support DDL replication for all database implementations. See the Oracle GoldenGateinstallation and configuration guide for the appropriate database to determine whether DDL replication is supported.
Parent topic: Apache HDFS
8.2.7.7 Partitioning
The partitioning functionality uses the template mapper functionality to resolve partitioning strings. The result is that the you have more control in how to partition source trail data. Starting Oracle GoldenGate for Big Data 21.1, all the keywords that are supported by the templating functionality are supported in HDFS partitioning.
Precondition
To use the partitioning functionality, ensure that the data is partitioned by the table. You cannot set the following configuration:
gg.handler.name.partitionByTable=false
Path Configuration
Assume that the path mapping template is configured as follows:
gg.handler.hdfs.pathMappingTemplate=/ogg/${fullyQualifiedTableName}
At runtime the path resolves as follows for the source table
DBO.ORDERS
:
/ogg/DBO.ORDERS
Partitioning Configuration
Configure the HDFS partitioning as follows; any of the keywords that are legal for templating are now legal for partitioning:
gg.handler.name.partitioner.fully qualified table name=templating keywords and/or
constants
DBO.ORDERS
table is set to the following:
gg.handler.hdfs.partitioner.DBO.ORDERS=par_sales_region=${columnValue[SALES_REGION]}
/ogg/DBO.ORDERS/par_sales_region=west/data files
/ogg/DBO.ORDERS/par_sales_region=east/data files
/ogg/DBO.ORDERS/par_sales_region=north/data files
/ogg/DBO.ORDERS/par_sales_region=south/data files
DBO.ORDERS
table is set to the
following:gg.handler.hdfs.partitioner.DBO.ORDERS=par_sales_region=${columnValue[SALES_REGION]}/par_state=${columnValue[STATE]}
This example can result in the following breakdown of files in HDFS:
/ogg/DBO.ORDERS/par_sales_region=west/par_state=CA/data files
/ogg/DBO.ORDERS/par_sales_region=east/par_state=FL/data files
/ogg/DBO.ORDERS/par_sales_region=north/par_state=MN/data files
/ogg/DBO.ORDERS/par_sales_region=south/par_state=TX/data files
Ensure to be extra vigilant while configuring HDFS partitioning. If you choose partitioning column values that have a very large range of data values, then it results in partitioning to a proportional number of output data files. The HDFS client spawns multiple threads to service each open HDFS write stream. Partitioning to very large numbers of HDFS files can result in resource exhaustion of memory and/or threads.
Note:
Starting Oracle GoldenGate for Big Data 21.1, the Automated Hive integration has been removed with the changes to support templating in control partitioning.Parent topic: Apache HDFS
8.2.7.8 HDFS Additional Considerations
The Oracle HDFS Handler requires certain HDFS client libraries to be resolved in its classpath as a prerequisite for streaming data to HDFS.
For a list of required client JAR files by version, see HDFS Handler Client Dependencies. The HDFS client jars do not ship with the Oracle GoldenGate for Big Dataproduct. The HDFS Handler supports multiple versions of HDFS, and the HDFS client jars must be the same version as the HDFS version to which the HDFS Handler is connecting. The HDFS client jars are open source and are freely available to download from sites such as the Apache Hadoop site or the maven central repository.
In order to establish connectivity to HDFS, the HDFS core-site.xml
file must be in the classpath of the HDFS Handler. If the core-site.xml
file is not in the classpath, the HDFS client code defaults to a mode that attempts to write to the local file system. Writing to the local file system instead of HDFS can be advantageous for troubleshooting, building a point of contact (POC), or as a step in the process of building an HDFS integration.
Another common issue is that data streamed to HDFS using the HDFS Handler may not be immediately available to Big Data analytic tools such as Hive. This behavior commonly occurs when the HDFS Handler is in possession of an open write stream to an HDFS file. HDFS writes in blocks of 128 MB by default. HDFS blocks under construction are not always visible to analytic tools. Additionally, inconsistencies between file sizes when using the -ls
, -cat
, and -get
commands in the HDFS shell may occur. This is an anomaly of HDFS streaming and is discussed in the HDFS specification. This anomaly of HDFS leads to a potential 128 MB per file blind spot in analytic data. This may not be an issue if you have a steady stream of replication data and do not require low levels of latency for analytic data from HDFS. However, this may be a problem in some use cases because closing the HDFS write stream finalizes the block writing. Data is immediately visible to analytic tools, and file sizing metrics become consistent again. Therefore, the new file rolling feature in the HDFS Handler can be used to close HDFS writes streams, making all data visible.
Important:
The file rolling solution may present its own problems. Extensive use of file rolling can result in many small files in HDFS. Many small files in HDFS may result in performance issues in analytic tools.
You may also notice the HDFS inconsistency problem in the following scenarios.
-
The HDFS Handler process crashes.
-
A forced shutdown is called on the HDFS Handler process.
-
A network outage or other issue causes the HDFS Handler process to abend.
In each of these scenarios, it is possible for the HDFS Handler to end without explicitly closing the HDFS write stream and finalizing the writing block. HDFS in its internal process ultimately recognizes that the write stream has been broken, so HDFS finalizes the write block. In this scenario, you may experience a short term delay before the HDFS process finalizes the write block.
Parent topic: Apache HDFS
8.2.7.9 Best Practices
It is considered a Big Data best practice for the HDFS cluster to operate on dedicated servers called cluster nodes. Edge nodes are server machines that host the applications to stream data to and retrieve data from the HDFS cluster nodes. Because the HDFS cluster nodes and the edge nodes are different servers, the following benefits are seen:
-
The HDFS cluster nodes do not compete for resources with the applications interfacing with the cluster.
-
The requirements for the HDFS cluster nodes and edge nodes probably differ. This physical topology allows the appropriate hardware to be tailored to specific needs.
It is a best practice for the HDFS Handler to be installed and running on an edge node and streaming data to the HDFS cluster using network connection. The HDFS Handler can run on any machine that has network visibility to the HDFS cluster. The installation of the HDFS Handler on an edge node requires that the core-site.xml
files, and the dependency jars are copied to the edge node so that the HDFS Handler can access them. The HDFS Handler can also run collocated on a HDFS cluster node if required.
Parent topic: Apache HDFS
8.2.7.10 Troubleshooting the HDFS Handler
Troubleshooting of the HDFS Handler begins with the contents for the Java
log4j
file. Follow the directions in the Java Logging Configuration to
configure the runtime to correctly generate the Java log4j
log
file.
Parent topic: Apache HDFS
8.2.7.10.1 Java Classpath
Problems with the Java classpath are common. The usual indication of a Java classpath problem is a ClassNotFoundException
in the Java log4j
log file. The Java log4j
log file can be used to troubleshoot this issue. Setting the log level to DEBUG
allows for logging of each of the jars referenced in the gg.classpath
object to be logged to the log file. In this way, you can ensure that all of the required dependency jars are resolved by enabling DEBUG
level logging and search the log file for messages, as in the following:
2015-09-21 10:05:10 DEBUG ConfigClassPath:74 - ...adding to classpath: url="file:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/common/lib/guava-11.0.2.jar
Parent topic: Troubleshooting the HDFS Handler
8.2.7.10.2 Java Boot Options
When running HDFS replicat with JRE 11, StackOverflowError
is
thrown. You can fix this issue by editing the bootoptions property in the Java Adapter
Properties file as follows:
jvm.bootoptions=-Djdk.lang.processReaperUseDefaultStackSize=true
Parent topic: Troubleshooting the HDFS Handler
8.2.7.10.3 HDFS Connection Properties
The contents of the HDFS core-site.xml
file (including default settings) are output to the Java log4j
log file when the logging level is set to DEBUG
or TRACE
. This output shows the connection properties to HDFS. Search for the following in the Java log4j
log file:
2015-09-21 10:05:11 DEBUG HDFSConfiguration:58 - Begin - HDFS configuration object contents for connection troubleshooting.
If the fs.defaultFS
property points to the local file system, then the core-site.xml
file is not properly set in the gg.classpath
property.
Key: [fs.defaultFS] Value: [file:///].
This shows to the fs.defaultFS
property properly pointed at and HDFS host and port.
Key: [fs.defaultFS] Value: [hdfs://hdfshost:9000].
Parent topic: Troubleshooting the HDFS Handler
8.2.7.10.4 Handler and Formatter Configuration
The Java log4j
log file contains information on the configuration state of the HDFS Handler and the selected formatter. This information is output at the INFO
log level. The output resembles the following:
2015-09-21 10:05:11 INFO AvroRowFormatter:156 - **** Begin Avro Row Formatter - Configuration Summary **** Operation types are always included in the Avro formatter output. The key for insert operations is [I]. The key for update operations is [U]. The key for delete operations is [D]. The key for truncate operations is [T]. Column type mapping has been configured to map source column types to an appropriate corresponding Avro type. Created Avro schemas will be output to the directory [./dirdef]. Created Avro schemas will be encoded using the [UTF-8] character set. In the event of a primary key update, the Avro Formatter will ABEND. Avro row messages will not be wrapped inside a generic Avro message. No delimiter will be inserted after each generated Avro message. **** End Avro Row Formatter - Configuration Summary **** 2015-09-21 10:05:11 INFO HDFSHandler:207 - **** Begin HDFS Handler - Configuration Summary **** Mode of operation is set to tx. Data streamed to HDFS will be partitioned by table. Tokens will be included in the output. The HDFS root directory for writing is set to [/ogg]. The maximum HDFS file size has been set to 1073741824 bytes. Rolling of HDFS files based on time is configured as off. Rolling of HDFS files based on write inactivity is configured as off. Rolling of HDFS files in the case of a metadata change event is enabled. HDFS partitioning information: The HDFS partitioning object contains no partitioning information. HDFS Handler Authentication type has been configured to use [none] **** End HDFS Handler - Configuration Summary ****
Parent topic: Troubleshooting the HDFS Handler
8.2.7.11 HDFS Handler Client Dependencies
This appendix lists the HDFS client dependencies for Apache Hadoop. The hadoop-client-x.x.x.jar
is not distributed with Apache Hadoop nor is it mandatory to be in the classpath. The hadoop-client-x.x.x.jar
is an empty maven project with the purpose of aggregating all of the Hadoop client dependencies.
Maven groupId: org.apache.hadoop
Maven atifactId: hadoop-client
Maven version: the HDFS version numbers listed for each section
Parent topic: Apache HDFS
8.2.7.11.1 Hadoop Client Dependencies
This section lists the Hadoop client dependencies for each HDFS version.
- HDFS 3.3.0
- HDFS 3.2.0
- HDFS 3.1.4
- HDFS 3.0.3
- HDFS 2.9.2
- HDFS 2.8.5
- HDFS 2.7.7
- HDFS 2.6.0
- HDFS 2.5.2
- HDFS 2.4.1
- HDFS 2.3.0
- HDFS 2.2.0
Parent topic: HDFS Handler Client Dependencies
8.2.7.11.1.1 HDFS 3.3.0
accessors-smart-1.2.jar animal-sniffer-annotations-1.17.jar asm-5.0.4.jar avro-1.7.7.jar azure-keyvault-core-1.0.0.jar azure-storage-7.0.0.jar checker-qual-2.5.2.jar commons-beanutils-1.9.4.jar commons-cli-1.2.jar commons-codec-1.11.jar commons-collections-3.2.2.jar commons-compress-1.19.jar commons-configuration2-2.1.1.jar commons-io-2.5.jar commons-lang3-3.7.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.6.jar commons-text-1.4.jar curator-client-4.2.0.jar curator-framework-4.2.0.jar curator-recipes-4.2.0.jar dnsjava-2.1.7.jar failureaccess-1.0.jar gson-2.2.4.jar guava-27.0-jre.jar hadoop-annotations-3.3.0.jar hadoop-auth-3.3.0.jar hadoop-azure-3.3.0.jar hadoop-client-3.3.0.jar hadoop-common-3.3.0.jar hadoop-hdfs-client-3.3.0.jar hadoop-mapreduce-client-common-3.3.0.jar hadoop-mapreduce-client-core-3.3.0.jar hadoop-mapreduce-client-jobclient-3.3.0.jar hadoop-shaded-protobuf_3_7-1.0.0.jar hadoop-yarn-api-3.3.0.jar hadoop-yarn-client-3.3.0.jar hadoop-yarn-common-3.3.0.jar htrace-core4-4.1.0-incubating.jar httpclient-4.5.6.jar httpcore-4.4.10.jar j2objc-annotations-1.1.jar jackson-annotations-2.10.3.jar jackson-core-2.6.0.jar jackson-core-asl-1.9.13.jar jackson-databind-2.10.3.jar jackson-jaxrs-base-2.10.3.jar jackson-jaxrs-json-provider-2.10.3.jar jackson-mapper-asl-1.9.13.jar jackson-module-jaxb-annotations-2.10.3.jar jakarta.activation-api-1.2.1.jar jakarta.xml.bind-api-2.3.2.jar javax.activation-api-1.2.0.jar javax.servlet-api-3.1.0.jar jaxb-api-2.2.11.jar jcip-annotations-1.0-1.jar jersey-client-1.19.jar jersey-core-1.19.jar jersey-servlet-1.19.jar jetty-client-9.4.20.v20190813.jar jetty-http-9.4.20.v20190813.jar jetty-io-9.4.20.v20190813.jar jetty-security-9.4.20.v20190813.jar jetty-servlet-9.4.20.v20190813.jar jetty-util-9.4.20.v20190813.jar jetty-util-ajax-9.4.20.v20190813.jar jetty-webapp-9.4.20.v20190813.jar jetty-xml-9.4.20.v20190813.jar jline-3.9.0.jar json-smart-2.3.jar jsp-api-2.1.jar jsr305-3.0.2.jar jsr311-api-1.1.1.jar kerb-admin-1.0.1.jar kerb-client-1.0.1.jar kerb-common-1.0.1.jar kerb-core-1.0.1.jar kerb-crypto-1.0.1.jar kerb-identity-1.0.1.jar kerb-server-1.0.1.jar kerb-simplekdc-1.0.1.jar kerb-util-1.0.1.jar kerby-asn1-1.0.1.jar kerby-config-1.0.1.jar kerby-pkix-1.0.1.jar kerby-util-1.0.1.jar kerby-xdr-1.0.1.jar listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar log4j-1.2.17.jar nimbus-jose-jwt-7.9.jar okhttp-2.7.5.jar okio-1.6.0.jar paranamer-2.3.jar protobuf-java-2.5.0.jar re2j-1.1.jar slf4j-api-1.7.25.jar snappy-java-1.0.5.jar stax2-api-3.1.4.jar token-provider-1.0.1.jar websocket-api-9.4.20.v20190813.jar websocket-client-9.4.20.v20190813.jar websocket-common-9.4.20.v20190813.jar wildfly-openssl-1.0.7.Final.jar woodstox-core-5.0.3.jar
Parent topic: Hadoop Client Dependencies
8.2.7.11.1.2 HDFS 3.2.0
accessors-smart-1.2.jar asm-5.0.4.jar avro-1.7.7.jar azure-keyvault-core-1.0.0.jar azure-storage-7.0.0.jar commons-beanutils-1.9.3.jar commons-cli-1.2.jar commons-codec-1.11.jar commons-collections-3.2.2.jar commons-compress-1.4.1.jar commons-configuration2-2.1.1.jar commons-io-2.5.jar commons-lang3-3.7.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.6.jar commons-text-1.4.jar curator-client-2.12.0.jar curator-framework-2.12.0.jar curator-recipes-2.12.0.jar dnsjava-2.1.7.jar gson-2.2.4.jar guava-11.0.2.jar hadoop-annotations-3.2.0.jar hadoop-auth-3.2.0.jar hadoop-azure-3.2.0.jar hadoop-client-3.2.0.jar hadoop-common-3.2.0.jar hadoop-hdfs-client-3.2.0.jar hadoop-mapreduce-client-common-3.2.0.jar hadoop-mapreduce-client-core-3.2.0.jar hadoop-mapreduce-client-jobclient-3.2.0.jar hadoop-yarn-api-3.2.0.jar hadoop-yarn-client-3.2.0.jar hadoop-yarn-common-3.2.0.jar htrace-core4-4.1.0-incubating.jar httpclient-4.5.2.jar httpcore-4.4.4.jar jackson-annotations-2.9.5.jar jackson-core-2.6.0.jar jackson-core-asl-1.9.13.jar jackson-databind-2.9.5.jar jackson-jaxrs-base-2.9.5.jar jackson-jaxrs-json-provider-2.9.5.jar jackson-mapper-asl-1.9.13.jar jackson-module-jaxb-annotations-2.9.5.jar javax.servlet-api-3.1.0.jar jaxb-api-2.2.11.jar jcip-annotations-1.0-1.jar jersey-client-1.19.jar jersey-core-1.19.jar jersey-servlet-1.19.jar jetty-security-9.3.24.v20180605.jar jetty-servlet-9.3.24.v20180605.jar jetty-util-9.3.24.v20180605.jar jetty-util-ajax-9.3.24.v20180605.jar jetty-webapp-9.3.24.v20180605.jar jetty-xml-9.3.24.v20180605.jar json-smart-2.3.jar jsp-api-2.1.jar jsr305-3.0.0.jar jsr311-api-1.1.1.jar kerb-admin-1.0.1.jar kerb-client-1.0.1.jar kerb-common-1.0.1.jar kerb-core-1.0.1.jar kerb-crypto-1.0.1.jar kerb-identity-1.0.1.jar kerb-server-1.0.1.jar kerb-simplekdc-1.0.1.jar kerb-util-1.0.1.jar kerby-asn1-1.0.1.jar kerby-config-1.0.1.jar kerby-pkix-1.0.1.jar kerby-util-1.0.1.jar kerby-xdr-1.0.1.jar log4j-1.2.17.jar nimbus-jose-jwt-4.41.1.jar okhttp-2.7.5.jar okio-1.6.0.jar paranamer-2.3.jar protobuf-java-2.5.0.jar re2j-1.1.jar slf4j-api-1.7.25.jar snappy-java-1.0.5.jar stax2-api-3.1.4.jar token-provider-1.0.1.jar wildfly-openssl-1.0.4.Final.jar woodstox-core-5.0.3.jar xz-1.0.jar
Parent topic: Hadoop Client Dependencies
8.2.7.11.1.3 HDFS 3.1.4
accessors-smart-1.2.jar animal-sniffer-annotations-1.17.jar asm-5.0.4.jar avro-1.7.7.jar azure-keyvault-core-1.0.0.jar azure-storage-7.0.0.jar checker-qual-2.5.2.jar commons-beanutils-1.9.4.jar commons-cli-1.2.jar commons-codec-1.11.jar commons-collections-3.2.2.jar commons-compress-1.19.jar commons-configuration2-2.1.1.jar commons-io-2.5.jar commons-lang-2.6.jar commons-lang3-3.4.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.6.jar curator-client-2.13.0.jar curator-framework-2.13.0.jar curator-recipes-2.13.0.jar error_prone_annotations-2.2.0.jar failureaccess-1.0.jar gson-2.2.4.jar guava-27.0-jre.jar hadoop-annotations-3.1.4.jar hadoop-auth-3.1.4.jar hadoop-azure-3.1.4.jar hadoop-client-3.1.4.jar hadoop-common-3.1.4.jar hadoop-hdfs-client-3.1.4.jar hadoop-mapreduce-client-common-3.1.4.jar hadoop-mapreduce-client-core-3.1.4.jar hadoop-mapreduce-client-jobclient-3.1.4.jar hadoop-yarn-api-3.1.4.jar hadoop-yarn-client-3.1.4.jar hadoop-yarn-common-3.1.4.jar htrace-core4-4.1.0-incubating.jar httpclient-4.5.2.jar httpcore-4.4.4.jar j2objc-annotations-1.1.jar jackson-annotations-2.9.10.jar jackson-core-2.9.10.jar jackson-core-asl-1.9.13.jar jackson-databind-2.9.10.4.jar jackson-jaxrs-base-2.9.10.jar jackson-jaxrs-json-provider-2.9.10.jar jackson-mapper-asl-1.9.13.jar jackson-module-jaxb-annotations-2.9.10.jar javax.servlet-api-3.1.0.jar jaxb-api-2.2.11.jar jcip-annotations-1.0-1.jar jersey-client-1.19.jar jersey-core-1.19.jar jersey-servlet-1.19.jar jetty-security-9.4.20.v20190813.jar jetty-servlet-9.4.20.v20190813.jar jetty-util-9.4.20.v20190813.jar jetty-util-ajax-9.4.20.v20190813.jar jetty-webapp-9.4.20.v20190813.jar jetty-xml-9.4.20.v20190813.jar json-smart-2.3.jar jsp-api-2.1.jar jsr305-3.0.2.jar jsr311-api-1.1.1.jar kerb-admin-1.0.1.jar kerb-client-1.0.1.jar kerb-common-1.0.1.jar kerb-core-1.0.1.jar kerb-crypto-1.0.1.jar kerb-identity-1.0.1.jar kerb-server-1.0.1.jar kerb-simplekdc-1.0.1.jar kerb-util-1.0.1.jar kerby-asn1-1.0.1.jar kerby-config-1.0.1.jar kerby-pkix-1.0.1.jar kerby-util-1.0.1.jar kerby-xdr-1.0.1.jar listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar log4j-1.2.17.jar nimbus-jose-jwt-7.9.jar okhttp-2.7.5.jar okio-1.6.0.jar paranamer-2.3.jar protobuf-java-2.5.0.jar re2j-1.1.jar slf4j-api-1.7.25.jar snappy-java-1.0.5.jar stax2-api-3.1.4.jar token-provider-1.0.1.jar woodstox-core-5.0.3.jar
Parent topic: Hadoop Client Dependencies
8.2.7.11.1.4 HDFS 3.0.3
accessors-smart-1.2.jar asm-5.0.4.jar avro-1.7.7.jar azure-keyvault-core-0.8.0.jar azure-storage-5.4.0.jar commons-beanutils-1.9.3.jar commons-cli-1.2.jar commons-codec-1.4.jar commons-collections-3.2.2.jar commons-compress-1.4.1.jar commons-configuration2-2.1.1.jar commons-io-2.4.jar commons-lang-2.6.jar commons-lang3-3.4.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.6.jar curator-client-2.12.0.jar curator-framework-2.12.0.jar curator-recipes-2.12.0.jar gson-2.2.4.jar guava-11.0.2.jar hadoop-annotations-3.0.3.jar hadoop-auth-3.0.3.jar hadoop-azure-3.0.3.jar hadoop-client-3.0.3.jar hadoop-common-3.0.3.jar hadoop-hdfs-client-3.0.3.jar hadoop-mapreduce-client-common-3.0.3.jar hadoop-mapreduce-client-core-3.0.3.jar hadoop-mapreduce-client-jobclient-3.0.3.jar hadoop-yarn-api-3.0.3.jar hadoop-yarn-client-3.0.3.jar hadoop-yarn-common-3.0.3.jar htrace-core4-4.1.0-incubating.jar httpclient-4.5.2.jar httpcore-4.4.4.jar jackson-annotations-2.7.8.jar jackson-core-2.7.8.jar jackson-core-asl-1.9.13.jar jackson-databind-2.7.8.jar jackson-jaxrs-base-2.7.8.jar jackson-jaxrs-json-provider-2.7.8.jar jackson-mapper-asl-1.9.13.jar jackson-module-jaxb-annotations-2.7.8.jar javax.servlet-api-3.1.0.jar jaxb-api-2.2.11.jar jcip-annotations-1.0-1.jar jersey-client-1.19.jar jersey-core-1.19.jar jersey-servlet-1.19.jar jetty-security-9.3.19.v20170502.jar jetty-servlet-9.3.19.v20170502.jar jetty-util-9.3.19.v20170502.jar jetty-util-ajax-9.3.19.v20170502.jar jetty-webapp-9.3.19.v20170502.jar jetty-xml-9.3.19.v20170502.jar json-smart-2.3.jar jsp-api-2.1.jar jsr305-3.0.0.jar jsr311-api-1.1.1.jar kerb-admin-1.0.1.jar kerb-client-1.0.1.jar kerb-common-1.0.1.jar kerb-core-1.0.1.jar kerb-crypto-1.0.1.jar kerb-identity-1.0.1.jar kerb-server-1.0.1.jar kerb-simplekdc-1.0.1.jar kerb-util-1.0.1.jar kerby-asn1-1.0.1.jar kerby-config-1.0.1.jar kerby-pkix-1.0.1.jar kerby-util-1.0.1.jar kerby-xdr-1.0.1.jar log4j-1.2.17.jar nimbus-jose-jwt-4.41.1.jar okhttp-2.7.5.jar okio-1.6.0.jar paranamer-2.3.jar protobuf-java-2.5.0.jar re2j-1.1.jar slf4j-api-1.7.25.jar snappy-java-1.0.5.jar stax2-api-3.1.4.jar token-provider-1.0.1.jar woodstox-core-5.0.3.jar xz-1.0.jar
Parent topic: Hadoop Client Dependencies
8.2.7.11.1.5 HDFS 2.9.2
accessors-smart-1.2.jar activation-1.1.jar apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar asm-5.0.4.jar avro-1.7.7.jar azure-keyvault-core-0.8.0.jar azure-storage-5.4.0.jar commons-beanutils-1.7.0.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.4.jar commons-collections-3.2.2.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-digester-1.8.jar commons-io-2.4.jar commons-lang-2.6.jar commons-lang3-3.4.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.1.jar curator-client-2.7.1.jar curator-framework-2.7.1.jar curator-recipes-2.7.1.jar ehcache-3.3.1.jar geronimo-jcache_1.0_spec-1.0-alpha-1.jar gson-2.2.4.jar guava-11.0.2.jar hadoop-annotations-2.9.2.jar hadoop-auth-2.9.2.jar hadoop-azure-2.9.2.jar hadoop-client-2.9.2.jar hadoop-common-2.9.2.jar hadoop-hdfs-client-2.9.2.jar hadoop-mapreduce-client-app-2.9.2.jar hadoop-mapreduce-client-common-2.9.2.jar hadoop-mapreduce-client-core-2.9.2.jar hadoop-mapreduce-client-jobclient-2.9.2.jar hadoop-mapreduce-client-shuffle-2.9.2.jar hadoop-yarn-api-2.9.2.jar hadoop-yarn-client-2.9.2.jar hadoop-yarn-common-2.9.2.jar hadoop-yarn-registry-2.9.2.jar hadoop-yarn-server-common-2.9.2.jar HikariCP-java7-2.4.12.jar htrace-core4-4.1.0-incubating.jar httpclient-4.5.2.jar httpcore-4.4.4.jar jackson-annotations-2.4.0.jar jackson-core-2.7.8.jar jackson-core-asl-1.9.13.jar jackson-databind-2.4.0.jar jackson-jaxrs-1.9.13.jar jackson-mapper-asl-1.9.13.jar jackson-xc-1.9.13.jar jaxb-api-2.2.2.jar jcip-annotations-1.0-1.jar jersey-client-1.9.jar jersey-core-1.9.jar jetty-sslengine-6.1.26.jar jetty-util-6.1.26.jar json-smart-2.3.jar jsp-api-2.1.jar jsr305-3.0.0.jar leveldbjni-all-1.8.jar log4j-1.2.17.jar mssql-jdbc-6.2.1.jre7.jar netty-3.7.0.Final.jar nimbus-jose-jwt-4.41.1.jar okhttp-2.7.5.jar okio-1.6.0.jar paranamer-2.3.jar protobuf-java-2.5.0.jar servlet-api-2.5.jar slf4j-api-1.7.25.jar slf4j-log4j12-1.7.25.jar snappy-java-1.0.5.jar stax2-api-3.1.4.jar stax-api-1.0-2.jar woodstox-core-5.0.3.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.6.jar
Parent topic: Hadoop Client Dependencies
8.2.7.11.1.6 HDFS 2.8.5
accessors-smart-1.2.jar activation-1.1.jar apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar asm-5.0.4.jar avro-1.7.4.jar azure-storage-2.2.0.jar commons-beanutils-1.7.0.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.4.jar commons-collections-3.2.2.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-digester-1.8.jar commons-io-2.4.jar commons-lang-2.6.jar commons-lang3-3.3.2.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.1.jar curator-client-2.7.1.jar curator-framework-2.7.1.jar curator-recipes-2.7.1.jar gson-2.2.4.jar guava-11.0.2.jar hadoop-annotations-2.8.5.jar hadoop-auth-2.8.5.jar hadoop-azure-2.8.5.jar hadoop-client-2.8.5.jar hadoop-common-2.8.5.jar hadoop-hdfs-client-2.8.5.jar hadoop-mapreduce-client-app-2.8.5.jar hadoop-mapreduce-client-common-2.8.5.jar hadoop-mapreduce-client-core-2.8.5.jar hadoop-mapreduce-client-jobclient-2.8.5.jar hadoop-mapreduce-client-shuffle-2.8.5.jar hadoop-yarn-api-2.8.5.jar hadoop-yarn-client-2.8.5.jar hadoop-yarn-common-2.8.5.jar hadoop-yarn-server-common-2.8.5.jar htrace-core4-4.0.1-incubating.jar httpclient-4.5.2.jar httpcore-4.4.4.jar jackson-core-2.2.3.jar jackson-core-asl-1.9.13.jar jackson-jaxrs-1.9.13.jar jackson-mapper-asl-1.9.13.jar jackson-xc-1.9.13.jar jaxb-api-2.2.2.jar jcip-annotations-1.0-1.jar jersey-client-1.9.jar jersey-core-1.9.jar jetty-sslengine-6.1.26.jar jetty-util-6.1.26.jar json-smart-2.3.jar jsp-api-2.1.jar jsr305-3.0.0.jar leveldbjni-all-1.8.jar log4j-1.2.17.jar netty-3.7.0.Final.jar nimbus-jose-jwt-4.41.1.jar okhttp-2.4.0.jar okio-1.4.0.jar paranamer-2.3.jar protobuf-java-2.5.0.jar servlet-api-2.5.jar slf4j-api-1.7.10.jar slf4j-log4j12-1.7.10.jar snappy-java-1.0.4.1.jar stax-api-1.0-2.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.6.jar
Parent topic: Hadoop Client Dependencies
8.2.7.11.1.7 HDFS 2.7.7
HDFS 2.7.7 (HDFS 2.7.0 is effectively the same, simply substitute 2.7.0 on the libraries versioned as 2.7.7)
activation-1.1.jar apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar avro-1.7.4.jar azure-storage-2.0.0.jar commons-beanutils-1.7.0.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.4.jar commons-collections-3.2.2.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-digester-1.8.jar commons-httpclient-3.1.jar commons-io-2.4.jar commons-lang-2.6.jar commons-lang3-3.3.2.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.1.jar curator-client-2.7.1.jar curator-framework-2.7.1.jar curator-recipes-2.7.1.jar gson-2.2.4.jar guava-11.0.2.jar hadoop-annotations-2.7.7.jar hadoop-auth-2.7.7.jar hadoop-azure-2.7.7.jar hadoop-client-2.7.7.jar hadoop-common-2.7.7.jar hadoop-hdfs-2.7.7.jar hadoop-mapreduce-client-app-2.7.7.jar hadoop-mapreduce-client-common-2.7.7.jar hadoop-mapreduce-client-core-2.7.7.jar hadoop-mapreduce-client-jobclient-2.7.7.jar hadoop-mapreduce-client-shuffle-2.7.7.jar hadoop-yarn-api-2.7.7.jar hadoop-yarn-client-2.7.7.jar hadoop-yarn-common-2.7.7.jar hadoop-yarn-server-common-2.7.7.jar htrace-core-3.1.0-incubating.jar httpclient-4.2.5.jar httpcore-4.2.4.jar jackson-core-2.2.3.jar jackson-core-asl-1.9.13.jar jackson-jaxrs-1.9.13.jar jackson-mapper-asl-1.9.13.jar jackson-xc-1.9.13.jar jaxb-api-2.2.2.jar jersey-client-1.9.jar jersey-core-1.9.jar jetty-sslengine-6.1.26.jar jetty-util-6.1.26.jar jsp-api-2.1.jar jsr305-3.0.0.jar leveldbjni-all-1.8.jar log4j-1.2.17.jar netty-3.6.2.Final.jar netty-all-4.0.23.Final.jar paranamer-2.3.jar protobuf-java-2.5.0.jar servlet-api-2.5.jar slf4j-api-1.7.10.jar slf4j-log4j12-1.7.10.jar snappy-java-1.0.4.1.jar stax-api-1.0-2.jar xercesImpl-2.9.1.jar xml-apis-1.3.04.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.6.jar
Parent topic: Hadoop Client Dependencies
8.2.7.11.1.8 HDFS 2.6.0
activation-1.1.jar apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar avro-1.7.4.jar commons-beanutils-1.7.0.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.4.jar commons-collections-3.2.1.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-digester-1.8.jar commons-httpclient-3.1.jar commons-io-2.4.jar commons-lang-2.6.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.1.jar curator-client-2.6.0.jar curator-framework-2.6.0.jar curator-recipes-2.6.0.jar gson-2.2.4.jar guava-11.0.2.jar hadoop-annotations-2.6.0.jar hadoop-auth-2.6.0.jar hadoop-client-2.6.0.jar hadoop-common-2.6.0.jar hadoop-hdfs-2.6.0.jar hadoop-mapreduce-client-app-2.6.0.jar hadoop-mapreduce-client-common-2.6.0.jar hadoop-mapreduce-client-core-2.6.0.jar hadoop-mapreduce-client-jobclient-2.6.0.jar hadoop-mapreduce-client-shuffle-2.6.0.jar hadoop-yarn-api-2.6.0.jar hadoop-yarn-client-2.6.0.jar hadoop-yarn-common-2.6.0.jar hadoop-yarn-server-common-2.6.0.jar htrace-core-3.0.4.jar httpclient-4.2.5.jar httpcore-4.2.4.jar jackson-core-asl-1.9.13.jar jackson-jaxrs-1.9.13.jar jackson-mapper-asl-1.9.13.jar jackson-xc-1.9.13.jar jaxb-api-2.2.2.jar jersey-client-1.9.jar jersey-core-1.9.jar jetty-util-6.1.26.jar jsr305-1.3.9.jar leveldbjni-all-1.8.jar log4j-1.2.17.jar netty-3.6.2.Final.jar paranamer-2.3.jar protobuf-java-2.5.0.jar servlet-api-2.5.jar slf4j-api-1.7.5.jar slf4j-log4j12-1.7.5.jar snappy-java-1.0.4.1.jar stax-api-1.0-2.jar xercesImpl-2.9.1.jar xml-apis-1.3.04.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.6.jar
Parent topic: Hadoop Client Dependencies
8.2.7.11.1.9 HDFS 2.5.2
HDFS 2.5.2 (HDFS 2.5.1 and 2.5.0 are effectively the same, simply substitute 2.5.1 or 2.5.0 on the libraries versioned as 2.5.2)
activation-1.1.jar apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar avro-1.7.4.jar commons-beanutils-1.7.0.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.4.jar commons-collections-3.2.1.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-digester-1.8.jar commons-httpclient-3.1.jar commons-io-2.4.jar commons-lang-2.6.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.1.jar guava-11.0.2.jar hadoop-annotations-2.5.2.jar adoop-auth-2.5.2.jar hadoop-client-2.5.2.jar hadoop-common-2.5.2.jar hadoop-hdfs-2.5.2.jar hadoop-mapreduce-client-app-2.5.2.jar hadoop-mapreduce-client-common-2.5.2.jar hadoop-mapreduce-client-core-2.5.2.jar hadoop-mapreduce-client-jobclient-2.5.2.jar hadoop-mapreduce-client-shuffle-2.5.2.jar hadoop-yarn-api-2.5.2.jar hadoop-yarn-client-2.5.2.jar hadoop-yarn-common-2.5.2.jar hadoop-yarn-server-common-2.5.2.jar httpclient-4.2.5.jar httpcore-4.2.4.jar jackson-core-asl-1.9.13.jar jackson-jaxrs-1.9.13.jar jackson-mapper-asl-1.9.13.jar jackson-xc-1.9.13.jar jaxb-api-2.2.2.jar jersey-client-1.9.jar jersey-core-1.9.jar jetty-util-6.1.26.jar jsr305-1.3.9.jar leveldbjni-all-1.8.jar log4j-1.2.17.jar netty-3.6.2.Final.jar paranamer-2.3.jar protobuf-java-2.5.0.jar servlet-api-2.5.jar slf4j-api-1.7.5.jar slf4j-log4j12-1.7.5.jar snappy-java-1.0.4.1.jar stax-api-1.0-2.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.6.jar
Parent topic: Hadoop Client Dependencies
8.2.7.11.1.10 HDFS 2.4.1
HDFS 2.4.1 (HDFS 2.4.0 is effectively the same, simply substitute 2.4.0 on the libraries versioned as 2.4.1)
activation-1.1.jar avro-1.7.4.jar commons-beanutils-1.7.0.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.4.jar commons-collections-3.2.1.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-digester-1.8.jar commons-httpclient-3.1.jar commons-io-2.4.jar commons-lang-2.6.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.1.jar guava-11.0.2.jar hadoop-annotations-2.4.1.jar hadoop-auth-2.4.1.jar hadoop-client-2.4.1.jar hadoop-hdfs-2.4.1.jar hadoop-mapreduce-client-app-2.4.1.jar hadoop-mapreduce-client-common-2.4.1.jar hadoop-mapreduce-client-core-2.4.1.jar hadoop-mapreduce-client-jobclient-2.4.1.jar hadoop-mapreduce-client-shuffle-2.4.1.jar hadoop-yarn-api-2.4.1.jar hadoop-yarn-client-2.4.1.jar hadoop-yarn-common-2.4.1.jar hadoop-yarn-server-common-2.4.1.jar httpclient-4.2.5.jar httpcore-4.2.4.jar jackson-core-asl-1.8.8.jar jackson-mapper-asl-1.8.8.jar jaxb-api-2.2.2.jar jersey-client-1.9.jar jersey-core-1.9.jar jetty-util-6.1.26.jar jsr305-1.3.9.jar log4j-1.2.17.jar paranamer-2.3.jar protobuf-java-2.5.0.jar servlet-api-2.5.jar slf4j-api-1.7.5.jar slf4j-log4j12-1.7.5.jar snappy-java-1.0.4.1.jar stax-api-1.0-2.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.5.jar hadoop-common-2.4.1.jar
Parent topic: Hadoop Client Dependencies
8.2.7.11.1.11 HDFS 2.3.0
activation-1.1.jar avro-1.7.4.jar commons-beanutils-1.7.0.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.4.jar commons-collections-3.2.1.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-digester-1.8.jar commons-httpclient-3.1.jar commons-io-2.4.jar commons-lang-2.6.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.1.jar guava-11.0.2.jar hadoop-annotations-2.3.0.jar hadoop-auth-2.3.0.jar hadoop-client-2.3.0.jar hadoop-common-2.3.0.jar hadoop-hdfs-2.3.0.jar hadoop-mapreduce-client-app-2.3.0.jar hadoop-mapreduce-client-common-2.3.0.jar hadoop-mapreduce-client-core-2.3.0.jar hadoop-mapreduce-client-jobclient-2.3.0.jar hadoop-mapreduce-client-shuffle-2.3.0.jar hadoop-yarn-api-2.3.0.jar hadoop-yarn-client-2.3.0.jar hadoop-yarn-common-2.3.0.jar hadoop-yarn-server-common-2.3.0.jar httpclient-4.2.5.jar httpcore-4.2.4.jar jackson-core-asl-1.8.8.jar jackson-mapper-asl-1.8.8.jar jaxb-api-2.2.2.jar jersey-core-1.9.jar jetty-util-6.1.26.jar jsr305-1.3.9.jar log4j-1.2.17.jar paranamer-2.3.jar protobuf-java-2.5.0.jar servlet-api-2.5.jar slf4j-api-1.7.5.jar slf4j-log4j12-1.7.5.jar snappy-java-1.0.4.1.jar stax-api-1.0-2.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.5.jar
Parent topic: Hadoop Client Dependencies
8.2.7.11.1.12 HDFS 2.2.0
activation-1.1.jar aopalliance-1.0.jar asm-3.1.jar avro-1.7.4.jar commons-beanutils-1.7.0.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.4.jar commons-collections-3.2.1.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-digester-1.8.jar commons-httpclient-3.1.jar commons-io-2.1.jar commons-lang-2.5.jar commons-logging-1.1.1.jar commons-math-2.1.jar commons-net-3.1.jar gmbal-api-only-3.0.0-b023.jar grizzly-framework-2.1.2.jar grizzly-http-2.1.2.jar grizzly-http-server-2.1.2.jar grizzly-http-servlet-2.1.2.jar grizzly-rcm-2.1.2.jar guava-11.0.2.jar guice-3.0.jar hadoop-annotations-2.2.0.jar hadoop-auth-2.2.0.jar hadoop-client-2.2.0.jar hadoop-common-2.2.0.jar hadoop-hdfs-2.2.0.jar hadoop-mapreduce-client-app-2.2.0.jar hadoop-mapreduce-client-common-2.2.0.jar hadoop-mapreduce-client-core-2.2.0.jar hadoop-mapreduce-client-jobclient-2.2.0.jar hadoop-mapreduce-client-shuffle-2.2.0.jar hadoop-yarn-api-2.2.0.jar hadoop-yarn-client-2.2.0.jar hadoop-yarn-common-2.2.0.jar hadoop-yarn-server-common-2.2.0.jar jackson-core-asl-1.8.8.jar jackson-jaxrs-1.8.3.jar jackson-mapper-asl-1.8.8.jar jackson-xc-1.8.3.jar javax.inject-1.jar javax.servlet-3.1.jar javax.servlet-api-3.0.1.jar jaxb-api-2.2.2.jar jaxb-impl-2.2.3-1.jar jersey-client-1.9.jar jersey-core-1.9.jar jersey-grizzly2-1.9.jar jersey-guice-1.9.jar jersey-json-1.9.jar jersey-server-1.9.jar jersey-test-framework-core-1.9.jar jersey-test-framework-grizzly2-1.9.jar jettison-1.1.jar jetty-util-6.1.26.jar jsr305-1.3.9.jar log4j-1.2.17.jar management-api-3.0.0-b012.jar paranamer-2.3.jar protobuf-java-2.5.0.jar slf4j-api-1.7.5.jar slf4j-log4j12-1.7.5.jar snappy-java-1.0.4.1.jar stax-api-1.0.1.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.5.jar
Parent topic: Hadoop Client Dependencies
8.2.8 Apache Kafka
The Kafka Handler is designed to stream change capture data from an Oracle GoldenGate trail to a Kafka topic.
This chapter describes how to use the Kafka Handler.
- Apache Kafka
The Kafka Handler is designed to stream change capture data from an Oracle GoldenGate trail to a Kafka topic. - Apache Kafka Connect Handler
The Kafka Connect Handler is an extension of the standard Kafka messaging functionality. - Apache Kafka REST Proxy
The Kafka REST Proxy Handler to stream messages to the Kafka REST Proxy distributed by Confluent.
Parent topic: Target
8.2.8.1 Apache Kafka
The Kafka Handler is designed to stream change capture data from an Oracle GoldenGate trail to a Kafka topic.
This chapter describes how to use the Kafka Handler.
- Overview
- Detailed Functionality
- Setting Up and Running the Kafka Handler
- Schema Propagation
- Performance Considerations
- About Security
- Metadata Change Events
- Snappy Considerations
- Kafka Interceptor Support
The Kafka Producer client framework supports the use of Producer Interceptors. A Producer Interceptor is simply a user exit from the Kafka Producer client whereby the Interceptor object is instantiated and receives notifications of Kafka message send calls and Kafka message send acknowledgement calls. - Kafka Partition Selection
Kafka topics comprise one or more partitions. Distribution to multiple partitions is a good way to improve Kafka ingest performance, because the Kafka client parallelizes message sending to different topic/partition combinations. Partition selection is controlled by a following calculation in the Kafka client. - Troubleshooting
- Kafka Handler Client Dependencies
What are the dependencies for the Kafka Handler to connect to Apache Kafka databases?
Parent topic: Apache Kafka
8.2.8.1.1 Overview
The Oracle GoldenGate for Big Data Kafka Handler streams change capture data from an Oracle GoldenGate trail to a Kafka topic. Additionally, the Kafka Handler provides functionality to publish messages to a separate schema topic. Schema publication for Avro and JSON is supported.
Apache Kafka is an open source, distributed, partitioned, and replicated messaging service, see http://kafka.apache.org/.
Kafka can be run as a single instance or as a cluster on multiple servers. Each Kafka server instance is called a broker. A Kafka topic is a category or feed name to which messages are published by the producers and retrieved by consumers.
In Kafka, when the topic name corresponds to the fully-qualified source table name, the Kafka Handler implements a Kafka producer. The Kafka producer writes serialized change data capture, from multiple source tables to either a single configured topic or separating source operations, to different Kafka topics.
Parent topic: Apache Kafka
8.2.8.1.2 Detailed Functionality
Transaction Versus Operation Mode
The Kafka Handler sends instances of the Kafka ProducerRecord
class to the Kafka producer API, which in turn publishes the ProducerRecord
to a Kafka topic. The Kafka ProducerRecord
effectively is the implementation of a Kafka message. The ProducerRecord
has two components: a key and a value. Both the key and value are represented as byte arrays by the Kafka Handler. This section describes how the Kafka Handler publishes data.
Transaction Mode
The following configuration sets the Kafka Handler to transaction mode:
gg.handler.name.Mode=tx
In transaction mode, the serialized data is concatenated for every operation in a transaction from the source Oracle GoldenGate trail files. The contents of the concatenated operation data is the value of the Kafka ProducerRecord
object. The key of the Kafka ProducerRecord
object is NULL. The result is that Kafka messages comprise data from 1 to N operations, where N is the number of operations in the transaction.
For grouped transactions, all the data for all the operations are concatenated into a single Kafka message. Therefore, grouped transactions may result in very large Kafka messages that contain data for a large number of operations.
Operation Mode
The following configuration sets the Kafka Handler to operation mode:
gg.handler.name.Mode=op
In operation mode, the serialized data for each operation is placed into an individual ProducerRecord
object as the value. The ProducerRecord
key is the fully qualified table name of the source operation. The ProducerRecord
is immediately sent using the Kafka Producer API. This means that there is a 1 to 1 relationship between the incoming operations and the number of Kafka messages produced.
Topic Name Selection
The topic is resolved at runtime using this configuration parameter:
gg.handler.topicMappingTemplate
You can configure a static string, keywords, or a combination of static strings and keywords to dynamically resolve the topic name at runtime based on the context of the current operation, see Using Templates to Resolve the Topic Name and Message Key.
Kafka Broker Settings
To configure topics to be created automatically, set the auto.create.topics.enable
property to true
. This is the default setting.
If you set the auto.create.topics.enable
property to false
, then you must manually create topics before you start the Replicat process.
Schema Propagation
The schema data for all tables is delivered to the schema topic that is configured with the schemaTopicName
property. For more information , see Schema Propagation.
Parent topic: Apache Kafka
8.2.8.1.3 Setting Up and Running the Kafka Handler
Instructions for configuring the Kafka Handler components and running the handler are described in this section.
You must install and correctly configure Kafka either as a single node or a clustered instance, see http://kafka.apache.org/documentation.html.
If you are using a Kafka distribution other than Apache Kafka, then consult the documentation for your Kafka distribution for installation and configuration instructions.
Zookeeper, a prerequisite component for Kafka and Kafka broker (or brokers), must be up and running.
Oracle recommends and considers it best practice that the data topic and the schema topic (if applicable) are preconfigured on the running Kafka brokers. You can create Kafka topics dynamically. However, this relies on the Kafka brokers being configured to allow dynamic topics.
If the Kafka broker is not collocated with the Kafka Handler process, then the remote host port must be reachable from the machine running the Kafka Handler.
- Classpath Configuration
- Kafka Handler Configuration
- Java Adapter Properties File
- Kafka Producer Configuration File
- Using Templates to Resolve the Topic Name and Message Key
The Kafka Handler provides functionality to resolve the topic name and the message key at runtime using a template configuration value. Templates allow you to configure static values and keywords. Keywords are used to dynamically resolve content at runtime and inject that resolved value into the resolved string. - Kafka Configuring with Kerberos
- Kafka SSL Support
Kafka support SSL connectivity between Kafka clients and the Kafka cluster. SSL connectivity provides both authentication and encryption of messages transported between the client and the server.
Parent topic: Apache Kafka
8.2.8.1.3.1 Classpath Configuration
For the Kafka Handler to connect to Kafka and run, the Kafka Producer properties
file and the Kafka client JARs must be configured in the
gg.classpath
configuration variable. The Kafka client JARs must
match the version of Kafka that the Kafka Handler is connecting to. For a list of
the required client JAR files by version, see Kafka Handler Client Dependencies.
The recommended storage location for the Kafka Producer properties file is the Oracle GoldenGate dirprm
directory.
The default location of the Kafka client JARs is Kafka_Home
/libs/*.
The gg.classpath
must be configured precisely. The path of the Kafka Producer Properties file must contain the path with no wildcard appended. If the *
wildcard is included in the path to the Kafka Producer Properties file, the file is not picked up. Conversely, path to the dependency JARs must include the *
wild card character in order to include all the JAR files in that directory in the associated classpath. Do not use *.jar.
The following is an example of the correctly configured classpath:
gg.classpath={kafka install dir}/libs/*
Parent topic: Setting Up and Running the Kafka Handler
8.2.8.1.3.2 Kafka Handler Configuration
The following are the configurable values for the Kafka Handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
To enable the selection of the Kafka Handler, you must first configure the handler
type by specifying gg.handler.namr.type=kafka
and the other
Kafka properties as follows:
Table 8-10 Configuration Properties for Kafka Handler
Property Name | Required / Optional | Property Value | Default | Description |
---|---|---|---|---|
|
Required |
|
None |
List of handlers to be used. |
|
Required |
|
None |
Type of handler to use. |
|
Optional |
Any custom file name |
|
Filename in classpath that holds Apache Kafka properties to configure the Apache Kafka producer. |
|
Optional |
Formatter class or short code. |
|
Formatter to use to format payload. Can be one of |
|
Required when schema delivery is required. |
Name of the schema topic. |
None |
Topic name where schema data will be delivered. If this property is not set, schema will not be propagated. Schemas will be propagated only for Avro formatters. |
|
Optional |
Fully qualified class name of a custom class that implements Oracle GoldenGate for Big Data Kafka Handler's |
Provided this implementation class:
|
Schema is also propagated as a |
|
Optional |
|
|
With Kafka Handler operation mode, each change capture data record (Insert, Update, Delete, and so on) payload is represented as a Kafka Producer Record and is flushed one at a time. With Kafka Handler in transaction mode, all operations within a source transaction are represented as a single Kafka Producer record. This combined byte payload is flushed on a transaction Commit event. |
|
Required |
A template string value to resolve the Kafka topic name at runtime. |
None |
See Using Templates to Resolve the Topic Name and Message Key. |
|
Required |
A template string value to resolve the Kafka message key at runtime. |
None |
See Using Templates to Resolve the Topic Name and Message Key. |
|
Optional |
|
|
Set to |
gg.handler.name.metaHeadersTemplate |
Optional | Comma delimited list of metacolumn keywords. | None | Allows the user to select metacolumns to inject context-based key value pairs into Kafka message headers using the metacolumn keyword syntax. |
Parent topic: Setting Up and Running the Kafka Handler
8.2.8.1.3.3 Java Adapter Properties File
The following is a sample configuration for the Kafka Handler from the Adapter properties file:
gg.handlerlist = kafkahandler gg.handler.kafkahandler.Type = kafka gg.handler.kafkahandler.KafkaProducerConfigFile = custom_kafka_producer.properties gg.handler.kafkahandler.topicMappingTemplate=oggtopic gg.handler.kafkahandler.keyMappingTemplate=${currentTimestamp} gg.handler.kafkahandler.Format = avro_op gg.handler.kafkahandler.SchemaTopicName = oggSchemaTopic gg.handler.kafkahandler.SchemaPrClassName = com.company.kafkaProdRec.SchemaRecord gg.handler.kafkahandler.Mode = tx
You can find a sample Replicat configuration and a Java Adapter Properties file for a Kafka integration in the following directory:
GoldenGate_install_directory
/AdapterExamples/big-data/kafka
Parent topic: Setting Up and Running the Kafka Handler
8.2.8.1.3.4 Kafka Producer Configuration File
The Kafka Handler must access a Kafka producer configuration file in order to publish messages to Kafka. The file name of the Kafka producer configuration file is controlled by the following configuration in the Kafka Handler properties.
gg.handler.kafkahandler.KafkaProducerConfigFile=custom_kafka_producer.properties
The Kafka Handler attempts to locate and load the Kafka producer configuration file by using the Java classpath. Therefore, the Java classpath must include the directory containing the Kafka Producer Configuration File.
The Kafka producer configuration file contains Kafka proprietary properties. The Kafka documentation provides configuration information for the 0.8.2.0 Kafka producer interface properties. The Kafka Handler uses these properties to resolve the host and port of the Kafka brokers, and properties in the Kafka producer configuration file control the behavior of the interaction between the Kafka producer client and the Kafka brokers.
A sample of configuration file for the Kafka producer is as follows:
bootstrap.servers=localhost:9092 acks = 1 compression.type = gzip reconnect.backoff.ms = 1000 value.serializer = org.apache.kafka.common.serialization.ByteArraySerializer key.serializer = org.apache.kafka.common.serialization.ByteArraySerializer # 100KB per partition batch.size = 102400 linger.ms = 0 max.request.size = 1048576 send.buffer.bytes = 131072
8.2.8.1.3.4.1 Encrypt Kafka Producer Properties
For more information about how to use Credential Store, see Using Identities in Oracle GoldenGate Credential Store.
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required
username="alice" password="alice";
can be replaced with:
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required
username=ORACLEWALLETUSERNAME[alias domain_name] password=ORACLEWALLETPASSWORD[alias
domain_name];
Parent topic: Kafka Producer Configuration File
8.2.8.1.3.5 Using Templates to Resolve the Topic Name and Message Key
The Kafka Handler provides functionality to resolve the topic name and the message key at runtime using a template configuration value. Templates allow you to configure static values and keywords. Keywords are used to dynamically resolve content at runtime and inject that resolved value into the resolved string.
The templates use the following configuration properties:
gg.handler.name.topicMappingTemplate
gg.handler.name.keyMappingTemplate
Template Modes
Source database transactions are made up of one or more individual
operations that are the individual inserts, updates, and deletes. The Kafka Handler
can be configured to send one message per operation (insert, update, delete), or
alternatively can be configured to group operations into messages at the transaction
level. Many template keywords resolve data based on the context of an individual
source database operation. Therefore, many of the keywords do not work when
sending messages at the transaction level. For example, using
${fullyQualifiedTableName}
does not work when
sending messages at the transaction level rather it resolves to the qualified source
table name for an operation. However, transactions can contain multiple operations
for many source tables. Resolving the fully qualified table name for messages at the
transaction level is non-deterministic so abends at runtime.
Parent topic: Setting Up and Running the Kafka Handler
8.2.8.1.3.6 Kafka Configuring with Kerberos
Use these steps to configure a Kafka Handler Replicat with Kerberos to enable a Cloudera instance to process an Oracle GoldenGate for Big Data trail to a Kafka topic:
- In GGSCI, add a Kafka
Replicat:
GGSCI> add replicat kafka, exttrail dirdat/gg
- Configure a
prm
file with these properties:replicat kafka discardfile ./dirrpt/kafkax.dsc, purge SETENV (TZ=PST8PDT) GETTRUNCATES GETUPDATEBEFORES ReportCount Every 1000 Records, Rate MAP qasource.*, target qatarget.*;
- Configure a Replicat properties file as
follows:
###KAFKA Properties file ### gg.log=log4j gg.log.level=info gg.report.time=30sec ###Kafka Classpath settings ### gg.classpath=/opt/cloudera/parcels/KAFKA-2.1.0-1.2.1.0.p0.115/lib/kafka/libs/* jvm.bootoptions=-Xmx64m -Xms64m -Djava.class.path=./ggjava/ggjava.jar -Dlog4j.configuration=log4j.properties -Djava.security.auth.login.config=/scratch/ydama/ogg/v123211/dirprm/jaas.conf -Djava.security.krb5.conf=/etc/krb5.conf ### Kafka handler properties ### gg.handlerlist = kafkahandler gg.handler.kafkahandler.type=kafka gg.handler.kafkahandler.KafkaProducerConfigFile=kafka-producer.properties gg.handler.kafkahandler.format=delimitedtext gg.handler.kafkahandler.format.PkUpdateHandling=update gg.handler.kafkahandler.mode=op gg.handler.kafkahandler.format.includeCurrentTimestamp=false gg.handler.kafkahandler.format.fieldDelimiter=| gg.handler.kafkahandler.format.lineDelimiter=CDATA[\n] gg.handler.kafkahandler.topicMappingTemplate=myoggtopic gg.handler.kafkahandler.keyMappingTemplate=${position}
- Configure a Kafka Producer file with these
properties:
bootstrap.servers=10.245.172.52:9092 acks=1 #compression.type=snappy reconnect.backoff.ms=1000 value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer key.serializer=org.apache.kafka.common.serialization.ByteArraySerializer batch.size=1024 linger.ms=2000 security.protocol=SASL_PLAINTEXT sasl.kerberos.service.name=kafka sasl.mechanism=GSSAPI
-
Configure a
jaas.conf
file with these properties:KafkaClient { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true storeKey=true keyTab="/scratch/ydama/ogg/v123211/dirtmp/keytabs/slc06unm/kafka.keytab" principal="kafka/slc06unm.us.oracle.com@HADOOPTEST.ORACLE.COM"; };
-
Ensure that you have the latest
key.tab
files from the Cloudera instance to connect secured Kafka topics. -
Start the Replicat from GGSCI and make sure that it is running with
INFO ALL
. -
Review the Replicat report to see the total number of records processed. The report is similar to:
Oracle GoldenGate for Big Data, 12.3.2.1.1.005 Copyright (c) 2007, 2018. Oracle and/or its affiliates. All rights reserved Built with Java 1.8.0_161 (class version: 52.0) 2018-08-05 22:15:28 INFO OGG-01815 Virtual Memory Facilities for: COM anon alloc: mmap(MAP_ANON) anon free: munmap file alloc: mmap(MAP_SHARED) file free: munmap target directories: /scratch/ydama/ogg/v123211/dirtmp. Database Version: Database Language and Character Set: *********************************************************************** ** Run Time Messages ** *********************************************************************** 2018-08-05 22:15:28 INFO OGG-02243 Opened trail file /scratch/ydama/ogg/v123211/dirdat/kfkCustR/gg000000 at 2018-08-05 22:15:28.258810. 2018-08-05 22:15:28 INFO OGG-03506 The source database character set, as determined from the trail file, is UTF-8. 2018-08-05 22:15:28 INFO OGG-06506 Wildcard MAP resolved (entry qasource.*): MAP "QASOURCE"."BDCUSTMER1", target qatarget."BDCUSTMER1". 2018-08-05 22:15:28 INFO OGG-02756 The definition for table QASOURCE.BDCUSTMER1 is obtained from the trail file. 2018-08-05 22:15:28 INFO OGG-06511 Using following columns in default map by name: CUST_CODE, NAME, CITY, STATE. 2018-08-05 22:15:28 INFO OGG-06510 Using the following key columns for target table qatarget.BDCUSTMER1: CUST_CODE. 2018-08-05 22:15:29 INFO OGG-06506 Wildcard MAP resolved (entry qasource.*): MAP "QASOURCE"."BDCUSTORD1", target qatarget."BDCUSTORD1". 2018-08-05 22:15:29 INFO OGG-02756 The definition for table QASOURCE.BDCUSTORD1 is obtained from the trail file. 2018-08-05 22:15:29 INFO OGG-06511 Using following columns in default map by name: CUST_CODE, ORDER_DATE, PRODUCT_CODE, ORDER_ID, PRODUCT_PRICE, PRODUCT_AMOUNT, TRANSACTION_ID. 2018-08-05 22:15:29 INFO OGG-06510 Using the following key columns for target table qatarget.BDCUSTORD1: CUST_CODE, ORDER_DATE, PRODUCT_CODE, ORDER_ID. 2018-08-05 22:15:33 INFO OGG-01021 Command received from GGSCI: STATS. 2018-08-05 22:16:03 INFO OGG-01971 The previous message, 'INFO OGG-01021', repeated 1 times. 2018-08-05 22:43:27 INFO OGG-01021 Command received from GGSCI: STOP. *********************************************************************** * ** Run Time Statistics ** * *********************************************************************** Last record for the last committed transaction is the following: ___________________________________________________________________ Trail name : /scratch/ydama/ogg/v123211/dirdat/kfkCustR/gg000000 Hdr-Ind : E (x45) Partition : . (x0c) UndoFlag : . (x00) BeforeAfter: A (x41) RecLength : 0 (x0000) IO Time : 2015-08-14 12:02:20.022027 IOType : 100 (x64) OrigNode : 255 (xff) TransInd : . (x03) FormatType : R (x52) SyskeyLen : 0 (x00) Incomplete : . (x00) AuditRBA : 78233 AuditPos : 23968384 Continued : N (x00) RecCount : 1 (x01) 2015-08-14 12:02:20.022027 GGSPurgedata Len 0 RBA 6473 TDR Index: 2 ___________________________________________________________________ Reading /scratch/ydama/ogg/v123211/dirdat/kfkCustR/gg000000, current RBA 6556, 20 records, m_file_seqno = 0, m_file_rba = 6556 Report at 2018-08-05 22:43:27 (activity since 2018-08-05 22:15:28) From Table QASOURCE.BDCUSTMER1 to qatarget.BDCUSTMER1: # inserts: 5 # updates: 1 # deletes: 0 # discards: 0 From Table QASOURCE.BDCUSTORD1 to qatarget.BDCUSTORD1: # inserts: 5 # updates: 3 # deletes: 5 # truncates: 1 # discards: 0
-
Ensure that the secure Kafka topic is created:
/kafka/bin/kafka-topics.sh --zookeeper slc06unm:2181 --list myoggtopic
-
Review the contents of the secure Kafka topic:
-
Create a
consumer.properties
file containing:security.protocol=SASL_PLAINTEXT sasl.kerberos.service.name=kafka
-
Set this environment variable:
export KAFKA_OPTS="-Djava.security.auth.login.config="/scratch/ogg/v123211/dirprm/jaas.conf"
-
Run the consumer utility to check the records:
/kafka/bin/kafka-console-consumer.sh --bootstrap-server sys06:9092 --topic myoggtopic --new-consumer --consumer.config consumer.properties
-
Parent topic: Setting Up and Running the Kafka Handler
8.2.8.1.3.7 Kafka SSL Support
Kafka support SSL connectivity between Kafka clients and the Kafka cluster. SSL connectivity provides both authentication and encryption of messages transported between the client and the server.
- Set up the Kafka cluster for SSL
- Create self signed certificates in a keystore/truststore file
- Configure the Kafka clients for SSL
bootstrap.servers=localhost:9092 acks=1 value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer key.serializer=org.apache.kafka.common.serialization.ByteArraySerializer security.protocol=SSL ssl.keystore.location=/var/private/ssl/server.keystore.jks ssl.keystore.password=test1234 ssl.key.password=test1234 ssl.truststore.location=/var/private/ssl/server.truststore.jks ssl.truststore.password=test1234
Parent topic: Setting Up and Running the Kafka Handler
8.2.8.1.4 Schema Propagation
The Kafka Handler provides the ability to publish schemas to a schema topic. Currently, the Avro Row and Operation formatters are the only formatters that are enabled for schema publishing. If the Kafka Handler schemaTopicName
property is set, then the schema is published for the following events:
-
The Avro schema for a specific table is published the first time an operation for that table is encountered.
-
If the Kafka Handler receives a metadata change event, the schema is flushed. The regenerated Avro schema for a specific table is published the next time an operation for that table is encountered.
-
If the Avro wrapping functionality is enabled, then the generic wrapper Avro schema is published the first time that any operation is encountered. To enable the generic wrapper, Avro schema functionality is enabled in the Avro formatter configuration, see Avro Row Formatter and The Avro Operation Formatter.
The Kafka ProducerRecord
value is the schema, and the key is the fully qualified table name.
Because Avro messages directly depend on an Avro schema, user of Avro over Kafka may encounter issues. Avro messages are not human readable because they are binary. To deserialize an Avro message, the receiver must first have the correct Avro schema, but because each table from the source database results in a separate Avro schema, this can be difficult. The receiver of a Kafka message cannot determine which Avro schema to use to deserialize individual messages when the source Oracle GoldenGate trail file includes operations from multiple tables. To solve this problem, you can wrap the specialized Avro messages in a generic Avro message wrapper. This generic Avro wrapper provides the fully-qualified table name, the hashcode of the schema string, and the wrapped Avro message. The receiver can use the fully-qualified table name and the hashcode of the schema string to resolve the associated schema of the wrapped message, and then use that schema to deserialize the wrapped message.
Parent topic: Apache Kafka
8.2.8.1.5 Performance Considerations
For the best performance, Oracle recommends that you send the Kafka Handler to operate in operation mode.
gg.handler.name.mode = op
Additionally, Oracle recommends that you set the
batch.size
and linger.ms values in the Kafka Producer
properties file. These values are highly dependent upon the use case scenario.
Typically, higher values result in better throughput, but latency is increased.
Smaller values in these properties reduces latency but overall throughput
decreases.
Use of the Replicat variable GROUPTRANSOPS
also
improves performance. The recommended setting is 10000
.
If the serialized operations from the source trail file must be delivered in individual Kafka messages, then the Kafka Handler must be set to operation mode.
gg.handler.name.mode = op
Parent topic: Apache Kafka
8.2.8.1.6 About Security
Kafka version 0.9.0.0 introduced security through SSL/TLS and SASL (Kerberos). You can secure the Kafka Handler using one or both of the SSL/TLS and SASL security offerings. The Kafka producer client libraries provide an abstraction of security functionality from the integrations that use those libraries. The Kafka Handler is effectively abstracted from security functionality. Enabling security requires setting up security for the Kafka cluster, connecting machines, and then configuring the Kafka producer properties file with the required security properties. For detailed instructions about securing the Kafka cluster, see the Kafka documentation at
You may encounter the inability to decrypt the Kerberos password from the keytab
file. This causes the Kerberos authentication to fall back to interactive mode which cannot work because it is being invoked programmatically. The cause of this problem is that the Java Cryptography Extension (JCE) is not installed in the Java Runtime Environment (JRE). Ensure that the JCE is loaded in the JRE, see http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html.
Parent topic: Apache Kafka
8.2.8.1.7 Metadata Change Events
Metadata change events are now handled in the Kafka Handler. This is relevant only if you have configured a schema topic and the formatter used supports schema propagation (currently Avro row and Avro Operation formatters). The next time an operation is encountered for a table for which the schema has changed, the updated schema is published to the schema topic.
To support metadata change events, the Oracle GoldenGate process capturing changes in the source database must support the Oracle GoldenGate metadata in trail feature, which was introduced in Oracle GoldenGate 12c (12.2).
Parent topic: Apache Kafka
8.2.8.1.8 Snappy Considerations
The Kafka Producer Configuration file supports the use of compression. One of the configurable options is Snappy, an open source compression and decompression (codec
) library that provides better performance than other codec
libraries. The Snappy JAR does not run on all platforms. Snappy may work on Linux systems though may or may not work on other UNIX and Windows implementations. If you want to use Snappy compression, test Snappy on all required systems before implementing compression using Snappy. If Snappy does not port to all required systems, then Oracle recommends using an alternate codec
library.
Parent topic: Apache Kafka
8.2.8.1.9 Kafka Interceptor Support
The Kafka Producer client framework supports the use of Producer Interceptors. A Producer Interceptor is simply a user exit from the Kafka Producer client whereby the Interceptor object is instantiated and receives notifications of Kafka message send calls and Kafka message send acknowledgement calls.
The typical use case for Interceptors is monitoring. Kafka Producer Interceptors
must conform to the interface
org.apache.kafka.clients.producer.ProducerInterceptor
. The Kafka
Handler supports Producer Interceptor usage.
The requirements to using Interceptors in the Handlers are as follows:
- The Kafka Producer configuration property
"
interceptor.classes
" must be configured with the class name of the Interceptor(s) to be invoked. - In order to invoke the Interceptor(s), the jar files plus any
dependency jars must be available to the JVM. Therefore, the jar files
containing the Interceptor(s) plus any dependency jars must be added to the
gg.classpath
in the Handler configuration file.For more information, see Kafka documentation.
Parent topic: Apache Kafka
8.2.8.1.10 Kafka Partition Selection
Kafka topics comprise one or more partitions. Distribution to multiple partitions is a good way to improve Kafka ingest performance, because the Kafka client parallelizes message sending to different topic/partition combinations. Partition selection is controlled by a following calculation in the Kafka client.
(Hash of the Kafka message key) modulus (the number of partitions) = selected partition number
The Kafka message key is selected by the following configuration value:
gg.handler.{your handler name}.keyMappingTemplate=
If this parameter is set to a value which generates a static key, all messages will go to the same partition. The following is example of static keys:
gg.handler.{your handler name}.keyMappingTemplate=StaticValue
If this parameter is set to a value which generates a key that changes infrequently, partition selection changes infrequently. In the following example the table name is used as the message key. Every operation for a specific source table will have the same key and thereby route to the same partition:
gg.handler.{your handler name}.keyMappingTemplate=${tableName}
gg.handler.{your handler name}.keyMappingTemplate=${null}
The recommended setting for configuration of the mapping key is the following:
gg.handler.{your handler name}.keyMappingTemplate=${primaryKeys}
This generates a Kafka message key that is the concatenated and delimited primary key values.
Operations for each row should have a unique primary key(s) thereby generating a unique Kafka message key for each row. Another important consideration is Kafka messages sent to different partitions are not guaranteed to be delivered to a Kafka consumer in the original order sent. This is part of the Kafka specification. Order is only maintained within a partition. Using primary keys as the Kafka message key means that operations for the same row, which have the same primary key(s), generate the same Kafka message key, and therefore are sent to the same Kafka partition. In this way, the order is maintained for operations for the same row.
At the DEBUG
log level the Kafka message coordinates (topic,
partition, and offset) are logged to the .log
file for successfully
sent messages.
Parent topic: Apache Kafka
8.2.8.1.11 Troubleshooting
- Verify the Kafka Setup
- Classpath Issues
- Invalid Kafka Version
- Kafka Producer Properties File Not Found
- Kafka Connection Problem
Parent topic: Apache Kafka
8.2.8.1.11.1 Verify the Kafka Setup
You can use the command line Kafka producer to write dummy data to a Kafka topic, and you can use a Kafka consumer to read this data from the Kafka topic. Use this method to verify the setup and read/write permissions to Kafka topics on disk, see http://kafka.apache.org/documentation.html#quickstart.
Parent topic: Troubleshooting
8.2.8.1.11.2 Classpath Issues
Java classpath problems are common. Such problems may include a ClassNotFoundException
problem in the log4j
log file or may be an error resolving the classpath because of a typographic error in the gg.classpath
variable. The Kafka client libraries do not ship with the Oracle GoldenGate for Big Data product. You must obtain the correct version of the Kafka client libraries and properly configure the gg.classpath
property in the Java Adapter Properties file to correctly resolve the Java the Kafka client libraries as described in Classpath Configuration.
Parent topic: Troubleshooting
8.2.8.1.11.3 Invalid Kafka Version
The Kafka Handler does not support Kafka versions 0.8.2.2 or older. If you run an unsupported version of Kafka, a runtime Java exception, java.lang.NoSuchMethodError
, occurs. It implies that the org.apache.kafka.clients.producer.KafkaProducer.flush()
method cannot be found. If you encounter this error, migrate to Kafka version 0.9.0.0 or later.
Parent topic: Troubleshooting
8.2.8.1.11.4 Kafka Producer Properties File Not Found
This problem typically results in the following exception:
ERROR 2015-11-11 11:49:08,482 [main] Error loading the kafka producer properties
Check the gg.handler.kafkahandler.KafkaProducerConfigFile
configuration variable to ensure that the Kafka Producer Configuration file name is set correctly. Check the gg.classpath
variable to verify that the classpath includes the path to the Kafka Producer properties file, and that the path to the properties file does not contain a *
wildcard at the end.
Parent topic: Troubleshooting
8.2.8.1.11.5 Kafka Connection Problem
This problem occurs when the Kafka Handler is unable to connect to Kafka. You receive the following warnings:
WARN 2015-11-11 11:25:50,784 [kafka-producer-network-thread | producer-1] WARN (Selector.java:276) - Error in I/O with localhost/127.0.0.1 java.net.ConnectException: Connection refused
The connection retry interval expires, and the Kafka Handler process abends. Ensure that the Kafka Broker is running and that the host and port provided in the Kafka Producer Properties file are correct. You can use network shell commands (such as netstat -l
) on the machine hosting the Kafka broker to verify that Kafka is listening on the expected port.
Parent topic: Troubleshooting
8.2.8.1.12 Kafka Handler Client Dependencies
What are the dependencies for the Kafka Handler to connect to Apache Kafka databases?
The maven central repository artifacts for Kafka databases are:
Maven groupId: org.apache.kafka
Maven atifactId: kafka-clients
Maven version: the Kafka version numbers listed for each section
Parent topic: Apache Kafka
8.2.8.1.12.1 Kafka 2.8.0
kafka-clients-2.8.0.jar lz4-java-1.7.1.jar slf4j-api-1.7.30.jar snappy-java-1.1.8.1.jar zstd-jni-1.4.9-1.jar
Parent topic: Kafka Handler Client Dependencies
8.2.8.1.12.2 Kafka 2.7.0
kafka-clients-2.7.0.jar lz4-java-1.7.1.jar slf4j-api-1.7.30.jar snappy-java-1.1.7.7.jar zstd-jni-1.4.5-6.jar
Parent topic: Kafka Handler Client Dependencies
8.2.8.1.12.3 Kafka 2.6.0
kafka-clients-2.6.0.jar lz4-java-1.7.1.jar slf4j-api-1.7.30.jar snappy-java-1.1.7.3.jar zstd-jni-1.4.4-7.jar
Parent topic: Kafka Handler Client Dependencies
8.2.8.1.12.4 Kafka 2.5.1
kafka-clients-2.5.1.jar lz4-java-1.7.1.jar slf4j-api-1.7.30.jar snappy-java-1.1.7.3.jar zstd-jni-1.4.4-7.jar
Parent topic: Kafka Handler Client Dependencies
8.2.8.1.12.5 Kafka 2.4.1
kafka-clients-2.4.1.jar lz4-java-1.6.0.jar slf4j-api-1.7.28.jar snappy-java-1.1.7.3.jar zstd-jni-1.4.3-1.jarr
Parent topic: Kafka Handler Client Dependencies
8.2.8.1.12.6 Kafka 2.3.1
kafka-clients-2.3.1.jar lz4-java-1.6.0.jar slf4j-api-1.7.26.jar snappy-java-1.1.7.3.jar zstd-jni-1.4.0-1.jar
Parent topic: Kafka Handler Client Dependencies
8.2.8.2 Apache Kafka Connect Handler
The Kafka Connect Handler is an extension of the standard Kafka messaging functionality.
This chapter describes how to use the Kafka Connect Handler.
- Overview
- Detailed Functionality
- Setting Up and Running the Kafka Connect Handler
- Connecting to a Secure Schema Registry
- Kafka Connect Handler Performance Considerations
- Kafka Interceptor Support
The Kafka Producer client framework supports the use of Producer Interceptors. A Producer Interceptor is simply a user exit from the Kafka Producer client whereby the Interceptor object is instantiated and receives notifications of Kafka message send calls and Kafka message send acknowledgement calls. - Kafka Partition Selection
Kafka topics comprise one or more partitions. Distribution to multiple partitions is a good way to improve Kafka ingest performance, because the Kafka client parallelizes message sending to different topic/partition combinations. Partition selection is controlled by a following calculation in the Kafka client. - Troubleshooting the Kafka Connect Handler
- Kafka Connect Handler Client Dependencies
What are the dependencies for the Kafka Connect Handler to connect to Apache Kafka Connect databases?
Parent topic: Apache Kafka
8.2.8.2.1 Overview
The Oracle GoldenGate Kafka Connect is an extension of the standard Kafka messaging functionality. Kafka Connect is a functional layer on top of the standard Kafka Producer and Consumer interfaces. It provides standardization for messaging to make it easier to add new source and target systems into your topology.
Confluent is a primary adopter of Kafka Connect and their Confluent Platform offering includes extensions over the standard Kafka Connect functionality. This includes Avro serialization and deserialization, and an Avro schema registry. Much of the Kafka Connect functionality is available in Apache Kafka. A number of open source Kafka Connect integrations are found at:
https://www.confluent.io/product/connectors/
The Kafka Connect Handler is a Kafka Connect source connector. You can capture database changes from any database supported by Oracle GoldenGate and stream that change of data through the Kafka Connect layer to Kafka. You can also connect to Oracle Event Hub Cloud Services (EHCS) with this handler.
Kafka Connect uses proprietary objects to define the schemas (org.apache.kafka.connect.data.Schema
) and the messages (org.apache.kafka.connect.data.Struct
). The Kafka Connect Handler can be configured to manage what data is published and the structure of the published data.
The Kafka Connect Handler does not support any of the pluggable formatters that are supported by the Kafka Handler.
Topics:
Parent topic: Apache Kafka Connect Handler
8.2.8.2.2 Detailed Functionality
JSON Converter
The Kafka Connect framework provides converters to convert in-memory Kafka Connect messages to a serialized format suitable for transmission over a network. These converters are selected using configuration in the Kafka Producer properties file.
Kafka Connect and the JSON converter is available as part of the Apache Kafka download. The JSON Converter converts the Kafka keys and values to JSONs which are then sent to a Kafka topic. You identify the JSON Converters with the following configuration in the Kafka Producer properties file:
key.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter=org.apache.kafka.connect.json.JsonConverter
value.converter.schemas.enable=true
The format of the messages is the message schema information followed by the payload information. JSON is a self describing format so you should not include the schema information in each message published to Kafka.
To omit the JSON schema information from the messages set the following:
key.converter.schemas.enable=false
value.converter.schemas.enable=false
Avro Converter
Confluent provides Kafka installations, support for Kafka, and extended functionality built on top of Kafka to help realize the full potential of Kafka. Confluent provides both open source versions of Kafka (Confluent Open Source) and an enterprise edition (Confluent Enterprise), which is available for purchase.
A common Kafka use case is to send Avro messages over Kafka. This can create a problem on the receiving end as there is a dependency for the Avro schema in order to deserialize an Avro message. Schema evolution can increase the problem because received messages must be matched up with the exact Avro schema used to generate the message on the producer side. Deserializing Avro messages with an incorrect Avro schema can cause runtime failure, incomplete data, or incorrect data. Confluent has solved this problem by using a schema registry and the Confluent schema converters.
The following shows the configuration of the Kafka Producer properties file.
key.converter=io.confluent.connect.avro.AvroConverter
value.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter.schema.registry.url=http://localhost:8081
When messages are published to Kafka, the Avro schema is registered and stored in the schema registry. When messages are consumed from Kafka, the exact Avro schema used to create the message can be retrieved from the schema registry to deserialize the Avro message. This creates matching of Avro messages to corresponding Avro schemas on the receiving side, which solves this problem.
Following are the requirements to use the Avro Converters:
- This functionality is available in both versions of Confluent Kafka (open source or enterprise).
- The Confluent schema registry service must be running.
- Source database tables must have an associated Avro schema. Messages associated with different Avro schemas must be sent to different Kafka topics.
- The Confluent Avro converters and the schema registry client must be available in the classpath.
The schema registry keeps track of Avro schemas by topic. Messages must be sent to a topic that has the same schema or evolving versions of the same schema. Source messages have Avro schemas based on the source database table schema so Avro schemas are unique for each source table. Publishing messages to a single topic for multiple source tables will appear to the schema registry that the schema is evolving every time the message sent from a source table that is different from the previous message.
Protobuf Converter
The Protobuf Converter allows Kafka Connect messages to be formatted as Google Protocol Buffers format. The Protobuf Converter integrates with the Confluent schema registry and this functionality is available in both the open source and enterprise versions of Confluent. Confluent added the Protobuf Converter starting in Confluent version 5.5.0.
key.converter=io.confluent.connect.protobuf.ProtobufConverter
value.converter=io.confluent.connect.protobuf.ProtobufConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter.schema.registry.url=http://localhost:8081
The requirements to use the Protobuf Converter are as follows:
- This functionality is available in both versions of Confluent Kafka (open source or enterprise) starting in 5.5.0.
- The Confluent schema registry service must be running.
- Messages with different schemas (source tables) should be sent to different Kafka topics.
- The Confluent Protobuf converter and the schema registry client must be available in the classpath.
The schema registry keeps track of Protobuf schemas by topic. Messages must be sent to a topic that has the same schema or evolving versions of the same schema. Source messages have Protobuf schemas based on the source database table schema so Protobuf schemas are unique for each source table. Publishing messages to a single topic for multiple source tables will appear to the schema registry that the schema is evolving every time the message sent from a source table that is different from the previous message.
Parent topic: Apache Kafka Connect Handler
8.2.8.2.3 Setting Up and Running the Kafka Connect Handler
Instructions for configuring the Kafka Connect Handler components and running the handler are described in this section.
Classpath Configuration
Two things must be configured in the gg.classpath
configuration variable so that the Kafka Connect Handler can to connect to Kafka and run. The required items are the Kafka Producer properties file and the Kafka client JARs. The Kafka client JARs must match the version of Kafka that the Kafka Connect Handler is connecting to. For a listing of the required client JAR files by version, see Kafka Handler Client Dependencies Kafka Connect Handler Client Dependencies. The recommended storage location for the Kafka Producer properties file is the Oracle GoldenGate dirprm
directory.
The default location of the Kafka Connect client JARs is the Kafka_Home/libs/*
directory.
The gg.classpath
variable must be configured precisely. Pathing to the Kafka Producer properties file should contain the path with no wildcard appended. The inclusion of the asterisk (*) wildcard in the path to the Kafka Producer properties file causes it to be discarded. Pathing to the dependency JARs should include the * wildcard character to include all of the JAR files in that directory in the associated classpath. Do not use *.jar
.
Following is an example of a correctly configured Apache Kafka classpath:
gg.classpath=dirprm:{kafka_install_dir}/libs/*
Following is an example of a correctly configured Confluent Kafka classpath:
gg.classpath={confluent_install_dir}/share/java/kafka-serde-tools/*:{confluent_install_dir}/share/java/kafka/*:{confluent_install_dir}/share/java/confluent-common/*
- Kafka Connect Handler Configuration
The automated output of meta-column fields in generated Kafka Connect messages has been removed as of Oracle GoldenGate for Big Data release 21.1. - Using Templates to Resolve the Topic Name and Message Key
- Configuring Security in the Kafka Connect Handler
Parent topic: Apache Kafka Connect Handler
8.2.8.2.3.1 Kafka Connect Handler Configuration
The automated output of meta-column fields in generated Kafka Connect messages has been removed as of Oracle GoldenGate for Big Data release 21.1.
Meta-column fields can be configured as the following property:
gg.handler.name.metaColumnsTemplate
To output the metacolumns as in previous versions configure the following:
gg.handler.name.metaColumnsTemplate=${objectname[table]},${optype[op_type]},${timestamp[op_ts]},${currenttimestamp[current_ts]},${position[pos]}
To also include the primary key columns and the tokens configure as follows:
gg.handler.name.metaColumnsTemplate=${objectname[table]},${optype[op_type]},${timestamp[op_ts]},${currenttimestamp[current_ts]},${position[pos]},${primarykeycolumns[primary_keys]},${alltokens[tokens]}
For more information see the configuration property:
gg.handler.name.metaColumnsTemplate
Table 8-11 Kafka Connect Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.handler.name.type |
Required |
|
None |
The configuration to select the Kafka Connect Handler. |
gg.handler.name.kafkaProducerConfigFile |
Required |
string |
None |
Name of the properties file containing the properties of the Kafka and Kafka Connect
configuration properties. This file must be part of the
classpath configured by the |
gg.handler.name.topicMappingTemplate |
Required |
A template string value to resolve the Kafka topic name at runtime. |
None |
See Using Templates to Resolve the Topic Name and Message Key. |
gg.handler.name.keyMappingTemplate |
Required |
A template string value to resolve the Kafka message key at runtime. |
None |
See Using Templates to Resolve the Topic Name and Message Key. |
gg.handler.name.includeTokens |
Optional |
|
|
Set to Set to |
gg.handler.name.messageFormatting |
Optional |
|
|
Controls how output messages are modeled. Selecting row and the output messages will be modeled as row. Set to op and the output messages will be modeled as operations messages. |
gg.handler.name.insertOpKey |
Optional |
any string |
|
The value of the field |
gg.handler.name.updateOpKey |
Optional |
any string |
|
The value of the field |
gg.handler.name.deleteOpKey |
Optional |
any string |
|
The value of the field |
gg.handler.name.truncateOpKey |
Optional |
any string |
|
The value of the field |
gg.handler.name.treatAllColumnsAsStrings |
Optional |
|
|
Set to true to treat all output fields as strings. Set to false and the Handler will map the corresponding field type from the source trail file to the best corresponding Kafka Connect data type. |
gg.handler.name.mapLargeNumbersAsStrings |
Optional |
|
|
Large numbers are mapping to number fields as Doubles. It is possible to lose precision in certain scenarios. If set to |
|
Optional |
abend | update | delete-insert |
|
Only applicable if modeling row messages |
|
Optional |
Any of the metacolumns keywords. |
None |
A comma-delimited string consisting of one or more templated values that represent the template, see Metacolumn Keywords. |
|
Optional |
|
|
Set to Set this property for each column to allow downstream applications to differentiate if a null value is actually null in the source trail file or if it is missing in the source trail file. |
gg.handler.name.enableDecimalLogicalType |
Optional | true|false |
false |
Set to true to enable decimal
logical types in Kafka Connect. Decimal logical types allow numbers
which will not fit in a 64 bit data type to be represented.
|
gg.handler.name.oracleNumberScale |
Optional | Positive Integer | 38 | Only applicable if
gg.handler.name.enableDecimalLogicalType=true .
Some source data types do not have a fixed scale associated with
them. Scale must be set for Kafka Connectdecimal logical types. In
the case of source types which do not have a scalein the metadata,
the value of this parameter is used to set the scale.
|
gg.handler.name.EnableTimestampLogicalType |
Optional | true|false |
false |
Set to true to enable the Kafka
Connect timestamp logical type. The Kafka connect timestamp logical
time is a integer measurement ofmilliseconds since the Java epoch.
This means precision greater thanmilliseconds is not possible if the
timestamp logica type is used. Use of this property requires that
the gg.format.timestamp property be set. This
property is the timestamp formatting string, which is used to
determine the output of timestamps in string format. For example,
gg.format.timestamp=yyyy-MM-dd HH:mm:ss.SSS .
Ensure that the goldengate.userexit.timestamp
property is not set in the configuration file. Setting this property
prevents parsing the input timestamp into a Java object which is
required for logical timestamps.
|
gg.handler.name.metaHeadersTemplate |
Optional | Comma delimited list of metacolumn keywords. | None | Allows the user to select metacolumns to inject context-based key value pairs into Kafka message headers using the metacolumn keyword syntax. See Metacolumn Keywords. |
gg.handler.name.schemaNamespace
|
Optional | Any string without characters which violate the Kafka Connector Avro schema naming requirements. | None | Used to control the generated Kafka Connect schema
name. If it is not set, then the schema name is the same as the
qualified source table name. For example, if the source table is
QASOURCE.TCUSTMER , then the Kafka Connect
schema name will be the same.
This property allows you to control
the generated schema name. For example, if this property is set
to |
gg.handler.name.enableNonnullable |
Optional | true|false |
false |
The default behavior is to set all fields as nullable
in the generated Kafka Connect schema. Set this parameter to
true to honor the nullable value configured in
the target metadata provided by the metadata provider. Setting this
property to true can have some adverse side
effects.
|
See Using Templates to Resolve the Stream Name and Partition Name for more information.
Review a Sample Configuration
gg.handlerlist=kafkaconnect #The handler properties gg.handler.kafkaconnect.type=kafkaconnect gg.handler.kafkaconnect.kafkaProducerConfigFile=kafkaconnect.properties gg.handler.kafkaconnect.mode=op #The following selects the topic name based on the fully qualified table name gg.handler.kafkaconnect.topicMappingTemplate=${fullyQualifiedTableName} #The following selects the message key using the concatenated primary keys gg.handler.kafkaconnect.keyMappingTemplate=${primaryKeys} #The formatter properties gg.handler.kafkaconnect.messageFormatting=row gg.handler.kafkaconnect.insertOpKey=I gg.handler.kafkaconnect.updateOpKey=U gg.handler.kafkaconnect.deleteOpKey=D gg.handler.kafkaconnect.truncateOpKey=T gg.handler.kafkaconnect.treatAllColumnsAsStrings=false gg.handler.kafkaconnect.pkUpdateHandling=abend
Parent topic: Setting Up and Running the Kafka Connect Handler
8.2.8.2.3.2 Using Templates to Resolve the Topic Name and Message Key
The Kafka Connect Handler provides functionality to resolve the topic name and the message key at runtime using a template configuration value. Templates allow you to configure static values and keywords. Keywords are used to dynamically replace the keyword with the context of the current processing. Templates are applicable to the following configuration parameters:
gg.handler.name.topicMappingTemplate
gg.handler.name.keyMappingTemplate
Template Modes
The Kafka Connect Handler can only send operation messages. The Kafka Connect Handler cannot group operation messages into a larger transaction message.
Parent topic: Setting Up and Running the Kafka Connect Handler
8.2.8.2.3.3 Configuring Security in the Kafka Connect Handler
Kafka version 0.9.0.0 introduced security through SSL/TLS or Kerberos. The Kafka Connect Handler can be secured using SSL/TLS or Kerberos. The Kafka producer client libraries provide an abstraction of security functionality from the integrations utilizing those libraries. The Kafka Connect Handler is effectively abstracted from security functionality. Enabling security requires setting up security for the Kafka cluster, connecting machines, and then configuring the Kafka Producer properties file, that the Kafka Handler uses for processing, with the required security properties.
You may encounter the inability to decrypt the Kerberos password from the keytab
file. This causes the Kerberos authentication to fall back to interactive mode which cannot work because it is being invoked programmatically. The cause of this problem is that the Java Cryptography Extension (JCE) is not installed in the Java Runtime Environment (JRE). Ensure that the JCE is loaded in the JRE, see http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html.
Parent topic: Setting Up and Running the Kafka Connect Handler
8.2.8.2.4 Connecting to a Secure Schema Registry
The customer topology for Kafka Connect may include a schema registry which is secured. This topic shows how to set the Kafka producer properties configured for connectivity to a secured schema registry.
SSL Mutual Authkey.converter.schema.registry.ssl.truststore.location= key.converter.schema.registry.ssl.truststore.password= key.converter.schema.registry.ssl.keystore.location= key.converter.schema.registry.ssl.keystore.password= key.converter.schema.registry.ssl.key.password= value.converter.schema.registry.ssl.truststore.location= value.converter.schema.registry.ssl.truststore.password= value.converter.schema.registry.ssl.keystore.location= value.converter.schema.registry.ssl.keystore.password= value.converter.schema.registry.ssl.key.password=
SSL Basic Auth
key.converter.basic.auth.credentials.source=USER_INFO key.converter.basic.auth.user.info=username:password key.converter.schema.registry.ssl.truststore.location= key.converter.schema.registry.ssl.truststore.password= value.converter.basic.auth.credentials.source=USER_INFO value.converter.basic.auth.user.info=username:password value.converter.schema.registry.ssl.truststore.location= value.converter.schema.registry.ssl.truststore.password=
Parent topic: Apache Kafka Connect Handler
8.2.8.2.5 Kafka Connect Handler Performance Considerations
There are multiple configuration settings both for the Oracle GoldenGate for Big Data configuration and in the Kafka producer which affect performance.
The Oracle GoldenGate parameter have the greatest affect on performance is the Replicat GROUPTRANSOPS
parameter. The GROUPTRANSOPS
parameter allows Replicat to group multiple source transactions into a single target transaction. At transaction commit, the Kafka Connect Handler calls flush on the Kafka Producer to push the messages to Kafka for write durability followed by a checkpoint. The flush call is an expensive call and setting the Replicat GROUPTRANSOPS
setting to larger amount allows the replicat to call the flush call less frequently thereby improving performance.
The default setting for GROUPTRANSOPS
is 1000
and performance improvements can be obtained by increasing the value to 2500, 5000, or even 10000.
The Op mode gg.handler.kafkaconnect.mode=op
parameter can also improve performance than the Tx mode gg.handler.kafkaconnect.mode=tx
.
A number of Kafka Producer properties can affect performance. The following are the parameters with significant impact:
-
linger.ms
-
batch.size
-
acks
-
buffer.memory
-
compression.type
Oracle recommends that you start with the default values for these parameters and perform performance testing to obtain a base line for performance. Review the Kafka documentation for each of these parameters to understand its role and adjust the parameters and perform additional performance testing to ascertain the performance effect of each parameter.
Parent topic: Apache Kafka Connect Handler
8.2.8.2.6 Kafka Interceptor Support
The Kafka Producer client framework supports the use of Producer Interceptors. A Producer Interceptor is simply a user exit from the Kafka Producer client whereby the Interceptor object is instantiated and receives notifications of Kafka message send calls and Kafka message send acknowledgement calls.
The typical use case for Interceptors is monitoring. Kafka Producer
Interceptors must conform to the interface
org.apache.kafka.clients.producer.ProducerInterceptor
. The Kafka
Connect Handler supports Producer Interceptor usage.
The requirements to using Interceptors in the Handlers are as follows:
- The Kafka Producer configuration property
"
interceptor.classes
" must be configured with the class name of the Interceptor(s) to be invoked. - In order to invoke the Interceptor(s), the jar files plus any
dependency jars must be available to the JVM. Therefore, the jar files
containing the Interceptor(s) plus any dependency jars must be added to the
gg.classpath
in the Handler configuration file.For more information, see Kafka documentation.
Parent topic: Apache Kafka Connect Handler
8.2.8.2.7 Kafka Partition Selection
Kafka topics comprise one or more partitions. Distribution to multiple partitions is a good way to improve Kafka ingest performance, because the Kafka client parallelizes message sending to different topic/partition combinations. Partition selection is controlled by a following calculation in the Kafka client.
(Hash of the Kafka message key) modulus (the number of partitions) = selected partition number
The Kafka message key is selected by the following configuration value:
gg.handler.{your handler name}.keyMappingTemplate=
If this parameter is set to a value which generates a static key, all messages will go to the same partition. The following is example of static keys:
gg.handler.{your handler name}.keyMappingTemplate=StaticValue
If this parameter is set to a value which generates a key that changes infrequently, partition selection changes infrequently. In the following example the table name is used as the message key. Every operation for a specific source table will have the same key and thereby route to the same partition:
gg.handler.{your handler name}.keyMappingTemplate=${tableName}
gg.handler.{your handler name}.keyMappingTemplate=${null}
The recommended setting for configuration of the mapping key is the following:
gg.handler.{your handler name}.keyMappingTemplate=${primaryKeys}
This generates a Kafka message key that is the concatenated and delimited primary key values.
Operations for each row should have a unique primary key(s) thereby generating a unique Kafka message key for each row. Another important consideration is Kafka messages sent to different partitions are not guaranteed to be delivered to a Kafka consumer in the original order sent. This is part of the Kafka specification. Order is only maintained within a partition. Using primary keys as the Kafka message key means that operations for the same row, which have the same primary key(s), generate the same Kafka message key, and therefore are sent to the same Kafka partition. In this way, the order is maintained for operations for the same row.
At the DEBUG
log level the Kafka message coordinates (topic,
partition, and offset) are logged to the .log
file for successfully
sent messages.
Parent topic: Apache Kafka Connect Handler
8.2.8.2.8 Troubleshooting the Kafka Connect Handler
- Java Classpath for Kafka Connect Handler
- Invalid Kafka Version
- Kafka Producer Properties File Not Found
- Kafka Connection Problem
Parent topic: Apache Kafka Connect Handler
8.2.8.2.8.1 Java Classpath for Kafka Connect Handler
Issues with the Java classpath are one of the most common problems. The indication of a classpath problem is a ClassNotFoundException
in the Oracle GoldenGate Java log4j
log file or and error while resolving the classpath if there is a typographic error in the gg.classpath
variable.
The Kafka client libraries do not ship with the Oracle GoldenGate for Big Data product. You are required to obtain the correct version of the Kafka client libraries and to properly configure the gg.classpath
property in the Java Adapter Properties file to correctly resolve the Java the Kafka client libraries as described in Setting Up and Running the Kafka Connect Handler.
Parent topic: Troubleshooting the Kafka Connect Handler
8.2.8.2.8.2 Invalid Kafka Version
Kafka Connect was introduced in Kafka 0.9.0.0 version. The Kafka Connect Handler does not work with Kafka versions 0.8.2.2 and older. Attempting to use Kafka Connect with Kafka 0.8.2.2 version typically results in a ClassNotFoundException
error at runtime.
Parent topic: Troubleshooting the Kafka Connect Handler
8.2.8.2.8.3 Kafka Producer Properties File Not Found
Typically, the following exception message occurs:
ERROR 2015-11-11 11:49:08,482 [main] Error loading the kafka producer properties
Verify that the gg.handler.kafkahandler.KafkaProducerConfigFile
configuration property for the Kafka Producer Configuration file name is set correctly.
Ensure that the gg.classpath
variable includes the path to the Kafka Producer properties file and that the path to the properties file does not contain a * wildcard at the end.
Parent topic: Troubleshooting the Kafka Connect Handler
8.2.8.2.8.4 Kafka Connection Problem
Typically, the following exception message appears:
WARN 2015-11-11 11:25:50,784 [kafka-producer-network-thread | producer-1] WARN (Selector.java:276) - Error in I/O with localhost/127.0.0.1 java.net.ConnectException: Connection refused
When this occurs, the connection retry interval expires and the Kafka Connection Handler process abends. Ensure that the Kafka Brokers are running and that the host and port provided in the Kafka Producer properties file is correct.
Network shell commands (such as, netstat -l
) can be used on the machine hosting the Kafka broker to verify that Kafka is listening on the expected port.
Parent topic: Troubleshooting the Kafka Connect Handler
8.2.8.2.9 Kafka Connect Handler Client Dependencies
What are the dependencies for the Kafka Connect Handler to connect to Apache Kafka Connect databases?
The maven central repository artifacts for Kafka Connect databases are:
Maven groupId: org.apache.kafka
Maven artifactId: kafka-clients & connect-json
Maven version: the Kafka Connect version numbers listed for each section
- Kafka 2.8.0
- Kafka 2.7.1
- Kafka 2.6.0
- Kafka 2.5.1
- Kafka 2.4.1
- Kafka 2.3.1
- Kafka 2.2.1
- Kafka 2.1.1
- Kafka 2.0.1
- Kafka 1.1.1
- Kafka 1.0.2
- Kafka 0.11.0.0
- Kafka 0.10.2.0
- Kafka 0.10.2.0
- Kafka 0.10.0.0
- Kafka 0.9.0.1
Parent topic: Apache Kafka Connect Handler
8.2.8.2.9.1 Kafka 2.8.0
connect-api-2.8.0.jar connect-json-2.8.0.jar jackson-annotations-2.10.5.jar jackson-core-2.10.5.jar jackson-databind-2.10.5.1.jar jackson-datatype-jdk8-2.10.5.jar javax.ws.rs-api-2.1.1.jar kafka-clients-2.8.0.jar lz4-java-1.7.1.jar slf4j-api-1.7.30.jar snappy-java-1.1.8.1.jar zstd-jni-1.4.9-1.jar
Parent topic: Kafka Connect Handler Client Dependencies
8.2.8.2.9.2 Kafka 2.7.1
connect-api-2.7.1.jar connect-json-2.7.1.jar jackson-annotations-2.10.5.jar jackson-core-2.10.5.jar jackson-databind-2.10.5.1.jar jackson-datatype-jdk8-2.10.5.jar javax.ws.rs-api-2.1.1.jar kafka-clients-2.7.1.jar lz4-java-1.7.1.jar slf4j-api-1.7.30.jar snappy-java-1.1.7.7.jar zstd-jni-1.4.5-6.jar
Parent topic: Kafka Connect Handler Client Dependencies
8.2.8.2.9.3 Kafka 2.6.0
connect-api-2.6.0.jar connect-json-2.6.0.jar jackson-annotations-2.10.2.jar jackson-core-2.10.2.jar jackson-databind-2.10.2.jar jackson-datatype-jdk8-2.10.2.jar javax.ws.rs-api-2.1.1.jar kafka-clients-2.6.0.jar lz4-java-1.7.1.jar slf4j-api-1.7.30.jar snappy-java-1.1.7.3.jar zstd-jni-1.4.4-7.jar
Parent topic: Kafka Connect Handler Client Dependencies
8.2.8.2.9.4 Kafka 2.5.1
connect-api-2.5.1.jar connect-json-2.5.1.jar jackson-annotations-2.10.2.jar jackson-core-2.10.2.jar jackson-databind-2.10.2.jar jackson-datatype-jdk8-2.10.2.jar javax.ws.rs-api-2.1.1.jar kafka-clients-2.5.1.jar lz4-java-1.7.1.jar slf4j-api-1.7.30.jar snappy-java-1.1.7.3.jar zstd-jni-1.4.4-7.jar
Parent topic: Kafka Connect Handler Client Dependencies
8.2.8.2.9.5 Kafka 2.4.1
kafka-clients-2.4.1.jar lz4-java-1.6.0.jar slf4j-api-1.7.28.jar snappy-java-1.1.7.3.jar zstd-jni-1.4.3-1.jarr
Parent topic: Kafka Connect Handler Client Dependencies
8.2.8.2.9.6 Kafka 2.3.1
connect-api-2.3.1.jar connect-json-2.3.1.jar jackson-annotations-2.10.0.jar jackson-core-2.10.0.jar jackson-databind-2.10.0.jar jackson-datatype-jdk8-2.10.0.jar javax.ws.rs-api-2.1.1.jar kafka-clients-2.3.1.jar lz4-java-1.6.0.jar slf4j-api-1.7.26.jar snappy-java-1.1.7.3.jar zstd-jni-1.4.0-1.jar
Parent topic: Kafka Connect Handler Client Dependencies
8.2.8.2.9.7 Kafka 2.2.1
kafka-clients-2.2.1.jar lz4-java-1.5.0.jar slf4j-api-1.7.25.jar snappy-java-1.1.7.2.jar zstd-jni-1.3.8-1.jar
Parent topic: Kafka Connect Handler Client Dependencies
8.2.8.2.9.8 Kafka 2.1.1
audience-annotations-0.5.0.jar connect-api-2.1.1.jar connect-json-2.1.1.jar jackson-annotations-2.9.0.jar jackson-core-2.9.8.jar jackson-databind-2.9.8.jar javax.ws.rs-api-2.1.1.jar jopt-simple-5.0.4.jar kafka_2.12-2.1.1.jar kafka-clients-2.1.1.jar lz4-java-1.5.0.jar metrics-core-2.2.0.jar scala-library-2.12.7.jar scala-logging_2.12-3.9.0.jar scala-reflect-2.12.7.jar slf4j-api-1.7.25.jar snappy-java-1.1.7.2.jar zkclient-0.11.jar zookeeper-3.4.13.jar zstd-jni-1.3.7-1.jar
Parent topic: Kafka Connect Handler Client Dependencies
8.2.8.2.9.9 Kafka 2.0.1
audience-annotations-0.5.0.jar connect-api-2.0.1.jar connect-json-2.0.1.jar jackson-annotations-2.9.0.jar jackson-core-2.9.7.jar jackson-databind-2.9.7.jar javax.ws.rs-api-2.1.jar jopt-simple-5.0.4.jar kafka_2.12-2.0.1.jar kafka-clients-2.0.1.jar lz4-java-1.4.1.jar metrics-core-2.2.0.jar scala-library-2.12.6.jar scala-logging_2.12-3.9.0.jar scala-reflect-2.12.6.jar slf4j-api-1.7.25.jar snappy-java-1.1.7.1.jar zkclient-0.10.jar zookeeper-3.4.13.jar
Parent topic: Kafka Connect Handler Client Dependencies
8.2.8.2.9.10 Kafka 1.1.1
kafka-clients-1.1.1.jar lz4-java-1.4.1.jar slf4j-api-1.7.25.jar snappy-java-1.1.7.1.jar
Parent topic: Kafka Connect Handler Client Dependencies
8.2.8.2.9.11 Kafka 1.0.2
kafka-clients-1.0.2.jar lz4-java-1.4.jar slf4j-api-1.7.25.jar snappy-java-1.1.4.jar
Parent topic: Kafka Connect Handler Client Dependencies
8.2.8.2.9.12 Kafka 0.11.0.0
connect-api-0.11.0.0.jar connect-json-0.11.0.0.jar jackson-annotations-2.8.0.jar jackson-core-2.8.5.jar jackson-databind-2.8.5.jar jopt-simple-5.0.3.jar kafka_2.11-0.11.0.0.jar kafka-clients-0.11.0.0.jar log4j-1.2.17.jar lz4-1.3.0.jar metrics-core-2.2.0.jar scala-library-2.11.11.jar scala-parser-combinators_2.11-1.0.4.jar slf4j-api-1.7.25.jar slf4j-log4j12-1.7.25.jar snappy-java-1.1.2.6.jar zkclient-0.10.jar zookeeper-3.4.10.jar
Parent topic: Kafka Connect Handler Client Dependencies
8.2.8.2.9.13 Kafka 0.10.2.0
connect-api-0.10.2.0.jar connect-json-0.10.2.0.jar jackson-annotations-2.8.0.jar jackson-core-2.8.5.jar jackson-databind-2.8.5.jar jopt-simple-5.0.3.jar kafka_2.11-0.10.2.0.jar kafka-clients-0.10.2.0.jar log4j-1.2.17.jar lz4-1.3.0.jar metrics-core-2.2.0.jar scala-library-2.11.8.jar scala-parser-combinators_2.11-1.0.4.jar slf4j-api-1.7.21.jar slf4j-log4j12-1.7.21.jar snappy-java-1.1.2.6.jar zkclient-0.10.jar zookeeper-3.4.9.jar
Parent topic: Kafka Connect Handler Client Dependencies
8.2.8.2.9.14 Kafka 0.10.2.0
connect-api-0.10.1.1.jar connect-json-0.10.1.1.jar jackson-annotations-2.6.0.jar jackson-core-2.6.3.jar jackson-databind-2.6.3.jar jline-0.9.94.jar jopt-simple-4.9.jar kafka_2.11-0.10.1.1.jar kafka-clients-0.10.1.1.jar log4j-1.2.17.jar lz4-1.3.0.jar metrics-core-2.2.0.jar netty-3.7.0.Final.jar scala-library-2.11.8.jar scala-parser-combinators_2.11-1.0.4.jar slf4j-api-1.7.21.jar slf4j-log4j12-1.7.21.jar snappy-java-1.1.2.6.jar zkclient-0.9.jar zookeeper-3.4.8.jar
Parent topic: Kafka Connect Handler Client Dependencies
8.2.8.2.9.15 Kafka 0.10.0.0
activation-1.1.jar connect-api-0.10.0.0.jar connect-json-0.10.0.0.jar jackson-annotations-2.6.0.jar jackson-core-2.6.3.jar jackson-databind-2.6.3.jar jline-0.9.94.jar jopt-simple-4.9.jar junit-3.8.1.jar kafka_2.11-0.10.0.0.jar kafka-clients-0.10.0.0.jar log4j-1.2.15.jar lz4-1.3.0.jar mail-1.4.jar metrics-core-2.2.0.jar netty-3.7.0.Final.jar scala-library-2.11.8.jar scala-parser-combinators_2.11-1.0.4.jar slf4j-api-1.7.21.jar slf4j-log4j12-1.7.21.jar snappy-java-1.1.2.4.jar zkclient-0.8.jar zookeeper-3.4.6.jar
Parent topic: Kafka Connect Handler Client Dependencies
8.2.8.2.9.16 Kafka 0.9.0.1
activation-1.1.jar connect-api-0.9.0.1.jar connect-json-0.9.0.1.jar jackson-annotations-2.5.0.jar jackson-core-2.5.4.jar jackson-databind-2.5.4.jar jline-0.9.94.jar jopt-simple-3.2.jar junit-3.8.1.jar kafka_2.11-0.9.0.1.jar kafka-clients-0.9.0.1.jar log4j-1.2.15.jar lz4-1.2.0.jar mail-1.4.jar metrics-core-2.2.0.jar netty-3.7.0.Final.jar scala-library-2.11.7.jar scala-parser-combinators_2.11-1.0.4.jar scala-xml_2.11-1.0.4.jar slf4j-api-1.7.6.jar slf4j-log4j12-1.7.6.jar snappy-java-1.1.1.7.jar zkclient-0.7.jar zookeeper-3.4.6.jar
8.2.8.2.9.16.1 Confluent Dependencies
Note:
The Confluent dependencies listed below are for the Kafka Connect Avro Converter and the assocated Avro Schema Registry client. When integrated with Confluent Kafka Connect, the below dependencies are required in addition to the Kafka Connect dependencies for the corresponding Kafka version which are listed in the previous sections.
- Confluent 6.2.0
- Confluent 6.1.0
- Confluent 6.0.0
- Confluent 5.5.0
- Confluent 5.4.0
- Confluent 5.3.0
- Confluent 5.2.1
- Confluent 5.1.3
- Confluent 5.0.3
- Confluent 4.1.2
Parent topic: Kafka 0.9.0.1
8.2.8.2.9.16.1.1 Confluent 6.2.0
avro-1.10.1.jar commons-compress-1.20.jar common-utils-6.2.0.jar connect-api-6.2.0-ccs.jar connect-json-6.2.0-ccs.jar jackson-annotations-2.10.5.jar jackson-core-2.11.3.jar jackson-databind-2.10.5.1.jar jackson-datatype-jdk8-2.10.5.jar jakarta.annotation-api-1.3.5.jar jakarta.inject-2.6.1.jar jakarta.ws.rs-api-2.1.6.jar javax.ws.rs-api-2.1.1.jar jersey-common-2.34.jar kafka-avro-serializer-6.2.0.jar kafka-clients-6.2.0-ccs.jar kafka-connect-avro-converter-6.2.0.jar kafka-connect-avro-data-6.2.0.jar kafka-schema-registry-client-6.2.0.jar kafka-schema-serializer-6.2.0.jar lz4-java-1.7.1.jar osgi-resource-locator-1.0.3.jar slf4j-api-1.7.30.jar snappy-java-1.1.8.1.jar swagger-annotations-1.6.2.jar zstd-jni-1.4.9-1.jar
Parent topic: Confluent Dependencies
8.2.8.2.9.16.1.2 Confluent 6.1.0
avro-1.9.2.jar commons-compress-1.19.jar common-utils-6.1.0.jar connect-api-6.1.0-ccs.jar connect-json-6.1.0-ccs.jar jackson-annotations-2.10.5.jar jackson-core-2.10.2.jar jackson-databind-2.10.5.1.jar jackson-datatype-jdk8-2.10.5.jar jakarta.annotation-api-1.3.5.jar jakarta.inject-2.6.1.jar jakarta.ws.rs-api-2.1.6.jar javax.ws.rs-api-2.1.1.jar jersey-common-2.31.jar kafka-avro-serializer-6.1.0.jar kafka-clients-6.1.0-ccs.jar kafka-connect-avro-converter-6.1.0.jar kafka-connect-avro-data-6.1.0.jar kafka-schema-registry-client-6.1.0.jar kafka-schema-serializer-6.1.0.jar lz4-java-1.7.1.jar osgi-resource-locator-1.0.3.jar slf4j-api-1.7.30.jar snappy-java-1.1.7.7.jar swagger-annotations-1.6.2.jar zstd-jni-1.4.5-6.jar
Parent topic: Confluent Dependencies
8.2.8.2.9.16.1.3 Confluent 6.0.0
avro-1.9.2.jar commons-compress-1.19.jar common-utils-6.0.0.jar connect-api-6.0.0-ccs.jar connect-json-6.0.0-ccs.jar jackson-annotations-2.10.5.jar jackson-core-2.10.2.jar jackson-databind-2.10.5.jar jackson-datatype-jdk8-2.10.5.jar jakarta.annotation-api-1.3.5.jar jakarta.inject-2.6.1.jar jakarta.ws.rs-api-2.1.6.jar javax.ws.rs-api-2.1.1.jar jersey-common-2.30.jar kafka-avro-serializer-6.0.0.jar kafka-clients-6.0.0-ccs.jar kafka-connect-avro-converter-6.0.0.jar kafka-connect-avro-data-6.0.0.jar kafka-schema-registry-client-6.0.0.jar kafka-schema-serializer-6.0.0.jar lz4-java-1.7.1.jar osgi-resource-locator-1.0.3.jar slf4j-api-1.7.30.jar snappy-java-1.1.7.3.jar swagger-annotations-1.6.2.jar zstd-jni-1.4.4-7.jar
Parent topic: Confluent Dependencies
8.2.8.2.9.16.1.4 Confluent 5.5.0
avro-1.9.2.jar classmate-1.3.4.jar common-config-5.5.0.jar commons-compress-1.19.jar commons-lang3-3.2.1.jar common-utils-5.5.0.jar connect-api-5.5.0-ccs.jar connect-json-5.5.0-ccs.jar guava-18.0.jar hibernate-validator-6.0.17.Final.jar jackson-annotations-2.10.2.jar jackson-core-2.10.2.jar jackson-databind-2.10.2.jar jackson-dataformat-yaml-2.4.5.jar jackson-datatype-jdk8-2.10.2.jar jackson-datatype-joda-2.4.5.jar jakarta.annotation-api-1.3.5.jar jakarta.el-3.0.2.jar jakarta.el-api-3.0.3.jar jakarta.inject-2.6.1.jar jakarta.validation-api-2.0.2.jar jakarta.ws.rs-api-2.1.6.jar javax.ws.rs-api-2.1.1.jar jboss-logging-3.3.2.Final.jar jersey-bean-validation-2.30.jar jersey-client-2.30.jar jersey-common-2.30.jar jersey-media-jaxb-2.30.jar jersey-server-2.30.jar joda-time-2.2.jar kafka-avro-serializer-5.5.0.jar kafka-clients-5.5.0-ccs.jar kafka-connect-avro-converter-5.5.0.jar kafka-connect-avro-data-5.5.0.jar kafka-schema-registry-client-5.5.0.jar kafka-schema-serializer-5.5.0.jar lz4-java-1.7.1.jar osgi-resource-locator-1.0.3.jar slf4j-api-1.7.30.jar snakeyaml-1.12.jar snappy-java-1.1.7.3.jar swagger-annotations-1.5.22.jar swagger-core-1.5.3.jar swagger-models-1.5.3.jar zstd-jni-1.4.4-7.jar
Parent topic: Confluent Dependencies
8.2.8.2.9.16.1.5 Confluent 5.4.0
avro-1.9.1.jar common-config-5.4.0.jar commons-compress-1.19.jar commons-lang3-3.2.1.jar common-utils-5.4.0.jar connect-api-5.4.0-ccs.jar connect-json-5.4.0-ccs.jar guava-18.0.jar jackson-annotations-2.9.10.jar jackson-core-2.9.9.jar jackson-databind-2.9.10.1.jar jackson-dataformat-yaml-2.4.5.jar jackson-datatype-jdk8-2.9.10.jar jackson-datatype-joda-2.4.5.jar javax.ws.rs-api-2.1.1.jar joda-time-2.2.jar kafka-avro-serializer-5.4.0.jar kafka-clients-5.4.0-ccs.jar kafka-connect-avro-converter-5.4.0.jar kafka-schema-registry-client-5.4.0.jar lz4-java-1.6.0.jar slf4j-api-1.7.28.jar snakeyaml-1.12.jar snappy-java-1.1.7.3.jar swagger-annotations-1.5.22.jar swagger-core-1.5.3.jar swagger-models-1.5.3.jar zstd-jni-1.4.3-1.jar
Parent topic: Confluent Dependencies
8.2.8.2.9.16.1.6 Confluent 5.3.0
audience-annotations-0.5.0.jar avro-1.8.1.jar common-config-5.3.0.jar commons-compress-1.8.1.jar common-utils-5.3.0.jar connect-api-5.3.0-ccs.jar connect-json-5.3.0-ccs.jar jackson-annotations-2.9.0.jar jackson-core-2.9.9.jar jackson-core-asl-1.9.13.jar jackson-databind-2.9.9.jar jackson-datatype-jdk8-2.9.9.jar jackson-mapper-asl-1.9.13.jar javax.ws.rs-api-2.1.1.jar jline-0.9.94.jar jsr305-3.0.2.jar kafka-avro-serializer-5.3.0.jar kafka-clients-5.3.0-ccs.jar kafka-connect-avro-converter-5.3.0.jar kafka-schema-registry-client-5.3.0.jar lz4-java-1.6.0.jar netty-3.10.6.Final.jar paranamer-2.7.jar slf4j-api-1.7.26.jar snappy-java-1.1.1.3.jar spotbugs-annotations-3.1.9.jar xz-1.5.jar zkclient-0.10.jar zookeeper-3.4.14.jar zstd-jni-1.4.0-1.jar
Parent topic: Confluent Dependencies
8.2.8.2.9.16.1.7 Confluent 5.2.1
audience-annotations-0.5.0.jar avro-1.8.1.jar common-config-5.2.1.jar commons-compress-1.8.1.jar common-utils-5.2.1.jar connect-api-2.2.0-cp2.jar connect-json-2.2.0-cp2.jar jackson-annotations-2.9.0.jar jackson-core-2.9.8.jar jackson-core-asl-1.9.13.jar jackson-databind-2.9.8.jar jackson-datatype-jdk8-2.9.8.jar jackson-mapper-asl-1.9.13.jar javax.ws.rs-api-2.1.1.jar jline-0.9.94.jar kafka-avro-serializer-5.2.1.jar kafka-clients-2.2.0-cp2.jar kafka-connect-avro-converter-5.2.1.jar kafka-schema-registry-client-5.2.1.jar lz4-java-1.5.0.jar netty-3.10.6.Final.jar paranamer-2.7.jar slf4j-api-1.7.25.jar snappy-java-1.1.1.3.jar xz-1.5.jar zkclient-0.10.jar zookeeper-3.4.13.jar zstd-jni-1.3.8-1.jar
Parent topic: Confluent Dependencies
8.2.8.2.9.16.1.8 Confluent 5.1.3
audience-annotations-0.5.0.jar avro-1.8.1.jar common-config-5.1.3.jar commons-compress-1.8.1.jar common-utils-5.1.3.jar connect-api-2.1.1-cp3.jar connect-json-2.1.1-cp3.jar jackson-annotations-2.9.0.jar jackson-core-2.9.8.jar jackson-core-asl-1.9.13.jar jackson-databind-2.9.8.jar jackson-mapper-asl-1.9.13.jar javax.ws.rs-api-2.1.1.jar jline-0.9.94.jar kafka-avro-serializer-5.1.3.jar kafka-clients-2.1.1-cp3.jar kafka-connect-avro-converter-5.1.3.jar kafka-schema-registry-client-5.1.3.jar lz4-java-1.5.0.jar netty-3.10.6.Final.jar paranamer-2.7.jar slf4j-api-1.7.25.jar snappy-java-1.1.1.3.jar xz-1.5.jar zkclient-0.10.jar zookeeper-3.4.13.jar zstd-jni-1.3.7-1.jar
Parent topic: Confluent Dependencies
8.2.8.2.9.16.1.9 Confluent 5.0.3
audience-annotations-0.5.0.jar avro-1.8.1.jar common-config-5.0.3.jar commons-compress-1.8.1.jar common-utils-5.0.3.jar connect-api-2.0.1-cp4.jar connect-json-2.0.1-cp4.jar jackson-annotations-2.9.0.jar jackson-core-2.9.7.jar jackson-core-asl-1.9.13.jar jackson-databind-2.9.7.jar jackson-mapper-asl-1.9.13.jar javax.ws.rs-api-2.1.jar jline-0.9.94.jar kafka-avro-serializer-5.0.3.jar kafka-clients-2.0.1-cp4.jar kafka-connect-avro-converter-5.0.3.jar kafka-schema-registry-client-5.0.3.jar lz4-java-1.4.1.jar netty-3.10.6.Final.jar paranamer-2.7.jar slf4j-api-1.7.25.jar snappy-java-1.1.1.3.jar xz-1.5.jar zkclient-0.10.jar zookeeper-3.4.13.jar
Parent topic: Confluent Dependencies
8.2.8.2.9.16.1.10 Confluent 4.1.2
avro-1.8.1.jar common-config-4.1.2.jar commons-compress-1.8.1.jar common-utils-4.1.2.jar connect-api-1.1.1-cp1.jar connect-json-1.1.1-cp1.jar jackson-annotations-2.9.0.jar jackson-core-2.9.6.jar jackson-core-asl-1.9.13.jar jackson-databind-2.9.6.jar jackson-mapper-asl-1.9.13.jar jline-0.9.94.jar kafka-avro-serializer-4.1.2.jar kafka-clients-1.1.1-cp1.jar kafka-connect-avro-converter-4.1.2.jar kafka-schema-registry-client-4.1.2.jar log4j-1.2.16.jar lz4-java-1.4.1.jar netty-3.10.5.Final.jar paranamer-2.7.jar slf4j-api-1.7.25.jar slf4j-log4j12-1.6.1.jar snappy-java-1.1.1.3.jar xz-1.5.jar zkclient-0.10.jar zookeeper-3.4.10.jar
Parent topic: Confluent Dependencies
8.2.8.3 Apache Kafka REST Proxy
The Kafka REST Proxy Handler to stream messages to the Kafka REST Proxy distributed by Confluent.
This chapter describes how to use the Kafka REST Proxy Handler.
- Overview
- Setting Up and Starting the Kafka REST Proxy Handler Services
- Consuming the Records
- Performance Considerations
- Kafka REST Proxy Handler Metacolumns Template Property
Parent topic: Apache Kafka
8.2.8.3.1 Overview
The Kafka REST Proxy Handler allows Kafka messages to be streamed using an HTTPS protocol. The use case for this functionality is to stream Kafka messages from an Oracle GoldenGate On Premises installation to cloud or alternately from cloud to cloud.
The Kafka REST proxy provides a RESTful interface to a Kafka cluster. It makes it easy for you to:
-
produce and consume messages,
-
view the state of the cluster,
-
and perform administrative actions without using the native Kafka protocol or clients.
Kafka REST Proxy is part of the Confluent Open Source and Confluent Enterprise distributions. It is not available in the Apache Kafka distribution. To access Kafka through the REST proxy, you have to install the Confluent Kafka version see https://docs.confluent.io/current/kafka-rest/docs/index.html.
Parent topic: Apache Kafka REST Proxy
8.2.8.3.2 Setting Up and Starting the Kafka REST Proxy Handler Services
You have several installation formats to choose from including ZIP or tar archives, Docker, and Packages.
- Using the Kafka REST Proxy Handler
- Downloading the Dependencies
- Classpath Configuration
- Kafka REST Proxy Handler Configuration
- Review a Sample Configuration
- Security
- Generating a Keystore or Truststore
- Using Templates to Resolve the Topic Name and Message Key
- Kafka REST Proxy Handler Formatter Properties
Parent topic: Apache Kafka REST Proxy
8.2.8.3.2.1 Using the Kafka REST Proxy Handler
You must download and install the Confluent Open Source or Confluent Enterprise Distribution because the Kafka REST Proxy is not included in Apache, Cloudera, or Hortonworks. You have several installation formats to choose from including ZIP or TAR archives, Docker, and Packages.
The Kafka REST Proxy has dependency on ZooKeeper, Kafka, and the Schema Registry
8.2.8.3.2.2 Downloading the Dependencies
You can review and download the Jersey RESTful Web Services in Java client dependency from:
https://eclipse-ee4j.github.io/jersey/.
You can review and download the Jersey Apache Connector dependencies from the maven repository: https://mvnrepository.com/artifact/org.glassfish.jersey.connectors/jersey-apache-connector.
8.2.8.3.2.3 Classpath Configuration
The Kafka REST Proxy handler uses the Jersey project jersey-client
version 2.27 and jersey-connectors-apache
version 2.27 to connect to Kafka. Oracle GoldenGate for Big Data does not include the required dependencies so you must obtain them, see Downloading the Dependencies.
You have to configure these dependencies using the gg.classpath
property in the Java Adapter properties file. This is an example of a correctly configured classpath for the Kafka REST Proxy Handler:
gg.classpath=dirprm:
{path_to_jersey_client_jars}/jaxrs-ri/lib/*:{path_to_jersey_client_jars}
/jaxrs-ri/api/*
:{path_to_jersey_client_jars}/jaxrs-ri/ext/*:{path_to_jersey_client_jars}
/connector/*
8.2.8.3.2.4 Kafka REST Proxy Handler Configuration
The following are the configurable values for the Kafka REST Proxy Handler. Oracle recommend that you store the Kafka REST Proxy properties file in the Oracle GoldenGate dirprm
directory.
To enable the selection of the Kafka REST Proxy Handler, you must first configure the handler type by specifying gg.handler.name.type=kafkarestproxy
and the other Kafka REST Proxy Handler properties as follows:
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
|
None |
The configuration to select the Kafka REST Proxy Handler. |
|
Required |
A template string value to resolve the Kafka topic name at runtime. |
None |
See Using Templates to Resolve the Topic Name and Message Key. |
|
Required |
A template string value to resolve the Kafka message key at runtime. |
None |
See Using Templates to Resolve the Topic Name and Message Key. |
|
Required |
The Listener address of the Rest Proxy. |
None |
Set to the URL of the Kafka REST proxy. |
|
Required |
|
None |
Set to the REST proxy payload data format |
|
Optional |
A value representing the payload size in mega bytes. |
|
Set to the maximum size of the payload of the HTTP messages. |
|
Optional |
|
|
Sets the API version to use. |
|
Optional |
|
|
Sets how operations are processed. In |
|
Optional |
Path to the truststore. |
None |
Path to the truststore file that holds certificates from trusted certificate authorities (CA). These CAs are used to verify certificates presented by the server during an SSL connection, see Generating a Keystore or Truststore. |
|
Optional |
Password of the truststore. |
None |
The truststore password. |
|
Optional |
Path to the keystore. |
None |
Path to the keystore file that the private key and identity certificate, which are presented to other parties (server or client) to verify its identity, see Generating a Keystore or Truststore. |
|
Optional |
Password of the keystore. |
None |
The keystore password. |
|
Optional |
|
None |
Proxy URL in the following format: |
|
Optional |
Any string. |
None |
The proxy user name. |
|
Optional |
Any string. |
None |
The proxy password. |
|
Optional |
Integer value. |
None |
The amount of time allowed for the server to respond. |
|
Optional |
Integer value. |
None |
The amount of time to wait to establish the connection to the host. |
gg.handler.name.format.metaColumnsTemplate |
Optional |
|
|
None |
|
${optype}, ${token.ROWID}, ${sys.username}, ${currenttimestamp} |
See Using Templates to Resolve the Stream Name and Partition Name for more information.
8.2.8.3.2.5 Review a Sample Configuration
The following is a sample configuration for the Kafka REST Proxy Handler from the Java Adapter properties file:
gg.handlerlist=kafkarestproxy #The handler properties gg.handler.kafkarestproxy.type=kafkarestproxy #The following selects the topic name based on the fully qualified table name gg.handler.kafkarestproxy.topicMappingTemplate=${fullyQualifiedTableName} #The following selects the message key using the concatenated primary keys gg.handler.kafkarestproxy.keyMappingTemplate=${primaryKeys} gg.handler.kafkarestproxy.postDataUrl=http://localhost:8083 gg.handler.kafkarestproxy.apiVersion=v1 gg.handler.kafkarestproxy.format=json gg.handler.kafkarestproxy.payloadsize=1 gg.handler.kafkarestproxy.mode=tx #Server auth properties #gg.handler.kafkarestproxy.trustStore=/keys/truststore.jks #gg.handler.kafkarestproxy.trustStorePassword=test1234 #Client auth properites #gg.handler.kafkarestproxy.keyStore=/keys/keystore.jks #gg.handler.kafkarestproxy.keyStorePassword=test1234 #Proxy properties #gg.handler.kafkarestproxy.proxy=http://proxyurl:80 #gg.handler.kafkarestproxy.proxyUserName=username #gg.handler.kafkarestproxy.proxyPassword=password #The MetaColumnTemplate formatter properties gg.handler.kafkarestproxy.format.metaColumnsTemplate=${optype},${timestampmicro},${currenttimestampmicro}
8.2.8.3.2.6 Security
Security is possible between the following:
-
Kafka REST Proxy clients and the Kafka REST Proxy server. The Oracle GoldenGate REST Proxy Handler is a Kafka REST Proxy client.
-
The Kafka REST Proxy server and Kafka Brokers. Oracle recommends that you thoroughly review the security documentation and configuration of the Kafka REST Proxy server, see https://docs.confluent.io/current/kafka-rest/docs/index.html
REST Proxy supports SSL for securing communication between clients and the Kafka REST Proxy Handler. To configure SSL:
-
Generate a keystore using the scripts, see Generating a Keystore or Truststore.
-
Update the Kafka REST Proxy server configuration in the
kafka-rest.properties
file with these properties:listeners=https://hostname:8083 confluent.rest.auth.propagate.method=SSL Configuration Options for HTTPS ssl.client.auth=true ssl.keystore.location={keystore_file_path}/server.keystore.jks ssl.keystore.password=test1234 ssl.key.password=test1234 ssl.truststore.location={keystore_file_path}/server.truststore.jks ssl.truststore.password=test1234 ssl.keystore.type=JKS ssl.truststore.type=JKS ssl.enabled.protocols=TLSv1.2,TLSv1.1,TLSv1
-
Restart your server.
To disable mutual authentication, you update the ssl.client.auth=
property from true
to false
.
8.2.8.3.2.7 Generating a Keystore or Truststore
Generating a Truststore
You execute this script to generate the ca-cert
, ca-key
, and truststore.jks
truststore files.
#!/bin/bash PASSWORD=password CLIENT_PASSWORD=password VALIDITY=365
Then you generate a CA as in this example:
openssl req -new -x509 -keyout ca-key -out ca-cert -days $VALIDITY -passin pass:$PASSWORD -passout pass:$PASSWORD -subj "/C=US/ST=CA/L=San Jose/O=Company/OU=Org/CN=FQDN" -nodes
Lastly, you add the CA to the server's truststore using keytool
:
keytool -keystore truststore.jks -alias CARoot -import -file ca-cert -storepass $PASSWORD -keypass $PASSWORD
Generating a Keystore
You run this script and pass the fqdn
as argument to generate the ca-cert.srl
, cert-file
, cert-signed
, and keystore.jks
keystore files.
#!/bin/bash PASSWORD=password VALIDITY=365 if [ $# -lt 1 ]; then echo "`basename $0` host fqdn|user_name|app_name" exit 1 fi CNAME=$1 ALIAS=`echo $CNAME|cut -f1 -d"."`
Then you generate the keystore with keytool
as in this example:
keytool -noprompt ¿keystore keystore.jks -alias $ALIAS -keyalg RSA -validity $VALIDITY -genkey -dname "CN=$CNAME,OU=BDP,O=Company,L=San Jose,S=CA,C=US" -storepass $PASSWORD -keypass $PASSWORD
Next, you sign all the certificates in the keystore with the CA:
keytool -keystore keystore.jks -alias $ALIAS -certreq -file cert-file -storepass $PASSWORDopenssl x509 -req -CA ca-cert -CAkey ca-key -in cert-file -out cert-signed -days $VALIDITY -CAcreateserial -passin pass:$PASSWORD
Lastly, you import both the CA and the signed certificate into the keystore:
keytool -keystore keystore.jks -alias CARoot -import -file ca-cert -storepass $PASSWORDkeytool -keystore keystore.jks -alias $ALIAS -import -file cert-signed -storepass $PASSWORD
8.2.8.3.2.8 Using Templates to Resolve the Topic Name and Message Key
The Kafka REST Proxy Handler provides functionality to resolve the topic name and the message key at runtime using a template configuration value. Templates allow you to configure static values and keywords. Keywords are used to dynamically replace the keyword with the context of the current processing. The templates use the following configuration properties:
gg.handler.name.topicMappingTemplate
gg.handler.name.keyMappingTemplate
Template Modes
The Kafka REST Proxy Handler can be configured to send one message per operation (insert, update, delete). Alternatively, it can be configured to group operations into messages at the transaction level.
Example Templates
The following describes example template configuration values and the resolved values.
Example Template | Resolved Value |
---|---|
|
|
|
|
|
|
8.2.8.3.2.9 Kafka REST Proxy Handler Formatter Properties
The following are the configurable values for the Kafka REST Proxy Handler Formatter.
Table 8-12 Kafka REST Proxy Handler Formatter Properties
Properties | Optional/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.handler.name.format.includeOpType |
Optional |
|
|
Set to Set to |
gg.handler.name.format.includeOpTimestamp |
Optional |
|
|
Set to Set to |
gg.handler.name.format.includeCurrentTimestamp |
Optional |
|
|
Set to Set to |
gg.handler.name.format.includePosition |
Optional |
|
|
Set to Set to |
gg.handler.name.format.includePrimaryKeys |
Optional |
|
|
Set to Set to |
gg.handler.name.format.includeTokens |
Optional |
|
|
Set to Set to |
gg.handler.name.format.insertOpKey |
Optional |
Any string. |
|
The value of the field |
gg.handler.name.format.updateOpKey |
Optional |
Any string. |
|
The value of the field |
gg.handler.name.format.deleteOpKey |
Optional |
Any string. |
|
The value of the field |
gg.handler.name.format.truncateOpKey |
Optional |
Any string. |
|
The value of the field |
gg.handler.name.format.treatAllColumnsAsStrings |
Optional |
|
|
Set to Set to |
gg.handler.name.format.mapLargeNumbersAsStrings |
Optional |
|
|
Set to |
gg.handler.name.format.iso8601Format |
Optional |
|
|
Set to |
gg.handler.name.format.pkUpdateHandling |
Optional |
|
|
It is only applicable if you are modeling row messages with the |
8.2.8.3.3 Consuming the Records
A simple way to consume data from Kafka topics using the Kafka REST Proxy Handler is Curl.
Consume JSON Data
-
Create a consumer for JSON data.
curl -k -X POST -H "Content-Type: application/vnd.kafka.v2+json" https://localhost:8082/consumers/my_json_consumer
-
Subscribe to a topic.
curl -k -X POST -H "Content-Type: application/vnd.kafka.v2+json" --data '{"topics":["topicname"]}' \ https://localhost:8082/consumers/my_json_consumer/instances/my_consumer_instance/subscription
-
Consume records.
curl –k -X GET -H "Accept: application/vnd.kafka.json.v2+json" \ https://localhost:8082/consumers/my_json_consumer/instances/my_consumer_instance/records
Consume Avro Data
-
Create a consumer for Avro data.
curl -k -X POST -H "Content-Type: application/vnd.kafka.v2+json" \ --data '{"name": "my_consumer_instance", "format": "avro", "auto.offset.reset": "earliest"}' \ https://localhost:8082/consumers/my_avro_consumer
-
Subscribe to a topic.
curl –k -X POST -H "Content-Type: application/vnd.kafka.v2+json" --data '{"topics":["topicname"]}' \ https://localhost:8082/consumers/my_avro_consumer/instances/my_consumer_instance/subscription
-
Consume records.
curl -X GET -H "Accept: application/vnd.kafka.avro.v2+json" \ https://localhost:8082/consumers/my_avro_consumer/instances/my_consumer_instance/records
Note:
If you are usingcurl
from the machine hosting the REST proxy, then unset
the http_proxy
environmental variable before consuming the messages. If you are
using curl
from the local machine to get messages from the Kafka REST Proxy,
then setting the http_proxy
environmental variable may be required.
Parent topic: Apache Kafka REST Proxy
8.2.8.3.4 Performance Considerations
There are several configuration settings both for the Oracle GoldenGate for Big Data configuration and in the Kafka producer that affects performance.
The Oracle GoldenGate parameter that has the greatest affect on performance is the Replicat GROUPTRANSOPS
parameter. It allows Replicat to group multiple source transactions into a single target transaction. At transaction commit, the Kafka REST Proxy Handler POST
’s the data to the Kafka Producer.
Setting the Replicat GROUPTRANSOPS
to a larger number allows the Replicat to call the POST
less frequently improving performance. The default value for GROUPTRANSOPS
is 1000 and performance can be improved by increasing the value to 2500, 5000, or even 10000.
Parent topic: Apache Kafka REST Proxy
8.2.8.3.5 Kafka REST Proxy Handler Metacolumns Template Property
Problems Starting Kafka REST Proxy server
The script to start the Kafka REST Proxy server appends its
CLASSPATH
to the environment CLASSPATH
variable. If set, the environment CLASSPATH
can contain JAR files
that conflict with the correct execution of the Kafka REST Proxy server and may
prevent it from starting. Oracle recommends that you unset the
CLASSPATH
environmental variable before started your Kafka REST
Proxy server. Reset the CLASSPATH
to “”
to
overcome the problem.
Parent topic: Apache Kafka REST Proxy
8.2.9 Apache Hive
Integrating with Hive
Oracle GoldenGate for Big Data release does not include a Hive storage handler because the HDFS Handler provides all of the necessary Hive functionality .
You can create a Hive integration to create tables and update table definitions in case of DDL events. This is limited to data formatted in Avro Object Container File format. For more information, see Writing in HDFS in Avro Object Container File Format and HDFS Handler Configuration.
For Hive to consume sequence files, the DDL creates Hive tables including STORED as sequencefile
. The following is a sample create table
script:
CREATE EXTERNAL TABLE table_name (
col1 string,
...
...
col2 string)
ROW FORMAT DELIMITED
STORED as sequencefile
LOCATION '/path/to/hdfs/file';
Note:
If files are intended to be consumed by Hive, then the gg.handler.name.partitionByTable
property should be set to true
.
Parent topic: Target
8.2.10 Azure Blob Storage
Topics:
- Overview
- Prerequisites
- Storage Account, Container, and Objects
- Configuration
- Troubleshooting and Diagnostics
Parent topic: Target
8.2.10.1 Overview
Azure Blob Storage (ABS) is a service for storing objects in Azure cloud. It is highly scalable and is a secure object storage for cloud-native workloads, archives, data lakes, high-performance computing, and machine learning. You can use the Azure Blob Storage Event handler to load files generated by the File Writer handler into ABS.
Parent topic: Azure Blob Storage
8.2.10.2 Prerequisites
- Azure cloud account set up.
- Java Software Development Kit (SDK) for Azure Blob Storage.
Parent topic: Azure Blob Storage
8.2.10.3 Storage Account, Container, and Objects
- Storage Account: An Azure storage account contains all of your Azure Storage data objects: blobs, file shares, queues, tables, and disks.
- Container: A container organizes a set of blobs, similar to a directory in a file system. A storage account can include an unlimited number of containers, and a container can store an unlimited number of blobs.
- Objects/blobs: Objects or blobs are the individual pieces of data that you store in a storage account container.
Parent topic: Azure Blob Storage
8.2.10.4 Configuration
To enable the selection of the ABS Event Handler, you must first
configure the Event Handler type by specifying
gg.eventhandler.name.type=abs
and the following ABS
properties:
Properties | Required/Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.eventhandler.name.type |
Required | abs | None | Selects the ABS Event Handler for use with File Writer handler. |
gg.eventhandler.name.bucketMappingTemplate |
Required | A string with resolvable keywords and constants used to dynamically generate a Azure storage account container name. | None | A container is created by the ABS Event handler if it does not exist using this name. See https://docs.microsoft.com/en-us/rest/api/storageservices/naming-and-referencing-containers--blobs--and-metadata#container-names. For supported keywords, see Template Keywords |
gg.eventhandler.name.pathMappingTemplate |
Required | A string with resolvable keywords and constants used to dynamically generate the path in the Azure storage account container to write the file. | None | Use keywords interlaced with constants to
dynamically generate a unique Azure storage account container path
names at runtime. Sample path name:
ogg/data/${groupName}/${fullyQualifiedTableName} .
For supported keywords, see Template Keywords
|
gg.eventhandler.name.fileNameMappingTemplate |
Optional | A string with resolvable keywords and constants used to dynamically generate a file name for the Azure Blob object. | None | Use resolvable keywords and constants used to dynamically generate the Azure Blob object file name. If not set, the upstream file name is used. For supported keywords, see Template Keywords |
gg.eventhandler.name.finalizeAction |
Optional | none | delete |
none |
Set to none to leave the Azure Blob
data file in place on the finalize action. Set to
delete if you want to delete the Azure Blob
data file with the finalize action.
|
gg.eventhandler.name.eventHandler |
Optional | A unique string identifier cross referencing a child event handler. | No event handler configured. | Sets the downstream event handler that is invoked on the file roll event. |
gg.eventhandler.name.accountName |
Required | String | None | Azure storage account name. |
gg.eventhandler.name.accountKey |
Optional | String | None | Azure storage account key. |
gg.eventhandler.name.sasToken |
Optional | String | None | Sets a credential that uses a shared access signature (SAS) to authenticate to an Azure Service. |
gg.eventhandler.name.tenantId |
Optional | String | None | Sets the Azure tenant ID of the application. |
gg.eventhandler.name.clientId |
Optional | String | None | Sets the Azure client ID of the application. |
gg.eventhandler.name.clientSecret |
Optional | String | None | Sets the Azure client secret for the authentication. |
gg.eventhandler.name.accessTier |
Optional | Hot | Cool | Archive |
None | Sets the tier on a Azure blob/object. Azure storage offers different access tiers, allowing you to store blob object data in the most cost-effective manner. Available access tiers include Hot, Cool and Archive. For more information, see https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers. |
gg.eventhandler.name.endpoint |
Optional | String |
https://<accountName>.blob.core.windows.net |
Sets the Azure Storage service endpoint. See Azure Government Cloud Configuration |
- Classpath Configuration
- Dependencies
- Authentication
- Proxy Configuration
- Sample Configuration
- Azure Government Cloud Configuration
Parent topic: Azure Blob Storage
8.2.10.4.1 Classpath Configuration
The ABS Event handler uses the Java SDK for Azure Blob Storage.
Note:
Ensure that the classpath includes the path to the Azure Blob Storage Java SDK.Parent topic: Configuration
8.2.10.4.2 Dependencies
<dependencies> <dependency> <groupId>com.azure</groupId> <artifactId>azure-storage-blob</artifactId> <version>12.13.0</version> </dependency> <dependency> <groupId>com.azure</groupId> <artifactId>azure-identity</artifactId> <version>1.3.3</version> </dependency> </dependencies>
Parent topic: Configuration
8.2.10.4.3 Authentication
accountKey
sasToken
tenandId
,clientID
, andclientSecret
accounkKey
has the highest precedence, followed by
sasToken
. If accountKey
and sasToken
are not set, then the tuple tenantId
, clientId
, and
clientSecret
are used.
Parent topic: Configuration
8.2.10.4.3.1 Azure Tenant ID, Client ID, and Client Secret
- Go to the Microsoft Azure portal.
- Select Azure Active Directory from the list on the left to view the Azure Active Directory panel.
- Select Properties in the Azure Active Directory panel to view the Azure Active Directory properties.
- Go to the Microsoft Azure portal.
- Select All Services from the list on the left to view the Azure Services Listing.
- Enter App into the filter command box and select App Registrations from the listed services.
- Select the App Registration you created to access Azure Storage.
Parent topic: Authentication
8.2.10.4.4 Proxy Configuration
When the process is run behind a proxy server, the jvm.bootoptions
property can be used to set proxy server configuration using well-known Java proxy
properties.
For example:
jvm.bootoptions=-Dhttps.proxyHost=some-proxy-address.com -Dhttps.proxyPort=80
-Djava.net.useSystemProxies=true
Parent topic: Configuration
8.2.10.4.5 Sample Configuration
#The ABS Event Handler gg.eventhandler.abs.type=abs gg.eventhandler.abs.pathMappingTemplate=${fullyQualifiedTableName} #TODO: Edit the Azure Blob Storage container name gg.eventhandler.abs.bucketMappingTemplate=<abs-container-name> gg.eventhandler.abs.finalizeAction=none #TODO: Edit the Azure storage account name. gg.eventhandler.abs.accountName=<storage-account-name> #TODO: Edit the Azure storage account key. #gg.eventhandler.abs.accountKey=<storage-account-key> #TODO: Edit the Azure shared access signature(SAS) to authenticate to an Azure Service. #gg.eventhandler.abs.sasToken=<sas-token> #TODO: Edit the the tenant ID of the application. gg.eventhandler.abs.tenantId=<azure-tenant-id> #TODO: Edit the the client ID of the application. gg.eventhandler.abs.clientId=<azure-client-id> #TODO: Edit the the client secret for the authentication. gg.eventhandler.abs.clientSecret=<azure-client-secret> gg.classpath=/path/to/abs-deps/* #TODO: Edit the proxy configuration. #jvm.bootoptions=-Dhttps.proxyHost=some-proxy-address.com -Dhttps.proxyPort=80 -Djava.net.useSystemProxies=true
Parent topic: Configuration
8.2.10.4.6 Azure Government Cloud Configuration
Additional configuration is required if Oracle GoldenGate for BigData has to replicate data to storage accounts that reside in Azure Government cloud.
AZURE_AUTHORITY_HOST
and gg.eventhandler.{name}.endpoint
as per the following table:
Government cloud | AZURE_AUTHORITY_HOST | gg.eventhandler.{name}.endpoint |
---|---|---|
Azure US Government Cloud |
|
|
Azure German Cloud |
|
https://<storage-account-name>.blob.core.cloudapi.de |
Azure China Cloud |
https://login.chinacloudapi.cn |
https://<storage-account-name>.blob.core.chinacloudapi.cn |
The environment variable can be set in the replicat prm file using the Oracle
GoldenGate setenv
parameter.
Example:
setenv (AZURE_AUTHORITY_HOST = "https://login.microsoftonline.us")
Parent topic: Configuration
8.2.10.5 Troubleshooting and Diagnostics
- Error: Confidential Client is not supported in Cross
Cloud request.
This indicates that the target Azure storage account resides in one of the Azure Government clouds. Set the required configuration as per Azure Government Cloud Configuration.
Parent topic: Azure Blob Storage
8.2.11 Azure Data Lake Storage
- Azure Data Lake Gen1 (ADLS Gen1)
Microsoft Azure Data Lake supports streaming data through the Hadoop client. Therefore, data files can be sent to Azure Data Lake using either the Oracle GoldenGate for Big Data Hadoop Distributed File System (HDFS) Handler or the File Writer Handler in conjunction with the HDFS Event Handler. - Azure Data Lake Gen2 using Hadoop Client and ABFS
Microsoft Azure Data Lake Gen 2 (using Hadoop Client and ABFS) supports streaming data via the Hadoop client. Therefore, data files can be sent to Azure Data Lake Gen 2 using either the Oracle GoldenGate for Big Data HDFS Handler or the File Writer Handler in conjunction with the HDFS Event Handler. - Azure Data Lake Gen2 using BLOB endpoint
Parent topic: Target
8.2.11.1 Azure Data Lake Gen1 (ADLS Gen1)
Microsoft Azure Data Lake supports streaming data through the Hadoop client. Therefore, data files can be sent to Azure Data Lake using either the Oracle GoldenGate for Big Data Hadoop Distributed File System (HDFS) Handler or the File Writer Handler in conjunction with the HDFS Event Handler.
The preferred mechanism for ingest to Microsoft Azure Data Lake is the File Writer Handler in conjunction with the HDFS Event Handler.
Use these steps to connect to Microsoft Azure Data Lake from Oracle GoldenGate for Big Data.
- Download Hadoop 2.9.1 from http://hadoop.apache.org/releases.html.
- Unzip the file in a temporary directory. For example,
/ggwork/hadoop/hadoop-2.9
. - Edit the
/ggwork/hadoop/hadoop-2.9/hadoop-env.sh
file in thedirectory.
- Add entries for the
JAVA_HOME
andHADOOP_CLASSPATH
environment variables:export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64 export HADOOP_CLASSPATH=/ggwork/hadoop/hadoop-2.9.1/share/hadoop/tools/lib/*:$HADOOP_CLASSPATH
This points to Java 8 and adds the
share/hadoop/tools/lib
to the Hadoop classpath. The library path is not in the variable by default and the required Azure libraries are in this directory. - Edit the
/ggwork/hadoop/hadoop-2.9.1/etc/hadoop/core-site.xml
file and add:<configuration> <property> <name>fs.adl.oauth2.access.token.provider.type</name> <value>ClientCredential</value> </property> <property> <name>fs.adl.oauth2.refresh.url</name> <value>Insert the Azure https URL here to obtain the access token</value> </property> <property> <name>fs.adl.oauth2.client.id</name> <value>Insert the client id here</value> </property> <property> <name>fs.adl.oauth2.credential</name> <value>Insert the password here</value> </property> <property> <name>fs.defaultFS</name> <value>adl://Account Name.azuredatalakestore.net</value> </property> </configuration>
- Open your firewall to connect to both the Azure URL to get the token and the Azure Data Lake URL. Or disconnect from your network or VPN. Access to Azure Data Lake does not currently support using a proxy server per the Apache Hadoop documentation.
- Use the Hadoop shell commands to prove connectivity to Azure Data Lake. For
example, in the 2.9.1 Hadoop installation directory, execute this command to get
a listing of the root HDFS
directory.
./bin/hadoop fs -ls /
- Verify connectivity to Azure Data Lake.
- Configure either the HDFS Handler or the File Writer Handler using the HDFS Event
Handler to push data to Azure Data Lake, see Flat Files. Oracle recommends that
you use the File Writer Handler with the HDFS Event Handler.
Setting the
gg.classpath
example:gg.classpath=/ggwork/hadoop/hadoop-2.9.1/share/hadoop/common/:/ggwork/hadoop/hadoop- 2.9.1/share/hadoop/common/lib/:/ggwork/hadoop/hadoop- 2.9.1/share/hadoop/hdfs/:/ggwork/hadoop/hadoop-2.9.1/share/hadoop/hdfs/lib/:/ggwork/hadoop/hadoop- 2.9.1/etc/hadoop:/ggwork/hadoop/hadoop-2.9.1/share/hadoop/tools/lib/*
See https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html.
Parent topic: Azure Data Lake Storage
8.2.11.2 Azure Data Lake Gen2 using Hadoop Client and ABFS
Microsoft Azure Data Lake Gen 2 (using Hadoop Client and ABFS) supports streaming data via the Hadoop client. Therefore, data files can be sent to Azure Data Lake Gen 2 using either the Oracle GoldenGate for Big Data HDFS Handler or the File Writer Handler in conjunction with the HDFS Event Handler.
Hadoop 3.3.0 (or higher) is recommended for connectivity to Azure Data Lake Gen 2. Hadoop 3.3.0 contains an important fix to correctly fire Azure events on file close using the "abfss" scheme. For more information, see Hadoop Jira issue Hadoop-16182.
Use the File Writer Handler in conjunction with the HDFS Event Handler. This is the preferred mechanism for ingest to Azure Data Lake Gen 2.Prerequisites
Part 1:
- Connectivity to Azure Data Lake Gen 2 assumes that the you have
correctly provisioned an Azure Data Lake Gen 2 account in the Azure portal.
From the Azure portal select Storage Accounts from the commands on the left to view/create/delete storage accounts.
In the Azure Data Lake Gen 2 provisioning process, it is recommended that the Hierarchical namespace is enabled in the Advanced tab.
It is not mandatory to enable Hierarchical namespace for Azure storage account.
- Ensure that you have created a Web app/API App Registration to
connect to the storage account.
From the Azure portal select All services from the list of commands on the left, type app into the filter command box and select App registrations from the filtered list of services. Create an App registration of type Web app/API.
Add permissions to access Azure Storage. Assign the App registration to an Azure account. Generate a Key for the App Registration as follows:The generated key string is your client secret and is only available at the time the key is created. Therefore, ensure you document the generated key string.- Navigate to the respective App registration page.
- On the left pane, select Certificates & secrets.
- Click + New client secret (This should show a new key under the column Value).
Part 2:
- In the Azure Data Lake Gen 2 account, ensure that the App
Registration is given access.
In the Azure portal, select Storage accounts from the left panel. Select the Azure Data Lake Gen 2 account that you have created.
Select the Access Control (IAM) command to bring up the Access Control (IAM) panel. Select the Role Assignments tab and add a roll assignment for the created App Registration.
The app registration assigned to the storage account must be provided with read and write access into the Azure storage account.
You can use either of the following roles: the built-in Azure role Storage Blob Data Contributor or custom role with the required permissions. - Connectivity to Azure Data Lake Gen 2 can be routed through a proxy
server.
Three parameters need to be set in the Java boot options to enable:
jvm.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar -DproxySet=true -Dhttps.proxyHost={insert your proxy server} -Dhttps.proxyPort={insert your proxy port}
- Two connectivity schemes to Azure Data Lake Gen 2 are supported:
abfs
andabfss
.The preferred method is abfss since it employs HTTPS calls thereby providing security and payload encryption.
Connecting to Microsoft Azure Data Lake 2
To connect to Microsoft Azure Data Lake 2 from Oracle GoldenGate for Big Data:
- Download Hadoop 3.3.0 from http://hadoop.apache.org/releases.html.
- Unzip the file in a temporary directory. For example,
/usr/home/hadoop/hadoop-3.3.0
. - Edit the
{hadoop install dir}/etc/hadoop/hadoop-env.sh
file to point to Java 8 and add the Azure Hadoop libraries to the Hadoop classpath. These are entries in thehadoop-env.sh
file:export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_202 export HADOOP_OPTIONAL_TOOLS="hadoop-azure"
- Private networks often require routing through a proxy server to
access the public internet. Therefore, you may have to configure proxy server
settings for the hadoop command line utility to test the connectivity to Azure.
To configure proxy server settings, set the following in the
hadoop-env.sh
file:export HADOOP_CLIENT_OPTS="-Dhttps.proxyHost={insert your proxy server} -Dhttps.proxyPort={insert your proxy port}"
Note:
These proxy settings only work for the hadoop command line utility. The proxy server settings for Oracle GoldenGate for Big Data connectivity to Azure are set in thejvm.bootoptions
as described in this point. - Edit the
{hadoop install dir}/etc/hadoop/core-site.xml
file and add the following configuration:<configuration> <property> <name>fs.azure.account.auth.type</name> <value>OAuth</value> </property> <property> <name>fs.azure.account.oauth.provider.type</name> <value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value> </property> <property> <name>fs.azure.account.oauth2.client.endpoint</name> <value>https://login.microsoftonline.com/{insert the Azure Tenant id here}/oauth2/token</value> </property> <property> <name>fs.azure.account.oauth2.client.id</name> <value>{insert your client id here}</value> </property> <property> <name>fs.azure.account.oauth2.client.secret</name> <value>{insert your client secret here}</value> </property> <property> <name>fs.defaultFS</name> <value>abfss://{insert your container name here}@{insert your ADL gen2 storage account name here}.dfs.core.windows.net</value> </property> <property> <name>fs.azure.createRemoteFileSystemDuringInitialization</name> <value>true</value> </property> </configuration>
To obtain your Azure Tenant Id, go to the Microsoft Azure portal. Enter Azure Active Directory in the Search bar and select Azure Active Directory from the list of services. The Tenant Id is located in the center of the main Azure Active Directory service page.
To obtain your Azure Client Id and Client Secret go to the Microsoft Azure portal. Select All Services from the list on the left to view the Azure Services Listing. Type App into the filter command box and select App Registrations from the listed services. Select the App Registration that you have created to access Azure Storage. The Application Id displayed for the App Registration is the Client Id. The Client Secret is the generated key string when a new key is added. This generated key string is available only once when the key is created. If you do not know the generated key string, create another key making sure you capture the generated key string.
The ADL gen2 account name is the account name you generated when you created the Azure ADL gen2 account.
File systems are sub partitions within an Azure Data Lake Gen 2 storage account. You can create and access new file systems on the fly but only if the following Hadoop configuration is set:
<property> <name>fs.azure.createRemoteFileSystemDuringInitialization</name> <value>true</value> </property>
- Verify connectivity using Hadoop shell
commands.
./bin/hadoop fs -ls / ./bin/hadoop fs -mkdir /tmp
-
Configure either the HDFS Handler or the File Writer Handler using the HDFS Event Handler to push data to Azure Data Lake, see Flat Files. Oracle recommends that you use the File Writer Handler with the HDFS Event Handler.
Setting the
gg.classpath
example:gg.classpath=/ggwork/hadoop/hadoop-3.3.0/share/hadoop/common/*:/ggwork/hadoop/hadoop-3.3.0/share/hadoop/common/lib/*:/ggwork/hadoop/hadoop-3.3.0/share/hadoop/hdfs/*: /ggwork/hadoop/hadoop-3.3.0/share/hadoop/hdfs/lib/*:/ggwork/hadoop/hadoop-3.3.0/etc/hadoop/:/ggwork/hadoop/hadoop-3.3.0/share/hadoop/tools/lib/*
See https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html.
Parent topic: Azure Data Lake Storage
8.2.11.3 Azure Data Lake Gen2 using BLOB endpoint
Oracle GoldenGate for Big Data can connect to ADLS Gen2 using BLOB endpoint. Oracle GoldenGate for Big Data ADLS Gen2 replication using BLOB endpoint does not require any Hadoop installation. For more information, see For more information, see Azure Blob Storage.
Parent topic: Azure Data Lake Storage
8.2.12 Azure Event Hubs
Kafka handler supports connectivity to Microsoft Azure Event Hubs.
Connectivity to the Azure Event Hubs cannot be routed through a proxy server. Therefore, when you run Oracle GoldenGate for Big Data on premise to push data to Azure Event Hubs, you need to open your firewall to allow connectivity.
Parent topic: Target
8.2.13 Azure Synapse Analytics
Microsoft Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing and big data analytics.
8.2.13.1 Detailed Functionality
Replication to Synapse uses stage and merge data flow.
The change data is staged in a temporary location in micro-batches and eventually merged into the target table.
Azure Data Lake Storage (ADLS) Gen 2 is used as the staging area for change data.
The Synapse Event handler is used as a downstream Event handler connected to the output of the Parquet Event handler.
The Parquet Event handler loads files generated by the File Writer Handler into ADLS Gen2.
The Synapse Event handler executes SQL statements to merge the operation records staged in ADLS Gen2.
The SQL operations are performed in batches providing better throughput.
Oracle GoldenGate for BigData uses the MERGE SQL
statement or
a combination of DELETE
and INSERT
SQL statements to
perform the merge operation.
8.2.13.1.1 Database User Privileges
Database user used for replication has to be granted the following privileges:
INSERT
,UPDATE
,DELETE
, andTRUNCATE
on the target tables.CREATE
andDROP
Synapse external file format.CREATE
andDROP
Synapse external data source.CREATE
andDROP
Synapse external table.
Parent topic: Detailed Functionality
8.2.13.1.2 Merge SQL Statement
The merge SQL statement for Azure Synapse Analytics was
made generally available during the later part of the year 2022 and
therefore Oracle GoldenGate for Big Data uses merge statement by
default. To disable merge SQL, ensure that a Java System property is
set in the jvm.bootoptions
parameter.
jvm.bootoptions=-Dsynapse.use.merge.sql=false
Parent topic: Detailed Functionality
8.2.13.1.3 Prerequisites
The following are the prerequisites:
- Uncompressed
UPDATE
records: If Oracle GoldenGate is configured to not use merge statement (see Merge SQL Statement), then it is mandatory that the trail files used to apply to Synapse contain uncompressedUPDATE
operation records, which means that theUPDATE
operations contain full image of the row being updated. IfUPDATE
records have missing columns, then replicat willABEND
on detecting a compressedUPDATE
trail record. - If Oracle GoldenGate is configured to use merge statement (see Merge SQL Statement), then the target table must be a hash distributed table.
- Target table existence: The target tables should exist on the Synapse database.
- Azure storage account: An Azure storage account and container should
exist.
Oracle recommends co-locating the Azure Synapse workspace, and the Azure storage account in the same azure region.
- If Oracle GoldenGate is configured to use merge statement, then the target
table cannot define
IDENTITY
columns because Synapse merge statement does not support inserting data intoIDENTITY
columns. For more information about merging SQL statement, see Merge SQL Statement.
Parent topic: Detailed Functionality
8.2.13.2 Configuration
- Automatic Configuration
- Synapse Database Credentials
- Classpath Configuration
- INSERTALLRECORDS Support
- Large Object (LOB) Performance
- End-to-End Configuration
- Compressed Update Handling
Parent topic: Azure Synapse Analytics
8.2.13.2.1 Automatic Configuration
Synapse replication involves configuration of multiple components, such as File Writer handler, Parquet Event handler, and Synapse Event handler.
The Automatic Configuration functionality helps to auto configure these components so that the user configuration is minimal.
The properties modified by auto configuration will also be logged in the handler log file.
To enable auto-configuration to replicate to Synapse target we need to set the
parameter as follows: gg.target=synapse
.
When replicating to Synapse target, customization of Parquet Event handler name and Synapse Event handler name is not allowed.
- File Writer Handler Configuration
- Parquet Event Handler Configuration
- Synapse Event Handler Configuration
Parent topic: Configuration
8.2.13.2.1.1 File Writer Handler Configuration
synapse
. The following is an example to edit a property of File
Writer handler:
gg.handler.synapse.pathMappingTemplate=./dirout
Parent topic: Automatic Configuration
8.2.13.2.1.2 Parquet Event Handler Configuration
The Parquet Event Handler name is pre-set to the value
parquet
. The Parquet Event Handler is auto-configured to write
to HDFS. The hadoop configuration file core-site.xml
must be
configured to write data files to the respective container in the Azure Data Lake
Storage(ADLS) Gen2 account. See Azure Data Lake Gen2 using Hadoop Client and ABFS.
gg.eventhandler.parquet.finalizeAction=delete
Parent topic: Automatic Configuration
8.2.13.2.1.3 Synapse Event Handler Configuration
Synapse Event Handler name is pre-set to the value
synapse
.
Table 8-13 Synapse Event Handler Configuration
Properties | Required/Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.eventhandler.synapse.connectionURL |
Required |
jdbc:sqlserver://<synapse- workspace>.sql.azuresynapse.net:1433;database= <db-name>;encrypt=true; trustServerCertificate=false; hostNameInCertificate=*.sql.azuresynapse.net; loginTimeout=300; |
None | JDBC URL to connect to Synapse. |
gg.eventhandler.synapse.UserName |
Required | Database username. | None | Synapse database user in the Synapse workspace. The
username has to be qualified with the Synapse workspace name.
Example: sqladminuser@synapseworkspace .
|
gg.eventhandler.synapse.Password |
Required | Supported database string. | None | Synapse database password. |
gg.eventhandler.synapse.credential |
Required | Credential name. | None | Synapse database credential name to access Azure Data Lake Gen2 files. See Synapse Database Credentials for steps to create credential. |
gg.eventhandler.synapse.maxConnnections |
Optional | Integer value | 10 | Use this parameter to control the number of concurrent JDBC database connections to the target Synapse database. |
gg.eventhandler.synapse.dropStagingTablesOnShutdown |
Optional | true or
false |
false |
If set to true , the temporary
staging tables created by GoldenGate will be dropped on replicat
graceful stop.
|
gg.maxInlineLobSize |
Optional | Integer Value | 16000 | This parameter can be used to set the maximum inline size of large object (LOB) columns in bytes. For more information, see Large Object (LOB) Performance. |
gg.aggregate.operations.flush.interval |
Optional | Integer | 30000 | The flush interval parameter determines how often
the data gets merged into Synapse. The value is set in milliseconds.
Use with caution! The higher the value, larger data will have to be
stored in the memory of the Replicat process.
Use the flush interval parameter with caution. Increasing its default value increases the amount of data stored in the internal memory of the Replicat. This can cause out-of-memory errors and stop the Replicat if it runs out of memory. |
gg.operation.aggregator.validate.keyupdate
|
Optional | true or
false |
false |
If set to true , Operation
Aggregator will validate key update operations (optype 115) and
correct to normal update if no key values have changed. Compressed
key update operations do not qualify for merge.
|
gg.compressed.update |
Optional | true or
false |
true |
If set the true , then this
indicates that the source trail files contain compressed update
operations. If set to true , then the source trail
files are expected to contain uncompressed update
operations.
|
gg.eventhandler.synapse.connectionRetryIntervalSeconds
|
Optional | Integer Value | 30 | Specifies the delay in seconds between connection retry attempts. |
gg.eventhandler.synapse.connectionRetries
|
Optional | Integer Value | 3 | Specifies the number of times connections to the target data warehouse will be retried. |
Parent topic: Automatic Configuration
8.2.13.2.2 Synapse Database Credentials
- Connect to the respective Synapse SQL dedicated pool using the Azure Web SQL console
(
https://web.azuresynapse.net/en-us/
). - Create a DB master key if one does not already exist, using your own password.
- Create a database scoped credential. This credential allows Oracle GoldenGate replicat
process to access Azure Storage Account.
Provide the Azure Storage Account name and Access key when creating this credential.
Storage Account Access keys can be retrieved from the Azure cloud console.
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Your own password' ; CREATE DATABASE SCOPED CREDENTIAL OGGBD_ADLS_credential WITH -- IDENTITY = '<storage_account_name>' , IDENTITY = 'sanavaccountuseast' , -- SECRET = '<storage_account_key>' SECRET = 'c8C0yR-this-is-a-fake-access-key-Gc9c5mENOJ1mLyxlO1vSRDlRG0/Ke+tbAvi6xe73HAAhLtdMFZRA==' ;
Parent topic: Configuration
8.2.13.2.3 Classpath Configuration
Synapse Event handler relies on the upstream File Writer handler and the Parquet Event handler.
Parent topic: Configuration
8.2.13.2.3.1 Dependencies
- Microsoft SQLServer JDBC driver: The JDBC driver can be downloaded from
Maven central using the following co-ordinates.
<dependency> <groupId>com.microsoft.sqlserver</groupId> <artifactId>mssql-jdbc</artifactId> <version>8.4.1.jre8</version> <scope>provided</scope> </dependency>
<OGGDIR>/DependencyDownloader/synapse.sh
.
Parquet Event handler dependencies: See Parquet Event Handler Configuration to configure classpath to include Parquet dependencies.
Parent topic: Classpath Configuration
8.2.13.2.3.2 Classpath
Edit the gg.classpath
configuration parameter to
include the path to the Parquet Event Handler dependencies and Synapse JDBC
driver.
gg.classpath=./synapse-deps/mssql-jdbc-8.4.1.jre8.jar:hadoop-3.2.1/share /hadoop/common/*:hadoop-3.2.1/share/hadoop/common/lib/*:hadoop-3.2.1/share/hadoop/hdfs/*:hadoop-3.2.1/share/hadoop/hdfs/lib/*:hadoop-3.2.1/etc/hadoop /:hadoop-3.2.1/share/hadoop/tools/lib/*:/path/to/parquet-deps/*
Parent topic: Classpath Configuration
8.2.13.2.4 INSERTALLRECORDS Support
Stage and merge targets supports INSERTALLRECORDS
parameter.
See INSERTALLRECORDS in Reference for
Oracle GoldenGate. Set the INSERTALLRECORDS
parameter in
the Replicat parameter file (.prm
). Set the
INSERTALLRECORDS
parameter in the Replicat parameter file
(.prm
)
Setting this property directs the Replicat process to use bulk insert
operations to load operation data into the target table. You can tune the batch size
of bulk inserts using the File Writer property
gg.handler.synapse.maxFileSize
. The default value is set to
1GB. The frequency of bulk inserts can be tuned using the File Writer property
gg.handler.synapse.fileRollInterval
, the default value is set
to 3m (three minutes).
Note:
- When using the Synapse internal stage, the staging files can be
compressed by setting
gg.handler.synapse.putSQLAutoCompress
totrue
.
Parent topic: Configuration
8.2.13.2.5 Large Object (LOB) Performance
gg.maxInlineLobSize
does not qualify for batch processing and such
operations gets slower.
If the compute machine has sufficient RAM, you can increase this parameter to speed up processing.
Parent topic: Configuration
8.2.13.2.6 End-to-End Configuration
The following is an end-end configuration example which uses auto-configuration for FW handler, Parquet and Synapse Event handlers.
This sample properties file can also be found in the directory
AdapterExamples/big-data/synapse/synapse.props
:
# Configuration to load GoldenGate trail operation records # into Azure Synapse Analytics by chaining # File writer handler -> Parquet Event handler -> Synapse Event handler. # Note: Recommended to only edit the configuration marked as TODO gg.target=synapse #The Parquet Event Handler # No properties are required for the Parquet Event handler. Configure core-site.xml to point to ADLS Gen2. #gg.eventhandler.parquet.finalizeAction=delete #The Synapse Event Handler #TODO: Edit JDBC ConnectionUrl gg.eventhandler.synapse.connectionURL=jdbc:sqlserver://<synapse-workspace>.sql.azuresynapse.net:1433;database=<db-name>;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.sql.azuresynapse.net;loginTimeout=300; #TODO: Edit JDBC user name gg.eventhandler.synapse.UserName=<db user name>@<synapse-workspace> #TODO: Edit JDBC password gg.eventhandler.synapse.Password=<db password> #TODO: Edit Credential to access Azure storage. gg.eventhandler.synapse.credential=OGGBD_ADLS_credential #TODO: Edit the classpath to include Parquet Event Handler dependencies and Synapse JDBC driver. gg.classpath=./synapse-deps/mssql-jdbc-8.4.1.jre8.jar:hadoop-3.2.1/share/hadoop/common/*:hadoop-3.2.1/share/hadoop/common/lib/*:hadoop-3.2.1/share/hadoop/hdfs/*:hadoop-3.2.1/share/hadoop/hdfs/lib/*:hadoop-3.2.1/etc/hadoop/:hadoop-3.2.1/share/hadoop/tools/lib/*:/path/to/parquet-deps/* #TODO: Provide sufficient memory (at least 8GB). jvm.bootoptions=-Xmx8g -Xms8g
Parent topic: Configuration
8.2.13.2.7 Compressed Update Handling
A compressed update record contains values for the key columns and the modified columns.
An uncompressed update record contains values for all the columns.
Oracle GoldenGate trails may contain compressed or uncompressed update records. The default extract configuration writes compressed updates to the trails.
The parameter gg.compressed.update
can be set to
true
or false
to indicate
compressed/uncompressed update records.
Parent topic: Configuration
8.2.13.2.7.1 MERGE Statement with Uncompressed Updates
In some use cases, if the trail contains uncompressed update records, then the
MERGE SQL
statement can be optimized for better performance by
setting gg.compressed.update=false
.
Parent topic: Compressed Update Handling
8.2.13.3 Troubleshooting and Diagnostics
- Connectivity Issues to Synapse:
- Validate JDBC connection URL, username and password.
- Check if http/https proxy is enabled. Synapse does not support connections over http(s) proxy.
- DDL not applied on the target table: Oracle GoldenGate for BigData does not support DDL replication.
- Target table existence: It is expected that the Synapse target table exists before starting the replicat process. replicat process will ABEND if the target table is missing.
- SQL Errors: In case there are any errors while executing any SQL, the entire SQL statement along with the bind parameter values are logged into the OGGBD handler log file.
- Co-existence of the components: The location/region of the machine where replicat process is running, Azure Data Lake Storage container region and the Synapse region would impact the overall throughput of the apply process. Data flow is as follows: Oracle GoldenGate -> Azure Data Lake Gen 2 -> Synapse. For best throughput, the components need to located as close as possible.
- Replicat ABEND due to partial LOB records in the trail file: Oracle
GoldenGate for Big Data Synapse apply does not support replication of partial
LOB. The trail file needs to be regenerated by Oracle Integrated capture using
TRANLOGOPTIONS FETCHPARTIALLOB
option in the extract parameter file. - Error:
com.microsoft.sqlserver.jdbc.SQLServerException: Conversion failed when converting date and/or time from character string
:This occurs when the source datetime column and target datetime column are incompatible.
For example: A case where the source column is a timestamp type, and the target column is Synapse time.
- If the Synapse table or column names contain double quotes, then Oracle GoldenGate for Big Data replicat will ABEND.
- Error:
com.microsoft.sqlserver.jdbc.SQLServerException: HdfsBridge::recordReaderFillBuffer
. This indicates that the data in the external table backed by Azure Data Lake file is not readable. Contact Oracle support. - IDENTITY column in the target table: The Synapse
MERGE
statement does not support inserting data intoIDENTITY
columns. Therefore, ifMERGE
statement is enabled usingjvm.bootoptions=-Dsynapse.use.merge.sql=true
, then Replicat will ABEND with following error message:Exception:com.microsoft.sqlserver.jdbc.SQLServerException: Cannot update identity column 'ORDER_ID'
- Error:
com.microsoft.sqlserver.jdbc.SQLServerException: Merge statements with a
:WHEN NOT MATCHED [BY TARGET]
clause must target a hash distributed tableThis indicates that merge SQL statement is on and Synapse target table is not a hash distributed table. You need to create the target table with a hash distribution.
Parent topic: Azure Synapse Analytics
8.2.14 Confluent Kafka
- Confluent is a primary adopter of Kafka Connect and their Confluent Platform offering includes extensions over the standard Kafka Connect functionality. This includes Avro serialization and deserialization, and an Avro schema registry. Much of the Kafka Connect functionality is available in Apache Kafka.
- You can use Oracle GoldenGate for Big Data Kafka Connect Handler to replicate to Confluent Kafka. The Kafka Connect Handler is a Kafka Connect source connector. You can capture database changes from any database supported by Oracle GoldenGate and stream that change of data through the Kafka Connect layer to Kafka.
- Kafka Connect uses proprietary objects to define the schemas
(
org.apache.kafka.connect.data.Schema
) and the messages (org.apache.kafka.connect.data.Struct
). The Kafka Connect Handler can be configured to manage what data is published and the structure of the published data. - The Kafka Connect Handler does not support any of the pluggable formatters that are supported by the Kafka Handler.
Parent topic: Target
8.2.15 DataStax
Datastax Enterprise is a NoSQL database built on Apache Cassandra. For more information, see Apache Cassandrafor configuring replication to Datastax Enterprise.
Parent topic: Target
8.2.16 Elasticsearch
- Elasticsearch with Elasticsearch 7x and 6x
The Elasticsearch Handler allows you to store, search, and analyze large volumes of data quickly and in near real time. - Elasticsearch 8x
The Elasticsearch Handler allows you to store, search, and analyze large volumes of data quickly and in near real time.
Parent topic: Target
8.2.16.1 Elasticsearch with Elasticsearch 7x and 6x
The Elasticsearch Handler allows you to store, search, and analyze large volumes of data quickly and in near real time.
Note:
This section on the Elasticsearch Handler pertains to Oracle GoldenGate for Big Data versions 21.9.0.0.0 and before. Starting with Oracle GoldenGate for Big Data 21.10.0.0.0, the Elasticsearch client was changed in order to support Elasticsearch 8.x.- Overview
- Detailing the Functionality
- Setting Up and Running the Elasticsearch Handler
- Troubleshooting
- Performance Consideration
- About the Shield Plug-In Support
- About DDL Handling
- Known Issues in the Elasticsearch Handler
- Elasticsearch Handler Transport Client Dependencies
What are the dependencies for the Elasticsearch Handler to connect to Elasticsearch databases? - Elasticsearch High Level REST Client Dependencies
Parent topic: Elasticsearch
8.2.16.1.1 Overview
Elasticsearch is a highly scalable open-source full-text search and analytics engine. Elasticsearch allows you to store, search, and analyze large volumes of data quickly and in near real time. It is generally used as the underlying engine or technology that drives applications with complex search features.
The Elasticsearch Handler uses the Elasticsearch Java client to connect and receive data into Elasticsearch node, see https://www.elastic.co.
Parent topic: Elasticsearch with Elasticsearch 7x and 6x
8.2.16.1.2 Detailing the Functionality
- About the Elasticsearch Version Property
- About the Index and Type
- About the Document
- About the Primary Key Update
- About the Data Types
- Operation Mode
- Operation Processing Support
- About the Connection
Parent topic: Elasticsearch with Elasticsearch 7x and 6x
8.2.16.1.2.1 About the Elasticsearch Version Property
The Elasticsearch Handler supports two different clients to communicate with the Elasticsearch cluster: The Elasticsearch transport client and the Elasticsearch High Level REST client.
Elasticsearch Handler can also be configured for the two supported clients by specifying the appropriate version of Elasticsearch handler properties file. Older version of Elasticsearch (6.x) supports only Transport client and the Elasticsearch handler can be configured by setting the configurable property version value to 6.x. For the latest version of Elasticsearch (7.x), both the Transport client and the High Level REST client are supported. Therefore, in the latest version, the Elasticsearch Handler can be configured for Transport client by setting the value of configurable property version to 7.x and High Level REST client by setting the value to Rest7.x.
The configurable parameters for each of them are as follows:
- Set the
gg.handler.name.version
configuration value to 6.x or 7.x to connect to the Elasticsearch cluster using the transport client using the respective version. - Set the
gg.handler.name.version
configuration value to REST7.0 to connect to the Elasticseach cluster using the Elasticsearch High Level REST client. The REST client support Elasticsearch versions 7.x.
Parent topic: Detailing the Functionality
8.2.16.1.2.2 About the Index and Type
An Elasticsearch index is a collection of documents with similar characteristics. An index can only be created in lowercase. An Elasticsearch type is a logical group within an index. All the documents within an index or type should have same number and type of fields.
The Elasticsearch Handler maps the source trail schema concatenated with source trail table name to construct the index. For three-part table names in source trail, the index is constructed by concatenating source catalog, schema, and table name.
Note:
Elasticsearch field names are case sensitive. If the field name in the data to be either updated or inserted are in uppercase and the existing fields in Elasticsearch server are in lowercase, then they are treated as new fields and not updated as existing fields. The workaround for this is using the parametergg.schema.normalize=lowercase
, which will update the field
name to lowercase, thus resolving the issue.
Table 8-14 Elasticsearch Mapping
Source Trail | Elasticsearch Index | Elasticsearch Type |
---|---|---|
|
|
|
|
|
|
If an index does not already exist in the Elasticsearch cluster, a new index is created when Elasticsearch Handler receives (INSERT
or UPDATE
operation in source trail) data.
Parent topic: Detailing the Functionality
8.2.16.1.2.3 About the Document
An Elasticsearch document is a basic unit of information that can be indexed. Within an index or type, you can store as many documents as you want. Each document has an unique identifier based on the _id
field.
The Elasticsearch Handler maps the source trail primary key column value as the document identifier.
Parent topic: Detailing the Functionality
8.2.16.1.2.4 About the Primary Key Update
The Elasticsearch document identifier is created based on the source table's primary key column value. The document identifier cannot be modified. The Elasticsearch handler processes a source primary key's update operation by performing a DELETE
followed by an INSERT
. While performing the INSERT
, there is a possibility that the new document may contain fewer fields than required. For the INSERT
operation to contain all the fields in the source table, enable trail Extract to capture the full data before images for update operations or use GETBEFORECOLS
to write the required column’s before images.
Parent topic: Detailing the Functionality
8.2.16.1.2.5 About the Data Types
Elasticsearch supports the following data types:
-
32-bit integer
-
64-bit integer
-
Double
-
Date
-
String
-
Binary
Parent topic: Detailing the Functionality
8.2.16.1.2.6 Operation Mode
The Elasticsearch Handler uses the operation mode for better performance. The gg.handler.name.mode
property is not used by the handler.
Parent topic: Detailing the Functionality
8.2.16.1.2.7 Operation Processing Support
The Elasticsearch Handler maps the source table name to the Elasticsearch type. The type name is case-sensitive.
For three-part table names in source trail, the index is constructed by concatenating source catalog, schema, and table name.
-
INSERT
-
The Elasticsearch Handler creates a new index if the index does not exist, and then inserts a new document.
-
UPDATE
-
If an Elasticsearch index or document exists, the document is updated. If an Elasticsearch index or document does not exist, a new index is created and the column values in the
UPDATE
operation are inserted as a new document. -
DELETE
-
If an Elasticsearch index or document exists, the document is deleted. If Elasticsearch index or document does not exist, a new index is created with zero fields.
The TRUNCATE
operation is not supported.
Parent topic: Detailing the Functionality
8.2.16.1.2.8 About the Connection
A cluster is a collection of one or more nodes (servers) that holds the entire data. It provides federated indexing and search capabilities across all nodes.
A node is a single server that is part of the cluster, stores the data, and participates in the cluster’s indexing and searching.
The Elasticsearch Handler property gg.handler.name.ServerAddressList
can be set to point to the nodes available in the cluster.
Parent topic: Detailing the Functionality
8.2.16.1.3 Setting Up and Running the Elasticsearch Handler
You must ensure that the Elasticsearch cluster is setup correctly and the cluster is up and running, see https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html. Alternatively, you can use Kibana to verify the setup.
Set the Classpath
The property gg.classpath
must include all the jars required by the
Java transport client. For a listing of the required client JAR files by version,
see Elasticsearch Handler Transport Client Dependencies.
For a listing of the required client JAR files for the Elatisticsearch High Level
REST client, see Elasticsearch High Level REST Client Dependencies.
The inclusion of the * wildcard in the path can include the * wildcard character in order to include all of the JAR files in that directory in the associated classpath. Do not use *.jar
.
The following is an example of the correctly configured classpath:
gg.classpath=Elasticsearch_Home/lib/*
8.2.16.1.3.1 Configuring the Elasticsearch Handler
Elasticsearch Handler can be configured for different version of Elasticsearch. For the latest version (7.x), two types of clients are supported: the Transport client and High-level REST client. When the configurable property version is set to the values 6.x or 7.x it uses Elasticsearch Transport client for connecting and performing all other operations of handler to Elasticsearch cluster. When the configurable property version is set to rest7.x, it uses Elasticsearch High Level REST client for connecting and performing other operations of handler to Elasticsearch 7.x cluster. The configurable parameters for each of them are separately given below:
Table 8-15 Common Configurable Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.handlerlist |
Required | Name (Any name of your choice for handler) | None | The list of handlers to be used. |
gg.handler.<name>.type |
Required | elasticsearch | None | Type of handler to use. For example, Elasticsearch, Kafka, or Flume. |
gg.handler.name.ServerAddressList |
Optional |
|
|
Comma separated list of contact points of the nodes. The allowed port for version REST7.x is 9200. For other version, it is 9300. |
gg.handler.name.version |
Required |
|
7.x |
The version values 5.x, 6.x, and 7.x indicate using the Elasticsearch Transport client to communicate with Elasticsearch version 5.x, 6.x and 7.x respectively. The version REST7.x indicates using the Elasticsearch High Level REST client to communicate with Elasticsearch version 7.x. |
gg.handler.name.version gg.handler.name.bulkWrite
|
Optional | true | false |
false |
When this property is true , the Elasticsearch Handler uses the
bulk write API to ingest data into Elasticsearch cluster. The batch
size of bulk write can be controlled using the
MAXTRANSOPS Replicat parameter.
|
gg.handler.name.numberAsString |
Optional | true | false |
false |
When this property is true , the Elasticsearch Handler receives
all the number column values (Long, Integer, or Double) in the
source trail as strings into the Elasticsearch cluster.
|
gg.handler.elasticsearch.upsert |
Optional | true | false |
true |
When this property is true , a new document is inserted if the
document does not already exist when performing an
UPDATE operation.
|
Example 8-1 Sample Handler Properties file:
Sample Replicat configuration and a Java Adapter Properties files can be found at the following directory:
GoldenGate_install_directory/AdapterExamples/big-data/elasticsearch
For Elasticsearch REST handler
gg.handlerlist=elasticsearch gg.handler.elasticsearch.type=elasticsearch gg.handler.elasticsearch.ServerAddressList=localhost:9300 gg.handler.elasticsearch.version=rest7.x gg.classpath=/path/to/elasticsearch/lib/*:/path/to/elasticsearch/modules/reindex/*:/path/to/elasticsearch/modules/lang-mustache/*:/path/to/elasticsearch/modules/rank-eval/*
- Common Configurable Properties
- Transport Client Configurable Properties
- Transport Client Setting Properties File
- Classpath Settings for Transport Client
- REST Client Configurable Properties
- Authentication for REST Client
- Classpath Settings for REST Client
Parent topic: Setting Up and Running the Elasticsearch Handler
8.2.16.1.3.1.1 Common Configurable Properties
Table 8-16 Common Configurable Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.handlerlist |
Required | Name (Any name of your choice for handler) | None | The list of handlers to be used. |
gg.handler.<name>.type |
Required | elasticsearch | None | Type of handler to use. For example, Elasticsearch, Kafka, or Flume. |
gg.handler.name.ServerAddressList |
Optional |
|
|
Comma separated list of contact points of the nodes. The allowed port for version REST7.x is 9200. For other version, it is 9300. |
gg.handler.name.version |
Required |
|
7.x |
The version values 6.x, and 7.x indicate using the Elasticsearch Transport client to communicate with Elasticsearch version 6.x and 7.x respectively. The version REST7.x indicates using the Elasticsearch High Level REST client to communicate with Elasticsearch version 7.x. |
gg.handler.name.version
gg.handler.name.bulkWrite
|
Optional | true |
false |
false |
When this property is
true , the Elasticsearch Handler
uses the bulk write API to ingest data into
Elasticsearch cluster. The batch size of bulk
write can be controlled using the
MAXTRANSOPS Replicat
parameter.
|
gg.handler.name.numberAsString |
Optional | true |
false |
false |
When this property is
true , the Elasticsearch Handler
receives all the number column values (Long,
Integer, or Double) in the source trail as strings
into the Elasticsearch cluster.
|
gg.handler.elasticsearch.upsert |
Optional | true |
false |
true |
When this property is
true , a new document is inserted
if the document does not already exist when
performing an UPDATE
operation.
|
Parent topic: Configuring the Elasticsearch Handler
8.2.16.1.3.1.2 Transport Client Configurable Properties
When the configurable property version is set to the value 6.x or 7.x, it uses Transport client to communicate with the corresponding version of Elasticsearch cluster. The configurable properties applicable when using Transport client only are as follows:
Table 8-17 Transport Client Configurable Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.handler.name.clientSettingsFile |
Required | Transport client properties file. | None | The filename in classpath that holds Elasticsearch transport client properties used by the Elasticsearch Handler. |
Copygg.handlerlist=elasticsearch gg.handler.elasticsearch.type=elasticsearch gg.handler.elasticsearch.ServerAddressList=localhost:9300 gg.handler.elasticsearch.clientSettingsFile=client.properties gg.handler.elasticsearch.version=[6.x | 7.x] gg.classpath=/path/to/elastic/lib/*:/path/to/elastic/modules/transport-netty4/*:/path/to/elastic/modules/reindex/*: /path/to/elastic/plugins/x-pack/*:
Parent topic: Configuring the Elasticsearch Handler
8.2.16.1.3.1.3 Transport Client Setting Properties File
The Elasticsearch Handler uses a Java Transport client to interact with Elasticsearch cluster. The Elasticsearch cluster may have additional plug-ins like shield or x-pack, which may require additional configuration.
The gg.handler.name.clientSettingsFile property
should
point to a file that has additional client settings based on the version of
Elasticsearch cluster.
The Elasticsearch Handler attempts to locate and load the client
settings file using the Java classpath. The Java classpath must include the
directory containing the properties file.The client properties file for
Elasticsearch (without any plug-in) is:
cluster.name=Elasticsearch_cluster_name
.
The Shield plug-in also supports additional capabilities like SSL and IP
filtering. The properties can be set in the client.properties
file, see https://www.elastic.co/guide/en/shield/current/_using_elasticsearch_java_clients_with_shield.html.
Example of client.properties file for Elasticsearch Handler with X-Pack plug-in:
Copycluster.name=Elasticsearch_cluster_name xpack.security.user=x-pack_username:x-pack-password
The X-Pack plug-in also supports additional capabilities. The properties can be set
in the client.properties
file, see
https://www.elastic.co/guide/en/elasticsearch/client/java-api/5.1/transport-client.html and https://www.elastic.co/guide/en/x-pack/current/java-clients.html
Parent topic: Configuring the Elasticsearch Handler
8.2.16.1.3.1.4 Classpath Settings for Transport Client
The gg.classpath
setting for Elasticsearch handler with Transport
client should contain the path to jars from library (lib) and modules
(transport-netty4 and reindex modules) folder inside Elasticsearch installation
directory. If x-pack plugin is used for authentication purpose, then the classpath
should also include the jars inside the plugins (x-pack) folder inside Elasticsearch
installation directory. See the path for jars as follows:
.
1. [path/to/elastic/lib/*] 2. [/path/to/elastic/modules/transport-netty4/*] 3. [/path/to/elastic/modules/reindex/*] 4. [/path/to/elastic/plugins/x-pack/*] This needs to be added only if x-pack plugin is configured in Elasticsearch
Parent topic: Configuring the Elasticsearch Handler
8.2.16.1.3.1.5 REST Client Configurable Properties
When the configurable property version is set to value rest7.x, the handler uses Elasticsearch High Level REST client to connect to Elasticsearch 7.x cluster. The configurable properties that are supported for REST client only are as follows:
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Optional |
|
None | The template to be used for deciding the routing algorithm. |
gg.handler.name.authType |
Optional | none | basic | ssl |
None | Controls the authentication type for the
Elasticsearch REST client.
|
gg.handler.name.authType
gg.handler.name.basicAuthUsername
|
Required (for auth-type basic.) | A valid username | None | The username for the server to authenticate the
Elasticsearch REST client. Must be provided for auth types
basic .
|
gg.handler.name.basicAuthPassword |
Required (for auth-type basic.) | A valid password | None | The password for the server to authenticate the
Elasticsearch REST client. Must be provided for auth types
basic .
|
gg.handler.name.trustStore |
Required (for auth-type SSL) | The fully qualified name (path + name) of trust-store file | None | The truststore for the Elasticsearch client to
validate the certificate received from the Elasticsearch server.
Must be provided if the auth type is set to
ssl . Valid only for the Elasticsearch REST
client
|
gg.handler.name.trustStorePassword |
Required (for auth-type SSL) | A valid trust-store Password | None | The password for the truststore for the Elasticsearch REST client to validate the certificate received from the Elasticsearch server. Must be provided if the auth type is set to ssl. |
gg.handler.name.maxConnectTimeout |
Optional | Positive integer | Default value of Apache HTTP Components framework. | Set the maximum wait period for a connection to be established from the Elasticsearch REST client to the Elasticsearch server. Valid only for the Elasticsearch REST client. |
gg.handler.name.maxSocketTimeout |
Optional | Positive Integer | Default value of Apache HTTP Components framework. | Sets the maximum wait period in milliseconds to wait for a response from the service after issuing a request. May need to be increased when pushing large data volumes. Valid only for the Elasticsearch REST client. |
gg.handler.name.proxyUsername |
Optional | The proxy server username | None | If the connectivity to the Elasticsearch uses the REST client and routing through a proxy server, then this property sets the username of your proxy server. Most proxy servers do not require credentials. |
gg.handler.name.proxyPassword |
Optional | The proxy server password | None | If the connectivity to the Elasticsearch uses the REST client and routing through a proxy server, then this property sets the password of your proxy server. Most proxy servers do not require credentials. |
gg.handler.name.proxyProtocol |
Optional | http | https |
None | If the connectivity to the Elasticsearch uses the REST client and routing through a proxy server, then this property sets the protocol of your proxy server. |
gg.handler.name.proxyPort |
Optional | The port number of your proxy server. | None | If the connectivity to the Elasticsearch uses the REST client and routing through a proxy server, then this property sets the port number of your proxy server. |
gg.handler.name.proxyServer |
Optional | The host name of your proxy server. | None | If the connectivity to the Elasticsearch uses the REST client and routing through a proxy server, then this property sets the host name of your proxy server. |
Sample Properties for Elasticsearch Handler using REST Client
gg.handlerlist=elasticsearch gg.handler.elasticsearch.type=elasticsearch gg.handler.elasticsearch.ServerAddressList=localhost:9200 gg.handler.elasticsearch.version=rest7.x gg.classpath=/path/to/elasticsearch/lib/*:/path/to/elasticsearch/modules/reindex/*:/path/to/elasticsearch/modules/lang-mustache/*:/path/to/elasticsearch/modules/rank-eval/*
Parent topic: Configuring the Elasticsearch Handler
8.2.16.1.3.1.6 Authentication for REST Client
The configurable property auth-type value SSL can be used to configure the SSL authentication mechanism for communicating with Elasticsearch cluster. This property can also be used to configure the basic authentication with SSL by providing configurable property basic username/password along with the trust-store properties.
Parent topic: Configuring the Elasticsearch Handler
8.2.16.1.3.1.7 Classpath Settings for REST Client
The Classpath for High Level REST client must contain the jars from the library (lib) folder and modules folders (reindex, lang-mustache and ran-eval) inside the Elasticsearch installation directory. The REST client are dependent on these libraries and should be included in gg.classpath for the handler to work. Following are the list of dependencies:
1. [/path/to/elasticsearch/lib/*] 2. [/path/to/elasticsearch/modules/reindex/*] 3. [/path/to/elasticsearch/modules/lang-mustache/*] 4. [/path/to/elasticsearch/modules/rank-eval/*]
Parent topic: Configuring the Elasticsearch Handler
8.2.16.1.4 Troubleshooting
This section contains information to help you troubleshoot various issues.
Transport Client Properties File Not Found
This is applicable for Transport Client only when the property version is set to 6.x or 7.x.
ERROR 2017-01-30 22:33:10,058 [main] Unable to establish connection. Check handler properties and client settings configuration.
To resolve this exception, verify that the
gg.handler.name.clientSettingsFile
configuration property is
correctly setting the Elasticsearch transport client settings file name. Verify that
the gg.classpath
variable includes the path to the correct file
name and that the path to the properties file does not contain an asterisk (*)
wildcard at the end.
- Incorrect Java Classpath
- Elasticsearch Version Mismatch
- Transport Client Properties File Not Found
- Cluster Connection Problem
- Unsupported Truncate Operation
- Bulk Execute Errors
Parent topic: Elasticsearch with Elasticsearch 7x and 6x
8.2.16.1.4.1 Incorrect Java Classpath
The most common initial error is an incorrect classpath to include all the required client libraries and creates a ClassNotFound
exception in the log4j
log file.
Also, it may be due to an error resolving the classpath if there is a typographic error in the gg.classpath
variable.
The Elasticsearch transport client libraries do not ship with the Oracle GoldenGate for Big Data product. You should properly configure the gg.classpath
property in the Java Adapter Properties file to correctly resolve the client libraries, see Setting Up and Running the Elasticsearch Handler.
Parent topic: Troubleshooting
8.2.16.1.4.2 Elasticsearch Version Mismatch
The Elasticsearch Handler gg.handler.name.version
property must
beset to one of the following values:
6.x
, 7.x
or
REST7.x
to match the major
version number of the Elasticsearch cluster. For
example,
gg.handler.name.version=7.x
.
The following errors may occur when there is a wrong version configuration:
Error: NoNodeAvailableException[None of the configured nodes are available:] ERROR 2017-01-30 22:35:07,240 [main] Unable to establish connection. Check handler properties and client settings configuration. java.lang.IllegalArgumentException: unknown setting [shield.user]
Ensure that all required plug-ins are installed and review documentation changes for any removed settings.
Parent topic: Troubleshooting
8.2.16.1.4.3 Transport Client Properties File Not Found
To resolve this exception:
ERROR 2017-01-30 22:33:10,058 [main] Unable to establish connection. Check handler properties and client settings configuration.
Verify that the gg.handler.name.clientSettingsFile
configuration property is correctly setting the Elasticsearch transport client settings file name. Verify that the gg.classpath
variable includes the path to the correct file name and that the path to the properties file does not contain an asterisk (*) wildcard at the end.
Parent topic: Troubleshooting
8.2.16.1.4.4 Cluster Connection Problem
This error occurs when the Elasticsearch Handler is unable to connect to the Elasticsearch cluster:
Error: NoNodeAvailableException[None of the configured nodes are available:]
Use the following steps to debug the issue:
-
Ensure that the Elasticsearch server process is running.
-
Validate the
cluster.name
property in the client properties configuration file. -
Validate the authentication credentials for the x-Pack or Shield plug-in in the client properties file.
-
Validate the
gg.handler.name.ServerAddressList
handler property.
Parent topic: Troubleshooting
8.2.16.1.4.5 Unsupported Truncate Operation
The following error occurs when the Elasticsearch Handler finds a TRUNCATE
operation in the source trail:
oracle.goldengate.util.GGException: Elasticsearch Handler does not support the operation: TRUNCATE
This exception error message is written to the handler log file before the RAeplicat process abends. Removing the GETTRUNCATES
parameter from the Replicat parameter file resolves this error.
Parent topic: Troubleshooting
8.2.16.1.4.6 Bulk Execute Errors
DEBUG [main] (ElasticSearch5DOTX.java:130) - Bulk execute status: failures:[true] buildFailureMessage:[failure in bulk execution: [0]: index [cs2cat_s1sch_n1tab], type [N1TAB], id [83], message [RemoteTransportException[[UOvac8l][127.0.0.1:9300][indices:data/write/bulk[s][p]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$7@43eddfb2 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@5ef5f412[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 84]]];]
It may be due to the Elasticsearch running out of resources to process the operation. You can limit the Replicat batch size using MAXTRANSOPS
to match the value of the thread_pool.bulk.queue_size
Elasticsearch configuration parameter.
Note:
Changes to the Elasticsearch parameter,thread_pool.bulk.queue_size
, are effective only after the Elasticsearch node is restarted.
Parent topic: Troubleshooting
8.2.16.1.5 Performance Consideration
The Elasticsearch Handler gg.handler.name.bulkWrite
property is used to determine whether the source trail records should be pushed to the Elasticsearch cluster one at a time or in bulk using the bulk write API. When this property is true, the source trail operations are pushed to the Elasticsearch cluster in batches whose size can be controlled by the MAXTRANSOPS
parameter in the generic Replicat parameter file. Using the bulk write API provides better performance.
Elasticsearch uses different thread pools to improve how memory consumption of threads are managed within a node. Many of these pools also have queues associated with them, which allow pending requests to be held instead of discarded.
For bulk operations, the default queue size is 50 (in version 5.2) and 200 (in version 5.3).
To avoid bulk API errors, you must set the Replicat MAXTRANSOPS
size to match the bulk thread pool queue size at a minimum. The configuration thread_pool.bulk.queue_size
property can be modified in the elasticsearch.yaml
file.
Parent topic: Elasticsearch with Elasticsearch 7x and 6x
8.2.16.1.6 About the Shield Plug-In Support
Elasticsearch versions 6.x and 7.x (X-Pack plug-in for Elasticsearch
6.x and 7.x) support a Shield
plug-in which provides basic authentication, SSL
and IP filtering. Similar capabilities exist in
the X-Pack plug-in for Elasticsearch 6.x
and 7.x. The additional transport client
settings can be configured in the Elasticsearch
Handler using the gg.handler.name.clientSettingsFile
property.
Parent topic: Elasticsearch with Elasticsearch 7x and 6x
8.2.16.1.7 About DDL Handling
The Elasticsearch Handler does not react to any DDL records in the source trail. Any data manipulation records for a new source table results in auto-creation of index or type in the Elasticsearch cluster.
Parent topic: Elasticsearch with Elasticsearch 7x and 6x
8.2.16.1.8 Known Issues in the Elasticsearch Handler
Elasticsearch: Trying to input very large number
Very large numbers result in inaccurate values with Elasticsearch document. For example, 9223372036854775807, -9223372036854775808. This is an issue with the Elasticsearch server and not a limitation of the Elasticsearch Handler.
The workaround for this issue is to ingest all the number values as strings using the gg.handler.name.numberAsString=true
property.
Elasticsearch: Issue with index
The Elasticsearch Handler is not able to input data into the same index if there are more than one table with similar column names and different column data types.
Index names are always lowercase though the catalog/schema/tablename
in the trail may be case-sensitive.
Parent topic: Elasticsearch with Elasticsearch 7x and 6x
8.2.16.1.9 Elasticsearch Handler Transport Client Dependencies
What are the dependencies for the Elasticsearch Handler to connect to Elasticsearch databases?
The maven central repository artifacts for Elasticsearch databases are:
Maven groupId: org.elasticsearch.client
Maven atifactId: transport
Maven groupId: org.elasticsearch.client
Maven atifactId: x-pack-transport
Parent topic: Elasticsearch with Elasticsearch 7x and 6x
8.2.16.1.10 Elasticsearch High Level REST Client Dependencies
The maven coordinates for the Elasticsearch High Level REST client are:
Maven groupId: org.elasticsearch.client
Maven atifactId:
elasticsearch-rest-high-level-client
Maven version: 7.13.3
Note:
Ensure not to mix the versions in the jar files dependency stack for the Elasticsearch High Level REST Client. Mixing versions results in dependency conflicts.Parent topic: Elasticsearch with Elasticsearch 7x and 6x
8.2.16.2 Elasticsearch 8x
The Elasticsearch Handler allows you to store, search, and analyze large volumes of data quickly and in near real time.
This article describes how to use the Elasticsearch handler (starting Oracle GoldenGate for Big Data 21.10.0.0.0). In Oracle GoldenGate for Big Data version 21.10.0.0, the Elasticsearch handler was modified to support a new Elasticsearch client. The new client supports Elasticsearch 8.x.
- Overview
- Detailing the Functionality
- About the Index
- About the Document
- About the Data Types
- About the Connection
- About Supported Operation
- About DDL Handling
- About the Primary Key Update
- About UPSERT
- About Bulk Write
- About Routing
- About Request Headers
- About Java API Client
- Setting Up the Elasticsearch Handler
- Elasticsearch Handler Configuration
- Enabling Security for Elasticsearch
The Elasticsearch cluster must be accessed in secured manner in production environment. Security features must be first enabled in Elasticsearch cluster and those security configurations must be added to Elasticsearch handler properties file - Security Configuration for Elasticsearch Cluster
The latest version of Elasticsearch has the security auto-configured when it is installed and started. The logs will print security details for auto-configured cluster as follows: - Security Configuration for Elasticsearch Handler
- Troubleshooting
- Elasticsearch Handler Client Dependencies
What are the dependencies for the Elasticsearch Handler to connect to Elasticsearch databases?
Parent topic: Elasticsearch
8.2.16.2.1 Overview
Elasticsearch is a highly scalable open-source full-text search and analytics engine. Elasticsearch allows you to store, search, and analyze large volumes of data quickly and in near real time. It is generally used as the underlying engine or technology that drives applications with complex search features.
The Elasticsearch Handler uses the Elasticsearch Java client to connect and receive data into Elasticsearch node, see https://www.elastic.co.
Parent topic: Elasticsearch 8x
8.2.16.2.2 Detailing the Functionality
Parent topic: Elasticsearch 8x
8.2.16.2.3 About the Index
An Elasticsearch index is a collection of documents with similar characteristics. An index can only be created in lowercase. An Elasticsearch type is a logical group within an index. All the documents within an index or type should have same number and type of fields. Index in Elasticsearch is equivalent to table in RDBMS.
For three-part table names in source trail, the index is constructed by concatenating source catalog, schema, and table name. The Elasticsearch Handler maps the source trail schema concatenated with source trail table name to construct the index when there is no catalog in source table.
Table 8-18 Elasticsearch Mapping
Source Trail | Elasticsearch Index |
---|---|
|
|
|
|
If an index does not already exist in the Elasticsearch cluster, a new index is
created when Elasticsearch Handler receives (INSERT
or
UPDATE
operation in source trail) data.
If Handler receives DELETE
operation in source trail
but the index does not exist in Elasticsearch cluster, then the handler will
ABEND
.
Parent topic: Elasticsearch 8x
8.2.16.2.4 About the Document
An Elasticsearch document is a basic unit of information that can be indexed. Within an index or type, you can store as many documents as you want. Each document has an unique identifier based on the _id
field.
If Handler receives DELETE
operation in source trail but the index
does not exist in Elasticsearch cluster, then the handler will
ABEND
.
Parent topic: Elasticsearch 8x
8.2.16.2.5 About the Data Types
Elasticsearch supports the following data types:
-
32-bit integer
-
64-bit integer
-
Double
-
Date
-
String
-
Binary
Parent topic: Elasticsearch 8x
8.2.16.2.6 About the Connection
A cluster is a collection of one or more nodes (servers) that holds the entire data. It provides federated indexing and search capabilities across all nodes.
A node is a single server that is part of the cluster, stores the data, and participates in the cluster’s indexing and searching.
The Elasticsearch Handler property gg.handler.name.ServerAddressList
can be set to point to the nodes available in the cluster.
Elasticsearch Handler uses the Java API client to connect to Elasticsearch cluster nodes configured in above handler property via http/https protocol, even though the cluster nodes internally communicate with each other using transport layer protocol.
Port for http/https must be configured in handler property (instead of transport port) for connection via Elasticsearch client.
Parent topic: Elasticsearch 8x
8.2.16.2.7 About Supported Operation
The Elasticsearch Handler supports the following operations for replication to Elasticsearch cluster in the target.
-
INSERT
-
The Elasticsearch Handler creates a new index if the index does not exist, and then inserts a new document. If the
_id
is already present, it overwrites (replaces) the existing record with new record withsame _id
. -
UPDATE
-
If an Elasticsearch index or document exists, the document is updated. If an Elasticsearch index or document does not exist, then a new index is created and the column values in the
UPDATE
operation are inserted as a new document. -
DELETE
-
If an Elasticsearch index or
_id
of document exists, then the document is deleted. If_id
of document does not exist, then it continues without doing anything. If Elasticsearch index is missing, then it willABEND
the handler.
The TRUNCATE
operation is not supported.
Parent topic: Elasticsearch 8x
8.2.16.2.8 About DDL Handling
The Elasticsearch Handler does not react to any DDL records in the source trail. Any data manipulation records for a new source table results in auto-creation of index or type in the Elasticsearch cluster.
Parent topic: Elasticsearch 8x
8.2.16.2.9 About the Primary Key Update
The Elasticsearch document identifier is created based on the source table's primary key column value. The document identifier cannot be modified.
The Elasticsearch handler processes a source primary key's update
operation by performing a DELETE
followed by an
INSERT
. While performing the INSERT
, there is
a possibility that the new document may contain fewer fields than required.
For the INSERT
operation to contain all the fields in
the source table, enable trail Extract to capture the full data before images for
update operations or use GETBEFORECOLS
to write the required
column’s before images.
Parent topic: Elasticsearch 8x
8.2.16.2.10 About UPSERT
The Elasticsearch handler supports UPSERT
mode for
UPDATE
operations. This mode can be enabled by setting the
Elasticsearch handler property gg.handler.name.upsert
as
true
. This is enabled by default.
The UPSERT
mode ensures that for an
UPDATE
operation from source trail, if the index or the _id of
document is missing from Elasticsearch cluster, it will create the index and convert
the operation to INSERT
for adding it as a new record.
Elasticsearch Handler will ABEND
for same scenario when
UPSERT
is false
.
HANDLECOLLISION
mode Oracle GoldenGate where:
- An insert collision should result in duplicate error.
- A missing update or delete should result in not found error.
Parent topic: Elasticsearch 8x
8.2.16.2.11 About Bulk Write
The Elasticsearch handler supports bulk operation mode where multiple operations can be grouped into a batch and whole batch can be applied to target Elasticsearch cluster in one shot. This improves the performance.
Bulk mode can be enabled by setting the value of Elasticsearch handler
property gg.handler.name.bulkWrite
as true
. It is
disabled by default.
Bulk mode has a few limitations. If there is any failure (exception thrown) for an operation in bulk, it can result in inconsistent data at target. For example, a delete operation where the index is missing from the target Elasticsearch cluster, it will result in exception. If such an operation is part of a batch in bulk mode, then the batch is not applied after the failure of that operation, resulting in inconsistency.
To avoid bulk API errors, you must set the handler
MAXTRANSOPS
size to match the bulk thread pool queue size at a
minimum.
The configuration thread_pool.bulk.queue_size
property
can be modified in the elasticsearch.yaml
file.
Parent topic: Elasticsearch 8x
8.2.16.2.12 About Routing
A document is routed to a particular shard in an index using the
_routing
value. The default _routing
value is
the document’s _id
field. Custom routing patterns can be
implemented by specifying a custom routing value per document.
Elasticsearch Handler supports custom routing by specifying the mapping
field key in the property gg.handler.name.routingKeyMappingTemplate
of Elasticsearch handler properties file.
Parent topic: Elasticsearch 8x
8.2.16.2.13 About Request Headers
gg.handler.name.headers
in the
properties file.
Parent topic: Elasticsearch 8x
8.2.16.2.14 About Java API Client
Elasticsearch Handler now uses Java API Client to connect Elasticsearch cluster for performing all operations of replication. It internally uses Elasticsearch Rest Client and Transport Client to perform all the operations. The older clients like Rest High-Level Client and Transport Client are deprecated and hence removed.
Supported Versions of Elasticsearch Cluster
To configure this handler, Elasticsearch cluster version 7.16.x or above must be configured and running. To configure Elasticsearch cluster, see Get Elasticsearch up and running
Parent topic: Elasticsearch 8x
8.2.16.2.15 Setting Up the Elasticsearch Handler
You must ensure that the Elasticsearch cluster is setup correctly and the cluster is up and running. Supported versions of Elasticsearch cluster are 7.16.x and above. See https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html. Alternatively, you can use Kibana to verify the setup.
Parent topic: Elasticsearch 8x
8.2.16.2.16 Elasticsearch Handler Configuration
To configure the Elasticsearch Handler, the parameter file
(res.prm
) and the properties
(elasticsearch.props
) file must be configured with valid
values.
Parameter File:
Parameter file should point to the correct properties file for Elasticsearch Handler.
res.prm
) necessary for running Elasticsearch Handler:
- REPLICAT replicat-name - TARGETDB LIBFILE libggjava.so SET property=dirprm/elasticsearch.props - MAP schema-name.table-name, TARGET schema-name.table-name
Properties File:
The following are the mandatory properties for properties file
(elasticsearch.props
), which is necessary for running
Elasticsearch handler:
- gg.handlerlist=elasticsearch - gg.handler.elasticsearch.type=elasticsearch - gg.handler.elasticsearch.ServerAddressList=127.0.0.1:9200
Table 8-19 Elasticsearch Handler Configuration Properties
Property Name | Required (Yes/No) | Legal Values (Default value) | Explanation |
---|---|---|---|
gg.handler.name.ServerAddressList |
Yes |
[<Hostname|ip>:<port>,
<Hostname|ip>:<port>, …]
|
List of valid hostnames (or IP) and port number separated by ‘:’ of cluster nodes of Elasticsearch cluster. |
gg.handler.name.BulkWrite |
No |
|
If Bulk Write mode is enabled (set true), the operations of transaction will be stored in batch and applied to target ES cluster in one shot for a batch (transaction) depending on batch size. |
gg.handler.name.Upsert |
No |
|
If upsert mode is enabled (set to true), the update operation will be inserted as new document when it’s missing on target ES cluster. |
gg.handler.name.NumberAsString |
No |
|
Set if the number will be stored as string. |
gg.handler.name.ProxyServer |
No |
[Proxy-Hostname | Proxy-IP]
|
Proxy server hostname (or IP) to connect to Elasticsearch cluster. |
gg.handler.name.ProxyPort |
No |
[Port number] |
Port number of proxy server. Required if proxy is configured. |
gg.handler.name.ProxyProtocol |
No |
|
Protocol for Proxy server connection. |
gg.handler.name.ProxyUsername |
No |
[Username of proxy server] | Username for connecting to Proxy server. |
gg.handler.name.ProxyPassword
|
No |
[Password of proxy server] | Password for connecting to Proxy server. This can be
encrypted using ORACLEWALLET .
|
gg.handler.name.AuthType |
No |
|
Authentication type to be used for connecting to Elasticsearch cluster. |
gg.handler.name.BasicAuthUsername |
No |
[username of ES cluster] | Username credential for basic authentication to
connect ES server. This can be encrypted using
ORACLEWALLET .
|
gg.handler.name.BasicAuthPassword |
No |
[password of ES cluster] | Password credential for basic authentication to
connect ES server. This can be encrypted using
ORACLEWALLET .
|
gg.handler.name.Fingerprint |
No |
[fingerprint hash code] | It is the hash of a certificate calculated on all
certificate's data and its signature. Applicable for authentication
type SSL. This can be encrypted using
ORACLEWALLET .
|
gg.handler.name.CertFilePath
|
No |
[/path/to/CA_certificate_file.crt] |
CA certificate file (.crt ) for
SSL/TLS authentication.
|
gg.handler.name.TrustStore |
No
|
[/Path/to/trust-store-file] |
Path to Trust-store file in server for SSL / TLS server authentication. Applicable for authentication type SSL. |
gg.handler.name.TrustStorePassword |
No |
[trust-store password] |
Password for Trust-store file for SSL/TLS
authentication. Applicable for authentication type SSL. This can be
encrypted using ORACLEWALLET .
|
gg.handler.name.TrustStoreType |
No |
[jks | pkcs12]
|
The key-store type for SSL/TLS authentication. Applicable if authentication type is SSL. |
gg.handler.name.RoutingKeyMappingTemplate |
No |
[Routing field-name] | This defines the field-name whose value will be mapped for routing to particular shard in an index of ES cluster. |
gg.handler.name.Headers |
No |
|
List of name and value pair of headers to be sent with REST calls. |
|
No
|
Time in seconds | Time in seconds that request will wait for connecting to Elasticsearch server. |
gg.handler.name.MaxSocketTimeout |
No
|
Time in seconds | Time in seconds that request will wait for response to come from Elasticsearch server. |
gg.handler.name.IOThreadCount |
No |
Count | Count of thread to handle IO requests. |
gg.handler.name.NodeSelector |
No |
|
Predefined strategy ANY or
SKIP_DEDICATED_MASTERS . Or fully qualified name
of class that implements custom strategy (by implementing
NodeSelector.java interface).
|
Set the Classpath
The Elasticsearch handler property gg.classpath
must
include all the dependency jars required by the Java API client. For a listing and
downloading of the required client JAR files, use the Dependency Downloader script
elasticsearch_java.sh
in
OGG_HOME/DependencyDownloader
directory and pass the version
8.7.0 as argument. For more information about Elasticsearch client dependencies, see
Elasticsearch Handler Client Dependencies.
It creates a directory
OGG_HOME/DepedencyDownloader/dependencies/elasticsearch_rest_8.7.0
and downloads all the dependency jars inside it. The client library version 8.7.0
can be used for all supported Elasticsearch clusters.
This location can be configured in classpath as:
gg.classpath=/path/to/OGG_HOME/DepedencyDownloader/dependencies/elasticsearch_rest_8.7.0/*
The inclusion of the * wildcard character at the end of the path can be
used in order to include all of the JAR files in that directory in the associated
classpath. Do not use *.jar
.
Sample Configuration of Elasticsearch Handler:
For reference, to configure Elasticsearch handler, sample parameter
(res.prm
) and sample properties file
(elasticsearch.props
) for Elasticsearch handler is available in
directory:
OGG_HOME/AdapterExamples/big-data/elasticsearch
Parent topic: Elasticsearch 8x
8.2.16.2.17 Enabling Security for Elasticsearch
The Elasticsearch cluster must be accessed in secured manner in production environment. Security features must be first enabled in Elasticsearch cluster and those security configurations must be added to Elasticsearch handler properties file
Parent topic: Elasticsearch 8x
8.2.16.2.18 Security Configuration for Elasticsearch Cluster
The latest version of Elasticsearch has the security auto-configured when it is installed and started. The logs will print security details for auto-configured cluster as follows:
- Elasticsearch security features have been automatically configured!
- Authentication is enabled and cluster connections are encrypted.
- Password for the elastic user (reset with `bin/elasticsearch-reset-password -u elastic`): nnh0LWKZMLkw_QD5jxhE
- HTTP CA certificate SHA-256 fingerprint: 862e3f117c386a63f8f43db88760d463900e4c814590b8920e1c0e25f6db4df4
- Configure Kibana to use this cluster:
- Run Kibana and click the configuration link in the terminal when Kibana starts.
- Copy the following enrollment token and paste it into Kibana in your browser (valid for the next 30 minutes): eyJ2ZXIiOiI4LjYuMiIsImFkciI6WyIxMDAuNzAuOTguNzM6OTIwMCJdLCJmZ3IiOiI4NjJlM2YxMTdjMzg2YTYzZjhmNDNkYjg4NzYwZDQ2MzkwMGU0YzgxNDU5MGI4OTIwZTFjMGUyNWY2ZGI0ZGY0Iiwia2V5IjoiUTVCVF9vWUJ2TnZDVXBSSkNTWEM6NkJNc3ZXanBUYWUwa0l6V1pDU1JPQSJ9
These
security parameter values must be noted down and used to configure Elasticsearch
handler. All the auto-generated certificates are created inside
ElasticSearch-install-directory/config/cert
folder.
If security is not auto-configured for older versions of Elasticsearch, we need to manually enable the security features like basic and encrypted (SSL) authentication in below configuration file of Elasticsearch cluster before running it.
Elasticsearch-installation-directory/config/elasticsearch.yml
#----------------------- BEGIN SECURITY AUTO CONFIGURATION ----------------
# The following settings, TLS certificates and keys have been
# configured for SSL/TLS authentication.
# -----------------------------------------------------------------------
# Enable security features
xpack.security.enabled: true
xpack.security.enrollment.enabled: true
# Enable encryption for HTTP API client connections
xpack.security.http.ssl:
enabled: true
keystore.path: certs/http.p12
# Enable encryption and mutual authentication between cluster nodes
xpack.security.transport.ssl:
enabled: true
verification_mode: certificate
keystore.path: certs/transport.p12
truststore.path: certs/transport.p12
# Create a new cluster with the current node only
# Additional nodes can still join the cluster later
cluster.initial_master_nodes: ["cluster-host-name"]
# Allow HTTP API connections from anywhere
# Connections are encrypted and require user authentication
http.host: 0.0.0.0
#----------------------- END SECURITY AUTO CONFIGURATION --------------
For more information about the security setting of Elasticsearch cluster, see https://www.elastic.co/guide/en/elasticsearch/reference/current/manually-configure-security.htmlParent topic: Elasticsearch 8x
8.2.16.2.19 Security Configuration for Elasticsearch Handler
gg.handler.name.authType
with following values:
Elasticsearch-installation-directory/config/elasticsearch.yml
- None: This mode is used when no security feature is enabled in Elasticsearch stack. No other configuration is required for this mode and Elasticsearch can be accessed directly using http protocol.
- Basic: This mode is used when only basic security feature is enabled for
a user by setting a username and password for the user. The basic authentication
username and password property must be provided in properties file in order to
access the Elasticsearch cluster.
gg.handler.name.authType=basic gg.handler.name.basicAuthUsername=elastic gg.handler.name.basicAuthPassword=changeme
- SSL: This mode mode is used when SSL/TLS authentication is configured for
encryption in Elasticsearch stack. User must provide either of CA fingerprint
hash, path to CA certificate file (
.crt
) OR path to trust-store file (along with trust-store type and trust-store password) for handler to be able to connect to Elasticsearch cluster. This mode also supports combination of SSL/TLS authentication and Basic authentication configured in Elasticsearch stack. User must configure both basic authentication properties (username and password) and SSL related properties (fingerprint or certificate file or trust-store), if both are configured in Elasticsearch cluster.gg.handler.name.authType=ssl # if basic authentication username and password is configured. gg.handler.name.basicAuthUsername=username gg.handler.name.basicAuthPassword=password # for SSL one of these three must be configured gg.handler.name.certFilePath=/path/to/ESHome/config/certs/http_ca.crt OR gg.handler.name.fingerprint=862e3f117c386a63f8f43db88760d463900e4c814590b8920e1c0e25f6db4df4 OR gg.handler.name.trustStore=/path/to/http.p12 gg.handler.name.trustStoreType=pkcs12 gg.handler.name.trustStorePassword=pass
All the above security related properties that contains confidential information can be configured to use Oracle Wallet for encrypting their confidential values in properties file.
Parent topic: Elasticsearch 8x
8.2.16.2.20 Troubleshooting
- Error:
org.elasticsearch.ElasticsearchException[Index [index-name] is not found]
- This exception occurs when there is a delete operation and the corresponding index of delete operation is not present in the Elasticsearch cluster. This can also occur for the update operation ifupsert=false
and the index is missing. - Error:
javax.net.ssl.SSLHandshakeException:[ Connection failed ]
- This can happen when properties for enabling authentication in theelasticsearch.yml
file mentioned above are missing for authentication type SSL. - Error:
javax.net.ssl.SSLException: [Received fatal alert: bad_certificate]
- This issue comes when host validation fails. Check that certificates generated using cert-utils in Elasticsearch contains the host information.
Parent topic: Elasticsearch 8x
8.2.16.2.21 Elasticsearch Handler Client Dependencies
What are the dependencies for the Elasticsearch Handler to connect to Elasticsearch databases?
The maven central repository artifacts for Elasticsearch databases are:
Maven groupId: co.elastic.clients
Maven atifactId: elasticsearch-java
Version: 8.7.0
Parent topic: Elasticsearch 8x
8.2.16.2.21.1 Elasticsearch 8.7.0
commons-codec-1.15.jar commons-logging-1.2.jar elasticsearch-java-8.7.0.jar elasticsearch-rest-client-8.7.0.jar httpasyncclient-4.1.5.jar httpclient-4.5.13.jar httpcore-4.4.13.jar httpcore-nio-4.4.13.jar jakarta.json-api-2.0.1.jar jsr305-3.0.2.jar parsson-1.0.0.jar
Parent topic: Elasticsearch Handler Client Dependencies
8.2.17 Flat Files
Oracle GoldenGate for Big Data supports writing data files to a local file system with File Writer Handler.
- Overview
You can use the File Writer Handler and the event handlers to transform data. - Optimized Row Columnar (ORC)
The Optimized Row Columnar (ORC) Event Handler to generate data files is in ORC format. - Parquet
Learn how to use the Parquet load files generated by the File Writer Handler into HDFS.
Parent topic: Target
8.2.17.1 Overview
You can use the File Writer Handler and the event handlers to transform data.
The File Writer Handler supports generating data files in delimited text, XML, JSON, Avro, and Avro Object Container File formats. It is intended to fulfill an extraction, load, and transform use case. Data files are staged on your local file system. Then when writing to a data file is complete, you can use a third party application to read the file to perform additional processing.
The File Writer Handler also supports the event handler framework. The event handler framework allows data files generated by the File Writer Handler to be transformed into other formats, such as Optimized Row Columnar (ORC) or Parquet. Data files can be loaded into third party applications, such as HDFS or Amazon S3. The event handler framework is extensible allowing more event handlers performing different transformations or loading to different targets to be developed. Additionally, you can develop a custom event handler for your big data environment.
Oracle GoldenGate for Big Data provides two handlers to write to HDFS. Oracle recommends that you use the HDFS Handler or the File Writer Handler in the following situations:
- The HDFS Handler is designed to stream data directly to HDFS.
-
Use when no post write processing is occurring in HDFS. The HDFS Handler does not change the contents of the file, it simply uploads the existing file to HDFS.
Use when analytical tools are accessing data written to HDFS in real time including data in files that are open and actively being written to.
- The File Writer Handler is designed to stage data to the local file system and then to load completed data files to HDFS when writing for a file is complete.
-
Analytic tools are not accessing data written to HDFS in real time.
Post write processing is occurring in HDFS to transform, reformat, merge, and move the data to a final location.
You want to write data files to HDFS in ORC or Parquet format.
- Detailing the Functionality
- Configuring the File Writer Handler
- Stopping the File Writer Handler
- Review a Sample Configuration
- File Writer Handler Partitioning
Partitioning functionality had been added to the File Writer Handler in Oracle GoldenGate for Big Data 21.1. The partitioning functionality uses the template mapper functionality to resolve partitioning strings. The result is that you are afforded control in how to partition source trail data.
Parent topic: Flat Files
8.2.17.1.1 Detailing the Functionality
- Using File Roll Events
- Automatic Directory Creation
- About the Active Write Suffix
- Maintenance of State
Parent topic: Overview
8.2.17.1.1.1 Using File Roll Events
A file roll event occurs when writing to a specific data file is completed. No more data is written to that specific data file.
Finalize Action Operation
You can configure the finalize action operation to clean up a specific data file after a successful file roll action using the finalizeaction
parameter with the following options:
-
none
-
Leave the data file in place (removing any active write suffix, see About the Active Write Suffix).
-
delete
-
Delete the data file (such as, if the data file has been converted to another format or loaded to a third party application).
-
move
-
Maintain the file name (removing any active write suffix), but move the file to the directory resolved using the
movePathMappingTemplate
property. -
rename
-
Maintain the current directory, but rename the data file using the
fileRenameMappingTemplate
property. -
move-rename
-
Rename the file using the file name generated by the
fileRenameMappingTemplate
property and move the file the file to the directory resolved using themovePathMappingTemplate
property.
Typically, event handlers offer a subset of these same actions.
A sample Configuration of a finalize action operation:
gg.handlerlist=filewriter
#The File Writer Handler
gg.handler.filewriter.type=filewriter
gg.handler.filewriter.mode=op
gg.handler.filewriter.pathMappingTemplate=./dirout/evActParamS3R
gg.handler.filewriter.stateFileDirectory=./dirsta
gg.handler.filewriter.fileNameMappingTemplate=${fullyQualifiedTableName}_${currentTimestamp}.txt
gg.handler.filewriter.fileRollInterval=7m
gg.handler.filewriter.finalizeAction=delete
gg.handler.filewriter.inactivityRollInterval=7m
File Rolling Actions
Any of the following actions trigger a file roll event.
-
A metadata change event.
-
The maximum configured file size is exceeded
-
The file roll interval is exceeded (the current time minus the time of first file write is greater than the file roll interval).
-
The inactivity roll interval is exceeded (the current time minus the time of last file write is greater than the file roll interval).
-
The File Writer Handler is configured to roll on shutdown and the Replicat process is stopped.
Operation Sequence
The file roll event triggers a sequence of operations to occur. It is important that you understand the order of the operations that occur when an individual data file is rolled:
-
The active data file is switched to inactive, the data file is flushed, and state data file is flushed.
-
The configured event handlers are called in the sequence that you specified.
-
The finalize action is executed on all the event handlers in the reverse order in which you configured them. Any finalize action that you configured is executed.
-
The finalize action is executed on the data file and the state file. If all actions are successful, the state file is removed. Any finalize action that you configured is executed.
For example, if you configured the File Writer Handler with the Parquet Event Handler and then the S3 Event Handler, the order for a roll event is:
-
The active data file is switched to inactive, the data file is flushed, and state data file is flushed.
-
The Parquet Event Handler is called to generate a Parquet file from the source data file.
-
The S3 Event Handler is called to load the generated Parquet file to S3.
-
The finalize action is executed on the S3 Parquet Event Handler. Any finalize action that you configured is executed.
-
The finalize action is executed on the Parquet Event Handler. Any finalize action that you configured is executed.
-
The finalize action is executed for the data file in the File Writer Handler
Parent topic: Detailing the Functionality
8.2.17.1.1.2 Automatic Directory Creation
Parent topic: Detailing the Functionality
8.2.17.1.1.3 About the Active Write Suffix
A common use case is using a third party application to monitor the write directory to read data files. Third party application can only read a data file when writing to that file has completed. These applications need a way to determine if writing to a data file is active or complete. The File Writer Handler allows you to configure an active write suffix using this property:
gg.handler.name.fileWriteActiveSuffix=.tmp
The value of this property is appended to the generated file name. When writing to the file is complete, the data file is renamed and the active write suffix is removed from the file name. You can set your third party application to monitor your data file names to identify when the active write suffix is removed.
Parent topic: Detailing the Functionality
8.2.17.1.1.4 Maintenance of State
Previously, all Oracle GoldenGate for Big Data Handlers have been stateless. These stateless handlers only maintain state in the context of the Replicat process that it was running. If the Replicat process was stopped and restarted, then all the state was lost. With a Replicat restart, the handler began writing with no contextual knowledge of the previous run.
The File Writer Handler provides the ability of maintaining state between invocations of the Replicat process. By default with a restart:
-
the state saved files are read,
-
the state is restored,
-
and appending active data files continues where the previous run stopped.
You can change this default action to require all files be rolled on shutdown by setting this property:
gg.handler.name.rollOnShutdown=true
Parent topic: Detailing the Functionality
8.2.17.1.2 Configuring the File Writer Handler
Lists the configurable values for the File Writer Handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
To enable the selection of the File Writer Handler, you must first configure the
handler type by specifying gg.handler.name.type=filewriter
and the other File Writer properties as follows:
Table 8-20 File Writer Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
|
None |
Selects the File Writer Handler for use. |
|
Optional |
Default unit of measure is bytes. You can stipulate |
1g |
Sets the maximum file size of files generated by the File Writer Handler. When the file size is exceeded, a roll event is triggered. |
|
Optional |
The default unit of measure is milliseconds. You can stipulate |
File rolling on time is off. |
The timer starts when a file is created. If the file is still open when the interval elapses then the a file roll event will be triggered. |
|
Optional |
The default unit of measure is milliseconds. You can stipulate |
File inactivity rolling is turned off. |
The timer starts from the latest write to a generated file. New writes to a generated file restart the counter. If the file is still open when the timer elapses a roll event is triggered.. |
|
Required |
A string with resolvable keywords and constants used to dynamically generate File Writer Handler data file names at runtime. |
None |
Use keywords interlaced with constants to dynamically generate unique file names at
runtime. Typically, file names follow the format, |
|
Required |
A string with resolvable keywords and constants used to dynamically generate the directory to which a file is written. |
None |
Use keywords interlaced with constants to dynamically generate unique path names at
runtime. Typically, path names follow the format,
|
|
Optional |
A string. |
None |
An optional suffix that is appended to files generated by the File Writer Handler to indicate that writing to the file is active. At the finalize action the suffix is removed. |
|
Required |
A directory on the local machine to store the state files of the File Writer Handler. |
None |
Sets the directory on the local machine to store the state files of the File Writer Handler. The group name is appended to the directory to ensure that the functionality works when operating in a coordinated apply environment. |
|
Optional |
|
|
Set to |
|
Optional |
|
|
Indicates what the File Writer Handler should do at the finalize action.
|
|
Optional |
|
|
Set to |
|
Optional |
|
No event handler configured. |
A unique string identifier cross referencing an event handler. The event handler will be invoked on the file roll event. Event handlers can do thing file roll event actions like loading files to S3, converting to Parquet or ORC format, or loading files to HDFS. |
|
Required if |
A string with resolvable keywords and constants used to dynamically generate File Writer Handler data file names for file renaming in the finalize action. |
None. |
Use keywords interlaced with constants to dynamically generate unique file names at
runtime. Typically, file names follow the format,
|
|
Required if |
A string with resolvable keywords and constants used to dynamically generate the directory to which a file is written. |
None |
Use keywords interlaced with constants to dynamically generate a unique path names at
runtime. Typically, path names typically follow the format,
|
|
Required |
|
|
Selects the formatter for the HDFS Handler for how output data will be formatted
If you want to use the Parquet or ORC Event Handlers, then the selected format must be |
|
Optional |
An even number of hex characters. |
None |
Enter an even number of hex characters where every two characters correspond to a single byte in the byte order mark (BOM). For example, the string |
|
Optional |
|
|
Set to |
|
Optional |
Any string |
new line ( |
Allows you to control the delimiter separating file names in the control file. You can use |
|
Optional |
A path to a directory to hold the control file. |
A period ( |
Set to specify where you want to write the control file. |
|
Optional |
|
|
Set to |
|
Optional |
One or more times to trigger a roll action of all open files. |
None |
Configure one or more trigger times in the following format: HH:MM,HH:MM,HH:MM Entries are based on a 24 hour clock. For example, an entry to configure rolled actions at three discrete times of day is: gg.handler.fw.atTime=03:30,21:00,23:51 |
|
Optional |
no compression. |
|
Enables the corresponding compression algorithm for generated Avro
OCF files. The corresponding compression library must be added to
the |
|
Optional |
|
Positive Integer >= 512 |
Sets the size the |
gg.handler.name.rollOnTruncate |
Optional | true | false |
false |
Controls whether the occurrence of truncate operation
causes a rollover of the corresponding data file by the handler. The
default is false , which means the corresponding data
file is not rolled when a truncate operation is presented. Set to
true to roll the data file on a truncate operation.
To propagate truncate operations, ensure to set the replicat property
GETTRUNCATES .
|
gg.handler.name.logEventHandlerStatus |
Optional | true | false |
false |
When set to true , it logs the status of
completed event handlers at the info logging level. Can be used for
debugging and troubleshooting of the event handlers.
|
gg.handler.name.eventHandlerTimeoutMinutes |
Optional | Long integer | 120 | The event handler thread timeout in minutes. The event handler threads spawned by the file writer handler are provided a max execution time to complete their work. If the timeout value is exceeded, then Replicat assumes that the Event handler thread is hung and will ABEND. For stage and merge use cases, Event handler threads may take longer to complete their work. The default value is set to 120 (2 hours). |
Parent topic: Overview
8.2.17.1.3 Stopping the File Writer Handler
- Force stop should never be executed on the replicat process.
- The Unix kill command should never be used to kill the replicat process.
An inconsistent state may mean that the replicat process will abend on startup and require manual removal of state files.
ERROR 2022-07-11 19:05:23.000367 [main]- Failed to restore state for UUID [d35f117f-ffab-4e60-aa93-f7ef860bf280] table name [QASOURCE.TCUSTORD] data file name [QASOURCE.TCUSTORD_2022-07-11_19-04-27.900.txt]
.state
file has not yet
been removed. Three scenarios can generally cause this problem:
- The replicat process was force stopped, was killed using
the kill command, or crashed while it was in the
processing window between when the data file was
removed and when the associated
.state
file was removed. - The user has manually removed the data file or files but
left the associated
.state
file in place. - There are two instances of the same replicat process running. A lock file is created to prevent this, but there is a window on replicat startup which allows multiple instances of a replicat process to be started.
If this problem occurs, then you should manually determine
whether or not the data file associated with the
.state
file has been successfully
processed. If the data has been successfully processed, then you can
manually remove the .state
file and restart the
replicat process.
If data file associated with the problematic .state
file
has been determined not to have been processed, then do the
following:
- Delete all the
.state
files. - Alter the
seqno
andrba
of the replicat process to back it up to a period for which it was known that processing successfully occurred. - Restart the replicat process to reprocess the data.
Parent topic: Overview
8.2.17.1.4 Review a Sample Configuration
This File Writer Handler configuration example is using the Parquet Event Handler to convert data files to Parquet, and then for the S3 Event Handler to load Parquet files into S3:
gg.handlerlist=filewriter #The handler properties gg.handler.name.type=filewriter gg.handler.name.mode=op gg.handler.name.pathMappingTemplate=./dirout gg.handler.name.stateFileDirectory=./dirsta gg.handler.name.fileNameMappingTemplate=${fullyQualifiedTableName}_${currentTimestamp}.txt gg.handler.name.fileRollInterval=7m gg.handler.name.finalizeAction=delete gg.handler.name.inactivityRollInterval=7m gg.handler.name.format=avro_row_ocf gg.handler.name.includetokens=true gg.handler.name.partitionByTable=true gg.handler.name.eventHandler=parquet gg.handler.name.rollOnShutdown=true gg.eventhandler.parquet.type=parquet gg.eventhandler.parquet.pathMappingTemplate=./dirparquet gg.eventhandler.parquet.writeToHDFS=false gg.eventhandler.parquet.finalizeAction=delete gg.eventhandler.parquet.eventHandler=s3 gg.eventhandler.parquet.fileNameMappingTemplate=${tableName}_${currentTimestamp}.parquet gg.handler.filewriter.eventHandler=s3 gg.eventhandler.s3.type=s3 gg.eventhandler.s3.region=us-west-2 gg.eventhandler.s3.proxyServer=www-proxy.us.oracle.com gg.eventhandler.s3.proxyPort=80 gg.eventhandler.s3.bucketMappingTemplate=tomsfunbucket gg.eventhandler.s3.pathMappingTemplate=thepath gg.eventhandler.s3.finalizeAction=none
Parent topic: Overview
8.2.17.1.5 File Writer Handler Partitioning
Partitioning functionality had been added to the File Writer Handler in Oracle GoldenGate for Big Data 21.1. The partitioning functionality uses the template mapper functionality to resolve partitioning strings. The result is that you are afforded control in how to partition source trail data.
All of the keywords that are supported by the templating functionality are now supported in File Writer Handler partitioning.
- File Writer Handler Partitioning Precondition
In order to use the partitioning functionality, data must first be partitioned by table. The following configuration cannot be set:gg.handler.filewriter.partitionByTable=false
. - Path Configuration
Assume that the path mapping template is configured as follows:gg.handler.filewriter.pathMappingTemplate=/ogg/${fullyQualifiedTableName}
. At runtime the path resolves as follows for theDBO.ORDERS
source table:/ogg/DBO.ORDERS
. - Partitioning Configuration
Any of the keywords that are legal for templating are now legal for partitioning:gg.handler.filewriter.partitioner.fully qualified table name=templating keywords and/or constants
. - Partitioning Effect on Event Handler
The resolved partitioning path is carried forward to the corresponding Event Handlers as well.
Parent topic: Overview
8.2.17.1.5.1 File Writer Handler Partitioning Precondition
In order to use the partitioning functionality, data must first be
partitioned by table. The following configuration cannot be set:
gg.handler.filewriter.partitionByTable=false
.
Parent topic: File Writer Handler Partitioning
8.2.17.1.5.2 Path Configuration
Assume that the path mapping template is configured as follows:
gg.handler.filewriter.pathMappingTemplate=/ogg/${fullyQualifiedTableName}
.
At runtime the path resolves as follows for the DBO.ORDERS
source table:
/ogg/DBO.ORDERS
.
Parent topic: File Writer Handler Partitioning
8.2.17.1.5.3 Partitioning Configuration
Any of the keywords that are legal for templating are now legal for
partitioning: gg.handler.filewriter.partitioner.fully qualified table
name=templating keywords and/or constants
.
Partitioning for the
DBO.ORDERS
table is set to the following:
gg.handler.filewriter.partitioner.DBO.ORDERS=par_sales_region=${columnValue[SALES_REGION]}
This example can result in the following breakdown of files on the file system:
/ogg/DBO.ORDERS/par_sales_region=west/data files /ogg/DBO.ORDERS/par_sales_region=east/data files /ogg/DBO.ORDERS/par_sales_region=north/data files /ogg/DBO.ORDERS/par_sales_region=south/data fileExample 2
Partitioning for the DBO.ORDERS
table
is set to the
following:
gg.handler.filewriter.partitioner.DBO.ORDERS=par_sales_region=${columnValue[SALES_REGION]}/par_state=${columnValue[STATE]}
This example can result in the following breakdown of files on the file system: /ogg/DBO.ORDERS/par_sales_region=west/par_state=CA/data files /ogg/DBO.ORDERS/par_sales_region=east/par_state=FL/data files /ogg/DBO.ORDERS/par_sales_region=north/par_state=MN/data files /ogg/DBO.ORDERS/par_sales_region=south/par_state=TX/data files
Caution:
Ensure to be extra vigilant while configuring partitioning. Choosing partitioning column values that have a very large range of data values result in partitioning to a proportional number of output data files.Parent topic: File Writer Handler Partitioning
8.2.17.1.5.4 Partitioning Effect on Event Handler
The resolved partitioning path is carried forward to the corresponding Event Handlers as well.
If partitioning is configured as
follows:
gg.handler.filewriter.partitioner.DBO.ORDERS=par_sales_region=${columnValue[SALES_REGION]}
,
then the partition string might resolve to the following:
par_sales_region=west par_sales_region=east par_sales_region=north par_sales_region=southExample 2
If S3 Event handler is used, then the path mapping
template of the S3 Event Handler is configured as follows:
gg.eventhandler.s3.pathMappingTemplate=output/dir
. The target
directories in S3 are as
follows:
output/dir/par_sales_region=west/data files output/dir/par_sales_region=east/data files output/dir/par_sales_region=north/data files output/dir/par_sales_region=south/data files
Parent topic: File Writer Handler Partitioning
8.2.17.2 Optimized Row Columnar (ORC)
The Optimized Row Columnar (ORC) Event Handler to generate data files is in ORC format.
This topic describes how to use the ORC Event Handler.
- Overview
- Detailing the Functionality
- Configuring the ORC Event Handler
- Optimized Row Columnar Event Handler Client Dependencies
What are the dependencies for the Optimized Row Columnar (OCR) Handler?
Parent topic: Flat Files
8.2.17.2.1 Overview
ORC is a row columnar format that can substantially improve data retrieval times and the performance of Big Data analytics. You can use the ORC Event Handler to write ORC files to either a local file system or directly to HDFS. For information, see https://orc.apache.org/.
Parent topic: Optimized Row Columnar (ORC)
8.2.17.2.2 Detailing the Functionality
Parent topic: Optimized Row Columnar (ORC)
8.2.17.2.2.1 About the Upstream Data Format
The ORC Event Handler can only convert Avro Object Container File (OCF) generated by the File Writer Handler. The ORC Event Handler cannot convert other formats to ORC data files. The format of the File Writer Handler must be avro_row_ocf
or avro_op_ocf
, see Flat Files.
Parent topic: Detailing the Functionality
8.2.17.2.2.2 About the Library Dependencies
Generating ORC files requires both the Apache ORC libraries and the HDFS client libraries, see Optimized Row Columnar Event Handler Client Dependencies and HDFS Handler Client Dependencies.
Oracle GoldenGate for Big Data does not include the Apache ORC libraries nor does it include the HDFS client libraries. You must configure the gg.classpath
variable to include the dependent libraries.
Parent topic: Detailing the Functionality
8.2.17.2.2.3 Requirements
The ORC Event Handler can write ORC files directly to HDFS. You must set the writeToHDFS
property to true
:
gg.eventhandler.orc.writeToHDFS=true
Ensure that the directory containing the HDFS core-site.xml
file is in gg.classpath
. This is so the core-site.xml
file can be read at runtime and the connectivity information to HDFS can be resolved. For example:
gg.classpath=/{HDFS_install_directory}/etc/hadoop
If you enable Kerberos authentication is on the HDFS cluster, you have to configure the Kerberos principal and the location of the keytab
file so that the password can be resolved at runtime:
gg.eventHandler.name.kerberosPrincipal=principal
gg.eventHandler.name.kerberosKeytabFile=path_to_the_keytab_file
Parent topic: Detailing the Functionality
8.2.17.2.3 Configuring the ORC Event Handler
You configure the ORC Handler operation using the properties file. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
The ORC Event Handler works only in conjunction with the File Writer Handler.
To enable the selection of the ORC Handler, you must first configure the handler
type by specifying gg.eventhandler.name.type=orc
and the
other ORC properties as follows:
Table 8-21 ORC Event Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
|
None |
Selects the ORC Event Handler. |
|
Optional |
|
|
The ORC framework allows direct writing to HDFS. Set to |
|
Required |
A string with resolvable keywords and constants used to dynamically generate the path in the ORC bucket to write the file. |
None |
Use keywords interlaced with constants to dynamically generate unique ORC path names
at runtime. Typically, path names follow the format,
|
|
Optional |
A string with resolvable keywords and constants used to dynamically generate the ORC file name at runtime. |
None |
Use resolvable keywords and constants used to dynamically generate the ORC data file name at runtime. If not set, the upstream file name is used. See Template Keywords. |
|
Optional |
|
|
Sets the compression codec of the generated ORC file. |
|
Optional |
|
|
Set to |
|
Optional |
The Kerberos principal name. |
None |
Sets the Kerberos principal when writing directly to HDFS and Kerberos authentication is enabled. |
|
Optional |
The path to the Keberos |
|
Sets the path to the Kerberos |
|
Optional |
|
|
Set to |
|
Optional |
|
The ORC default. |
Sets the block size of generated ORC files. |
|
Optional |
|
The ORC default. |
Sets the buffer size of generated ORC files. |
|
Optional |
|
The ORC default. |
Set if the ORC encoding strategy is optimized for compression or for speed.. |
|
Optional |
A percentage represented as a floating point number. |
The ORC default. |
Sets the percentage for padding tolerance of generated ORC files. |
|
Optional |
|
The ORC default. |
Sets the row index stride of generated ORC files. |
|
Optional |
|
The ORC default. |
Sets the stripe size of generated ORC files. |
|
Optional |
A unique string identifier cross referencing a child event handler. |
No event handler configured. |
The event handler that is invoked on the file roll event. Event handlers can do file roll event actions like loading files to S3 or HDFS. |
|
Optional |
The false positive probability must be greater than
zero and less than one. For example, |
The Apache ORC default. |
Sets the false positive probability of the querying of a bloom filter index and the result indicating that the value being searched for is in the block, but the value is actually not in the block. needs to set which tables to set bloom filters and on which columns. The user selects on which tables and columns to set bloom filters with the following configuration syntax: gg.eventhandler.orc.bloomFilter.QASOURCE.TCUSTMER=CUST_CODE gg.eventhandler.orc.bloomFilter.QASOURCE.TCUSTORD=CUST_CODE,ORDER_DATE
|
|
Optional |
|
|
Sets the version of the ORC bloom filter. |
Parent topic: Optimized Row Columnar (ORC)
8.2.17.2.4 Optimized Row Columnar Event Handler Client Dependencies
What are the dependencies for the Optimized Row Columnar (OCR) Handler?
The maven central repository artifacts for ORC are:
Maven groupId: org.apache.orc
Maven atifactId: orc-core
Maven version: 1.6.9
The Hadoop client dependencies are also required for the ORC Event Handler, see Hadoop Client Dependencies.
8.2.17.2.4.1 ORC Client 1.6.9
aircompressor-0.19.jar annotations-17.0.0.jar commons-lang-2.6.jar commons-lang3-3.12.0.jar hive-storage-api-2.7.1.jar jaxb-api-2.2.11.jar orc-core-1.6.9.jar orc-shims-1.6.9.jar protobuf-java-2.5.0.jar slf4j-api-1.7.5.jar threeten-extra-1.5.0.jar
Parent topic: Optimized Row Columnar Event Handler Client Dependencies
8.2.17.2.4.2 ORC Client 1.5.5
aircompressor-0.10.jar asm-3.1.jar commons-cli-1.2.jar commons-codec-1.4.jar commons-collections-3.2.1.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-httpclient-3.1.jar commons-io-2.1.jar commons-lang-2.6.jar commons-logging-1.1.1.jar commons-math-2.1.jar commons-net-3.1.jar guava-11.0.2.jar hadoop-annotations-2.2.0.jar hadoop-auth-2.2.0.jar hadoop-common-2.2.0.jar hadoop-hdfs-2.2.0.jar hive-storage-api-2.6.0.jar jackson-core-asl-1.8.8.jar jackson-mapper-asl-1.8.8.jar jaxb-api-2.2.11.jar jersey-core-1.9.jar jersey-server-1.9.jar jsch-0.1.42.jar log4j-1.2.17.jar orc-core-1.5.5.jar orc-shims-1.5.5.jar protobuf-java-2.5.0.jar slf4j-api-1.7.5.jar slf4j-log4j12-1.7.5.jar xmlenc-0.52.jar zookeeper-3.4.5.jar
Parent topic: Optimized Row Columnar Event Handler Client Dependencies
8.2.17.2.4.3 ORC Client 1.4.0
aircompressor-0.3.jar apacheds-i18n-2.0.0-M15.jar apacheds-kerberos-codec-2.0.0-M15.jar api-asn1-api-1.0.0-M20.jar api-util-1.0.0-M20.jar asm-3.1.jar commons-beanutils-core-1.8.0.jar commons-cli-1.2.jar commons-codec-1.4.jar commons-collections-3.2.2.jar commons-compress-1.4.1.jar commons-configuration-1.6.jar commons-httpclient-3.1.jar commons-io-2.4.jar commons-lang-2.6.jar commons-logging-1.1.3.jar commons-math3-3.1.1.jar commons-net-3.1.jar curator-client-2.6.0.jar curator-framework-2.6.0.jar gson-2.2.4.jar guava-11.0.2.jar hadoop-annotations-2.6.4.jar hadoop-auth-2.6.4.jar hadoop-common-2.6.4.jar hive-storage-api-2.2.1.jar htrace-core-3.0.4.jar httpclient-4.2.5.jar httpcore-4.2.4.jar jackson-core-asl-1.9.13.jar jdk.tools-1.6.jar jersey-core-1.9.jar jersey-server-1.9.jar jsch-0.1.42.jar log4j-1.2.17.jar netty-3.7.0.Final.jar orc-core-1.4.0.jar protobuf-java-2.5.0.jar slf4j-api-1.7.5.jar slf4j-log4j12-1.7.5.jar xmlenc-0.52.jar xz-1.0.jar zookeeper-3.4.6.jar
Parent topic: Optimized Row Columnar Event Handler Client Dependencies
8.2.17.3 Parquet
Learn how to use the Parquet load files generated by the File Writer Handler into HDFS.
See Flat Files.
- Overview
- Detailing the Functionality
- Configuring the Parquet Event Handler
- Parquet Event Handler Client Dependencies
What are the dependencies for the Parquet Event Handler?
Parent topic: Flat Files
8.2.17.3.1 Overview
The Parquet Event Handler enables you to generate data files in Parquet format. Parquet files can be written to either the local file system or directly to HDFS. Parquet is a columnar data format that can substantially improve data retrieval times and improve the performance of Big Data analytics, see https://parquet.apache.org/.
Parent topic: Parquet
8.2.17.3.2 Detailing the Functionality
Parent topic: Parquet
8.2.17.3.2.1 Configuring the Parquet Event Handler to Write to HDFS
The Apache Parquet framework supports writing directly to HDFS. The Parquet Event Handler can write Parquet files directly to HDFS. These additional configuration steps are required:
The Parquet Event Handler dependencies and considerations are the same as the HDFS Handler, see HDFS Additional Considerations.
Set the writeToHDFS
property to true
:
gg.eventhandler.parquet.writeToHDFS=true
Ensure that gg.classpath
includes the HDFS client libraries.
Ensure that the directory containing the HDFS core-site.xml
file is in gg.classpath
. This is so the core-site.xml
file can be read at runtime and the connectivity information to HDFS can be resolved. For example:
gg.classpath=/{HDFS_install_directory}/etc/hadoop
If Kerberos authentication is enabled on the HDFS cluster, you have to configure the Kerberos principal and the location of the keytab
file so that the password can be resolved at runtime:
gg.eventHandler.name.kerberosPrincipal=principal
gg.eventHandler.name.kerberosKeytabFile=path_to_the_keytab_file
Parent topic: Detailing the Functionality
8.2.17.3.2.2 About the Upstream Data Format
The Parquet Event Handler can only convert Avro Object Container File (OCF) generated by the File Writer Handler. The Parquet Event Handler cannot convert other formats to Parquet data files. The format of the File Writer Handler must be avro_row_ocf
or avro_op_ocf
, see Flat Files.
Parent topic: Detailing the Functionality
8.2.17.3.3 Configuring the Parquet Event Handler
You configure the Parquet Event Handler operation using the properties file. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
The Parquet Event Handler works only in conjunction with the File Writer Handler.
To enable the selection of the Parquet Event Handler, you must first configure the
handler type by specifying gg.eventhandler.name.type=parquet
and the other Parquet Event properties as follows:
Table 8-22 Parquet Event Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
|
None |
Selects the Parquet Event Handler for use. |
|
Optional |
|
|
Set to |
|
Required |
A string with resolvable keywords and constants used to dynamically generate the path to write generated Parquet files. |
None |
Use keywords interlaced with constants to dynamically generate unique path names at
runtime. Typically, path names follow the format,
|
|
Optional |
A string with resolvable keywords and constants used to dynamically generate the Parquet file name at runtime |
None |
Sets the Parquet file name. If not set, the upstream file name is used. See Template Keywords. |
|
Optional |
|
|
Sets the compression codec of the generated Parquet file. |
|
Optional |
|
|
Indicates what the Parquet Event Handler should do at the finalize action. |
|
Optional |
|
The Parquet default. |
Set to |
|
Optional |
|
The Parquet default. |
Set to |
|
Optional |
Integer |
The Parquet default. |
Sets the Parquet dictionary page size. |
|
Optional |
Integer |
The Parquet default. |
Sets the Parquet padding size. |
|
Optional |
Integer |
The Parquet default. |
Sets the Parquet page size. |
|
Optional |
Integer |
The Parquet default. |
Sets the Parquet row group size. |
|
Optional |
The Kerberos principal name. |
None |
Set to the Kerberos principal when writing directly to HDFS and Kerberos authentication is enabled. |
|
Optional |
The path to the Keberos |
The Parquet default. |
Set to the path to the Kerberos |
|
Optional |
A unique string identifier cross referencing a child event handler. |
No event handler configured. |
The event handler that is invoked on the file roll event. Event handlers can do file roll event actions like loading files to S3, converting to Parquet or ORC format, or loading files to HDFS. |
gg.eventhandler.name.writerVersion |
Optional | v1|v2 |
The Parquet library default
which is up through Parquet version 1.11.0 is
v1 .
|
Allows the ability to set the Parquet writer version. |
Parent topic: Parquet
8.2.17.3.4 Parquet Event Handler Client Dependencies
What are the dependencies for the Parquet Event Handler?
The maven central repository artifacts for Parquet are:
Maven groupId: org.apache.parquet
Maven atifactId: parquet-avro
Maven version: 1.9.0
Maven groupId: org.apache.parquet
Maven atifactId: parquet-hadoop
Maven version: 1.9.0
The Hadoop client dependencies are also required for the Parquet Event Handler, see Hadoop Client Dependencies.
Parent topic: Parquet
8.2.17.3.4.1 Parquet Client 1.12.0
audience-annotations-0.12.0.jar avro-1.10.1.jar commons-compress-1.20.jar commons-pool-1.6.jar jackson-annotations-2.11.3.jar jackson-core-2.11.3.jar jackson-databind-2.11.3.jar javax.annotation-api-1.3.2.jar parquet-avro-1.12.0.jar parquet-column-1.12.0.jar parquet-common-1.12.0.jar parquet-encoding-1.12.0.jar parquet-format-structures-1.12.0.jar parquet-hadoop-1.12.0.jar parquet-jackson-1.12.0.jar slf4j-api-1.7.22.jar snappy-java-1.1.8.jar zstd-jni-1.4.9-1.jar
Parent topic: Parquet Event Handler Client Dependencies
8.2.17.3.4.2 Parquet Client 1.11.1
audience-annotations-0.11.0.jar avro-1.9.2.jar commons-compress-1.19.jar commons-pool-1.6.jar jackson-annotations-2.10.2.jar jackson-core-2.10.2.jar jackson-databind-2.10.2.jar javax.annotation-api-1.3.2.jar parquet-avro-1.11.1.jar parquet-column-1.11.1.jar parquet-common-1.11.1.jar parquet-encoding-1.11.1.jar parquet-format-structures-1.11.1.jar parquet-hadoop-1.11.1.jar parquet-jackson-1.11.1.jar slf4j-api-1.7.22.jar snappy-java-1.1.7.3.jar
Parent topic: Parquet Event Handler Client Dependencies
8.2.17.3.4.3 Parquet Client 1.10.1
avro-1.8.2.jar commons-codec-1.10.jar commons-compress-1.8.1.jar commons-pool-1.6.jar fastutil-7.0.13.jar jackson-core-asl-1.9.13.jar jackson-mapper-asl-1.9.13.jar paranamer-2.7.jar parquet-avro-1.10.1.jar parquet-column-1.10.1.jar parquet-common-1.10.1.jar parquet-encoding-1.10.1.jar parquet-format-2.4.0.jar parquet-hadoop-1.10.1.jar parquet-jackson-1.10.1.jar slf4j-api-1.7.2.jar snappy-java-1.1.2.6.jar xz-1.5.jar
Parent topic: Parquet Event Handler Client Dependencies
8.2.17.3.4.4 Parquet Client 1.9.0
avro-1.8.0.jar commons-codec-1.5.jar commons-compress-1.8.1.jar commons-pool-1.5.4.jar fastutil-6.5.7.jar jackson-core-asl-1.9.11.jar jackson-mapper-asl-1.9.11.jar paranamer-2.7.jar parquet-avro-1.9.0.jar parquet-column-1.9.0.jar parquet-common-1.9.0.jar parquet-encoding-1.9.0.jar parquet-format-2.3.1.jar parquet-hadoop-1.9.0.jar parquet-jackson-1.9.0.jar slf4j-api-1.7.7.jar snappy-java-1.1.1.6.jar xz-1.5.jar
Parent topic: Parquet Event Handler Client Dependencies
8.2.18 Google BigQuery
Topics:
- Using Streaming API
Learn how to use the Google BigQuery Handler, which streams change data capture data from source trail files into Google BigQuery. - Google BigQuery Stage and Merge
Parent topic: Target
8.2.18.1 Using Streaming API
Learn how to use the Google BigQuery Handler, which streams change data capture data from source trail files into Google BigQuery.
BigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage, see https://cloud.google.com/bigquery/.
- Detailing the Functionality
- Setting Up and Running the BigQuery Handler
The Google BigQuery Handler uses the Java BigQuery client libraries to connect to Big Query. - Google BigQuery Dependencies
The Google BigQuery client libraries are required for integration with BigQuery.
Parent topic: Google BigQuery
8.2.18.1.1 Detailing the Functionality
- Data Types
- Metadata Support
- Operation Modes
- Operation Processing Support
- Proxy Settings
- Mapping to Google Datasets
A dataset is contained within a specific Google cloud project. Datasets are top-level containers that are used to organize and control access to your tables and views.
Parent topic: Using Streaming API
8.2.18.1.1.1 Data Types
The BigQuery Handler supports the standard SQL data types and most of these data types are supported by the BigQuery Handler. A data type conversion from the column value in the trail file to the corresponding Java type representing the BigQuery column type in the BigQuery Handler is required.
The following data types are supported:
STRING
BYTES
INTEGER
FLOAT
NUMERIC
BOOLEAN
TIMESTAMP
DATE
TIME
DATETIME
The BigQuery Handler does not support complex data types, such as ARRAY
and STRUCT
.
Parent topic: Detailing the Functionality
8.2.18.1.1.2 Metadata Support
The BigQuery Handler creates tables in BigQuery if the tables do not exist.
The BigQuery Handler alters tables to add columns which exist in the source metadata or configured metacolumns which do not exist in the target metadata. The BigQuery Handler also adds columns dynamically at runtime if it detects a metadata change.
The BigQuery Handler does not drop columns in the BigQuery table which do not exist into the source table definition. BigQuery neither supports dropping existing columns, nor supports changing the data type of existing columns. Once a column is created in BigQuery, it is immutable.
Truncate operations are not supported.
Parent topic: Detailing the Functionality
8.2.18.1.1.3 Operation Modes
You can configure the BigQuery Handler in one of these two modes:
-
Audit Log Mode = true
gg.handler.name.auditLogMode=true
When the handler is configured to run with audit log mode true, the data is pushed into Google BigQuery without a unique row identification key. As a result, Google BigQuery is not able to merge different operations on the same row. For example, a source row with an insert operation, two update operations, and then a delete operation would show up in BigQuery as four rows, one for each operation.
Also, the order in which the audit log is displayed in the BigQuery data set is not deterministic.
To overcome these limitations, users should specify optype and postion in the meta columns template for the handler. This adds two columns of the same names in the schema for the table in Google BigQuery. For example:
gg.handler.bigquery.metaColumnsTemplate = ${optype}, ${position}
The
optype
is important to determine the operation type for the row in the audit log.To view the audit log in order of the operations processed in the trail file, specify position which can be used in the ORDER BY clause while querying the table in Google BigQuery. For example:
SELECT * FROM [projectId:datasetId.tableId] ORDER BY position
-
auditLogMode = false
-
gg.handler.name.auditLogMode=false
When the handler is configured to run with audit log mode
false
, the data is pushed into Google BigQuery using a unique row identification key. The Google BigQuery is able to merge different operations for the same row. However, the behavior is complex. The Google BigQuery maintains a finite deduplication period in which it will merge operations for a given row. Therefore, the results can be somewhat non-deterministic.The trail source needs to have a full image of the records in order to merge correctly.
Example 1
An insert operation is sent to BigQuery and before the deduplication period expires, an update operation for the same row is sent to BigQuery. The resultant is a single row in BigQuery for the update operation.
Example 2
An insert operation is sent to BigQuery and after the deduplication period expires, an update operation for the same row is sent to BigQuery. The resultant is that both the insert and the update operations show up in BigQuery.
This behavior has confounded many users, as this is the documented behavior when using the BigQuery SDK and a feature as opposed to a defect. The documented length of the deduplication period is at least one minute. However, Oracle testing has shown that the period is significantly longer. Therefore, unless users can guarantee that all operations for a give row occur within a very short period, it is likely there will be multiple entries for a given row in BigQuery. It is therefore just as important for users to configure meta columns with the optype and position so they can determine the latest state for a given row. To read more about audit log mode read the following Google BigQuery documentation:Streaming data into BigQuery.
Parent topic: Detailing the Functionality
8.2.18.1.1.4 Operation Processing Support
The BigQuery Handler pushes operations to Google BigQuery using synchronous API. Insert, update, and delete operations are processed differently in BigQuery than in a traditional RDBMS.
The following explains how insert, update, and delete operations are interpreted by the handler depending on the mode of operation:
-
auditLogMode = true
-
-
insert
– Inserts the record withoptype
as an insert operation in the BigQuery table. -
update
–Inserts the record withoptype
as an update operation in the BigQuery table. -
delete
– Inserts the record withoptype
as a delete operation in the BigQuery table. -
pkUpdate
—WhenpkUpdateHandling
property is configured asdelete-insert
, the handler sends out a delete operation followed by an insert operation. Both these rows have the same position in the BigQuery table, which helps to identify it as a primary key operation and not a separate delete and insert operation.
-
-
auditLogMode = false
-
-
insert
– If the row does not already exist in Google BigQuery, then an insert operation is processed as aninsert
. If the row already exists in Google BigQuery, then an insert operation is processed as anupdate
. The handler sets the deleted column tofalse
. -
update
– If a row does not exist in Google BigQuery, then an update operation is processed as aninsert
. If the row already exists in Google BigQuery, then an update operation is processed asupdate
. The handler sets the deleted column tofalse
. -
delete
– If the row does not exist in Google BigQuery, then a delete operation is added. If the row exists in Google BigQuery, then a delete operation is processed as adelete
. The handler sets the deleted column totrue
. -
pkUpdate
—WhenpkUpdateHandling
property is configured asdelete-insert
, the handler sets the deleted column totrue
for the row whose primary key is updated. It is followed by a separate insert operation with the new primary key and the deleted column set tofalse
for this row.
-
Do not toggle the audit log mode because it forces the BigQuery handler to abend as Google BigQuery cannot alter schema of an existing table. The existing table needs to be deleted before switching audit log modes.
Note:
The BigQuery Handler does not support the truncate
operation. It abends when it encounters a truncate
operation.
Parent topic: Detailing the Functionality
8.2.18.1.1.5 Proxy Settings
To connect to BigQuery using a proxy server, you must configure the proxy host and the proxy port in the properties file as follows:
jvm.bootoptions= -Dhttps.proxyHost=proxy_host_name -Dhttps.proxyPort=proxy_port_number
Parent topic: Detailing the Functionality
8.2.18.1.1.6 Mapping to Google Datasets
A dataset is contained within a specific Google cloud project. Datasets are top-level containers that are used to organize and control access to your tables and views.
A table or view must belong to a dataset, so you need to create at least one dataset before loading data into BigQuery.
The Big Query handler can use existing datasets or create datasets if not found.
The Big Query Handler maps the table's schema name to the dataset name. For three-part table names, the dataset is constructed by concatenating catalog and schema.
Parent topic: Detailing the Functionality
8.2.18.1.2 Setting Up and Running the BigQuery Handler
The Google BigQuery Handler uses the Java BigQuery client libraries to connect to Big Query.
- Group ID:
com.google.cloud
- Artifact ID: google-cloud-bigquery
- Version: 2.7.1
The BigQuery Client libraries do not ship with Oracle GoldenGate for Big Data. Additionally, Google appears to have removed the link to download the BigQuery Client libraries. You can download the BigQuery Client libraries using Maven and the Maven coordinates listed above. However, this requires proficiency with Maven. The Google BigQuery client libraries can be downloaded using the Dependency downloading scripts. For more information, see Google BigQuery Dependencies.
For more information about Dependency Downloader, see Dependency Downloader.
- Schema Mapping for BigQuery
- Understanding the BigQuery Handler Configuration
- Review a Sample Configuration
- Configuring Handler Authentication
Parent topic: Using Streaming API
8.2.18.1.2.1 Schema Mapping for BigQuery
The table schema name specified in the replicat map statement is mapped to the BigQuery
dataset name. For example: map QASOURCE.*, target "dataset_US".*;
This map statement replicates tables to the BigQuery dataset
"dataset_US"
. Oracle GoldenGate for Big Data normalizes schema and table
names to uppercase. Lowercase and mixed case dataset and table names are supported, but
need to be quoted in the Replicat mapping statement.
Parent topic: Setting Up and Running the BigQuery Handler
8.2.18.1.2.2 Understanding the BigQuery Handler Configuration
The following are the configurable values for the BigQuery Handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
To enable the selection of the BigQuery Handler, you must first configure the
handler type by specifying gg.handler.name.type=bigquery
and
the other BigQuery properties as follows:
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
Any string |
None |
Provides a name for the BigQuery Handler. The BigQuery Handler name then becomes part of the property names listed in this table. |
|
Required |
|
None |
Selects the BigQuery Handler for streaming change data capture into Google BigQuery. |
|
Optional |
Relative or absolute path to the credentials file |
None |
The credentials file downloaded from Google BigQuery for authentication. If you do not specify the path to the credentials file, you need to set it as an environment variable, see Configuring Handler Authentication. |
|
Required |
Any string |
None |
The name of the project in Google BigQuery. The handler needs project ID to connect to Google BigQuery store. |
|
Optional |
Any number |
|
The maximum number of operations to be batched together. This is applicable for all target table batches. |
|
Optional |
Any number |
|
The maximum amount of time in milliseconds to wait before executing the next batch of operations. This is applicable for all target table batches. |
|
Optional |
|
|
Sets whether to insert all valid rows of a request, even if invalid rows exist. If not set, the entire insert request fails if it contains an invalid row. |
gg.handler. name. ignoreUnknownValues |
Optional |
|
|
Sets whether to accept rows that contain values that do not match the schema. If not set, rows with unknown values are considered to be invalid. |
gg.handler. name. connectionTimeout |
Optional |
Positive integer |
|
The maximum amount of time, in milliseconds, to wait for the handler to establish a connection with Google BigQuery. |
gg.handler. name. readTimeout |
Optional |
Positive integer |
|
The maximum amount of time in milliseconds to wait for the handler to read data from an established connection. |
|
Optional |
A legal string |
None |
A legal string specifying the |
gg.handler. name. auditLogMode |
Optional |
|
|
Set to Set to |
gg.handler. name. pkUpdateHandling |
Optional |
|
|
Sets how the handler handles update operations that change a primary key. Primary key operations can be problematic for the BigQuery Handler and require special consideration:
|
gg.handler.name.adjustScale |
Optional |
|
false |
The BigQuery numeric data type supports a maximum scale of 9 digits.
If a field is mapped into a BigQuery numeric data type, then it fails if the scale is
larger than 9 digits. Set this property to true to round fields
mapped to BigQuery numeric data types to a scale of 9 digits. Enabling this property
results in a loss of precision for source data values with a scale larger than
9.
|
gg.handler.name.includeDeletedColumn |
Optional |
|
false |
Set to true to include a boolean column in the
output called deleted. The value of this column is set to false for
insert and update operations, and is set to true for delete
operations.
|
gg.handler.name.enableAlter |
Optional | true | false |
false |
Set to true to enable altering the target BigQuery
table. This will allow the BigQuery Handler to add columns or metacolumns configured
on the source, which are not currently in the target BigQuery table.
|
gg.handler.name.clientId |
Optional | String | None | Use to set the client id if the configuration property
gg.handler.name.credentialsFile to resolve the Google BigQuery
credentials is not set. You may wish to use this property instead of the credentials
file in order to use Oracle Wallet to secure credentials.
|
gg.handler.name.clientEmail |
Optional | String | None | Use to set the client email if the configuration property
gg.handler.name.credentialsFile to resolve the Google BigQuery
credentials is not set. You may wish to use this property instead of the credentials
file inorder to use Oracle Wallet to secure credentials.
|
gg.handler.name.privateKey |
Optional | String | None | Use to set the private key if the configuration property
gg.handler.name.credentialsFile to resolve the Google BigQuery
credentials is not set. You may wish to use this property instead of the credentials
file inorder to use Oracle Wallet to secure credentials.
|
gg.handler.name.privateKeyId |
Optional | String | None | Use to set the private key id if the configuration property
gg.handler.name.credentialsFile to resolve the Google BigQuery
credentials is not set. You may wish use this property instead of the credentials file
in order to use Oracle Wallet to secure credentials.
|
gg.handler.name.url |
Optional | A legal URL to connect to BigQuery including scheme, server name and port (if not the default port). The default is https://www.googleapis.com. | https://www.googleapis.com | Allows the user to set a URL for a private endpoint to connect to BigQuery. |
To be able to connect GCS to the Google Cloud Service account, ensure that either of the following is configured: the credentials file property with the relative or absolute path to credentials JSON file or the properties for individual credentials keys. The configuration property that is used to individually add google service account credential key enables them to be encrypted using the Oracle wallet.
Parent topic: Setting Up and Running the BigQuery Handler
8.2.18.1.2.3 Review a Sample Configuration
The following is a sample configuration for the BigQuery Handler:
gg.handlerlist = bigquery #The handler properties gg.handler.bigquery.type = bigquery gg.handler.bigquery.projectId = festive-athlete-201315 gg.handler.bigquery.credentialsFile = credentials.json gg.handler.bigquery.auditLogMode = true gg.handler.bigquery.pkUpdateHandling = delete-insert gg.handler.bigquery.metaColumnsTemplate =${optype}, ${position}
Parent topic: Setting Up and Running the BigQuery Handler
8.2.18.1.2.4 Configuring Handler Authentication
You have to configure the BigQuery Handler authentication using the credentials in the JSON file downloaded from Google BigQuery.
Download the credentials file:
-
Login into your Google account at cloud.google.com.
-
Click Console, and then to go to the Dashboard where you can select your project.
-
From the navigation menu, click APIs & Services then select Credentials.
-
From the Create Credentials menu, choose Service account key.
-
Choose the JSON key type to download the JSON credentials file for your system.
After you have the credentials file, you can authenticate the handler in one of the following methods listed here:
-
Specify the path to the credentials file in the properties file with the
gg.handler.name.credentialsFile
configuration property.The path of the credentials file must contain the path with no wildcard appended. If you include the
*
wildcard in the path to the credentials file, the file is not recognized.Or
- Set the credentials file keys (
clientId
,ClientEmail
,privateKeyId
, andprivateKey
) into the corresponding handler properties.Or
-
Set the
GOOGLE_APPLICATION_CREDENTIALS
environment variable on your system. For example:export GOOGLE_APPLICATION_CREDENTIALS = credentials.json
Then restart the Oracle GoldenGate manager process.
Parent topic: Setting Up and Running the BigQuery Handler
8.2.18.1.3 Google BigQuery Dependencies
The Google BigQuery client libraries are required for integration with BigQuery.
The maven coordinates are as follows:
Maven groupId: com.google.cloud
Maven artifactId: google-cloud-bigquery
Version: 2.7.1
Parent topic: Using Streaming API
8.2.18.1.3.1 BigQuery 2.7.1
The required BigQuery Client libraries for the 2.7.1 version are as follows:
api-common-2.1.3.jar checker-compat-qual-2.5.5.jar checker-qual-3.21.1.jar commons-codec-1.15.jar commons-logging-1.2.jar error_prone_annotations-2.11.0.jar failureaccess-1.0.1.jar gax-2.11.0.jar gax-httpjson-0.96.0.jar google-api-client-1.33.1.jar google-api-services-bigquery-v2-rev20211129-1.32.1.jar google-auth-library-credentials-1.4.0.jar google-auth-library-oauth2-http-1.4.0.jar google-cloud-bigquery-2.7.1.jar google-cloud-core-2.4.0.jar google-cloud-core-http-2.4.0.jar google-http-client-1.41.2.jar google-http-client-apache-v2-1.41.2.jar google-http-client-appengine-1.41.2.jar google-http-client-gson-1.41.2.jar google-http-client-jackson2-1.41.2.jar google-oauth-client-1.33.0.jar grpc-context-1.44.0.jar gson-2.8.9.jar guava-31.0.1-jre.jar httpclient-4.5.13.jar httpcore-4.4.15.jar j2objc-annotations-1.3.jar jackson-core-2.13.1.jar javax.annotation-api-1.3.2.jar jsr305-3.0.2.jar listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar opencensus-api-0.31.0.jar opencensus-contrib-http-util-0.31.0.jar protobuf-java-3.19.3.jar protobuf-java-util-3.19.3.jar proto-google-common-protos-2.7.2.jar proto-google-iam-v1-1.2.1.jar
Parent topic: Google BigQuery Dependencies
8.2.18.2 Google BigQuery Stage and Merge
Topics:
- Overview
BigQuery is Google Cloud’s fully managed, petabyte-scale, and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real time. - Detailed Functionality
- Prerequisites
- Differences between BigQuery Handler and Stage and Merge BigQuery Event Handler
- Authentication or Authorization
- Configuration
- Troubleshooting and Diagnostics
Parent topic: Google BigQuery
8.2.18.2.1 Overview
BigQuery is Google Cloud’s fully managed, petabyte-scale, and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real time.
Parent topic: Google BigQuery Stage and Merge
8.2.18.2.2 Detailed Functionality
The BigQuery Event handler uses the stage and merge data flow.
The change data is staged in a temporary location in microbatches and eventually merged into to the target table. Google Cloud Storage (GCS) is used as the staging area for change data.
This Event handler is used as a downstream Event handler connected to the output of the GCS Event handler.
The GCS Event handler loads files generated by the File Writer Handler into Google Cloud Storage.
The Event handler runs BigQuery Query jobs to execute MERGE
SQL
. The SQL operations are performed in batches providing better
throughput.
Note:
The BigQuery Event handler doesn't use the Google BigQuery streaming API.
Parent topic: Google BigQuery Stage and Merge
8.2.18.2.3 Prerequisites
- Target table existence: Ensure that the target tables exist in the BigQuery dataset.
- Google Cloud Storage(GCS) bucket and dataset location: Ensure that the GCS bucket and the BigQuery dataset exist in the same location/region.
Parent topic: Google BigQuery Stage and Merge
8.2.18.2.4 Differences between BigQuery Handler and Stage and Merge BigQuery Event Handler
Table 8-23 BigQuery Handler v/s Stage and Merge BigQuery Event Handler
Feature/Limitation | BigQuery Handler | Stage And Merge BigQuery Event Handler |
---|---|---|
Compressed update support | Partially supported with limitations. | YES |
Audit log mode | Process all the operations as INSERT .
|
No need to enable audit log mode. |
GCP Quotas/Limits | Maximum rows per second per table:
100000 . See Google BigQuery Documentation.
|
Daily destination table update limit —
1500 updates per table per day. See Google BigQuery Documentation.
|
Approximate pricing with 1TB Storage (for exact
pricing refer GCP Pricing calculator)
|
Streaming Inserts for 1TB costs ~72.71 USD per
month
|
Query job for 1TB costs ~20.28 USD
per month.
|
Duplicate rows replicated to BigQuery | YES | NO |
Replication of TRUNCATE
operation
|
Not supported | Supported |
API used | BigQuery Streaming API | BigQuery Query job |
Parent topic: Google BigQuery Stage and Merge
8.2.18.2.5 Authentication or Authorization
For more information about using the Google service account key, see Authentication and Authorization in the Google Cloud Service (GCS) Event Handler topic. In addition to
the permissions needed to access GCS, the service account also needs permissions to
access BigQuery. You may choose to use a pre-defined IAM role, such as
roles/bigquery.dataEditor
or
roles/bigquery.dataOwner
. When creating a custom role, the
following are the IAM permissions used to run BigQuery Event handler. For more
information, see Configuring Handler Authentication.
Parent topic: Google BigQuery Stage and Merge
8.2.18.2.5.1 BigQuery Permissions
Table 8-24 BigQuery Permissions
Permission | Description |
---|---|
bigquery.connections.create |
Create new connections in a project. |
bigquery.connections.delete |
Delete a connection. |
bigquery.connections.get |
Gets connection metadata. Credentials are excluded. |
bigquery.connections.list |
List connections in a project. |
bigquery.connections.update |
Update a connection and its credentials. |
bigquery.connections.use |
Use a connection configuration to connect to a remote data source. |
bigquery.datasets.create |
Create new datasets. |
bigquery.datasets.get |
Get metadata about a dataset. |
bigquery.connections.export |
Export table data out of BigQuery. |
bigquery.connections.get |
Get table metadata. To get table data, you need
bigquery.tables.getData .
|
bigquery.connections.list |
List connections in a project. |
bigquery.connections.update |
Update a connection and its credentials. |
bigquery.datasets.create |
Create new empty datasets. |
bigquery.datasets.get |
Get metadata about a dataset. |
bigquery.datasets.getIamPolicy |
Reserved for future use. |
bigquery.datasets.update |
Update metadata for a dataset. |
bigquery.datasets.updateTag |
Update tags for a dataset. |
bigquery.jobs.create |
Run jobs (including queries) within the project. |
bigquery.jobs.get |
Get data and metadata on any job. |
bigquery.jobs.list |
List all jobs and retrieve metadata on any job submitted by any user. For jobs submitted by other users, details and metadata are redacted. |
bigquery.jobs.listAll |
List all jobs and retrieve metadata on any job submitted by any user. |
bigquery.jobs.update |
Cancel any job. |
bigquery.readsessions.create |
Create a new read session via the BigQuery Storage API. |
bigquery.readsessions.getData |
Read data from a read session via the BigQuery Storage API. |
bigquery.readsessions.update |
Update a read session via the BigQuery Storage API. |
bigquery.reservations.create |
Create a reservation in a project. |
bigquery.reservations.delete |
Delete a reservation. |
bigquery.reservations.get |
Retrieve details about a reservation. |
bigquery.reservations.list |
List all reservations in a project. |
bigquery.reservations.update |
Update a reservation’s properties. |
bigquery.reservationAssignments.create |
Create a reservation assignment. This permission is
required on the owner project and assignee resource. To move a
reservation assignment, you need
bigquery.reservationAssignments.create on the new
owner project and assignee resource.
|
bigquery.reservationAssignments.delete |
Delete a reservation assignment. This permission is
required on the owner project and assignee resource. To move a
reservation assignment, you need
bigquery.reservationAssignments.delete on the old
owner project and assignee resource.
|
bigquery.reservationAssignments.list |
List all reservation assignments in a project. |
bigquery.reservationAssignments.search |
Search for a reservation assignment for a given project, folder, or organization. |
bigquery.routines.create |
Create new routines (functions and stored procedures). |
bigquery.routines.delete |
Delete routines. |
bigquery.routines.list |
List routines and metadata on routines. |
bigquery.routines.update |
Update routine definitions and metadata. |
bigquery.savedqueries.create |
Create saved queries. |
bigquery.savedqueries.delete |
Delete saved queries. |
bigquery.savedqueries.get |
Get metadata on saved queries. |
bigquery.savedqueries.list |
Lists saved queries. |
bigquery.savedqueries.update |
Updates saved queries. |
bigquery.tables.create |
Create new tables. |
bigquery.tables.delete |
Delete tables |
bigquery.tables.export |
Export table data out of BigQuery. |
bigquery.tables.get |
Get table metadata. To get table data, you need
bigquery.tables.getData .
|
bigquery.tables.getData |
Get table data. This permission is required for querying
table data. To get table metadata, you need
bigquery.tables.get .
|
bigquery.tables.getIamPolicy |
Read a table’s IAM policy. |
bigquery.tables.list |
List tables and metadata on tables. |
bigquery.tables.setCategory |
Set policy tags in table schema. |
bigquery.tables.setIamPolicy |
Changes a table’s IAM policy. |
bigquery.tables.update |
Update table metadata. To update table data, you need
bigquery.tables.updateData .
|
bigquery.tables.updateData |
Update table data. To update table metadata, you need
bigquery.tables.update .
|
bigquery.tables.updateTag |
Update tags for a table. |
In addition to these permissions, ensure that
resourcemanager.projects.get/list
is always granted as a pair.
Parent topic: Authentication or Authorization
8.2.18.2.6 Configuration
- Automatic Configuration
- Classpath Configuration
The GCS Event handler and the BigQuery Event handler use the Java SDK provided by Google. Google does not provide a direct link to download the SDK. - Proxy Configuration
- INSERTALLRECORDS Support
- BigQuery Dataset and GCP ProjectId Mapping
- End-to-End Configuration
- Compressed Update Handling
Parent topic: Google BigQuery Stage and Merge
8.2.18.2.6.1 Automatic Configuration
Replication to BigQuery involves configuring of multiple components, such as File Writer handler, Google Cloud Storae (GCS) Event handler and BigQuery Event handler.
The Automatic Configuration functionality helps to auto configure these components so that the user configuration is minimal.
The properties modified by auto configuration is also logged in the handler
log file. To enable auto configuration to replicate to BigQuery target, set the
parameter gg.target=bq
.
When replicating to BigQuery target, you cannot customize GCS Event handler name and BigQuery Event handler name.
- File Writer Handler Configuration
File Writer handler name is preset to the valuebq
. The following is an example to edit a property of File Writer handler:gg.handler.bq.pathMappingTemplate=./dirout
. - GCS Event Handler Configuration
The GCS Event handler name is preset to the valuegcs
. The following is an example to edit a property of GCS Event handler:gg.eventhandler.gcs.concurrency=5
. - BigQuery Event Handler Configuration
BigQuery Event handler name is preset to the valuebq
. There are no mandatory parameters required for BigQuery Event handler. Mostly, auto configure derives the required parameters.
Parent topic: Configuration
8.2.18.2.6.1.1 File Writer Handler Configuration
File Writer handler name is preset to the value bq
. The
following is an example to edit a property of File Writer handler:
gg.handler.bq.pathMappingTemplate=./dirout
.
Parent topic: Automatic Configuration
8.2.18.2.6.1.2 GCS Event Handler Configuration
The GCS Event handler name is preset to the value gcs
. The following
is an example to edit a property of GCS Event handler:
gg.eventhandler.gcs.concurrency=5
.
Parent topic: Automatic Configuration
8.2.18.2.6.1.3 BigQuery Event Handler Configuration
BigQuery Event handler name is preset to the value bq
.
There are no mandatory parameters required for BigQuery Event handler. Mostly, auto
configure derives the required parameters.
The following are the BigQuery Event handler configurations:
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.eventhandler.bq.credentialsFile |
Optional | Relative or absolute path to the service account key file. | Value from property
gg.eventhandler.gcs.credentialsFile |
Sets the path to the service account key file.
Autoconfigure will automatically configure this property based on the
configuration gg.eventhandler.gcs.credentialsFile,
unless the user wants to use a different service account key file for
BigQuery access. Alternatively, if the environment variable
GOOGLE_APPLICATION_CREDENTIALS is set to the path
to the service account key file, this parameter need not be set.
|
gg.eventhandler.bq.projectId |
Optional | The Google project-id | project-id associated with the service account. | Sets the project-id of the Google Cloud project that houses BigQuery. Autoconfigure will automatically configure this property by accessing the service account key file unless user wants to override this explicitly. |
gg.eventhandler.bq.kmsKey |
Optional | Key names in the format:
projects/<PROJECT>/locations/<LOCATION>/keyRings/<RING_NAME>/cryptoKeys/<KEY_NAME>
|
Value from property
gg.eventhandler.gcs.kmsKey |
Set a customer managed Cloud KMS key to encrypt data in
BigQuery. Autoconfigure will automatically configure this property based
on the configuration
gg.eventhandler.gcs.kmsKey .
|
gg.eventhandler.bq.connectionTimeout |
Optional | Positive integer. | 20000 |
The maximum amount of time, in milliseconds, to wait for the handler to establish a connection with Google BigQuery. |
gg.eventhandler.bq.readTimeout |
Optional | Positive integer. | 30000 |
The maximum amount of time in milliseconds to wait for the handler to read data from an established connection. |
gg.eventhandler.bq.totalTimeout |
Optional | Positive integer. | 120000 |
The total timeout parameter in seconds. The TotalTimeout parameter has the ultimate control over how long the logic should keep trying the remote call until it gives up completely. |
gg.eventhandler.bq.retries |
Optional | Positive integer. | 3 |
The maximum number of retry attempts to perform. |
gg.eventhandler.bq.createDataset |
Optional | true | false |
true |
Set to true to automatically create the
BigQuery dataset if it does not exist.
|
gg.eventhandler.bq.createTable |
Optional | true | false |
true |
Set to true to automatically create the
BigQuery target table if it does not exist.
|
gg.aggregate.operations.flush.interval |
Optional | Integer | 30000 | The flush interval parameter determines how often the
data will be merged into Snowflake. The value is set in milliseconds.
Caution: The higher this value, more data will be stored in the memory of the Replicat process..Note: Use the flush interval parameter with caution. Increasing its default value will increase the amount of data stored in the internal memory of the Replicat. This can cause out of memory errors and stop the Replicat if it runs out of memory. |
gg.compressed.update |
Optional | true or false |
true |
If set the true , then this indicates
that the source trail files contain compressed update operations. If set
to true , then the source trail files are expected to
contain uncompressed update operations.
|
gg.eventhandler.bq.connectionRetryIntervalSeconds
|
Optional | Integer Value | 30 | Specifies the delay in seconds between connection retry attempts. |
gg.eventhandler.bq.connectionRetries
|
Optional | Integer Value | 3 | Specifies the number of times connections to the target data warehouse will be retried. |
gg.eventhandler.bq.url |
Optional | An absolute URL to connect to Google BigQuery. | https://googleapis.com | A legal URL to connect to Google BigQuery including scheme, server name and port (if not the default port). The default is https://googleapis.com. |
Parent topic: Automatic Configuration
8.2.18.2.6.2 Classpath Configuration
The GCS Event handler and the BigQuery Event handler use the Java SDK provided by Google. Google does not provide a direct link to download the SDK.
You can download the SDKs using the following maven co-ordinates:
<dependency> <groupId>com.google.cloud</groupId> <artifactId>google-cloud-storage</artifactId> <version>1.113.9</version> </dependency>
To download the GCS dependencies, execute the following script
<OGGDIR>/DependencyDownloader/gcs.sh
.
<dependency> <groupId>com.google.cloud</groupId> <artifactId>google-cloud-bigquery</artifactId> <version>1.111.1</version> </dependency>
To download the BigQuery dependencies, execute the following script
<OGGDIR>/DependencyDownloader/bigquery.sh
. For more information, see
gcs.sh
in Dependency Downloader Scripts.
Set the path to the GCS and BigQuery SDK in the
gg.classpath
configuration parameter. For example:
gg.classpath=./gcs-deps/*:./bq-deps/*
.
For more information, see Dependency Downloader Scripts.
Parent topic: Configuration
8.2.18.2.6.3 Proxy Configuration
When the replicat process is run behind a proxy server, you can use the
jvm.bootoptions
property to set the proxy server configuration.
For
example: jvm.bootoptions=-Dhttps.proxyHost=some-proxy-address.com
-Dhttps.proxyPort=80
.
Parent topic: Configuration
8.2.18.2.6.4 INSERTALLRECORDS Support
Stage and merge targets supports
INSERTALLRECORDS
parameter.
See INSERTALLRECORDS in
Reference for Oracle GoldenGate. Set the
INSERTALLRECORDS
parameter in the Replicat
parameter file (.prm
). Set the
INSERTALLRECORDS
parameter in the Replicat
parameter file (.prm
)
Setting this property directs the Replicat process to use bulk insert operations to load operation data into the target table.
To process initial load trail files, set theINSERTALLRECORDS
parameter in the Replicat parameter file (.prm
). Setting
this property directs the Replicat process to use bulk insert operations to
load operation data into the target table. You can tune the batch size of
bulk inserts using the gg.handler.bq.maxFileSize
File
Writer property. The default value is set to 1GB.
The
frequency of bulk inserts can be tuned using the File Writer
gg.handler.bq.fileRollInterval
property,
the default value is set to 3m (three minutes).
Parent topic: Configuration
8.2.18.2.6.5 BigQuery Dataset and GCP ProjectId Mapping
The table catalog name is mapped to the GCP
projectId
.
Parent topic: Configuration
8.2.18.2.6.5.1 Three-Part Table Names
Parent topic: BigQuery Dataset and GCP ProjectId Mapping
8.2.18.2.6.5.2 Mapping Table
Table 8-25 Mapping Table
MAP statement in the Replicat parameter file | BigQuery Dataset | GCP ProjectId |
---|---|---|
MAP SCHEMA1.*, TARGET
"bq-project-1".*.*; |
SCHEMA1 |
bq-project-1 |
MAP "bq-project-2".SCHEMA2.*, TARGET
*.*.*; |
SCHEMA2 |
bq-project-2 |
MAP SCHEMA3.*, TARGET *.*; |
SCHEMA3 |
The default projectId from the GCP service account key file or the
configuration gg.eventhandler.bq.projectId .
|
Parent topic: BigQuery Dataset and GCP ProjectId Mapping
8.2.18.2.6.6 End-to-End Configuration
The following is an end-end configuration example which uses auto configuration for File Writer (FW) handler, GCS, and BigQuery Event handlers.
AdapterExamples/big-data/bigquery-via-gcs/bq.props
. # Configuration to load GoldenGate trail operation records # into Google Big Query by chaining # File writer handler -> GCS Event handler -> BQ Event handler. # Note: Recommended to only edit the configuration marked as TODO # The property gg.eventhandler.gcs.credentialsFile need not be set if # the GOOGLE_APPLICATION_CREDENTIALS environment variable is set. gg.target=bq ## The GCS Event handler #TODO: Edit the GCS bucket name gg.eventhandler.gcs.bucketMappingTemplate=<gcs-bucket-name> #TODO: Edit the GCS credentialsFile gg.eventhandler.gcs.credentialsFile=/path/to/gcp/credentialsFile ## The BQ Event handler ## No mandatory configuration required. #TODO: Edit to include the GCS Java SDK and BQ Java SDK. gg.classpath=/path/to/gcs-deps/*:/path/to/bq-deps/* #TODO: Edit to provide sufficient memory (at least 8GB). jvm.bootoptions=-Xmx8g -Xms8g #TODO: If running OGGBD behind a proxy server. #jvm.bootoptions=-Xmx8g -Xms512m -Dhttps.proxyHost=<ip-address> -Dhttps.proxyPort=<port>
Parent topic: Configuration
8.2.18.2.6.7 Compressed Update Handling
A compressed update record contains values for the key columns and the modified columns.
An uncompressed update record contains values for all the columns.
Oracle GoldenGate trails may contain compressed or uncompressed update records. The default extract configuration writes compressed updates to the trails.
The parameter gg.compressed.update
can be set to
true
or false
to indicate
compressed/uncompressed update records.
Parent topic: Configuration
8.2.18.2.6.7.1 MERGE Statement with Uncompressed Updates
In some use cases, if the trail contains uncompressed update records, then the
MERGE SQL
statement can be optimized for better performance by
setting gg.compressed.update=false
.
Parent topic: Compressed Update Handling
8.2.18.2.7 Troubleshooting and Diagnostics
- DDL not applied on the target table: Oracle GoldenGate for BigData does not support DDL replication.
- SQL Errors: In case there are any errors while executing any SQL, the entire SQL statement along with the bind parameter values are logged into the Oracle GoldenGate for Big Data handler log file.
- Co-existence of the components: The location/region of the
machine where Replicat process is running and the BigQuery dataset/GCS bucket
impacts the overall throughput of the apply process.
Data flow is as follows: GoldenGate -> GCS bucket -> BigQuery. For best throughput, ensure that the components are located as close as possible.
com.google.cloud.bigquery.BigQueryException: Access Denied: Project <any-gcp-project>: User does not have bigquery.datasets.create permission in project <any-gcp-project>
. The service account key used by Oracle GoldenGate for BigData does not have permission to create datasets in this project. Grant the permissionbigquery.datasets.create
and restart the Replicat process. The privileges are listed in BigQuery Permissions.
Parent topic: Google BigQuery Stage and Merge
8.2.19 Google Cloud Storage
Topics:
Parent topic: Target
8.2.19.1 Overview
You can use the GCS Event handler to load files generated by the File Writer handler into GCS.
Parent topic: Google Cloud Storage
8.2.19.2 Prerequisites
- Google Cloud Platform (GCP) account set up.
- Google service account key with the relevant permissions.
- GCS Java Software Developement Kit (SDK)
Parent topic: Google Cloud Storage
8.2.19.3 Buckets and Objects
Parent topic: Google Cloud Storage
8.2.19.4 Authentication and Authorization
You need to create a service account key with the relevant Identity and Access Management (IAM) permissions.
Use the JSON key type to generate the service account key file.
You can either
set the path to the service account key file in the environment variable
GOOGLE_APPLICATION_CREDENTIALS
or in the GCS Event handler property
gg.eventhandler.name.credentialsFile
. You can also specify the individual
keys of credentials file like clientId
, clientEmail
,
privateKeyId
and privateKey
into corresponding handler
properties instead of specifying the credentials file path directly. This enables the
credential keys to be encrypted using Oracle wallet.
The following are the IAM permissions to be added into the service account used to run GCS Event handler.
Parent topic: Google Cloud Storage
8.2.19.4.1 Bucket Permissions
Table 8-26 Bucket Permissions
Bucket Permission Name | Description |
---|---|
storage.buckets.create |
Create new buckets in a project. |
storage.buckets.delete |
Delete buckets. |
storage.buckets.get |
Read bucket metadata, excluding IAM policies. |
storage.buckets.list |
List buckets in a project. Also read bucket metadata, excluding IAM policies, when listing. |
storage.buckets.update |
Update bucket metadata, excluding IAM policies. |
Parent topic: Authentication and Authorization
8.2.19.4.2 Object Permissions
Table 8-27 Object Permissions
Object Permission Name | Description |
---|---|
storage.objects.create |
Add new objects to a bucket. |
storage.objects.delete |
Delete objects. |
storage.objects.get |
Read object data and metadata, excluding ACLs. |
storage.objects.list |
List objects in a bucket. Also read object metadata, excluding ACLs, when listing. |
storage.objects.update |
Update object metadata, excluding ACLs. |
Parent topic: Authentication and Authorization
8.2.19.5 Configuration
Table 8-28 Object Permissions
Properties | Required/Optional | Legal Values | Default | Explanation | |
---|---|---|---|---|---|
gg.eventhandler.name.type |
Required | gcs |
None | Selects the GCS Event Handler for use with File Writer handler. | |
gg.eventhandler.name.location |
Optional | A valid GCS location. | None | If the GCS bucket does not exist, a new bucket will be created in this GCS location. If location is not specified, new bucket creation will fail. GCS location reference:GCS locations. | |
gg.eventhandler.name.bucketMappingTemplate |
Required | A string with resolvable keywords and constants used to dynamically generate a GCS bucket name. | None | A GCS bucket is created by the GCS Event handler if it does not exist using this name. See Bucket Naming Guidelines.. For more information about supported keywords, see Template Keywords . | |
gg.eventhandler.name.pathMappingTemplate |
Required | A string with resolvable keywords and constants used to dynamically generate the path in the GCS bucket to write the file. | None | Use keywords interlaced with constants to dynamically generate a
unique GCS path names at runtime. Example path name:
ogg/data/${groupName}/${fullyQualifiedTableName} . For more
information about supported keywords, see Template Keywords .
|
|
gg.eventhandler.name.fileNameMappingTemplate |
Optional | A string with resolvable keywords and constants used to dynamically generate a file name for the GCS object. | None | Use resolvable keywords and constants used to dynamically generate the GCS object file name. If not set, the upstream file name is used. For more information about supported keywords, see Template Keywords | |
gg.eventhandler.name.finalizeAction |
Optional | A unique string identifier cross referencing a child event handler. | No event handler configured. | Sets the downstream event handler that is invoked on the file roll event. A typical example would be use a downstream to load the GCS data into Google BigQuery using the BigQuery Event handler. | |
gg.eventhandler.name.credentialsFile |
Optional | Relative or absolute path to the service account key file. | Noe | Sets the path to the service account key file. Alternatively, if
the environment variable GOOGLE_APPLICATION_CREDENTIALS is set to
the path to the service account key file, then you need not set this parameter.
|
|
gg.eventhandler.name.storageClass |
Optional | STANDARD|NEARLINE |COLDLINE|ARCHIVE|
REGIONAL|MULTI_REGIONAL| DURABLE_REDUCED_AVAILABILITY |
None | The storage class you set for an object affects the object’s availability and pricing model. If this property is not set, then the storage class for the file is set to the default storage class for the respective bucket. If the bucket does not exist and storage class is specified, then a new bucket is created with this storage class as its default. | |
gg.eventhandler.name.kmsKey |
Optional | Key names in the format:
projects/<PROJECT>/locations/<LOCATION>/keyRings/<RING_NAME>/cryptoKeys/<KEY_NAME> .
<PROJECT> : Google project-id.
<LOCATION> : Location of the GCS bucket.
<RING_NAME> : Google Cloud KMS key ring
name. <KEY_NAME> : Google Cloud KMS key
name.
|
None | Google Cloud Storage always encrypts your data on the server
side, before it is written to disk using Google-managed encryption keys. As an
additional layer of security, customers may choose to use keys generated by Google
Cloud Key Management Service (KMS). This property can be used to set a customer
managed Cloud KMS key to encrypt GCS objects. When using customer managed keys,
the gg.eventhandler.name.concurrency property cannot be set to a
value greater than one because with customer managed keys GCP does not allow
multi-part uploads using object composition.
|
|
gg.eventhandler.name.concurrency |
Optional | Any number in the range 1 to 32. | 10 |
If concurrency is set to a value greater than one, then the GCS
Event handler performs multi-part uploads using composition. The multi-part
uploads spawn concurrent threads to upload each part. The individual parts are
uploaded to the following directory
<bucketMappingTemplate>/oggtmp . This directory
is reserved for use by Oracle GoldenGate for Big Data. This provides better
throughput rates for uploading large files. Multi-part uploads are used for files
with size greater than 10 mega bytes.
|
|
gg.eventhandler.gcs.clientId |
Optional | Valid Big Query Credentials Client Id | NA | Provides the client ID key from the credentials file for connecting to Google Big Query service account. | |
gg.eventhandler.gcs.clientEmail |
Optional | Valid Big Query Credentials Client Email | NA | Provides the client Email key from the credentials file for connecting to Google Big Query service account. | |
gg.eventhandler.gcs.privateKeyId |
Optional | Valid Big Query Credentials Client Email | NA | Provides the client Email key from the credentials file for connecting to Google Big Query service account. | |
gg.eventhandler.gcs.privateKey |
Optional | Valid Big Query Credentials Private Key. | NA | Provides the Private Key from the credentials file for connecting to Google Big Query service account. | |
gg.eventhandler.name.projectId |
Optional | The Google project-id | project-id associated
with the service account.
|
NA | Sets the project-id of the Google Cloud project
that houses the storage bucket. Auto configure will automatically configure this
property by accessing the service account key file unless user wants to override
this explicitly.
|
|
gg.eventhandler.name.url |
Optional | A legal URL to connect to Google Cloud Storage including scheme, server name and port (if not the default port). The default is https://storage.googleapis.com. | https://storage.googleapis.com | Allows the user to set a URL for a private endpoint to connect to GCS. |
Note:
To be able to connect GCS to the Google Cloud Service account, ensure that either of the following is configured: the credentials file property with the relative or absolute path to credentials JSON file or the properties for individual credentials keys. The configuration property to individually add google service account credential key enables them to encrypt using the Oracle wallet.8.2.19.5.1 Classpath Configuration
The GCS Event handler uses the Java SDK for Google Cloud Storage. The classpath must include the path to the GCS SDK.
Parent topic: Configuration
8.2.19.5.1.1 Dependencies
<dependency> <groupId>com.google.cloud</groupId> <artifactId>google-cloud-storage</artifactId> <version>1.113.9</version> </dependency>
Alternatively, you can download the GCS dependencies by running the script:
<OGGDIR>/DependencyDownloader/gcs.sh
.
Edit the gg.classpath
configuration parameter to include the path to the
GCS SDK.
Parent topic: Classpath Configuration
8.2.19.5.2 Proxy Configuration
jvm.bootoptions
property to set proxy server configuration.
For Example:
jvm.bootoptions=-Dhttps.proxyHost=some-proxy-address.com
-Dhttps.proxyPort=80
Parent topic: Configuration
8.2.19.5.3 Sample Configuration
#The GCS Event handler gg.eventhandler.gcs.type=gcs gg.eventhandler.gcs.pathMappingTemplate=${fullyQualifiedTableName} #TODO: Edit the GCS bucket name gg.eventhandler.gcs.bucketMappingTemplate=<gcs-bucket-name> #TODO: Edit the GCS credentialsFile gg.eventhandler.gcs.credentialsFile=/path/to/gcs/credentials-file gg.eventhandler.gcs.finalizeAction=none gg.classpath=/path/to/gcs-deps/* jvm.bootoptions=-Xmx8g -Xms8g
Parent topic: Configuration
8.2.20 Java Message Service (JMS)
The Java Message Service (JMS) Handler allows operations from a trail file to be formatted in messages, and then published to JMS providers like Oracle Weblogic Server, Websphere, and ActiveMQ.
This chapter describes how to use the JMS Handler.
Parent topic: Target
8.2.20.1 Overview
The Java Message Service is a Java API that allows applications to create, send, receive, and read messages. The JMS API defines a common set of interfaces and associated semantics that allow programs written in the Java programming language to communicate with other messaging implementations.
The JMS Handler captures the Oracle GoldenGate trail and sends those messages to the configured JMS providers.
Note:
The Java Message Service (JMS) Handler does not support DDL operations. In case of DDL operations, replicat/extract is expected to fail.Parent topic: Java Message Service (JMS)
8.2.20.2 Setting Up and Running the JMS Handler
The JMS Handler setup (JNDI configuration) depends on the JMS provider that you use.
The following sections provide instructions for configuring the JMS Handler components and running the handler.
Runtime Prerequisites
The JMS provider should be up and running with the required
ConnectionFactory
and QueueConnectionFactory
and TopicConnectionFactory
configured.
Security
Configure the SSL according to the JMS Provider used.
- Classpath Configuration
- Java Naming and Directory Interface Configuration
- Handler Configuration
- Sample Configuration Using Oracle WebLogic Server
Parent topic: Java Message Service (JMS)
8.2.20.2.1 Classpath Configuration
Oracle recommends that you store the JMS Handler properties file in the Oracle
GoldenGate dirprm
directory. The JMS Handler requires the JMS
Provider client JARs are in the classpath in order to execute. The location of the
providers client JARs is similar to:
gg.classpath= path_to_the_providers_client_jars
Parent topic: Setting Up and Running the JMS Handler
8.2.20.2.2 Java Naming and Directory Interface Configuration
You configure the Java Naming and Directory Interface (JNDI) properties to connect to an Initial Context to look up the connection factory and initial destination.
Table 8-29 JNDI Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
Valid provider URL with port |
None |
Specifies the URL that the handler uses to look up objects on the server. For
example, |
|
Required |
Initial Context factory class name |
None |
Specifies which initial context factory to use when
creating a new initial context object. For Oracle WebLogic
Server, the value is
|
|
Required |
Valid user name |
None |
Specifies the user name to use. |
|
Required |
Valid password |
None |
Specifies the password for the user. |
Parent topic: Setting Up and Running the JMS Handler
8.2.20.2.3 Handler Configuration
You configure the JMS Handler operation using the properties file. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
To enable the selection of the JMS Handler, you must first configure the
handler type by specifying gg.handler.name.type=jms
and the
other JMS properties as follows:
Table 8-30 JMS Handler Configuration Properties
Parent topic: Setting Up and Running the JMS Handler
8.2.20.2.4 Sample Configuration Using Oracle WebLogic Server
#JMS Handler Template
gg.handlerlist=jms
gg.handler.jms.type=jms
#TODO: Set the message formatter type
gg.handler.jms.format=
#TODO: Set the destination for resolving the queue/topic name.
gg.handler.jms.destination=
#Start of JMS handler properties when JNDI is used.
gg.handler.jms.useJndi=true
#TODO: Set the connectionFactory for resolving the queue/topic name.
gg.handler.jms.connectionFactory=
#TODO: Set the standard JNDI properties url, initial factory name, principal and credentials.
java.naming.provider.url=
java.naming.factory.initial=
java.naming.security.principal=
java.naming.security.credentials=
End of JMS handler properties when JNDI is used.
#Start of JMS handler properties when JNDI is not used.
#TODO: Comment the above properties related to useJndi is true.
#TODO: Uncomment the below properties to configure when useJndi is false.
#gg.handler.jms.useJndi=false
#TODO: Set connectionURL of MQ.
#gg.handler.jms.connectionUrl=
#TODO: Set the connection Factory Class of the MQ.
#gg.handler.jms.connectionFactoryClass=
#TODO: Set the path the jms client library wlthint3client.jar
gg.classpath=
jvm.bootoptions=-Xmx512m -Xms32m
Parent topic: Setting Up and Running the JMS Handler
8.2.20.3 JMS Dependencies
The Java EE Specification APIs have moved out of the JDK in Java 8. JMS is a part of this specification, and therefore this dependency is required.
Maven groupId: javax
Maven artifactId: javaee-api
Version: 8.0
You can download the jar from Maven Central Repository.
Parent topic: Java Message Service (JMS)
8.2.21 Java Database Connectivity
Learn how to use the Java Database Connectivity (JDBC) Handler, which can replicate source transactional data to a target or database.
This chapter describes how to use the JDBC Handler.
- Overview
- Detailed Functionality
The JDBC Handler replicates source transactional data to a target or database by using a JDBC interface. - Setting Up and Running the JDBC Handler
Use the JDBC Metadata Provider with the JDBC Handler to obtain column mapping features, column function features, and better data type mapping. - Sample Configurations
Parent topic: Target
8.2.21.1 Overview
The Generic Java Database Connectivity (JDBC) Handler lets you replicate source transactional data to a target system or database by using a JDBC interface. You can use it with targets that support JDBC connectivity.
You can use the JDBC API to access virtually any data source, from relational databases to spreadsheets and flat files. JDBC technology also provides a common base on which the JDBC Handler was built. The JDBC handler with the JDBC metadata provider also lets you use Replicat features such as column mapping and column functions. For more information about using these features, see Metadata Providers
For more information about using the JDBC API, see http://docs.oracle.com/javase/8/docs/technotes/guides/jdbc/index.html.
Parent topic: Java Database Connectivity
8.2.21.2 Detailed Functionality
The JDBC Handler replicates source transactional data to a target or database by using a JDBC interface.
- Single Operation Mode
- Oracle Database Data Types
- MySQL Database Data Types
- Netezza Database Data Types
- Redshift Database Data Types
Parent topic: Java Database Connectivity
8.2.21.2.1 Single Operation Mode
The JDBC Handler performs SQL operations on every single trail record (row operation) when the trail record is processed by the handler. The JDBC Handler does not use the BATCHSQL
feature of the JDBC API to batch operations.
Parent topic: Detailed Functionality
8.2.21.2.3 MySQL Database Data Types
The following column data types are supported for MySQL Database targets:
INT
REAL
FLOAT
DOUBLE
NUMERIC
DATE
DATETIME
TIMESTAMP
TINYINT
BOOLEAN
SMALLINT
BIGINT
MEDIUMINT
DECIMAL
BIT
YEAR
ENUM
CHAR
VARCHAR
Parent topic: Detailed Functionality
8.2.21.2.4 Netezza Database Data Types
The following column data types are supported for Netezza database targets:
byteint
smallint
integer
bigint
numeric(p,s)
numeric(p)
float(p)
Real
double
char
varchar
nchar
nvarchar
date
time
Timestamp
Parent topic: Detailed Functionality
8.2.21.2.5 Redshift Database Data Types
The following column data types are supported for Redshift database targets:
SMALLINT
INTEGER
BIGINT
DECIMAL
REAL
DOUBLE
CHAR
VARCHAR
DATE
TIMESTAMP
Parent topic: Detailed Functionality
8.2.21.3 Setting Up and Running the JDBC Handler
Use the JDBC Metadata Provider with the JDBC Handler to obtain column mapping features, column function features, and better data type mapping.
The following topics provide instructions for configuring the JDBC Handler components and running the handler.
Parent topic: Java Database Connectivity
8.2.21.3.1 Java Classpath
The JDBC Java Driver location must be included in the class path of the handler using the gg.classpath
property.
For example, the configuration for a MySQL database could be:
gg.classpath= /path/to/jdbc/driver/jar/mysql-connector-java-5.1.39-bin.jar
Parent topic: Setting Up and Running the JDBC Handler
8.2.21.3.2 Handler Configuration
You configure the JDBC Handler operation using the properties file. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
To enable the selection of the JDBC Handler, you must first configure the handler
type by specifying gg.handler.name.type=jdbc
and the other
JDBC properties as follows:
Table 8-31 JDBC Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
|
None |
Selects the JDBC Handler for streaming change data capture into name. |
|
Required |
A valid JDBC connection URL |
None |
The target specific JDBC connection URL. |
|
Target database dependent. |
The target specific JDBC driver class name |
None |
The target specific JDBC driver class name. |
|
Target database dependent. |
A valid user name |
None |
The user name used for the JDBC connection to the target database. |
|
Target database dependent. |
A valid password |
None |
The password used for the JDBC connection to the target database. |
|
Optional |
Unsigned integer |
Target database dependent |
If this property is not specified, the JDBC Handler queries the target dependent database metadata indicating maximum number of active prepared SQL statements. Some targets do not provide this metadata so then the default value of 256 active SQL statements is used. If this property is specified, the JDBC Handler will not query the target database for such metadata and use the property value provided in the configuration. In either case, when the JDBC handler finds that the total number of active SQL statements is about to be exceeded, the oldest SQL statement is removed from the cache to add one new SQL statement. |
Parent topic: Setting Up and Running the JDBC Handler
8.2.21.3.3 Statement Caching
To speed up DML operations, JDBC driver implementations typically allow multiple statements to be cached. This configuration avoids repreparing a statement for operations that share the same profile or template.
The JDBC Handler uses statement caching to speed up the process and caches as many statements as the underlying JDBC driver supports. The cache is implemented by using an LRU cache where the key is the profile of the operation (stored internally in the memory as an instance of StatementCacheKey
class), and the value is the PreparedStatement
object itself.
A StatementCacheKey
object contains the following information for the various DML profiles that are supported in the JDBC Handler:
DML operation type | StatementCacheKey contains a tuple of: |
---|---|
|
(table name, operation type, ordered after-image column indices) |
|
(table name, operation type, ordered after-image column indices) |
|
(table name, operation type) |
|
(table name, operation type) |
Parent topic: Setting Up and Running the JDBC Handler
8.2.21.3.4 Setting Up Error Handling
The JDBC Handler supports using the REPERROR
and HANDLECOLLISIONS
Oracle GoldenGate parameters. See Reference for Oracle GoldenGate.
You must configure the following properties in the handler properties file to define the mapping of different error codes for the target database.
-
gg.error.duplicateErrorCodes
-
A comma-separated list of error codes defined in the target database that indicate a duplicate key violation error. Most of the drivers of the JDBC drivers return a valid error code so,
REPERROR
actions can be configured based on the error code. For example:gg.error.duplicateErrorCodes=1062,1088,1092,1291,1330,1331,1332,1333
-
gg.error.notFoundErrorCodes
-
A comma-separated list of error codes that indicate missed
DELETE
orUPDATE
operations on the target database.In some cases, the JDBC driver errors occur when an
UPDATE
orDELETE
operation does not modify any rows in the target database so, no additional handling is required by the JDBC Handler.Most JDBC drivers do not return an error when a
DELETE
orUPDATE
is affecting zero rows so, the JDBC Handler automatically detects a missedUPDATE
orDELETE
operation and triggers an error to indicate a not-found error to the Replicat process. The Replicat process can then execute the specifiedREPERROR
action.The default error code used by the handler is zero. When you configure this property to a non-zero value, the configured error code value is used when the handler triggers a not-found error. For example:
gg.error.notFoundErrorCodes=1222
-
gg.error.deadlockErrorCodes
-
A comma-separated list of error codes that indicate a deadlock error in the target database. For example:
gg.error.deadlockErrorCodes=1213
- Setting Codes
-
Oracle recommends that you set a non-zero error code for the
gg.error.duplicateErrorCodes
,gg.error.notFoundErrorCodes
, andgg.error.deadlockErrorCodes
properties because Replicat does not respond toREPERROR
andHANDLECOLLISIONS
configuration when the error code is set to zero.
Sample Oracle Database Target Error Codes
gg.error.duplicateErrorCodes=1
gg.error.notFoundErrorCodes=0
gg.error.deadlockErrorCodes=60
Sample MySQL Database Target Error Codes
gg.error.duplicateErrorCodes=1022,1062
gg.error.notFoundErrorCodes=1329
gg.error.deadlockErrorCodes=1213,1614
Parent topic: Setting Up and Running the JDBC Handler
8.2.21.4 Sample Configurations
The following topics contain sample configurations for the databases supported by the JDBC Handler from the Java Adapter properties file.
- Sample Oracle Database Target
- Sample Oracle Database Target with JDBC Metadata Provider
- Sample MySQL Database Target
- Sample MySQL Database Target with JDBC Metadata Provider
Parent topic: Java Database Connectivity
8.2.21.4.1 Sample Oracle Database Target
gg.handlerlist=jdbcwriter
gg.handler.jdbcwriter.type=jdbc
#Handler properties for Oracle database target
gg.handler.jdbcwriter.DriverClass=oracle.jdbc.driver.OracleDriver
gg.handler.jdbcwriter.connectionURL=jdbc:oracle:thin:@<DBServer address>:1521:<database name>
gg.handler.jdbcwriter.userName=<dbuser>
gg.handler.jdbcwriter.password=<dbpassword>
gg.classpath=/path/to/oracle/jdbc/driver/ojdbc5.jar
goldengate.userexit.timestamp=utc
goldengate.userexit.writers=javawriter
javawriter.stats.display=TRUE
javawriter.stats.full=TRUE
gg.log=log4j
gg.log.level=INFO
gg.report.time=30sec
javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=.:ggjava/ggjava.jar:./dirprm
Parent topic: Sample Configurations
8.2.21.4.2 Sample Oracle Database Target with JDBC Metadata Provider
gg.handlerlist=jdbcwriter
gg.handler.jdbcwriter.type=jdbc
#Handler properties for Oracle database target with JDBC Metadata provider
gg.handler.jdbcwriter.DriverClass=oracle.jdbc.driver.OracleDriver
gg.handler.jdbcwriter.connectionURL=jdbc:oracle:thin:@<DBServer address>:1521:<database name>
gg.handler.jdbcwriter.userName=<dbuser>
gg.handler.jdbcwriter.password=<dbpassword>
gg.classpath=/path/to/oracle/jdbc/driver/ojdbc5.jar
#JDBC Metadata provider for Oracle target
gg.mdp.type=jdbc
gg.mdp.ConnectionUrl=jdbc:oracle:thin:@<DBServer address>:1521:<database name>
gg.mdp.DriverClassName=oracle.jdbc.driver.OracleDriver
gg.mdp.UserName=<dbuser>
gg.mdp.Password=<dbpassword>
goldengate.userexit.timestamp=utc
goldengate.userexit.writers=javawriter
javawriter.stats.display=TRUE
javawriter.stats.full=TRUE
gg.log=log4j
gg.log.level=INFO
gg.report.time=30sec
javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=.:ggjava/ggjava.jar:./dirprm
Parent topic: Sample Configurations
8.2.21.4.3 Sample MySQL Database Target
gg.handlerlist=jdbcwriter
gg.handler.jdbcwriter.type=jdbc
#Handler properties for MySQL database target
gg.handler.jdbcwriter.DriverClass=com.mysql.jdbc.Driver
gg.handler.jdbcwriter.connectionURL=jdbc:<a target="_blank" href="mysql://">mysql://</a><DBServer address>:3306/<database name>
gg.handler.jdbcwriter.userName=<dbuser>
gg.handler.jdbcwriter.password=<dbpassword>
gg.classpath=/path/to/mysql/jdbc/driver//mysql-connector-java-5.1.39-bin.jar
goldengate.userexit.timestamp=utc
goldengate.userexit.writers=javawriter
javawriter.stats.display=TRUE
javawriter.stats.full=TRUE
gg.log=log4j
gg.log.level=INFO
gg.report.time=30sec
javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=.:ggjava/ggjava.jar:./dirprm
Parent topic: Sample Configurations
8.2.21.4.4 Sample MySQL Database Target with JDBC Metadata Provider
gg.handlerlist=jdbcwriter
gg.handler.jdbcwriter.type=jdbc
#Handler properties for MySQL database target with JDBC Metadata provider
gg.handler.jdbcwriter.DriverClass=com.mysql.jdbc.Driver
gg.handler.jdbcwriter.connectionURL=jdbc:<a target="_blank" href="mysql://">mysql://</a><DBServer address>:3306/<database name>
gg.handler.jdbcwriter.userName=<dbuser>
gg.handler.jdbcwriter.password=<dbpassword>
gg.classpath=/path/to/mysql/jdbc/driver//mysql-connector-java-5.1.39-bin.jar
#JDBC Metadata provider for MySQL target
gg.mdp.type=jdbc
gg.mdp.ConnectionUrl=jdbc:<a target="_blank" href="mysql://">mysql://</a><DBServer address>:3306/<database name>
gg.mdp.DriverClassName=com.mysql.jdbc.Driver
gg.mdp.UserName=<dbuser>
gg.mdp.Password=<dbpassword>
goldengate.userexit.timestamp=utc
goldengate.userexit.writers=javawriter
javawriter.stats.display=TRUE
javawriter.stats.full=TRUE
gg.log=log4j
gg.log.level=INFO
gg.report.time=30sec
javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=.:ggjava/ggjava.jar:./dirprm
Parent topic: Sample Configurations
8.2.22 Map(R)
Oracle GoldenGate for Big Data supports MapR over HDFS handler. For more information, see HDFS Event Handler
Parent topic: Target
8.2.23 MongoDB
Learn how to use the MongoDB Handler, which can replicate transactional data from Oracle GoldenGate to a target MongoDB and Autonomous JSON databases (AJD and ATP) .
- Overview
- MongoDB Wire Protocol
- Supported Target Types
- Detailed Functionality
- Setting Up and Running the MongoDB Handler
- Security and Authentication
- Reviewing Sample Configurations
- MongoDB to AJD/ATP Migration
- MongoDB Handler Client Dependencies
What are the dependencies for the MongoDB Handler to connect to MongoDB databases?
Parent topic: Target
8.2.23.1 Overview
Mongodb Handler can used to replicate data from RDMS as well as document based databases like Mongodb or Cassandra to the following target databases using MongoDB wire protocol
Parent topic: MongoDB
8.2.23.2 MongoDB Wire Protocol
The MongoDB Wire Protocol is a simple socket-based, request-response style protocol. Clients communicate with the database server through a regular TCP/IP socket, see https://docs.mongodb.com/manual/reference/mongodb-wire-protocol/.
Parent topic: MongoDB
8.2.23.3 Supported Target Types
-
MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling, see https://www.mongodb.com/.
-
Oracle Autonomous JSON Database (AJD) is a cloud document database service that makes it simple to develop JSON-centric applications, see Autonomous JSON Database | Oracle.
-
Autonomous Database for transaction processing and mixed workloads (ATP) is a fully automated database service optimized to run transactional, analytical, and batch workloads concurrently, see Autonomous Transaction Processing | Oracle.
- On-premises Oracle Database 21c with Database API for MongoDB is also a supported target. See Installing Database API for MongoDB for any Oracle Database.
Parent topic: MongoDB
8.2.23.4 Detailed Functionality
The MongoDB Handler takes operations from the source trail file and creates corresponding documents in the target MongoDB or Autonomous databases (AJD and ATP).
A record in MongoDB is a Binary JSON (BSON) document, which is a data structure composed of field and value pairs. A BSON data structure is a binary representation of JSON documents. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents.
A collection is a grouping of MongoDB or AJD/ATP documents and is the equivalent of an RDBMS table. In MongoDB or AJD/ATP databases, a collection holds collection of documents. Collections do not enforce a schema. MongoDB or AJD/ATP documents within a collection can have different fields.
8.2.23.4.1 Document Key Column
MongoDB or AJD/ATP databases require every document (row) to have a
column named _id
whose value should be unique in a collection
(table). This is similar to a primary key for RDBMS tables. If a document does not
contain a top-level _id
column during an insert, the MongoDB driver
adds this column.
The MongoDB Handler builds custom _id
field values for
every document based on the primary key column values in the trail record. This
custom _id
is built using all the key column values concatenated
by a :
(colon) separator. For example:
KeyColValue1:KeyColValue2:KeyColValue3
The MongoDB Handler enforces uniqueness based on these custom
_id
values. This means that every record in the trail must be
unique based on the primary key columns values. Existence of non-unique records for
the same table results in a MongoDB Handler failure and in Replicat abending with a
duplicate key error.
The behavior of the _id
field is:
-
By default, MongoDB creates a unique index on the column during the creation of a collection.
-
It is always the first column in a document.
-
It may contain values of any BSON data type except an array.
Parent topic: Detailed Functionality
8.2.23.4.2 Primary Key Update Operation
_id
column to be
modified. This means a primary key update operation record in the trail needs special
handling. The MongoDB Handler converts a primary key update operation into a combination
of a DELETE
(with old key) and an INSERT
(with new
key). To perform the INSERT
, a complete before-image of the update
operation in trail is recommended. You can generate the trail to populate a complete
before image for update operations by enabling the Oracle GoldenGate
GETUPDATEBEFORES
and NOCOMPRESSUPDATES
parameters,
see Reference for Oracle GoldenGate.
Parent topic: Detailed Functionality
8.2.23.4.3 MongoDB Trail Data Types
The MongoDB Handler supports delivery to the BSON data types as follows:
-
32-bit integer
-
64-bit integer
-
Double
-
Date
-
String
-
Binary data
Parent topic: Detailed Functionality
8.2.23.5 Setting Up and Running the MongoDB Handler
The following topics provide instructions for configuring the MongoDB Handler components and running the handler.
- Classpath Configuration
- MongoDB Handler Configuration
- Using Bulk Write
- Using Write Concern
- Using Three-Part Table Names
- Using Undo Handling
Parent topic: MongoDB
8.2.23.5.1 Classpath Configuration
The MongoDB Java Driver is required for Oracle GoldenGate for Big Data to connect and stream data to MongoDB. If the Oracle GoldenGate for Big Data version is 21.7.0.0.0 and below, then you need to use 3.x (MongoDB Java Driver 3.12.8). If the Oracle GoldenGate for Big Data version is 21.8.0.0.0 and above, then you need to use MongoDB Java Driver 4.6.0 . The MongoDB Java Driver is not included in the Oracle GoldenGate for Big Data product. You must download the driver from: mongo java driver.
Select mongo-java-driver and the version to download the recommended driver JAR file.
You must configure the gg.classpath
variable to load the MongoDB
Java Driver JAR at runtime. For example:
gg.classpath=/home/mongodb/mongo-java-driver-3.12.8.jar
.
Oracle GoldenGate for Big Data supports the MongoDB Decimal
128 data type that was added in MongoDB 3.4. Use of a MongoDB Java Driver prior to
3.12.8 results in a ClassNotFound
exception.
Parent topic: Setting Up and Running the MongoDB Handler
8.2.23.5.2 MongoDB Handler Configuration
You configure the MongoDB Handler operation using the properties file. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
To enable the selection of the MongoDB Handler, you must first configure the handler type by specifying gg.handler.name.type=mongodb
and the other MongoDB properties as follows:
Table 8-32 MongoDB Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
|
None |
Selects the MongoDB Handler for use with Replicat. |
|
Optional |
|
|
Set to Set to |
|
Optional |
|
None |
Sets the required write concern for all the operations performed by the MongoDB Handler. The property value is in JSON format and can only accept keys as |
|
Optional |
Valid MongoDB client URI |
None |
Sets the MongoDB client URI. A client URI can also be used to set other MongoDB
connection properties, such as authentication and
|
|
Optional |
|
|
When set to If the size of the document exceeds the MongoDB limit, an exception occurs and Replicat abends. |
|
Optional |
|
|
Set to |
|
Optional |
|
|
MongoDB version 3.4 added support for a 128-bit decimal data type called Decimal128.
This data type was needed since Oracle GoldenGate for Big Data supports both integer and
decimal data types that do not fit into a 64-bit Long or Double.
Setting this property to |
|
Optional |
|
|
Set to Note: MongoDB added support for transactions in MongoDB version 4.0. Additionally, the minimum version of the MongoDB client driver is 3.10.1. |
Parent topic: Setting Up and Running the MongoDB Handler
8.2.23.5.3 Using Bulk Write
Bulk write is enabled by default. For better throughput, Oracle recommends that you use bulk write.
You can also enable bulk write by using the BulkWrite
handler property. To enable or disable bulk write use the gg.handler.handler.BulkWrite=true | false
. The MongoDB Handler does not use the gg.handler.handler.mode=
property that is used by Oracle GoldenGate for Big Data.
op | tx
With bulk write, the MongoDB Handler uses the GROUPTRANSOPS
parameter to retrieve the batch size. The handler converts a batch of trail records to MongoDB documents, which are then written to the database in one request.
Parent topic: Setting Up and Running the MongoDB Handler
8.2.23.5.4 Using Write Concern
Write concern describes the level of acknowledgement that is requested from MongoDB for write operations to a standalone MongoDB, replica sets, and sharded-clusters. With sharded-clusters, Mongo instances pass the write concern on to the shards, see https://docs.mongodb.com/manual/reference/write-concern/.
Use the following configuration:
w: value
wtimeout: number
Parent topic: Setting Up and Running the MongoDB Handler
8.2.23.5.5 Using Three-Part Table Names
An Oracle GoldenGate trail may have data for sources that support three-part table names, such as Catalog.Schema.Table
. MongoDB only supports two-part names, such as DBName.Collection
. To support the mapping of source three-part names to MongoDB two-part names, the source Catalog
and Schema
is concatenated with an underscore delimiter to construct the Mongo DBName
.
For example, Catalog.Schema.Table
would become catalog1_schema1.table1
.
Parent topic: Setting Up and Running the MongoDB Handler
8.2.23.5.6 Using Undo Handling
The MongoDB Handler can recover from bulk write errors using a lightweight undo engine. This engine works differently from typical RDBMS undo engines, rather the best effort to assist you in error recovery. Error recovery works well when there are primary violations or any other bulk write error where the MongoDB database provides information about the point of failure through BulkWriteException
.
Table 8-33Table 1 lists the requirements to make the best use of this functionality.
Table 8-33 Undo Handling Requirements
Operation to Undo | Require Full Before Image in the Trail? |
---|---|
|
No |
|
Yes |
|
No (before image of fields in the |
If there are errors during undo operations, it may be not possible to get the MongoDB collections to a consistent state. In this case, you must manually reconcile the data.
Parent topic: Setting Up and Running the MongoDB Handler
8.2.23.6 Security and Authentication
MongoDB Handler uses Oracle GoldenGate credential store to manage user IDs and their encrypted passwords (together known as credentials) that are used by Oracle GoldenGate processes to interact with the MongoDB database. The credential store eliminates the need to specify user names and clear-text passwords in the Oracle GoldenGate parameter files.
An optional alias can be used in the parameter file instead of the user ID to map to a userid and password pair in the credential store.
In Oracle GoldenGate for Big Data, you specify the alias and domain in the property file and not the actual user ID or password. User credentials are maintained in secure wallet storage.
CREDENTIAL STORE
and DBLOGIN
run
the following commands in the adminclient:
adminclient> add credentialstore
adminclient> alter credentialstore add user <userid> password <pwd> alias mongo
Example value of
userid:mongodb://myUserAdmin@localhost:27017/admin?replicaSet=rs0
adminclient > dblogin useridalias mongo
To test
DBLOGIN
, run the following command
adminclient> list tables tcust*
On successful add of authentication to credential store, add the alias in the parameter file of extract.
SOURCEDB USERIDALIAS mongo
MongoDB Handler uses connection URI to connect to a MongoDB deployment.
Authentication and Security is passed as query string as part of connection URI. See
SSL Configuration Setup to configure SSL. mongodb://<user>@<hostname1>:<port>,<hostname2>:<port>,<hostname3>:<port>/?replicaSet=<replicatName>
To specify TLS/SSL: “+srv”
as
mongodb+srv
automatically sets the tls option to
true
. mongodb+srv://server.example.com/
tls=false
in the query string.
mongodb:// >@<hostname1>:<port>/?replicaSet=<replicatName>&tls=false
To specify Authentication:
mongodb://<user>@<hostname1>:<port>,<hostname2>:<port>,<hostname3>:<port>/?replicaSet=<replicatName>&authSource=admin
mongodb://<user>@<hostname1>:<port>,<hostname2>:<port>,<hostname3>:<port>/?replicaSet=<replicatName>&authSource=admin&authMechanism=GSSAPI
For
more information about Security and Authentication using Connection URL, see Mongo DB DocumentationParent topic: MongoDB
8.2.23.6.1 SSL Configuration Setup
To configure SSL between the MongoDB instance and Oracle GoldenGate for Big Data MongoDB Handler, do the following:
openssl req -passout pass:password -new -x509 -days 3650 -extensions v3_ca -keyout
ca_private.pem -out ca.pem -subj
"/CN=CA/OU=GOLDENGATE/O=ORACLE/L=BANGALORE/ST=KA/C=IN"
Create key and certificate signing requests (CSR) for client and all server nodes
openssl req -newkey rsa:4096 -nodes -out client.csr -keyout client.key -subj
'/CN=certName/OU=OGGBDCLIENT/O=ORACLE/L=BANGALORE/ST=AP/C=IN'
openssl req -newkey rsa:4096 -nodes -out server.csr -keyout server.key -subj
'/CN=slc13auo.us.oracle.com/OU=GOLDENGATE/O=ORACLE/L=BANGALORE/ST=TN/C=IN'
Sign the certificate signing requests with CA
openssl x509 -passin pass:password -sha256 -req -days 365 -in client.csr -CA ca.pem -CAkey
ca_private.pem -CAcreateserial -out client-signed.crtopenssl x509 -passin pass:password -sha256 -req -days 365 -in server.csr -CA ca.pem -CAkey
ca_private.pem -CAcreateserial -out server-signed.crt -extensions v3_req -extfile
<(cat << EOF[ v3_req ]subjectAltName = @alt_names
[ alt_names ]
DNS.1 = 127.0.0.1
DNS.2 = localhost
DNS.3 = hostname
EOF)
cat client-signed.crt client.key > client.pem
cat server-signed.crt server.key > server.pem
Create trust store and keystore
openssl pkcs12 -export -out server.pkcs12 -in server.pem
openssl pkcs12 -export -out client.pkcs12 -in client.pem
bash-4.2$ ls
ca.pem ca_private.pem client.csr client.pem server-signed.crt server.key server.pkcs12
ca.srl client-signed.crt client.key client.pkcs12 server.csr server.pem
Start instances of mongod with the following options:
--tlsMode requireTLS --tlsCertificateKeyFile ../opensslKeys/server.pem --tlsCAFile
../opensslKeys/ca.pem
credentialstore connectionString
alter credentialstore add user
mongodb://myUserAdmin@localhost:27017/admin?ssl=true&tlsCertificateKeyFile=../mcopensslkeys/client.pem&tlsCertificateKeyFilePassword=password&tlsCAFile=../mcopensslkeys/ca.pem
password root alias mongo
Note:
The Length ofconnectionString
should not exceed 256.
For CDC Extract, add the key store and trust store as part of the JVM options.
JVM options
-Xms512m -Xmx4024m -Xss32m -Djavax.net.ssl.trustStore=../mcopensslkeys /server.pkcs12
-Djavax.net.ssl.trustStorePassword=password
-Djavax.net.ssl.keyStore =../mcopensslkeys/client.pkcs12
-Djavax.net.ssl.keyStorePassword=password
Parent topic: Security and Authentication
8.2.23.7 Reviewing Sample Configurations
Basic Configuration
The following is a sample configuration for the MongoDB Handler from the Java adapter properties file:
gg.handlerlist=mongodb gg.handler.mongodb.type=mongodb #The following handler properties are optional. #Refer to the Oracle GoldenGate for BigData documentation #for details about the configuration. #gg.handler.mongodb.clientURI=mongodb://localhost:27017/ #gg.handler.mongodb.WriteConcern={w:value, wtimeout: number } #gg.handler.mongodb.BulkWrite=false #gg.handler.mongodb.CheckMaxRowSizeLimit=true goldengate.userexit.timestamp=utc goldengate.userexit.writers=javawriter javawriter.stats.display=TRUE javawriter.stats.full=TRUE gg.log=log4j gg.log.level=INFO gg.report.time=30sec #Path to MongoDB Java driver. # maven co-ordinates # <dependency> # <groupId>org.mongodb</groupId> # <artifactId>mongo-java-driver</artifactId> # <version>3.10.1</version> # </dependency> gg.classpath=/path/to/mongodb/java/driver/mongo-java-driver-3.10.1.jar javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=.:ggjava/ggjava.jar:./dirprm
Oracle or MongDB Database Source to MongoDB, AJD, and ATP Target
You can map an Oracle or MongDB Database source table name in uppercase to a table in MongoDB that is in lowercase. This applies to both table names and schemas. There are two methods that you can use:
- Create a Data Pump
-
You can create a data pump before the Replicat, which translates names to lowercase. Then you configure a MongoDB Replicat to use the output from the pump:
extract pmp exttrail ./dirdat/le map RAMOWER.EKKN, target "ram"."ekkn";
- Convert When Replicating
-
You can convert table column names to lowercase when replicating to the MongoDB table by adding this parameter to your MongoDB properties file:
gg.schema.normalize=lowercase
Parent topic: MongoDB
8.2.23.8 MongoDB to AJD/ATP Migration
Parent topic: MongoDB
8.2.23.8.1 Overview
Oracle Autonomous JSON Database (AJD) and Autonomous Database for transaction processing also uses wire protocol to connect. Wire protocol has the same MongoDB CRUD APIs.
Parent topic: MongoDB to AJD/ATP Migration
8.2.23.8.2 Configuring MongoDB handler to Write to AJD/ATP
Basic configuration remains the same including optional properties mentioned in this chapter.
The handler uses same protocol (mongodb wire protocol) and same driver jar for Autonomous databases as that of mongodb for performing all operation in target agnostic manner for performing the replication. The properties can also be used for any of the supported targets.
gg.handlerlist=mongodb gg.handler.mongodb.type=mongodb #URL mentioned below should be an AJD instance URL gg.handler.mongodb.clientURI=mongodb://[username]:[password]@[url]?authSource=$external&authMechanism=PLAIN&ssl=true #Path to MongoDB Java driver. Maven co-ordinates # <dependency> # <groupId>org.mongodb</groupId> # <artifactId>mongo-java-driver</artifactId> # <version>3.10.1</version> # </dependency> gg.classpath=/path/to/mongodb/java/driver/mongo-java-driver-3.10.1.jar javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=.:ggjava/ggjava.jar:./dirprm
Parent topic: MongoDB to AJD/ATP Migration
8.2.23.8.3 Steps for Migration
To migrate from MongoDB to AJD, first it is required to run initial load. Initial load comprises inserts operations only. After running initial load, start CDC which keeps the source and target database synchronized.
- Start CDC extract and generate trails. Do not start replicat to consume these trail files.
- Start Initial load extract and wait for initial load to complete.
- Create a new replicat to consume the initial load trails generated in Step 2. Wait for completion and then stop replicat.
- Create a new replicate to consume the CDC trails. Configure this replicat to use
HANDLECOLLISIONS
and then start replicat. - Wait for the CDC replicat (Step 4) to consume all the trails, check replicat lag, and replicat RBA to ensure that the CDC replicat has caught up. At this point, the source and target databases should be in sync.
- Stop the CDC replicat, remove
HANDLECOLLISIONS
parameter, and then restart the CDC replicat.
Parent topic: MongoDB to AJD/ATP Migration
8.2.23.8.4 Best Practices
- Before running CDC, ensure to run initial load, which loads the initial data using insert operations.
- Use bulk mode for running mongoDB handler in order to achieve better throughput.
- Enable handle-collision while migration to allow replicat to handle any collision error automatically.
- In order to insert missing update, ensure to add the
INSERTMISSINGUPDATES
property in the.prm
file.
Parent topic: MongoDB to AJD/ATP Migration
8.2.23.9 MongoDB Handler Client Dependencies
What are the dependencies for the MongoDB Handler to connect to MongoDB databases?
Note:
If the Oracle GoldenGate for Big Data version is 21.7.0.0.0 and below, the driver version is MongoDB Java Driver 3.12.8. For Oracle GoldenGate for Big Data versions 21.8.0.0.0 and above, the driver version is MongoDB Java Driver 4.6.0.Parent topic: MongoDB
8.2.23.9.1 MongoDB Java Driver 4.6.0
The required dependent client libraries are:
bson-4.6.0.jar
bson-record-codec-4.6.0.jar
mongodb-driver-core-4.6.0.jar
mongodb-driver-legacy-4.6.0.jar
mongodb-driver-legacy-4.6.0.jar
mongodb-driver-sync-4.6.0.jar
The Maven coordinates of these third-party libraries that are needed to run MongoDB replicat are:
<dependency> <groupId>org.mongodb</groupId> <artifactId>mongodb-driver-legacy</artifactId> <version>4.6.0</version> </dependency> <dependency> <groupId>org.mongodb</groupId> <artifactId>mongodb-driver-sync</artifactId> <version>4.6.0</version> </dependency>
Example
Download the latest version from Maven central at: https://central.sonatype.com/artifact/org.mongodb/mongodb-driver-reactivestreams/4.6.0.
Parent topic: MongoDB Handler Client Dependencies
8.2.23.9.2 MongoDB Java Driver 3.12.8
pom.xml
file, substituting
your correct
information:<!-- https://mvnrepository.com/artifact/org.mongodb/mongo-java-driver --> <dependency> <groupId>org.mongodb</groupId> <artifactId>mongo-java-driver</artifactId> <version>3.12.8</version> </dependency>
Parent topic: MongoDB Handler Client Dependencies
8.2.24 Netezza
You can replicate to Netezza using Command event Handler in conjunction with Flat Files.
Parent topic: Target
8.2.25 OCI Streaming
Oracle Cloud Infrastructure Streaming (OCI Streaming) supports putting messages to and receiving messages using the Kafka client. Therefore, Oracle GoldenGate for Big Data can be used to publish change data capture operation messages to OCI Streaming.
Note:
The Oracle Streaming Service currently does not have a schema registry to which the Kafka Connect Avro converter can connect. Streams to which the Kafka Handlers or the Kafka Connect Handlers publish messages must be pre-created in Oracle Cloud Infrastructure (OCI). Using the Kafka Handler to publish messages to a stream in OSS which does not already exist results in a runtime exception.- To create a stream in OCI, in the OCI console. select Analytics,
click Streaming, and then click Create Stream. Streams are created by
default in the DefaultPool.
Figure 8-1 Example Image of Stream Creation
- The Kafka Producer client requires certain Kafka producer configuration
properties to connect to OSS streams. To obtain this connectivity information, click
the pool name in the OSS panel. If
DefaultPool
is used, then click DefaultPool in the OSS panel.Figure 8-2 Example OSS Panel showing DefaultPool
Figure 8-3 Example DefaultPool Properties
- The Kafka Producer also requires an AUTH-TOKEN (password) to connect to
OSS. To obtain an
AUTH-TOKEN
go to the User Details page and generate anAUTH-TOKEN
. AUTH-TOKENs are only viewable at creation and are not subsequently viewable. Ensure that you store theAUTH-TOKEN
in a safe place.Figure 8-4 Auth-Tokens
Once you have these configurations, you can publish messages to OSS.
For example, kafka.prm
file:
replicat kafka
TARGETDB LIBFILE libggjava.so SET property=dirprm/kafka.properties
map *.*, target qatarget.*;
kafka.properties
file
:gg.log=log4j gg.log.level=debug gg.report.time=30sec gg.handlerlist=kafkahandler gg.handler.kafkahandler.type=kafka gg.handler.kafkahandler.mode=op gg.handler.kafkahandler.format=json gg.handler.kafkahandler.kafkaProducerConfigFile=oci_kafka.properties # The following dictates how we'll map the workload to the target OSS streams gg.handler.kafkahandler.topicMappingTemplate=OGGBD-191002 gg.handler.kafkahandler.keyMappingTemplate=${tableName} gg.classpath=/home/opc/dependencyDownloader/dependencies/kafka_2.2.0/* jvm.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar:dirprm
Example Kafka Producer Properties
(oci_kafka.properties
)
bootstrap.servers=cell-1.streaming.us-phoenix-1.oci.oraclecloud.com:9092 security.protocol=SASL_SSL sasl.mechanism=PLAIN value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer key.serializer=org.apache.kafka.common.serialization.ByteArraySerializer sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="paasdevgg/oracleidentitycloudservice/user.name@oracle.com/ocid1.streampool.oc1.phx.amaaaaaa3p5c3vqa4hfyl7uv465pay4audmoajughhxlsgj7afc2an5u3xaq" password="YOUR-AUTH-TOKEN";
To view the messages, click Load Messages in OSS.
Figure 8-5 Viewing the Messages
Parent topic: Target
8.2.26 Oracle NoSQL
The Oracle NoSQL Handler can replicate transactional data from Oracle GoldenGate to a target Oracle NoSQL Database.
This chapter describes how to use the Oracle NoSQL Handler.
- Overview
- On-Premise Connectivity
- OCI Cloud Connectivity
- Oracle NoSQL Types
- Oracle NoSQL Handler Configuration
- Performance Considerations
- Operation Processing Support
- Column Processing
- Table Check and Reconciliation Process
- Oracle NoSQL SDK Dependencies
Parent topic: Target
8.2.26.1 Overview
Oracle NoSQL Database is a NoSQL-type distributed key-value database. It provides a powerful and flexible transaction model that greatly simplifies the process of developing a NoSQL-based application. It scales horizontally with high availability and transparent load balancing even when dynamically adding new capacity.
Starting from the Oracle GoldenGate for Big Data 21.3.0.0.0 release, the Oracle NoSQL Handler has been changed to use the Oracle NoSQL Java SDK to communicate with Oracle NoSQL. The Oracle NoSQL Java SDK supports both on-premise and OCI cloud instances of Oracle NoSQL. Make sure to read the documentation because connecting to on-premise verses OCI cloud instances of Oracle NoSQL both require specialized configuration parameters and possibly some setup.
For more information about Oracle NoSQL Java SDK, see Oracle NoSQL SDK for Java.
Parent topic: Oracle NoSQL
8.2.26.2 On-Premise Connectivity
The Oracle NoSQL Java SDK requires that connectivity route through the Oracle NoSQL Database Proxy. The Oracle NoSQL Database Proxy is a separate process which enables the http/https interface of Oracle NoSQL. The Oracle NoSQL Java SDK uses the http/https interface. Oracle GoldenGate effectively communicates with the on-premise Oracle NoSQL instance through the Oracle NoSQL Database Proxy process.
For more information on the Oracle NoSQL Database Proxy including setup instructions, see Connecting to the Oracle NoSQL Database On-premise.
Connectivity to the Oracle NoSQL Database Proxy requires mutual authentication whereby the client authenticates the server and the server authenticates the client.
Parent topic: Oracle NoSQL
8.2.26.2.1 Server Authentication
Upon initial connection, the Oracle NoSQL Database Proxy process passes a certificate to the Oracle NoSQL Java SDK (Oracle NoSQL Handler). The Oracle NoSQL Java SDK then verifies the certificate against a certificate in a configured trust store. After the certificate received from the proxy has been verified against the trust store, the client has authenticated the server.
Parent topic: On-Premise Connectivity
8.2.26.2.2 Client Authentication
Upon initial connection, the Oracle NoSQL Java SDK (Oracle NoSQL Handler) passes credentials (username and password) to the Oracle NoSQL Database Proxy. These credentials are used for the NoSQL On-Premise instance to client.
Parent topic: On-Premise Connectivity
8.2.26.2.3 Sample On-Premise Oracle NoSQL Configuration
gg.handlerlist=nosql
gg.handler.nosql.type=nosql
gg.handler.nosql.nosqlURL=https://localhost:5555
gg.handler.nosql.ddlHandling=CREATE,ADD,DROP
gg.handler.nosql.interactiveMode=false
#Client Credentials
gg.handler.nosql.username={your username}
gg.handler.nosql.password={your password}
gg.handler.nosql.mode=op
# Set the gg.classpath to pick up the Oracle NoSQL Java SDK
gg.classpath=/path/to/the/SDK/*
# Set the -D options in the bootoptions to resolve the trust store location and password
jvm.bootoptions=-Xmx512m -Xms32m -Djavax.net.ssl.trustStore=/usr/nosql/kv-20.3.17/USER/security/driver.trust -Djavax.net.ssl.trustStorePassword={your trust store password}
Parent topic: On-Premise Connectivity
8.2.26.3 OCI Cloud Connectivity
- Server Authentication
- Client Authentication
- Sample Cloud Oracle NoSQL Configuration
- Sample OCI Configuration file
Parent topic: Oracle NoSQL
8.2.26.3.1 Server Authentication
Parent topic: OCI Cloud Connectivity
8.2.26.3.2 Client Authentication
Upon initial connection, the fingerprint
,
keyfile
, and pass_phrase
properties are used
for the server to authenticate the client.
Parent topic: OCI Cloud Connectivity
8.2.26.3.3 Sample Cloud Oracle NoSQL Configuration
gg.handlerlist=nosql
gg.handler.nosql.type=nosqlNoSQLSdkHandler
#gg.handler.nosql.type=nosql
gg.handler.nosql.ddlHandling=CREATE,ADD,DROP
gg.handler.nosql.interactiveMode=false
gg.handler.nosql.region=us-sanjose-1
gg.handler.nosql.configFilePath=/path/to/the/OCI/conf/file/nosql.conf
gg.handler.nosql.compartmentId=ocid1.compartment.oc1..aaaaaaaae2aedhka4jlb3h6zhpaonaoktmg53adwkhwjflvv6hihz5cvwfeq
gg.handler.nosql.storageGb=10
gg.handler.nosql.readUnits=50
gg.handler.nosql.writeUnits=50
gg.handler.nosql.mode=op
# Set the gg.classpath to pick up the Oracle NoSQL Java SDK
gg.classpath=/path/to/the/SDK/*
Parent topic: OCI Cloud Connectivity
8.2.26.3.4 Sample OCI Configuration file
[DEFAULT]
user=ocid1.user.oc1..aaaaaaaaammf6u5h4wsmiuk52us5vnqhnnyzexkn56cqijlyo4vaao2jzi3a
fingerprint=77:53:2c:e5:31:81:48:c3:3d:af:60:cf:e0:42:5c:7f
tenancy=ocid1.tenancy.oc1..aaaaaaaattuxbj75pnn3nksvzyidshdbrfmmeflv4kkemajroz2thvca4kba
region=us-sanjose-1
key_file=/home/username/OracleNoSQL/lastname.firstname-04-13-18-51.pem
openssl rsa -aes256 -in in.pem -out out.pem
-
tenancy
-
The Tenancy ID is displayed at the bottom of the Console page.
-
region
-
The region is displayed with the header session drop-down menu in the Console.
-
fingerprint
-
To generate the fingerprint, use the How to Get the Key's Fingerprint instructions at:
https://docs.cloud.oracle.com/iaas/Content/API/Concepts/apisigningkey.htm
-
key_file
-
You need to share the public and private key to establish a connection with Oracle Cloud Infrastructure. To generate the keys, use the How to Generate an API Signing Keyat:
https://docs.cloud.oracle.com/iaas/Content/API/Concepts/apisigningkey.htm
-
pass_phrase
- This is an optional property. It is used to configure the
passphrase if the private key in the pem file is protected with a passphase.
The following openssl command can be used to take an unprotected private key
pem file and add a passphrase.
The following command prompts the user for the passphrase:
openssl rsa -aes256 -in in.pem -out out.pem
Parent topic: OCI Cloud Connectivity
8.2.26.4 Oracle NoSQL Types
Oracle NoSQL provides a number of column data types and most of these data types are supported by the Oracle NoSQL Handler. A data type conversion from the column value in the trail file to the corresponding Java type representing the Oracle NoSQL column type in the Oracle NoSQL Handler is required.
The Oracle NoSQL Handler does not support Array, Map and Record data types by default. To support them, you can implement a custom data converter and override the default data type conversion logic to override it with your own custom logic to support your use case. Contact Oracle Support for guidance.
The following Oracle NoSQL data types are supported:
- Binary
- Boolen
- Double
- Integer
- Number
- String
- Timestamp
The following Oracle NoSQL data types are not supported:
- Array
- Map
Parent topic: Oracle NoSQL
8.2.26.5 Oracle NoSQL Handler Configuration
Properties | Required/Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
nosql |
None |
Selects the Oracle NoSQL Handler. |
|
Optional |
|
|
When set to true , the NoSQL handler will process one operation
at a time. When set to false , the NoSQL Handler will process the
batch perations at transaction commit. Batching has limitations. Batched operations
must be separated by table and all batch operations for a table must have a common
shared key(s).
|
|
Optional |
|
None |
Configure the Oracle NoSQL
Handler for the DDL functionality to provide. Options include
CREATE , ADD , and DROP .
|
|
Optional |
Positive Integer |
3 |
The number of retries on any read or write exception that the Oracle NoSQL Handler encounters. |
|
Optional |
Positive Integer |
30000 |
The maximum time in milliseconds for a NoSQL request to wait for a response. If the timeout is exceeded, the call is assumed to have failed. |
|
Optional |
A valid URL including protocol. |
None |
On-premise only. Used to set the connectivity URL for the NoSQL proxy instance. |
|
Optional |
String |
None |
On-premise only. Used to set the username for connectivity to an on-premise NoSQL instance through the NoSQL proxy process. |
|
Optional |
String |
None |
On-premise only. Used to set the password for connectivity to an on-premise NoSQL instance through the NoSQL proxy process. |
|
Optional |
The OCID of an Oracle NoSQL compartment on OCI. |
None |
Cloud only. The OCID of an Oracle NoSQL cloud instance compartment on OCI. |
|
Optional |
Legal Oracle OCI region name. |
None |
Cloud only. The OCI region name of an Oracle NoSQL cloud instance. |
|
Optional |
A legal path and file name. |
None |
Cloud only. Set the path and file name of the config file containing the Oracle OCI information on the user, fingerprint, tenancy, region, and key-file. |
|
Optional |
None |
"DEFAULT" |
Cloud only. Sets the named
sub-section in the gg.handler.name.configFilePath . OCI config files
can contain multiple entries and the naming specifies which entry to use.
|
|
Optional |
Positive Integer |
10 |
Cloud only. Oracle NoSQL tables created in a cloud instance must be configured with a maximum storage size. This sets that configuration for tables created by the Oracle NoSQL Handler. |
|
Optional |
Positive Integer |
50 |
Cloud only. Oracle NoSQL tables created in an OCI cloud instance must be configured with read units which is the maximum read throughput. Each unit is 1KB per second. |
|
Optional |
Positive Integer |
50 |
Cloud only. Oracle NoSQL tables created in an OCI cloud instance must be configured with write units which is the maximum write throughput. Each unit is 1KB per second. |
|
Optional |
|
|
Set to true if
the desired behavior of the handler is to abend when a column is found in the source
table but the column does not exist in the target NoSQL table. Set to
false if the desired behavior is for the handler to ignore
columns found in the source table for which no corresponding column exists in the
target NoSQL table.
|
|
Optional |
The fully qualified data converter class name. |
The default data converter. |
The custom data converter can be
implemented to override the default data conversion logic to support your specific
use case. Must be included in the gg.classpath to be used.
|
gg.handler.name.timestampPattern |
Optional | A legal pattern for parsing timestamps as they exist in the source trail file. |
yyyy-MM-dd HH:mm:ss |
This feature can be used to parse source field data into timestamps for timestamp target fields. The pattern needs to follow the Java convention for timestamp patterns and source data needs to conform to the pattern. |
gg.handler.name.proxyServer |
Optional | None | The proxy server host name. | Used to configure the forwarding proxy server host name for connectivity of on-premise Oracle GoldenGate for Big Data to Oracle Cloud Infrastructure (OCI) cloud instances of Oracle NoSQL. You must use at least version 5.2.27 of the Oracle NoSQL Java SDK. |
gg.handler.name.proxyPort |
Optional | 80 | Positive Integer | Used to configure the forwarding proxy server port number for connectivity of on-premise Oracle GoldenGate for Big Data to OCI cloud instances of Oracle NoSQL. You must use at least version 5.2.27 of the Oracle NoSQL Java SDK. |
gg.handler.name.proxyUsername |
Optional | None | String | Used to configure the username of the forwarding proxy for connectivity of on-premise Oracle GoldenGate for Big Data to OCI cloud instances of Oracle NoSQL if applicable. Most proxy servers do not require credentials. You must use at least version 5.2.27 of the Oracle NoSQL Java SDK. |
gg.handler.name.proxyPassword |
Optional | None | String | Used to configure the password of the forwarding proxy for connectivity of on-premise Oracle GoldenGate for Big Data to OCI cloud instances of Oracle NoSQL if applicable. Most proxy servers do not require credentials. Must use at least version 5.2.27 of the Oracle NoSQL Java SDK. |
Parent topic: Oracle NoSQL
8.2.26.6 Performance Considerations
When then NoSQL Handler is processing in interactive mode, operations are processing one at a time as they are received by the NoSQL Handler.
The NoSQL Handler will process in bulk mode if the following parameter is set.
gg.handler.name.interactiveMode=false
- Operations must be for the same NoSQL table.
- Operations mush be in the same NoSQL shard (have the same shard key or shard key values).
- Only one operation per row exists in the batch.
false
, the NoSQL handler group
operations by table and shard key, and deduplicates operations for the same row.
An example of Deduplication: If there is an insert and an update for a row, then only the update operation is processed if the operations fall within the same transaction or replicat grouped transaction.
The NoSQL handler may provide better performance when interactive mode is
set to false
. However, for the interactive mode to provide better
performance, operations need to be groupable by the above criteria. If operations
are not groupable by the above criteria or if operations or bulk mode only provide
grouping into very small batches, then bulk mode may not provide much or any
improvement in performance.
Parent topic: Oracle NoSQL
8.2.26.7 Operation Processing Support
The Oracle NoSQL Handler moves operations to Oracle NoSQL using synchronous API. The insert, update, and delete operations are processed differently in Oracle NoSQL databases rather than in a traditional RDBMS:
- insert: If the row does not exist in your database, then an insert operation is processed as an insert. If the row exists, then an insert operation is processed as an update.
- update: If a row does not exist in your database, then an update operation is processed as an insert. If the row exists, then an update operation is processed as update.
- delete: If the row does not exist in your database, then a delete operation has no effect. If the row exists, then a delete operation is processed as a delete.
The state of the data in Oracle NoSQL databases is idempotent. You can replay the source trail files or replay sections of the trail files. Ultimately, the state of an Oracle NoSQL database is the same regardless of the number of times the trail data was written into Oracle NoSQL.
Primary key values for a row in Oracle NoSQL databases are immutable. An update operation that changes any primary key value for a Oracle NoSQL row must be treated as a delete and insert. The Oracle NoSQL Handler can process update operations that result in the change of a primary key in an Oracle NoSQL database only as a delete and insert. To successfully process this operation, the source trail file must contain the complete before and after change data images for all columns.
Parent topic: Oracle NoSQL
8.2.26.8 Column Processing
Drop Column Functionality
Caution:
Dropping a column is potentially dangerous because it is permanently removing data from an Oracle NoSQL Database. Carefully consider your use case before configuring dropping.Primary key columns cannot be dropped.
Column name changes are not handled well because there is no DDL-processing. The Oracle NoSQL Handler can handle any case change for the column name. A column name change event on the source database appears to the handler like dropping an existing column and adding a new column.
Parent topic: Oracle NoSQL
8.2.26.9 Table Check and Reconciliation Process
- The Oracle NoSQL Handler interrogates the target Oracle NoSQL database for the table definition. If the table does not exist, the Oracle NoSQL Handler does one of two things. If gg.handler.name.ddlHandling includes CREATE, then a table is created in the database. Otherwise, the process abends and a message is logged that tells you the table that does not exist.
- If the table exists in the Oracle NoSQL database, then the Oracle
NoSQL Handler performs a reconciliation between the table definition from the
source trail file and the table definition in the database. This reconciliation
process searches for columns that exist in the source table definition and not
in the corresponding database table definition. If it locates columns fitting
this criteria and the
gg.handler.name.ddlHandling property
includes ADD, then the Oracle NoSQL Handler alters the target table in the database to add the new columns. Otherwise the columns missing in the target will not be added. If the propertygg.handler.name.abendOnUnmappedColumns
is set totrue
, then the NoSQL Handler will abend. Else, if the configuration properygg.handler.name.abendOnUnmappedColumns
is set tofalse
, then the NoSQL Handler will continue the process and will not replicat data for the columns which exist in the source table and do not exist in the target NoSQL table. - The reconciliation process searches for columns that exist in the
target Oracle NoSQL and do not exist in the source table definition. If it
locates columns fitting this criteria and the
gg.handler.name.ddlHandling
property includes DROP, then the Oracle NoSQL Handler alters the target table in Oracle NoSQL to drop these columns. Otherwise, those columns are ignored.
Parent topic: Oracle NoSQL
8.2.26.9.1 Full Image Data Requirements
In Oracle NoSQL, update operations perform a complete reinsertion of the data for the entire row. This Oracle NoSQL feature improves ingest performance, but in turn levies a critical requirement. Updates must include data for all columns, also known as full image updates. Partial image updates are not supported (updates with just the primary key information and data for the columns that changed). Using the Oracle NoSQL Handler with partial image update information results in incomplete data in the target NoSQL table.
Parent topic: Table Check and Reconciliation Process
8.2.26.10 Oracle NoSQL SDK Dependencies
The maven coordinates are as follows:
Maven groupId: com.oracle.nosql.sdk
Maven artifactId: nosqldriver
Version: 5.2.27
Parent topic: Oracle NoSQL
8.2.26.10.1 Oracle NoSQL SDK Dependencies 5.2.27
bcpkix-jdk15on-1.68.jar bcprov-jdk15on-1.68.jar jackson-core-2.12.1.jar netty-buffer-4.1.63.Final.jar netty-codec-4.1.63.Final.jar netty-codec-http-4.1.63.Final.jar netty-codec-socks-4.1.63.Final.jar netty-common-4.1.63.Final.jar netty-handler-4.1.63.Final.jar netty-handler-proxy-4.1.63.Final.jar netty-resolver-4.1.63.Final.jar netty-transport-4.1.63.Final.jar nosqldriver-5.2.27.jar
Parent topic: Oracle NoSQL SDK Dependencies
8.2.27 OCI Autonomous Data Warehouse
Oracle Autonomous Data Warehouse (ADW) is a fully managed database tuned and optimized for data warehouse workloads with the market-leading performance of Oracle Database.
- Detailed Functionality
The ADW Event handler is used as a downstream Event handler connected to the output of the OCI Object Storage Event handler. The OCI Event handler loads files generated by the File Writer Handler into Oracle OCI Object storage. All the SQL operations are performed in batches providing better throughput. - ADW Database Credential to Access OCI ObjectStore File
- ADW Database User Privileges
ADW databases come with a predefined database role namedDWROLE
. If the ADW 'admin' user is not being used, then the database user needs to be granted the roleDWROLE
. - Unsupported Operations/ Limitations
- Troubleshooting and Diagnostics
- Classpath
ADW apply relies on the upstream File Writer handler and the OCI Event handler. Include the required jars needed to run the OCI Event handler ingg.classpath
. - Configuration
Parent topic: Target
8.2.27.1 Detailed Functionality
The ADW Event handler is used as a downstream Event handler connected to the output of the OCI Object Storage Event handler. The OCI Event handler loads files generated by the File Writer Handler into Oracle OCI Object storage. All the SQL operations are performed in batches providing better throughput.
Parent topic: OCI Autonomous Data Warehouse
8.2.27.2 ADW Database Credential to Access OCI ObjectStore File
To access the OCI ObjectStore File:
- A PL/SQL procedure needs to be run to create a credential to access Oracle Cloud Infrastructure (OCI) Object store files.
- An OCI authentication token needs to be generated under User settings
from the OCI console. See
CREATE_CREDENTIAL
in Using Oracle Autonomous Data WareHouse on Shared Exadata Infrastructure. For example:BEGIN DBMS_CLOUD.create_credential ( credential_name => 'OGGBD-CREDENTIAL', username => 'oci-user', password => 'oci-user'); END; /
- The credential name can be configured using the followng property:
gg.eventhandler.adw.objectStoreCredential
. For example:gg.eventhandler.adw.objectStoreCredential=OGGBD-CREDENTIAL
.
Parent topic: OCI Autonomous Data Warehouse
8.2.27.3 ADW Database User Privileges
ADW databases come with a predefined database role named
DWROLE
. If the ADW 'admin' user is not being used, then the
database user needs to be granted the role DWROLE
.
This role provides the privileges required for data warehouse operations. For
example, the following command grants DWROLE
to the user
dbuser-1
:
GRANT DWROLE TO dbuser-1;
Note:
Ensure that you do not use Oracle-created database user
ggadmin
for ADW replication, because this
user lacks the INHERIT
privilege.
Parent topic: OCI Autonomous Data Warehouse
8.2.27.4 Unsupported Operations/ Limitations
- DDL changes are not supported.
- Replication of Oracle Object data types are not supported.
- If the GoldenGate trail is generated by Oracle Integrated capture, then for the UPDATE operations on the source LOB column, only the changed portion of the LOB is written to the trail file. Oracle GoldenGate for Big Data Autonomous Data Warehouse (ADW) apply doesn't support replication of partial LOB columns in the trail file.
Parent topic: OCI Autonomous Data Warehouse
8.2.27.5 Troubleshooting and Diagnostics
- Connectivity Issues to ADW
- Validate JDBC connection URL, user name and password.
- Check if http/https proxy is enabled. See ADW proxy configuration: Prepare for Oracle Call Interface (OCI), ODBC, and JDBC OCI Connections in Using Oracle Autonomous Data Warehouse on Shared Exadata Infrastructure.
- DDL not applied on the target table: The ADW handler will ignore DDL.
- Target table existence: It is expected that the ADW target table exists before starting the apply process. Target tables need to be designed with appropriate primary keys, indexes and partitions. Approximations based on the column metadata in the trail file may not be always correct. Therefore, replicat will ABEND if the target table is missing.
- Diagnostic throughput information on the apply process is logged into
the handler log file.
For example:
File Writer finalized 29525834 records (rate: 31714) (start time: 2020-02-10 01:25:32.000579) (end time: 2020-02-10 01:41:03.000606).
In this sample log message:
- This message provides details about the end-end throughput of File Writer handler and the downstream event handlers (OCI Event handler and ADW event handler).
- The throughput rate also takes into account the wait-times incurred before rolling over files.
- The throughput rate also takes into account the time taken by the OCI event handler and the ADW event handler to process operations.
- The above examples indicates that 29525834 operations were
finalized at the rate of 31714 operations per second between start time:
[2020-02-10 01:25:32.000579]
and end time:[2020-02-10 01:41:03.000606]
.
INFO 2019-10-01 00:36:49.000490 [pool-8-thread-1] – Begin DWH Apply stage and load statistics ********START********************************* INFO 2019-10-01 00:36:49.000490 [pool-8-thread-1] - Time spent for staging process [2074 ms] INFO 2019-10-01 00:36:49.000490 [pool-8-thread-1] - Time spent for merge process [992550 ms] INFO 2019-10-01 00:36:49.000490 [pool-8-thread-1] - [31195516] operations processed, rate[31,364]operations/sec. INFO 2019-10-01 00:36:49.000490 [pool-8-thread-1] – End DWH Apply stage and load statistics ********END*********************************** INFO 2019-10-01 00:37:18.000230 [pool-6-thread-1] – Begin OCI Event handler upload statistics ********START********************************* INFO 2019-10-01 00:37:18.000230 [pool-6-thread-1] - Time spent loading files into ObjectStore [71789 ms] INFO 2019-10-01 00:37:18.000230 [pool-6-thread-1] - [31195516] operations processed, rate[434,545] operations/sec. INFO 2019-10-01 00:37:18.000230 [pool-6-thread-1] – End OCI Event handler upload statistics ********END***********************************
In this example:
ADW Event handler throughput:
- In the above log message, the statistics for the ADW event handler is reported as DWH Apply stage and load statistics. ADW is classified as a Data Ware House (DWH), and therefore, this name.
- Here 31195516 operations from the source trail file were applied to ADW database at the rate of 31364 operations per second.
- ADW uses stage and merge. The time spent on staging is 2074 milliseconds and the time spent on executing merge SQL is 992550 milliseconds.
- In the above log message, the statistics for the OCI event handler is reported as OCI Event handler upload statistics.
- Here 31195516 operations from the source trail file were uploaded to the OCI object store at the rate of 434545 operations per second.
- Errors due to ADW credential missing grants to read OCI object store
files:
- A SQL exception indicating authorization failure is logged in
the handler log file. For
example:
java.sql.SQLException: ORA-20401: Authorization failed for URI - https://objectstorage.us-ashburn-1.oraclecloud.com/n/some_namespace/b/some_bucket/o/ADMIN.NLS_AllTypes/ADMIN.NLS_AllTypes_2019-12-16_11-44-01.237.avro
- A SQL exception indicating authorization failure is logged in
the handler log file. For
example:
- Errors in file format/column data:
In case the ADW Event handler is unable to read data from the external staging table due to column data errors, the Oracle GoldenGate for Big Data handler log file provides diagnostic information to debug the issue.
The following details are available in the log file:
JOB ID
SID
SERIAL #
ROWS_LOADED
START_TIME
UPDATE_TIME
STATUS
TABLE_NAME
OWNER_NAME
FILE_URI_LIST
LOGFILE_TABLE
BADFILE_TABLE
The contents of the
LOGFILE_TABLE
andBADFILE_TABLE
should indicate the specific record and the column(s) in the record which have error and the cause of the error. This information is also queried automatically by the ADW Event handler and logged into theOGGBD FW
handler log file. Based on the root cause of the error, customer can take action. In many cases, customers would have to modify the target table definition based on the source column data types and restart replicat. In other cases, customers may also want to modify the mapping in the replicat prm file. For this, Oracle recommends that they re-position replicat to start from the beginning. - Any other SQL Errors:
In case there are any errors while executing any SQL, the entire SQL statement along with the bind parameter values are logged into the OGGBD handler log file.
- Co-existence of the components:
The location/region of the machine where replicat process is running, OCI Objects storage bucket region and the ADW region would impact the overall throughput of the apply process. Data flow is as follows: GoldenGate OCI Object store ADW. For best throughput, the components need to located as close as possible.
- Debugging row count mismatch on the target table
For better throughput, ADW event handler does not validate the row counts modified on the target table. We can enable row count matching by using the Java System property:
disable.row.count.validation
. To enable row count validation, provide this property in thejvm.bootoptions
as follows:jvm.bootoptions=-Xmx512m -Xms32m -Djava.class.path=.:ggjava/ggjava.jar:./dirprm -Ddisable.row.count.validation=false
- Replicat ABEND due to partial LOB records in the trail file:
Oracle GoldenGate for Big Data ADW apply does not support replication of partial LOB. The trail file needs to be regenerated by Oracle Integrated capture using
TRANLOGOPTIONS
FETCHPARTIALLOB
option in the extract parameter file. - Throughput gain with uncompressed UPDATE trails:
If the source trail files contain the full image (all the column values of the respective table) of the row being updated, then you can include the JVM boot option
-Dcompressed.update=false
in the configuration propertyjvm.bootoptions
.For certain workloads and ADW instance shapes, this configuration may provide a better throughput. You may need to test the throughput gain on your environment.
Parent topic: OCI Autonomous Data Warehouse
8.2.27.6 Classpath
ADW apply relies on the upstream File Writer handler and the OCI Event
handler. Include the required jars needed to run the OCI Event handler in
gg.classpath
.
ADW Event handler uses the Oracle JDBC driver and its dependencies. The Autonomous Data Warehouse JDBC driver and other required dependencies are packaged with Oracle GoldenGate for Big Data.
For example:
gg.classpath=./oci-java-sdk/lib/*:./oci-java-sdk/third-party/lib/*
Parent topic: OCI Autonomous Data Warehouse
8.2.27.7 Configuration
- Automatic Configuration
Autonomous Data Warehouse (ADW) replication involves configuring of multiple components, such as file writer handler, OCI event handler and ADW event handler. - File Writer Handler Configuration
File writer handler name is pre-set to the valueadw
. The following is an example to edit a property of file writer handler:gg.handler.adw.pathMappingTemplate=./dirout
- OCI Event Handler Configuration
OCI event handler name is pre-set to the value ‘oci’. - ADW Event Handler Configuration
ADW event handler name is pre-set to the valueadw
. - INSERTALLRECORDS Support
- End-to-End Configuration
- Compressed Update Handling
Parent topic: OCI Autonomous Data Warehouse
8.2.27.7.1 Automatic Configuration
Autonomous Data Warehouse (ADW) replication involves configuring of multiple components, such as file writer handler, OCI event handler and ADW event handler.
The Automatic Configuration functionality helps to auto configure these components so that the user configuration is minimal. The properties modified by auto configuration will also be logged in the handler log file.
To enable auto configuration to replicate to ADW target we need to set the parameter
gg.target=adw
gg.target Required Legal Value: adw Default: None Explanation: Enables replication to ADW target
When replicating to ADW target, customization of OCI event hander name and ADW event handler name is not allowed.
Parent topic: Configuration
8.2.27.7.2 File Writer Handler Configuration
File writer handler name is pre-set to the value adw
. The
following is an example to edit a property of file writer handler:
gg.handler.adw.pathMappingTemplate=./dirout
Parent topic: Configuration
8.2.27.7.3 OCI Event Handler Configuration
OCI event handler name is pre-set to the value ‘oci’.
The following is an example to edit a property of the OCI event handler:
gg.eventhandler.oci.profile=DEFAULT
Parent topic: Configuration
8.2.27.7.4 ADW Event Handler Configuration
ADW event handler name is pre-set to the value adw
.
The following are the ADW event handler configurations:
Property | Required/Optional | Legal Values | Default | Explanationtes |
---|---|---|---|---|
gg.eventhandler.adw.connectionURL |
Required | ADW | None | Sets the ADW JDBC
connection URL. Example:
jdbc:oracle:thin:@adw20190410ns_medium?TNS_ADMIN=/home/sanav/projects/adw/wallet |
gg.eventhandler.adw.UserName |
Required | JDBC User name | None | Sets the ADW database user name. |
gg.eventhandler.adw.Password |
Required | JDBC Password | None | Sets the ADW database password. |
gg.eventhandler.adw.maxStatements |
Optional | Integer value between 1 to 250. | The default value is 250. | Use this parameter to control the number of prepared SQL statements that can be used. |
gg.eventhandler.adw.maxConnnections |
Optional | Integer value. | 10 | Use this parameter to control the number of concurrent JDBC database connections to the target ADW database. |
gg.eventhandler.adw.dropStagingTablesOnShutdown |
Optional | true |
false |
false |
If set to
true , the temporary staging
tables created by the ADW event handler is dropped
on replicat graceful stop.
|
gg.eventhandler.adw.objectStoreCredential |
Required | A database credential name. | None | ADW Database credential to access OCI object-store files. |
gg.initialLoad |
Optional | true |
false |
false |
If set to
true , initial load mode is
enabled. See INSERTALLRECORDS Support.
|
gg.operation.aggregator.validate.keyupdate
|
Optional | true or false |
false |
If set to true , Operation Aggregator will
validate key update operations (optype 115) and correct to normal update if no key
values have changed. Compressed key update operations do not qualify for merge.
|
gg.compressed.update |
Optional | true or false |
true |
If set the true , then this indicates that the
source trail files contain compressed update operations. If set to
true , then the source trail files are expected to contain
uncompressed update operations.
|
gg.eventhandler.adw.connectionRetries
|
Optional | Integer Value | 3 | Specifies the number of times connections to the target data warehouse will be retried. |
gg.eventhandler.adw.connectionRetryIntervalSeconds
|
Optional | Integer Value | 30 | Specifies the delay in seconds between connection retry attempts. |
Parent topic: Configuration
8.2.27.7.5 INSERTALLRECORDS Support
Stage and merge targets supports
INSERTALLRECORDS
parameter.
See INSERTALLRECORDS in Reference for
Oracle GoldenGate. Set the INSERTALLRECORDS
parameter in
the Replicat parameter file (.prm
). Set the
INSERTALLRECORDS
parameter in the Replicat parameter file
(.prm
)
Setting this property directs the Replicat process to use bulk insert operations to load operation data into the target table.
You can tune the batch size of bulk inserts using the File writer propertygg.handler.adw.maxFileSize
. The default value is set to 1GB.
The frequency of bulk inserts can be tuned using the File writer property
gg.handler.adw.fileRollInterval
, the default value is set to 3m
(three minutes).
INSERTALLRECORDS
parameter in the Replicat parameter file
(.prm
). Setting this property directs the Replicat process to use
bulk insert operations to load operation data into the target table.
You
can tune the batch size of bulk inserts using the File Writer property
gg.handler.adw.maxFileSize
. The default value is set to 1GB.
The frequency of bulk inserts can be tuned using the File Writer property
gg.handler.adw.fileRollInterval
, the default value is set to 3m
(three minutes).
Parent topic: Configuration
8.2.27.7.6 End-to-End Configuration
- In an Oracle GoldenGate Classic install:
<oggbd_install_dir>/AdapterExamples/big-data/adw-via-oci/adw.props
. - In an Oracle GoldenGate Microservices install:
<oggbd_install_dir>/opt/AdapterExamples/big-data/adw-via-oci/adw.props
.
# Configuration to load GoldenGate trail operation records # into Autonomous Data Warehouse (ADW) by chaining # File writer handler -> OCI Event handler -> ADW Event handler. # Note: Recommended to only edit the configuration marked as TODO gg.target=adw ##The OCI Event handler # TODO: Edit the OCI config file path. gg.eventhandler.oci.configFilePath=<path/to/oci/config> # TODO: Edit the OCI profile name. gg.eventhandler.oci.profile=DEFAULT # TODO: Edit the OCI namespace. gg.eventhandler.oci.namespace=<OCI namespace> # TODO: Edit the OCI region. gg.eventhandler.oci.region=<oci-region> # TODO: Edit the OCI compartment identifier. gg.eventhandler.oci.compartmentID=<OCI compartment id> gg.eventhandler.oci.pathMappingTemplate=${fullyQualifiedTableName} # TODO: Edit the OCI bucket name. gg.eventhandler.oci.bucketMappingTemplate=<ogg-bucket> ##The ADW Event Handler # TODO: Edit the ADW JDBC connectionURL gg.eventhandler.adw.connectionURL=jdbc:oracle:thin:@adw20190410ns_medium?TNS_ADMIN=/path/to/ /adw/wallet # TODO: Edit the ADW JDBC user gg.eventhandler.adw.UserName=<db user> # TODO: Edit the ADW JDBC password gg.eventhandler.adw.Password=<db password> # TODO: Edit the ADW Credential that can access the OCI Object Store. gg.eventhandler.adw.objectStoreCredential=<ADW Object Store credential> # TODO:Set the classpath to include OCI Java SDK. gg.classpath=./oci-java-sdk/lib/*:./oci-java-sdk/third-party/lib/* #TODO: Edit to provide sufficient memory (at least 8GB). jvm.bootoptions=-Xmx8g -Xms8g
Parent topic: Configuration
8.2.27.7.7 Compressed Update Handling
A compressed update record contains values for the key columns and the modified columns.
An uncompressed update record contains values for all the columns.
Oracle GoldenGate trails may contain compressed or uncompressed update records. The default extract configuration writes compressed updates to the trails.
The parameter gg.compressed.update
can be set to
true
or false
to indicate
compressed/uncompressed update records.
Parent topic: Configuration
8.2.27.7.7.1 MERGE Statement with Uncompressed Updates
In some use cases, if the trail contains uncompressed update records, then the
MERGE SQL
statement can be optimized for better performance by
setting gg.compressed.update=false
.
Parent topic: Compressed Update Handling
8.2.28 Oracle Cloud Infrastructure Object Storage
The Oracle Cloud Infrastructure Event Handler is used to load files generated by the File Writer Handler into an Oracle Cloud Infrastructure Object Store.
The Oracle Cloud Infrastructure Event Handler is used to load files generated by the Flat Files into an Oracle Cloud Infrastructure Object Store. This topic describes how to use the OCI Event Handler.
- Overview
- Detailing the Functionality
- Configuration
- Configuring Credentials for Oracle Cloud Infrastructure
- Troubleshooting
- OCI Dependencies
Parent topic: Target
8.2.28.1 Overview
The Oracle Cloud Infrastructure Object Storage service is an internet-scale, high-performance storage platform that offers reliable and cost-efficient data durability. The Object Storage service can store an unlimited amount of unstructured data of any content type, including analytic data and rich content, like images and videos, see https://cloud.oracle.com/en_US/cloud-infrastructure.
You can use any format handler that the File Writer Handler supports.
Parent topic: Oracle Cloud Infrastructure Object Storage
8.2.28.2 Detailing the Functionality
The Oracle Cloud Infrastructure Event Handler requires the Oracle Cloud Infrastructure Java software development kit (SDK) to transfer files to Oracle Cloud Infrastructure Object Storage. Oracle GoldenGate for Big Data does not include the Oracle Cloud Infrastructure Java SDK, see https://docs.cloud.oracle.com/iaas/Content/API/Concepts/sdkconfig.htm.
You must download the Oracle Cloud Infrastructure Java SDK at:
https://docs.us-phoenix-1.oraclecloud.com/Content/API/SDKDocs/javasdk.htm
Extract the JAR files to a permanent directory. There are two directories required by the handler, the JAR library directory that has Oracle Cloud Infrastructure SDK JAR and a third-party JAR library. Both directories must be in the gg.classpath
.
Specify the gg.classpath
environment variable to include the JAR files of the Oracle Cloud Infrastructure Java SDK.
Example
gg.classpath=/usr/var/oci/lib/*:/usr/var/oci/third-party/lib/*
Setting of the proxy server settings requires additional dependency libraries identified by the following Maven coordinates:
Group ID: com.oracle.oci.sdk
Artifact ID: oci-java-sdk-addons-apache
The best way to get all of the dependencies is to use the Dependency Downloading utility scripts. The OCI script downloads both the OCI Java SDK and the Apache Addons libraries.
For more information on this dependency, see OCI Documentation - README.
Parent topic: Oracle Cloud Infrastructure Object Storage
8.2.28.3 Configuration
You configure the Oracle Cloud Infrastructure Event Handler operation using the properties file. These properties are located in the Java Adapter properties file (and not in the Replicat properties file).
The Oracle Cloud Infrastructure Event Handler works only in conjunction with the File Writer Handler.
To enable the selection of the Oracle Cloud Infrastructure Event Handler, you must
first configure the handler type by specifying
gg.eventhandler.name.type=oci
and the other Oracle Cloud
Infrastructure properties as follows:
Table 8-34 Oracle Cloud Infrastructure Event Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
|
None |
Selects the Oracle Cloud Infrastructure Event Handler. |
gg.eventhandler.name.contentType |
Optional | Valid content type value which is used to indicate the media type of the resource. | application/octet-stream |
The content type of the object. |
gg.eventhandler.name.contentEncoding |
Optional | Valid values indicate which encoding to be applied. | utf-8 | The content encoding of the object. |
gg.eventhandler.name.contentLanguage |
Optional | Valid language intended for the audience. | en | The content language of the object. |
|
Optional |
Path to the event handler |
None |
The configuration file name and location. Ifgg.eventhandler.name.configFilePath
is not set, then the following authentication parameters are
required:
gg.eventhandler.name.configFilePath .
|
gg.eventhandler.name.userId |
Optional | Valid user ID | None | OCID of the user calling the API. To get the value,
see (Required Keys and OCIDs)https://docs.oracle.com/en-us/iaas/Content/API/Concepts/apisigningkey.htm#Required_Keys_and_OCIDs.
Example: ocid1.user.oc1..<unique_ID>
(shortened for brevity)
|
gg.eventhandler.name.tenancyId |
Optional | Valid tenancy ID | None | OCID of your tenancy. To get the value, see
(Required Keys and OCIDs)https://docs.oracle.com/en-us/iaas/Content/API/Concepts/apisigningkey.htm#Required_Keys_and_OCIDs.
in Oracle Cloud Infrastructure documentation. Example:
ocid1.tenancy.oc1..<unique_ID> |
gg.eventhandler.name.privateKeyFile |
Optional | A valid path to the file | None | Full path and filename of the private key.
Note: The key pair must be in PEM format. For more information about generating a key pair in PEM format, see (Required Keys and OCIDs)https://docs.oracle.com/en-us/iaas/Content/API/Concepts/apisigningkey.htm#Required_Keys_and_OCIDs in Oracle Cloud Infrastructure documentation. Example:
/home/opc/.oci/oci_api_key.pem |
gg.eventhandler.name.publicKeyFingerprint |
Optional | String | None | Fingerprint for the public key that was added to this user. To get the value, see (Required Keys and OCIDs) https://docs.oracle.com/en-us/iaas/Content/API/Concepts/apisigningkey.htm#Required_Keys_and_OCIDs in Oracle Cloud Infrastructure documentation. |
|
Required |
Valid string representing the profile name. |
|
In the Oracle Cloud Infrastructure |
|
Required |
Oracle Cloud Infrastructure region |
None |
Oracle Cloud Infrastructure Servers and Data is hosted in a region and is a localized geographic area. The valid Region Identifiers are listed at Oracle Cloud Infrastructure Documentation - Regions and Availability Domains. |
|
Required |
Valid compartment id. |
None |
A compartment is a logical container to organize Oracle Cloud Infrastructure resources. The |
|
Required |
A string with resolvable keywords and constants used to dynamically generate the path in the Oracle Cloud Infrastructure bucket to write the file. |
None |
Use keywords interlaced with constants to dynamically generate unique Oracle Cloud Infrastructure path names at runtime. See Template Keywords. |
|
Optional |
A string with resolvable keywords and constants used to dynamically generate the Oracle Cloud Infrastructure file name at runtime. |
None |
Use resolvable keywords and constants to dynamically generate the Oracle Cloud Infrastructure data file name at runtime. If not set, the upstream file name is used. See Template Keywords. |
|
Required |
A string with resolvable keywords and constants used to dynamically generate the path in the Oracle Cloud Infrastructure bucket to write the file. |
None |
Use resolvable keywords and constants used to dynamically generate the Oracle Cloud Infrastructure bucket name at runtime. The event handler attempts to create the Oracle Cloud Infrastructure bucket if it does not exist. See Template Keywords. |
|
Optional |
|
None |
Set to |
|
Optional |
A unique string identifier cross referencing a child event handler. |
No event handler is configured. |
Sets the event handler that is invoked on the file roll event. Event handlers can do file roll event actions like loading files to S3, converting to Parquet or ORC format, loading files to HDFS, loading files to Oracle Cloud Infrastructure Storage Classic, or loading file to Oracle Cloud Infrastructure. |
gg.eventhandler.name.proxyServer |
Optional | The host name of your proxy server. | None | Set to the host name of the proxy server if OCI connectivity requires routing through a proxy server. |
gg.eventhandler.name.proxyPort |
Optional | The port number of the proxy server. | None | Set to the port number of the proxy server if OCI connectivity requires routing through a proxy server. |
gg.eventhandler.name.proxyProtocol |
Optional |
HTTP | HTTPS |
HTTP |
Sets the proxy protocol connection to the proxy server for additional level of security. The majority of proxy servers support HTTP. Only set this if the proxy server supports HTTPS and HTTPS is required. |
gg.eventhandler.name.proxyUsername |
Optional | The username for the proxy server. | None | Sets the username for connectivity to the proxy server if credentials are required. Most proxy servers do not require credentials. |
gg.eventhandler.name.proxyPassword |
Optional | The password for the proxy server. | None | Sets the password for connectivity to the proxy server if credentials are required. Most proxy servers do not require credentials. |
gg.handler.name.SSEKey |
Optional | A legal Base64 encoded OCI server side encryption key. | None | Allows you to control the encryption of data files loaded to OCI. OCI encrypts by default. This property allows an additional level of control by supporting encryption with a specific key. That key must also be used to decrypt data files. |
Sample Configuration
gg.eventhandler.oci.type=oci gg.eventhandler.oci.configFilePath=~/.oci/config gg.eventhandler.oci.profile=DEFAULT gg.eventhandler.oci.namespace=dwcsdemo gg.eventhandler.oci.region=us-ashburn-1 gg.eventhandler.oci.compartmentID=ocid1.compartment.oc1..aaaaaaaajdg6iblwgqlyqpegf6kwdais2gyx3guspboa7fsi72tfihz2wrba gg.eventhandler.oci.pathMappingTemplate=${schemaName} gg.eventhandler.oci.bucketMappingTemplate=${schemaName} gg.eventhandler.oci.fileNameMappingTemplate=${tableName}_${currentTimestamp}.txt gg.eventhandler.oci.finalizeAction=NONE goldengate.userexit.writers=javawriter
8.2.28.3.1 Automatic Configuration
OCI Object storage replication involves configuring multiple components, such as the File Writer Handler, formatter, and the target OCI Object Storage Event Handler.
The Automatic Configuration functionality helps you to auto configure these components so that the manual configuration is minimal.
The properties modified by auto-configuration is also logged in the handler log file.
To enable auto configuration to replicate to the OCI Object Storage
target, set the parameter gg.target=oci
.
8.2.28.3.1.1 File Writer Handler Configuration
The File Writer Handler name is pre set to the value
oci
.
You can add or edit a property of the File Writer Handler. For example:
gg.handler.oci.pathMappingTemplate=./dirout
Parent topic: Automatic Configuration
8.2.28.3.1.2 Formatter Configuration
The json row formatter is set by default.
You can add or edit a property of the formatter. For example:
gg.handler.oci.format=json_row
Parent topic: Automatic Configuration
8.2.28.4 Configuring Credentials for Oracle Cloud Infrastructure
Basic configuration information like user credentials and tenancy Oracle Cloud IDs (OCIDs) of Oracle Cloud Infrastructure is required for the Java SDKs to work, see https://docs.cloud.oracle.com/iaas/Content/General/Concepts/identifiers.htm.
The ideal configuration file include keys user
, fingerprint
, key_file
, tenancy
, and region
with their respective values. The default configuration file name and location is ~/.oci/config
.
Create the config
file as follows:
- Create a directory called
.oci
in the Oracle GoldenGate for Big Data home directory - Create a text file and name it
config
. - Obtain the values for these properties:
-
user
-
- Login to the Oracle Cloud Infrastructure Console https://console.us-ashburn-1.oraclecloud.com.
- Click Username.
- Click User Settings.
The User's OCID is displayed and is the value for the key user.
-
tenancy
-
The Tenancy ID is displayed at the bottom of the Console page.
-
region
-
The region is displayed with the header session drop-down menu in the Console.
-
fingerprint
-
To generate the fingerprint, use the How to Get the Key's Fingerprint instructions at:
https://docs.cloud.oracle.com/iaas/Content/API/Concepts/apisigningkey.htm
-
key_file
-
You need to share the public and private key to establish a connection with Oracle Cloud Infrastructure. To generate the keys, use the How to Generate an API Signing Keyat:
https://docs.cloud.oracle.com/iaas/Content/API/Concepts/apisigningkey.htm
-
pass_phrase
- This is an optional property. It is used to configure
the passphrase if the private key in the pem file is protected with
a passphase. The following openssl command can be used to take an
unprotected private key pem file and add a passphrase.
The following command prompts the user for the passphrase:
openssl rsa -aes256 -in in.pem -out out.pem
-
Sample Configuration File
user=ocid1.user.oc1..aaaaaaaat5nvwcna5j6aqzqedqw3rynjq fingerprint=20:3b:97:13::4e:c5:3a:34 key_file=~/.oci/oci_api_key.pem tenancy=ocid1.tenancy.oc1..aaaaaaaaba3pv6wkcr44h25vqstifs
Parent topic: Oracle Cloud Infrastructure Object Storage
8.2.28.5 Troubleshooting
Connectivity Issues
If the OCI Event Handler is unable to connect to the OCI object storage when running on premise, it’s likely your connectivity to the public internet is protected by a proxy server. Proxy servers act a gateway between the private network of a company and the public internet. Contact your network administrator to get the URL of your proxy server.
Oracle GoldenGate for Big Data connectivity to OCI can be routed through a proxy server by setting the following configuration properties:
gg.eventhandler.name.proxyServer={insert your proxy server name} gg.eventhandler.name.proxyPort={insert your proxy server port number}
ClassNotFoundException Error
The most common initial error is an incorrect classpath that does not include all the required client libraries so results in a ClassNotFoundException
error. Specify the gg.classpath
variable to include all of the required JAR files for the Oracle Cloud Infrastructure Java SDK, see Detailing the Functionality.
Parent topic: Oracle Cloud Infrastructure Object Storage
8.2.28.6 OCI Dependencies
The maven coordinates for OCI are as follows:
Maven groupId: com.oracle.oci.sdk
Maven artifactId: oci-java-sdk-full
Version: 1.34.0
The following are the Apache add-ons to which, support routing through a proxy server:
Maven groupId: com.oracle.oci.sdk
Maven artifactId: oci-java-sdk-addons-apache
Version: 1.34.0
Parent topic: Oracle Cloud Infrastructure Object Storage
8.2.28.6.1 OCI 1.34.0
accessors-smart-1.2.jar aopalliance-repackaged-2.6.1.jar asm-5.0.4.jar bcpkix-jdk15on-1.68.jar bcprov-jdk15on-1.68.jar checker-qual-3.5.0.jar commons-codec-1.15.jar commons-io-2.8.0.jar commons-lang3-3.8.1.jar commons-logging-1.2.jar error_prone_annotations-2.3.4.jar failureaccess-1.0.1.jar guava-30.1-jre.jar hk2-api-2.6.1.jar hk2-locator-2.6.1.jar hk2-utils-2.6.1.jar httpclient-4.5.13.jar httpcore-4.4.13.jar j2objc-annotations-1.3.jar jackson-annotations-2.12.0.jar jackson-core-2.12.0.jar jackson-databind-2.12.0.jar jackson-datatype-jdk8-2.12.0.jar jackson-datatype-jsr310-2.12.0.jar jackson-module-jaxb-annotations-2.10.1.jar jakarta.activation-api-1.2.1.jar jakarta.annotation-api-1.3.5.jar jakarta.inject-2.6.1.jar jakarta.ws.rs-api-2.1.6.jar jakarta.xml.bind-api-2.3.2.jar javassist-3.25.0-GA.jar jcip-annotations-1.0-1.jar jersey-apache-connector-2.32.jar jersey-client-2.32.jar jersey-common-2.32.jar jersey-entity-filtering-2.32.jar jersey-hk2-2.32.jar jersey-media-json-jackson-2.32.jar json-smart-2.3.jar jsr305-3.0.2.jar listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar nimbus-jose-jwt-8.5.jar oci-java-sdk-addons-apache-1.34.0.jar oci-java-sdk-analytics-1.34.0.jar oci-java-sdk-announcementsservice-1.34.0.jar oci-java-sdk-apigateway-1.34.0.jar oci-java-sdk-apmcontrolplane-1.34.0.jar oci-java-sdk-apmsynthetics-1.34.0.jar oci-java-sdk-apmtraces-1.34.0.jar oci-java-sdk-applicationmigration-1.34.0.jar oci-java-sdk-artifacts-1.34.0.jar oci-java-sdk-audit-1.34.0.jar oci-java-sdk-autoscaling-1.34.0.jar oci-java-sdk-bds-1.34.0.jar oci-java-sdk-blockchain-1.34.0.jar oci-java-sdk-budget-1.34.0.jar oci-java-sdk-cims-1.34.0.jar oci-java-sdk-circuitbreaker-1.34.0.jar oci-java-sdk-cloudguard-1.34.0.jar oci-java-sdk-common-1.34.0.jar oci-java-sdk-computeinstanceagent-1.34.0.jar oci-java-sdk-containerengine-1.34.0.jar oci-java-sdk-core-1.34.0.jar oci-java-sdk-database-1.34.0.jar oci-java-sdk-databasemanagement-1.34.0.jar oci-java-sdk-datacatalog-1.34.0.jar oci-java-sdk-dataflow-1.34.0.jar oci-java-sdk-dataintegration-1.34.0.jar oci-java-sdk-datasafe-1.34.0.jar oci-java-sdk-datascience-1.34.0.jar oci-java-sdk-dns-1.34.0.jar oci-java-sdk-dts-1.34.0.jar oci-java-sdk-email-1.34.0.jar oci-java-sdk-events-1.34.0.jar oci-java-sdk-filestorage-1.34.0.jar oci-java-sdk-full-1.34.0.jar oci-java-sdk-functions-1.34.0.jar oci-java-sdk-goldengate-1.34.0.jar oci-java-sdk-healthchecks-1.34.0.jar oci-java-sdk-identity-1.34.0.jar oci-java-sdk-integration-1.34.0.jar oci-java-sdk-keymanagement-1.34.0.jar oci-java-sdk-limits-1.34.0.jar oci-java-sdk-loadbalancer-1.34.0.jar oci-java-sdk-loganalytics-1.34.0.jar oci-java-sdk-logging-1.34.0.jar oci-java-sdk-loggingingestion-1.34.0.jar oci-java-sdk-loggingsearch-1.34.0.jar oci-java-sdk-managementagent-1.34.0.jar oci-java-sdk-managementdashboard-1.34.0.jar oci-java-sdk-marketplace-1.34.0.jar oci-java-sdk-monitoring-1.34.0.jar oci-java-sdk-mysql-1.34.0.jar oci-java-sdk-networkloadbalancer-1.34.0.jar oci-java-sdk-nosql-1.34.0.jar oci-java-sdk-objectstorage-1.34.0.jar oci-java-sdk-objectstorage-extensions-1.34.0.jar oci-java-sdk-objectstorage-generated-1.34.0.jar oci-java-sdk-oce-1.34.0.jar tbcampbe: oci-java-sdk-ocvp-1.34.0.jar oci-java-sdk-oda-1.34.0.jar oci-java-sdk-ons-1.34.0.jar oci-java-sdk-opsi-1.34.0.jar oci-java-sdk-optimizer-1.34.0.jar oci-java-sdk-osmanagement-1.34.0.jar oci-java-sdk-resourcemanager-1.34.0.jar oci-java-sdk-resourcesearch-1.34.0.jar oci-java-sdk-rover-1.34.0.jar oci-java-sdk-sch-1.34.0.jar oci-java-sdk-secrets-1.34.0.jar oci-java-sdk-streaming-1.34.0.jar oci-java-sdk-tenantmanagercontrolplane-1.34.0.jar oci-java-sdk-usageapi-1.34.0.jar oci-java-sdk-vault-1.34.0.jar oci-java-sdk-waas-1.34.0.jar oci-java-sdk-workrequests-1.34.0.jar osgi-resource-locator-1.0.3.jar resilience4j-circuitbreaker-1.2.0.jar resilience4j-core-1.2.0.jar slf4j-api-1.7.29.jar vavr-0.10.0.jar vavr-match-0.10.0.jar
Parent topic: OCI Dependencies
8.2.29 Redis
Redis is an in-memory data structure store which supports optional durability. Redis is simply a key/value data store where a unique key identifies the data structure stored. The value is the data structure that is stored.
The Redis Handler supports the replication of change data capture to Redis and the storage of that data in three different data structures: Hash Maps, Streams, JSONs.
- Data Structures Supported by the Redis Handler
- Redis Handler Configuration Properties
- Security
- Authentication Using Credentials
- SSL Basic Auth
- SSL Mutual Auth
- Redis Handler Dependencies
The Redis Handler uses the Jedis client libraries to connect to the Redis server. - Redis Handler Client Dependencies
The Redis Handler uses the Jedis client to connect to Redis.
Parent topic: Target
8.2.29.1 Data Structures Supported by the Redis Handler
8.2.29.1.1 Hash Maps
Behavior on Inserts, Updates, and Deletes
The source trail file will contain insert, update. and delete operations for which the data can be pushed into Redis. The Redis Handler will process inserts, updates, and deletes as follows:
Inserts – The Redis Handler will create a new key in Redis the value of which is a hash map for which the hash map key is the column name and the hash map value is the column value.
Updates – The Redis Handler will update an existing hash map structure in Redis. The existing hash map will be updated with the column names and values from the update operation processed. Because hash map data is updated and not replace, full image updates are not required.
Primary Key Updates – The Redis Handler will move the old key to the new key name alone with the data structure, then an update will be performed on the hash map.
Deletes – The Redis Handler will delete the key and its corresponding data structure from Redis.
Handling of Null Values
Redis hash maps cannot store null as a value. A Redis hash map must have a non-null value. The default behavior is to omit columns with a null value from the generated hash map. If an update changes a column value from a non-null value to a null value, then the column key and value is removed from the hash map.
gg.handler.redis.omitNullValues=false
gg.handler.redis.nullValueRepresentation=null
The user will need to designate some value as null. But the following are legal too.
In this case the null value representation is an empty string or
“”
.
gg.handler.redis.nullValueRepresentation=CDATA[]
In this case the null value representation is set to a tab.
gg.handler.redis.nullValueRepresentation=CDATA[\t]
Support for Binary Values
The default functionality is to push all data into Redis hash maps as Java strings. Binary values must be converted to Base64 to be represented as a Java String. Consequently, binary values will be represented as Base64. Alternatively, users can push bytes into Redis hash maps to retain the original bytes values by setting the following configuration property.
gg.handler.redis.dataType=bytes
Example hash map data in Redis:
127.0.0.1:6379> hgetall TCUSTMER:JANE 1) "optype" 2) "I" 3) "CITY" 4) "DENVER" 5) "primarykeycolumns" 6) "CUST_CODE" 7) "STATE" 8) "CO" 9) "CUST_CODE" 10) "JANE" 11) "position" 12) "00000000000000002126" 13) "NAME" 14) "ROCKY FLYER INC."
Example Configuration
gg.handlerlist=redis gg.handler.redis.type=redis gg.handler.redis.hostPortList= localhost:6379 gg.handler.redis.createIndexes=true gg.handler.redis.mode=op gg.handler.redis.metacolumnsTemplate=${position},${optype},${primarykeycolumns}
Parent topic: Data Structures Supported by the Redis Handler
8.2.29.1.2 Streams
Redis streams are analogs the Kafka topics. The Redis key is the stream name. The value of the stream are the individual messages pushed to the Redis stream. Individual messages are identified by a timestamp and offset of when the message was pushed to Redis. The value of each individual message is a hash map for which the key is the column name and value is the column value.
Behavior on Inserts, Updates, and Deletes
Each and every operation and its associated data is propagated to Redis Streams. Therefore, every operation will show up as a new message in Redis Streams.
Handling of Null Values
Redis streams stores hash maps as the value for each message. A Redis hash map cannot store null as a value. Null values work exactly as they do in hash maps functionality.
Support for Binary Values
The default functionality is to push all data into Redis hash maps as Java strings. Binary values must be converted to Base64 to be represented as a Java String. Consequently, binary values will be represented as Base64. Alternatively, users can push bytes into Redis hash maps to retain the original bytes values by setting the following configuration property.
gg.handler.redis.dataType=bytes
Steam data appears in Redis as follows:
127.0.0.1:6379> xread STREAMS TCUSTMER 0-0 1) 1) "TCUSTMER" 2) 1) 1) "1664399290398-0" 2) 1) "optype" 2) "I" 3) "CITY" 4) "SEATTLE" 5) "primarykeycolumns" 6) "CUST_CODE" 7) "STATE" 8) "WA" 9) "CUST_CODE" 10) "WILL" 11) "position" 12) "00000000000000001956" 13) "NAME" 14) "BG SOFTWARE CO."
2) 1) "1664399290398-1" 2) 1) "optype" 2) "I" 3) "CITY" 4) "DENVER" 5) "primarykeycolumns" 6) "CUST_CODE" 7) "STATE" 8) "CO" 9) "CUST_CODE" 10) "JANE" 11) "position" 12) "00000000000000002126" 13) "NAME" 14) "ROCKY FLYER INC."
Example Configuration
gg.handlerlist=redis gg.handler.redis.type=redis gg.handler.redis.hostportlist=localhost:6379 gg.handler.redis.mode=op gg.handler.redis.integrationType=streams gg.handler.redis.metacolumnsTemplate=${position},${optype},${primarykeycolumns}
Parent topic: Data Structures Supported by the Redis Handler
8.2.29.1.3 JSONs
The key is a unique identifier for the table and row of the data which is being pushed to Redis. The value is a JSON object. The keys in the JSON object are the column names while the values in the JSON object are the column values.
The source trail file will contain inserts update and delete operations for which the data can be pushed into Redis. The Redis Handler will process inserts, updates, and deletes as follows:
Inserts – The Redis Handler will create a new JSON at the key.
Updates – The Redis Handler will replace the JSON at the given key with the new JSON reflecting the data of update. Because the JSON is replaced, full image updates are recommended in the source trail file.
Deletes – The key in Redis along with its corresponding JSON data structure are deleted.
Handling of Null Values
The JSON specification supports null values as JSON null. Therefore,
null values in the data will be propagated as JSON null. Null value replacement is
not supported since the JSON specification supports null values. Neither
gg.handler.redis.omitNullValues
nor
gg.handler.redis.nullValueRepresentation
configuration
properties have any effect when the Redis Handler is configured to send JSONs. JSON
per the specification is represented as follows: “fieldname”:
null
Support for Binary Values
Per the JSON specification, binary values are
represented as Base64. Therefore, all binary values will be converted and propagated
as Base64. Setting the property gg.handler.redis.dataType
has no
effect. JSONs will generally appear in Redis as follows:
127.0.0.1:6379> JSON.GET
TCUSTMER:JANE"{\"position\":\"00000000000000002126\",\"optype\":\"I\",\"primarykeycolumns\":[\"CUST_CODE\"],\"CUST_CODE\":\"JANE\",\"NAME\":\"ROCKY
FLYER INC.\",\"CITY\":\"DENVER\",\"STATE\":\"CO\"}"
Example Configuration:
gg.handlerlist=redis gg.handler.redis.type=redis gg.handler.redis.hostportlist=localhost:6379 gg.handler.redis.mode=op gg.handler.redis.integrationType=jsons gg.handler.redis.createIndexes=true gg.handler.redis.metacolumnsTemplate=${position},${optype},${primarykeycolumns}
Parent topic: Data Structures Supported by the Redis Handler
8.2.29.2 Redis Handler Configuration Properties
Table 8-35 Redis Handler Configuration Properties
Properties | Required/Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.handlerlist=name |
Required | Any String | none | Provides the name for the Redis Handler. |
gg.handler.name.type |
Required | redis | none |
Selects the Redis Handler. |
gg.handler.name.mode |
Optional | op | tx |
op |
The default is recommended. In |
gg.handler.name.integrationType |
Optional | hashmaps | streams | jsons |
hashmaps |
Sets the integration type for Redis. Select hashmaps and the data will be pushed into Redis as hashmaps. Select streams and data will be pushed into Redis streams. Select jsons and the data will be pushed into Redis as JSONs. |
gg.handler.name.dataType |
Optional | string | bytes |
string |
Only valid for hashmap and streams integration types. Controls if string data or byte data is pushed to Redis. If string is selected, all binary data will be pushed to Redis Base64 encoded. If bytes is selected, binary data is pushed to Redis without conversion.
|
gg.handler.name.keyMappingTempate |
Optional | Any combination of string and templating keywords. |
For hashmaps and jsons:
For streams: |
Redis is a key value data store. The resolved value of this template determines the key for an operation. |
gg.handler.name.createIndexes |
Optional | true | false |
true
|
Will automatically create an index for each replicated table for
the following integration types: |
gg.handler.name.omitNullValues |
Optional | true | false |
true |
Null values cannot be stored as values in a Redis hashmap structure. Both the intetgation types hashmaps and streams store hashmaps. By default, if a column value is null it cannot be replicated to Redis. By default, if a column value is changed to null, it has to be removed from a hashmap. Setting this to false will replicate a configured value representing a null value to Redis. |
gg.handler.name.nullValueRepresentation |
Optional | Any String |
|
Only valid if integration type is hashmaps or streams. Only valid
if |
gg.handler.name.metaColumnsTemplate |
Optional | Any string of comma separated metacolumn keywords. | none |
This can be configured to select one or more metacolumns to be added to the output to Redis. See Metacolumn Keywords. |
gg.handler.name.insertOpKey |
Optional | Any string | “I” |
This is the value of the operation type for inserts
which is replicated if the metacolumn ${optype} is
configured.
|
gg.handler.name.updateOpKey |
Optional | Any sting | “U” |
This is the value of the operation type for updates which is
replicated if the |
gg.handler.name.deleteOpKey |
Optional | Any string | "D" |
This is the value of the operation type for deletes
which is replicated if the metacolumn ${optype} is
configured.
|
gg.handler.name.trucateOpKey |
Optional | Any string | "T" |
This is the value of the operation type for truncate
which is replicated if the metacolumn ${optype} is
configured.
|
gg.handler.name.maxStreamLength |
Optional | Positive Integer | 0 |
Sets the maximum length of steams. If more messages are pushed to a steam than this value, then the oldest messages will be deleted so that the maximum stream size is enforced. The default value is 0 which means no limit on the maximum stream length. |
gg.handler.name.username |
Optional | Any string | None |
Used to set the username, if required, for connectivity to Redis. |
gg.handler.name.password |
Optional | Any string | None |
Used to set the password, if required, for connectivity to Redis. |
gg.handler.name.timeout |
Optional | integer | 15000 |
Property to set the both the connection and socket timeouts in milliseconds. |
gg.handler.name.enableSSL |
Optional | true | false |
false |
Set to |
Parent topic: Redis
8.2.29.3 Security
Connectivity to Redis can be secured in multiple ways. It is the Redis server which is configured for, and thereby selects, the type of security. The Redis Handler, which is the Redis client, must be configured to match the security of the server.
Redis server – connection listener – This is the Redis application.
Redis client – connection caller – This is the Oracle GoldenGate Redis Handler.
Check with your Redis administrator as to what security has been configured on the Redis server. Then, configure the Redis Handler to follow the security configuration of the Redis server.
Parent topic: Redis
8.2.29.4 Authentication Using Credentials
This is a simple security that requires the Redis client-provided credentials (username and password) for the Redis server to authenticate the Redis client. This security does not provide any encryption of inflight messages.
gg.handler.name.username=<username>
gg.handler.name.password=<password>
Parent topic: Redis
8.2.29.5 SSL Basic Auth
In this use case the Redis server passes a certificate to the Redis client. This allows the client to authenticate the server. The client passes credentials to the server, which allows the Redis server to authenticate the client. This connection is SSL and provides encryption of inflight messages.
gg.handler.name.enableSSL=true
gg.handler.name.username=<username>
gg.handler.name.password=<password>
If the Redis server passes an unsigned certificate to the Redis client, then the Redis Handler will need to be configured with a truststore. If the Redis server passes a certificate signed by a Certificate Authority, then a truststore is not required.
To configure a truststore on the Redis Handler:
jvm.bootoptions=-Djavax.net.ssl.trustStore=<absolute path to truststore> -Djavax.net.ssl.trustStorePassword=<truststore password>
Parent topic: Redis
8.2.29.6 SSL Mutual Auth
In this use case the Redis server passes a certificate to the Redis client. This allows the client to authenticate the server. The Redis client then passes a certificate to the Redis server. This allows the server to authenticate the Redis client. This connection is SSL and provides encryption of inflight messages.
gg.handler.name.enableSSL=true
Typically with this setup, the Redis client will need both a truststore and a keystore. The configuration is as follows:
To configure a truststore on the Redis Handler:
jvm.bootoptions=-Djavax.net.ssl.keyStore=<absolute path to keystore> -Djavax.net.ssl.keyStorePassword=<keystore password> -Djavax.net.ssl.trustStore=<absolute path to truststore> -Djavax.net.ssl.trustStorePassword=<truststore password>
Parent topic: Redis
8.2.29.7 Redis Handler Dependencies
The Redis Handler uses the Jedis client libraries to connect to the Redis server.
The following is a link to Jedis: https://github.com/redis/jedis
The Jedis libraries do not ship with
Oracle GoldenGate for Big Data and will need to be obtained and then the
gg.classpath configuration
property will need to be configured to
resolved the Jedis client. The dependency downloader utility which ships with Oracle
GoldenGate for Big Data can be used to download Jedis. The Redis Handler was developed using
Jedis 4.2.3. The following shows example configuration of the
gg.classpath
:
gg.classpath=/OGGBDinstall/DependencyDownloader/dependencies/jedis_4.2.3/*
Parent topic: Redis
8.2.29.8 Redis Handler Client Dependencies
The Redis Handler uses the Jedis client to connect to Redis.
Group ID: redis.clients
Artifact ID: jedisParent topic: Redis
8.2.29.8.1 jedis 4.2.3
commons-pool2-2.11.1.jar
gson-2.8.9.jar
jedis-4.2.3.jar
json-20211205.jar
slf4j-api-1.7.32.jar
Parent topic: Redis Handler Client Dependencies
8.2.30 Snowflake
Topics:
8.2.30.1 Overview
Snowflake is a serverless data warehouse that runs on any of the following cloud providers: Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure.
The Snowflake Event Handler is used to replicate data into Snowflake.
Parent topic: Snowflake
8.2.30.2 Detailed Functionality
- The change data from the Oracle GoldenGate trails is staged in micro-batches at a temporary staging location (internal or external stage).
- The staged records are then merged into the Snowflake target tables using a merge SQL statement.
This topic contains the following:
Parent topic: Snowflake
8.2.30.2.1 Staging Location
The change data records from the Oracle GoldenGate trail files are formatted into Avro OCF (Object Container Format) and are then uploaded to the staging location.
Change data can be staged in one of the following object stores:
- Snowflake internal stage
- Snowflake external stage
- AWS Simple Storage Service (S3)
- Azure Data Lake Storage (ADLS) Gen2
- Google Cloud Storage (GCS)
Parent topic: Detailed Functionality
8.2.30.2.2 Database User Privileges
The database user used for replicating into Snowflake has to be granted the following privileges:
INSERT
,UPDATE
,DELETE
, andTRUNCATE
on the target tables.CREATE
andDROP
on Snowflake named stage and external stage.- If using external stage (S3, ADLS, GCS),
CREATE
,ALTER
, andDROP
external table.
Parent topic: Detailed Functionality
8.2.30.2.3 Prerequisites
- Verify that the target tables exist on the Snowflake database.
- You must have Amazon Web Services, Google Cloud Platform, or Azure cloud accounts set up if you intend to use any of the external stage locations such as, S3, ADLS Gen2, or GCS.
- Snowflake JDBC driver
Parent topic: Detailed Functionality
8.2.30.3 Configuration
Note:
Ensure to specify the path to the properties file in the parameter file only when using Coordinated Replicat. Add the following line to the parameter file:TARGETDB LIBFILE libggjava.so SET property=<parameter file directory>/<properties file name>
- Automatic Configuration
- Snowflake Storage Integration
- Classpath Configuration
- Proxy Configuration
- INSERTALLRECORDS Support
- Snowflake Key Pair Authentication
- Mapping Source JSON/XML to Snowflake VARIANT
- Operation Aggregation
- Compressed Update Handling
- End-to-End Configuration
- Compressed Update Handling
Parent topic: Snowflake
8.2.30.3.1 Automatic Configuration
Snowflake replication involves configuring multiple components, such as the File Writer Handler, S3 or HDFS or GCS Event Handler, and the target Snowflake Event Handler.
The Automatic Configuration functionality helps you to auto-configure these components so that the manual configuration is minimal.
The properties modified by auto-configuration is also logged in the handler log file.
To enable auto-configuration to replicate to the Snowflake target, set
the parameter gg.target=snowflake
.
The Java system property SF_STAGE
determines the staging
location. If SF_STAGE
is not set, then Snowflake internal stage is
used.
If SF_STAGE
is set to either s3
,
hdfs
, or gcs
, then AWS S3, ADLS Gen2, or GCS
are respectively used as the staging locations.
The JDBC Metadata provider is also automatically enabled to retrieve target table metadata from Snowflake.
- File Writer Handler Configuration
- S3 Handler Configuration
- HDFS Event Handler Configuration
- Google Cloud Storage Event Handler Configuration
- Snowflake Event Handler Configuration
Parent topic: Configuration
8.2.30.3.1.1 File Writer Handler Configuration
The File Writer Handler name is pre-set to the value snowflake
and
its properties are automatically set to the required values for Snowflake.
You can add or edit a property of the File Writer Handler. For example:
gg.handler.snowflake.pathMappingTemplate=./dirout
Parent topic: Automatic Configuration
8.2.30.3.1.2 S3 Handler Configuration
The S3 Event Handler name is pre-set to the value
s3
and must be configured to match your S3 configuration.
The following is an example of editing a property of the S3 Event Handler:
gg.eventhandler.s3.bucketMappingTemplate=bucket1
For
more information, see Amazon S3.
Parent topic: Automatic Configuration
8.2.30.3.1.3 HDFS Event Handler Configuration
The Hadoop Distributed File System (HDFS) Event Handler name is pre-set
to the value hdfs
and it is auto-configured to write to HDFS.
Ensure that the Hadoop configuration file core-site.xml
is configured to write data files to the respective container in the Azure Data Lake
Storage (ADLS) Gen2 storage account. For more information, see Azure Data Lake Gen2 using Hadoop Client and ABFS.
The following is an example of editing a property of the HDFS Event handler:
gg.eventhandler.hdfs.finalizeAction=delete
Parent topic: Automatic Configuration
8.2.30.3.1.4 Google Cloud Storage Event Handler Configuration
The Google Cloud Storage (GCS) Event Handler name is pre-set to the
value gcs
and must be configured to match your GCS
configuration.
The following is an example of editing a GCS Event Handler property:
gg.eventhandler.gcs.bucketMappingTemplate=bucket1
Parent topic: Automatic Configuration
8.2.30.3.1.5 Snowflake Event Handler Configuration
The Snowflake Event Handler name is pre-set to the value
snowflake
.
The following are configuration properties available for the Snowflake Event handler, the required ones must be changed to match your Snowflake configuration:
Table 8-36 Snowflake Event Handler Configuration
Properties | Required/Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.eventhandler.snowflake.connectionURL |
Required | jdbc:snowflake://<account_name>.snowflakecomputing.com/?warehouse=<warehouse-name>&db=<database-name> |
None | JDBC URL to connect to Snowflake. Snowflake account name, warehouse and database must be set in the JDBC URL. |
gg.eventhandler.snowflake.connectionURL |
Required | Supported connection URL. | None | JDBC URL to connect to Snowflake. Snowflake account name, warehouse and database must be set in the JDBC URL. The warehouse can be set using `warehouse=<warehouse name>`, database can set using `db=<db name>`. In some cases for authorization, a role should be set using `role=<rolename>`. |
gg.eventhandler.snowflake.UserName |
Required | Supported database user name string. | None | Snowflake database user. |
gg.eventhandler.snowflake.Password |
Required | Supported database password string. | None | Snowflake database password. |
gg.eventhandler.snowflake.storageIntegration |
Optional | Storage integration name. | None | This parameter is required when using an external stage such as ADLS Gen2 or GCS or S3. This is the credential for Snowflake data warehouse to access the respective Object store files. For more information, see Snowflake Storage Integration. |
gg.eventhandler.snowflake.maxConnections |
Optional | Integer Value | 10 | Use this parameter to control the number of concurrent JDBC database connections to the target Snowflake database. |
gg.eventhandler.snowflake.dropStagingTablesOnShutdown |
Optional | true | false |
false |
If set to true , the temporary
staging tables created by Oracle GoldenGate are dropped on replicat
graceful stop.
|
gg.aggregate.operations.flush.interval |
Optional | Integer | 30000 | The flush interval parameter determines how often the
data will be merged into Snowflake. The value is set in
milliseconds. Use with caution, the higher this value is the more
data will need to be stored in the memory of the Replicat
process.
Note: Use the flush interval parameter with caution. Increasing its default value will increase the amount of data stored in the internal memory of the Replicat. This can cause out of memory errors and stop the Replicat if it runs out of memory. |
gg.eventhandler.snowflake.putSQLThreads |
Optional | Integer Value | 4 | Specifies the number of threads
(`PARALLEL ` clause) to use for uploading files
using PUT SQL . This is only relevant when Snowflake
internal stage (named stage) is used.
|
gg.eventhandler.snowflake.putSQLAutoCompress |
Optional | true | false |
false |
Specifies whether Snowflake uses
gzip to compress files
(AUTO_COMPRESS clause) during upload using
PUT SQL .
true : Files are compressed (if they are not already
compressed).
false : Files are not compressed
(which means, the files are uploaded as is). This is only relevant
when Snowflake internal stage (named stage) is used.
|
gg.operation.aggregator.validate.keyupdate
|
Optional | true or
false |
false |
If set to true , Operation
Aggregator will validate key update operations (optype 115) and
correct to normal update if no key values have changed. Compressed
key update operations do not qualify for merge.
|
gg.eventhandler.snowflake.useCopyForInitialLoad |
Optional | true or
false |
true |
If set to true , then COPY
SQL statement will be used during initial load. If set
to false , then INSERT SQL
statement will be used during initial load.
|
gg.compressed.update |
Optional | true or
false |
true |
If set the true , then this
indicates that the source trail files contain compressed update
operations. If set to false , then the source trail
files are expected to contain uncompressed update operations.
|
gg.eventhandler.snowflake.connectionRetries
|
Optional | Integer Value | 3 | Specifies the number of times connections to the target data warehouse will be retried. |
gg.eventhandler.snowflake.connectionRetryIntervalSeconds
|
Optional | Integer Value | 30 | Specifies the delay in minutes between connection retry attempts. |
Parent topic: Automatic Configuration
8.2.30.3.2 Snowflake Storage Integration
When you use an external staging location, ensure to setup Snowflake storage integration to grant Snowflake database read permission to the files located in the cloud object store.
If the Java system property SF_STAGE
is not set, then
the storage integration is not required, and Oracle GoldenGate defaults to internal
stage.
- Azure Data Lake Storage (ADLS) Gen2 Storage Integration: For
more information about creating the storage integration for Azure, see Snowflake documentation to
create the storage integration for Azure.
Example:
-- AS ACCOUNTADMIN create storage integration azure_int type = external_stage storage_provider = azure enabled = true azure_tenant_id = '<azure tenant id>' storage_allowed_locations = ('azure://<azure-account-name>.blob.core.windows.net/<azure-container>/'); desc storage integration azure_int; -- Read AZURE_CONSENT_URL and accept the terms and conditions specified in the link. -- Read AZURE_MULTI_TENANT_APP_NAME to get the Snowflake app name to be granted Blob Read permission. grant create stage on schema <schema name> to role <role name>; grant usage on integration azure_int to role <role name>;
- Google Cloud Storage (GCS) Storage Integration: For more
information about creating the storage integration for GCS, see Snowflake Documentation.
Example:
create storage integration gcs_int type = external_stage storage_provider = gcs enabled = true storage_allowed_locations = ('gcs://<gcs-bucket-name>/'); desc storage integration gcs_int; -- Read the column STORAGE_GCP_SERVICE_ACCOUNT to get the GCP Service Account email for Snowflake. -- Create a GCP role with storage read permission and assign the role to the Snowflake Service account. grant create stage on schema <schema name> to role <role name>; grant usage on integration gcs_int to role <role name>;
- AWS S3 Storage Integration: For more information about
creating the storage integration for S3, see Snowflake
Documentation.
Note:
When you use S3 as the external stage, you don't need to create storage integration if you already have access to the following AWS credentials: AWS Access Key Id and Secret key. You can set AWS credentials in thejvm.bootoptions
property. - The storage integration name must start with an alphabetic character and cannot
contain spaces or special characters unless the entire identifier string is
enclosed in double quotes for example,
My object
. Identifiers enclosed in double quotes are also case-sensitive.
Parent topic: Configuration
8.2.30.3.3 Classpath Configuration
Snowflake Event Handler uses the Snowflake JDBC driver. Ensure that the classpath includes the path to the JDBC driver. If an external stage is used, then you need to also include the respective object store Event Handler’s dependencies in the classpath.
Parent topic: Configuration
8.2.30.3.3.1 Dependencies
Snowflake JDBC driver: You can use the Dependency Downloader
tool to download the JDBC driver by running the following script:
<OGGDIR>/DependencyDownloader/snowflake.sh
.
For more information about Dependency Downloader, see Dependency Downloader in the Installing and Upgrading Oracle GoldenGate for Big Data guide.
Alternatively, you can also download the JDBC driver from Maven central using the following co-ordinates:
<dependency>
<groupId>net.snowflake</groupId>
<artifactId>snowflake-jdbc</artifactId>
<version>3.13.19</version>
</dependency>
- If staging location is set to S3, then the classpath should include the S3 Event handler dependencies. See S3 Handler Configuration.
- If staging location is set to HDFS, then the classpath should include the HDFS Event handler dependencies. See HDFS Event Handler Configuration.
- If staging location is set to Google Cloud Storage (GCS), then the classpath should include the GCS Event handler dependencies. See Google Cloud Storage Event Handler Configuration.
Edit the gg.classpath
configuration parameter to include
the path to the object store Event Handler dependencies (if external
stage is in use) and the Snowflake JDBC driver.
Parent topic: Classpath Configuration
8.2.30.3.4 Proxy Configuration
When the Replicat process runs behind a proxy server, you can use the
jvm.bootoptions
property proxy server configuration.
Example:
jvm.bootoptions=-Dhttp.useProxy=true -Dhttps.proxyHost=<some-proxy-address.com> -Dhttps.proxyPort=80 -Dhttp.proxyHost=<some-proxy-address.com> -Dhttp.proxyPort=80
Parent topic: Configuration
8.2.30.3.5 INSERTALLRECORDS Support
Stage and merge targets supports INSERTALLRECORDS
parameter.
See INSERTALLRECORDS in Reference for
Oracle GoldenGate. Set the INSERTALLRECORDS
parameter in
the Replicat parameter file (.prm
). Set the
INSERTALLRECORDS
parameter in the Replicat parameter file
(.prm
)
Setting this property directs the Replicat process to use bulk insert
operations to load operation data into the target table. You can tune the batch size
of bulk inserts using the File Writer property
gg.handler.snowflake.maxFileSize
. The default value is set to
1GB. The frequency of bulk inserts can be tuned using the File writer property
gg.handler.snowflake.fileRollInterval
, the default value is set
to 3m (three minutes).
Note:
- When using the Snowflake internal stage, the staging files can
be compressed by setting
gg.eventhandler.snowflake.putSQLAutoCompress
totrue
.
Parent topic: Configuration
8.2.30.3.6 Snowflake Key Pair Authentication
Snowflake supports key pair authentication as an alternative to basic authentication using username and password.
The path to the private key file must be set in the JDBC connection URL
using the property: private_key_file
.
If the private key file is encrypted, then the connection URL should also include the
property: private_key_file_pwd
.
Additionally, the connection URL should also include the Snowflake user that is
assigned the respective public key by setting the property
user
.
jdbc:snowflake://<account_name>.snowflakecomputing.com/?warehouse=<warehouse-name> &db=<database-name>&private_key_file=/path/to/private/key/rsa_key.p8 &private_key_file_pwd=<private-key-password>&user=<db-user>
Username
and Password
are not
set.
Note:
Oracle recommends you to upgrade Oracle GoldenGate for Big Data to version 21.10.0.0.0. In case you cannot upgrade to 21.10.0.0.0, then modify the JDBC URL to replace '\' characters with '/'.Parent topic: Configuration
8.2.30.3.7 Mapping Source JSON/XML to Snowflake VARIANT
JSON
and XML
source column types in the
Oracle GoldenGate trail gets automatically detected and mapped into Snowflake
VARIANT
.
You can inspect the metadata in the Oracle
GoldenGate trail file for JSON
and XML
types
using logdump
.
logdump
output showing JSON
and XML
types:
022/01/06 01:38:54.717.464 Metadata Len 679 RBA 6032 Table Name: CDB1_PDB1.TKGGU1.JSON_TAB1 * 1)Name 2)Data Type 3)External Length 4)Fetch Offset 5)Scale 6)Level 7)Null 8)Bump if Odd 9)Internal Length 10)Binary Length 11)Table Length 12)Most Sig DT 13)Least Sig DT 14)High Precision 15)Low Precision 16)Elementary Item 17)Occurs 18)Key Column 19)Sub DataType 20)Native DataType 21)Character Set 22)Character Length 23)LOB Type 24)Partial Type 25)Remarks * TDR version: 11 Definition for table CDB1_PDB1.TKGGU1.JSON_TAB1 Record Length: 81624 Columns: 7 ID 64 50 0 0 0 0 0 50 50 50 0 0 0 0 1 0 1 2 2 -1 0 0 0 COL 64 4000 56 0 0 1 0 4000 8200 0 0 0 0 0 1 0 0 0 119 0 0 1 1 JSON COL2 64 4000 4062 0 0 1 0 4000 8200 0 0 0 0 0 1 0 0 0 119 0 0 1 1 JSON COL3 64 4000 8068 0 0 1 0 4000 4000 0 0 0 0 0 1 0 0 10 112 -1 0 1 1 XML SYS_NC00005$ 64 8000 12074 0 0 1 0 4000 4000 0 0 0 0 0 1 0 0 4 113 -1 0 1 1 Hidden SYS_IME_OSON_CF27CFDF1CEB4FA2BF85A3D6239A433C 64 65534 16080 0 0 1 0 32767 32767 0 0 0 0 0 1 0 0 4 23 -1 0 0 0 Hidden SYS_IME_OSON_CEE1B31BB4494F6ABF31AC002BEBE941 64 65534 48852 0 0 1 0 32767 32767 0 0 0 0 0 1 0 0 4 23 -1 0 0 0 Hidden End of definition
In this example, COL
and COL2
are
JSON
columns and COL3
is an
XML
column.
Additionally, mapping to
Snowflake VARIANT
is supported only if the source columns are
stored as text.
Parent topic: Configuration
8.2.30.3.8 Operation Aggregation
Operation aggregation is the process of aggregating (merging/compressing) multiple operations on the same row into a single output operation based on a threshold.
8.2.30.3.8.1 In-Memory Operation Aggregation
- Operation records can be aggregated in-memory by setting
gg.aggregate.operations=true.
This is the default configuration.
- You can tune the frequency of merge interval using
gg.aggregate.operations.flush.interval
property, the default value is set to 30000 milliseconds (thirty seconds). - Operation aggregation in-memory requires additional JVM memory configuration.
Parent topic: Operation Aggregation
8.2.30.3.8.2 Operation Aggregation Using SQL
- To use SQL aggregation, it is mandatory that the trail files
contain uncompressed
UPDATE
operation records, which means that theUPDATE
operations contain full image of the row being updated. - Operation aggregation using SQL can provide better throughput if the trails files contains uncompressed update records.
- Replicat can aggregate operations using SQL statements by setting the
gg.aggregate.operations.using.sql=true
. - You can tune the frequency of merge interval using the File writer
gg.handler.snowflake.fileRollInterval
property, the default value is set to 3m (three minutes). - Operation aggregation using SQL does not require additional JVM memory configuration.
Parent topic: Operation Aggregation
8.2.30.3.9 Compressed Update Handling
A compressed update record contains values for the key columns and the modified
columns. An uncompressed update record contains values for all the columns. Oracle
GoldenGate trails may contain compressed or uncompressed update records. The default
extract configurationwrites compressed updates to the trails. The parameter
gg.compressed.update
can be set to true/false to indicate
compressed/uncompressed update records.
Parent topic: Configuration
8.2.30.3.9.1 MERGE Statement with Uncompressed Updates
In some use cases, if the trail contains uncompressed update records,
then the MERGE SQL
statement can be optimized for better
performance by setting gg.compressed.update=false
.
Parent topic: Compressed Update Handling
8.2.30.3.10 End-to-End Configuration
The following is an end-end configuration example which uses auto-configuration.
Location of the sample properties file:<OGGDIR>/AdapterExamples/big-data/snowflake/
sf.props
: Configuration using internal stagesf-s3.props
: Configuration using S3 stage.sf-az.props
: Configuration using ADLS Gen2 stage.sf-gcs.props
: Configuration using GCS stage.
# Note: Recommended to only edit the configuration marked as TODO gg.target=snowflake #The Snowflake Event Handler #TODO: Edit JDBC ConnectionUrl gg.eventhandler.snowflake.connectionURL=jdbc:snowflake://<account_name>.snowflakecomputing.com/?warehouse=<warehouse-name>&db=<database-name> #TODO: Edit JDBC user name gg.eventhandler.snowflake.UserName=<db user name> #TODO: Edit JDBC password gg.eventhandler.snowflake.Password=<db password> # Using Snowflake internal stage. # Configuration to load GoldenGate trail operation records # into Snowflake Data warehouse by chaining # File writer handler -> Snowflake Event handler. #TODO:Set the classpath to include Snowflake JDBC driver. gg.classpath=./snowflake-jdbc-3.13.7.jar #TODO:Provide sufficient memory (at least 8GB). jvm.bootoptions=-Xmx8g -Xms8g # Using Snowflake S3 External Stage. # Configuration to load GoldenGate trail operation records # into Snowflake Data warehouse by chaining # File writer handler -> S3 Event handler -> Snowflake Event handler. #The S3 Event Handler #TODO: Edit the AWS region #gg.eventhandler.s3.region=<aws region> #TODO: Edit the AWS S3 bucket #gg.eventhandler.s3.bucketMappingTemplate=<s3 bucket> #TODO:Set the classpath to include AWS Java SDK and Snowflake JDBC driver. #gg.classpath=aws-java-sdk-1.11.356/lib/*:aws-java-sdk-1.11.356/third-party/lib/*:./snowflake-jdbc-3.13.7.jar #TODO:Set the AWS access key and secret key. Provide sufficient memory (at least 8GB). #jvm.bootoptions=-Daws.accessKeyId=<AWS access key> -Daws.secretKey=<AWS secret key> -DSF_STAGE=s3 -Xmx8g -Xms8g # Using Snowflake ADLS Gen2 External Stage. # Configuration to load GoldenGate trail operation records # into Snowflake Data warehouse by chaining # File writer handler -> HDFS Event handler -> Snowflake Event handler. #The HDFS Event Handler # No properties are required for the HDFS Event handler. # If there is a need to edit properties, check example in the following line. #gg.eventhandler.hdfs.finalizeAction=delete #TODO: Edit snowflake storage integration to access Azure Blob Storage. #gg.eventhandler.snowflake.storageIntegration=<azure_int> #TODO: Edit the classpath to include HDFS Event Handler dependencies and Snowflake JDBC driver. #gg.classpath=./snowflake-jdbc-3.13.7.jar:hadoop-3.2.1/share/hadoop/common/*:hadoop-3.2.1/share/hadoop/common/lib/*:hadoop-3.2.1/share/hadoop/hdfs/*:hadoop-3.2.1/share/hadoop/hdfs/lib/*:hadoop-3.2.1/etc/hadoop/:hadoop-3.2.1/share/hadoop/tools/lib/* #TODO: Set property SF_STAGE=hdfs. Provide sufficient memory (at least 8GB). #jvm.bootoptions=-DSF_STAGE=hdfs -Xmx8g -Xms8g # Using Snowflake GCS External Stage. # Configuration to load GoldenGate trail operation records # into Snowflake Data warehouse by chaining # File writer handler -> GCS Event handler -> Snowflake Event handler. ## The GCS Event handler #TODO: Edit the GCS bucket name #gg.eventhandler.gcs.bucketMappingTemplate=<gcs bucket> #TODO: Edit the GCS credentialsFile #gg.eventhandler.gcs.credentialsFile=<oggbd-project-credentials.json> #TODO: Edit snowflake storage integration to access GCS. #gg.eventhandler.snowflake.storageIntegration=<gcs_int> #TODO: Edit the classpath to include GCS Java SDK and Snowflake JDBC driver. #gg.classpath=gcs-deps/*:./snowflake-jdbc-3.13.7.jar #TODO: Set property SF_STAGE=gcs. Provide sufficient memory (at least 8GB). #jvm.bootoptions=-DSF_STAGE=gcs -Xmx8g -Xms8g
Parent topic: Configuration
8.2.30.3.11 Compressed Update Handling
A compressed update record contains values for the key columns and the modified columns.
An uncompressed update record contains values for all the columns.
Oracle GoldenGate trails may contain compressed or uncompressed update records. The default extract configuration writes compressed updates to the trails.
The parameter gg.compressed.update
can be set to
true
or false
to indicate
compressed/uncompressed update records.
Parent topic: Configuration
8.2.30.3.11.1 MERGE Statement with Uncompressed Updates
In some use cases, if the trail contains uncompressed update records, then the
MERGE SQL
statement can be optimized for better performance by
setting gg.compressed.update=false
.
Parent topic: Compressed Update Handling
8.2.30.4 Troubleshooting and Diagnostics
- Connectivity issues to Snowflake:
- Validate JDBC connection URL, username, and password.
- Check HTTP(S) proxy configuration if running Replicat process behind a proxy.
- DDL not applied on the target table: Oracle GoldenGate for Big Data does not support DDL replication.
- Target table existence: It is expected that the target
table exists before starting the Replicat process.
Replicat process will ABEND if the target table is missing.
- SQL Errors: In case there are any errors while executing any SQL, the SQL statements along with the bind parameter values are logged into the Oracle GoldenGate for Big Data handler log file.
- Co-existence of the components: When using an external
stage location (S3, ADLS Gen 2 or GCS), the location/region of the machine
where the Replicat process is running and the object store’s region have an
impact on the overall throughput of the apply process.
For the best possible throughput, the components need to be located ideally in the same region or as close as possible.
- Replicat ABEND due to partial LOB records in the trail
file: Oracle GoldenGate for Big Data does not support replication of
partial LOB data. The trail file needs to be regenerated by Oracle
Integrated capture using
TRANLOGOPTIONS FETCHPARTIALLOB
option in the Extract parameter file. - When replicating to more than ten target tables, the parameter
maxConnnections
can be increased to a higher value which can improve throughput.Note:
When tuning this, increasing the parameter value would create more JDBC connections on the Snowflake data warehouse.You can consult your Snowflake Database administrators so that the data warehouse health is not compromised. - The Snowflake JDBC driver uses the standard Java log utility. The log levels
of the JDBC driver can be set using the JDBC connection parameter tracing.
The tracing level can be set in the Snowflake Event handler property
gg.eventhandler.snowflake.connectionURL
.The following is an example of editing this property:
For more information, see https://docs.snowflake.com/en/user-guide/jdbc-parameters.html#tracing.jdbc:snowflake://<account_name>.snowflakecomputing.com/?warehouse=<warehouse-name>&db=<database-name>&tracing=SEVERE
- Exception: net.snowflake.client.jdbc.SnowflakeReauthenticationRequest:
Authentication token has expired. The user must authenticate
again.
This error occurs when are extended periods of inactivity. To resolve this, you can set the JDBC parameter
CLIENT_SESSION_KEEP_ALIVE
to force the database user to login after a period of inactivity in the session. For example,jdbc:snowflake://<account_name>.snowflakecomputing.com/?warehouse=<warehouse-name>&db=<database-name>&CLIENT_SESSION_KEEP_ALIVE=true
- Replicat stops with an out of memory error: Decrease the
gg.aggregate.operations.flush.interval
value if you are not using its default value (30000). - Performance issue while replicating Large Object (LOB) column
values: LOB processing can lead to slowness. For every LOB column
that exceeds the inline LOB threshold, an
UPDATE SQL
is executed. Look for the following message to tune throughput during LOB processing:The current operation at position
Check the trail files that contain LOB data and get a maximum size of[<seqno>/<rba>]
for table [<tablename>] contains a LOB column [<column name>] of length [<N>] bytes that exceeds the threshold of maximum inline LOB size [<N>]. Operation Aggregator will flush merged operations, which can degrade performance. The maximum inline LOB size in bytes can be tuned using the configurationgg.maxInlineLobSize
.BLOB/CLOB
columns. Alternatively, check the source table definitions to determine the maximum size of LOB data. The default inline LOB size is set to 16000 bytes, which can be increased to a higher value so that all LOB column updates are processed in batches. The configuration property isgg.maxInlineLobSize
`. For example: Ingg.maxInlineLobSize=24000000 -->
, all LOBs up to 24 MB are processed inline. You need to reposition the Replicat, purge the state files, data directory, and start over, so that bigger staging files are generated. - Error message: No database is set in the current session. Please set a
database in the JDBC connection url
[gg.eventhandler.snowflake.connectionURL] using the option
'db=<database name>'.`
Resolution: Set the database name in the configuration property
gg.eventhandler.snowflake.connectionURL
. - Warning message: No role is set in the current session. Please set a
custom role name in the JDBC connection url
[gg.eventhandler.snowflake.connectionURL] using the option
'role=<role name>' if the warehouse [{}] requires a custom role to
access it.
Resolution: In some cases a custom role is required to access the Snowflake warehouse, set the role in the configuration property
gg.eventhandler.snowflake.connectionURL
. - Error message: No active warehouse selected in the current session.
Please set the warehouse name (and custom role name if required to
access the respective warehouse) in the JDBC connection url
[gg.eventhandler.snowflake.connectionURL] using the options
'warehouse=<warehouse name>' and 'role=<role
name>'.
Resolution: Set the warehouse and role in the configuration property
gg.eventhandler.snowflake.connectionURL
.
Parent topic: Snowflake
8.2.31 Additional Details
- Command Event Handler
This chapter describes how to use the Command Event Handler. The Command Event Handler provides the interface to synchronously execute an external program or script. - HDFS Event Handler
The HDFS Event Handler is used to load files generated by the File Writer Handler into HDFS. - Metacolumn Keywords
- Metadata Providers
The Metadata Providers can replicate from a source to a target using a Replicat parameter file. - Pluggable Formatters
The pluggable formatters are used to convert operations from the Oracle GoldenGate trail file into formatted messages that you can send to Big Data targets using one of the Oracle GoldenGate for Big Data Handlers. - Stage and Merge Data Warehouse Replication
Data warehouse targets typically support Massively Parallel Processing (MPP). The cost of a single Data Manipulation Language (DML) operation is comparable to the cost of execution of batch DMLs. - Template Keywords
- Velocity Dependencies
Starting Oracle GoldenGate for Big Data release 21.1.0.0.0, the Velocity jar files have been removed from the packaging.
Parent topic: Target
8.2.31.1 Command Event Handler
This chapter describes how to use the Command Event Handler. The Command Event Handler provides the interface to synchronously execute an external program or script.
- Overview - Command Event Handler
The purpose of the Command Event Handler is to load data files generated by the File Writer Handler into respective targets by executing an external program or a script provided. - Configuring the Command Event Handler
You can configure the Command Event Handler operation using the File Writer Handler properties file. - Using Command Argument Template Strings
Command Argument Templated Strings consists of keywords that are dynamically resolved at runtime. Command Argument Templated strings are passed as arguments to the script in the same order mentioned in thecommandArgumentTemplate
property .
Parent topic: Additional Details
8.2.31.1.1 Overview - Command Event Handler
The purpose of the Command Event Handler is to load data files generated by the File Writer Handler into respective targets by executing an external program or a script provided.
Parent topic: Command Event Handler
8.2.31.1.2 Configuring the Command Event Handler
You can configure the Command Event Handler operation using the File Writer Handler properties file.
The Command Event Handler works only in conjunction with the File Writer Handler.
To enable the selection of the Command Event Handler, you must first configure the
handler type by specifying gg.eventhandler.name.type=command
and
the other Command Event properties as follows:
Table 8-37 Command Event Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
command |
None |
Selects the Command Event Handler for use with Replicat |
|
Required |
Valid path of external program or a script to be executed. |
None |
The script or an external program that should be executed by the Command Event Handler. |
|
Optional |
Integer value representing milli seconds |
Indefinitely |
The Command Event Handler will wait for a period of time for the called commands in the script or external program to complete. If the Command Event Handler fails to complete the command within the configured timout period of time, process will get Abend. |
gg.eventhandler.name.multithreaded |
Optional | true |
false |
true
|
If true, the configured commands in the script or external program will be executed multithreaded way. Else executed in single thread. |
|
Optional |
See Using Command Argument Templated Strings. |
None |
The Command Event Handler uses the command argument template strings during script or external program execution as input arguments. For a list of valid argument strings, see Using Command Argument Templated Strings. |
gg.eventhandler.command.type=command
gg.eventhandler.command.command=<path of the script to be executed>
#gg.eventhandler.command.cmdWaitMilli=10000
gg.eventhandler.command.multithreaded=true
gg.eventhandler.command.commandArgumentTemplate=${tablename},${datafilename},${countoperations}
Parent topic: Command Event Handler
8.2.31.1.3 Using Command Argument Template Strings
Command Argument Templated Strings consists of keywords that are dynamically
resolved at runtime. Command Argument Templated strings are passed as arguments to the
script in the same order mentioned in the commandArgumentTemplate
property
.
The valid tokens used as a command Argument Template strings are as follows:
UUID
, TableName
, DataFileName
,
DataFileDir
, DataFileDirandName
,
Offset
, Format
, CountOperations
,
CountInserts
, CountUpdates
,
CountDeletes
, CountTruncates
. Invalid Templated
string results in an Abend.
Supported Template Strings
- ${uuid}
- The File Writer Handler assigns a uuid to internally track the state of generated files. The usefulness of the uuid may be limited to troubleshooting scenarios.
- ${dataFileDirandName}
- The source file name with complete path and filename along with the file extension.
- ${format}
- The format of the file. For example:
delimitedtext | json | json_row | xml | avro_row | avro_op | avro_row_ocf | avro_op_ocf
- ${countOperations}
- The total count of operations in the data file. It must be either renamed or used by the event handlers or it becomes zero (0) because nothing is written. For example, 1024.
- ${countInserts}
- The total count of insert operations in the data file. It must be either renamed or used by the event handlers or it becomes zero (0) because nothing is written. For example, 125.
- ${countUpdates}
- The total count of update operations in the data file. It must be either renamed or used by the event handlers or it becomes zero (0) because nothing is written. For example, 265.
Note:
The Command Event Handler on successful execution of the script or the commnad logs a message with the following statement: The command completed successfully, along with the statement of command that gets executed. If there's an error when the command gets executed, the Command Event Handler abends the Replicat process and logs the error message.Parent topic: Command Event Handler
8.2.31.2 HDFS Event Handler
The HDFS Event Handler is used to load files generated by the File Writer Handler into HDFS.
This topic describes how to use the HDFS Event Handler. See Flat Files.
Parent topic: Additional Details
8.2.31.2.1 Detailing the Functionality
8.2.31.2.1.1 Configuring the Handler
The HDFS Event Handler can can upload data files to HDFS. These additional configuration steps are required:
The HDFS Event Handler dependencies and considerations are the same as the HDFS Handler, see HDFS Additional Considerations.
Ensure that gg.classpath
includes the HDFS client libraries.
Ensure that the directory containing the HDFS core-site.xml
file is in gg.classpath
. This is so the core-site.xml
file can be read at runtime and the connectivity information to HDFS can be resolved. For example:
gg.classpath=/{HDFSinstallDirectory}/etc/hadoop
If Kerberos authentication is enabled on the HDFS cluster, you have to configure the Kerberos principal and the location of the keytab
file so that the password can be resolved at runtime:
gg.eventHandler.name.kerberosPrincipal=principal
gg.eventHandler.name.kerberosKeytabFile=pathToTheKeytabFile
Parent topic: Detailing the Functionality
8.2.31.2.1.2 Configuring the HDFS Event Handler
You configure the HDFS Handler operation using the properties file. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
To enable the selection of the HDFS Event Handler, you must first configure the
handler type by specifying gg.eventhandler.name.type=hdfs
and the other HDFS Event properties as follows:
Table 8-38 HDFS Event Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
|
None |
Selects the HDFS Event Handler for use. |
|
Required |
A string with resolvable keywords and constants used to dynamically generate the path in HDFS to write data files. |
None |
Use keywords interlaced with constants to dynamically generate unique path names at
runtime. Path names typically follow the format,
|
|
Optional |
A string with resolvable keywords and constants used to dynamically generate the HDFS file name at runtime. |
None |
Use keywords interlaced with constants to dynamically generate unique file names at runtime. If not set, the upstream file name is used. See Template Keywords. |
|
Optional |
|
|
Indicates what the File Writer Handler should do at the finalize action.
|
|
Optional |
The Kerberos principal name. |
None |
Set to the Kerberos principal when HDFS Kerberos authentication is enabled. |
|
Optional |
The path to the Keberos |
None |
Set to the path to the Kerberos |
|
Optional |
A unique string identifier cross referencing a child event handler. |
No event handler configured. |
A unique string identifier cross referencing an event handler. The event handler will be invoked on the file roll event. Event handlers can do thing file roll event actions like loading files to S3, converting to Parquet or ORC format, or loading files to HDFS. |
Parent topic: Detailing the Functionality
8.2.31.3 Metacolumn Keywords
The metacolumns functionality allows you to select the metadata fields that you want to see in the generated output messages. The format of the metacolumn syntax is:
-
${keyword[fieldName].argument}
-
The keyword is fixed based on the metacolumn syntax. Optionally, you can provide a field name between the square brackets. If a field name is not provided, then the default field name is used.
Keywords are separated by a comma. Following is an example configuration of metacolumns:
gg.handler.filewriter.format.metaColumnsTemplate=${objectname[table]},${optype[op_type]},${timestamp[op_ts]},${currenttimestamp[current_ts]},${position[pos]}
An argument may be required for a few metacolumn keywords. For example, it is required where specific token values are resolved or specific environmental variable values are resolved.
-
${alltokens}
-
All of the tokens for an operation delivered as a map where the token keys are the keys in the map and the token values are the map values.
-
${token}
-
The value of a specific Oracle GoldenGate token. The token key should follow token key should follow the token using the period (
.
) operator. For example:${token.MYTOKEN}
-
${sys}
-
A system environmental variable. The variable name should follow sys using the period (.) operator.
-
${sys.MYVAR}
-
An Oracle GoldenGate environment variable. The variable name should follow
env
using the period (.) operator. -
${env}
-
An Oracle GoldenGate environment variable. The variable name should follow
env
using the period (.
) operator. For example:${env.someVariable}
-
${javaprop}
-
A Java JVM variable. The variable name should follow
javaprop
using the period (.
) operator. For example:${javaprop.MYVAR}
-
${optype}
-
The operation type. This is generally
I
for inserts,U
for updates,D
for deletes, andT
for truncates. -
${position}
-
The record position. This is location of the record in the source trail file. It is a 20 character string. The first 10 characters is the trail file sequence number. The last 10 characters is the offset or rba of the record in the trail file.
-
${timestamp}
-
Record timestamp.
-
${catalog}
-
Catalog name.
-
${schema}
-
Schema name.
-
${table}
-
Table name.
-
${objectname}
-
The fully qualified table name.
-
${csn}
-
Source Commit Sequence Number.
-
${xid}
-
Source transaction ID.
-
${currenttimestamp}
-
Current timestamp.
-
${currenttimestampiso8601}
-
Current timestamp in ISO 8601 format.
-
${opseqno}
-
Record sequence number within the transaction.
-
${timestampmicro}
-
Record timestamp in microseconds after epoch.
-
${currenttimestampmicro}
-
Current timestamp in microseconds after epoch.
-
${txind}
-
The is the transactional indicator from the source trail file. The values of a transaction are
B
for the first operation,M
for the middle operations,E
for the last operation, orW
for whole if there is only one operation. Filtering operations or the use of coordinated apply negate the usefulness of this field. -
${primarykeycolumns}
-
Use to inject a field with a list of the primary key column names.
-
${static}
-
Use to inject a field with a static value into the output. The value desired should be the argument. If the desired value is
abc
, then the syntax is${static.abc}
or${static[FieldName].abc}
. -
${seqno}
-
Used to inject a field containing the sequence number of the source trail file for the given operation.
-
${rba}
-
Used to inject a field containing the rba (offset) of the operation in the source trail file for the given operation.
Parent topic: Additional Details
8.2.31.4 Metadata Providers
The Metadata Providers can replicate from a source to a target using a Replicat parameter file.
This chapter describes how to use the Metadata Providers.
- About the Metadata Providers
- Avro Metadata Provider
The Avro Metadata Provider is used to retrieve the table metadata from Avro Schema files. For every table mapped in Replicat usingCOLMAP
, the metadata is retrieved from Avro Schema. Retrieved metadata is then used by Replicat for column mapping. - Java Database Connectivity Metadata Provider
- Hive Metadata Provider
The Hive Metadata Provider is used to retrieve the table metadata from a Hive metastore. The metadata is retrieved from Hive for every target table that is mapped in the Replicat properties file using theCOLMAP
parameter. The retrieved target metadata is used by Replicat for the column mapping functionality. - Google BigQuery Metadata Provider
Google metadata provider uses the Google Query Job to retrieve the metadata schema information from the Google BigQuery Table. The Table should already be created on the target for BigQuery to fetch the metadata.
Parent topic: Additional Details
8.2.31.4.1 About the Metadata Providers
Metadata Providers work only if handlers are configured to run with a Replicat process.
The Replicat process maps source table to target table and source column to target column mapping using syntax in the Replicat configuration file. The source metadata definitions are included in the Oracle GoldenGate trail file (or by source definitions files in Oracle GoldenGate releases 12.2 and later). When the replication target is a database, the Replicat process obtains the target metadata definitions from the target database. However, this is a shortcoming when pushing data to Big Data applications or during Java delivery in general. Typically, Big Data applications provide no target metadata, so Replicat mapping is not possible. The metadata providers exist to address this deficiency. You can use a metadata provider to define target metadata using either Avro or Hive, which enables Replicat mapping of source table to target table and source column to target column.
The use of the metadata provider is optional and is enabled if the gg.mdp.type
property is specified in the Java Adapter Properties file. If the metadata included in the source Oracle GoldenGate trail file is acceptable for output, then do not use the metadata provider. Use a metadata provider should be used in the following cases:
-
You need to map source table names into target table names that do not match.
-
You need to map source column names into target column name that do not match.
-
You need to include certain columns from the source trail file and omit other columns.
A limitation of Replicat mapping is that the mapping defined in the Replicat configuration file is static. Oracle GoldenGate provides functionality for DDL propagation when using an Oracle database as the source. The proper handling of schema evolution can be problematic when the Metadata Provider and Replicat mapping are used. Consider your use cases for schema evolution and plan for how you want to update the Metadata Provider and the Replicat mapping syntax for required changes.
For every table mapped in Replicat using COLMAP
, the metadata is retrieved from a configured metadata provider and retrieved metadata then be used by Replicat for column mapping.
Only the Hive and Avro Metadata Providers are supported and you must choose one or the other to use in your metadata provider implementation.
Scenarios - When to use a metadata provider
-
The following scenarios do not require a metadata provider to be configured:
A mapping in which the source schema named
GG
is mapped to the target schema namedGGADP
.*A mapping in which the schema and table name whereby the schema
GG.TCUSTMER
is mapped to the table nameGGADP.TCUSTMER_NEW
MAP GG.*, TARGET GGADP.*; (OR) MAP GG.TCUSTMER, TARGET GG_ADP.TCUSTMER_NEW;
-
The following scenario requires a metadata provider to be configured:
A mapping in which the source column name does not match the target column name. For example, a source column of
CUST_CODE
mapped to a target column ofCUST_CODE_NEW
.MAP GG.TCUSTMER, TARGET GG_ADP.TCUSTMER_NEW, COLMAP(USEDEFAULTS, CUST_CODE_NEW=CUST_CODE, CITY2=CITY);
Parent topic: Metadata Providers
8.2.31.4.2 Avro Metadata Provider
The Avro Metadata Provider is used to retrieve the table metadata from Avro Schema
files. For every table mapped in Replicat using COLMAP
, the metadata is
retrieved from Avro Schema. Retrieved metadata is then used by Replicat for column
mapping.
- Detailed Functionality
- Runtime Prerequisites
- Classpath Configuration
- Avro Metadata Provider Configuration
- Review a Sample Configuration
- Metadata Change Events
- Limitations
- Troubleshooting
Parent topic: Metadata Providers
8.2.31.4.2.1 Detailed Functionality
The Avro Metadata Provider uses Avro schema definition files to retrieve metadata.
Avro schemas are defined using JSON. For each table mapped in the
process_name. prm
file, you must create a corresponding Avro
schema definition file.
Avro Metadata Provider Schema Definition Syntax
{"namespace": "[$catalogname.]$schemaname", "type": "record", "name": "$tablename", "fields": [ {"name": "$col1", "type": "$datatype"}, {"name": "$col2 ", "type": "$datatype ", "primary_key":true}, {"name": "$col3", "type": "$datatype ", "primary_key":true}, {"name": "$col4", "type": ["$datatype","null"]} ] } namespace - name of catalog/schema being mapped name - name of the table being mapped fields.name - array of column names fields.type - datatype of the column fields.primary_key - indicates the column is part of primary key. Representing nullable and not nullable columns: "type":"$datatype" - indicates the column is not nullable, where "$datatype" is the actual datatype. "type": ["$datatype","null"] - indicates the column is nullable, where "$datatype" is the actual datatype
The names of schema files that are accessed by the Avro Metadata Provider must be in the following format:
[$catalogname.]$schemaname.$tablename.mdp.avsc $catalogname - name of the catalog if exists $schemaname - name of the schema $tablename - name of the table .mdp.avsc - constant, which should be appended always
Supported Avro Primitive Data Types
- boolean
- bytes
- double
- float
- int
- long
- string
See https://avro.apache.org/docs/1.7.5/spec.html#schema_primitive
.
Supported Avro Logical Data Types
- decimal
- timestamp
{"name":"DECIMALFIELD","type": {"type":"bytes","logicalType":"decimal","precision":15,"scale":5}}
{"name":"TIMESTAMPFIELD","type": {"type":"long","logicalType":"timestamp-micros"}}
Parent topic: Avro Metadata Provider
8.2.31.4.2.2 Runtime Prerequisites
Before you start the Replicat process, create Avro schema definitions for all tables mapped in Replicat's parameter file.
Parent topic: Avro Metadata Provider
8.2.31.4.2.3 Classpath Configuration
The Avro Metadata Provider requires no additional classpath setting.
Parent topic: Avro Metadata Provider
8.2.31.4.2.4 Avro Metadata Provider Configuration
Property | Required/Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
|
|
Selects the Avro Metadata Provider |
|
Required |
Example: |
|
The path to the Avro schema files directory |
|
Optional |
Valid character set |
|
Specifies the character set of the column with character data type. Used to convert the source data from the trail file to the correct target character set. |
|
Optional |
Valid character set |
|
Specifies the character set of the column with character data type. Used to convert the source data from the trail file to the correct target character set. Example: Used to indicate character set of columns, such as |
Parent topic: Avro Metadata Provider
8.2.31.4.2.5 Review a Sample Configuration
This is an example for configuring the Avro Metadata Provider. Consider a source that includes the following table:
TABLE GG.TCUSTMER { CUST_CODE VARCHAR(4) PRIMARY KEY, NAME VARCHAR(100), CITY VARCHAR(200), STATE VARCHAR(200) }
This table maps the(CUST_CODE (GG.TCUSTMER)
in the source to CUST_CODE2 (GG_AVRO.TCUSTMER_AVRO)
on the target and the column CITY
(GG.TCUSTMER)
in source to CITY2
(GG_AVRO.TCUSTMER_AVRO)
on the target. Therefore, the mapping in the process_name. prm
file is:
MAP GG.TCUSTMER, TARGET GG_AVRO.TCUSTMER_AVRO, COLMAP(USEDEFAULTS, CUST_CODE2=CUST_CODE, CITY2=CITY);
In this example the mapping definition is as follows:
-
Source schema
GG
is mapped to target schemaGG_AVRO
. -
Source column
CUST_CODE
is mapped to target columnCUST_CODE2
. -
Source column
CITY
is mapped to target columnCITY2
. -
USEDEFAULTS
specifies that rest of the columns names are same on both source and target (NAME
andSTATE
columns).
This example uses the following Avro schema definition file:
File path: /home/ggadp/avromdpGG_AVRO.TCUSTMER_AVRO.mdp.avsc
{"namespace": "GG_AVRO", "type": "record", "name": "TCUSTMER_AVRO", "fields": [ {"name": "NAME", "type": "string"}, {"name": "CUST_CODE2", "type": "string", "primary_key":true}, {"name": "CITY2", "type": "string"}, {"name": "STATE", "type": ["string","null"]} ] }
The configuration in the Java Adapter properties file includes the following:
gg.mdp.type = avro gg.mdp.schemaFilesPath = /home/ggadp/avromdp
The following sample output uses a delimited text formatter with a semi-colon as the delimiter:
I;GG_AVRO.TCUSTMER_AVRO;2013-06-02 22:14:36.000000;NAME;BG SOFTWARE CO;CUST_CODE2;WILL;CITY2;SEATTLE;STATE;WA
Oracle GoldenGate for Big Data includes a sample Replicat configuration file, a sample Java Adapter properties file, and sample Avro schemas at the following location:
GoldenGate_install_directory
/AdapterExamples/big-data/metadata_provider/avro
Parent topic: Avro Metadata Provider
8.2.31.4.2.6 Metadata Change Events
If the DDL changes in the source database tables, you may need to modify the Avro schema definitions and the mappings in the Replicat configuration file. You may also want to stop or suspend the Replicat process in the case of a metadata change event. You can stop the Replicat process by adding the following line to the Replicat configuration file (process_name. prm
):
DDL INCLUDE ALL, EVENTACTIONS (ABORT)
Alternatively, you can suspend the Replicat process by adding the following line to the Replication configuration file:
DDL INCLUDE ALL, EVENTACTIONS (SUSPEND)
Parent topic: Avro Metadata Provider
8.2.31.4.2.7 Limitations
Avro bytes data type cannot be used as primary key.
The source-to-target mapping that is defined in the Replicat configuration file is static. Oracle GoldenGate 12.2 and later support DDL propagation and source schema evolution for Oracle Databases as replication source. If you use DDL propagation and source schema evolution, you lose the ability to seamlessly handle changes to the source metadata.
Parent topic: Avro Metadata Provider
8.2.31.4.2.8 Troubleshooting
This topic contains the information about how to troubleshoot the following issues:
- Invalid Schema Files Location
- Invalid Schema File Name
- Invalid Namespace in Schema File
- Invalid Table Name in Schema File
Parent topic: Avro Metadata Provider
8.2.31.4.2.8.1 Invalid Schema Files Location
The Avro schema files directory specified in the gg.mdp.schemaFilesPath
configuration property must be a valid directory.If the path is not valid, you encounter following exception:
oracle.goldengate.util.ConfigException: Error initializing Avro metadata provider Specified schema location does not exist. {/path/to/schema/files/dir}
Parent topic: Troubleshooting
8.2.31.4.2.8.2 Invalid Schema File Name
For every table that is mapped in the process_name.prm
file, you must create a corresponding Avro schema file in the directory that is specified in gg.mdp.schemaFilesPath
.
For example, consider the following scenario:
Mapping:
MAP GG.TCUSTMER, TARGET GG_AVRO.TCUSTMER_AVRO, COLMAP(USEDEFAULTS, cust_code2=cust_code, CITY2 = CITY);
Property:
gg.mdp.schemaFilesPath=/home/usr/avro/
In this scenario, you must create a file called GG_AVRO.TCUSTMER_AVRO.mdp.avsc
in the /home/usr/avro/
directory.
If you do not create the /home/usr/avro/GG_AVRO.TCUSTMER_AVRO.mdp.avsc
file, you encounter the following exception:
java.io.FileNotFoundException: /home/usr/avro/GG_AVRO.TCUSTMER_AVRO.mdp.avsc
Parent topic: Troubleshooting
8.2.31.4.2.8.3 Invalid Namespace in Schema File
The target schema name specified in Replicat mapping must be same as the namespace in the Avro schema definition file.
For example, consider the following scenario:
Mapping:
MAP GG.TCUSTMER, TARGET GG_AVRO.TCUSTMER_AVRO, COLMAP(USEDEFAULTS, cust_code2 = cust_code, CITY2 = CITY); Avro Schema Definition: { "namespace": "GG_AVRO", .. }
In this scenario, Replicat abends with following exception:
Unable to retrieve table matadata. Table : GG_AVRO.TCUSTMER_AVRO Mapped [catalogname.]schemaname (GG_AVRO) does not match with the schema namespace {schema namespace}
Parent topic: Troubleshooting
8.2.31.4.2.8.4 Invalid Table Name in Schema File
The target table name that is specified in Replicat mapping must be same as the name in the Avro schema definition file.
For example, consider the following scenario:
Mapping:
MAP GG.TCUSTMER, TARGET GG_AVRO.TCUSTMER_AVRO, COLMAP(USEDEFAULTS, cust_code2 = cust_code, CITY2 = CITY);
Avro Schema Definition:
{ "namespace": "GG_AVRO", "name": "TCUSTMER_AVRO", .. }
In this scenario, if the target table name specified in Replicat mapping does not match with the Avro schema name, then REPLICAT abends with following exception:
Unable to retrieve table matadata. Table : GG_AVRO.TCUSTMER_AVRO Mapped table name (TCUSTMER_AVRO) does not match with the schema table name {table name}
Parent topic: Troubleshooting
8.2.31.4.3 Java Database Connectivity Metadata Provider
The Java Database Connectivity (JDBC) Metadata Provider is used to retrieve the table metadata from any target database that supports a JDBC connection and has a database schema. The JDBC Metadata Provider is the preferred metadata provider for any target database that is an RDBMS, although various other non-RDBMS targets also provide a JDBC driver.
Topics:
- JDBC Detailed Functionality
- Java Classpath
- JDBC Metadata Provider Configuration
- Review a Sample Configuration
Parent topic: Metadata Providers
8.2.31.4.3.1 JDBC Detailed Functionality
The JDBC Metadata Provider uses the JDBC driver that is provided with your target database. The DBC driver retrieves the metadata for every target table that is mapped in the Replicat properties file. Replicat processes use the retrieved target metadata to map columns.
You can enable this feature for JDBC Handler by configuring the REPERROR
property in your Replicat parameter file. In addition, you need to define the error codes specific to your RDBMS JDBC target in the JDBC Handler properties file as follows:
Table 8-39 JDBC REPERROR
Codes
Property | Value | Required |
---|---|---|
gg.error.duplicateErrorCodes |
Comma-separated integer values of error codes that indicate duplicate errors |
No |
gg.error.notFoundErrorCodes |
Comma-separated integer values of error codes that indicate Not Found errors |
No |
gg.error.deadlockErrorCodes |
Comma-separated integer values of error codes that indicate deadlock errors |
No |
For example:
#ErrorCode
gg.error.duplicateErrorCodes=1062,1088,1092,1291,1330,1331,1332,1333
gg.error.notFoundErrorCodes=0
gg.error.deadlockErrorCodes=1213
To understand how the various JDBC types are mapped to database-specific SQL types, see https://docs.oracle.com/javase/6/docs/technotes/guides/jdbc/getstart/mapping.html#table1.
Parent topic: Java Database Connectivity Metadata Provider
8.2.31.4.3.2 Java Classpath
The JDBC Java Driver location must be included in the class path of the handler using the gg.classpath
property.
For example, the configuration for a MySQL database might be:
gg.classpath= /path/to/jdbc/driver/jar/mysql-connector-java-5.1.39-bin.jar
Parent topic: Java Database Connectivity Metadata Provider
8.2.31.4.3.3 JDBC Metadata Provider Configuration
The following are the configurable values for the JDBC Metadata Provider. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
Table 8-40 JDBC Metadata Provider Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
|
None |
Entering |
|
Required |
|
None |
The target database JDBC URL. |
|
Required |
Java class name of the JDBC driver |
None |
The fully qualified Java class name of the JDBC driver. |
|
Optional |
A legal username string. |
None |
The user name for the JDBC connection. Alternatively, you can provide the user name using the |
|
Optional |
A legal password string |
None |
The password for the JDBC connection. Alternatively, you can provide the password using the |
Parent topic: Java Database Connectivity Metadata Provider
8.2.31.4.3.4 Review a Sample Configuration
MySQL Driver Configuration
gg.mdp.type=jdbc gg.mdp.ConnectionUrl=jdbc:oracle:thin:@myhost:1521:orcl gg.mdp.DriverClassName=oracle.jdbc.driver.OracleDriver gg.mdp.UserName=username gg.mdp.Password=password
Netezza Driver Configuration
gg.mdp.type=jdbc gg.mdp.ConnectionUrl=jdbc:netezza://hostname:port/databaseName gg.mdp.DriverClassName=org.netezza.Driver gg.mdp.UserName=username gg.mdp.Password=password
Oracle OCI Driver configuration
ggg.mdp.type=jdbc gg.mdp.ConnectionUrl=jdbc:oracle:oci:@myhost:1521:orcl gg.mdp.DriverClassName=oracle.jdbc.driver.OracleDriver gg.mdp.UserName=username gg.mdp.Password=password
Oracle Teradata Driver configuration
gg.mdp.type=jdbc gg.mdp.ConnectionUrl=jdbc:teradata://10.111.11.111/USER=username,PASSWORD=password gg.mdp.DriverClassName=com.teradata.jdbc.TeraDriver gg.mdp.UserName=username gg.mdp.Password=password
Oracle Thin Driver Configuration
gg.mdp.type=jdbc gg.mdp.ConnectionUrl=jdbc:mysql://localhost/databaseName?user=username&password=password gg.mdp.DriverClassName=com.mysql.jdbc.Driver gg.mdp.UserName=username gg.mdp.Password=password
Redshift Driver Configuration
gg.mdp.type=jdbc gg.mdp.ConnectionUrl=jdbc:redshift://hostname:port/databaseName gg.mdp.DriverClassName=com.amazon.redshift.jdbc42.Driver gg.mdp.UserName=username gg.mdp.Password=password
Parent topic: Java Database Connectivity Metadata Provider
8.2.31.4.4 Hive Metadata Provider
The Hive Metadata Provider is used to retrieve the table metadata from a Hive
metastore. The metadata is retrieved from Hive for every target table that is mapped in the
Replicat properties file using the COLMAP
parameter. The retrieved target
metadata is used by Replicat for the column mapping functionality.
- Detailed Functionality
- Configuring Hive with a Remote Metastore Database
- Classpath Configuration
- Hive Metadata Provider Configuration Properties
- Review a Sample Configuration
- Security
- Metadata Change Event
- Limitations
- Additional Considerations
- Troubleshooting
Parent topic: Metadata Providers
8.2.31.4.4.1 Detailed Functionality
The Hive Metadata Provider uses both Hive JDBC and HCatalog interfaces to retrieve metadata from the Hive metastore. For each table mapped in the process_name.prm
file, a corresponding table is created in Hive.
The default Hive configuration starts an embedded, local metastore Derby database. Because, Apache Derby is designed to be an embedded database, it allows only a single connection. The limitation of the Derby Database means that it cannot function when working with the Hive Metadata Provider. To workaround this limitation this, you must configure Hive with a remote metastore database. For more information about how to configure Hive with a remote metastore database, see https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+Administration.
Hive does not support Primary Key semantics, so the metadata retrieved from Hive metastore does not include a primary key definition. When you use the Hive Metadata Provider, use the Replicat KEYCOLS
parameter to define primary keys.
KEYCOLS
Use the KEYCOLS
parameter must be used to define primary keys in the target schema. The Oracle GoldenGate HBase Handler requires primary keys. Therefore, you must set primary keys in the target schema when you use Replicat mapping with HBase as the target.
The output of the Avro formatters includes an Array field to hold the primary column names. If you use Replicat mapping with the Avro formatters, consider using KEYCOLS
to identify the primary key columns.
For example configurations of KEYCOLS
, see Review a Sample Configuration.
Supported Hive Data types
-
BIGINT
-
BINARY
-
BOOLEAN
-
CHAR
-
DATE
-
DECIMAL
-
DOUBLE
-
FLOAT
-
INT
-
SMALLINT
-
STRING
-
TIMESTAMP
-
TINYINT
-
VARCHAR
See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
.
Parent topic: Hive Metadata Provider
8.2.31.4.4.2 Configuring Hive with a Remote Metastore Database
You can find a list of supported databases that you can use to configure remote Hive metastore can be found at https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-SupportedBackendDatabasesforMetastore.
The following example shows a MySQL database is configured as the Hive metastore using properties in the ${HIVE_HOME}/conf/hive-site.xml
Hive configuration file.
Note:
The ConnectionURL
and driver class used in this example are specific to MySQL database. If you use a database other than MySQL, then change the values to fit your configuration.
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://MYSQL_DB_IP:MYSQL_DB_PORT/DB_NAME?createDatabaseIfNotExist=false</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>MYSQL_CONNECTION_USERNAME</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>MYSQL_CONNECTION_PASSWORD</value> </property>
To see a list of parameters to configure in the hive-site.xml
file for a remote metastore, see https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-RemoteMetastoreDatabase.
Note:
Follow these steps to add the MySQL JDBC connector JAR in the Hive classpath:
-
In
HIVE_HOME
/lib/
directory.DB_NAME
should be replaced by a valid database name created in MySQL. -
Start the Hive Server:
HIVE_HOME
/bin/hiveserver2/bin/hiveserver2
-
Start the Hive Remote Metastore Server:
HIVE_HOME
/bin/hive --service metastore
Parent topic: Hive Metadata Provider
8.2.31.4.4.3 Classpath Configuration
For the Hive Metadata Provider to connect to Hive, you must configure thehive-site.xml
file and the Hive and HDFS client jars in the gg.classpath
variable. The client JARs must match the version of Hive to which the Hive Metadata Provider is connecting.
For example, if the hive-site.xml
file is created in the /home/user/oggadp/dirprm
directory, then gg.classpath
entry is gg.classpath=/home/user/oggadp/dirprm/
-
Create a
hive-site.xml
file that has the following properties:<configuration> <!-- Mandatory Property --> <property> <name>hive.metastore.uris</name> <value>thrift://HIVE_SERVER_HOST_IP:9083</value> <property> <!-- Optional Property. Default value is 5 --> <property> <name>hive.metastore.connect.retries</name> <value>3</value> </property> <!-- Optional Property. Default value is 1 --> <property> <name>hive.metastore.client.connect.retry.delay</name> <value>10</value> </property> <!-- Optional Property. Default value is 600 seconds --> <property> <name>hive.metastore.client.socket.timeout</name> <value>50</value> </property> </configuration>
-
By default, the following directories contain the Hive and HDFS client jars:
HIVE_HOME
/hcatalog/share/hcatalog/*HIVE_HOME
/lib/*HIVE_HOME
/hcatalog/share/webhcat/java-client/*HADOOP_HOME
/share/hadoop/common/*HADOOP_HOME
/share/hadoop/common/lib/*HADOOP_HOME
/share/hadoop/mapreduce/*Configure the
gg.classpath
exactly as shown in the step 1. The path to thehive-site.xml
file must be the path with no wildcard appended. If you include the*
wildcard in the path to thehive-site.xml
file, it will not be located. The path to the dependency JARs must include the*
wildcard character to include all of the JAR files in that directory in the associated classpath. Do not use*.jar.
Parent topic: Hive Metadata Provider
8.2.31.4.4.4 Hive Metadata Provider Configuration Properties
Property | Required/Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
|
|
Selects the Hive Metadata Provider |
|
Required |
Format without Kerberos Authentication: j Format with Kerberos Authentication:
|
|
The JDBC connection URL of the Hive server |
|
Required |
|
|
The fully qualified Hive JDBC driver class name |
|
Optional |
Valid username |
|
The user name for connecting to the Hive database. The |
|
Optional |
Valid Password |
|
The password for connecting to the Hive database |
|
Optional |
Valid character set |
|
The character set of the column with the character data type. Used to convert the source data from the trail file to the correct target character set. |
|
Optional |
Valid character set |
|
The character set of the column with the national character data type. Used to convert the source data from the trail file to the correct target character set. For example, this property may indicate the character set of columns, such as |
|
Optional |
Kerberos |
none |
Allows you to designate Kerberos authentication to Hive. |
|
Optional (Required if |
Relative or absolute path to a Kerberos keytab file. |
|
The |
|
Optional (Required if |
A legal Kerberos principal name( |
|
The Kerberos principal name for Kerberos authentication. |
Parent topic: Hive Metadata Provider
8.2.31.4.4.5 Review a Sample Configuration
This is an example for configuring the Hive Metadata Provider. Consider a source with following table:
TABLE GG.TCUSTMER { CUST_CODE VARCHAR(4) PRIMARY KEY, NAME VARCHAR(100), CITY VARCHAR(200), STATE VARCHAR(200)}
The example maps the column CUST_CODE
(GG.TCUSTMER)
in the source to CUST_CODE2
(GG_HIVE.TCUSTMER_HIVE)
on the target and column CITY
(GG.TCUSTMER
) in the source to CITY2
(GG_HIVE.TCUSTMER_HIVE)
on the target.
Mapping configuration in the process_name. prm
file includes the following configuration:
MAP GG.TCUSTMER, TARGET GG_HIVE.TCUSTMER_HIVE, COLMAP(USEDEFAULTS, CUST_CODE2=CUST_CODE, CITY2=CITY) KEYCOLS(CUST_CODE2);
In this example:
-
The source schema
GG
is mapped to the target schemaGG_HIVE
. -
The source column
CUST_CODE
is mapped to the target columnCUST_CODE2
. -
The source column
CITY
is mapped to the target columnCITY2
. -
USEDEFAULTS
specifies that rest of the column names are same on both source and target (NAME
andSTATE
columns). -
KEYCOLS
is used to specify thatCUST_CODE2
should be treated as primary key.
Because primary keys cannot be specified in the Hive DDL, the KEYCOLS
parameter is used to specify the primary keys.
Note:
You can choose any schema name and are not restricted to the gg_hive schema
name. The Hive schema can be pre-existing or newly created. You do this by modifying the connection URL (gg.mdp.connectionUrl
) in the Java Adapter properties file and the mapping configuration in the Replicat.prm
file. Once the schema name is changed, update the connection URL (gg.mdp.connectionUrl
) and mapping in the Replicat.prm
file.
You can create the schema and tables for this example in Hive by using the following commands. You can create the schema and tables for this example in Hive by using the following commands. To start the Hive CLI use the following command:
HIVE_HOME
/bin/hive
To create the GG_HIVE
schema, in Hive, use the following command:
hive> create schema gg_hive; OK Time taken: 0.02 seconds
To create the TCUSTMER_HIVE
table in the GG_HIVE
database, use the following command:
hive> CREATE EXTERNAL TABLE `TCUSTMER_HIVE`( > "CUST_CODE2" VARCHAR(4), > "NAME" VARCHAR(30), > "CITY2" VARCHAR(20), > "STATE" STRING); OK Time taken: 0.056 seconds
Configure the .properties
file in a way that resembles the following:
gg.mdp.type=hive
gg.mdp.connectionUrl=jdbc:hive2://HIVE_SERVER_IP:10000/gg_hive
gg.mdp.driverClassName=org.apache.hive.jdbc.HiveDriver
The following sample output uses the delimited text formatter, with a comma as the delimiter:
I;GG_HIVE.TCUSTMER_HIVE;2015-10-07T04:50:47.519000;cust_code2;WILL;name;BG SOFTWARE CO;city2;SEATTLE;state;WA
A sample Replicat configuration file, Java Adapter properties file, and Hive create table SQL script are included with the installation at the following location:
GoldenGate_install_directory
/AdapterExamples/big-data/metadata_provider/hive
Parent topic: Hive Metadata Provider
8.2.31.4.4.6 Security
You can secure the Hive server using Kerberos authentication. For information about how to secure the Hive server, see the Hive documentation for the specific Hive release. The Hive Metadata Provider can connect to a Kerberos secured Hive server.
Make sure that the paths to the HDFS core-site.xml
file and the hive-site.xml
file are in the handler's classpath.
Enable the following properties in the core-site.xml
file:
<property> <name>hadoop.security.authentication</name> <value>kerberos</value> </property> <property> <name>hadoop.security.authorization</name> <value>true</value> </property>
Enable the following properties in the hive-site.xml
file:
<property> <name>hive.metastore.sasl.enabled</name> <value>true</value> </property> <property> <name>hive.metastore.kerberos.keytab.file</name> <value>/path/to/keytab</value> <!-- Change this value --> </property> <property> <name>hive.metastore.kerberos.principal</name> <value>Kerberos Principal</value> <!-- Change this value --> </property> <property> <name>hive.server2.authentication</name> <value>KERBEROS</value> </property> <property> <name>hive.server2.authentication.kerberos.principal</name> <value>Kerberos Principal</value> <!-- Change this value --> </property> <property> <name>hive.server2.authentication.kerberos.keytab</name> <value>/path/to/keytab</value> <!-- Change this value --> </property>
Parent topic: Hive Metadata Provider
8.2.31.4.4.7 Metadata Change Event
Tables in Hive metastore should be updated, altered, or created manually if the source database tables change. In the case of a metadata change event, you may wish to terminate or suspend the Replicat process. You can terminate the Replicat process by adding the following to the Replicat configuration file (process_name. prm
):
DDL INCLUDE ALL, EVENTACTIONS (ABORT)
You can suspend, the Replicat process by adding the following to the Replication configuration file:
DDL INCLUDE ALL, EVENTACTIONS (SUSPEND)
Parent topic: Hive Metadata Provider
8.2.31.4.4.8 Limitations
Columns with binary data type cannot be used as primary keys.
The source-to-target mapping that is defined in the Replicat configuration file is static. Oracle GoldenGate 12.2 and later versions supports DDL propagation and source schema evolution for Oracle databases as replication sources. If you use DDL propagation and source schema evolution, you lose the ability to seamlessly handle changes to the source metadata.
Parent topic: Hive Metadata Provider
8.2.31.4.4.9 Additional Considerations
The most common problems encountered are the Java classpath issues. The Hive Metadata Provider requires certain Hive and HDFS client libraries to be resolved in its classpath.
The required client JAR directories are listed in Classpath Configuration. Hive and HDFS client JARs do not ship with Oracle GoldenGate for Big Data. The client JARs should be of the same version as the Hive version to which the Hive Metadata Provider is connecting.
To establish a connection to the Hive server, the hive-site.xml
file must be in the classpath.
Parent topic: Hive Metadata Provider
8.2.31.4.4.10 Troubleshooting
If the mapped target table is not present in Hive, the Replicat process will terminate with a "Table metadata resolution exception".
For example, consider the following mapping:
MAP GG.TCUSTMER, TARGET GG_HIVE.TCUSTMER_HIVE, COLMAP(USEDEFAULTS, CUST_CODE2=CUST_CODE, CITY2=CITY) KEYCOLS(CUST_CODE2);
This mapping requires a table called TCUSTMER_HIVE
to be created in the schema GG_HIVE
in the Hive metastore. If this table is not present in Hive, then the following exception occurs:
ERROR [main) - Table Metadata Resolution Exception Unable to retrieve table matadata. Table : GG_HIVE.TCUSTMER_HIVE NoSuchObjectException(message:GG_HIVE.TCUSTMER_HIVE table not found)
Parent topic: Hive Metadata Provider
8.2.31.4.5 Google BigQuery Metadata Provider
Google metadata provider uses the Google Query Job to retrieve the metadata schema information from the Google BigQuery Table. The Table should already be created on the target for BigQuery to fetch the metadata.
Google BigQuery does not support primary key semantics, so the metadata
retrieved from BigQuery Table does not include any primary key definition. You can
identify the primary keys using the KEYCOLS
syntax in the replicat
mapping statement. If KEYCOLS
is not present, then the key information
from the source table is used.
- Authentication
- Supported BigQuery Datatypes
- Parameterized BigQuery Datatypes
The BigQuery datatypes that can be parameterized to add constraints are STRING, BYTES, NUMERIC, and BIGNUMERIC. The STRING and BYTES datatypes can have length constraints.NUMERIC and BIGNUMERIC can have scale and precision constraints. - Unsupported BigQuery Datatypes
- Configuring BigQuery Metadata Provider
- Sample Configuration
- Proxy Settings
- Classpath Settings
- Limitations
Parent topic: Metadata Providers
8.2.31.4.5.1 Authentication
Google BigQuery cloud service account can be connected either using the credentials JSON file by setting the path to the file in MDP property or setting the individual keys of credentials JSON into BigQuery MDP properties. The individual properties of BigQuery metadata provider for configuring the service account credential keys can be encrypted using Oracle wallet.
Parent topic: Google BigQuery Metadata Provider
8.2.31.4.5.2 Supported BigQuery Datatypes
The following table lists the Google BigQuery datatypes that are supported and their default scale and precision values:
Data Type | Range | Max Scale | Max Precision | Max Bytes |
---|---|---|---|---|
BOOL |
|
NA |
NA |
1 |
INT64 |
|
NA |
NA |
8 |
FLOAT64 |
NA |
NA |
None |
8 |
NUMERIC |
Min: Max: |
9 | 38 | 64 |
BIG NUMERIC |
Min:
Max:
|
38 | 77 | 255 |
STRING | Unlimited | NA | NA | 2147483647L |
BYTES | Unlimited | NA | NA | 2147483647L |
DATE | 0001-01-01 to 9999-12-31 |
NA | NA | NA |
TIME | 00:00:00 to 23:59:59.999999 |
NA | NA | NA |
TIMESTAMP | 0001-01-01 00:00:00 to 9999-12-31 23:59:59.999999
UTC |
NA | NA | NA |
Parent topic: Google BigQuery Metadata Provider
8.2.31.4.5.3 Parameterized BigQuery Datatypes
The BigQuery datatypes that can be parameterized to add constraints are STRING, BYTES, NUMERIC, and BIGNUMERIC. The STRING and BYTES datatypes can have length constraints.NUMERIC and BIGNUMERIC can have scale and precision constraints.
- STRING(L): L is the maximum number of Unicode characters allowed.
- BYTES(L): L is the maximum number of bytes allowed.
- NUMERIC(P[, S]) or BIGNUMERIC(P[, S]): P is maximum precision (total number of digits) and S is maximum scale (number of digits after decimal) that is allowed.
The parameterized datatypes are supported in BigQuery Metadata Provider. If there is a datatype with user-defined precision, scale or max-length, then metadata provider calculates the data based on those values.
Parent topic: Google BigQuery Metadata Provider
8.2.31.4.5.4 Unsupported BigQuery Datatypes
The following table lists the Google BigQuery datatypes that are supported and their default scale and precision values:
The BigQuery datatypes that are not supported by metadata provider are complex datatypes, such as GEOGRAPHY, JSON, ARRAY, INTERVAL, and STRUCT. The metadata provider is going to abend with invalid datatype exception if it encounters them.
Parent topic: Google BigQuery Metadata Provider
8.2.31.4.5.5 Configuring BigQuery Metadata Provider
The following table lists the configuration properties for BigQuery metadata provider:
Property | Required/Optional | Legal Values | Default | Explanationtes |
---|---|---|---|---|
|
Required | bq | NA | Select BigQuery Metadata Provider |
gg.mdp.credentialsFile |
Optional | File path to credentials JSON file. | NA | Provides path to the credentials JSON file for connecting to Google BigQuery Service account. |
gg.mdp.clientId |
Optional | Valid BigQuery Credentials Client Id | NA | Provides the client Id key from the credentials file for connecting to Google BigQuery service account. |
gg.mdp.clientEmail |
Optional | Valid BigQuery Credentials Client Email | NA | Provides the client Email key from the credentials file for connecting to Google BigQuery service account. |
gg.mdp.privateKeyId |
Optional | Valid BigQuery Credentials Private Key ID | NA | Provides the Private Key ID from the credentials file for connecting to Google BigQuery service account. |
gg.mdp.privateKey |
Optional | Valid BigQuery Credentials Private Key | NA | Provides the Private Key from the credentials file for connecting to Google BigQuery service account. |
gg.mdp.projectId |
Optional | Unique BigQuery project Id | NA | Unique project Id of BigQuery. |
gg.mdp.connectionTimeout |
Optional | Time in sec | 5 | Connect Timeout for BigQuery connection. |
gg.mdp.readTimeout |
Optional | Time in sec | 6 | Timeout to read from BigQuery connection. |
gg.mdp.totalTimeout |
Optional | Time in sec | 9 | Total timeout for BigQuery connection. |
gg.mdp.retryCount |
Optional | Maximum number of retries. | 3 | Maximum number of reties for connecting to BigQuery. |
Parent topic: Google BigQuery Metadata Provider
8.2.31.4.5.6 Sample Configuration
Sample properties file content:
gg.mdp.type=bq
gg.mdp.credentialsFile=/path/to/credFile.json
Sample parameter file:
REPLICAT bqeh
TARGETDB LIBFILE libggjava.so SET property=dirprm/bqeh.props
MAP schema.tableName, TARGET schema.tableName;
Parent topic: Google BigQuery Metadata Provider
8.2.31.4.5.7 Proxy Settings
jvm.bootoptions= -Dhttps.proxyHost=www-proxy.us.oracle.com -Dhttps.proxyPort=80
Parent topic: Google BigQuery Metadata Provider
8.2.31.4.5.8 Classpath Settings
The dependency of BigQuery metadata provider is same as the Google BigQuery stage-and-merge Event Handler dependency. The dependencies added to the Oracle GoldenGate class-path for BigQuery event Handler is sufficient for running the BigQuery metadata provider, and no extra dependency need to be configured.
Parent topic: Google BigQuery Metadata Provider
8.2.31.4.5.9 Limitations
The complex BigQuery datatypes are not yet supported by the metadata provider. It will abend in case any of unsupported datatypes are encountered.
If the BigQuery handler or event-handler is configured to auto create table and dataspace, then the metadata provider expects table to exist in order to fetch the metadata. The feature to auto-create table and dataspace of BigQuery handler and event handler does not work with BigQuery metadata provider. Metadata change event is not supported by Big Query metadata provider. It can be configured to abend or suspend in case there is a metadata change.
Parent topic: Google BigQuery Metadata Provider
8.2.31.5 Pluggable Formatters
The pluggable formatters are used to convert operations from the Oracle GoldenGate trail file into formatted messages that you can send to Big Data targets using one of the Oracle GoldenGate for Big Data Handlers.
This chapter describes how to use the pluggable formatters.
- Using Operation-Based versus Row-Based Formatting
The Oracle GoldenGate for Big Data formatters include operation-based and row-based formatters. - Using the Avro Formatter
Apache Avro is an open source data serialization and deserialization framework known for its flexibility, compactness of serialized data, and good serialization and deserialization performance. Apache Avro is commonly used in Big Data applications. - Using the Delimited Text Formatter
- Using the JSON Formatter
- Using the Length Delimited Value Formatter
The Length Delimited Value (LDV) Formatter is a row-based formatter. It formats database operations from the source trail file into a length delimited value output. Each insert, update, delete, or truncate operation from the source trail is formatted into an individual length delimited message. - Using the XML Formatter
The XML Formatter formats before-image and after-image data from the source trail file into an XML document representation of the operation data. The format of the XML document is effectively the same as the XML format in the previous releases of the Oracle GoldenGate Java Adapter.
Parent topic: Additional Details
8.2.31.5.1 Using Operation-Based versus Row-Based Formatting
The Oracle GoldenGate for Big Data formatters include operation-based and row-based formatters.
The operation-based formatters represent the individual insert, update, and delete events that occur on table data in the source database. Insert operations only provide after-change data (or images), because a new row is being added to the source database. Update operations provide both before-change and after-change data that shows how existing row data is modified. Delete operations only provide before-change data to identify the row being deleted. The operation-based formatters model the operation as it is exists in the source trail file. Operation-based formats include fields for the before-change and after-change images.
The row-based formatters model the row data as it exists after the operation data is applied. Row-based formatters contain only a single image of the data. The following sections describe what data is displayed for both the operation-based and the row-based formatters.
Parent topic: Pluggable Formatters
8.2.31.5.1.1 Operation Formatters
The formatters that support operation-based formatting are JSON, Avro Operation, and XML. The output of operation-based formatters are as follows:
-
Insert operation: Before-image data is null. After image data is output.
-
Update operation: Both before-image and after-image data is output.
-
Delete operation: Before-image data is output. After-image data is null.
-
Truncate operation: Both before-image and after-image data is null.
Parent topic: Using Operation-Based versus Row-Based Formatting
8.2.31.5.1.2 Row Formatters
The formatters that support row-based formatting are Delimited Text and Avro Row. Row-based formatters output the following information for the following operations:
-
Insert operation: After-image data only.
-
Update operation: After-image data only. Primary key updates are a special case which will be discussed in individual sections for the specific formatters.
-
Delete operation: Before-image data only.
-
Truncate operation: The table name is provided, but both before-image and after-image data are null. Truncate table is a DDL operation, and it may not support different database implementations. Refer to the Oracle GoldenGate documentation for your database implementation.
Parent topic: Using Operation-Based versus Row-Based Formatting
8.2.31.5.1.3 Table Row or Column Value States
In an RDBMS, table data for a specific row and column can only have one of two states: either the data has a value, or it is null. However; when data is transferred to the Oracle GoldenGate trail file by the Oracle GoldenGate capture process, the data can have three possible states: it can have a value, it can be null, or it can be missing.
For an insert operation, the after-image contains data for all column values regardless of whether the data is null.. However, the data included for update and delete operations may not always contain complete data for all columns. When replicating data to an RDBMS for an update operation only the primary key values and the values of the columns that changed are required to modify the data in the target database. In addition, only the primary key values are required to delete the row from the target database. Therefore, even though values are present in the source database, the values may be missing in the source trail file. Because data in the source trail file may have three states, the Plugable Formatters must also be able to represent data in all three states.
Because the row and column data in the Oracle GoldenGate trail file has an important effect on a Big Data integration, it is important to understand the data that is required. Typically, you can control the data that is included for operations in the Oracle GoldenGate trail file. In an Oracle database, this data is controlled by the supplemental logging level. To understand how to control the row and column values that are included in the Oracle GoldenGate trail file, see the Oracle GoldenGate documentation for your source database implementation..
Parent topic: Using Operation-Based versus Row-Based Formatting
8.2.31.5.2 Using the Avro Formatter
Apache Avro is an open source data serialization and deserialization framework known for its flexibility, compactness of serialized data, and good serialization and deserialization performance. Apache Avro is commonly used in Big Data applications.
Parent topic: Pluggable Formatters
8.2.31.5.2.1 Avro Row Formatter
The Avro Row Formatter formats operation data from the source trail file into messages in an Avro binary array format. Each individual insert, update, delete, and truncate operation is formatted into an individual Avro message. The source trail file contains the before and after images of the operation data. The Avro Row Formatter takes the before-image and after-image data and formats it into an Avro binary representation of the operation data.
The Avro Row Formatter formats operations from the source trail file into a format that represents the row data. This format is more compact than the output from the Avro Operation Formatter for the Avro messages model the change data operation.
The Avro Row Formatter may be a good choice when streaming Avro data to HDFS. Hive supports data files in HDFS in an Avro format.
This section contains the following topics:
- Operation Metadata Formatting Details
The automated output of meta-column fields in generated Avro messages has been removed as of Oracle GoldenGate for Big Data release 21.1. Meta-column fields can still be output; however, they need to explicitly configured as the following property:gg.handler.name.format.metaColumnsTemplate
. - Operation Data Formatting Details
- Sample Avro Row Messages
- Avro Schemas
- Avro Row Configuration Properties
- Review a Sample Configuration
- Metadata Change Events
- Special Considerations
Parent topic: Using the Avro Formatter
8.2.31.5.2.1.1 Operation Metadata Formatting Details
The automated output of meta-column fields in generated Avro messages has been
removed as of Oracle GoldenGate for Big Data release 21.1. Meta-column fields can
still be output; however, they need to explicitly configured as the following
property: gg.handler.name.format.metaColumnsTemplate
.
To output the metacolumns configure the following:
gg.handler.name.format.metaColumnsTemplate=${objectname[table]},${optype[op_type]},${timestamp[op_ts]},${currenttimestamp[current_ts]},${position[pos]}
To also include the primary key columns and the tokens configure as follows:
gg.handler.name.format.metaColumnsTemplate=${objectname[table]},${optype[op_type]},${timestamp[op_ts]},${currenttimestamp[current_ts]},${position[pos]},${primarykeycolumns[primary_keys]},${alltokens[tokens]}
For more information see the configuration property:
gg.handler.name.format.metaColumnsTemplate
.
Table 8-41 Avro Formatter Metadata
Value | Description |
---|---|
|
The fully qualified table in the format is: |
|
The type of database operation from the source trail file. Default values are |
|
The timestamp of the operation from the source trail file. Since this timestamp is from the source trail, it is fixed. Replaying the trail file results in the same timestamp for the same operation. |
|
The time when the formatter processed the current operation record. This timestamp follows the ISO-8601 format and includes microsecond precision. Replaying the trail file will not result in the same timestamp for the same operation. |
|
The concatenated sequence number and the RBA number from the source trail file. This trail position lets you trace the operation back to the source trail file. The sequence number is the source trail file number. The RBA number is the offset in the trail file. |
|
An array variable that holds the column names of the primary keys of the source table. |
|
A map variable that holds the token key value pairs from the source trail file. |
Parent topic: Avro Row Formatter
8.2.31.5.2.1.2 Operation Data Formatting Details
The operation data follows the operation metadata. This data is represented as individual fields identified by the column names.
Column values for an operation from the source trail file can have one of three states: the column has a value, the column value is null, or the column value is missing. Avro attributes only support two states, the column has a value or the column value is null. Missing column values are handled the same as null values. Oracle recommends that when you use the Avro Row Formatter, you configure the Oracle GoldenGate capture process to provide full image data for all columns in the source trail file.
By default, the setting of the Avro Row Formatter maps the data types from the source trail file to the associated Avro data type. Because Avro provides limited support for data types, source columns map into Avro long, double, float, binary, or string data types. You can also configure data type mapping to handle all data as strings.
Parent topic: Avro Row Formatter
8.2.31.5.2.1.3 Sample Avro Row Messages
Because Avro messages are binary, they are not human readable. The following sample messages show the JSON representation of the messages.
Parent topic: Avro Row Formatter
8.2.31.5.2.1.3.1 Sample Insert Message
{"table": "GG.TCUSTORD", "op_type": "I", "op_ts": "2013-06-02 22:14:36.000000", "current_ts": "2015-09-18T10:13:11.172000", "pos": "00000000000000001444", "primary_keys": ["CUST_CODE", "ORDER_DATE", "PRODUCT_CODE", "ORDER_ID"], "tokens": {"R": "AADPkvAAEAAEqL2AAA"}, "CUST_CODE": "WILL", "ORDER_DATE": "1994-09-30:15:33:00", "PRODUCT_CODE": "CAR", "ORDER_ID": "144", "PRODUCT_PRICE": 17520.0, "PRODUCT_AMOUNT": 3.0, "TRANSACTION_ID": "100"}
Parent topic: Sample Avro Row Messages
8.2.31.5.2.1.3.2 Sample Update Message
{"table": "GG.TCUSTORD", "op_type": "U", "op_ts": "2013-06-02 22:14:41.000000", "current_ts": "2015-09-18T10:13:11.492000", "pos": "00000000000000002891", "primary_keys": ["CUST_CODE", "ORDER_DATE", "PRODUCT_CODE", "ORDER_ID"], "tokens": {"R": "AADPkvAAEAAEqLzAAA"}, "CUST_CODE": "BILL", "ORDER_DATE": "1995-12-31:15:00:00", "PRODUCT_CODE": "CAR", "ORDER_ID": "765", "PRODUCT_PRICE": 14000.0, "PRODUCT_AMOUNT": 3.0, "TRANSACTION_ID": "100"}
Parent topic: Sample Avro Row Messages
8.2.31.5.2.1.3.3 Sample Delete Message
{"table": "GG.TCUSTORD", "op_type": "D", "op_ts": "2013-06-02 22:14:41.000000", "current_ts": "2015-09-18T10:13:11.512000", "pos": "00000000000000004338", "primary_keys": ["CUST_CODE", "ORDER_DATE", "PRODUCT_CODE", "ORDER_ID"], "tokens": {"L": "206080450", "6": "9.0.80330", "R": "AADPkvAAEAAEqLzAAC"}, "CUST_CODE": "DAVE", "ORDER_DATE": "1993-11-03:07:51:35", "PRODUCT_CODE": "PLANE", "ORDER_ID": "600", "PRODUCT_PRICE": null, "PRODUCT_AMOUNT": null, "TRANSACTION_ID": null}
Parent topic: Sample Avro Row Messages
8.2.31.5.2.1.3.4 Sample Truncate Message
{"table": "GG.TCUSTORD", "op_type": "T", "op_ts": "2013-06-02 22:14:41.000000", "current_ts": "2015-09-18T10:13:11.514000", "pos": "00000000000000004515", "primary_keys": ["CUST_CODE", "ORDER_DATE", "PRODUCT_CODE", "ORDER_ID"], "tokens": {"R": "AADPkvAAEAAEqL2AAB"}, "CUST_CODE": null, "ORDER_DATE": null, "PRODUCT_CODE": null, "ORDER_ID": null, "PRODUCT_PRICE": null, "PRODUCT_AMOUNT": null, "TRANSACTION_ID": null}
Parent topic: Sample Avro Row Messages
8.2.31.5.2.1.4 Avro Schemas
Avro uses JSONs to represent schemas. Avro schemas define the format of generated Avro messages and are required to serialize and deserialize Avro messages. Schemas are generated on a just-in-time basis when the first operation for a table is encountered. Because generated Avro schemas are specific to a table definition, a separate Avro schema is generated for every table encountered for processed operations. By default, Avro schemas are written to the GoldenGate_Home
/dirdef
directory, although the write location is configurable. Avro schema file names adhere to the following naming convention: Fully_Qualified_Table_Name
.avsc
.
The following is a sample Avro schema for the Avro Row Format for the references examples in the previous section:
{ "type" : "record", "name" : "TCUSTORD", "namespace" : "GG", "fields" : [ { "name" : "table", "type" : "string" }, { "name" : "op_type", "type" : "string" }, { "name" : "op_ts", "type" : "string" }, { "name" : "current_ts", "type" : "string" }, { "name" : "pos", "type" : "string" }, { "name" : "primary_keys", "type" : { "type" : "array", "items" : "string" } }, { "name" : "tokens", "type" : { "type" : "map", "values" : "string" }, "default" : { } }, { "name" : "CUST_CODE", "type" : [ "null", "string" ], "default" : null }, { "name" : "ORDER_DATE", "type" : [ "null", "string" ], "default" : null }, { "name" : "PRODUCT_CODE", "type" : [ "null", "string" ], "default" : null }, { "name" : "ORDER_ID", "type" : [ "null", "string" ], "default" : null }, { "name" : "PRODUCT_PRICE", "type" : [ "null", "double" ], "default" : null }, { "name" : "PRODUCT_AMOUNT", "type" : [ "null", "double" ], "default" : null }, { "name" : "TRANSACTION_ID", "type" : [ "null", "string" ], "default" : null } ] }
Parent topic: Avro Row Formatter
8.2.31.5.2.1.5 Avro Row Configuration Properties
Table 8-42 Avro Row Configuration Properties
Properties | Optional/ Required | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.handler.name.format.insertOpKey |
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate an insert operation. |
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate an update operation. |
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate a delete operation. |
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate a truncate operation. |
|
Optional |
Any legal encoding name or alias supported by Java. |
UTF-8 (the JSON default) |
Controls the output encoding of generated JSON Avro schema. The JSON default is UTF-8. Avro messages are binary and support their own internal representation of encoding. |
gg.handler.name.format.treatAllColumnsAsStrings |
Optional |
|
|
Controls the output typing of generated Avro messages. If set to false then the formatter will attempt to map Oracle GoldenGate types to the corresponding AVRO type. If set to true then all data will be treated as Strings in the generated Avro messages and schemas. |
gg.handler.name.format.pkUpdateHandling |
Optional |
|
|
Specifies how the formatter handles update operations that change a primary key. Primary key operations for the Avro Row formatter require special consideration.
|
|
Optional |
Any string |
no value |
Inserts a delimiter after each Avro message. This is not
a best practice, but in certain cases you may want to parse a stream
of data and extract individual Avro messages from the stream. Select
a unique delimiter that cannot occur in any Avro message. This
property supports |
|
Optional |
|
|
Avro schemas always follow
the |
|
Optional |
|
|
Wraps the Avro messages for operations from the source trail file in a generic Avro wrapper message. For more information, see Generic Wrapper Functionality. |
|
Optional |
Any legal, existing file system path. |
|
The output location of generated Avro schemas. |
|
Optional |
Any legal encoding name or alias supported by Java. |
|
The directory in the HDFS where schemas are output. A
metadata change overwrites the schema during the next operation for
the associated table. Schemas follow the same naming convention as
schemas written to the local file
system: |
|
Optional |
|
|
The format of the current timestamp. The default is the
ISO 8601 format. A setting of false removes the
T between the date and time in the current
timestamp, which outputs a space instead.
|
|
Optional |
|
|
Set to true to include a
{column_name}_isMissing boolean field
for each source field. This field allows downstream applications to
differentiate if a null value is null in the source trail file (value is
false ) or is missing in the source trail file
(value is true ).
|
|
Optional |
|
|
Enables the use of Avro |
|
Optional |
Any integer value from 0 to 38. |
None |
Allows you to set the scale on the Avro
|
gg.handler.name.format.mapOracleNumbersAsStrings |
Optional |
|
false |
This property is only applicable if decimal logical
types are enabled via the property
gg.handler.name.format.enableDecimalLogialType=true .
Oracle numbers are especially problematic because they have a large
precision (168) and floating scale of up to 38. Some analytical tools,
such as Spark cannot read numbers that large. This property allows you
to map those Oracle numbers as strings while still mapping the smaller
numbers as decimal logical types.
|
|
Optional |
|
|
Set to |
gg.handler.name.format.mapLargeNumbersAsStrings |
Optional | true | false |
false |
Oracle GoldenGate supports the floating point and
integer source datatypes. Some of these datatypes may not fit into the
Avro primitive double or long datatypes. Set this property to
true to map the fields that do not fit into the
Avro primitive double or long datatypes to Avro string.
|
gg.handler.name.format.metaColumnsTemplate |
Optional | See Metacolumn Keywords. | None |
The current meta column information can be configured in a simple manner and removes the explicit need to use: insertOpKey | updateOpKey | deleteOpKey |
truncateOpKey | includeTableName | includeOpTimestamp |
includeOpType | includePosition | includeCurrentTimestamp,
useIso8601Format It is a comma-delimited string consisting of one or more templated values that represent the template. For more information about the Metacolumn keywords, see Metacolumn Keywords. |
gg.handler.name.format.maxPrecision |
Optional | None | Positive Integer | Allows you to set the maximum precision for Avro decimal
logical types. Consuming applications may have limitations on Avro
precision (that is, Apache Spark supports a maximum precision of
38).
WARNING: Configuration of this property is not without risk. |
Parent topic: Avro Row Formatter
8.2.31.5.2.1.6 Review a Sample Configuration
The following is a sample configuration for the Avro Row Formatter in the Java Adapter properties file:
gg.handler.hdfs.format=avro_row gg.handler.hdfs.format.insertOpKey=I gg.handler.hdfs.format.updateOpKey=U gg.handler.hdfs.format.deleteOpKey=D gg.handler.hdfs.format.truncateOpKey=T gg.handler.hdfs.format.encoding=UTF-8 gg.handler.hdfs.format.pkUpdateHandling=abend gg.handler.hdfs.format.wrapMessageInGenericAvroMessage=false
Parent topic: Avro Row Formatter
8.2.31.5.2.1.7 Metadata Change Events
If the replicated database and upstream Oracle GoldenGate replication process can propagate metadata change events, the Avro Row Formatter can take action when metadata changes. Because Avro messages depend closely on their corresponding schema, metadata changes are important when you use Avro formatting.
An updated Avro schema is generated as soon as a table operation occurs after a metadata change event. You must understand the impact of a metadata change event and change downstream targets to the new Avro schema. The tight dependency of Avro messages to Avro schemas may result in compatibility issues. Avro messages generated before the schema change may not be able to be deserialized with the newly generated Avro schema.
Conversely, Avro messages generated after the schema change may not be able to be deserialized with the previous Avro schema. It is a best practice to use the same version of the Avro schema that was used to generate the message. For more information, consult the Apache Avro documentation.
Parent topic: Avro Row Formatter
8.2.31.5.2.1.8 Special Considerations
This sections describes these special considerations:
8.2.31.5.2.1.8.1 Troubleshooting
Because Avro is a binary format, it is not human readable. Since Avro messages are in binary format, it is difficult to debug any issue, the Avro Row Formatter provides a special feature to help debug issues. When the log4j
Java logging level is set to TRACE
, Avro messages are deserialized and displayed in the log file as a JSON object, letting you view the structure and contents of the created Avro messages. Do not enable TRACE
in a production environment as it has substantial negative impact on performance. To troubleshoot content, you may want to consider switching to use a formatter that produces human-readable content. The XML or JSON formatters both produce content in human-readable format.
Parent topic: Special Considerations
8.2.31.5.2.1.8.2 Primary Key Updates
In Big Data integrations, primary key update operations require special consideration and planning. Primary key updates modify one or more of the primary keys of a given row in the source database. Because data is appended in Big Data applications, a primary key update operation looks more like a new insert than like an update without special handling. You can use the following properties to configure the Avro Row Formatter to handle primary keys:
Table 8-43 Configurable behavior
Value | Description |
---|---|
|
The formatter terminates. This behavior is the default behavior. |
|
With this configuration the primary key update is treated like any other update operation. Use this configuration only if you can guarantee that the primary key is not used as selection criteria row data from a Big Data system. |
|
The primary key update is treated as a special case of a delete, using the before image data and an insert using the after-image data. This configuration may more accurately model the effect of a primary key update in a Big Data application. However, if this configuration is selected, it is important to have full supplemental logging enabled on Replication at the source database. Without full supplemental logging the delete operation will be correct, but insert operation will not contain all of the data for all of the columns for a full representation of the row data in the Big Data application. |
Parent topic: Special Considerations
8.2.31.5.2.1.8.3 Generic Wrapper Functionality
Because Avro messages are not self describing, the receiver of the message must know the schema associated with the message before the message can be deserialized. Avro messages are binary and provide no consistent or reliable way to inspect the message contents in order to ascertain the message type. Therefore, Avro can be troublesome when messages are interlaced into a single stream of data such as Kafka.
The Avro formatter provides a special feature to wrap the Avro message in a generic Avro message. You can enable this functionality by setting the following configuration property.
gg.handler.name.format.wrapMessageInGenericAvroMessage=true
The generic message is Avro message wrapping the Avro payload message that is common to all Avro messages that are output. The schema for the generic message is name generic_wrapper.avsc
and is written to the output schema directory. This message has the following three fields:
-
table_name
:The fully qualified source table name. -
schema_fingerprint
: The fingerprint of the Avro schema of the wrapped message. The fingerprint is generated using the AvroSchemaNormalization.parsingFingerprint64(schema)
call. -
payload
: The wrapped Avro message.
The following is the Avro Formatter generic wrapper schema.
{ "type" : "record", "name" : "generic_wrapper", "namespace" : "oracle.goldengate", "fields" : [ { "name" : "table_name", "type" : "string" }, { "name" : "schema_fingerprint", "type" : "long" }, { "name" : "payload", "type" : "bytes" } ] }
Parent topic: Special Considerations
8.2.31.5.2.2 The Avro Operation Formatter
The Avro Operation Formatter formats operation data from the source trail file into messages in an Avro binary array format. Each individual insert, update, delete, and truncate operation is formatted into an individual Avro message. The source trail file contains the before and after images of the operation data. The Avro Operation Formatter formats this data into an Avro binary representation of the operation data.
This format is more verbose than the output of the Avro Row Formatter for which the Avro messages model the row data.
- Operation Metadata Formatting Details
- Operation Data Formatting Details
- Sample Avro Operation Messages
- Avro Schema
- Avro Operation Formatter Configuration Properties
- Review a Sample Configuration
- Metadata Change Events
- Special Considerations
Parent topic: Using the Avro Formatter
8.2.31.5.2.2.1 Operation Metadata Formatting Details
The automated output of meta-column fields in generated Avro messages has been removed as of Oracle GoldenGate for Big Data release 21.1. Meta-column fields can still be output; however, they need to explicitly configured as the following property:
gg.handler.name.format.metaColumnsTemplate
To output the metacolumns as in previous versions configure the following:
gg.handler.name.format.metaColumnsTemplate=${objectname[table]},${optype[op_type]},${timestamp[op_ts]},${currenttimestamp[current_ts]},${position[pos]}
To also include the primary key columns and the tokens configure as follows:
gg.handler.name.format.metaColumnsTemplate=${objectname[table]},${optype[op_type]},${timestamp[op_ts]},${currenttimestamp[current_ts]},${position[pos]},${primarykeycolumns[primary_keys]},${alltokens[tokens]}
For more information see the configuration property:
gg.handler.name.format.metaColumnsTemplate
Table 8-44 Avro Messages and its Metadata
Fields | Description |
---|---|
|
The fully qualified table name, in the
format:
|
|
The type of database operation from the source trail
file. Default values are |
|
The timestamp of the operation from the source trail file. Since this timestamp is from the source trail, it is fixed. Replaying the trail file results in the same timestamp for the same operation. |
|
The time when the formatter processed the current operation record. This timestamp follows the ISO-8601 format and includes microsecond precision. Replaying the trail file will not result in the same timestamp for the same operation. |
|
The concatenated sequence number and rba number from the source trail file. The trail position provides traceability of the operation back to the source trail file. The sequence number is the source trail file number. The rba number is the offset in the trail file. |
|
An array variable that holds the column names of the primary keys of the source table. |
|
A map variable that holds the token key value pairs from the source trail file. |
Parent topic: The Avro Operation Formatter
8.2.31.5.2.2.2 Operation Data Formatting Details
The operation data is represented as individual fields identified by the column names.
Column values for an operation from the source trail file can have one of three states: the column has a value, the column value is null, or the column value is missing. Avro attributes only support two states: the column has a value or the column value is null. The Avro Operation Formatter contains an additional Boolean field COLUMN_NAME
_isMissing
for each column to indicate whether the column value is missing or not. Using COLUMN_NAME
field together with the COLUMN_NAME
_isMissing
field, all three states can be defined.
-
State 1: The column has a value
COLUMN_NAME
field has a valueCOLUMN_NAME
_isMissing
field is false -
State 2: The column value is null
COLUMN_NAME
field value is nullCOLUMN_NAME
_isMissing
field is false -
State 3: The column value is missing
COLUMN_NAME
field value is nullCOLUMN_NAME
_isMissing
field is true
By default the Avro Row Formatter maps the data types from the source trail file to the associated Avro data type. Because Avro supports few data types, this functionality usually results in the mapping of numeric fields from the source trail file to members typed as numbers. You can also configure this data type mapping to handle all data as strings.
Parent topic: The Avro Operation Formatter
8.2.31.5.2.2.3 Sample Avro Operation Messages
Because Avro messages are binary, they are not human readable. The following topics show example Avro messages in JSON format:
Parent topic: The Avro Operation Formatter
8.2.31.5.2.2.3.1 Sample Insert Message
{"table": "GG.TCUSTORD", "op_type": "I", "op_ts": "2013-06-02 22:14:36.000000", "current_ts": "2015-09-18T10:17:49.570000", "pos": "00000000000000001444", "primary_keys": ["CUST_CODE", "ORDER_DATE", "PRODUCT_CODE", "ORDER_ID"], "tokens": {"R": "AADPkvAAEAAEqL2AAA"}, "before": null, "after": { "CUST_CODE": "WILL", "CUST_CODE_isMissing": false, "ORDER_DATE": "1994-09-30:15:33:00", "ORDER_DATE_isMissing": false, "PRODUCT_CODE": "CAR", "PRODUCT_CODE_isMissing": false, "ORDER_ID": "144", "ORDER_ID_isMissing": false, "PRODUCT_PRICE": 17520.0, "PRODUCT_PRICE_isMissing": false, "PRODUCT_AMOUNT": 3.0, "PRODUCT_AMOUNT_isMissing": false, "TRANSACTION_ID": "100", "TRANSACTION_ID_isMissing": false}}
Parent topic: Sample Avro Operation Messages
8.2.31.5.2.2.3.2 Sample Update Message
{"table": "GG.TCUSTORD", "op_type": "U", "op_ts": "2013-06-02 22:14:41.000000", "current_ts": "2015-09-18T10:17:49.880000", "pos": "00000000000000002891", "primary_keys": ["CUST_CODE", "ORDER_DATE", "PRODUCT_CODE", "ORDER_ID"], "tokens": {"R": "AADPkvAAEAAEqLzAAA"}, "before": { "CUST_CODE": "BILL", "CUST_CODE_isMissing": false, "ORDER_DATE": "1995-12-31:15:00:00", "ORDER_DATE_isMissing": false, "PRODUCT_CODE": "CAR", "PRODUCT_CODE_isMissing": false, "ORDER_ID": "765", "ORDER_ID_isMissing": false, "PRODUCT_PRICE": 15000.0, "PRODUCT_PRICE_isMissing": false, "PRODUCT_AMOUNT": 3.0, "PRODUCT_AMOUNT_isMissing": false, "TRANSACTION_ID": "100", "TRANSACTION_ID_isMissing": false}, "after": { "CUST_CODE": "BILL", "CUST_CODE_isMissing": false, "ORDER_DATE": "1995-12-31:15:00:00", "ORDER_DATE_isMissing": false, "PRODUCT_CODE": "CAR", "PRODUCT_CODE_isMissing": false, "ORDER_ID": "765", "ORDER_ID_isMissing": false, "PRODUCT_PRICE": 14000.0, "PRODUCT_PRICE_isMissing": false, "PRODUCT_AMOUNT": 3.0, "PRODUCT_AMOUNT_isMissing": false, "TRANSACTION_ID": "100", "TRANSACTION_ID_isMissing": false}}
Parent topic: Sample Avro Operation Messages
8.2.31.5.2.2.3.3 Sample Delete Message
{"table": "GG.TCUSTORD", "op_type": "D", "op_ts": "2013-06-02 22:14:41.000000", "current_ts": "2015-09-18T10:17:49.899000", "pos": "00000000000000004338", "primary_keys": ["CUST_CODE", "ORDER_DATE", "PRODUCT_CODE", "ORDER_ID"], "tokens": {"L": "206080450", "6": "9.0.80330", "R": "AADPkvAAEAAEqLzAAC"}, "before": { "CUST_CODE": "DAVE", "CUST_CODE_isMissing": false, "ORDER_DATE": "1993-11-03:07:51:35", "ORDER_DATE_isMissing": false, "PRODUCT_CODE": "PLANE", "PRODUCT_CODE_isMissing": false, "ORDER_ID": "600", "ORDER_ID_isMissing": false, "PRODUCT_PRICE": null, "PRODUCT_PRICE_isMissing": true, "PRODUCT_AMOUNT": null, "PRODUCT_AMOUNT_isMissing": true, "TRANSACTION_ID": null, "TRANSACTION_ID_isMissing": true}, "after": null}
Parent topic: Sample Avro Operation Messages
8.2.31.5.2.2.3.4 Sample Truncate Message
{"table": "GG.TCUSTORD", "op_type": "T", "op_ts": "2013-06-02 22:14:41.000000", "current_ts": "2015-09-18T10:17:49.900000", "pos": "00000000000000004515", "primary_keys": ["CUST_CODE", "ORDER_DATE", "PRODUCT_CODE", "ORDER_ID"], "tokens": {"R": "AADPkvAAEAAEqL2AAB"}, "before": null, "after": null}
Parent topic: Sample Avro Operation Messages
8.2.31.5.2.2.4 Avro Schema
Avro schemas are represented as JSONs. Avro schemas define the format of generated Avro messages and are required to serialize and deserialize Avro messages.Avro schemas are generated on a just-in-time basis when the first operation for a table is encountered. Because Avro schemas are specific to a table definition, a separate Avro schema is generated for every table encountered for processed operations. By default, Avro schemas are written to the GoldenGate_Home
/dirdef
directory, although the write location is configurable. Avro schema file names adhere to the following naming convention: Fully_Qualified_Table_Name.avsc
.
The following is a sample Avro schema for the Avro Operation Format for the samples in the preceding sections:
{ "type" : "record", "name" : "TCUSTORD", "namespace" : "GG", "fields" : [ { "name" : "table", "type" : "string" }, { "name" : "op_type", "type" : "string" }, { "name" : "op_ts", "type" : "string" }, { "name" : "current_ts", "type" : "string" }, { "name" : "pos", "type" : "string" }, { "name" : "primary_keys", "type" : { "type" : "array", "items" : "string" } }, { "name" : "tokens", "type" : { "type" : "map", "values" : "string" }, "default" : { } }, { "name" : "before", "type" : [ "null", { "type" : "record", "name" : "columns", "fields" : [ { "name" : "CUST_CODE", "type" : [ "null", "string" ], "default" : null }, { "name" : "CUST_CODE_isMissing", "type" : "boolean" }, { "name" : "ORDER_DATE", "type" : [ "null", "string" ], "default" : null }, { "name" : "ORDER_DATE_isMissing", "type" : "boolean" }, { "name" : "PRODUCT_CODE", "type" : [ "null", "string" ], "default" : null }, { "name" : "PRODUCT_CODE_isMissing", "type" : "boolean" }, { "name" : "ORDER_ID", "type" : [ "null", "string" ], "default" : null }, { "name" : "ORDER_ID_isMissing", "type" : "boolean" }, { "name" : "PRODUCT_PRICE", "type" : [ "null", "double" ], "default" : null }, { "name" : "PRODUCT_PRICE_isMissing", "type" : "boolean" }, { "name" : "PRODUCT_AMOUNT", "type" : [ "null", "double" ], "default" : null }, { "name" : "PRODUCT_AMOUNT_isMissing", "type" : "boolean" }, { "name" : "TRANSACTION_ID", "type" : [ "null", "string" ], "default" : null }, { "name" : "TRANSACTION_ID_isMissing", "type" : "boolean" } ] } ], "default" : null }, { "name" : "after", "type" : [ "null", "columns" ], "default" : null } ] }
Parent topic: The Avro Operation Formatter
8.2.31.5.2.2.5 Avro Operation Formatter Configuration Properties
Table 8-45 Configuration Properties
Properties | Optional Y/N | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate an insert operation |
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate an update operation. |
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate a delete operation. |
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate a truncate operation. |
|
Optional |
Any legal encoding name or alias supported by Java |
UTF-8 (the JSON default) |
Controls the output encoding of generated JSON Avro schema. The JSON default is UTF-8. Avro messages are binary and support their own internal representation of encoding. |
|
Optional |
|
|
Controls the output typing of generated Avro messages. If set to |
|
Optional |
Any string |
no value |
Inserts delimiter after each Avro message. This is not a best practice, but in certain cases you may want to parse a stream of data and extract individual Avro messages from the stream, use this property to help. Select a unique delimiter that cannot occur in any Avro message. This property supports |
|
Optional |
Any legal, existing file system path. |
|
The output location of generated Avro schemas. |
|
Optional |
|
|
Wraps Avro messages for operations from the source trail file in a generic Avro wrapper message. For more information, see Generic Wrapper Functionality. |
|
Optional |
|
|
The format of the current timestamp. By default the ISO 8601 is set to false , removes the T between the date and time in the current timestamp, which outputs a space instead.
|
|
Optional |
|
|
Set to |
|
Optional |
Any integer value from 0 to 38. |
None |
Allows you to set the scale on the Avro |
gg.handler.name.format.mapOracleNumbersAsStrings |
Optional |
|
|
This property is only applicable if decimal logical
types are enabled via the property
gg.handler.name.format.enableDecimalLogialType=true .
Oracle numbers are especially problematic because they have a large
precision (168) and floating scale of up to 38. Some analytical tools,
such as Spark cannot read numbers that large. This property allows you
to map those Oracle numbers as strings while still mapping the smaller
numbers as decimal logical types.
|
|
Optional |
|
|
Set to |
|
Optional |
|
|
Enables the use of Avro |
gg.handler.name.format.mapLargeNumbersAsStrings |
Optional |
|
false
|
Oracle GoldenGate supports the floating point and
integer source datatypes. Some of these datatypes may not fit into the
Avro primitive double or long datatypes. Set this property to
true to map the fields that do not fit into the
Avro primitive double or long datatypes to Avro string.
|
gg.handler.name.format.metaColumnsTemplate |
Optional | See Metacolumn Keywords | None |
The current meta column information can be configured in a simple manner and removes the explicit need to use: insertOpKey | updateOpKey | deleteOpKey |
truncateOpKey | includeTableName | includeOpTimestamp |
includeOpType | includePosition | includeCurrentTimestamp,
useIso8601Format It is a comma-delimited string consisting of one or more templated values that represent the template. For more information about the Metacolumn keywords, see Metacolumn Keywords. |
gg.handler.name.format.maxPrecision |
Optional | None | Positive Integer | Allows you to set the maximum precision for Avro decimal
logical types. Consuming applications may have limitations on Avro
precision (that is, Apache Spark supports a maximum precision of
38).
WARNING: Configuration of this property is not without risk. |
Parent topic: The Avro Operation Formatter
8.2.31.5.2.2.6 Review a Sample Configuration
The following is a sample configuration for the Avro Operation Formatter in the Java Adapter properg.handlerties
file:
gg.handler.hdfs.format=avro_op gg.handler.hdfs.format.insertOpKey=I gg.handler.hdfs.format.updateOpKey=U gg.handler.hdfs.format.deleteOpKey=D gg.handler.hdfs.format.truncateOpKey=T gg.handler.hdfs.format.encoding=UTF-8 gg.handler.hdfs.format.wrapMessageInGenericAvroMessage=false
Parent topic: The Avro Operation Formatter
8.2.31.5.2.2.7 Metadata Change Events
If the replicated database and upstream Oracle GoldenGate replication process can propagate metadata change events, the Avro Operation Formatter can take action when metadata changes. Because Avro messages depend closely on their corresponding schema, metadata changes are important when you use Avro formatting.
An updated Avro schema is generated as soon as a table operation occurs after a metadata change event.
You must understand the impact of a metadata change event and change downstream targets to the new Avro schema. The tight dependency of Avro messages to Avro schemas may result in compatibility issues. Avro messages generated before the schema change may not be able to be deserialized with the newly generated Avro schema. Conversely, Avro messages generated after the schema change may not be able to be deserialized with the previous Avro schema. It is a best practice to use the same version of the Avro schema that was used to generate the message
For more information, consult the Apache Avro documentation.
Parent topic: The Avro Operation Formatter
8.2.31.5.2.2.8 Special Considerations
This section describes these special considerations:
Parent topic: The Avro Operation Formatter
8.2.31.5.2.2.8.1 Troubleshooting
Because Avro is a binary format, it is not human readable. However, when the log4j
Java logging level is set to TRACE
, Avro messages are deserialized and displayed in the log file as a JSON object, letting you view the structure and contents of the created Avro messages. Do not enable TRACE
in a production environment, as it has a substantial impact on performance.
Parent topic: Special Considerations
8.2.31.5.2.2.8.2 Primary Key Updates
The Avro Operation Formatter creates messages with complete data of before-image and after-images for update operations. Therefore, the Avro Operation Formatter requires no special treatment for primary key updates.
Parent topic: Special Considerations
8.2.31.5.2.2.8.3 Generic Wrapper Message
Because Avro messages are not self describing, the receiver of the message must know the schema associated with the message before the message can be deserialized. Avro messages are binary and provide no consistent or reliable way to inspect the message contents in order to ascertain the message type. Therefore, Avro can be troublesome when messages are interlaced into a single stream of data such as Kafka.
The Avro formatter provides a special feature to wrap the Avro message in a generic Avro message. You can enable this functionality by setting the following configuration property:
gg.handler.name.format.wrapMessageInGenericAvroMessage=true
The generic message is Avro message wrapping the Avro payload message that is common to all Avro messages that are output. The schema for the generic message is name generic_wrapper.avsc
and is written to the output schema directory. This message has the following three fields:
-
table_name
: The fully qualified source table name. -
schema_fingerprint
: The fingerprint of the of the Avro schema generating the messages. The fingerprint is generated using theparsingFingerprint64(Schema s)
method on theorg.apache.avro.SchemaNormalization
class. -
payload
: The wrapped Avro message.
The following is the Avro Formatter generic wrapper schema:
{ "type" : "record", "name" : "generic_wrapper", "namespace" : "oracle.goldengate", "fields" : [ { "name" : "table_name", "type" : "string" }, { "name" : "schema_fingerprint", "type" : "long" }, { "name" : "payload", "type" : "bytes" } ] }
Parent topic: Special Considerations
8.2.31.5.2.3 Avro Object Container File Formatter
Oracle GoldenGate for Big Data can write to HDFS in Avro Object Container File (OCF) format. Avro OCF handles schema evolution more efficiently than other formats. The Avro OCF Formatter also supports compression and decompression to allow more efficient use of disk space.
The HDFS Handler integrates with the Avro formatters to write files to HDFS in Avro OCF format. The Avro OCF format is required for Hive to read Avro data in HDFS. The Avro OCF format is detailed in the Avro specification, see http://avro.apache.org/docs/current/spec.html#Object+Container+Files.
You can configure the HDFS Handler to stream data in Avro OCF format, generate table definitions in Hive, and update table definitions in Hive in the case of a metadata change event.
8.2.31.5.2.3.1 Avro OCF Formatter Configuration Properties
Properties | Optional / Required | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate an insert operation. |
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate an update operation. |
|
Optional |
Any string |
|
Indicator to be truncated into the output record to indicate a truncate operation. |
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate a truncate operation. |
|
Optional |
Any legal encoding name or alias supported by Java. |
|
Controls the output encoding of generated JSON Avro schema. The JSON default is UTF-8. Avro messages are binary and support their own internal representation of encoding. |
|
Optional |
|
|
Controls the output typing of generated Avro messages. When the setting is |
|
Optional |
|
|
Controls how the formatter should handle update operations that change a primary key. Primary key operations can be problematic for the Avro Row formatter and require special consideration by you.
|
|
Optional |
|
|
Because schemas must be generated for Avro serialization to |
|
Optional |
Any legal, existing file system path |
|
The directory where generated Avro schemas are saved to the local file system. This property does not control where the Avro schema is written to in HDFS; that is controlled by an HDFS Handler property. |
|
Optional |
|
|
By default, the value of this property is true, and the format for the current timestamp is ISO8601. Set to |
|
Optional |
|
|
If set to
|
Parent topic: Avro Object Container File Formatter
8.2.31.5.3 Using the Delimited Text Formatter
-
Column has a value: The column value is output.
-
Column value is null: The default output value is
NULL
. The output for the case of a null column value is configurable. -
Column value is missing: The default output value is an empty string (""). The output for the case of a missing column value is configurable.
- Using the Delimited Text Row Formatter
The Delimited Text Row Formatter is the Delimited Text Formatter that was included a release prior to the Oracle GoldeGate for Big Data 19.1.0.0 release. It writes the after change data for inserts and updates, and before change data for deletes. - Delimited Text Operation Formatter
The Delimited Text Operation Formatter is new functionality in the Oracle GoldenGate for Big Data 19.1.0.0.0 release. It outputs both before and after change data for insert, update and delete operations.
Parent topic: Pluggable Formatters
8.2.31.5.3.1 Using the Delimited Text Row Formatter
The Delimited Text Row Formatter is the Delimited Text Formatter that was included a release prior to the Oracle GoldeGate for Big Data 19.1.0.0 release. It writes the after change data for inserts and updates, and before change data for deletes.
- Message Formatting Details
- Sample Formatted Messages
- Output Format Summary Log
- Configuration
- Metadata Change Events
- Additional Considerations
Parent topic: Using the Delimited Text Formatter
8.2.31.5.3.1.1 Message Formatting Details
The automated output of meta-column fields in generated delimited text messages has been removed as of Oracle GoldenGate for Big Data release 21.1. Meta-column fields can still be output; however, they need to explicitly configured as the following property:
gg.handler.name.format.metaColumnsTemplate
To output the metacolumns as in previous versions configure the following:
gg.handler.name.format.metaColumnsTemplate=${optype[op_type]},${objectname[table]},${timestamp[op_ts]},${currenttimestamp[current_ts]},${position[pos]}
To also include the primary key columns and the tokens configure as follows:
gg.handler.name.format.metaColumnsTemplate=${optype[op_type]},${objectname[table]},${timestamp[op_ts]},${currenttimestamp[current_ts]},${position[pos]},${primarykeycolumns[primary_keys]},${alltokens[tokens]}
see the configuration property
gg.handler.name.format.metaColumnsTemplate
in the Delimited Text Formatter Configuration Properties table.
Formatting details:
-
Operation Type : Indicates the type of database operation from the source trail file. Default values are
I
for insert,U
for update,D
for delete,T
for truncate. Output of this field is suppressible. -
Fully Qualified Table Name: The fully qualified table name is the source database table including the catalog name, and the schema name. The format of the fully qualified table name is
catalog_name.schema_name.table_name
. The output of this field is suppressible. -
Operation Timestamp : The commit record timestamp from the source system. All operations in a transaction (unbatched transaction) will have the same operation timestamp. This timestamp is fixed, and the operation timestamp is the same if the trail file is replayed. The output of this field is suppressible.
-
Current Timestamp : The timestamp of the current time when the delimited text formatter processes the current operation record. This timestamp follows the ISO-8601 format and includes microsecond precision. Replaying the trail file does not result in the same timestamp for the same operation. The output of this field is suppressible.
-
Trail Position :The concatenated sequence number and RBA number from the source trail file. The trail position lets you trace the operation back to the source trail file. The sequence number is the source trail file number. The RBA number is the offset in the trail file. The output of this field is suppressible.
-
Tokens : The token key value pairs from the source trail file. The output of this field in the delimited text output is suppressed unless the
includeTokens
configuration property on the corresponding handler is explicitly set totrue
.
Parent topic: Using the Delimited Text Row Formatter
8.2.31.5.3.1.2 Sample Formatted Messages
The following sections contain sample messages from the Delimited Text Formatter. The default field delimiter has been changed to a pipe character, |
, to more clearly display the message.
Parent topic: Using the Delimited Text Row Formatter
8.2.31.5.3.1.2.1 Sample Insert Message
I|GG.TCUSTORD|2013-06-02 22:14:36.000000|2015-09-18T13:23:01.612001|00000000000000001444|R=AADPkvAAEAAEqL2A AA|WILL|1994-09-30:15:33:00|CAR|144|17520.00|3|100
Parent topic: Sample Formatted Messages
8.2.31.5.3.1.2.2 Sample Update Message
U|GG.TCUSTORD|2013-06-02 22:14:41.000000|2015-09-18T13:23:01.987000|00000000000000002891|R=AADPkvAAEAAEqLzA AA|BILL|1995-12-31:15:00:00|CAR|765|14000.00|3|100
Parent topic: Sample Formatted Messages
8.2.31.5.3.1.2.3 Sample Delete Message
D,GG.TCUSTORD,2013-06-02 22:14:41.000000,2015-09-18T13:23:02.000000,00000000000000004338,L=206080450,6=9.0. 80330,R=AADPkvAAEAAEqLzAAC,DAVE,1993-11-03:07:51:35,PLANE,600,,,
Parent topic: Sample Formatted Messages
8.2.31.5.3.1.2.4 Sample Truncate Message
T|GG.TCUSTORD|2013-06-02 22:14:41.000000|2015-09-18T13:23:02.001000|00000000000000004515|R=AADPkvAAEAAEqL2A AB|||||||
Parent topic: Sample Formatted Messages
8.2.31.5.3.1.3 Output Format Summary Log
If INFO
level logging is enabled, the Java log4j logging logs a
summary of the delimited text output format . A summary of the delimited fields is
logged for each source table encountered and occurs when the first operation for
that table is received by the Delimited Text formatter. This detailed explanation of
the fields of the delimited text output may be useful when you perform an initial
setup. When a metadata change event occurs, the summary of the delimited fields is
regenerated and logged again at the first subsequent operation for that table.
Parent topic: Using the Delimited Text Row Formatter
8.2.31.5.3.1.4 Configuration
8.2.31.5.3.1.4.1 Review a Sample Configuration
The following is a sample configuration for the Delimited Text formatter in the Java Adapter configuration file:
gg.handler.name.format.includeColumnNames=false gg.handler.name.format.insertOpKey=I gg.handler.name.format.updateOpKey=U gg.handler.name.format.deleteOpKey=D gg.handler.name.format.truncateOpKey=T gg.handler.name.format.encoding=UTF-8 gg.handler.name.format.fieldDelimiter=CDATA[\u0001] gg.handler.name.format.lineDelimiter=CDATA[\n] gg.handler.name.format.keyValueDelimiter=CDATA[=] gg.handler.name.format.kevValuePairDelimiter=CDATA[,] gg.handler.name.format.pkUpdateHandling=abend gg.handler.name.format.nullValueRepresentation=NULL gg.handler.name.format.missingValueRepresentation=CDATA[] gg.handler.name.format.includeGroupCols=false gg.handler.name.format=delimitedtext
Parent topic: Configuration
8.2.31.5.3.1.5 Metadata Change Events
Oracle GoldenGate for Big Data now handles metadata change events at runtime. This assumes that the replicated database and upstream replication processes are propagating metadata change events. The Delimited Text Formatter changes the output format to accommodate the change and the Delimited Text Formatter continue running.
Note:
A metadata change may affect downstream applications. Delimited text formats include a fixed number of fields that are positionally relevant. Deleting a column in the source table can be handled seamlessly during Oracle GoldenGate runtime, but results in a change in the total number of fields, and potentially changes the positional relevance of some fields. Adding an additional column or columns is probably the least impactful metadata change event, assuming that the new column is added to the end. Consider the impact of a metadata change event before executing the event. When metadata change events are frequent, Oracle recommends that you consider a more flexible and self-describing format, such as JSON or XML.Parent topic: Using the Delimited Text Row Formatter
8.2.31.5.3.1.6 Additional Considerations
Exercise care when you choose field and line delimiters. It is important to choose delimiter values that will not occur in the content of the data.
The Java Adapter configuration trims leading and trailing characters from configuration values when they are determined to be whitespace. However, you may want to choose field delimiters, line delimiters, null value representations, and missing value representations that include or are fully considered to be whitespace . In these cases, you must employ specialized syntax in the Java Adapter configuration file to preserve the whitespace. To preserve the whitespace, when your configuration values contain leading or trailing characters that are considered whitespace, wrap the configuration value in a CDATA[]
wrapper. For example, a configuration value of \n
should be configured as CDATA[\n]
.
You can use regular expressions to search column values then replace matches with a specified value. You can use this search and replace functionality together with the Delimited Text Formatter to ensure that there are no collisions between column value contents and field and line delimiters. For more information, see Using Regular Expression Search and Replace.
Big Data applications sore data differently from RDBMSs. Update and delete operations in an RDBMS result in a change to the existing data. However, in Big Data applications, data is appended instead of changed. Therefore, the current state of a given row consolidates all of the existing operations for that row in the HDFS system. This leads to some special scenarios as described in the following sections.
8.2.31.5.3.1.6.1 Primary Key Updates
In Big Data integrations, primary key update operations require special consideration and planning. Primary key updates modify one or more of the primary keys for the given row from the source database. Because data is appended in Big Data applications, a primary key update operation looks more like an insert than an update without any special handling. You can configure how the Delimited Text formatter handles primary key updates. These are the configurable behaviors:
Table 8-46 Configurable Behavior
Value | Description |
---|---|
|
By default the delimited text formatter terminates in the case of a primary key update. |
|
The primary key update is treated like any other update operation. Use this configuration alternative only if you can guarantee that the primary key is not used as selection criteria to select row data from a Big Data system. |
|
The primary key update is treated as a special case of a delete, using the before-image data and an insert using the after-image data. This configuration may more accurately model the effect of a primary key update in a Big Data application. However, if this configuration is selected it is important to have full supplemental logging enabled on replication at the source database. Without full supplemental logging, the delete operation will be correct, but the insert operation will not contain all of the data for all of the columns for a full representation of the row data in the Big Data application. |
Parent topic: Additional Considerations
8.2.31.5.3.1.6.2 Data Consolidation
Big Data applications append data to the underlying storage. Analytic tools generally spawn MapReduce programs that traverse the data files and consolidate all the operations for a given row into a single output. Therefore, it is important to specify the order of operations. The Delimited Text formatter provides a number of metadata fields to do this. The operation timestamp may be sufficient to fulfill this requirement. Alternatively, the current timestamp may be the best indicator of the order of operations. In this situation, the trail position can provide a tie-breaking field on the operation timestamp. Lastly, the current timestamp may provide the best indicator of order of operations in Big Data.
Parent topic: Additional Considerations
8.2.31.5.3.2 Delimited Text Operation Formatter
The Delimited Text Operation Formatter is new functionality in the Oracle GoldenGate for Big Data 19.1.0.0.0 release. It outputs both before and after change data for insert, update and delete operations.
- Message Formatting Details
- Sample Formatted Messages
- Output Format Summary Log
- Delimited Text Formatter Configuration Properties
- Review a Sample Configuration
- Metadata Change Events
Oracle GoldenGate for Big Data now handles metadata change events at runtime. This assumes that the replicated database and upstream replication processes are propagating metadata change events. The Delimited Text Formatter changes the output format to accommodate the change and the Delimited Text Formatter continue running. - Additional Considerations
Exercise care when you choose field and line delimiters. It is important to choose delimiter values that do not occur in the content of the data.
Parent topic: Using the Delimited Text Formatter
8.2.31.5.3.2.1 Message Formatting Details
The automated output of meta-column fields in generated
delimited text messages has been removed as of Oracle GoldenGate for Big Data
release 21.1. Meta-column fields can still be output; however, they need to
explicitly configured as the following property:
gg.handler.name.format.metaColumnsTemplate
. For more
information, see the configuration property
gg.handler.name.format.metaColumnsTemplate
in the Delimited Text Formatter Configuration Properties table.
To output the metacolumns as in previous versions configure the following:
gg.handler.name.format.metaColumnsTemplate=${optype[op_type]},${objectname[table]},${timestamp[op_ts]},${currenttimestamp[current_ts]},${position[pos]}
To also include the primary key columns and the tokens configure as follows:
gg.handler.name.format.metaColumnsTemplate=${optype[op_type]},${objectname[table]},${timestamp[op_ts]},${currenttimestamp[current_ts]},${position[pos]},${primarykeycolumns[primary_keys]},${alltokens[tokens]}
Formatting details:
-
Operation Type :Indicates the type of database operation from the source trail file. Default values are
I
for insert,U
for update,D
for delete,T
for truncate. Output of this field is suppressible. -
Fully Qualified Table Name: The fully qualified table name is the source database table including the catalog name, and the schema name. The format of the fully qualified table name is catalog_name.schema_name.table_name. The output of this field is suppressible.
-
Operation Timestamp : The commit record timestamp from the source system. All operations in a transaction (unbatched transaction) will have the same operation timestamp. This timestamp is fixed, and the operation timestamp is the same if the trail file is replayed. The output of this field is suppressible.
-
Current Timestamp : The timestamp of the current time when the delimited text formatter processes the current operation record. This timestamp follows the ISO-8601 format and includes microsecond precision. Replaying the trail file does not result in the same timestamp for the same operation. The output of this field is suppressible.
-
Trail Position :The concatenated sequence number and RBA number from the source trail file. The trail position lets you trace the operation back to the source trail file. The sequence number is the source trail file number. The RBA number is the offset in the trail file. The output of this field is suppressible.
-
Tokens : The token key value pairs from the source trail file. The output of this field in the delimited text output is suppressed unless the
includeTokens
configuration property on the corresponding handler is explicitly set totrue
.
Parent topic: Delimited Text Operation Formatter
8.2.31.5.3.2.2 Sample Formatted Messages
The following sections contain sample messages from the Delimited Text Formatter. The default field delimiter has been changed to a pipe character, |
, to more clearly display the message.
Parent topic: Delimited Text Operation Formatter
8.2.31.5.3.2.2.1 Sample Insert Message
I|GG.TCUSTMER|2015-11-05
18:45:36.000000|2019-04-17T04:49:00.156000|00000000000000001956|R=AAKifQAAKAAAFDHAAA,t=,L=7824137832,6=2.3.228025||WILL||BG
SOFTWARE CO.||SEATTLE||WA
Parent topic: Sample Formatted Messages
8.2.31.5.3.2.2.2 Sample Update Message
U|QASOURCE.TCUSTMER|2015-11-05 18:45:39.000000|2019-07-16T11:54:06.008002|00000000000000005100|R=AAKifQAAKAAAFDHAAE|ANN|ANN|ANN'S BOATS||SEATTLE|NEW YORK|WA|NY
Parent topic: Sample Formatted Messages
8.2.31.5.3.2.2.3 Sample Delete Message
D|QASOURCE.TCUSTORD|2015-11-05
18:45:39.000000|2019-07-16T11:54:06.009000|00000000000000005272|L=7824137921,R=AAKifSAAKAAAMZHAAE,6=9.9.479055|DAVE||1993-11-03
07:51:35||PLANE||600||135000.00||2||200|
Parent topic: Sample Formatted Messages
8.2.31.5.3.2.2.4 Sample Truncate Message
T|QASOURCE.TCUSTMER|2015-11-05
18:45:39.000000|2019-07-16T11:54:06.004002|00000000000000003600|R=AAKifQAAKAAAFDHAAE||||||||
Parent topic: Sample Formatted Messages
8.2.31.5.3.2.3 Output Format Summary Log
If INFO
level logging is enabled, the Java log4j logging logs a
summary of the delimited text output format . A
summary of the delimited fields is logged for each
source table encountered and occurs when the first
operation for that table is received by the
Delimited Text formatter. This detailed
explanation of the fields of the delimited text
output may be useful when you perform an initial
setup. When a metadata change event occurs, the
summary of the delimited fields is regenerated and
logged again at the first subsequent operation for
that table.
Parent topic: Delimited Text Operation Formatter
8.2.31.5.3.2.4 Delimited Text Formatter Configuration Properties
Table 8-47 Delimited Text Formatter Configuration Properties
Properties | Optional / Required | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.handler.name.format |
Required |
delimitedtext_op |
None |
Selects the Delimited Text Operation Formatter as the formatter. |
gg.handler.name.format.includeColumnNames |
Optional |
|
false |
Controls the output of writing the column names as a
delimited field preceding the column value. When
When
|
gg.handler.name.format.disableEscaping |
Optional |
|
false | Set to true to disable the
escaping of characters which conflict with the configured delimiters.
Ensure that it is set to true if
gg.handler.name.format.fieldDelimiter is set to a
value of multiple characters.
|
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate an insert operation. |
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate an update operation. |
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate a delete operation. |
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate a truncate operation. |
|
Optional |
Any encoding name or alias supported by Java. |
The native system encoding of the machine hosting the Oracle GoldenGate process. |
Determines the encoding of the output delimited text. |
|
Optional |
Any String |
|
The delimiter used between delimited fields. This value
supports |
|
Optional |
Any String |
Newline (the default Hive delimiter) |
The delimiter used between delimited fields.
This value supports CDATA[] wrapping.
|
|
Optional |
Any string |
|
Specifies a delimiter between keys and values in a map.
Key1=value1. Tokens are mapped values. Configuration value supports
|
|
Optional |
Any string |
|
Specifies a delimiter between key value pairs in a map.
|
|
Optional |
Any string |
NULL |
Specifies what is included in the delimited output in the
case of a NULL value. Configuration value supports
|
|
Optional |
Any string |
|
Specifies what is included in the delimited text output
in the case of a missing value. Configuration value supports
|
|
Optional |
|
|
Set to |
|
Optional |
|
|
Set to |
gg.handler.name.format.includeGroupCols |
Optional | true | false |
false |
If set to true , the columns are grouped
into sets of all names, all before values, and all after values
U,QASOURCE.TCUSTMER,2015-11-05 18:45:39.000000,2019-04-17T05:19:30.556000,00000000000000005100,R=AAKifQAAKAAAFDHAAE,CUST_CODE,NAME,CITY,STATE,ANN,ANN'S BOATS,SEATTLE,WA,ANN,,NEW YORK,NY |
gg.handler.name.format.enableFieldDescriptorHeaders |
Optional | true | false |
false |
Set to true to add a descriptive header
to each data file for delimited text output. The header will be the
individual field names separated by the field delimiter.
|
gg.handler.name.format.metaColumnsTemplate |
Optional | See Metacolumn Keywords. | None | The current meta column information can be configured in
a simple manner and removes the explicit need to
use: It
is a comma-delimited string consisting of one or more templated values
that represent the template. For more information about the Metacolumn
keywords, see Metacolumn Keywords. This is an example that would produce a list of
metacolumns: ${optype}, ${token.ROWID}, ${sys.username},
${currenttimestamp} |
Parent topic: Delimited Text Operation Formatter
8.2.31.5.3.2.5 Review a Sample Configuration
The following is a sample configuration for the Delimited Text formatter in the Java Adapter configuration file:
gg.handler.name.format.includeColumnNames=false gg.handler.name.format.insertOpKey=I gg.handler.name.format.updateOpKey=U gg.handler.name.format.deleteOpKey=D gg.handler.name.format.truncateOpKey=T gg.handler.name.format.encoding=UTF-8 gg.handler.name.format.fieldDelimiter=CDATA[\u0001] gg.handler.name.format.lineDelimiter=CDATA[\n] gg.handler.name.format.keyValueDelimiter=CDATA[=] gg.handler.name.format.kevValuePairDelimiter=CDATA[,] gg.handler.name.format.nullValueRepresentation=NULL gg.handler.name.format.missingValueRepresentation=CDATA[] gg.handler.name.format.includeGroupCols=false gg.handler.name.format=delimitedtext_op
Parent topic: Delimited Text Operation Formatter
8.2.31.5.3.2.6 Metadata Change Events
Oracle GoldenGate for Big Data now handles metadata change events at runtime. This assumes that the replicated database and upstream replication processes are propagating metadata change events. The Delimited Text Formatter changes the output format to accommodate the change and the Delimited Text Formatter continue running.
Note:
A metadata change may affect downstream applications. Delimited text formats include a fixed number of fields that are positionally relevant. Deleting a column in the source table can be handled seamlessly during Oracle GoldenGate runtime, but results in a change in the total number of fields, and potentially changes the positional relevance of some fields. Adding an additional column or columns is probably the least impactful metadata change event, assuming that the new column is added to the end. Consider the impact of a metadata change event before executing the event. When metadata change events are frequent, Oracle recommends that you consider a more flexible and self-describing format, such as JSON or XML.
Parent topic: Delimited Text Operation Formatter
8.2.31.5.3.2.7 Additional Considerations
Exercise care when you choose field and line delimiters. It is important to choose delimiter values that do not occur in the content of the data.
The Java Adapter configuration trims leading and trailing characters from
configuration values when they are determined to be whitespace. However, you may
want to choose field delimiters, line delimiters, null value representations, and
missing value representations that include or are fully considered to be whitespace
. In these cases, you must employ specialized syntax in the Java Adapter
configuration file to preserve the whitespace. To preserve the whitespace, when your
configuration values contain leading or trailing characters that are considered
whitespace, wrap the configuration value in a CDATA[]
wrapper. For
example, a configuration value of \n
should be configured as
CDATA[\n]
.
You can use regular expressions to search column values then replace matches with a specified value. You can use this search and replace functionality together with the Delimited Text Formatter to ensure that there are no collisions between column value contents and field and line delimiters. For more information, see Using Regular Expression Search and Replace.
Big Data applications sore data differently from RDBMSs. Update and delete operations in an RDBMS result in a change to the existing data. However, in Big Data applications, data is appended instead of changed. Therefore, the current state of a given row consolidates all of the existing operations for that row in the HDFS system. This leads to some special scenarios as described in the following sections.
Parent topic: Delimited Text Operation Formatter
8.2.31.5.4 Using the JSON Formatter
The JavaScript Object Notation (JSON) formatter can output operations from the source trail file in either row-based format or operation-based format. It formats operation data from the source trail file into a JSON objects. Each insert, update, delete, and truncate operation is formatted into an individual JSON message.
- Operation Metadata Formatting Details
The automated output of meta-column fields in generated JSONs has been removed as of Oracle GoldenGate for Big Data release 21.1. Meta-column fields can still be output. However, they need to explicitly configured as the following property:gg.handler.name.format.metaColumnsTemplate
. - Operation Data Formatting Details
- Row Data Formatting Details
- Sample JSON Messages
- JSON Schemas
- JSON Formatter Configuration Properties
- Review a Sample Configuration
- Metadata Change Events
- JSON Primary Key Updates
- Integrating Oracle Stream Analytics
Parent topic: Pluggable Formatters
8.2.31.5.4.1 Operation Metadata Formatting Details
The automated output of meta-column fields in generated JSONs has been removed as of
Oracle GoldenGate for Big Data release 21.1. Meta-column fields can still be output.
However, they need to explicitly configured as the following property:
gg.handler.name.format.metaColumnsTemplate
.
To output the metacolumns as in previous versions configure the following:
gg.handler.name.format.metaColumnsTemplate=${objectname[table]},${optype[op_type]},${timestamp[op_ts]},${currenttimestamp[current_ts]},${position[pos]}
To also include the primary key columns and the tokens configure as follows:
gg.handler.name.format.metaColumnsTemplate=${objectname[table]},${optype[op_type]},${timestamp[op_ts]},${currenttimestamp[current_ts]},${position[pos]},${primarykeycolumns[primary_keys]},${alltokens[tokens]}
For more information see the configuration property:
gg.handler.name.format.metaColumnsTemplate
.
Parent topic: Using the JSON Formatter
8.2.31.5.4.2 Operation Data Formatting Details
JSON messages begin with the operation metadata fields, which are followed by the operation data fields. This data is represented by before
and after
members that are objects. These objects contain members whose keys are the column names and whose values are the column values.
Operation data is modeled as follows:
-
Inserts: Includes the after-image data.
-
Updates: Includes both the before-image and the after-image data.
-
Deletes: Includes the before-image data.
Column values for an operation from the source trail file can have one of three states: the column has a value, the column value is null, or the column value is missing. The JSON Formatter maps these column value states into the created JSON objects as follows:
-
The column has a value: The column value is output. In the following example, the member
STATE
has a value."after":{ "CUST_CODE":"BILL", "NAME":"BILL'S USED CARS", "CITY":"DENVER", "STATE":"CO" }
-
The column value is null: The default output value is a JSON NULL. In the following example, the member
STATE
is null."after":{ "CUST_CODE":"BILL", "NAME":"BILL'S USED CARS", "CITY":"DENVER", "STATE":null }
-
The column value is missing: The JSON contains no element for a missing column value. In the following example, the member
STATE
is missing."after":{ "CUST_CODE":"BILL", "NAME":"BILL'S USED CARS", "CITY":"DENVER", }
The default setting of the JSON Formatter is to map the data types from the source trail file to the associated JSON data type. JSON supports few data types, so this functionality usually results in the mapping of numeric fields from the source trail file to members typed as numbers. This data type mapping can be configured treat all data as strings.
Parent topic: Using the JSON Formatter
8.2.31.5.4.3 Row Data Formatting Details
JSON messages begin with the operation metadata fields, which are followed by the operation data fields. For row data formatting, this are the source column names and source column values as JSON key value pairs. This data is represented by before
and after
members that are objects. These objects contain members whose keys are the column names and whose values are the column values.
Row data is modeled as follows:
-
Inserts: Includes the after-image data.
-
Updates: Includes the after-image data.
-
Deletes: Includes the before-image data.
Column values for an operation from the source trail file can have one of three states: the column has a value, the column value is null, or the column value is missing. The JSON Formatter maps these column value states into the created JSON objects as follows:
-
The column has a value: The column value is output. In the following example, the member
STATE
has a value."CUST_CODE":"BILL", "NAME":"BILL'S USED CARS", "CITY":"DENVER", "STATE":"CO" }
-
The column value is null :The default output value is a JSON NULL. In the following example, the member
STATE
is null."CUST_CODE":"BILL", "NAME":"BILL'S USED CARS", "CITY":"DENVER", "STATE":null }
-
The column value is missing: The JSON contains no element for a missing column value. In the following example, the member
STATE
is missing."CUST_CODE":"BILL", "NAME":"BILL'S USED CARS", "CITY":"DENVER", }
The default setting of the JSON Formatter is to map the data types from the source trail file to the associated JSON data type. JSON supports few data types, so this functionality usually results in the mapping of numeric fields from the source trail file to members typed as numbers. This data type mapping can be configured to treat all data as strings.
Parent topic: Using the JSON Formatter
8.2.31.5.4.4 Sample JSON Messages
The following topics are sample JSON messages created by the JSON Formatter for insert, update, delete, and truncate operations.
- Sample Operation Modeled JSON Messages
- Sample Flattened Operation Modeled JSON Messages
- Sample Row Modeled JSON Messages
- Sample Primary Key Output JSON Message
Parent topic: Using the JSON Formatter
8.2.31.5.4.4.1 Sample Operation Modeled JSON Messages
Insert
{
"table":"QASOURCE.TCUSTORD",
"op_type":"I",
"op_ts":"2015-11-05 18:45:36.000000",
"current_ts":"2016-10-05T10:15:51.267000",
"pos":"00000000000000002928",
"after":{
"CUST_CODE":"WILL",
"ORDER_DATE":"1994-09-30:15:33:00",
"PRODUCT_CODE":"CAR",
"ORDER_ID":144,
"PRODUCT_PRICE":17520.00,
"PRODUCT_AMOUNT":3,
"TRANSACTION_ID":100
}
}
Update
{
"table":"QASOURCE.TCUSTORD",
"op_type":"U",
"op_ts":"2015-11-05 18:45:39.000000",
"current_ts":"2016-10-05T10:15:51.310002",
"pos":"00000000000000004300",
"before":{
"CUST_CODE":"BILL",
"ORDER_DATE":"1995-12-31:15:00:00",
"PRODUCT_CODE":"CAR",
"ORDER_ID":765,
"PRODUCT_PRICE":15000.00,
"PRODUCT_AMOUNT":3,
"TRANSACTION_ID":100
},
"after":{
"CUST_CODE":"BILL",
"ORDER_DATE":"1995-12-31:15:00:00",
"PRODUCT_CODE":"CAR",
"ORDER_ID":765,
"PRODUCT_PRICE":14000.00
}
}
Delete
{
"table":"QASOURCE.TCUSTORD",
"op_type":"D",
"op_ts":"2015-11-05 18:45:39.000000",
"current_ts":"2016-10-05T10:15:51.312000",
"pos":"00000000000000005272",
"before":{
"CUST_CODE":"DAVE",
"ORDER_DATE":"1993-11-03:07:51:35",
"PRODUCT_CODE":"PLANE",
"ORDER_ID":600,
"PRODUCT_PRICE":135000.00,
"PRODUCT_AMOUNT":2,
"TRANSACTION_ID":200
}
}
Truncate
{
"table":"QASOURCE.TCUSTORD",
"op_type":"T",
"op_ts":"2015-11-05 18:45:39.000000",
"current_ts":"2016-10-05T10:15:51.312001",
"pos":"00000000000000005480",
}
Parent topic: Sample JSON Messages
8.2.31.5.4.4.2 Sample Flattened Operation Modeled JSON Messages
Insert
{
"table":"QASOURCE.TCUSTORD",
"op_type":"I",
"op_ts":"2015-11-05 18:45:36.000000",
"current_ts":"2016-10-05T10:34:47.956000",
"pos":"00000000000000002928",
"after.CUST_CODE":"WILL",
"after.ORDER_DATE":"1994-09-30:15:33:00",
"after.PRODUCT_CODE":"CAR",
"after.ORDER_ID":144,
"after.PRODUCT_PRICE":17520.00,
"after.PRODUCT_AMOUNT":3,
"after.TRANSACTION_ID":100
}
Update
{
"table":"QASOURCE.TCUSTORD",
"op_type":"U",
"op_ts":"2015-11-05 18:45:39.000000",
"current_ts":"2016-10-05T10:34:48.192000",
"pos":"00000000000000004300",
"before.CUST_CODE":"BILL",
"before.ORDER_DATE":"1995-12-31:15:00:00",
"before.PRODUCT_CODE":"CAR",
"before.ORDER_ID":765,
"before.PRODUCT_PRICE":15000.00,
"before.PRODUCT_AMOUNT":3,
"before.TRANSACTION_ID":100,
"after.CUST_CODE":"BILL",
"after.ORDER_DATE":"1995-12-31:15:00:00",
"after.PRODUCT_CODE":"CAR",
"after.ORDER_ID":765,
"after.PRODUCT_PRICE":14000.00
}
Delete
{
"table":"QASOURCE.TCUSTORD",
"op_type":"D",
"op_ts":"2015-11-05 18:45:39.000000",
"current_ts":"2016-10-05T10:34:48.193000",
"pos":"00000000000000005272",
"before.CUST_CODE":"DAVE",
"before.ORDER_DATE":"1993-11-03:07:51:35",
"before.PRODUCT_CODE":"PLANE",
"before.ORDER_ID":600,
"before.PRODUCT_PRICE":135000.00,
"before.PRODUCT_AMOUNT":2,
"before.TRANSACTION_ID":200
}
Truncate
{
"table":"QASOURCE.TCUSTORD",
"op_type":"D",
"op_ts":"2015-11-05 18:45:39.000000",
"current_ts":"2016-10-05T10:34:48.193001",
"pos":"00000000000000005480",
"before.CUST_CODE":"JANE",
"before.ORDER_DATE":"1995-11-11:13:52:00",
"before.PRODUCT_CODE":"PLANE",
"before.ORDER_ID":256,
"before.PRODUCT_PRICE":133300.00,
"before.PRODUCT_AMOUNT":1,
"before.TRANSACTION_ID":100
}
Parent topic: Sample JSON Messages
8.2.31.5.4.4.3 Sample Row Modeled JSON Messages
Insert
{
"table":"QASOURCE.TCUSTORD",
"op_type":"I",
"op_ts":"2015-11-05 18:45:36.000000",
"current_ts":"2016-10-05T11:10:42.294000",
"pos":"00000000000000002928",
"CUST_CODE":"WILL",
"ORDER_DATE":"1994-09-30:15:33:00",
"PRODUCT_CODE":"CAR",
"ORDER_ID":144,
"PRODUCT_PRICE":17520.00,
"PRODUCT_AMOUNT":3,
"TRANSACTION_ID":100
}
Update
{
"table":"QASOURCE.TCUSTORD",
"op_type":"U",
"op_ts":"2015-11-05 18:45:39.000000",
"current_ts":"2016-10-05T11:10:42.350005",
"pos":"00000000000000004300",
"CUST_CODE":"BILL",
"ORDER_DATE":"1995-12-31:15:00:00",
"PRODUCT_CODE":"CAR",
"ORDER_ID":765,
"PRODUCT_PRICE":14000.00
}
Delete
{
"table":"QASOURCE.TCUSTORD",
"op_type":"D",
"op_ts":"2015-11-05 18:45:39.000000",
"current_ts":"2016-10-05T11:10:42.351002",
"pos":"00000000000000005272",
"CUST_CODE":"DAVE",
"ORDER_DATE":"1993-11-03:07:51:35",
"PRODUCT_CODE":"PLANE",
"ORDER_ID":600,
"PRODUCT_PRICE":135000.00,
"PRODUCT_AMOUNT":2,
"TRANSACTION_ID":200
}
Truncate
{
"table":"QASOURCE.TCUSTORD",
"op_type":"T",
"op_ts":"2015-11-05 18:45:39.000000",
"current_ts":"2016-10-05T11:10:42.351003",
"pos":"00000000000000005480",
}
Parent topic: Sample JSON Messages
8.2.31.5.4.4.4 Sample Primary Key Output JSON Message
{ "table":"DDL_OGGSRC.TCUSTMER", "op_type":"I", "op_ts":"2015-10-26 03:00:06.000000", "current_ts":"2016-04-05T08:59:23.001000", "pos":"00000000000000006605", "primary_keys":[ "CUST_CODE" ], "after":{ "CUST_CODE":"WILL", "NAME":"BG SOFTWARE CO.", "CITY":"SEATTLE", "STATE":"WA" } }
Parent topic: Sample JSON Messages
8.2.31.5.4.5 JSON Schemas
By default, JSON schemas are generated for each source table encountered. JSON schemas are generated on a just in time basis when an operation for that table is first encountered. A JSON schema is not required to parse a JSON object. However, many JSON parsers can use a JSON schema to perform a validating parse of a JSON object. Alternatively, you can review the JSON schemas to understand the layout of output JSON objects. By default, the JSON schemas are created in the GoldenGate_Home
/dirdef
directory and are named by the following convention:
FULLY_QUALIFIED_TABLE_NAME
.schema.json
The generation of the JSON schemas is suppressible.
- The following JSON schema example is for the JSON object listed in Sample Operation Modeled JSON Messages.
{ "$schema":"http://json-schema.org/draft-04/schema#", "title":"QASOURCE.TCUSTORD", "description":"JSON schema for table QASOURCE.TCUSTORD", "definitions":{ "row":{ "type":"object", "properties":{ "CUST_CODE":{ "type":[ "string", "null" ] }, "ORDER_DATE":{ "type":[ "string", "null" ] }, "PRODUCT_CODE":{ "type":[ "string", "null" ] }, "ORDER_ID":{ "type":[ "number", "null" ] }, "PRODUCT_PRICE":{ "type":[ "number", "null" ] }, "PRODUCT_AMOUNT":{ "type":[ "integer", "null" ] }, "TRANSACTION_ID":{ "type":[ "number", "null" ] } }, "additionalProperties":false }, "tokens":{ "type":"object", "description":"Token keys and values are free form key value pairs.", "properties":{ }, "additionalProperties":true } }, "type":"object", "properties":{ "table":{ "description":"The fully qualified table name", "type":"string" }, "op_type":{ "description":"The operation type", "type":"string" }, "op_ts":{ "description":"The operation timestamp", "type":"string" }, "current_ts":{ "description":"The current processing timestamp", "type":"string" }, "pos":{ "description":"The position of the operation in the data source", "type":"string" }, "primary_keys":{ "description":"Array of the primary key column names.", "type":"array", "items":{ "type":"string" }, "minItems":0, "uniqueItems":true }, "tokens":{ "$ref":"#/definitions/tokens" }, "before":{ "$ref":"#/definitions/row" }, "after":{ "$ref":"#/definitions/row" } }, "required":[ "table", "op_type", "op_ts", "current_ts", "pos" ], "additionalProperties":false }
- The following JSON schema example is for the JSON object listed in Sample Flattened Operation Modeled JSON Messages.
{ "$schema":"http://json-schema.org/draft-04/schema#", "title":"QASOURCE.TCUSTORD", "description":"JSON schema for table QASOURCE.TCUSTORD", "definitions":{ "tokens":{ "type":"object", "description":"Token keys and values are free form key value pairs.", "properties":{ }, "additionalProperties":true } }, "type":"object", "properties":{ "table":{ "description":"The fully qualified table name", "type":"string" }, "op_type":{ "description":"The operation type", "type":"string" }, "op_ts":{ "description":"The operation timestamp", "type":"string" }, "current_ts":{ "description":"The current processing timestamp", "type":"string" }, "pos":{ "description":"The position of the operation in the data source", "type":"string" }, "primary_keys":{ "description":"Array of the primary key column names.", "type":"array", "items":{ "type":"string" }, "minItems":0, "uniqueItems":true }, "tokens":{ "$ref":"#/definitions/tokens" }, "before.CUST_CODE":{ "type":[ "string", "null" ] }, "before.ORDER_DATE":{ "type":[ "string", "null" ] }, "before.PRODUCT_CODE":{ "type":[ "string", "null" ] }, "before.ORDER_ID":{ "type":[ "number", "null" ] }, "before.PRODUCT_PRICE":{ "type":[ "number", "null" ] }, "before.PRODUCT_AMOUNT":{ "type":[ "integer", "null" ] }, "before.TRANSACTION_ID":{ "type":[ "number", "null" ] }, "after.CUST_CODE":{ "type":[ "string", "null" ] }, "after.ORDER_DATE":{ "type":[ "string", "null" ] }, "after.PRODUCT_CODE":{ "type":[ "string", "null" ] }, "after.ORDER_ID":{ "type":[ "number", "null" ] }, "after.PRODUCT_PRICE":{ "type":[ "number", "null" ] }, "after.PRODUCT_AMOUNT":{ "type":[ "integer", "null" ] }, "after.TRANSACTION_ID":{ "type":[ "number", "null" ] } }, "required":[ "table", "op_type", "op_ts", "current_ts", "pos" ], "additionalProperties":false }
- The following JSON schema example is for the JSON object listed in Sample Row Modeled JSON Messages.
{ "$schema":"http://json-schema.org/draft-04/schema#", "title":"QASOURCE.TCUSTORD", "description":"JSON schema for table QASOURCE.TCUSTORD", "definitions":{ "tokens":{ "type":"object", "description":"Token keys and values are free form key value pairs.", "properties":{ }, "additionalProperties":true } }, "type":"object", "properties":{ "table":{ "description":"The fully qualified table name", "type":"string" }, "op_type":{ "description":"The operation type", "type":"string" }, "op_ts":{ "description":"The operation timestamp", "type":"string" }, "current_ts":{ "description":"The current processing timestamp", "type":"string" }, "pos":{ "description":"The position of the operation in the data source", "type":"string" }, "primary_keys":{ "description":"Array of the primary key column names.", "type":"array", "items":{ "type":"string" }, "minItems":0, "uniqueItems":true }, "tokens":{ "$ref":"#/definitions/tokens" }, "CUST_CODE":{ "type":[ "string", "null" ] }, "ORDER_DATE":{ "type":[ "string", "null" ] }, "PRODUCT_CODE":{ "type":[ "string", "null" ] }, "ORDER_ID":{ "type":[ "number", "null" ] }, "PRODUCT_PRICE":{ "type":[ "number", "null" ] }, "PRODUCT_AMOUNT":{ "type":[ "integer", "null" ] }, "TRANSACTION_ID":{ "type":[ "number", "null" ] } }, "required":[ "table", "op_type", "op_ts", "current_ts", "pos" ], "additionalProperties":false }
Parent topic: Using the JSON Formatter
8.2.31.5.4.6 JSON Formatter Configuration Properties
Table 8-48 JSON Formatter Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Optional |
|
None |
Controls whether the generated JSON output messages are
operation modeled or row modeled. Set to |
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate an insert operation. |
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate an update operation. |
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate a delete operation. |
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate a truncate operation. |
|
Optional |
|
|
Controls the output format of the JSON data. True formats the data with white space for easy reading. False generates more compact output that is difficult to read.. |
|
Optional |
Any string |
|
Inserts a delimiter between generated JSONs so that they
can be more easily parsed in a continuous stream of data.
Configuration value supports |
|
Optional |
|
|
Controls the generation of JSON schemas for the generated JSON documents. JSON schemas are generated on a table-by-table basis. A JSON schema is not required to parse a JSON document. However, a JSON schemahelp indicate what the JSON documents look like and can be used for a validating JSON parse. |
|
Optional |
Any legal, existing file system path |
|
Controls the output location of generated JSON schemas. |
|
Optional |
|
|
Controls the output typing of generated JSON documents.
When |
|
Optional |
Any legal encoding name or alias supported by Java. |
|
Controls the output encoding of generated JSON schemas and documents. |
|
Optional |
|
|
Controls the version of created schemas. Schema
versioning creates a schema with a timestamp in the schema directory
on the local file system every time a new schema is created.
|
|
Optional |
|
|
Controls the format of the current timestamp. The default
is the ISO 8601 format. A setting of |
|
Optional |
|
|
Controls sending flattened JSON formatted data to the
target entity. Must be set to This property is applicable only to Operation Formatted
JSON ( |
|
Optional |
Any legal character or character string for a JSON field name. |
|
Controls the delimiter for concatenated JSON element
names. This property supports |
|
Optional |
Any legal character or character string for a JSON field name. |
Any legal JSON attribute name. |
Allows you to set whether the JSON element-before, that contains the change column values, can be renamed. This property is only applicable to Operation Formatted
JSON ( |
|
Optional |
Any legal character or character string for a JSON field name. |
Any legal JSON attribute name. |
Allows you to set whether the JSON element, that contains the after-change column values, can be renamed. This property is only applicable to Operation Formatted
JSON ( |
|
Optional |
|
|
Specifies how the formatter handles update operations that change a primary key. Primary key operations can be problematic for the JSON formatter and you need to speacially consider it. You can only use this property in conjunction with the row modeled JSON output messages. This property is only applicable to Row Formatted JSON
(
|
gg.handler.name.format.omitNullValues |
Optional |
true | false |
|
Set to |
gg.handler.name.format.omitNullValuesSpecialUpdateHandling |
Optional | true | false |
false |
Only applicable if
gg.handler.name.format.omitNullValues=true . When
set to true , it provides special handling to propagate
the null value on the update after image if the before image data is
missing or has a value.
|
gg.handler.name.format.enableJsonArrayOutput |
Optional | true | false |
false |
Set to true to nest JSON documents
representing the operation data into a JSON array. This works for file
output and Kafka messages in transaction mode.
|
gg.handler.name.format.metaColumnsTemplate |
Optional | See Metacolumn Keywords | None |
The current meta column information can be configured in a simple manner and removes the explicit need to use: insertOpKey | updateOpKey | deleteOpKey |
truncateOpKey | includeTableName | includeOpTimestamp |
includeOpType | includePosition | includeCurrentTimestamp,
useIso8601Format It is a comma-delimited string consisting of one or more templated values that represent the template. For more information about the Metacolumn keywords, see Metacolumn Keywords. This is an example that would
produce a list of metacolumns: |
Parent topic: Using the JSON Formatter
8.2.31.5.4.7 Review a Sample Configuration
The following is a sample configuration for the JSON Formatter in the Java Adapter configuration file:
gg.handler.hdfs.format=json gg.handler.hdfs.format.insertOpKey=I gg.handler.hdfs.format.updateOpKey=U gg.handler.hdfs.format.deleteOpKey=D gg.handler.hdfs.format.truncateOpKey=T gg.handler.hdfs.format.prettyPrint=false gg.handler.hdfs.format.jsonDelimiter=CDATA[] gg.handler.hdfs.format.generateSchema=true gg.handler.hdfs.format.schemaDirectory=dirdef gg.handler.hdfs.format.treatAllColumnsAsStrings=false
Parent topic: Using the JSON Formatter
8.2.31.5.4.8 Metadata Change Events
Metadata change events are handled at runtime. When metadata is changed in a table, the JSON schema is regenerated the next time an operation for the table is encountered. The content of created JSON messages changes to reflect the metadata change. For example, if an additional column is added, the new column is included in created JSON messages after the metadata change event.
Parent topic: Using the JSON Formatter
8.2.31.5.4.9 JSON Primary Key Updates
When the JSON formatter is configured to model operation data, primary key updates require no special treatment and are treated like any other update. The before and after values reflect the change in the primary key.
When the JSON formatter is configured to model row data, primary key updates must be specially handled. The default behavior is to abend. However, by using thegg.handler.name.format.pkUpdateHandling
configuration property, you can configure the JSON formatter to model row data to treat primary key updates as either a regular update or as delete and then insert operations. When you configure the formatter to handle primary key updates as delete and insert operations, Oracle recommends that you configure your replication stream to contain the complete before-image and after-image data for updates. Otherwise, the generated insert operation for a primary key update will be missing data for fields that did not change.
Parent topic: Using the JSON Formatter
8.2.31.5.4.10 Integrating Oracle Stream Analytics
You can integrate Oracle GoldenGate for Big Data with Oracle Stream Analytics (OSA) by sending operation-modeled JSON messages to the Kafka Handler. This works only when the JSON formatter is configured to output operation-modeled JSON messages.
Because OSA requires flattened JSON objects, a new feature in the JSON formatter generates flattened JSONs. To use this feature, set the gg.handler.name.format.flatten=false
to true
. (The default setting is false). The following is an example of a flattened JSON file:
{
"table":"QASOURCE.TCUSTMER",
"op_type":"U",
"op_ts":"2015-11-05 18:45:39.000000",
"current_ts":"2016-06-22T13:38:45.335001",
"pos":"00000000000000005100",
"before.CUST_CODE":"ANN",
"before.NAME":"ANN'S BOATS",
"before.CITY":"SEATTLE",
"before.STATE":"WA",
"after.CUST_CODE":"ANN",
"after.CITY":"NEW YORK",
"after.STATE":"NY"
}
Parent topic: Using the JSON Formatter
8.2.31.5.5 Using the Length Delimited Value Formatter
The Length Delimited Value (LDV) Formatter is a row-based formatter. It formats database operations from the source trail file into a length delimited value output. Each insert, update, delete, or truncate operation from the source trail is formatted into an individual length delimited message.
With the length delimited, there are no field delimiters. The fields are variable in size based on the data.
By default, the length delimited maps these column value states into the length delimited value output. Column values for an operation from the source trail file can have one of three states:
-
Column has a value —The column value is output with the prefix indicator
P
. -
Column value is NULL —The default output value is
N
. The output for the case of aNULL
column value is configurable. -
Column value is missing - The default output value is
M
. The output for the case of a missing column value is configurable.
- Formatting Message Details
- Sample Formatted Messages
- LDV Formatter Configuration Properties
- Additional Considerations
Parent topic: Pluggable Formatters
8.2.31.5.5.1 Formatting Message Details
The default format for output of data is the following:
- First is the row Length followed by metadata:
<ROW LENGTH><PRESENT INDICATOR><FIELD LENGTH><OPERATION TYPE><PRESENT INDICATOR><FIELD LENGTH><FULLY QUALIFIED TABLE NAME><PRESENT INDICATOR><FIELD LENGTH><OPERATION TIMESTAMP><PRESENT INDICATOR><FIELD LENGTH><CURRENT TIMESTAMP><PRESENT INDICATOR><FIELD LENGTH><TRAIL POSITION><PRESENT INDICATOR><FIELD LENGTH><TOKENS>
Or
<ROW LENGTH><FIELD LENGTH><FULLY QUALIFIED TABLE NAME><FIELD LENGTH><OPERATION TIMESTAMP><FIELD LENGTH><CURRENT TIMESTAMP><FIELD LENGTH><TRAIL POSITION><FIELD LENGTH><TOKENS>
- Next is the row data:
<PRESENT INDICATOR><FIELD LENGTH><COLUMN 1 VALUE><PRESENT INDICATOR><FIELD LENGTH><COLUMN N VALUE>
Parent topic: Using the Length Delimited Value Formatter
8.2.31.5.5.2 Sample Formatted Messages
- Insert Message:
0133P01IP161446749136000000P161529311765024000P262015-11-05 18:45:36.000000P04WILLP191994-09-30 15:33:00P03CARP03144P0817520.00P013P03100
- Update Message
0133P01UP161446749139000000P161529311765035000P262015-11-05 18:45:39.000000P04BILLP191995-12-31 15:00:00P03CARP03765P0814000.00P013P03100
- Delete Message
0136P01DP161446749139000000P161529311765038000P262015-11-05 18:45:39.000000P04DAVEP191993-11-03 07:51:35P05PLANEP03600P09135000.00P012P03200
Parent topic: Using the Length Delimited Value Formatter
8.2.31.5.5.3 LDV Formatter Configuration Properties
Table 8-49 LDV Formatter Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
gg.handler.name.format.binaryLengthMode |
Optional |
|
|
The output can be controlled to display the field or
record length in either binary or ASCII format. If set to
|
|
Optional |
|
|
Set to |
|
Optional |
|
|
Set to |
|
Optional |
|
|
Use to configure the |
|
Optional |
Any string |
|
Use to configure what is included in the output when a column value is present. This value supports CDATA[] wrapping. |
|
Optional |
Any string |
|
Use to configure what is included in the output when a missing value is present. This value supports CDATA[] wrapping. |
|
Optional |
Any string |
|
Use to configure what is included in the output when a NULL value is present. This value supports CDATA[] wrapping. |
|
Optional |
See Metacolumn Keywords. |
None |
Use to configure the
current meta column information in a simple manner and removes
the explicit need of A comma-delimited string consisting of one or more templated values represents the template. This example produces a list of meta columns: ${optype},
${token.ROWID},${sys.username},${currenttimestamp} See Metacolumn Keywords. |
|
Optional |
|
|
Specifies how the formatter handles update operations that change a primary key. Primary key operations can be problematic for the text formatter and require special consideration by you.
|
|
Optional |
Any encoding name or alias supported by Java. |
The native system encoding of the machine hosting the Oracle GoldenGate process. |
Use to set the output encoding for character data and columns. |
${optype}, ${token.ROWID}, ${sys.username}, ${currenttimestamp}
Review a Sample Configuration
#The LDV Handler gg.handler.filewriter.format=binary gg.handler.filewriter.format.binaryLengthMode=false gg.handler.filewriter.format.recordLength=4 gg.handler.filewriter.format.fieldLength=2 gg.handler.filewriter.format.legacyFormat=false gg.handler.filewriter.format.presentValue=CDATA[P] gg.handler.filewriter.format.missingValue=CDATA[M] gg.handler.filewriter.format.nullValue=CDATA[N] gg.handler.filewriter.format.metaColumnsTemplate=${optype},${timestampmicro},${currenttimestampmicro},${timestamp} gg.handler.filewriter.format.pkUpdateHandling=abend
Parent topic: Using the Length Delimited Value Formatter
8.2.31.5.5.4 Additional Considerations
Big Data applications differ from RDBMSs in how data is stored. Update and delete operations in an RDBMS result in a change to the existing data. Data is not changed in Big Data applications, it is simply appended to existing data. The current state of a given row becomes a consolidation of all of the existing operations for that row in the HDFS system.
Primary Key Updates
Primary key update operations require special consideration and planning for Big Data integrations. Primary key updates are update operations that modify one or more of the primary keys for the given row from the source database. Since data is simply appended in Big Data applications, a primary key update operation looks more like a new insert than an update without any special handling. The Length Delimited Value Formatter provides specialized handling for primary keys that is configurable to you. These are the configurable behaviors:Table 8-50 Primary Key Update Behaviors
Value | Description |
---|---|
Abend |
The default behavior is that the length delimited value formatter will abend in the case of a primary key update. |
Update |
With this configuration the primary key update will be treated just like any other update operation. This configuration alternative should only be selected if you can guarantee that the primary key that is being changed is not being used as the selection criteria when selecting row data from a Big Data system. |
Delete-Insert |
Using this configuration the primary key update is treated as a special case of a delete using the before image data and an insert using the after image data. This configuration may more accurately model the effect of a primary key update in a Big Data application. However, if this configuration is selected it is important to have full supplemental logging enabled on replication at the source database. Without full supplemental logging, the delete operation will be correct, but the insert operation do not contain all of the data for all of the columns for a full representation of the row data in the Big Data application. |
Consolidating Data
Big Data applications simply append data to the underlying storage. Typically, analytic tools spawn map reduce programs that traverse the data files and consolidate all the operations for a given row into a single output. It is important to have an indicator of the order of operations. The Length Delimited Value Formatter provides a number of metadata fields to fulfill this need. The operation timestamp may be sufficient to fulfill this requirement. However, two update operations may have the same operation timestamp especially if they share a common transaction. The trail position can provide a tie breaking field on the operation timestamp. Lastly, the current timestamp may provide the best indicator of order of operations in Big Data.
Parent topic: Using the Length Delimited Value Formatter
8.2.31.5.6 Using the XML Formatter
The XML Formatter formats before-image and after-image data from the source trail file into an XML document representation of the operation data. The format of the XML document is effectively the same as the XML format in the previous releases of the Oracle GoldenGate Java Adapter.
- Message Formatting Details
- Sample XML Messages
- XML Schema
- XML Formatter Configuration Properties
- Review a Sample Configuration
- Metadata Change Events
- Primary Key Updates
Parent topic: Pluggable Formatters
8.2.31.5.6.1 Message Formatting Details
The XML formatted messages contain the following information:
Table 8-51 XML formatting details
Value | Description |
---|---|
|
The fully qualified table name. |
|
The operation type. |
|
The current timestamp is the time when the formatter processed the current operation record. This timestamp follows the ISO-8601 format and includes micro second precision. Replaying the trail file does not result in the same timestamp for the same operation. |
|
The position from the source trail file. |
|
The total number of columns in the source table. |
|
The |
|
The |
Parent topic: Using the XML Formatter
8.2.31.5.6.2 Sample XML Messages
The following sections provide sample XML messages.
Parent topic: Using the XML Formatter
8.2.31.5.6.2.1 Sample Insert Message
<?xml version='1.0' encoding='UTF-8'?> <operation table='GG.TCUSTORD' type='I' ts='2013-06-02 22:14:36.000000' current_ts='2015-10-06T12:21:50.100001' pos='00000000000000001444' numCols='7'> <col name='CUST_CODE' index='0'> <before missing='true'/> <after><![CDATA[WILL]]></after> </col> <col name='ORDER_DATE' index='1'> <before missing='true'/> <after><![CDATA[1994-09-30:15:33:00]]></after> </col> <col name='PRODUCT_CODE' index='2'> <before missing='true'/> <after><![CDATA[CAR]]></after> </col> <col name='ORDER_ID' index='3'> <before missing='true'/> <after><![CDATA[144]]></after> </col> <col name='PRODUCT_PRICE' index='4'> <before missing='true'/> <after><![CDATA[17520.00]]></after> </col> <col name='PRODUCT_AMOUNT' index='5'> <before missing='true'/> <after><![CDATA[3]]></after> </col> <col name='TRANSACTION_ID' index='6'> <before missing='true'/> <after><![CDATA[100]]></after> </col> <tokens> <token> <Name><![CDATA[R]]></Name> <Value><![CDATA[AADPkvAAEAAEqL2AAA]]></Value> </token> </tokens> </operation>
Parent topic: Sample XML Messages
8.2.31.5.6.2.2 Sample Update Message
<?xml version='1.0' encoding='UTF-8'?> <operation table='GG.TCUSTORD' type='U' ts='2013-06-02 22:14:41.000000' current_ts='2015-10-06T12:21:50.413000' pos='00000000000000002891' numCols='7'> <col name='CUST_CODE' index='0'> <before><![CDATA[BILL]]></before> <after><![CDATA[BILL]]></after> </col> <col name='ORDER_DATE' index='1'> <before><![CDATA[1995-12-31:15:00:00]]></before> <after><![CDATA[1995-12-31:15:00:00]]></after> </col> <col name='PRODUCT_CODE' index='2'> <before><![CDATA[CAR]]></before> <after><![CDATA[CAR]]></after> </col> <col name='ORDER_ID' index='3'> <before><![CDATA[765]]></before> <after><![CDATA[765]]></after> </col> <col name='PRODUCT_PRICE' index='4'> <before><![CDATA[15000.00]]></before> <after><![CDATA[14000.00]]></after> </col> <col name='PRODUCT_AMOUNT' index='5'> <before><![CDATA[3]]></before> <after><![CDATA[3]]></after> </col> <col name='TRANSACTION_ID' index='6'> <before><![CDATA[100]]></before> <after><![CDATA[100]]></after> </col> <tokens> <token> <Name><![CDATA[R]]></Name> <Value><![CDATA[AADPkvAAEAAEqLzAAA]]></Value> </token> </tokens> </operation>
Parent topic: Sample XML Messages
8.2.31.5.6.2.3 Sample Delete Message
<?xml version='1.0' encoding='UTF-8'?> <operation table='GG.TCUSTORD' type='D' ts='2013-06-02 22:14:41.000000' current_ts='2015-10-06T12:21:50.415000' pos='00000000000000004338' numCols='7'> <col name='CUST_CODE' index='0'> <before><![CDATA[DAVE]]></before> <after missing='true'/> </col> <col name='ORDER_DATE' index='1'> <before><![CDATA[1993-11-03:07:51:35]]></before> <after missing='true'/> </col> <col name='PRODUCT_CODE' index='2'> <before><![CDATA[PLANE]]></before> <after missing='true'/> </col> <col name='ORDER_ID' index='3'> <before><![CDATA[600]]></before> <after missing='true'/> </col> <col name='PRODUCT_PRICE' index='4'> <missing/> </col> <col name='PRODUCT_AMOUNT' index='5'> <missing/> </col> <col name='TRANSACTION_ID' index='6'> <missing/> </col> <tokens> <token> <Name><![CDATA[L]]></Name> <Value><![CDATA[206080450]]></Value> </token> <token> <Name><![CDATA[6]]></Name> <Value><![CDATA[9.0.80330]]></Value> </token> <token> <Name><![CDATA[R]]></Name> <Value><![CDATA[AADPkvAAEAAEqLzAAC]]></Value> </token> </tokens> </operation>
Parent topic: Sample XML Messages
8.2.31.5.6.2.4 Sample Truncate Message
<?xml version='1.0' encoding='UTF-8'?> <operation table='GG.TCUSTORD' type='T' ts='2013-06-02 22:14:41.000000' current_ts='2015-10-06T12:21:50.415001' pos='00000000000000004515' numCols='7'> <col name='CUST_CODE' index='0'> <missing/> </col> <col name='ORDER_DATE' index='1'> <missing/> </col> <col name='PRODUCT_CODE' index='2'> <missing/> </col> <col name='ORDER_ID' index='3'> <missing/> </col> <col name='PRODUCT_PRICE' index='4'> <missing/> </col> <col name='PRODUCT_AMOUNT' index='5'> <missing/> </col> <col name='TRANSACTION_ID' index='6'> <missing/> </col> <tokens> <token> <Name><![CDATA[R]]></Name> <Value><![CDATA[AADPkvAAEAAEqL2AAB]]></Value> </token> </tokens> </operation>
Parent topic: Sample XML Messages
8.2.31.5.6.3 XML Schema
The XML Formatter does not generate an XML schema (XSD). The XSD applies to all messages generated by the XML Formatter. The following XSD defines the structure of the XML documents that are generated by the XML Formatter.
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="operation"> <xs:complexType> <xs:sequence> <xs:element name="col" maxOccurs="unbounded" minOccurs="0"> <xs:complexType> <xs:sequence> <xs:element name="before" minOccurs="0"> <xs:complexType> <xs:simpleContent> <xs:extension base="xs:string"> <xs:attribute type="xs:string" name="missing" use="optional"/> </xs:extension> </xs:simpleContent> </xs:complexType> </xs:element> <xs:element name="after" minOccurs="0"> <xs:complexType> <xs:simpleContent> <xs:extension base="xs:string"> <xs:attribute type="xs:string" name="missing" use="optional"/> </xs:extension> </xs:simpleContent> </xs:complexType> </xs:element> <xs:element type="xs:string" name="missing" minOccurs="0"/> </xs:sequence> <xs:attribute type="xs:string" name="name"/> <xs:attribute type="xs:short" name="index"/> </xs:complexType> </xs:element> <xs:element name="tokens" minOccurs="0"> <xs:complexType> <xs:sequence> <xs:element name="token" maxOccurs="unbounded" minOccurs="0"> <xs:complexType> <xs:sequence> <xs:element type="xs:string" name="Name"/> <xs:element type="xs:string" name="Value"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute type="xs:string" name="table"/> <xs:attribute type="xs:string" name="type"/> <xs:attribute type="xs:string" name="ts"/> <xs:attribute type="xs:dateTime" name="current_ts"/> <xs:attribute type="xs:long" name="pos"/> <xs:attribute type="xs:short" name="numCols"/> </xs:complexType> </xs:element> </xs:schema>
Parent topic: Using the XML Formatter
8.2.31.5.6.4 XML Formatter Configuration Properties
Table 8-52 XML Formatter Configuration Properties
Properties | Optional Y/N | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate an insert operation. |
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate an update operation. |
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate a delete operation. |
|
Optional |
Any string |
|
Indicator to be inserted into the output record to indicate a truncate operation. |
|
Optional |
Any legal encoding name or alias supported by Java. |
UTF-8 (the XML default) |
The output encoding of generated XML documents. |
|
Optional |
|
|
Determines whether an XML prolog is included in generated XML documents. An XML prolog is optional for well-formed XML. An XML prolog resembles the following: |
|
Optional |
|
|
Controls the format of the current timestamp in the XML message. The default adds a |
|
Optional |
|
|
Set to |
|
Optional |
|
|
Set to |
|
Optional |
|
|
Set to |
gg.handler.name.format.metaColumnsTemplate |
Optional | See Metacolumn Keywords. | None |
The current meta column information can be configured in a simple manner and removes the explicit need to use: insertOpKey | updateOpKey | deleteOpKey |
truncateOpKey | includeTableName | includeOpTimestamp |
includeOpType | includePosition | includeCurrentTimestamp,
useIso8601Format It is a comma-delimited string consisting of one or more templated values that represent the template. For more information about the Metacolumn keywords, see Metacolumn Keywords. |
Parent topic: Using the XML Formatter
8.2.31.5.6.5 Review a Sample Configuration
The following is a sample configuration for the XML Formatter in the Java Adapter properties file:
gg.handler.hdfs.format=xml gg.handler.hdfs.format.insertOpKey=I gg.handler.hdfs.format.updateOpKey=U gg.handler.hdfs.format.deleteOpKey=D gg.handler.hdfs.format.truncateOpKey=T gg.handler.hdfs.format.encoding=ISO-8859-1 gg.handler.hdfs.format.includeProlog=false
Parent topic: Using the XML Formatter
8.2.31.5.6.6 Metadata Change Events
The XML Formatter seamlessly handles metadata change events. A metadata change event does not result in a change to the XML schema. The XML schema is designed to be generic so that the same schema represents the data of any operation from any table.
If the replicated database and upstream Oracle GoldenGate replication process can propagate metadata change events, the XML Formatter can take action when metadata changes. Changes in the metadata are reflected in messages after the change. For example, when a column is added, the new column data appears in XML messages for the table.
Parent topic: Using the XML Formatter
8.2.31.5.6.7 Primary Key Updates
Updates to a primary key require no special handling by the XML formatter. The XML formatter creates messages that model database operations. For update operations, this includes before and after images of column values. Primary key changes are represented in this format as a change to a column value just like a change to any other column value.
Parent topic: Using the XML Formatter
8.2.31.6 Stage and Merge Data Warehouse Replication
Data warehouse targets typically support Massively Parallel Processing (MPP). The cost of a single Data Manipulation Language (DML) operation is comparable to the cost of execution of batch DMLs.
Therefore, for better throughput the change data from the Oracle GoldenGate trails can be staged in micro batches at a temporary staging location, and the staged data records are merged into the data warehouse target table using the respective data warehouse’s merge SQL statement. This section outlines an approach to replicate change data records from source databases to target data warehouses using stage and merge. The solution uses Command Event handler to invoke custom bash-shell scripts.
This chapter contains examples of what you can do with command event handler feature.
- Steps for Stage and Merge
- Hive Stage and Merge
Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability for querying and analysis of large data sets stored in Hadoop files.
Parent topic: Additional Details
8.2.31.6.1 Steps for Stage and Merge
- Stage
In this step the change data records in the Oracle GoldenGate trail files are pushed into a staging location. The staging location is typically a cloud object store such as OCI, AWS S3, Azure Data Lake, or Google Cloud Storage. - Merge
In this step the change data files in the object store are viewed as an external table defined in the data warehouse. The data in the external staging table is merged onto the target table. - Configuration of Handlers
File Writer(FW) handler needs to be configured to generate local staging files that contain change data from the GoldenGate trail files. - File Writer Handler
File Writer (FW) handler is typically configured to generate files partitioned by table using the configurationgg.handler.{name}.partitionByTable=true
. - Operation Aggregation
Operation aggregation is the process of aggregating (merging/compressing) multiple operations on the same row into a single output operation based on a threshold. - Object Store Event handler
The File Writer handler needs to be chained with an object store Event handler. Oracle GoldenGate for BigData supports uploading files to most cloud object stores such as OCI, AWS S3, and Azure Data Lake. - JDBC Metadata Provider
If the data warehouse supports JDBC connection, then the JDBC metadata provider needs to be enabled. - Command Event handler Merge Script
Command Event handler is configured to invoke a bash-shell script. Oracle provides a bash-shell script that can execute the SQL statements so that the change data in the staging files are merged into the target tables. - Stage and Merge Sample Configuration
A working configuration for the respective data warehouse is available under the directoryAdapterExamples/big-data/data-warehouse-utils/<target>/
. - Variables in the Merge Script
Typically, variables appear at the beginning of the Oracle provided script. There are lines starting with#TODO
: that document the changes required for variables in the script. - SQL Statements in the Merge Script
The SQL statements in the shell script needs to be customized. There are lines starting with#TODO
: that document the changes required for SQL statements. - Merge Script Functions
- Prerequisites
- Limitations
Parent topic: Stage and Merge Data Warehouse Replication
8.2.31.6.1.1 Stage
In this step the change data records in the Oracle GoldenGate trail files are pushed into a staging location. The staging location is typically a cloud object store such as OCI, AWS S3, Azure Data Lake, or Google Cloud Storage.
Parent topic: Steps for Stage and Merge
8.2.31.6.1.2 Merge
In this step the change data files in the object store are viewed as an external table defined in the data warehouse. The data in the external staging table is merged onto the target table.
Parent topic: Steps for Stage and Merge
8.2.31.6.1.3 Configuration of Handlers
File Writer(FW) handler needs to be configured to generate local staging files that contain change data from the GoldenGate trail files.
The FW handler needs to be chained to an object store Event handler that can upload the staging files into a staging location.
The staging location is typically a cloud object store, such as AWS S3 or Azure Data Lake.
The output of the object store event handler is chained with the Command Event handler that can invoke custom scripts to execute merge SQL statements on the target data warehouse.
Parent topic: Steps for Stage and Merge
8.2.31.6.1.4 File Writer Handler
File Writer (FW) handler is typically configured to generate files
partitioned by table using the configuration
gg.handler.{name}.partitionByTable=true
.
In most cases FW handler is configured to use the Avro Object Container Format (OCF) formatter.
The output file format could change based on the specific data warehouse target.
Parent topic: Steps for Stage and Merge
8.2.31.6.1.5 Operation Aggregation
Operation aggregation is the process of aggregating (merging/compressing) multiple operations on the same row into a single output operation based on a threshold.
Operation Aggregation needs to be enabled for stage and merge replication using the
configuration gg.aggregate.operations=true
.
Parent topic: Steps for Stage and Merge
8.2.31.6.1.6 Object Store Event handler
The File Writer handler needs to be chained with an object store Event handler. Oracle GoldenGate for BigData supports uploading files to most cloud object stores such as OCI, AWS S3, and Azure Data Lake.
Parent topic: Steps for Stage and Merge
8.2.31.6.1.7 JDBC Metadata Provider
If the data warehouse supports JDBC connection, then the JDBC metadata provider needs to be enabled.
Parent topic: Steps for Stage and Merge
8.2.31.6.1.8 Command Event handler Merge Script
Command Event handler is configured to invoke a bash-shell script. Oracle provides a bash-shell script that can execute the SQL statements so that the change data in the staging files are merged into the target tables.
Parent topic: Steps for Stage and Merge
8.2.31.6.1.9 Stage and Merge Sample Configuration
A working configuration for the respective data warehouse is available
under the directory
AdapterExamples/big-data/data-warehouse-utils/<target>/
.
- replicat parameter (.prm) file.
- replicat properties file that contains the FW handler and all the Event handler configuration.
- DDL file for the sample table used in the merge script.
- Merge script for the specific data warehouse. This script contains SQL statements tested using the sample table defined in the DDL file.
Parent topic: Steps for Stage and Merge
8.2.31.6.1.10 Variables in the Merge Script
Typically, variables appear at the beginning of the Oracle provided
script. There are lines starting with #TODO
: that document the changes
required for variables in the script.
#TODO: Edit this. Provide the replicat group name. repName=RBD #TODO: Edit this. Ensure each replicat uses a unique prefix. stagingTablePrefix=${repName}_STAGE_ #TODO: Edit the AWS S3 bucket name. bucket=<AWS S3 bucket name> #TODO: Edit this variable as needed. s3Location="'s3://${bucket}/${dir}/'" #TODO: Edit AWS credentials awsKeyId and awsSecretKey awsKeyId=<AWS Access Key Id> awsSecretKey=<AWS Secret key>
The variables repName
and stagingTablePrefix
are
relevant for all the data warehouse targets.
Parent topic: Steps for Stage and Merge
8.2.31.6.1.11 SQL Statements in the Merge Script
The SQL statements in the shell script needs to be customized. There are
lines starting with #TODO
: that document the changes required for SQL
statements.
In most cases, we need to double quote " identifiers in the SQL statement. The double
quote needs to be escaped in the script using backslash. For example:
\"
.
Oracle provides a working example of SQL statements for a single table
with a pre-defined set of columns defined in the sample DDL file. You need to add
new sections for your own tables as part of if-else
code block in
the script.
if [ "${tableName}" == "DBO.TCUSTORD" ] then #TODO: Edit all the column names of the staging and target tables. # The merge SQL example here is configured for the example table defined in the DDL file. # Oracle provided SQL statements # TODO: Add similar SQL queries for each table. elif [ "${tableName}" == "DBO.ANOTHER_TABLE" ] then #Edit SQLs for this table. fi
Parent topic: Steps for Stage and Merge
8.2.31.6.1.12 Merge Script Functions
The script is coded to include the following shell functions:
main
validateParams
process
processTruncate
processDML
dropExternalTable
createExternalTable
merge
The script has code comments for you to infer the purpose of each function.
Merge Script main
function
The function main
is the entry point of the script. The
processing of the staged changed data file begin here.
This function invokes two functions: validateParams
and
process
.
The input parameters to the script is validated in the function:
validateParams
.
Processing resumes in the process
function if
validation is successful.
Merge Script process
function
This function processes the operation records in the staged change data file and
invokes processTruncate
or processDML
as needed.
Truncate operation records are handled in the function
processTruncate
. Insert
,
Update
, and Delete
operation records are
handled in the function processDML
.
Merge Script merge
function
The merge
function invoked by the function
processDML
contains the merge SQL statement that will be
executed for each table.
The key columns to be used in the merge SQL’s ON
clause
needs to be customized.
null
values, the
ON
clause uses data warehouse specific NVL
functions. Example for a single key column
"C01Key
":ON ((NVL(CAST(TARGET.\"C01Key\" AS VARCHAR(4000)),'${uuid}')=NVL(CAST(STAGE.\"C01Key\" AS VARCHAR(4000)),'${uuid}')))`
The column names in the merge
statement’s
update
and insert
clauses also needs to be
customized for every table.
Merge Script createExternalTable
function
The createExternalTable
function invoked by the function
processDML
creates an external table that is backed by the file
in the respective object store file.
In this function, the DDL SQL statement for the external table should be customized for every target table to include all the target table columns.
In addition to the target table columns, the external table definition
also consists of three meta-columns: optype
,
position
, and fieldmask
.
The data type of the meta-columns should not be modified. The position of the meta-columns should not be modified in the DDL statement.
Parent topic: Steps for Stage and Merge
8.2.31.6.1.13 Prerequisites
- The Command handler merge scripts are available, starting from Oracle GoldenGate for BigData release 19.1.0.0.8.
- The respective data warehouse’s command line programs to execute SQL queries must be installed on the machine where GoldenGate for Big Data is installed.
Parent topic: Steps for Stage and Merge
8.2.31.6.1.14 Limitations
Primary key update operations are split into delete and insert pair. In
case the Oracle GoldenGate trail file doesn't contain column values for all the
columns in the respective table, then the missing columns gets updated to
null
on the target table.
Parent topic: Steps for Stage and Merge
8.2.31.6.2 Hive Stage and Merge
Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability for querying and analysis of large data sets stored in Hadoop files.
This topic contains examples of what you can do with the Hive command event handler
- Data Flow
- Configuration
The directoryAdapterExamples/big-data/data-warehouse-utils/hive/
in the Oracle GoldenGate BigData install contains all the configuration and scripts needed needed for replication to Hive using stage and merge. - Merge Script Variables
- Prerequisites
Parent topic: Stage and Merge Data Warehouse Replication
8.2.31.6.2.1 Data Flow
- File Writer (FW) handler is configured to generate files in Avro Object Container Format (OCF).
- The HDFS Event handler is used to push the Avro OCF files into Hadoop.
- The Command Event handler passes the Hadoop file metadata to the
hive.sh
script.
Parent topic: Hive Stage and Merge
8.2.31.6.2.2 Configuration
The directory
AdapterExamples/big-data/data-warehouse-utils/hive/
in the Oracle
GoldenGate BigData install contains all the configuration and scripts needed needed for
replication to Hive using stage and merge.
hive.prm
: The replicat parameter file.hive.props
: The replicat properties file that stages data to Hadoop and runs the Command Event handler.hive.sh
: The bash-shell script that reads data staged in Hadoop and merges data to Hive target table.hive-ddl.sql
: The DDL statement that contains sample target table used in the scripthive.sh
.
Edit the properties indicated by the #TODO
:
comments in the properties file hive.props
.
The
bash-shell script function merge()
contains SQL statements that
needs to be customized for your target tables.
Parent topic: Hive Stage and Merge
8.2.31.6.2.3 Merge Script Variables
Modify the variables needs as needed:
#TODO: Modify the location of the OGGBD dirdef directory where the Avro schema files exist. avroSchemaDir=/opt/ogg/dirdef #TODO: Edit the JDBC URL to connect to hive. hiveJdbcUrl=jdbc:hive2://localhost:10000/default #TODO: Edit the JDBC user to connect to hive. hiveJdbcUser=APP #TODO: Edit the JDBC password to connect to hive. hiveJdbcPassword=mine #TODO: Edit the replicat group name. repName=HIVE #TODO: Edit this. Ensure each replicat uses a unique prefix. stagingTablePrefix=${repName}_STAGE_
Parent topic: Hive Stage and Merge
8.2.31.6.2.4 Prerequisites
The following are the prerequisites:
- The merge script
hive.sh
requires command line programbeeline
to be installed on the machine where Oracle GoldenGate for BigData replicat is installed. - The custom script
hive.sh
uses themerge
SQL statement.Hive Query Language (Hive QL) introduced support for
merge
in Hive version 2.2.
Parent topic: Hive Stage and Merge
8.2.31.7 Template Keywords
The templating functionality allows you to use a mix of constants and/or keywords for context based resolution of string values at runtime. The templating functionality is used extensively in the Oracle GoldenGate for Big Data to resolve file paths, file names, topic names, or message keys. This appendix describes the keywords and their associated arguments if applicable. Additionally, there are examples showing templates and resolved values.
Template Keywords
This table includes a column if the keyword is supported for transaction level messages.
Keyword | Explanation | Transaction Message Support |
---|---|---|
|
Resolves to the fully qualified table name including the period (.) delimiter between the catalog, schema, and table names. For example, |
No |
|
Resolves to the catalog name. |
No |
|
Resolves to the schema name. |
No |
|
Resolves to the short table name. |
No |
|
Resolves to the type of the operation:
( |
No |
|
The first parameter is optional and allows you to
set the delimiter between primary key values. The default is
_ .
|
No |
|
The sequence number of the source trail file followed by the offset (RBA). |
Yes |
|
The operation timestamp from the source trail file. |
Yes |
|
Resolves to “”. |
Yes |
|
Resolves to the name of the Replicat process. If using coordinated delivery, it resolves to the name of the Replicat process with the Replicate thread number appended. |
Yes |
or
|
Resolves to a static value where the
key is the fully-qualified table name. The keys and values are
designated inside of the square brace in the following format:
|
No |
${xid} |
Resolves the transaction id. | Yes |
or ${columnValue[][][]} |
Resolves to a column value where the key is the fully-qualified table name and the value is the column name to be resolved. For example: ${columnValue[DBO.TABLE1=COL1,DBO.TABLE2=COL2]} The
second parameter is optional and allows you to set the value to
use if the column value is null. The default is an empty string
The third parameter is optional
and allows you to set the value to use if the column value is
missing. The default is an empty string If the ${columnValue[COL1]} or ${columnValue[COL2][NULL][MISSING]} |
No |
Or
|
Resolves to the current timestamp.
You can control the format of the current timestamp using the
Java based formatting as described in the
Examples: ${currentTimestamp}${currentTimestamp[yyyy-MM-dd
HH:mm:ss.SSS]} |
Yes |
|
Resolves to a NULL string. |
Yes |
|
It is possible to write a custom value resolver. If required, contact Oracle Support. |
Implementation dependent |
${token[]} |
Resolves a token value. | No |
${toLowerCase[]} |
Keyword to convert to argument to lower case. Argument can be constants, keywords, or combination of both. | Yes |
${toUpperCase[]} |
Keyword to convert to argument to upper case. Argument can be constants, keywords, or combination of both. | Yes |
${substring[][]}
Or ${substring[][][]} |
Keyword to perform a substring operation on the
configured content.
Note: Performing a substring function means that an array index out of bounds condition can occur at runtime. This occurs if the configured starting index or ending index is beyond the length of the string currently being acted upon. The${substring} function does not throw a
runtime exception. It instead detects an array index out of
bounds condition and in that case does not execute the substring
function.
|
Yes |
${regex[][][]} |
Keyword to apply a regular expressions to search and
replace content. This has three required parameters:
|
Yes |
${operationCount} |
Keyword to resolve the count of operations. | Yes |
${insertCount} |
Keyword to resolve the count of insert operations. | Yes |
${deleteCount} |
Keyword to resolve the count of delete operations. | Yes |
${updateCount} |
Keyword to resolve the count of update operations. | Yes |
${truncateCount} |
Keyword to resolve the count of truncate operations. | Yes |
${uuid} |
Keyword to resolve a universally unique identifier (UUID). This is a 36 character string guaranteed to be unique. An example UUID: 7f6e4529-e387-48c1-a1b6-3e7a4146b211 | Yes |
Example Templates
The following describes example template configuration values and the resolved values.
Example Template | Resolved Value |
---|---|
|
|
|
|
|
|
A_STATIC_VALUE |
A_STATIC_VALUE |
Parent topic: Additional Details
8.2.31.8 Velocity Dependencies
Starting Oracle GoldenGate for Big Data release 21.1.0.0.0, the Velocity jar files have been removed from the packaging.
For the Velocity formatting to work, you need to download the jars and
include them in their runtime by modifying the gg.classpath
.
The maven coordinates for Velocity are as follows:
Maven groupId: org.apache.velocity
Maven artifactId: velocity
Version: 1.7
Parent topic: Additional Details
Footnote Legend
Footnote 2:Time zone with a two-digit hour and a two-digit minimum offset.
Footnote 3:
Time zone with a two-digit hour and a two-digit minimum offset.