This chapter explains the Cassandra Handler and includes examples so that you can understand this functionality.
Topics:
The Cassandra Handler provides the interface to Apache Cassandra databases.
Apache Cassandra is a NoSQL Database Management System designed to store large amounts of data. A Cassandra cluster configuration provides horizontal scaling and replication of data across multiple machines. It can provide high-availability and eliminate a single point of failure by replicating data to multiple nodes within a Cassandra cluster. Apache Cassandra is open-source and designed to run on low cost commodity hardware.
Cassandra relaxes the axioms of traditional Relational Database Management Systems regarding atomicity, consistency, isolation, and durability. When considering implementing Cassandra it is important to understand its differences from a traditional RDBMS and how those differences affect your specific use case.
Cassandra provides eventual consistency. Under the eventual consistency model, accessing the state of data for a specific row will eventually return the latest state of the data for that row as defined by the most recent change. However, there may be a latency period between the creation and modification of the state of a row and what is returned when the state of that row is queried. The promise of eventual consistency is that the latency period is predictable based on your Cassandra configuration and the level of work load that your Cassandra cluster is currently under. See the Apache Cassandra website for more information:
The Cassandra Handler provides some control over consistency with the configuration of the gg.handler.name.consistencyLevel
property in the Java Adapter properties file.
Cassandra provides a number of column data types and most of these data types are supported by the Cassandra Handler. A data type conversion from the column value in the source trail file to the corresponding Java type representing the Cassandra column type in the Cassandra Handler is required. This data conversion process does introduce the risk of a runtime conversion error. A poorly mapped field (like varchar
as the source containing alpha numeric data to a Cassandra int
) may cause a runtime error and cause the Cassandra Handler to abend. The following is a link to the Cassandra Java type mappings.
https://github.com/datastax/java-driver/tree/3.x/manual#cql-to-java-type-mapping
The following Cassandra column data types are not supported:
list
map
set
tuple
Certain use cases may exist where the data may require specialized processing to convert it to the corresponding Java type for intake into Cassandra. If this is the case, you have these options:
You may be able to use the general regular expression search and replace functionality to get the source column value data formatted into a way that can be then converted into the Java data type for use in Cassandra.
You could implement or extend the default data type conversion logic to override it with your custom logic for your use case. If this is required, contact Oracle Support for guidance.
Traditional RDBMSs separate structured data into tables. Related tables are included in higher-level collections called databases. Cassandra contains both of these concepts. Tables in an RDBMS are also tables in Cassandra while databases in an RDBMS are keyspaces in Cassandra.
It is important to understand how data maps from the metadata definition in the source trail file are mapped to the corresponding keyspace and table in Cassandra. Source tables are generally either two-part names defined as schema.table
or three-part names defined as catalog.schema.table
. The following table explains how catalog, schema, and table names map into Cassandra. Note that unless special syntax is used, Cassandra converts all keyspace, table names, and column names to lowercase.
Table Name in Source Trail File | Cassandra Keyspace Name | Cassandra Table Name |
---|---|---|
|
|
|
|
|
|
|
|
|
The Cassandra Handler does not automatically create keyspaces in Cassandra. Keyspaces in Cassandra define a replication factor, the replication strategy, and topology. The Cassandra Handler does not possess enough information to create the keyspaces so you must manually create keyspaces.
You can create keyspaces in Cassandra using the CREATE KEYSPACE
command from the Cassandra shell.
The Cassandra Handler can automatically create tables in Cassandra if you configure it to do so. The source table definition may be a poor source of information to create tables in Cassandra. Primary keys in Cassandra are divided into:
Partitioning keys that define how data for a table is separated into partitions in Cassandra.
Clustering keys that define the order of items within a partition.
The default mapping for automated table creation is that the first primary key is the partition key and any additional primary keys are mapped as clustering keys.
Automated table creation by the Cassandra Handler may be fine for proof of concept though may result in data definitions that do not scale well. Creation of tables in Cassandra with poorly constructed primary keys may result in reduced performance for ingest and retrieval as the volume of data stored in Cassandra increases. Oracle recommends that you analyze the metadata of your replicated tables then strategically manually create the corresponding tables in Cassandra that are properly partitioned and clustered that can scale well.
Primary key definitions for tables in Cassandra are immutable once created. Changing a Cassandra table primary key definition requires the following manual steps:
Create a staging table.
Populate the data in the staging table from original table.
Drop the original table.
Recreate the original table with the modified primary key definitions.
Populate the data in the original table from the staging table.
Drop the staging table.
You can configure the Cassandra Handler to add columns that exist in the source trail file table definition though are missing in the Cassandra table definition. The Cassandra Handler can accommodate metadata change events of adding a column. A reconciliation process occurs that reconciles the source table definition to the Cassandra table definition. When configured to add columns, any columns found in the source table definition that do not exist in the Cassandra table definition are added. The reconciliation process for a table occurs after application start up the first time an operation for the table is encountered. The reconciliation process reoccurs after a metadata change event on a source table, when the first operation for the source table is encountered after the change event.
You can configure the Cassandra Handler to drop columns that do not exist in the source trail file definition though exist in the Cassandra table definition. The Cassandra Handler can accommodate metadata change events of dropping a column. A reconciliation process occurs that reconciles the source table definition to the Cassandra table definition. When configured to drop columns any columns found in the Cassandra table definition that are not in the source table definition are dropped.
Caution:
Dropping a column is potentially dangerous because it is permanently removing data from a Cassandra table. You should carefully consider your use case before configuring this mode.
Note:
Primary key columns cannot be dropped. Attempting to do so results in an abend.
Note:
Column name changes are not well-handled because there is no actual DDL processing. A column name change event on the source database appears to the Cassandra Handler like dropping an existing column and adding a new column.
The Cassandra Handler pushes operations to Cassandra using either the asynchronous or synchronous API. In asynchronous mode, operations are flushed at transaction commit (grouped transaction commit using GROUPTRANSOPS
) to ensure write durability. The Cassandra Handler does not interface with Cassandra in a transactional way.
Insert, update, and delete operations are processed differently in Cassandra than a traditional RDBMS. The following explains how insert, update, and delete operations are interpreted by Cassandra:
Inserts – If the row does not already exist in Cassandra, then an insert operation is processed as an insert. If the row already exists in Cassandra, then an insert operation is processed as an update.
Updates – If a row does not exist in Cassandra, then an update operation is processed as an insert. If the row already exists in Cassandra, then an update operation is processed as insert.
Delete – If the row does not exist in Cassandra, then a delete operation has no effect. If the row exists in Cassandra, then a delete operation is processed as a delete.
The state of the data in Cassandra is eventually idempotent. You can replay the source trail files or replay sections of the trail files. Ultimately, the state of the Cassandra database should be the same regardless of the number of times the trail data was written into Cassandra.
Oracle GoldenGate allows you to control the data that is propagated to the source trail file in the event of an update. The data for an update in the source trail file is either a compressed or a full image of the update and the column information is provided as follows:
For the primary keys and the columns for which the value changed. Data for columns that did not change is not provided in the trail file.
For all columns including primary keys, columns for which the value changed, and columns for which the value did not change.
The amount of available information on an update is important to the Cassandra Handler. If the source trail file contains full images of the change data then the Cassandra Handler can use prepared statements to perform row updates in Cassandra. Full images also allow the Cassandra Handler to perform primary key updates for a row in Cassandra. In Cassandra, primary keys are immutable so an update that changes a primary key must be treated as a delete and an insert. Conversely, compressed updates means that prepared statements cannot be used for Cassandra row updates. Simple statements identifying the changing values and primary keys must be dynamically created then executed. Compressed updates mean that primary key updates are not possible so the result is that the Cassandra Handler will abend.
You must set the control properties, gg.handler.name.compressedUpdates
and gg.handler.name.compressedUpdatesfor
, so that the handler expects either compressed or full image updates.
The default value, true
, means that the Cassandra Handler expects compressed updates. Prepared statements are not be used for updates and primary key updates cause the handler to abend.
Setting the value to false
means that prepared statements are used for updates and primary key updates can be processed. When the source trail file does not contain full image data, it is dangerous and can lead to corrupted data. This is because columns for which the data is missing are considered null and the null value is pushed to Cassandra. If you have doubts about whether the source trail files contains compressed or full image data, then you should set gg.handler.name.compressedUpdates
to true
.
In addition, CLOB and BLOB data types do not propagate LOB data in updates unless the LOB column value changed. So if the source tables contain LOB data, then you should set gg.handler.name.compressedUpdates
to true
.
Primary Key Updates
Primary key values for a row in Cassandra are immutable. An update operation that changes any primary key value for a Cassandra row must be treated as a delete and insert. The Cassandra Handler can process update operations that result in the change of a primary key in Cassandra only as a delete and insert. To successfully process this operation, the source trail file must contain the complete before and after change data images for all columns. The gg.handler.name.compressed
configuration property of the Cassandra Handler must be set to false
for primary key updates to be successfully processed.
Get the Driver Libraries
The Datastax Java Driver for Cassandra does not ship with Oracle GoldenGate for Big Data. The recommended version of the Datastax Java Driver for Cassandra is 3.1 and you download it at:
https://github.com/datastax/java-driver
Set the Classpath
gg.classpath
configuration property in the Java Adapter properties file to specify the JARs for the Datastax Java Driver for Cassandra.
gg.classpath={download_dir}/cassandra-java-driver-3.1.0/*:{download_dir}/cassandra-java-driver-3.1.0/lib/*
Instructions for configuring the Cassandra Handler components and running the handler are described in the following sections.
The following are the configurable values for the Cassandra Handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
Table 6-1 Cassandra Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
Any string |
None |
Provides a name for the Cassandra Handler. The Cassandra Handler name is then becomes part of the property names listed in this table. |
|
Required |
|
None |
Selects the Cassandra Handler for streaming change data capture into Cassandra. |
|
Optional |
|
|
The default is recommended. In |
|
Optional |
A comma separated list of host names that the Cassandra Handler will connect to. |
|
A comma separated list of the Cassandra host machines for the driver to establish an initial connection to the Cassandra cluster. This configuration property does not need to include all the machines enlisted in the Cassandra cluster. By connecting to a single machine, the driver can learn about other machines in the Cassandra cluster and establish connections to those machines as required. |
|
Optional |
A legal username string. |
None |
A username for the connection to Cassandra. It is required if Cassandra is configured to require credentials. |
|
Optional |
A legal password string. |
None |
A password for the connection to Cassandra. It is required if Cassandra is configured to require credentials. |
|
Optional |
|
|
Sets the Cassandra Handler whether to or not to expect full image updates from the source trail file. Set to Setting it to |
|
Optional |
|
None |
Configures the Cassandra Handler for the DDL functionality to provide. Options include When When When |
|
Optional |
|
|
Sets the interaction between the Cassandra Handler and Cassandra. Set to Set to |
|
Optional |
|
The Cassandra default. |
Sets the consistency level for operations with Cassandra. It configures the criteria that must be met for storage on the Cassandra cluster when an operation is executed. Lower levels of consistency can provide better performance while higher levels of consistency are safer. You can review details about the write consistency levels for Cassandra at: https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlConfigConsistency.html |
The following is sample configuration for the Cassandra Handler from the Java Adapter properties file:
gg.handlerlist=cassandra #The handler properties gg.handler.cassandra.type=cassandra gg.handler.cassandra.mode=op gg.handler.cassandra.contactPoints=localhost gg.handler.cassandra.ddlHandling=CREATE,ADD,DROP gg.handler.cassandra.compressedUpdates=true gg.handler.cassandra.cassandraMode=async gg.handler.cassandra.consistencyLevel=ONE
The Cassandra Handler connection to the Cassandra Cluster can be secured using user name and password credentials. The user name and password credentials are configured as part of the Cassandra Handler configuration in the Adapter properties file, see Cassandra Handler Configuration.
When started, the Cassandra Handler performs the table check and reconciliation process the first time an operation for a source table is encountered. Additionally, a DDL event or a metadata change event causes the table definition in the Cassandra Handler to be marked as dirty so the next time an operation for the table is encountered the handler repeats the table check and reconciliation process as described in the following section.
The Cassandra Handler first interrogates the target Cassandra database to see if the target Cassandra keyspace exists. If the target Cassandra keyspace does not exist, then the Cassandra Handler abends. Keyspaces must be created by the user. The log file should contain the error of the exact keyspace name the Cassandra Handler is expecting.
Next, the Cassandra Handler interrogates the target Cassandra database for the table definition. If the table does not exist, the Cassandra Handler will do one of two things. If gg.handler.name.ddlHandling
includes CREATE
, then a table is created in Cassandra; otherwise the process abends. A message is logged that shows you the table that does not exist in Cassandra.
If the table exists in Cassandra, then the Cassandra Handler performs a reconciliation between the table definition from the source trail file and the table definition in Cassandra. This reconciliation process searches for columns that exist in the source table definition and not in the corresponding Cassandra table definition. If it locates columns fitting this criteria and the gg.handler.name.ddlHandling
property includes ADD
, then the Cassandra Handler alters the target table in Cassandra adding the new columns; otherwise it ignores these columns.
Next, the reconciliation process search for columns that exist in the target Cassandra table though do not exist in the source table definition. If the locates columns fitting this criteria and the gg.handler.name.ddlHandling
property includes DROP
then the Cassandra Handler alters the target table in Cassandra to drop these columns; otherwise those columns are ignored.
Finally, the prepared statements are built.
Configuring the Cassandra Handler for async
mode will provide better performance than sync
mode. The Replicat property GROUPTRANSOPS
should be set to the default of a 1000.
Setting of the consistency level directly affects performance. The higher the consistency level, the more work must occur on the Cassandra cluster before the transmission of a given operation can be considered complete. You should select the minimum consistency level that still satisfies the requirements of your use case. Consistency level information is found at:
https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlConfigConsistency.html
The Cassandra Handler can work in either operation (op
) or transaction (tx
) mode. For the best performance operation mode is recommended:
gg.handler.name.mode=op
Cassandra database requires at least one primary key and the value for any primary key cannot be null. Automated table creation fails for source tables that do not have a primary key.
When gg.handler.name.compressedUpdates=false
is set it means that the Cassandra Handler expects to update full before and after images of the data. Using this property setting with a source trail file with partial image updates results in null values being updated for columns for which the data is missing. This configuration is incorrect and update operations pollute the target data with null values in columns that did not change.
The Cassandra Handler does not process DDL from the source database even if the source database provides DDL. Instead it performs a reconciliation process between the source table definition and the target Cassandra table definition. A DDL statement executed at the source database changing a column name appears to the Cassandra Handler the same as if a column was dropped from the source table and a new column was added to the source table. This behavior is dependent on how the gg.handler.name.ddlHandling
property is configured:
gg.handler.name.ddlHandling Configuration | Behavior |
---|---|
Not configured for |
Old column name and data maintained in Cassandra. New column is not created in Cassandra so no data is replicated for the new column name from the DDL change forward. |
Configured for |
Old column name and data maintained in Cassandra. New column iscreated in Cassandra and data replicated for the new column name from the DDL change forward. Column mismatch of where the data is located before and after the DDL change. |
Configured for |
Old column name and data dropped in Cassandra. New column is not created in Cassandra so no data replicated for the new column name. |
Configured for |
Old column name and data dropped in Cassandra. New column is created in Cassandra and data replicated for the new column name from the DDL change forward. |
This section contains information to help you troubleshoot various issues. Review the following topics for additional help:
The most common initial error is an incorrect classpath to include all the required client libraries and creates a ClassNotFound
exception in the log file. You can troubleshoot by setting the Java Adapter logging to DEBUG
, and then rerun the process. At the debug level, the logging includes information of which JARs were added to the classpath from the gg.classpath
configuration variable. The gg.classpath
variable supports the wildcard (*
) character to select all JARs in a configured directory. For example, /usr/cassandra/cassandra-java-driver-3.1.0/*:/usr/cassandra/cassandra-java-driver-3.1.0/lib/*
.
For more information about setting the classpath, see Setting Up and Running the Cassandra Handlerand Cassandra Handler Client Dependencies.
**** Begin Cassandra Handler - Configuration Summary **** Mode of operation is set to op. The Cassandra cluster contact point(s) is [localhost]. The handler has been configured for GoldenGate compressed updates (partial image updates). Sending data to Cassandra in [ASYNC] mode. The Cassandra consistency level has been set to [ONE]. Cassandra Handler DDL handling: The handler will create tables in Cassandra if they do not exist. The handler will add columns to Cassandra tables for columns in the source metadata that do not exist in Cassandra. The handler will drop columns in Cassandra tables for columns that do not exist in the source metadata. **** End Cassandra Handler - Configuration Summary ****
When running the Cassandra handler, you may experience a com.datastax.driver.core.exceptions.WriteTimeoutException
exception that causes the Replicat process to abend. It is likely to occur under some or all of the following conditions.
The Cassandra Handler is processing large numbers of operations putting the Cassandra cluster under a significant processing load.
GROUPTRANSOPS
is configured higher than the 1000 default.
The Cassandra Handler is configured in asynchronous mode.
The Cassandra Handler is configured for a consistency level higher than ONE
.
The problem is that the Cassandra Handler is streaming data faster than the Cassandra cluster can process it. The write latency in the Cassandra cluster then finally exceeds the write request timeout period, which in turn results in the exception.
The following are potential solutions:
Increase the write request timeout period. This is controlled with the write_request_timeout_in_ms
property in Cassandra and is located in the cassandra.yaml
file in the cassandra_install/conf
directory. The default is 2000 (2 seconds). You can increase this value to move past the error, and then restart the Cassandra node or nodes for the change to take affect.
It is considered a good practice to also decrease the GROUPTRANSOPS
configuration value of the Replicat process if this error occurs. Typically, decreasing the GROUPTRANSOPS
configuration decreases the size of transactions processed and reduces the likelihood that the Cassandra Handler can overtax the Cassandra cluster.
The consistency level of the Cassandra Handler can be reduced, which in turn reduces the amount of work the Cassandra cluster has to complete for an operation to be considered as written.