5 Using the HBase Handler

The HBase Handler allows you to populate HBase tables from existing Oracle GoldenGate supported sources.

Topics:

5.1 Overview

HBase is an open source Big Data application that emulates much of the functionality of a relational database management system (RDBMS). Hadoop is specifically designed to store large amounts of unstructured data. Conversely, data stored in databases and being replicated through Oracle GoldenGate is highly structured. HBase provides a method of maintaining the important structure of data, while taking advantage of the horizontal scaling that is offered by the Hadoop Distributed File System (HDFS).

5.2 Detailed Functionality

The HBase Handler takes operations from the source trail file and creates corresponding tables in HBase, and then loads change capture data into those tables.

HBase Table Names

Table names created in an HBase map to the corresponding table name of the operation from the source trail file. It is case-sensitive.

HBase Table Namespace

For two part table names (schema name and table name), the schema name maps to the HBase table namespace. For a three part table name like Catalog.Schema.MyTable, the create HBase namespace would be Catalog_Schema. HBase table namespaces are case sensitive. A NULL schema name is supported and maps to the default HBase namespace.

HBase Row Key

HBase has a similar concept of the database primary keys called the HBase row key. The HBase row key is the unique identifier for a table row. HBase only supports a single row key per row and it cannot be empty or NULL. The HBase Handler maps the primary key value into the HBase row key value. If the source table has multiple primary keys, then the primary key values are concatenated, separated by a pipe delimiter (|).You can configure the HBase row key delimiter.

The source table must have at least one primary key column. Replication of a table without a primary key causes the HBase Handler to abend.

HBase Column Family

HBase has the concept of a column family. A column family is a grouping mechanism for column data. Only a single column family is supported. Every HBase column must belong to a single column family. The HBase Handler provides a single column family per table that defaults to cf. The column family name is configurable by you. However, once a table is created with a specific column family name, reconfiguration of the column family name in the HBase example without first modify or dropping the table results in an abend of the Oracle GoldenGate Replicat processes.

5.3 Setting Up and Running the HBase Handler

Instructions for configuring the HBase Handler components and running the handler are described in this section.

HBase must be up and running either collocated with the HBase Handler process or on a machine that is network connectable from the machine hosting the HBase Handler process. The underlying HDFS single instance or clustered instance serving as the repository for HBase data must be up and running.

Topics:

5.3.1 Classpath Configuration

You must include two things in the gg.classpath configuration variable in order for the HBase Handler to connect to HBase and stream data. The first is the hbase-site.xml file and the second are the HBase client jars. The HBase client jars must match the version of HBase to which the HBase Handler is connecting. The HBase client jars are not shipped with the Oracle GoldenGate for Big Data product.

HBase Handler Client Dependencies includes the listing of required HBase client jars by version.

The default location of the hbase-site.xml file is HBase_Home/conf.

The default location of the HBase client JARs is HBase_Home/lib/*.

If the HBase Handler is running on Windows, follow the Windows classpathing syntax.

The gg.classpath must be configured exactly as described. Pathing to the hbase-site.xml should simply contain the path with no wild card appended. The inclusion of the * wildcard in the path to the hbase-site.xml file will cause it not to be accessible. Conversely, pathing to the dependency jars should include the * wild card character in order to include all of the jar files in that directory in the associated classpath. Do not use *.jar. An example of a correctly configured gg.classpath variable is the following:

gg.classpath=/var/lib/hbase/lib/*:/var/lib/hbase/conf

5.3.2 HBase Handler Configuration

The following are the configurable values for the HBase Handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).

Table 5-1 HBase Handler Configuration Properties

Properties Required/ Optional Legal Values Default Explanation

gg.handlerlist

Required

Any string

None

Provides a name for the HBase Handler. The HBase Handler name is then becomes part of the property names listed in this table.

gg.handler.name.type=hbase

Required

-

-

Selects the HBase Handler for streaming change data capture into HBase

gg.handler.name.hBaseColumnFamilyName

Optional

Any String legal for an HBase column family name

cf

Column family is a grouping mechanism for columns in HBase. The HBase Handler only supports a single column family in the 12.2 release.

gg.handler.name.includeTokens

Optional

true | false

false

Using true indicates that token values are included in the output to HBase. Using false means token values are not be included.

gg.handler.name.keyValueDelimiter

Optional

Any string

=

Provides a delimiter between key values in a map. For example, key=value,key1=value1,key2=value2. Tokens are mapped values. Configuration value supports CDATA[] wrapping.

gg.handler.name.keyValuePairDelimiter

Optional

Any string

,

Provides a delimiter between key value pairs in a map. For example, key=value,key1=value1,key2=value2key=value,key1=value1,key2=value2. Tokens are mapped values. Configuration value supports CDATA[] wrapping.

gg.handler.name.encoding

Optional

Any encoding name or alias supported by Java.Foot 1 For a list of supported options, visit the Oracle Java Documentation website at

https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html

The native system encoding of the machine hosting the Oracle GoldenGate process

Determines the encoding of values written the HBase. HBase values are written as bytes.

gg.handler.name.pkUpdateHandling

Optional

abend | update | delete-insert

abend

Provides configuration for how the HBase Handler should handle update operations that change a primary key. Primary key operations can be problematic for the HBase Handler and require special consideration by you.

  • abend - indicates the process will abend

  • update - indicates the process will treat this as a normal update

  • delete-insert - indicates the process will treat this as a delete and an insert. The full before image is required for this feature to work properly. This can be achieved by using full supplemental logging in Oracle Database. Without full before and after row images the insert data will be incomplete.

gg.handler.name.nullValueRepresentation

Optional

Any string

NULL

Allows you to configure what will be sent to HBase in the case of a NULL column value. The default is NULL. Configuration value supports CDATA[] wrapping.

gg.handler.name.authType

Optional

kerberos

None

Setting this property to kerberos enables Kerberos authentication.

gg.handler.name.kerberosKeytabFile

Optional (Required if authType=kerberos)

Relative or absolute path to a Kerberos keytab file

-

The keytab file allows the HDFS Handler to access a password to perform a kinit operation for Kerberos security.

gg.handler.name.kerberosPrincipal

Optional (Required if authType=kerberos)

A legal Kerberos principal name (for example, user/FQDN@MY.REALM)

-

The Kerberos principal name for Kerberos authentication.

gg.handler.name.hBase98Compatible 

Optional

true | false

false

Set this configuration property to true to enable integration with the HBase 0.98.x and 0.96.x releases.

gg.handler.name.rowkeyDelimiter 

Optional

Any string

|

Configures the delimiter between primary key values from the source table when generating the HBase rowkey. This property supports CDATA[] wrapping of the value to preserve whitespace if the user wishes to delimit incoming primary key values with a character or characters determined to be whitespace.

gg.handler.name.setHBaseOperationTimestamp

Optional

true | false

false

Set to true to set the timestamp for HBase operations in the HBase Handler instead of allowing HBase is assign the timestamps on the server side. This property can be used to solve the problem of a row delete followed by an immediate reinsert of the row not showing up in HBase, see HBase Handler Delete-Insert Problem.

Footnote 1

For more Java information, see Java Internalization Support at https://docs.oracle.com/javase/8/docs/technotes/guides/intl/.

5.3.3 Sample Configuration

The following is a sample configuration for the HBase Handler from the Java Adapter properties file:

gg.handlerlist=hbase
gg.handler.hbase.type=hbase
gg.handler.hbase.mode=tx
gg.handler.hbase.hBaseColumnFamilyName=cf
gg.handler.hbase.includeTokens=true
gg.handler.hbase.keyValueDelimiter=CDATA[=]
gg.handler.hbase.keyValuePairDelimiter=CDATA[,]
gg.handler.hbase.encoding=UTF-8
gg.handler.hbase.pkUpdateHandling=abend
gg.handler.hbase.nullValueRepresentation=CDATA[NULL]
gg.handler.hbase.authType=none

5.3.4 Performance Considerations

At each transaction commit, the HBase Handler performs a flush call to flush any buffered data to the HBase region server. This must be done to maintain write durability. Flushing to the HBase region server is an expensive call and performance can be greatly improved by using the Replicat GROUPTRANSOPS parameter to group multiple smaller transactions in the source trail file into a larger single transaction applied to HBase. You can use Replicat base-batching by adding the configuration syntax in the Replicat configuration file.

Operations from multiple transactions are grouped together into a larger transaction, and it is only at the end of the grouped transaction that transaction commit is executed.

5.3.5 Security

HBase connectivity can be secured using Kerberos authentication. Follow the associated documentation for the HBase release to secure the HBase cluster. The HBase Handler can connect to Kerberos secured cluster. The HBase hbase-site.xml should be in handlers classpath with the hbase.security.authentication property set to kerberos and hbase.security.authorization property set to true.

Additionally, you must set the following properties in the HBase Handler Java configuration file:

gg.handler.{name}.authType=kerberos
gg.handler.{name}.keberosPrincipalName={legal Kerberos principal name}
gg.handler.{name}.kerberosKeytabFile={path to a keytab file that contains the password for the Kerberos principal so that the Oracle GoldenGate HDFS handler can programmatically perform the Kerberos kinit operations to obtain a Kerberos ticket}.

5.4 Metadata Change Events

Oracle GoldenGate 12.2 includes metadata in trail and can handle metadata change events at runtime. The HBase Handler can handle metadata change events at runtime as well. One of the most common scenarios is the addition of a new column. The result in HBase will be that the new column and its associated data will begin being streamed to HBase after the metadata change event.

It is important to understand that in order to enable metadata change events the entire Replication chain must be upgraded to Oracle GoldenGate 12.2. The 12.2 HBase Handler can work with trail files produced by Oracle GoldenGate 12.1 and greater. However, these trail files do not include metadata in trail and therefore metadata change events cannot be handled at runtime.

5.5 Additional Considerations

HBase has been experiencing changes to the client interface in the last few releases. HBase 1.0.0 introduced a new recommended client interface and the 12.2 HBase Handler has moved to the new interface to keep abreast of the most current changes. However, this does create a backward compatibility issue. The HBase Handler is not compatible with HBase versions older than 1.0.0. If an Oracle GoldenGate integration is required with 0.99.x or older version of HBase, this can be accomplished using the 12.1.2.1.x HBase Handler. Contact Oracle Support to obtain a ZIP file of the 12.1.2.1.x HBase Handler.

Common errors on the initial setup of the HBase Handler are classpath issues. The typical indicator is occurrences of the ClassNotFoundException in the Java log4j log file. The HBase client JARS do not ship with the Oracle GoldenGate for Big Data product. You must resolve the required HBase client JARS. HBase Handler Client Dependencies includes the listing of HBase client JARS for each supported version. Either the hbase-site.xml or one or more of the required client JARS are not included in the classpath. For instructions on configuring the classpath of the HBase Handler, see Classpath Configuration.

5.6 Troubleshooting the HBase Handler

Troubleshooting of the HBase Handler begins with the contents for the Java log4j file. Follow the directions in the Java Logging Configuration to configure the runtime to correctly generate the Java log4j log file.

Topics:

5.6.1 Java Classpath

Issues with the Java classpath are one of the most common problems. An indication of a classpath problem is a ClassNotFoundException in the Java log4j log file. The Java log4j log file can be used to troubleshoot this issue. Setting the log level to DEBUG allows for logging of each of the jars referenced in the gg.classpath object to be logged to the log file. You can make sure that all of the required dependency jars are resolved by enabling DEBUG level logging, and then search the log file for messages like the following:

2015-09-29 13:04:26 DEBUG ConfigClassPath:74 -  ...adding to classpath:
 url="file:/ggwork/hbase/hbase-1.0.1.1/lib/hbase-server-1.0.1.1.jar"

5.6.2 HBase Connection Properties

The contents of the HDFS hbase-site.xml file (including default settings) are output to the Java log4j log file when the logging level is set to DEBUG or TRACE. It shows the connection properties to HBase. Search for the following in the Java log4j log file.

2015-09-29 13:04:27 DEBUG HBaseWriter:449 - Begin - HBase configuration object contents for connection troubleshooting. 
Key: [hbase.auth.token.max.lifetime] Value: [604800000].

A common error is for the hbase-site.xml file to be either not included in the classpath or a pathing error to the hbase-site.xml. In this case the HBase Handler will not be able to establish a connection to HBase and the Oracle GoldenGate process will abend. The following error will be reported in the Java log4j log.

2015-09-29 12:49:29 ERROR HBaseHandler:207 - Failed to initialize the HBase handler.
org.apache.hadoop.hbase.ZooKeeperConnectionException: Can't connect to ZooKeeper

Verify that the classpath correctly includes the hbase-site.xml file and that HBase is running.

5.6.3 Logging of Handler Configuration

The Java log4j log file contains information on the configuration state of the HBase Handler. This information is output at the INFO log level. Sample output is as follows:

2015-09-29 12:45:53 INFO HBaseHandler:194 - **** Begin HBase Handler - Configuration Summary ****
  Mode of operation is set to tx.
  HBase data will be encoded using the native system encoding.
  In the event of a primary key update, the HBase Handler will ABEND.
  HBase column data will use the column family name [cf].
  The HBase Handler will not include tokens in the HBase data.
  The HBase Handler has been configured to use [=] as the delimiter between keys and values.
  The HBase Handler has been configured to use [,] as the delimiter between key values pairs.
  The HBase Handler has been configured to output [NULL] for null values.
Hbase Handler Authentication type has been configured to use [none]

5.6.4 HBase Handler Delete-Insert Problem

If you are using the HBase Handler gg.handler.name.setHBaseOperationTimestamp configuration property, the source database may get out of sync with data in the HBase Handler tables. This is caused by the deletion of a row followed by the immediate reinsertion of the row. HBase creates a tombstone marker for the delete that is identified by a specific timestamp. This tombstone marker marks any row records in HBase with the same row key as deleted that have a timestamp before or the same as the tombstone marker. This can occur when the deleted row is immediately reinserted. The insert operation can inadvertently have the same timestamp as the delete operation so the delete operation causes the subsequent insert operation to incorrectly appear as deleted.

To work around this issue, you need to set the gg.handler.name.setHbaseOperationTimestamp= to true, which does two things:

  • Sets the timestamp for row operations in the HBase Handler.

  • Detection of a delete-insert operation that ensures that the insert operation has a timestamp that is after the insert.

The default for gg.handler.name.setHbaseOperationTimestamp isfalse, which means that the HBase server supplies the timestamp for a row. This can cause the out of sync problem.

Setting the row operation timestamp in the HBase Handler can have these consequences:

  1. Since the timestamp is set on the client side, this could create problems if multiple applications are feeding data to the same HBase table.

  2. If delete and reinsert is a common pattern in your use case, then the HBase Handler has to increment the timestamp 1 millisecond each time this scenario is encountered.

Processing cannot be allowed to get too far into the future so the HBase Handler only allows the timestamp to increment 100 milliseconds into the future before it attempts to wait the process so that the client side HBase operation timestamp and real time are back in sync. When a delete-insert is used instead of an update in the source database so this sync scenario would be quite common. Processing speeds may be affected by not allowing the HBase timestamp to go over 100 milliseconds into the future if this scenario is common.

5.6.5 Cloudera CDH HBase Compatibility

The Cloudera CDH has moved to HBase 1.0.0 in the CDH 5.4.0 version. To keep reverse compatibility with HBase 0.98.x and before, the HBase client in the CDH broke the binary compatibility with Apache HBase 1.0.0. This created a compatibility problem for the HBase Handler when connecting to Cloudera CDH HBase for CDH versions 5.4 - 5.11. You may have been advised to solve this problem by using the old 0.98 HBase interface and setting the following configuration parameter:

gg.handler.name.hBase98Compatible=true 

This compatibility problem is solved using Java Refection. If you are using the HBase Handler to connect to CDH 5.4x, then you should changed the HBase Handler configuration property to the following:

gg.handler.name.hBase98Compatible=false

Optionally, you can omit the property entirely because the default value is false.