4 Using the Flume Handler

The chapter includes the following sections:

4.1 Overview

The Oracle GoldenGate for Big Data Flume Handler is designed to stream change capture data from a Oracle GoldenGate trail to a Flume source. Apache Flume is an open source application for which the primary purpose is streaming data into Big Data applications. The Flume architecture contains three main components namely, Sources, Channels and Sinks which collectively make a pipeline for data. A Flume source publishes the data to a Flume channel. A Flume sink retrieves the data out of a Flume channel and streams the data to different targets. A Flume Agent is a container process that owns and manages a source, channel and sink. A single Flume installation can host many agent processes. The Flume Handler can stream data from a trail file to Avro or Thrift RPC Flume sources.

4.2 Runtime Prerequisites

In order to run the Flume Handler, a Flume Agent configured with an Avro or Thrift Flume source must be up and running. Oracle GoldenGate can be collocated with Flume or located on a different machine. If located on a different machine the host and port of the Flume source must be reachable via network connection. For instructions on how to configure and start a Flume Agent process, see the Flume User Guide at

https://flume.apache.org/releases/content/1.6.0/FlumeUserGuide.pdf

4.3 Classpath Configuration

You must configure two things in the gg.classpathconfiguration variable for the Flume Handler to connect to the Flume source and run. The first thing is the Flume Agent configuration file and the second are the Flume client jars. The Flume Handler uses the contents of the Flume Agent configuration file to resolve the host, port, and source type for the connection to Flume source. The Flume client libraries do not ship with Oracle GoldenGate for Big Data. The Flume client library versions must match the version of Flume to which the Flume Handler is connecting. For a listing for the required Flume client JAR files by version, see Flume Handler Client Dependencies.

The Oracle GoldenGate property, gg.classpath, needs to be set to include the following default locations:

  • The default location of the core-site.xml file is Flume_Home/conf.

  • The default location of the Flume client jars is Flume_Home/lib/*.

The gg.classpath must be configured exactly as shown in the preceding example. Pathing to the Flume Agent configuration file should simply contain the path with no wild card appended. The inclusion of the *wildcard in the path to the Flume Agent configuration file will cause it not to be accessible. Conversely, pathing to the dependency jars should include the * wild card character in order to include all of the jar files in that directory in the associated classpath. Do not use *.jar. An example of a correctly configured gg.classpath variable is the following:

gg.classpath=dirprm/:/var/lib/flume/lib/*

If the Oracle GoldenGate for Big Data Flume Handler and Flume are not collocated, then the Flume Agent configuration file and the Flume client libraries will need to be copied to the machine hosting the Oracle GoldenGate for Big Data Flume Handler process.

4.4 Pluggable Formatters

The Oracle GoldenGate for Big Data Flume Handler supports all of the Big Data formatters included with the Oracle GoldenGate for Big Data release. The formatters are:

  • Avro Row

  • Avro Operation

  • JSON

  • XML

  • Delimited Text

4.5 Flume Handler Configuration

The configuration properties for 12.2.0.1 Flume Handler are outlined as follows:

Property Name Property Value Mandatory Description

gg.handlerlist

flumehandler (choice of any name)

Yes

List of handlers. Only one is allowed with grouping properties ON.

gg.handler.flumehandler.type

flume

Yes

Type of handler to use.

gg.handler.flumehandler.format

Formatter class or short code

No. Defaults to delimitedtext

The Formatter to be used. Can be one of the following:

  • avro_row

  • avro_op

  • delimitedtext

  • xml

  • json

Alternatively, it is possible to write a custom formatter and include the fully qualified class name here.

gg.handler.flumehandler.RpcClientPropertiesFile

Any choice of filename

No. Defaults to default-flume-rpc.properties

Either the default default-flume-rpc.properties or a specified custom RPC client properties file should exist in the classpath.

gg.handler.flumehandler.mode

op|tx

No. Defaults to op

Operation mode or Transaction Mode. Java Adapter grouping options can be used only in tx mode.

gg.handler.flumehandler.EventHeaderClass

A custom implementation fully qualified class name

No. Defaults to DefaultFlumeEventHeader

Class to be used which defines what headers properties are to be added to a flume event.

gg.handler.flumehandler.EventMapsTo

op|tx

No. Defaults to op

Defines whether each flume event would represent an operation or a transaction. If handler mode = op, EventMapsTo will always be op.

gg.handler.flumehandler.PropagateSchema

true|false

No. Defaults to false

When set to true, the Flume handler will begin to publish schema events.

gg.handler.flumehandler.includeTokens

true|false

No. Defaults to false

When set to true, includes token data from the source trail files in the output. When set to false to excludes the token data from the source trail files in the output.

4.6 Sample Configuration

gg.handlerlist = flumehandler
gg.handler.flumehandler.type = flume
gg.handler.flumehandler.RpcClientPropertiesFile=custom-flume-rpc.properties
gg.handler.flumehandler.format =avro_op
gg.handler.flumehandler.mode =tx
gg.handler.flumehandler.EventMapsTo=tx
gg.handler.flumehandler.PropagateSchema =true
gg.handler.flumehandler.includeTokens=false

A sample Replicat configuration and a Java Adapter properties file for a Flume integration can be found at the following directory:

GoldenGate_install_directory/AdapterExamples/big-data/flume

4.7 Troubleshooting

4.7.1 Java Classpath

Issues with the Java classpath are one of the most common problems. The indication of a classpath problem is a ClassNotFoundException in the Oracle GoldenGate Java log4j log file. The Java log4j log file can be used to troubleshoot this issue. Setting the log level to DEBUG allows for logging of each of the jars referenced in the gg.classpath object to be logged to the log file. In this way you can make sure that all of the required dependency jars are resolved.

4.7.2 Flume Flow Control Issues

The Flume Handler may in certain scenarios write to the Flume source faster than the Flume sink can dispatch messages. In the scenario, the Flume Handler will work for a while but once Flume can no longer accept messages it will abend. The cause in the Oracle GoldenGate Java log file will likely be an EventDeliveryException indicating the Flume Handler was unable to send an event. Check the Flume log to for the exact cause of the problem. You may be able to reconfigure the Flume channel to increase capacity or increase the configuration for Java heap if the Flume Agent is experiencing an OutOfMemoryException. However, this may not entirely solve the problem. If the Flume Handler can push data to the Flume source faster than messages are dispatched by the Flume sink, any change may simply extend the period the Oracle GoldenGate Handler can run before failing.

4.7.3 Flume Agent Configuration File Not Found

The Flume Handler will abend at start up if the Flume Agent configuration file is not in the classpath. The result is generally a ConfigException listing the issue as an error loading the Flume producer properties. Check the gg.handler.{name}. RpcClientProperites configuration file to ensure that the naming of the Flume Agent Properties file is correct. Check the GoldenGate gg.classpath properties to ensure that the classpath contains the directory containing the Flume Agent properties file. Also check the classpath to ensure that the path to the Flume Agent properties file does not end with a wildcard "*" character.

4.7.4 Flume Connection Exception

The Flume Handler will abend at start up if it is unable to make a connection to the Flume source. The root cause of this problem will likely be reported as an IOExeption in the Oracle GoldenGate Java log4j file indicating a problem connecting to Flume at a given host and port. Check the following:

  • The Flume Agent process is running.

  • That the Flume agent configuration file that the Oracle for Big Data Flume Handler is accessing contains the correct host and port.

4.7.5 Other Failures

Review the contents of the Oracle GoldenGate Java log4j file.

4.8 Data Mapping of Operations to Flume Events

This section explains how operation data from the Oracle GoldenGate trail file is mapped by the Flume Handler into Flume Events based on different configurations. A Flume Event is a unit of data that flows through a Flume agent. The Event flows from Source to Channel to Sink, and is represented by an implementation of the Event interface. An Event carries a payload (byte array) that is accompanied by an optional set of headers (string attributes).

4.8.1 Operation Mode

The configuration for the Flume Handler is the following in the Oracle GoldenGate Java configuration file.

gg.handler.{name}.mode=op

The data for each individual operation from Oracle GoldenGate trail file maps into a single Flume Event. Each event is immediately flushed to Flume. Each Flume Event will have the following headers.

  • TABLE_NAME: The table name for the operation.

  • SCHEMA_NAME: The catalog name (if available) and the schema name of the operation.

  • SCHEMA_HASH: The hash code of the Avro schema. (Only applicable for Avro Row and Avro Operation formatters.)

4.8.2 Transaction Mode and EventMapsTo Operation

The configuration for the Flume Handler is the following in the Oracle GoldenGate Java configuration file.

gg.handler.flume_handler_name.mode=tx
gg.handler.flume_handler_name.EventMapsTo=op

The data for each individual operation from Oracle GoldenGate trail file maps into a single Flume Event. Events are flushed to Flume at transaction commit. Each Flume Event will have the following headers.

  • TABLE_NAME: The table name for the operation.

  • SCHEMA_NAME: The catalog name (if available) and the schema name of the operation.

  • SCHEMA_HASH: The hash code of the Avro schema. (Only applicable for Avro Row and Avro Operation formatters.)

It is suggested to use this mode when formatting data as Avro or delimited text. It is important to understand that configuring Extract or Replicat batching functionality will increase the number of operations processed in a transaction.

4.8.3 Transaction Mode and EventMapsTo Transaction

The configuration for the Flume Handler is the following in the Oracle GoldenGate Java configuration file.

gg.handler.flume_handler_name.mode=tx
gg.handler.flume_handler_name.EventMapsTo=tx

The data for all operations for a transaction from the source trail file are concatenated and mapped into a single Flume Event. The event is flushed at transaction commit. Each Flume Event has the following headers.

  • GG_TRANID: The transaction ID of the transaction

  • OP_COUNT: The number of operations contained in this Flume payload event

It is suggested to use this mode only when using self describing formats such as JSON or XML. In is important to understand that configuring Extract or Replicat batching functionality will increase the number of operations processed in a transaction.

4.9 Flume Handler Certification Matrix

The Oracle GoldenGate for Big Data Flume Handler works with versions 1.6.x, 1.5.x and 1.4.x of Apache Flume. Compatibility with versions of Flume before 1.4.0 is not guaranteed.

The Flume Handler is compatible with the following versions of the Hortonworks Data Platform (HDP):

  • HDP 2.4 (Flume 1.5.2)

  • HDP 2.3 (Flume 1.5.2)

  • HDP 2.2 (Flume 1.5.2)

  • HDP 2.1 (Flume 1.4.0)

The Flume Handler is compatible with the following versions of the Cloudera Distributions of Hadoop (CDH):

  • CDH 5.7.x (Flume 1.6.0)

  • CDH 5.6.x (Flume 1.6.0)

  • CDH 5.5.x (Flume 1.6.0)

  • CDH 5.4.x (Flume 1.5.0)

  • CDH 5.3.x (Flume 1.5.0)

  • CDH 5.2.x (Flume 1.5.0)

  • CDH 5.1.x (Flume 1.5.0)

4.10 Performance Considerations

  • Replicat based grouping is recommended to be used to improve performance.

  • Extract based grouping uses the grouping in the Java Adapter. Message size based grouping with Java Adapter may be slower than operation count based grouping. If Adapter based grouping is really needed, operation count based grouping is recommended.

  • Transaction mode with gg.handler.flume_handler_name. EventMapsTo=tx setting is recommended for best performance.

  • The maximum heap size of the Flume Handler may affect performance. Too little heap may results in frequent garbage collections by the JVM. Increasing the maximum heap size of the JVM in the Oracle GoldenGate Java properties file may improve performance.

4.11 Metadata Change Events

The Oracle GoldenGate for Big Data 12.2.0.1 Flume Handler is adaptive to changes in DDL at the source. However, this functionality depends on the source replicated database and the upstream Oracle GoldenGate Capture process to capture and replicate DDL events. This feature is not immediately available for all database implementations in Oracle GoldenGate 12.2. Refer to the Oracle GoldenGate documentation for your database implementation for information about DDL replication.

Whenever a metadata change occurs at the source, the flume handler will notify the associated formatter of the metadata change event. Any cached schema that the formatter is holding for that table will be deleted. The next time the associated formatter encounters an operation for that table the schema will be regenerated.

4.12 Example Flume Source Configuration

4.12.1 Avro Flume Source

The following is sample configuration for an Avro Flume source from the Flume Agent configuration file:

client.type = default
hosts = h1
hosts.h1 = host_ip:host_port
batch-size = 100
connect-timeout = 20000
request-timeout = 20000

4.12.2 Thrift Flume Source

The following is sample configuration for an Avro Flume source from the Flume Agent configuration file:

client.type = thrift
hosts = h1
hosts.h1 = host_ip:host_port

4.13 Advanced Features

4.13.1 Schema Propagation

The Flume Handler can propagate schemas to Flume. This is currently only supported for the Avro Row and Operation formatters. To enable this feature set the following property:

gg.handler.flume_handler_name.propagateSchema=true

The Avro Row or Operation Formatters generate Avro schemas on a just in time basis. Avro schemas are generated the first time an operation for a table is encountered. A metadata change event results in the schema reference being for a table being cleared and thereby a new schema is generated the next time an operation is encountered for that table.

When schema propagation is enabled the Flume Handler will propagate schemas an Avro Event when they are encountered.

Default Flume Schema Event headers for Avro include the following information:

  • SCHEMA_EVENT: TRUE

  • GENERIC_WRAPPER: TRUE/FALSE

  • TABLE_NAME: The table name as seen in the trail

  • SCHEMA_NAME: The catalog name (if available) and the schema name

  • SCHEMA_HASH: The hash code of the Avro schema

4.13.2 Security

Kerberos authentication for the Oracle GoldenGate for Big Data Flume Handler connection to the Flume source is possible, but this feature is only supported in Flume 1.6.0 (and assumed higher) using the Thrift Flume source. This feature is enabled solely by changing the configuration of the Flume source in the Flume Agent configuration file. Following is an example of the Flume source configuration from the Flume Agent configuration file showing how to enable Kerberos authentication. The Kerberos principal name of the client and the server need to be provided. The path to a Kerberos keytab file must be provided so that the password of the client principal can be resolved at runtime. For information on how to administrate Kerberos, Kerberos principals and their associated passwords, and the creation of a Kerberos keytab file, refer to the Kerberos documentation.

client.type = thrift
hosts = h1
hosts.h1 =host_ip:host_port
kerberos=true
client-principal=flumeclient/client.example.org@EXAMPLE.ORG
client-keytab=/tmp/flumeclient.keytab
server-principal=flume/server.example.org@EXAMPLE.ORG

4.13.3 Fail Over Functionality

It is possible to configure the Flume Handler so that it will fail over in the event that the primary Flume source becomes unavailable. This feature is currently only supported in Flume 1.6.0 (and assumed higher) using the Avro Flume source. This feature is enabled solely with Flume source configuration in the Flume Agent configuration file. The following is sample configuration for enabling fail over functionality:

client.type=default_failover
hosts=h1 h2 h3
hosts.h1=host_ip1:host_port1
hosts.h2=host_ip2:host_port2
hosts.h3=host_ip3:host_port3
max-attempts = 3
batch-size = 100
connect-timeout = 20000
request-timeout = 20000

4.13.4 Load Balancing Functionality

It is possible to configure the Oracle GoldenGate for Big Data Flume Handler so that produced Flume events will be load balanced across multiple Flume sources. This feature is currently only supported in Flume 1.6.0 (and assumed higher) using the Avro Flume source. This feature is enabled solely with Flume source configuration in the Flume Agent configuration file. The following is sample configuration for enabling load balancing functionality:

client.type = default_loadbalance
hosts = h1 h2 h3
hosts.h1 = host_ip1:host_port1
hosts.h2 = host_ip2:host_port2
hosts.h3 = host_ip3:host_port3
backoff = false
maxBackoff = 0
host-selector = round_robin
batch-size = 100
connect-timeout = 20000
request-timeout = 20000