4 Using the Elasticsearch Handler

The Elasticsearch Handler allows you to store, search, and analyze large volumes of data quickly and in near real time.

This chapter describes how to use the Elasticsearch handler.

4.1 Overview

Elasticsearch is a highly scalable open-source full-text search and analytics engine. Elasticsearch allows you to store, search, and analyze large volumes of data quickly and in near real time. It is generally used as the underlying engine or technology that drives applications with complex search features.

The Elasticsearch Handler uses the Elasticsearch Java client to connect and receive data into Elasticsearch node, see https://www.elastic.co.

4.2 Detailing the Functionality

This topic details the Elasticsearch Handler functionality.

4.2.1 About the Elasticsearch Version Property

The Elasticsearch Handler supports two different clients to communicate with the Elasticsearch cluster: The Elasticsearch transport client and the Elasticsearch High Level REST client.

  1. Set the gg.handler.name.version configuration value to 5.x, 6.x or 7.x to connect to the Elasticsearch cluster using the transport client using the respective version.
  2. Set the gg.handler.name.version configuration value to REST7.0 to connect to the Elasticseach cluster using the Elasticsearch High Level REST client. The REST client support Elasticsearch versions 7.x.

4.2.2 About the Index and Type

An Elasticsearch index is a collection of documents with similar characteristics. An index can only be created in lowercase. An Elasticsearch type is a logical group within an index. All the documents within an index or type should have same number and type of fields.

The Elasticsearch Handler maps the source trail schema concatenated with source trail table name to construct the index. For three-part table names in source trail, the index is constructed by concatenating source catalog, schema, and table name.

The Elasticsearch Handler maps the source table name to the Elasticsearch type. The type name is case-sensitive.

Table 4-1 Elasticsearch Mapping

Source Trail Elasticsearch Index Elasticsearch Type

schema.tablename

schema_tablename

tablename

catalog.schema.tablename

catalog_schema_tablename

tablename

If an index does not already exist in the Elasticsearch cluster, a new index is created when Elasticsearch Handler receives (INSERT or UPDATE operation in source trail) data.

4.2.3 About the Document

An Elasticsearch document is a basic unit of information that can be indexed. Within an index or type, you can store as many documents as you want. Each document has an unique identifier based on the _id field.

The Elasticsearch Handler maps the source trail primary key column value as the document identifier.

4.2.4 About the Primary Key Update

The Elasticsearch document identifier is created based on the source table's primary key column value. The document identifier cannot be modified. The Elasticsearch handler processes a source primary key's update operation by performing a DELETE followed by an INSERT. While performing the INSERT, there is a possibility that the new document may contain fewer fields than required. For the INSERT operation to contain all the fields in the source table, enable trail Extract to capture the full data before images for update operations or use GETBEFORECOLS to write the required column’s before images.

4.2.5 About the Data Types

Elasticsearch supports the following data types:

  • 32-bit integer

  • 64-bit integer

  • Double

  • Date

  • String

  • Binary

4.2.6 Operation Mode

The Elasticsearch Handler uses the operation mode for better performance. The gg.handler.name.mode property is not used by the handler.

4.2.7 Operation Processing Support

The Elasticsearch Handler maps the source table name to the Elasticsearch type. The type name is case-sensitive.

For three-part table names in source trail, the index is constructed by concatenating source catalog, schema, and table name.

INSERT

The Elasticsearch Handler creates a new index if the index does not exist, and then inserts a new document.

UPDATE

If an Elasticsearch index or document exists, the document is updated. If an Elasticsearch index or document does not exist, a new index is created and the column values in the UPDATE operation are inserted as a new document.

DELETE

If an Elasticsearch index or document exists, the document is deleted. If Elasticsearch index or document does not exist, a new index is created with zero fields.

The TRUNCATE operation is not supported.

4.2.8 About the Connection

A cluster is a collection of one or more nodes (servers) that holds the entire data. It provides federated indexing and search capabilities across all nodes.

A node is a single server that is part of the cluster, stores the data, and participates in the cluster’s indexing and searching.

The Elasticsearch Handler property gg.handler.name.ServerAddressList can be set to point to the nodes available in the cluster.

4.3 Setting Up and Running the Elasticsearch Handler

You must ensure that the Elasticsearch cluster is setup correctly and the cluster is up and running, see https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html. Alternatively, you can use Kibana to verify the setup.

Set the Classpath

The property gg.classpath must include all the jars required by the Java transport client. For a listing of the required client JAR files by version, see Elasticsearch Handler Transport Client Dependencies. For a listing of the required client JAR files for the Elatisticsearch High Level REST client, see Elasticsearch High Level REST Client Dependencies.

Default location of 5.X JARs:

  Elasticsearch_Home/lib/*
	Elasticsearch_Home/plugins/x-pack/*
	Elasticsearch_Home/modules/transport-netty3/*	
	Elasticsearch_Home/modules/transport-netty4/*
	Elasticsearch_Home/modules/reindex/*		

The inclusion of the * wildcard in the path can include the * wildcard character in order to include all of the JAR files in that directory in the associated classpath. Do not use *.jar.

The following is an example of the correctly configured classpath:

	gg.classpath=Elasticsearch_Home/lib/*

4.3.1 Configuring the Elasticsearch Handler

The following are the configurable values for the Elasticsearch handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).

To enable the selection of the Elasticsearch Handler, you must first configure the handler type by specifying gg.handler.name.type=elasticsearch and the other Elasticsearch properties as follows:

Table 4-2 Elasticsearch Handler Configuration Properties

Properties Required/ Optional Legal Values Default Explanation
gg.handlerlist

Required

Name (any name of your choice)

None

The list of handlers to be used.

gg.handler.name.type

Required

elasticsearch

None

Type of handler to use. For example, Elasticsearch, Kafka, Flume, or HDFS.

gg.handler.name.ServerAddressList

Optional

Server:Port[, Server:Port]

localhost:9300

Comma separated list of contact points of the nodes to connect to the Elasticsearch cluster.

gg.handler.name.clientSettingsFile

Required

Transport client properties file.

None

The filename in classpath that holds Elasticsearch transport client properties used by the Elasticsearch Handler.

gg.handler.name.version

Optional

5.x|6.x|7.x|REST7.x

7.x

The legal values 5.x, 6.x, and 7.x indicate using the Elasticsearch transport client to communicate with the Elasticsearch cluster. REST indicates using the Elasticsearch High Level REST client to communicate with the Elasticsearch cluster.
gg.handler.name.bulkWrite

Optional

true | false

false

When this property is true, the Elasticsearch Handler uses the bulk write API to ingest data into Elasticsearch cluster. The batch size of bulk write can be controlled using the MAXTRANSOPS Replicat parameter.

gg.handler.name.numberAsString

Optional

true | false

false

When this property is true, the Elasticsearch Handler would receive all the number column values (Long, Integer, or Double) in the source trail as strings  into the Elasticsearch cluster.

gg.handler.name.routingKeyMappingTemplate

Optional

A string made up of constant values and templating keywords so that a value for the routing key can be resolved at runtime.

None

Set a template to dynamically resolve the routing key at runtime to control the shard in Elasticsearch to which the message is sent. The default is to use the id that is used by Elasticsearch as the routing key.

gg.handler.elasticsearch.upsert

Optional

true | false

true

When this property is true, a new document is inserted if the document does not already exist when performing an UPDATE operation.

gg.handler.elasticsearch.routingTemplate

Optional ${columnValue[table1=column1,table2=column2,…] None N/A
gg.handler.name.authType Optional none | basic | ssl None Controls the authentication type for the Elasticsearch REST client.
  • none - No authentication
  • basic - Client authentication using username and password without message encryption
  • ssl - Mutual authentication. Client authenticates the server using a truststore. Server authentication client using username and password. Messages are encrypted.
gg.handler.name.basicAuthUsername Optional A valid username None The username for the server to authenticate the Elasticsearch REST client. Must be provided for auth types basic and ssl.
gg.handler.name.basicAuthPassword Optional A valid password None The password for the server to authenticate the Elasticsearch REST client. Must be provided for auth types basic and ssl.
gg.handler.name.trustStore Optional The path and name of the truststore file. None The truststore for the Elasticsearch client to validate the certificate received from the Elasticsearch server. Must be provided if the auth type is set to ssl. Valid only for the Elasticsearch REST client.
gg.handler.name.trustStorePassword Optional The password to access the truststore. None The password for the truststore for the Elasticsearch REST client to validate the certificate received from the Elasticsearch server. Must be provided if the auth type is set to ssl.
gg.handler.name.maxConnectTimeout Optional Positive integer The default value of the Apache HTTP Components framework. Set the maximum wait period for a connection to be established from the Elasticsearch REST client to the Elasticsearch server. Valid only for the Elasticsearch REST client.
gg.handler.name.maxSocketTimeout Optional Positive integer The default value of the Apache HTTP Components framework. Sets the maximum wait period in milliseconds to wait for a response from the service after issuing a request. May need to be increased when pushing large data volumes. Valid only for the Elasticsearch REST client.
gg.handler.name.proxyUsername Optional The proxy server username. None If the connectivity to the Elasticsearch uses the REST client and routing through a proxy server, then this property sets the username of your proxy server. Most proxy servers do not require credentials.
gg.handler.name.proxyPassword Optional The proxy server password. None If the connectivity to the Elasticsearch uses the REST client and routing through a proxy server, then this property sets the password of your proxy server. Most proxy servers do not require credentials.
gg.handler.name.proxyProtocol Optional http | https http If the connectivity to the Elasticsearch uses the REST client and routing through a proxy server, then this property sets the protocol of your proxy server.
gg.handler.name.proxyPort Optional The port number of your proxy server. None If the connectivity to the Elasticsearch uses the REST client and routing through a proxy server, then this property sets the port number of your proxy server.
gg.handler.name.proxyServer Optional The host name of your proxy server. None If the connectivity to the Elasticsearch uses the REST client and routing through a proxy server, then this property sets the host name of your proxy server.

Example 4-1 Sample Handler Properties file:

For 5.x Elasticsearch cluster:

gg.handlerlist=elasticsearch
gg.handler.elasticsearch.type=elasticsearch
gg.handler.elasticsearch.ServerAddressList=localhost:9300
gg.handler.elasticsearch.clientSettingsFile=client.properties
gg.handler.elasticsearch.version=5.x
gg.classpath=/path/to/elastic/lib/*:/path/to/elastic/modules/transport-netty4/*:/path/to/elastic/modules/reindex/*

For 5.x Elasticsearch cluster with x-pack:

gg.handlerlist=elasticsearch
gg.handler.elasticsearch.type=elasticsearch
gg.handler.elasticsearch.ServerAddressList=localhost:9300
gg.handler.elasticsearch.clientSettingsFile=client.properties
gg.handler.elasticsearch.version=5.x
gg.classpath=/path/to/elastic/lib/*:/path/to/elastic/plugins/x-pack/*:/path/to/elastic/modules/transport-netty4/*:/path/to/elastic/modules/reindex/*

Sample Replicat configuration and a Java Adapter Properties files can be found at the following directory:

GoldenGate_install_directory/AdapterExamples/big-data/elasticsearch

For Elasticsearch REST handler

gg.handlerlist=elasticsearch
gg.handler.elasticsearch.type=elasticsearch
gg.handler.elasticsearch.ServerAddressList=localhost:9300
gg.handler.elasticsearch.version=rest7.x
gg.classpath=/path/to/elasticsearch/lib/*:/path/to/elasticsearch/modules/reindex/*:/path/to/elasticsearch/modules/lang-mustache/*:/path/to/elasticsearch/modules/rank-eval/*

4.3.2 About the Transport Client Settings Properties File

The Elasticsearch Handler uses a Java Transport client to interact with Elasticsearch cluster. The Elasticsearch cluster may have addional plug-ins like shield or x-pack, which may require additional configuration.

The gg.handler.name.clientSettingsFile property should point to a file that has additional client settings based on the version of Elasticsearch cluster. The Elasticsearch Handler attempts to locate and load the client settings file using the Java classpath. The Java classpath must include the directory containing the properties file.

The client properties file for Elasticsearch (without any plug-in) is:

cluster.name=Elasticsearch_cluster_name

The Shield plug-in also supports additional capabilities like SSL and IP filtering. The properties can be set in the client.properties file, see https://www.elastic.co/guide/en/shield/current/_using_elasticsearch_java_clients_with_shield.html.

The client.properties file for Elasticsearch 5.x with the X-Pack plug-in is:

cluster.name=Elasticsearch_cluster_name
xpack.security.user=x-pack_username:x-pack-password

The X-Pack plug-in also supports additional capabilities. The properties can be set in the client.properties file, see https://www.elastic.co/guide/en/elasticsearch/client/java-api/5.1/transport-client.html and https://www.elastic.co/guide/en/x-pack/current/java-clients.html.

4.4 Performance Consideration

The Elasticsearch Handler gg.handler.name.bulkWrite property is used to determine whether the source trail records should be pushed to the Elasticsearch cluster one at a time or in bulk using the bulk write API. When this property is true, the source trail operations are pushed to the Elasticsearch cluster in batches whose size can be controlled by the MAXTRANSOPS parameter in the generic Replicat parameter file. Using the bulk write API provides better performance.

Elasticsearch uses different thread pools to improve how memory consumption of threads are managed within a node. Many of these pools also have queues associated with them, which allow pending requests to be held instead of discarded.

For bulk operations, the default queue size is 50 (in version 5.2) and 200 (in version 5.3).

To avoid bulk API errors, you must set the Replicat MAXTRANSOPS size to match the bulk thread pool queue size at a minimum. The configuration thread_pool.bulk.queue_size property can be modified in the elasticsearch.yaml file.

4.5 About the Shield Plug-In Support

Elasticsearch versions 5.x supports a Shield plug-in which provides basic authentication, SSL and IP filtering. Similar capabilities exist in the X-Pack plug-in for Elasticsearch 6.x and 7.x. The additional transport client settings can be configured in the Elasticsearch Handler using the gg.handler.name.clientSettingsFile property.

4.6 About DDL Handling

The Elasticsearch Handler does not react to any DDL records in the source trail. Any data manipulation records for a new source table results in auto-creation of index or type in the Elasticsearch cluster.

4.7 Troubleshooting

This section contains information to help you troubleshoot various issues.

4.7.1 Incorrect Java Classpath

The most common initial error is an incorrect classpath to include all the required client libraries and creates a ClassNotFound exception in the log4j log file.

Also, it may be due to an error resolving the classpath if there is a typographic error in the gg.classpath variable.

The Elasticsearch transport client libraries do not ship with the Oracle GoldenGate for Big Data product. You should properly configure the gg.classpath property in the Java Adapter Properties file to correctly resolve the client libraries, see Setting Up and Running the Elasticsearch Handler.

4.7.2 Elasticsearch Version Mismatch

The Elasticsearch Handler gg.handler.name.version property must be set to one of the following values: 5.x, 6.x, 7.x, or REST to match the major version number of the Elasticsearch cluster. For example, gg.handler.name.version=7.x.

The following errors may occur when there is a wrong version configuration:

Error: NoNodeAvailableException[None of the configured nodes are available:]

ERROR 2017-01-30 22:35:07,240 [main] Unable to establish connection. Check handler properties and client settings configuration.

java.lang.IllegalArgumentException: unknown setting [shield.user] 

Ensure that all required plug-ins are installed and review documentation changes for any removed settings.

4.7.3 Transport Client Properties File Not Found

To resolve this exception:

ERROR 2017-01-30 22:33:10,058 [main] Unable to establish connection. Check handler properties and client settings configuration.

Verify that the gg.handler.name.clientSettingsFile configuration property is correctly setting the Elasticsearch transport client settings file name. Verify that the gg.classpath variable includes the path to the correct file name and that the path to the properties file does not contain an asterisk (*) wildcard at the end.

4.7.4 Cluster Connection Problem

This error occurs when the Elasticsearch Handler is unable to connect to the Elasticsearch cluster:

Error: NoNodeAvailableException[None of the configured nodes are available:]

Use the following steps to debug the issue:

  1. Ensure that the Elasticsearch server process is running.

  2. Validate the cluster.name property in the client properties configuration file.

  3. Validate the authentication credentials for the x-Pack or Shield plug-in in the client properties file.

  4. Validate the gg.handler.name.ServerAddressList handler property.

4.7.5 Unsupported Truncate Operation

The following error occurs when the Elasticsearch Handler finds a TRUNCATE operation in the source trail:

oracle.goldengate.util.GGException: Elasticsearch Handler does not support the operation: TRUNCATE

This exception error message is written to the handler log file before the RAeplicat process abends. Removing the GETTRUNCATES parameter from the Replicat parameter file resolves this error.

4.7.6 Bulk Execute Errors

""
DEBUG [main] (ElasticSearch5DOTX.java:130) - Bulk execute status: failures:[true] buildFailureMessage:[failure in bulk execution: [0]: index [cs2cat_s1sch_n1tab], type [N1TAB], id [83], message [RemoteTransportException[[UOvac8l][127.0.0.1:9300][indices:data/write/bulk[s][p]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$7@43eddfb2 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@5ef5f412[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 84]]];]

It may be due to the Elasticsearch running out of resources to process the operation. You can limit the Replicat batch size using MAXTRANSOPS to match the value of the thread_pool.bulk.queue_size Elasticsearch configuration parameter.

Note:

Changes to the Elasticsearch parameter, thread_pool.bulk.queue_size, are effective only after the Elasticsearch node is restarted.

4.8 Logging

The following log messages appear in the handler log file on successful connection:

Connection to a 5.x Elasticsearch cluster:

INFO [main] (Elasticsearch5DOTX.java:38) - **BEGIN Elasticsearch client settings**
INFO [main] (Elasticsearch5DOTX.java:39) - {xpack.security.user=user1:user1_kibana, cluster.name=elasticsearch-user1-myhost, request.headers.X-Found-Cluster=elasticsearch-user1-myhost}
INFO [main] (Elasticsearch5DOTX.java:52) - Connecting to Server[myhost.us.example.com] Port[9300]
INFO [main] (Elasticsearch5DOTX.java:64) - Client node name:  _client_
INFO [main] (Elasticsearch5DOTX.java:65) - Connected nodes: [{node-myhost}{w9N25BrOSZeGsnUsogFn1A}{bIiIultVRjm0Ze57I3KChg}{myhost}{198.51.100.1:9300}]
INFO [main] (Elasticsearch5DOTX.java:66) - Filtered nodes: []
INFO [main] (Elasticsearch5DOTX.java:68) - **END Elasticsearch client settings**

4.9 Known Issues in the Elasticsearch Handler

Elasticsearch: Trying to input very large number

Very large numbers result in inaccurate values with Elasticsearch document. For example, 9223372036854775807, -9223372036854775808. This is an issue with the Elasticsearch server and not a limitation of the Elasticsearch Handler.

The workaround for this issue is to ingest all the number values as strings using the gg.handler.name.numberAsString=true property.

Elasticsearch: Issue with index

The Elasticsearch Handler is not able to input data into the same index if there are more than one table with similar column names and different column data types.

Index names are always lowercase though the catalog/schema/tablename in the trail may be case-sensitive.