66 About Elasticsearch

This chapter provides an overview of the Elasticsearch indexing and search service integrated with the Oracle Communications Messaging Server different message stores.

Elasticsearch Indexing and Search

The Classic message store and Cassandra message store supports Elasticsearch indexing and search engine. Elasticsearch is a distributed search engine from the Apache Lucene project. It provides full text search services with built-in high availability, replication, horizontal scaling, and automatic load balancing.

For Elasticsearch, a message store deployment includes an Elasticsearch cluster. Multiple Classic message stores can share the same Elasticsearch cluster. Typically, all message store servers in a deployment share the same Elasticsearch clusters.

The classic message store does not enable index and search services by default. When Elasticsearch is enabled, the message store sends the message content to Elasticsearch for indexing when messages are appended to the mailboxes; the IMAP server sends search queries to Elasticsearch to perform textual searches. The message store uses Indexed Search Converter (ISC) to convert binary message content to text before indexing. Therefore, ISC must be configured on the message store server when Elasticsearch is enabled.

In classic store, ISC uses the file system to cache the text content. In Cassandra store, ISC uses Cassandra to store the cache content.

Note:

ISC requires read access to the message store in order to convert the content of a message. When used with classic store, ISC can be co-located with other message store processes or can run on a separate server with shared filesystem access to message store data.

Enabling Elasticsearch

To enable Elasticsearch on the message store, set the following options:

store.searchengine = elastic
elasticsearch.hostlist = a space separated list of elasticsearch hosts
elasticsearch.numshards = number of shards in elasticsearch cluster
elasticsearch.numreplicas = number of replicas in elasticsearch cluster
elasticsearch.storesource = false, if the storage space is limited

where,

searchengine specifies the message store search engine type.when the searchengine is set to elastic, Elasticsearch options control IMAP search operations.

hostlist specifies a space separated list of Elasticsearch hosts. Host format is host[:port]. If port is not specified, the port number is determined based on the setting of the port Elasticsearch option.

numshards specifies the number of shards in the Elasticsearch cluster. The numshards is used by store to create the message store index in Elasticsearch. The number of shards cannot be changed after the index is created. (Default : 5)

numreplicas specifies the number of replicas in the Elasticsearch cluster. The numreplicas is used by stored to create the Elasticsearch message store index. The message store does not update the number of replicas after the index is created. The number can be updated in Elasticsearch manually. (Default : 1)

The _source field in Elasticsearch contains the document body that was passed at index time. If storesource is enabled, the message store will create the Elasticsearch store/message index mapping with the _source field enabled; IMAP copy will use the _source field from Elasticsearch to index the message on the destination folder. The _source field data consumes a lot of disk space. You might want to disable the _source field if storage space is limited. Disabling the _source field will disable the ability to reindex, upgrade or repair index from Elasticsearch. (Default : false)

For a list of Elasticsearch options, see Messaging Server Reference.

Differences Between Elasticsearch and Brute-Force IMAP Searching

This section describes the differences in IMAP searches between Elasticsearch and Brute-force.

Wildcard Search

Wildcard searching refers to using a single character, such as an asterisk (*), to represent several characters or an empty string in an email search. Wildcard searching differentiates between the following types:

Prefix wildcard search: The wildcard is used at the end of the string, for example: appl*
Suffix wildcard search: The wildcard is used at the beginning of the string, for example: *pple
Substring search: The wildcard is used at the beginning and end of the string: *ppl*

When a search is issued in brute-force, it linearly searches through email content to match the exact sequence of characters provided in the search command. This implicitly includes prefix and suffix search.

For example, consider an email body that contains the following text:

Hello World! This is a test.

All the following searches in brute-force return this email:

search body world
search body wor
search body llo
search body "lo Wor"
search body !

In Elasticsearch, the default search does not include prefix, suffix, and substring search. You must set the following configuration options to be able to conduct prefix and suffix search in Elasticsearch:

To enable prefix search:

msconfig set imap.indexer.prefix_search "subject body from to cc text bcc"

To enable suffix search:

msconfig set imap.indexer.suffix_search "subject body from to cc text bcc"

To enable substring search:

msconfig set imap.indexer.substring_search "subject body from to cc text bcc"

However, in Elasticsearch, wildcarding is allowed only for single-word search terms. Elasticsearch does not support wildcarding within phrases. Thus, even with wildcarding enabled, the following search using the previous example does not return the email in Elasticsearch.

search body "lo Wor"

Also, the following search returns all email message in that folder because special characters like punctuation marks are discarded by both indexing and search:

search body !

The following searches in Elasticsearch do return the example email when prefix and suffix searches are enabled:

search body world
search body wor
search body llo

Special Characters and Searching

The Elasticsearch standard tokenizer splits text fields into tokens, treating whitespace and special characters (like punctuation) as delimiters. Delimiter characters are discarded, with the following exceptions:

Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names.
The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved as single tokens.

This means that if an email field contains words with special characters, such as "@," Elasticsearch splits that word around the special character and indexes it as two words.

For email address wildcard search, Elasticsearch treats the email address as two words in sequence (user name and domain name). The term in the search should not mix user name and domain together when wildcarding is enabled. That is, searching on user1@example.com is not a valid search in Elasticsearch.

The following example wildcard email address searches are all valid in Elasticsearch:

Prefix search: FROM user7642@example.com:

SEARCH FROM user
SEARCH FROM u
SEARCH FROM exampl
SEARCH FROM example.c

Suffix search: TO test1.central.example.com:

SEARCH TO st1
SEARCH TO st1@central.example.com
SEARCH TO mple.com
SEARCH TO le.com
SEARCH TO om

Substring search: CC: user7170@us.example.com:

SEARCH CC ser71
SEARCH CC exampl
SEARCH CC s.exampl
SEARCH CC s.example.co

The following example wildcard email address searches are invalid in Elasticsearch:

FROM jean-marie@oracle.com:

SEARCH FROM ean-ma
SEARCH FROM marie@or

Words Not Indexed by Elasticsearch

The following words are filtered out from emails by the Elasticsearch English stopwords filter. These words are not indexed by Elasticsearch, and so are not searchable.

"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in",

"into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the",

"their", "then", "there", "these", "they", "this", "to", "was", "will",

"with"