72 Overview of Cassandra Message Store and Solr

This chapter provides an overview of the Oracle Communications Messaging Server Cassandra message store and integrated indexing and search capabilities provided by the Solr search engine.

About the Cassandra Message Store and Solr Indexing and Search

DataStax Enterprise (DSE) is the database platform hosting the Cassandra message store, and Solr is the indexing and search engine. DataStax bundles Solr with DSE. Solr replaces the need to install Oracle Communications Indexing and Search Service as the indexing and search engine. Solr is an open source enterprise search platform from the Apache Lucene project. Because Solr stores email content in a reverse index, searches are very efficient and fast compared to the linear search in Berkeley Database.

In Messaging Server, the Cassandra message store consists of keyspaces. A keyspace, in a NoSQL data store, is a namespace that defines data replication on nodes. (It resembles the schema concept in Relational database management systems.) The Messaging Server Cassandra message store consists of the following keyspaces:

  1. msg: Contains the email message "blobs." It can be very large.

  2. mbox: Contains user and mailbox metadata. It is relatively small and has a lot of mutations.

  3. index: Contains Cassandra tables for Solr indexing.

  4. cache: Contains cache tables for converted message blobs. The data in these tables is for Solr use only, and enables the Field Input Transformer (FIT) to use cached data instead of converting large message blobs for each update operation.

In Messaging Server, when using Cassandra message store and Solr, conversion of message parts (attachments) is done in a separate Java Virtual Machine (JVM) process that runs on an application server. This process is called the Indexed Search Converter (ISC). ISC runs as a web application using embedded Jetty as the application server.

Differences in Cassandra Message Store and Classic Message Store

The Cassandra message store differs from the classic message store in the following ways:

  • The Cassandra message store uses a single maintenance queue. Classic message store uses three maintenance queues.

  • The imcheck -d and imcheck -s commands are supported only by classic message store.

  • Other imcheck command differences with classic message store:

    • imcheck -m output: The Cassandra message store does not have start offset, cache ID and cache offset.

    • imcheck -q: The RecNo is not unique in Cassandra message store.

  • When you rename a mailbox on Cassandra, the subscription list is updated automatically. IMAP has SUBSCRIBE and UNSUBSCRIBE commands, which are typically used for shared folders. There is no requirement in the IMAP specification for the subscription to recognize the new name when a shared folder is renamed. Classic message store subscriptions do not follow renames, whereas Cassandra message store does.

  • Cassandra message store folder purge is always deferred. This is optional in classic message store. Also, Cassandra message store mailbox purge and cleanup tasks are combined into one task.

  • Cassandra message store does not advertise the IMAP ACL extension (RFC 4314). Although the ACL commands are implemented, the semantics of ACLs with respect to shared folders are not fully implemented. As a result, it is inappropriate to advertise the IMAP ACL extension in this version of Cassandra message store.

  • Cassandra message store does not support the IMAP ANNOTATE capability. (The ANNOTATE extension to IMAP permits clients and servers to maintain "meta data" for messages, or individual message parts, stored in a mailbox on the server.)

  • In Cassandra message store, delete user does not remove access rights from the ACL of the other user's folder. When a user is deleted and recreated, shared folders are still accessible.

  • User ID (puid) and folder ID (fid) are permanently unique in Cassandra message store.

  • Cassandra message store does not use a message store partition. Thus, you no longer are able to perform maintenance tasks such as expire, backup, and reconstruct by store partition.

  • The imexpire command cannot install expire rules by partition.

  • Cassandra message store supports only mailbox reconstruct.

  • Cassandra message store only supports Unified Configuration.

  • IMAP subscribe: Cassandra message store does not permit users to subscribe to non-existing folders.

  • Specific differences that apply to Messaging Server 8.0.x:

    • The reconstruct -x command has been obsoleted (it is now always enabled).

    • imcheck -e output: The File type column has been removed.

    • imcheck -q output: RecNo has a queue prefix.

  • Classic message store allows use of invalid identifiers in IMAP ACLs; they are simply ignored when evaluating the ACL. Cassandra message store does not allow use of invalid identifiers in an ACL; an attempt to use an invalid identifier will result in no change to the ACL.

  • Cassandra message store supports both an external identifier and a persistent identifier for each user. Changing the persistent identifier of an existing user is not recommended. The external identifier for a user can be changed freely and is used for ACLs and shared folders instead of the persistent identifier. See the topic on User Identifiers in Messaging Server Reference, as well as the ldap_permid and ldap_extid options for more information.

  • A SETACL command for cross-domains is not allowed. Thus, for a Cassandra message store, the value for the store.privatesharedfolders.restrictdomain configuration option is always 1 (to disallow regular users from sharing private folders to users in another domain.) In addition, neither the default domain nor the canonical domain name can ever be changed.

Differences Between Solr and Indexing and Search Service

Table 72-1 describes the differences between Solr-based indexing and search and Oracle Communications Indexing and Search Service.

Table 72-1 Comparison of Solr and Indexing and Search Service

Solr (Messaging Server 8.0.2) Indexing and Search Service (Messaging Server 7.x/8.x)

Uses DataStax Enterprise (DSE) Max Cassandra database and integrated Solr indexing and search.

Uses Berkeley Database (BDB) and separately installed and configured Indexing and Search Service.

Uses persistent-unique user and folder IDs (separate from display ID).

Folder IDs are maintained in a file-system based directory structure.

Uses ENS as the notification service. (JMQ is no longer supported in Messaging Server 8.0.2.)

Indexing and Search Service deployment depends upon using multiple JMQ brokers.

Uses a single attachment converter, Apache Tika which is included in the Messaging Server package. Tika supports what is currently supported in Indexing and Search Service. Tika is capable of supporting more attachment types in future releases.

Uses multiple attachment converters, one for each document type.

Uses replicated copies of the index.

Uses one copy of the index.

All searches are Lucene indexed.

Only some searches are Lucene indexed.

Because the index is replicated, when you add capacity, a rebalancing occurs without the need for a service outage.

Increasing capacity requires that you move users and take a service outage.

A consistency repair of data is performed at the Cassandra/Solr and application/CQL layers.

A consistency repair of data is performed by a slow IMAP scan.

Enables searching by custom header fields (Solr creates dynamic header fields for each email as it arrives).

Uses pre-defined, standard header fields. Indexing and Search Service search fails on customer header fields and falls back to IMAP-only linear search.

Creates dynamic flag fields at runtime.

Users pre-defined, standard flag fields. Indexing and Search service fails on custom field searches and falls back to IMAP-only linear search.


Differences in Index and Search Features

The following index and search features (available in Indexing and Search Service) are currently not supported in Messaging Server 8.0.2 and Cassandra/Solr:

  • All searches are in Datastax/Solr occur through IMAP, with no failback search.

  • Solr does not have a RESTful search interface. (IMAP/CQL is the search interface.)

  • Oracle Convergence attachment-based search results and thumbnail view are not currently available.

Other differences between Solr and Indexing and Search Service include the following:

  • Only RETURN (ALL) is supported in the IMAP SEARCH command in Messaging Server 8.0.x when MS 8.0.x is configured to send searches to Indexing and Search Service. However, all the return parameters are supported in Messaging Server 8.0.2.

  • In Messaging Server 8.0.2, if the ESEARCH command contains a Message sequence number term and a non-selected mailbox to search, then the following message appears: "Unsupported Search Term"

  • The following features are not currently supported in Messaging Server 8.0.2 when configured to use Cassandra message store:

    • Annotation Search

    • Shared folders search

Differences Between Cassandra and Classic Store IMAP Searching

This section describes the differences in IMAP searches between Cassandra message store and classic message store.

Wildcard Search

Wildcard searching refers to using a single character, such as an asterisk (*), to represent several characters or an empty string in an email search. Wildcard searching differentiates between the following types:

  • Prefix wildcard search: The wildcard is used at the end of the string, for example: appl*

  • Suffix wildcard search: The wildcard is used at the beginning of the string, for example: *pple

  • Substring search: The wildcard is used at the beginning and end of the string: *ppl*

When a search is issued in classic message store, it linearly searches through email content to match the exact sequence of characters provided in the search command. This implicitly includes prefix and suffix search.

For example, consider an email body that contains the following text:

Hello World! This is a test.

All the following searches in classic message store return this email:

search body world
search body wor
search body llo
search body "lo Wor"
search body !

Because search is Solr-based in Cassandra message store, the default search does not include prefix, suffix, and substring search. You must set the following configuration options to be able to conduct prefix and suffix search on Cassandra message store:

  • To enable prefix search:

    msconfig set imap.indexer.prefix_search "subject body from to cc text bcc"
    
  • To enable suffix search:

    msconfig set imap.indexer.suffix_search "subject body from to cc text bcc"
    
  • To enable substring search:

    msconfig set imap.indexer.substring_search "subject body from to cc text bcc"
    

However, in Cassandra message store, wildcarding is allowed only for single-word search terms. Solr does not support wildcarding within phrases. Thus, even with wildcarding enabled, the following search using the previous example does not return the email in Cassandra message store.

search body "lo Wor"

Also, the following search returns all email message in that folder because special characters like punctuation marks are discarded by both indexing and search:

search body !

The following searches in Cassandra message store do return the example email when prefix and suffix searches are enabled:

search body world
search body wor
search body llo

Special Characters and Searching

The Solr standard tokenizer splits text fields into tokens, treating whitespace and special characters (like punctuation) as delimiters. Delimiter characters are discarded, with the following exceptions:

  • Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names.

  • The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved as single tokens.

This means that if an email field contains words with special characters, such as "@," Solr splits that word around the special character and indexes it as two words.

For email address wildcard search, Cassandra message store treats the email address as two words in sequence (user name and domain name). The term in the search should not mix user name and domain together when wildcarding is enabled. That is, searching on user1@example.com is not a valid search in Solr.

The following example wildcard email address searches are all valid in Cassandra message store:

  • Prefix search: FROM user7642@example.com:

    SEARCH FROM user
    SEARCH FROM u
    SEARCH FROM exampl
    SEARCH FROM example.c
    
  • Suffix search: TO test1.central.example.com:

    SEARCH TO st1
    SEARCH TO st1@central.example.com
    SEARCH TO mple.com
    SEARCH TO le.com
    SEARCH TO om
    
  • Substring search: CC: user7170@us.example.com:

    SEARCH CC ser71
    SEARCH CC exampl
    SEARCH CC s.exampl
    SEARCH CC s.example.co
    

The following example wildcard email address searches are invalid in Cassandra message store:

  • FROM jean-marie@oracle.com:

    SEARCH FROM ean-ma
    SEARCH FROM marie@or
    

Words Not Indexed by Solr

The following words are filtered out from emails by the Solr StopFilterFactory filter. These words are not indexed by Solr, and so are not searchable.

"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in",

"into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the",

"their", "then", "there", "these", "they", "this", "to", "was", "will",

"with"