43 Public Site Search

WebCenter Sites includes a new framework for managing search indices. This framework forms the basis for searches in both the editorial interface and on the live site. That is, the visitors' side. Hence the name Public Site Search.

This chapter introduces the search framework and discusses the usage of the API in building public site searches.

This chapter contains the following sections:

Section 43.1, "Overview of the Search Framework"
Section 43.2, "Index Types"
Section 43.3, "Search API Overview"
Section 43.4, "Advanced Configuration"

43.1 Overview of the Search Framework

Note:

You can skip this section if you are primarily interested in the usage of the search API.

Figure 43-1 shows how the search engine integration framework works with the rest of WebCenter Sites.

Figure 43-1 Search Engine Integration

Description of "Figure 43-1 Search Engine Integration"

The search framework consists of the Search API, special asset event listeners, and a polling system for queues. This framework is used in coordination with the Event Management and Queue Management frameworks. This document focuses primarily on the search framework.

Asset framework detects changes/additions to assets and fires off events.
Registered listeners queue the changes, using a persistent queue implementation. A given event can be queued into one or many persistent queues. Each queue can be thought of as the source of data for a search index.
Once asset events are queued, a background process empties the queue contents and routes them to the Search API.
The Search API chooses the appropriate (configurable) search engine vendor implementation to start the indexing process.

43.2 Index Types

Two types of indices are created in WebCenter Sites: Global index and AssetType index. Global index is the index of all data (all asset types enabled for Global index). If you want to search for a phrase or expression in multiple asset types (such as attempting to build a Google-like search interface), Global index is more appropriate.

While Global index contains data for all fields of the index, it does not store the data in a form that is suitable for parametric searches. An AssetType index contains indexed information for a given asset type in a manner that can be searched parametrically. The WebCenter Sites Admin interface supports the configuration of Asset Type searches, which includes attribute-based searches for the indexing-enabled asset types. More information about the search configuration options is available in the Oracle Fusion Middleware WebCenter Sites Administrator's Guide.

This section contains the following topics:

Section 43.2.1, "Global Index"
Section 43.2.2, "Asset Type Index"

43.2.1 Global Index

Global index is used by the Contributor interface to build a global search UI. An instance of the index also exists on the delivery server. The index on the delivery server can be used to build public site searches.

Note that only those assets that are published to the live site after search is configured are available for searches. It is during publishing that the data gets indexed. All assets that may exist on the live site before search is configured will not be reflected in the Global index (until the assets are re-indexed on the live site).

A search index functions roughly similar to a database table. Global index consists of the following fields:

Note:

The field names are case-sensitive.

Table 43-1 Fields in the Global Index

Field name	Description
`defaultSearchField`	This field contains all the data of the indexed asset. This is the field you would search for in full-text searching. The `defaultSearchField` contains index data for all attributes of the asset, including any binary field data. Data for all the attributes is merged into this single field and indexed. The index itself does not 'store' data for this field, but does allow full-text searching.
`id`	This field contains the asset id.
`AssetType`	This field contains the asset type (Content_C/Product_P).
`locale`	This field contains the locale string (for example, `en_US`).
`name`	This field contains the name of the asset.
`description`	Description associated with the asset.
`subtype`	Name of the subtype (flex definition name).
`subtypeid`	ID of the subtype (flex def id).
`updateddate`	Last updated date as found at the time of indexing.
`siteid`	IDs of all sites in which this asset is available.
`startdate`	Start date field in the asset table.
`enddate`	End date field in the asset table.

43.2.2 Asset Type Index

An asset type index is created when it is enabled from the Admin interface by selecting Admin tab, then Search, and then Configure Asset Type Search. Once an asset type is enabled, an index will be created under /shared/lucene/<Asset type name>. This index contains all attributes of the given type as fields in the index. For example, the index for Content_C contains the following fields:

Note:

The field names are case-sensitive.

Table 43-2 Fields in the Asset Type Index

Field Name	Description
`DefaultSearchField`	This field contains all the data of the indexed asset. This is the field you would search for in full text searching. The `DefaultSearchField` contains index data for all attributes of the asset, including any binary field data. Data for all the attributes is merged into this single field and indexed. The index itself does not 'store' data for this field, but does allow full text searching.
`id`	This contains the asset id
`AssetType`	This contains the asset type (Content_C/Product_P)
`locale`	Contains the locale string (example en_US)
`name`	Name of the asset
`description`	Description associated with the asset
`subtype`	Name of the subtype (flex definition name)
`subtypeid`	Id of the subtype (flex def id)
`updateddate`	Last updated date as found at the time of indexing.
`siteid`	All site ids this asset is available in
`startdate`	Startdate field in the asset table
`enddate`	Enddate field in the asset table
`Dimension`	ID of the dimension
`Dimension-parent`	ID of the Dimension parent
`createdby`	User name that created this asset
`createddate`	Date the asset was created
`Publist`	List of site names this asset belongs to
`Relationships`	Asset IDs of related items (flex only)
`externaldoctype`	Not used
`filename`	File name used for static publishing
`flextemplateid`	ID of the flex definition (flex only)
`fw_uid`	Globally unique id of this asset
`path`	Path used for static publishing
`renderid`	Object ID of the Template asset assigned to a flex asset.
`ruleset`	XML document of the ruleset
`status`	Status associated with the asset
`subtype`	Subtype name
`subtypeid`	Subtype id (flex only)
`template`	Template name
`updatedby`	User name that last updated this asset
`updateddate`	Date of last update
`urlexternaldoc`	Not used
`urlexternaldocxml`	Not used
`FSIIAbstract`	Flex attribute
`FSIIBody`	Flex attribute
`FSIIByline`	Flex attribute
`FSIIDescriptionAttr`	Flex attribute
`FSIIHeadline`	Flex attribute
`FSIINameAttr`	Flex attribute
`FSIIPostDate`	Flex attribute
`FSIISubheadline`	Flex attribute
`FSIITemplateAttr`	Flex attribute

To visualize which fields are available in a given index, use the tool Luke, available at:

http://www.getopt.org/luke/

Once you launch the tool, use the tool's browse function to load the index by simply locating the folder that contains the index (for example: ../shared/lucene/Content_C).

43.3 Search API Overview

This section contains the following topics:

Section 43.3.1, "SearchEngine"
Section 43.3.2, "QueryExpression"
Section 43.3.3, "Configuration"

43.3.1 SearchEngine

The SearchEngine interface defines the key functions of a search engine implementation; indexing and searching. More information about SearchEngine is available in the Oracle Fusion Middleware WebCenter Sites Java API Reference.

The source of index data is given to SearchEngine. SearchEngine, in response to the indexing request, creates the search index, if it is not already created and updates the contents based on IndexSource's accessors.
index() works off of a given IndexSource instance. Depending on the search engine's implementation details, it operates on new, modified and deleted data coming from the IndexSource (in most search engines, all that is modified must be deleted first and then re-indexed).

index() also invokes index lifecycle methods (startIndexing() and endIndexing()) on the given instance of IndexSource.
search() operates on a QueryExpression against one or many indexes (IndexSources), resulting in a single set of results, sorted by their relevance (or SortOrder, if specified and usable across indices.
A configuration lookup interface (IndexSourceConfig) is supplied to the SearchEngine instance which it can look up IndexSource properties, if needed.
A QueryConverter interface is supplied to SearchEngine. This interface converts a given QueryExpression to its native form (recognizable by the specific search engine). This makes it possible to control the query language that the search engine uses externally.
SearchResult is an abstraction over what is returned from the search engine. SearchResult is an iterator over ResultRow, a sub class of IndexRow, that contains relevance information. The getRelavence() method returns a double; the higher the value, the higher is the relevance of this ResultRow for the given query.

43.3.2 QueryExpression

The QueryExpression interface is a generic interface for defining search criteria. All search engines support native formats for building queries. The native form contains definitions of wildcards, relevance hints, etc. These tend to be very specific to each search engine.

Search engines also provide a basic query construct, which can be programmatically built (AND & OR over field matches). These can be thought of in terms of generalized programmable interfaces, although limited in power.

QueryExpression encapsulates four distinct characteristics of search engine queries:

Native text search format: Most search engines support a very sophisticated native format for search, including wild cards, special hints etc. This is available through getStringFormat().
Conjunction and disjunction: ANDs and ORs of conditions using and() and or() methods.
Pagination: Using getStartIndex() and getMaxResults() methods.
Sorting: using getSortOrder().

43.3.3 Configuration

Make sure that search indexing is enabled:

In the SystemEvents table, verify that the SearchIndexEvent is enabled (enabled field =1). This is configured to run in the background constantly (*:*:* */*/*); in practice it runs about every 30 seconds.
Make sure Asset listener is registered:

Assets are queued for processing by the search framework, using asset events. The asset events are registered in the AssetListener_reg table. Make sure the following entry is found in this table (if it does not exist, add it).

Table 43-3 AssetListener_reg Table

ID	Listener	Blocking
1153937286234	`com.openmarket.basic.event.SearchAssetIdEventListener`	Y

IndexSourceMetaDataConfig: table that stores configuration information for IndexSource. This describes the structure and nature of the index itself. This should have a row for Global by default. Any asset type enabled for Asset type index will have an additional row in this table.

SearchEngineMetaDataConfig: stores the search engine configuration. This table should have a row for Lucene, by default.

These are configured correctly by the installer and managed by the Admin interface.

Defaults here should suffice; the precise definition and meaning of "configuration" will be covered in some depth at a later point in this tutorial (see Section 43.4, "Advanced Configuration").

43.4 Advanced Configuration

The following sections explain aspects of advanced configuration that may be necessary in certain circumstances. However, in most cases, the defaults should be sufficient.

This section contains the following topics:

Section 43.4.1, "Configuring Lucene Parameters"
Section 43.4.2, "Configuring the Custom AnalyzerFactory"

43.4.1 Configuring Lucene Parameters

Note:

Some of these parameters can cause significant changes in the way the index performs at run time. Refer to the Lucene documentation and rely on experimentation to determine the best settings for your site. If you do not have a compelling reason to change the defaults, it is highly advised that you keep the defaults.

In the Lucene search engine, an index can be created with a certain set of parameters that determine how the index is created and how it performs. While the Lucene default parameters are reasonable, WebCenter Sites provides administrators with a way to change them.

The IndexSourceMetaDataConfig table contains one row per index you would find in the system. Each row has a field named "properties" whose contents are used to configure Lucene parameters. Parameter-value pairs are separated by a semicolon ( ';' ) as shown below:

param1=value1;param2=value2

Parameters supported by WebCenter Sites are listed in Table 43-4.

Table 43-4 Parameters Supported by WebCenter Sites

Parameter	Type	Description
`mergeFactor`	Integer	Determines how often segment indices are merged. With smaller values, less RAM is used while indexing, and searches on unoptimized indices are faster, but indexing speed is slower. With larger values, more RAM is used during indexing, and while searches on unoptimized indices are slower, indexing is faster. Thus larger values (> 10) are best for batch index creation, and smaller values (< 10) for indices that are interactively maintained. This must never be less than `2`. The default value is `10`.
`maxMergeDocs`	Integer	Determines the largest number of documents ever merged. Small values (e.g., less than 10,000) are best for interactive indexing, as this limits the length of pauses while indexing to a few seconds. Larger values are best for batched indexing and speedier searches. Defaults to max integer value (231-1).
`maxBufferedDocs`	Integer	Determines the minimal number of documents required before the buffered in-memory documents are merging and a new Segment is created. Defaults to `10`.
`optimizeInterval`	Integer	Determines the time interval (in seconds) between optimize() calls. The default value is 30 seconds, which is the recommended value for most systems. If you have a large amount of data changes, set this parameter to any value within the range of 300 to 600 seconds.
`commitLockTimeout`	Long	Sets the maximum time to wait for a commit lock (in milliseconds). Defaults to `10000`.
`maxFieldLength`	Integer	Maximum number of terms that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. Note that this effectively truncates large documents, excluding from the index terms that occur further in the document. If you know your source documents are large, be sure to set this value high enough to accommodate the expected size. If you set it to max value of Integer (231-1), then the only limit is memory, but you should anticipate an OutOfMemoryError. By default, no more than 10,000 terms will be indexed for a field.
`termIndexInterval`	Integer	Sets the interval between indexed terms. Large values cause less memory to be used by IndexReader, but slow random-access to terms. Small values cause more memory to be used by an IndexReader, and speed random-access to terms. This parameter determines the amount of computation required per query term, regardless of the number of documents that contain that term. In particular, it is the maximum number of other terms that must be scanned before a term is located and its frequency and position information may be processed. In a large index with user-entered query terms, query processing time is likely to be dominated not by term lookup but rather by the processing of frequency and positional data. In a small index or when many uncommon query terms are generated (e.g., by wildcard queries) term lookup may become a dominant cost. In particular, numUniqueTerms/interval terms are read into memory by an IndexReader, and, on average, interval/2terms must be scanned for each random term access. Default value is `128`.
`useCompoundFile`	String (must be `yes` or `no`)	Setting to turn on usage of a compound file. When on, multiple files for each segment are merged into a single file once the segment creation is finished.
`writeLockTimeout`	Long	Sets the maximum time to wait for a write lock. Default value is `1000`.

43.4.2 Configuring the Custom AnalyzerFactory

In Lucene, an Analyzer represents a policy for extracting index terms from text. Analyzers are used at the time of indexing as well as searching for various tasks such as removing stop words, removing white spaces, etc.

Different Analyzers exist in the Lucene repository for handling various locales. Often Analyzers are used for injecting synonyms or addressing accented characters gracefully. You can also build your own Analyzer by using any of the Lucene standard analyzers as a basis.

WebCenter Sites's Lucene implementation uses StandadAnalyzer, a general purpose analyzer for the English language. However, WebCenter Sites supports custom Analyzers via a plugin interface, AnalyzerFactory. The configured AnalyzerFactory is used to look up the analyzer, when required, in the process of indexing or searching. The AnalyzerFactory looks up the analyzer in the following instances:

When building the index as a whole
When parsing a query
When indexing an individual row

To plug in a custom AnalyzerFactory, you need to implement and register the AnalyzerFactory interface. Registration is done by modifying a row in SearchEngineMetaDataConfig table. Add the following to the properties field of the row whose Name field is set to Lucene.

AnalyzerFactory=<fully qualified class name of your custom AnalyzerFactory>