Using Public Site Search

69 Using Public Site Search

WebCenter Sites includes a new framework for managing search indices. This framework forms the basis for searches in both the editorial interface and on the live site. That is, the visitors' side. Hence the name public site search.

Topics:

About the Search Framework

The search framework consists of the Search API, special asset event listeners, and a polling system for queues. You use this framework in coordination with the Event Management and Queue Management frameworks. This topic focuses primarily on the search framework.

Note:

You can skip this topic if you are primarily interested in the usage of the search API.

The following figure shows how the search engine integration framework works with the rest of WebCenter Sites.

Figure 69-1 Search Engine Integration

Description of "Figure 69-1 Search Engine Integration "

Asset framework detects changes/additions to assets and fires off events.
Registered listeners queue the changes, using a persistent queue implementation. A given event can be queued into one or many persistent queues. Each queue can be thought of as the source of data for a search index.
Once asset events are queued, a background process empties the queue contents and routes them to the Search API.
The Search API chooses the appropriate (configurable) search engine vendor implementation to start the indexing process.

Index Types

Two types of indices are created in WebCenter Sites: Global index and AssetType index. Global index is the index of all data (all asset types enabled for Global index). To search for a phrase or expression in multiple asset types (such as attempting to build a Google-like search interface), Global index is more appropriate.

While Global index contains data for all fields of the index, it does not store the data in a form that is suitable for parametric searches. An AssetType index contains indexed information for a given asset type in a manner that can be searched parametrically. The Admin interface supports the configuration of Asset Type searches, which includes attribute-based searches for the indexing-enabled asset types. See Adding Asset Types to the Search Index in Administering Oracle WebCenter Sites.

Topics:

Global Index

Global index is used by the Oracle WebCenter Sites: Contributor interface to build a global search UI. An instance of the index also exists on the delivery server. The index on the delivery server can be used to build public site searches.

Only those assets that are published to the live site after search is configured are available for searches. It is during publishing that the data gets indexed. All assets that may exist on the live site before search is configured is not reflected in the Global index (until the assets are re-indexed on the live site).

A search index functions roughly similar to a database table.

The following table describes the fields in Global index.

Note:

The field names are case-sensitive.

Table 69-1 Fields in the Global Index

Field name	Description
defaultSearchField	This contains all the data of the indexed asset. This is the field you would search for in full-text searching. This contains index data for all attributes of the asset, including any binary field data. Data for all the attributes is merged into this single field and indexed. The index itself does not 'store' data for this field, but does allow full-text searching. Note: Search strings must be entered in lowercase only, no capitals.
id	This contains the asset ID.
AssetType	This contains the asset type (`Content_C/Product_P`).
locale	This contains the locale string (for example, `en_US`).
name	This contains the name of the asset.
description	Description associated with the asset.
subtype	Name of the subtype (flex definition name).
subtypeid	ID of the subtype (flex def ID).
updateddate	Last updated date as found at the time of indexing.
siteid	IDs of all sites in which this asset is available.
startdate	Start date field in the asset table.
enddate	End date field in the asset table.

Asset Type Index

An asset type index is created when it is enabled from the Admin interface by selecting the Admin tab, then Search, and then Configure Asset Type Search. Once an asset type is enabled, an index is created under /shared/lucene/<Asset type name>. This index contains all attributes of the given type as fields in the index.

The following table describes the fields in the Content_C index.

Note:

The field names are case-sensitive.

Table 69-2 Fields in the Asset Type Index

Field Name	Description
DefaultSearchField	This contains all the data of the indexed asset. This is the field you would search for in full text searching. This contains index data for all attributes of the asset, including any binary field data. Data for all the attributes is merged into this single field and indexed. The index itself does not 'store' data for this field, but does allow full text searching. Note: Search strings must be entered in lowercase only, no capitals.
id	This contains the asset ID
AssetType	This contains the asset type (Content_C/Product_P)
locale	Contains the locale string (example en_US)
name	Name of the asset
description	Description associated with the asset
subtype	Name of the subtype (flex definition name)
subtypeid	Id of the subtype (flex def ID)
updateddate	Last updated date as found at the time of indexing.
siteid	All site IDs this asset is available in
startdate	Startdate field in the asset table
enddate	Enddate field in the asset table
Dimension	ID of the dimension
Dimension-parent	ID of the Dimension parent
createdby	User name that created this asset
createddate	Date the asset was created
Publist	List of site names this asset belongs to
Relationships	Asset IDs of related items (flex only)
externaldoctype	Not used
filename	File name used for static publishing
flextemplateid	ID of the flex definition (flex only)
fw_uid	Globally unique ID of this asset
path	Path used for static publishing
renderid	Object ID of the Template asset assigned to a flex asset.
ruleset	XML document of the ruleset
status	Status associated with the asset
subtype	Subtype name
subtypeid	Subtype ID (flex only)
template	Template name
updatedby	User name that last updated this asset
updateddate	Date of last update
urlexternaldoc	Not used
urlexternaldocxml	Not used
FSIIAbstract	Flex attribute
FSIIBody	Flex attribute
FSIIByline	Flex attribute
FSIIDescriptionAttr	Flex attribute
FSIIHeadline	Flex attribute
FSIINameAttr	Flex attribute
FSIIPostDate	Flex attribute
FSIISubheadline	Flex attribute
FSIITemplateAttr	Flex attribute

To visualize which fields are available in a given index, use the tool Luke, available at:

http://www.getopt.org/luke/

After you launch the tool, use the tool's browse function to load the index by simply locating the folder that contains the index (for example: ../shared/lucene/Content_C).

About Search API

When you use the Search API, you’ll work with the SearchEngine interface and the QueryExpression interface.

Topics:

SearchEngine

The SearchEngine interface defines the key functions of a search engine implementation; indexing and searching. More information about SearchEngine is available in the Java API Reference for Oracle WebCenter Sites.

The source of index data is given to SearchEngine. SearchEngine, in response to the indexing request, creates the search index (if it is not created), and updates the contents based on the IndexSource accessors.
index() works off of a given IndexSource instance. Depending on the search engine's implementation details, it operates on new, modified, and deleted data coming from the IndexSource (in most search engines, all that is modified must be deleted first and then re-indexed).

index() also invokes index lifecycle methods (startIndexing() and endIndexing()) on the given instance of IndexSource.
search() operates on a QueryExpression against one or many indexes (IndexSources), resulting in a single set of results, sorted by their relevance (or SortOrder, if specified and usable across indices.
A configuration lookup interface (IndexSourceConfig) is supplied to the SearchEngine instance which it can look up IndexSource properties, if needed.
A QueryConverter interface is supplied to SearchEngine. This interface converts a given QueryExpression to its native form (recognizable by the specific search engine). This makes it possible to control the query language that the search engine uses externally.
SearchResult is an abstraction over what is returned from the search engine. SearchResult is an iterator over ResultRow, a sub class of IndexRow, that contains relevance information. The getRelavence() method returns a double; the higher the value, the higher is the relevance of this ResultRow for the given query.

QueryExpression

The QueryExpression interface is a generic interface for defining search criteria. All search engines support native formats for building queries. The native form contains definitions of wildcards, relevance hints, and so on. These tend to be very specific to each search engine.

Search engines also provide a basic query construct, which can be programmatically built (AND & OR over field matches). These can be thought of in terms of generalized programmable interfaces, although limited in power.

QueryExpression encapsulates four distinct characteristics of search engine queries:

Native text search format: Most search engines support a very sophisticated native format for search, including wild cards, special hints, and so on. This is available through getStringFormat().
Conjunction and disjunction: ANDs and ORs of conditions using and() and or() methods.
Pagination: Using getStartIndex() and getMaxResults() methods.
Sorting: using getSortOrder().

Configuring Query Expression

Ensure search indexing is enabled:

In the SystemEvents table, verify that the SearchIndexEvent is enabled (enabled field =1). This is configured to run in the background constantly (*:*:* */*/*); in practice it runs about every 30 seconds.
Make sure Asset listener is registered:

Assets are queued for processing by the search framework, using asset events. The asset events are registered in the AssetListener_reg table. Make sure the entry in the following table exists. Add it if it doesn't exist.

Table 69-3 AssetListener_reg Table

ID	Listener	Blocking
1153937286234	`com.openmarket.basic.event.SearchAssetIdEventListener`	Y

IndexSourceMetaDataConfig: table that stores configuration information for IndexSource. This describes the structure and nature of the index itself. This should have a row for Global by default. Any asset type enabled for Asset type index has an additional row in this table.

SearchEngineMetaDataConfig: stores the search engine configuration. This table should have a row for Lucene, by default.

These are configured correctly by the installer and managed by the Admin interface.

Defaults here should suffice. See Advanced Configuration.

Advanced Configuration

While in most cases you will find that the defaults are sufficient, in some use cases you may need to configure Lucene parameters and AnalyzerFactory.

Topics:

Configuration of Lucene Parameters

Note:

Some parameters can cause significant changes in the way the index performs at run time. Refer to the Lucene documentation and rely on experimentation to determine the best settings for your site. It is highly advised that you keep the defaults unless you have compelling reasons to change them.

In the Lucene search engine, an index can be created with a certain set of parameters that determine how the index is created and how it performs. While the Lucene default parameters are reasonable, WebCenter Sites provides administrators with a way to change them.

The SearchEngineMetaDataConfig table contains one row per index. Each row has a field named properties whose contents are used to configure Lucene parameters. Parameter-value pairs are separated by a semicolon ( ';' ) as shown below:

param1=value1;param2=value2

This table describes the parameters supported by WebCenter Sites.

Table 69-4 Parameters Supported by WebCenter Sites

Parameter	Type	Description
`mergeFactor`	Integer	Determines how often segment indices are merged. With smaller values, less RAM is used while indexing, and searches on unoptimized indices are faster, but indexing speed is slower. With larger values, more RAM is used during indexing, and while searches on unoptimized indices are slower, indexing is faster. Thus larger values (> 10) are best for batch index creation, and smaller values (< 10) for indices that are interactively maintained. This must never be less than `2`. The default value is `10`.
`maxMergeDocs`	Integer	Determines the largest number of documents ever merged. Small values (for example, less than 10,000) are best for interactive indexing, as this limits the length of pauses while indexing to a few seconds. Larger values are best for batched indexing and speedier searches. Defaults to max integer value (231-1).
`maxBufferedDocs`	Integer	Determines the minimal number of documents required before the buffered in-memory documents are merging and a new Segment is created. Defaults to `10`.
`optimizeInterval`	Integer	Determines the time interval (in seconds) between optimize() calls. The default value is 30 seconds, which is the recommended value for most systems. To allow a large amount of data changes, set this parameter to any value within the range of 300 to 600 seconds.
`commitLockTimeout`	Long	Sets the maximum time to wait for a commit lock (in milliseconds). Defaults to `10000`.
`maxFieldLength`	Integer	Maximum number of terms that are indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. Note that this effectively truncates large documents, excluding from the index terms that occur further in the document. To support large source documents, be sure to set this value high enough to accommodate the expected size. If you set it to max value of Integer (231-1), then the only limit is memory, but you should anticipate an `OutOfMemoryError`. By default, no more than 10,000 terms will be indexed for a field.
`termIndexInterval`	Integer	Sets the interval between indexed terms. Large values cause less memory to be used by IndexReader, but slow random-access to terms. Small values cause more memory to be used by an IndexReader, and speed random-access to terms. This parameter determines the amount of computation required per query term, regardless of the number of documents that contain that term. In particular, it is the maximum number of other terms that must be scanned before a term is located and its frequency and position information may be processed. In a large index with user-entered query terms, query processing time is likely to be dominated not by term lookup but rather by the processing of frequency and positional data. In a small index or when many uncommon query terms are generated (for example, by wildcard queries) term lookup may become a dominant cost. In particular, numUniqueTerms/interval terms are read into memory by an IndexReader, and, on average, interval/2terms must be scanned for each random term access. Default value is `128`.
`useCompoundFile`	String (must be `yes` or `no`)	Setting to turn on usage of a compound file. When on, multiple files for each segment are merged into a single file once the segment creation is finished.
`writeLockTimeout`	Long	Sets the maximum time to wait for a write lock. Default value is `1000`.

Configuration of Custom AnalyzerFactory

In Lucene, an Analyzer represents a policy for extracting index terms from text. Analyzers are used at the time of indexing and searching for various tasks such as removing stop words and removing white spaces.

Different Analyzers exist in the Lucene repository for handling various locales. Often Analyzers are used for injecting synonyms or addressing accented characters gracefully. You can also build your own Analyzer by using any of the Lucene standard analyzers as a basis.

The WebCenter Sites Lucene implementation uses StandadAnalyzer, a general purpose analyzer for the English language. However, WebCenter Sites supports custom Analyzers through a plugin interface, AnalyzerFactory. The configured AnalyzerFactory is used to look up the analyzer, when required, in the process of indexing or searching. The AnalyzerFactory looks up the analyzer in the following instances:

When building the index as a whole
When parsing a query
When indexing an individual row

To plug in a custom AnalyzerFactory, you have to implement and register the AnalyzerFactory interface. Registration is done by modifying a row in the SearchEngineMetaDataConfig table. Add the following to the properties field of the row whose Name field is set to Lucene.

AnalyzerFactory=<fully qualified class name of your custom AnalyzerFactory>