78 Using Public Site Search

WebCenter Sites includes a new framework for managing search indices. This framework forms the basis for searches in both the editorial interface and on the live site. That is, the visitors' side. Hence the name public site search.

Topics:

78.1 About the Search Framework

The search framework consists of the Search API, special asset event listeners, and a polling system for queues. You use this framework in coordination with the Event Management and Queue Management frameworks. This topic focuses primarily on the search framework.

Note:

You can skip this topic if you are primarily interested in the usage of the search API.

The following figure shows how the search engine integration framework works with the rest of WebCenter Sites.

Figure 78-1 Search Engine Integration

Description of Figure 78-1 follows
Description of "Figure 78-1 Search Engine Integration "
  1. Asset framework detects changes/additions to assets and fires off events.

  2. Registered listeners queue the changes, using a persistent queue implementation. A given event can be queued into one or many persistent queues. Each queue can be thought of as the source of data for a search index.

  3. Once asset events are queued, a background process empties the queue contents and routes them to the Search API.

  4. The Search API chooses the appropriate (configurable) search engine vendor implementation to start the indexing process.

78.2 Index Types

Two types of indices are created in WebCenter Sites: Global index and AssetType index. Global index is the index of all data (all asset types enabled for Global index). To search for a phrase or expression in multiple asset types (such as attempting to build a Google-like search interface), Global index is more appropriate.

While Global index contains data for all fields of the index, it does not store the data in a form that is suitable for parametric searches. An AssetType index contains indexed information for a given asset type in a manner that can be searched parametrically. The Admin interface supports the configuration of Asset Type searches, which includes attribute-based searches for the indexing-enabled asset types. See Adding Asset Types to the Search Index in Administering Oracle WebCenter Sites.

Topics:

78.2.1 Global Index

Global index is used by the Oracle WebCenter Sites: Contributor interface to build a global search UI. An instance of the index also exists on the delivery server. The index on the delivery server can be used to build public site searches.

Only those assets that are published to the live site after search is configured are available for searches. It is during publishing that the data gets indexed. All assets that may exist on the live site before search is configured is not reflected in the Global index (until the assets are re-indexed on the live site).

A search index functions roughly similar to a database table.

The following table describes the fields in Global index.

Note:

The field names are case-sensitive.

Table 78-1 Fields in the Global Index

Field name Description

defaultSearchField

This contains all the data of the indexed asset. This is the field you would search for in full-text searching.

This contains index data for all attributes of the asset, including any binary field data. Data for all the attributes is merged into this single field and indexed. The index itself does not 'store' data for this field, but does allow full-text searching.

Note: Search strings must be entered in lowercase only, no capitals.

id

This contains the asset ID.

AssetType

This contains the asset type (Content_C/Product_P).

locale

This contains the locale string (for example, en_US).

name

This contains the name of the asset.

description

Description associated with the asset.

subtype

Name of the subtype (flex definition name).

subtypeid

ID of the subtype (flex def ID).

updateddate

Last updated date as found at the time of indexing.

siteid

IDs of all sites in which this asset is available.

startdate

Start date field in the asset table.

enddate

End date field in the asset table.

78.2.2 Asset Type Index

An asset type index is created when it is enabled from the Admin interface by selecting the Admin tab, then Search, and then Configure Asset Type Search. Once an asset type is enabled, an index is created under /shared/lucene/<Asset type name>. This index contains all attributes of the given type as fields in the index.

The following table describes the fields in the Content_C index.

Note:

The field names are case-sensitive.

Table 78-2 Fields in the Asset Type Index

Field Name Description

DefaultSearchField

This contains all the data of the indexed asset. This is the field you would search for in full text searching.

This contains index data for all attributes of the asset, including any binary field data. Data for all the attributes is merged into this single field and indexed.

The index itself does not 'store' data for this field, but does allow full text searching.

Note: Search strings must be entered in lowercase only, no capitals.

id

This contains the asset ID

AssetType

This contains the asset type (Content_C/Product_P)

locale

Contains the locale string (example en_US)

name

Name of the asset

description

Description associated with the asset

subtype

Name of the subtype (flex definition name)

subtypeid

Id of the subtype (flex def ID)

updateddate

Last updated date as found at the time of indexing.

siteid

All site IDs this asset is available in

startdate

Startdate field in the asset table

enddate

Enddate field in the asset table

Dimension

ID of the dimension

Dimension-parent

ID of the Dimension parent

createdby

User name that created this asset

createddate

Date the asset was created

Publist

List of site names this asset belongs to

Relationships

Asset IDs of related items (flex only)

externaldoctype

Not used

filename

File name used for static publishing

flextemplateid

ID of the flex definition (flex only)

fw_uid

Globally unique ID of this asset

path

Path used for static publishing

renderid

Object ID of the Template asset assigned to a flex asset.

ruleset

XML document of the ruleset

status

Status associated with the asset

subtype

Subtype name

subtypeid

Subtype ID (flex only)

template

Template name

updatedby

User name that last updated this asset

updateddate

Date of last update

urlexternaldoc

Not used

urlexternaldocxml

Not used

FSIIAbstract

Flex attribute

FSIIBody

Flex attribute

FSIIByline

Flex attribute

FSIIDescriptionAttr

Flex attribute

FSIIHeadline

Flex attribute

FSIINameAttr

Flex attribute

FSIIPostDate

Flex attribute

FSIISubheadline

Flex attribute

FSIITemplateAttr

Flex attribute

To visualize which fields are available in a given index, use the tool Luke, available at:

http://www.getopt.org/luke/

After you launch the tool, use the tool's browse function to load the index by simply locating the folder that contains the index (for example: ../shared/lucene/Content_C).

78.3 About Search API

When you use the Search API, you’ll work with the SearchEngine interface and the QueryExpression interface.

Topics:

78.3.1 SearchEngine

The SearchEngine interface defines the key functions of a search engine implementation; indexing and searching. More information about SearchEngine is available in the Java API Reference for Oracle WebCenter Sites.

  • The source of index data is given to SearchEngine. SearchEngine, in response to the indexing request, creates the search index (if it is not created), and updates the contents based on the IndexSource accessors.

  • index() works off of a given IndexSource instance. Depending on the search engine's implementation details, it operates on new, modified, and deleted data coming from the IndexSource (in most search engines, all that is modified must be deleted first and then re-indexed).

    index() also invokes index lifecycle methods (startIndexing() and endIndexing()) on the given instance of IndexSource.

  • search() operates on a QueryExpression against one or many indexes (IndexSources), resulting in a single set of results, sorted by their relevance (or SortOrder, if specified and usable across indices.

  • A configuration lookup interface (IndexSourceConfig) is supplied to the SearchEngine instance which it can look up IndexSource properties, if needed.

  • A QueryConverter interface is supplied to SearchEngine. This interface converts a given QueryExpression to its native form (recognizable by the specific search engine). This makes it possible to control the query language that the search engine uses externally.

  • SearchResult is an abstraction over what is returned from the search engine. SearchResult is an iterator over ResultRow, a sub class of IndexRow, that contains relevance information. The getRelavence() method returns a double; the higher the value, the higher is the relevance of this ResultRow for the given query.

78.3.2 QueryExpression

The QueryExpression interface is a generic interface for defining search criteria. All search engines support native formats for building queries. The native form contains definitions of wildcards, relevance hints, and so on. These tend to be very specific to each search engine.

Search engines also provide a basic query construct, which can be programmatically built (AND & OR over field matches). These can be thought of in terms of generalized programmable interfaces, although limited in power.

QueryExpression encapsulates four distinct characteristics of search engine queries:

  • Native text search format: Most search engines support a very sophisticated native format for search, including wild cards, special hints, and so on. This is available through getStringFormat().

  • Conjunction and disjunction: ANDs and ORs of conditions using and() and or() methods.

  • Pagination: Using getStartIndex() and getMaxResults() methods.

  • Sorting: using getSortOrder().

78.3.3 Configuring Query Expression

  1. Ensure search indexing is enabled:

    In the SystemEvents table, verify that the SearchIndexEvent is enabled (enabled field =1). This is configured to run in the background constantly (*:*:* */*/*); in practice it runs about every 30 seconds.

  2. Make sure Asset listener is registered:

    Assets are queued for processing by the search framework, using asset events. The asset events are registered in the AssetListener_reg table. Make sure the entry in the following table exists. Add it if it doesn't exist.

Table 78-3 AssetListener_reg Table

ID Listener Blocking

1153937286234

com.openmarket.basic.event.SearchAssetIdEventListener

Y

IndexSourceMetaDataConfig: table that stores configuration information for IndexSource. This describes the structure and nature of the index itself. This should have a row for Global by default. Any asset type enabled for Asset type index has an additional row in this table.

SearchEngineMetaDataConfig: stores the search engine configuration. This table should have a row for Lucene, by default.

These are configured correctly by the installer and managed by the Admin interface.

Defaults here should suffice. See Advanced Configuration.

78.4 Advanced Configuration

While in most cases you will find that the defaults are sufficient, in some use cases you may need to configure Lucene parameters and AnalyzerFactory.

Topics:

78.4.1 Configuration of Lucene Parameters

Note:

Some parameters can cause significant changes in the way the index performs at run time. Refer to the Lucene documentation and rely on experimentation to determine the best settings for your site. It is highly advised that you keep the defaults unless you have compelling reasons to change them.

In the Lucene search engine, an index can be created with a certain set of parameters that determine how the index is created and how it performs. While the Lucene default parameters are reasonable, WebCenter Sites provides administrators with a way to change them.

The SearchEngineMetaDataConfig table contains one row per index. Each row has a field named properties whose contents are used to configure Lucene parameters. Parameter-value pairs are separated by a semicolon ( ';' ) as shown below:

param1=value1;param2=value2

This table describes the parameters supported by WebCenter Sites.

Table 78-4 Parameters Supported by WebCenter Sites

Parameter Type Description

mergeFactor

Integer

Determines how often segment indices are merged.

With smaller values, less RAM is used while indexing, and searches on unoptimized indices are faster, but indexing speed is slower.

With larger values, more RAM is used during indexing, and while searches on unoptimized indices are slower, indexing is faster. Thus larger values (> 10) are best for batch index creation, and smaller values (< 10) for indices that are interactively maintained.

This must never be less than 2. The default value is 10.

maxMergeDocs

Integer

Determines the largest number of documents ever merged. Small values (for example, less than 10,000) are best for interactive indexing, as this limits the length of pauses while indexing to a few seconds. Larger values are best for batched indexing and speedier searches.

Defaults to max integer value (231-1).

maxBufferedDocs

Integer

Determines the minimal number of documents required before the buffered in-memory documents are merging and a new Segment is created.

Defaults to 10.

optimizeInterval

Integer

Determines the time interval (in seconds) between optimize() calls. The default value is 30 seconds, which is the recommended value for most systems. To allow a large amount of data changes, set this parameter to any value within the range of 300 to 600 seconds.

commitLockTimeout

Long

Sets the maximum time to wait for a commit lock (in milliseconds).

Defaults to 10000.

maxFieldLength

Integer

Maximum number of terms that are indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory.

Note that this effectively truncates large documents, excluding from the index terms that occur further in the document. To support large source documents, be sure to set this value high enough to accommodate the expected size. If you set it to max value of Integer (231-1), then the only limit is memory, but you should anticipate an OutOfMemoryError.

By default, no more than 10,000 terms will be indexed for a field.

termIndexInterval

Integer

Sets the interval between indexed terms. Large values cause less memory to be used by IndexReader, but slow random-access to terms. Small values cause more memory to be used by an IndexReader, and speed random-access to terms.

This parameter determines the amount of computation required per query term, regardless of the number of documents that contain that term. In particular, it is the maximum number of other terms that must be scanned before a term is located and its frequency and position information may be processed.

In a large index with user-entered query terms, query processing time is likely to be dominated not by term lookup but rather by the processing of frequency and positional data. In a small index or when many uncommon query terms are generated (for example, by wildcard queries) term lookup may become a dominant cost. In particular, numUniqueTerms/interval terms are read into memory by an IndexReader, and, on average, interval/2terms must be scanned for each random term access.

Default value is 128.

useCompoundFile

String (must be yes or no)

Setting to turn on usage of a compound file. When on, multiple files for each segment are merged into a single file once the segment creation is finished.

writeLockTimeout

Long

Sets the maximum time to wait for a write lock.

Default value is 1000.

78.4.2 Configuration of Custom AnalyzerFactory

In Lucene, an Analyzer represents a policy for extracting index terms from text. Analyzers are used at the time of indexing and searching for various tasks such as removing stop words and removing white spaces.

Different Analyzers exist in the Lucene repository for handling various locales. Often Analyzers are used for injecting synonyms or addressing accented characters gracefully. You can also build your own Analyzer by using any of the Lucene standard analyzers as a basis.

The WebCenter Sites Lucene implementation uses StandadAnalyzer, a general purpose analyzer for the English language. However, WebCenter Sites supports custom Analyzers through a plugin interface, AnalyzerFactory. The configured AnalyzerFactory is used to look up the analyzer, when required, in the process of indexing or searching. The AnalyzerFactory looks up the analyzer in the following instances:

  • When building the index as a whole

  • When parsing a query

  • When indexing an individual row

To plug in a custom AnalyzerFactory, you have to implement and register the AnalyzerFactory interface. Registration is done by modifying a row in the SearchEngineMetaDataConfig table. Add the following to the properties field of the row whose Name field is set to Lucene.

AnalyzerFactory=<fully qualified class name of your custom AnalyzerFactory>