69 Using Public Site Search
WebCenter Sites includes a new framework for managing search indices. This framework forms the basis for searches in both the editorial interface and on the live site. That is, the visitors' side. Hence the name public site search.
Topics:
About the Search Framework
The search framework consists of the Search API, special asset event listeners, and a polling system for queues. You use this framework in coordination with the Event Management and Queue Management frameworks. This topic focuses primarily on the search framework.
Note:
You can skip this topic if you are primarily interested in the usage of the search API.
The following figure shows how the search engine integration framework works with the rest of WebCenter Sites.
-
Asset framework detects changes/additions to assets and fires off events.
-
Registered listeners queue the changes, using a persistent queue implementation. A given event can be queued into one or many persistent queues. Each queue can be thought of as the source of data for a search index.
-
Once asset events are queued, a background process empties the queue contents and routes them to the Search API.
-
The Search API chooses the appropriate (configurable) search engine vendor implementation to start the indexing process.
Index Types
Two types of indices are created in WebCenter Sites: Global
index and AssetType
index. Global
index is the index of all data (all asset types enabled for Global
index). To search for a phrase or expression in multiple asset types (such as attempting to build a Google-like search interface), Global
index is more appropriate.
While Global
index contains data for all fields of the index, it does not store the data in a form that is suitable for parametric searches. An AssetType
index contains indexed information for a given asset type in a manner that can be searched parametrically. The Admin interface supports the configuration of Asset Type searches, which includes attribute-based searches for the indexing-enabled asset types. See Adding Asset Types to the Search Index in Administering Oracle WebCenter
Sites.
Topics:
Global Index
Global
index is used by the Oracle WebCenter Sites:
Contributor interface to build a global search UI. An instance of the index also exists on the delivery server. The index on the delivery server can be used to build public site searches.
Only those assets that are published to the live site after search is configured are available for searches. It is during publishing that the data gets indexed. All assets that may exist on the live site before search is configured is not reflected in the Global
index (until the assets are re-indexed on the live site).
A search index functions roughly similar to a database table.
The following table describes the fields in Global
index.
Note:
The field names are case-sensitive.
Table 69-1 Fields in the Global Index
Field name | Description |
---|---|
defaultSearchField |
This contains all the data of the indexed asset. This is the field you would search for in full-text searching. This contains index data for all attributes of the asset, including any binary field data. Data for all the attributes is merged into this single field and indexed. The index itself does not 'store' data for this field, but does allow full-text searching. Note: Search strings must be entered in lowercase only, no capitals. |
id |
This contains the asset ID. |
AssetType |
This contains the asset type ( |
locale |
This contains the locale string (for example, |
name |
This contains the name of the asset. |
description |
Description associated with the asset. |
subtype |
Name of the subtype (flex definition name). |
subtypeid |
ID of the subtype (flex def ID). |
updateddate |
Last updated date as found at the time of indexing. |
siteid |
IDs of all sites in which this asset is available. |
startdate |
Start date field in the asset table. |
enddate |
End date field in the asset table. |
Asset Type Index
An asset type index is created when it is enabled from the Admin interface by selecting the Admin tab, then Search, and then Configure Asset Type Search. Once an asset type is enabled, an index is created under /shared/lucene/<Asset type name>
. This index contains all attributes of the given type as fields in the index.
The following table describes the fields in the Content_C
index.
Note:
The field names are case-sensitive.
Table 69-2 Fields in the Asset Type Index
Field Name | Description |
---|---|
DefaultSearchField |
This contains all the data of the indexed asset. This is the field you would search for in full text searching. This contains index data for all attributes of the asset, including any binary field data. Data for all the attributes is merged into this single field and indexed. The index itself does not 'store' data for this field, but does allow full text searching. Note: Search strings must be entered in lowercase only, no capitals. |
id |
This contains the asset ID |
AssetType |
This contains the asset type (Content_C/Product_P) |
locale |
Contains the locale string (example en_US) |
name |
Name of the asset |
description |
Description associated with the asset |
subtype |
Name of the subtype (flex definition name) |
subtypeid |
Id of the subtype (flex def ID) |
updateddate |
Last updated date as found at the time of indexing. |
siteid |
All site IDs this asset is available in |
startdate |
Startdate field in the asset table |
enddate |
Enddate field in the asset table |
Dimension |
ID of the dimension |
Dimension-parent |
ID of the Dimension parent |
createdby |
User name that created this asset |
createddate |
Date the asset was created |
Publist |
List of site names this asset belongs to |
Relationships |
Asset IDs of related items (flex only) |
externaldoctype |
Not used |
filename |
File name used for static publishing |
flextemplateid |
ID of the flex definition (flex only) |
fw_uid |
Globally unique ID of this asset |
path |
Path used for static publishing |
renderid |
Object ID of the Template asset assigned to a flex asset. |
ruleset |
XML document of the ruleset |
status |
Status associated with the asset |
subtype |
Subtype name |
subtypeid |
Subtype ID (flex only) |
template |
Template name |
updatedby |
User name that last updated this asset |
updateddate |
Date of last update |
urlexternaldoc |
Not used |
urlexternaldocxml |
Not used |
FSIIAbstract |
Flex attribute |
FSIIBody |
Flex attribute |
FSIIByline |
Flex attribute |
FSIIDescriptionAttr |
Flex attribute |
FSIIHeadline |
Flex attribute |
FSIINameAttr |
Flex attribute |
FSIIPostDate |
Flex attribute |
FSIISubheadline |
Flex attribute |
FSIITemplateAttr |
Flex attribute |
To visualize which fields are available in a given index, use the tool Luke, available at:
After you launch the tool, use the tool's browse function to load the index by simply locating the folder that contains the index (for example: ../shared/lucene/Content_C
).
About Search API
When you use the Search API, you’ll work with the SearchEngine
interface and the QueryExpression
interface.
Topics:
SearchEngine
The SearchEngine
interface defines the key functions of a search engine implementation; indexing and searching. More information about SearchEngine is available in the Java API Reference for Oracle
WebCenter Sites.
-
The source of index data is given to
SearchEngine
.SearchEngine
, in response to the indexing request, creates the search index (if it is not created), and updates the contents based on theIndexSource
accessors. -
index()
works off of a givenIndexSource
instance. Depending on the search engine's implementation details, it operates on new, modified, and deleted data coming from theIndexSource
(in most search engines, all that is modified must be deleted first and then re-indexed).index()
also invokes index lifecycle methods (startIndexing()
andendIndexing()
) on the given instance ofIndexSource
. -
search()
operates on aQueryExpression
against one or many indexes (IndexSources
), resulting in a single set of results, sorted by their relevance (orSortOrder
, if specified and usable across indices. -
A configuration lookup interface (
IndexSourceConfig
) is supplied to theSearchEngine
instance which it can look upIndexSource
properties, if needed. -
A
QueryConverter
interface is supplied toSearchEngine
. This interface converts a givenQueryExpression
to its native form (recognizable by the specific search engine). This makes it possible to control the query language that the search engine uses externally. -
SearchResult
is an abstraction over what is returned from the search engine.SearchResult
is an iterator overResultRow
, a sub class ofIndexRow
, that contains relevance information. ThegetRelavence()
method returns a double; the higher the value, the higher is the relevance of thisResultRow
for the given query.
QueryExpression
The QueryExpression
interface is a generic interface for defining search criteria. All search engines support native formats for building queries. The native form contains definitions of wildcards, relevance hints, and so on. These tend to be very specific to each search engine.
Search engines also provide a basic query construct, which can be programmatically built (AND & OR
over field matches). These can be thought of in terms of generalized programmable interfaces, although limited in power.
QueryExpression
encapsulates four distinct characteristics of search engine queries:
-
Native text search format: Most search engines support a very sophisticated native format for search, including wild cards, special hints, and so on. This is available through
getStringFormat()
. -
Conjunction and disjunction: ANDs and ORs of conditions using
and()
andor()
methods. -
Pagination: Using
getStartIndex()
andgetMaxResults()
methods. -
Sorting: using
getSortOrder()
.
Configuring Query Expression
Table 69-3 AssetListener_reg Table
ID | Listener | Blocking |
---|---|---|
1153937286234 |
|
Y |
IndexSourceMetaDataConfig
: table that stores configuration information for IndexSource
. This describes the structure and nature of the index itself. This should have a row for Global
by default. Any asset type enabled for Asset type index has an additional row in this table.
SearchEngineMetaDataConfig
: stores the search engine configuration. This table should have a row for Lucene, by default.
These are configured correctly by the installer and managed by the Admin interface.
Defaults here should suffice. See Advanced Configuration.
Advanced Configuration
While in most cases you will find that the defaults are sufficient, in some use cases you may need to configure Lucene parameters and AnalyzerFactory.
Topics:
Configuration of Lucene Parameters
Note:
Some parameters can cause significant changes in the way the index performs at run time. Refer to the Lucene documentation and rely on experimentation to determine the best settings for your site. It is highly advised that you keep the defaults unless you have compelling reasons to change them.
In the Lucene search engine, an index can be created with a certain set of parameters that determine how the index is created and how it performs. While the Lucene default parameters are reasonable, WebCenter Sites provides administrators with a way to change them.
The SearchEngineMetaDataConfig
table contains one row per index. Each row has a field named properties
whose contents are used to configure Lucene parameters. Parameter-value pairs are separated by a semicolon ( ';' ) as shown below:
param1=
value1;
param2=
value2
This table describes the parameters supported by WebCenter Sites.
Table 69-4 Parameters Supported by WebCenter Sites
Parameter | Type | Description |
---|---|---|
|
Integer |
Determines how often segment indices are merged. With smaller values, less RAM is used while indexing, and searches on unoptimized indices are faster, but indexing speed is slower. With larger values, more RAM is used during indexing, and while searches on unoptimized indices are slower, indexing is faster. Thus larger values (> 10) are best for batch index creation, and smaller values (< 10) for indices that are interactively maintained. This must never be less than |
|
Integer |
Determines the largest number of documents ever merged. Small values (for example, less than 10,000) are best for interactive indexing, as this limits the length of pauses while indexing to a few seconds. Larger values are best for batched indexing and speedier searches. Defaults to max integer value (231-1). |
|
Integer |
Determines the minimal number of documents required before the buffered in-memory documents are merging and a new Segment is created. Defaults to |
|
Integer |
Determines the time interval (in seconds) between optimize() calls. The default value is 30 seconds, which is the recommended value for most systems. To allow a large amount of data changes, set this parameter to any value within the range of 300 to 600 seconds. |
|
Long |
Sets the maximum time to wait for a commit lock (in milliseconds). Defaults to |
|
Integer |
Maximum number of terms that are indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. Note that this effectively truncates large documents, excluding from the index terms that occur further in the document. To support large source documents, be sure to set this value high enough to accommodate the expected size. If you set it to max value of Integer (231-1), then the only limit is memory, but you should anticipate an By default, no more than 10,000 terms will be indexed for a field. |
|
Integer |
Sets the interval between indexed terms. Large values cause less memory to be used by IndexReader, but slow random-access to terms. Small values cause more memory to be used by an IndexReader, and speed random-access to terms. This parameter determines the amount of computation required per query term, regardless of the number of documents that contain that term. In particular, it is the maximum number of other terms that must be scanned before a term is located and its frequency and position information may be processed. In a large index with user-entered query terms, query processing time is likely to be dominated not by term lookup but rather by the processing of frequency and positional data. In a small index or when many uncommon query terms are generated (for example, by wildcard queries) term lookup may become a dominant cost. In particular, numUniqueTerms/interval terms are read into memory by an IndexReader, and, on average, interval/2terms must be scanned for each random term access. Default value is |
|
String (must be |
Setting to turn on usage of a compound file. When on, multiple files for each segment are merged into a single file once the segment creation is finished. |
|
Long |
Sets the maximum time to wait for a write lock. Default value is |
Configuration of Custom AnalyzerFactory
In Lucene, an Analyzer represents a policy for extracting index terms from text. Analyzers are used at the time of indexing and searching for various tasks such as removing stop words and removing white spaces.
Different Analyzers exist in the Lucene repository for handling various locales. Often Analyzers are used for injecting synonyms or addressing accented characters gracefully. You can also build your own Analyzer by using any of the Lucene standard analyzers as a basis.
The WebCenter Sites Lucene implementation uses StandadAnalyzer
, a general purpose analyzer for the English language. However, WebCenter Sites supports custom Analyzers through a plugin interface, AnalyzerFactory
. The configured AnalyzerFactory
is used to look up the analyzer, when required, in the process of indexing or searching. The AnalyzerFactory
looks up the analyzer in the following instances:
-
When building the index as a whole
-
When parsing a query
-
When indexing an individual row
To plug in a custom AnalyzerFactory
, you have to implement and register the AnalyzerFactory
interface. Registration is done by modifying a row in the SearchEngineMetaDataConfig
table. Add the following to the properties field of the row whose Name field is set to Lucene
.
AnalyzerFactory=<fully qualified class name of your custom AnalyzerFactory>