Configuring Clustering in Search Results

Real-time clustering dynamically organizes search results into groups to provide end users with different views on the top results. Clustered documents within one group, called a cluster node, share the same common topics or property values. A cluster node with a large document set can be categorized into child cluster nodes, and a hierarchy is built. Users can navigate directly to a specific cluster node. Effective real-time clustering balances clustering quality and clustering time.

Note:

The 10.1.8.2 query application is certified with Internet Explorer versions 6 and 7 and Firefox versions 1.5 and 2.x. Existing 10.1.8.1 functionality is certified on all Oracle SES-supported browsers through the classic user interface: http://host:port/search/query/search-classic.jsp

Search attributes (String, Number or Date) are used to generate a cluster tree. The attributes can be local search attributes, federated attributes that are not explicitly mapped, and Oracle SES internal attributes.

Oracle SES supports two types of cluster trees: topic and metadata. Each tree can be enabled or disabled individually. Parameters that apply to all cluster trees for the default query application can be configured on the Global Settings - Clustering Configuration page. These include the following:

  • Enable clustering: Select this option to enable clustering.

  • Maximum cluster tree depth: The maximum level of the cluster node hierarchy.

  • Maximum number of children per node: The maximum number of cluster nodes on each level. This does not apply to the miscellaneous node.

  • Minimum number of documents per node: The minimum number of the documents within one node. This does not apply to the miscellaneous node.

    Within each level of a cluster tree, documents that are not categorized into a node are placed in a special node named miscellaneous. The Minimum number of documents per node and Maximum number of children per node parameters do not apply to the miscellaneous node.

For customized Oracle SES applications, configure clustering with the Query Web Services API.

Topic Clustering

Topic clustering uses the most significant phrases (and optionally sentences) from documents to create relevant cluster nodes and hierarchies. The significant phrases are extracted both at query-time and by the Secure Enterprise Search Document Summarizer, which is a document service included by default for search result clustering.

Configure crawl-time extraction of top phrases with document services parameters on the Global Settings - Document Services page. Create a topic clustering tree on the Global Settings - Clustering Configuration - Create Topic Clustering Tree page.

Topic clustering can be configured with one or more search attributes of String type and with the following Oracle SES internal attributes:

  • eqsnippet: The excerpt of the document with keywords in context.

  • eqtopphrases: The most frequent phrases within one document among the phrases with the same number of words.

  • eqtopsentences: The significant sentences within one document based on the significant phrases.

By default, the attributes keywords, title, eqsnippet and eqtopphrases are configured for topic clustering. Keywords, eqtopphrases, and eqtopsentences contain pre-extracted words and phrases: no additional phrase extraction is performed on these attributes.

Parameters that control query-time word and phrase extraction for the default query application can be configured on the Global Settings - Clustering Configure page. These include the following:

Single Word Extraction 

  • Minimum occurrence: The minimum frequency for the word to be extracted.

  • Maximum number of words to extract: The maximum number of words to be extracted.

Phrase Extraction 

  • Minimum occurrence: Minimum frequency for a phrase to be extracted.

  • Maximum number of phrases to extract: Maximum number of phrases to be extracted.

  • Maximum phrase length: Maximum number of words for each phrase to be extracted.

Topic clustering uses a phrase stopword list and a blacklist to prevent words or phrases from becoming topic cluster result nodes.

The phrase stopword list is also used by the Document Summarizer document service. The stopword file is a language-specific file containing words that should not be considered during phrase extraction. The blacklist file is a language-specific file containing words and phrases that should not appear as cluster node names.

For example, if all indexed documents include the phrase "Oracle Corporation" and it does not make sense to have a cluster node for "oracle corporation", then this phrase could be added to the blacklist.

Note:

A separate stopword list contains index stop words. This is an Oracle SES internal file for words that should not be indexed. This list is not related to phrase extraction.

Both the stopword and blacklist files are in plain text format, with each line containing one word or phrase. The phrase stopwords file name should be "phrasestopwords" followed a period and the two-letter language code (for example, phrasestopwords.en for English). Similarly, the blacklist file name should be "blacklist" followed by a period and the two-letter language code.

By default, these files are located in the directory

ORACLE_HOME/search/lib/plugins/doc/extractor/phrasestopwords

Sample phrase stopword files for other languages are in

ORACLE_HOME/search/lib/plugins/doc/extractor/samples/phrasestopwords

If there are documents for these languages, then copy these files to

ORACLE_HOME/search/lib/plugins/doc/extractor/phrasestopwords

The order of word or phrase in the file does not affect the phrase extraction. For example, phrasestopwords.en may contain the following:

a
an
me
:
z

The blacklist.en file may contain the following:

site maps
oracle corporation
:
term of use

Notes:

  • The stopword and blacklist files are applicable to both the default query application and the Web services API. The other parameters are applicable to the default query application only.

  • During backup and recovery operations, if you recover an instance in a new location, then the stopword directory must to be updated to reflect the new location, because it is an absolute path.

Topic clustering currently works best in English. Both the document summarizer in the crawler and the clustering module in the query application use a stemmer to stem the word and merge the words and phrases with the same stems. The open source stemmer library Snowball is used for this purpose. The version included with Oracle SES supports the following languages:

  • Dutch

  • English

  • Finnish

  • French

  • German

  • Norwegian

  • Portuguese

  • Russian

  • Spanish

  • Swedish

The Egothor stemmer is included for Polish language support. The stemmer configuration is shared between the default query application and the Web Services API.

Note:

Topic clustering is not supported for Chinese and Japanese.

Metadata Clustering

Metadata clustering is performed on a single attribute of String, Date, or Number type. If there are multiple values for the same attribute in one document, then only the first value is used for clustering. By default, the entire value is passed in as is for clustering.

However, for String attributes only, a delimiter can be specified for tokenizing the attribute value. If no tokenization delimiter is entered (or if only white space is entered), then the delimiter defaults to white space. When tokenized, the single attribute value is divided into multiple segments and each segment can correspond to a hierarchy based on another delimiter called the hierarchy delimiter. White space is the default hierarchy delimiter; however, if both tokenization and hierarchy are selected, then the delimiters must be different. Parsing is done first by tokenization, and then by interpreting the hierarchy from the resulting tokens.

Create a metadata clustering tree on the Global Settings - Clustering Configuration - Create Metadata Clustering Tree page.

As an example where both tokenization and hierarchy are meaningful, a category attribute might consist of a comma-delimited list of fields, each representing a slash-separated hierarchical categorization (as in "java/j2ee/jdbc, oracle/search/connector").

The tokenization and hierarchy configuration is not applicable to Date or Number attributes. Metadata trees of Date type attributes use a fixed display format with year on the first level, month on the second, and day on the third. The year is sorted in descending order, and the month and day are sorted in ascending order.

Metadata trees for Number type attributes are range-based with a fixed number of ranges (5) and a fixed tree depth (3), that is, the maximum number of ranges for number clustering trees is five (5). The tree depth starts at the root node. For a range to be shown, it must satisfy the Minimum Documents Per Node parameter, which is set on the Query-time Clustering Configuration page. Empty ranges are not shown.

Using Clustering

Cluster nodes filter the top results but do not change the order of the documents. When users select a cluster node, the result view is limited to the documents in that cluster node. All operations, such as sorting or paging through results, are limited to the cluster node.

The real-time clustering sidebar is hidden by default. Users can display the sidebar by clicking an arrow icon on the left-hand side of the search results page. Within the sidebar, result clusters are shown. The cluster nodes are sorted by the number of documents in each node.

Users can expand or collapse the nodes within a cluster tree without affecting the rest of the interface. If users click a cluster node, then the search results are filtered. If a cluster tree contains no children nodes, it is disabled.

Configuring Clustering in the Web Services API

Methods in the Query Web Service API provide clustering for customized Oracle SES applications. The main interface is the method doOracleOrganizedSearch, which accepts query information, grouping and sorting options, and clustering requests. Based on the request variation, it returns the requested result. A second method doOracleFetchSearch is used when the set of documents is known.

The input for doOracleOrganizedSearch includes the following information

  • Query

  • TopN (the result set size used for grouping, sorting, and clustering)

  • Duplicate controls (removed, marked)

  • Data group list

  • Query and document language

  • Grouping and sorting options

  • Cluster tree configuration info (tree depth, children for each node, threshold, tree format type: JSON, XML; topic extraction configuration, metadata clustering configuration.)

  • Other query parameters (including Number startIndex, Boolean returnCount, String filterConnect, Filter[] filters)

The output is an object that contains the search result, grouping information, and the cluster tree string list. The search result list is in the order specified by the grouping and sorting option. If this is not specified, then it is sorted by the relevance score. The returned cluster tree string represents the clustering tree information: tree structure, node names, and document IDs.

Java Classes for Clustering

There are three classes to support the grouping and sorting options: GroupAttribute, SortAttribute, and GroupResult.

There are two classes to support the clustering request: ClusterConfig, which controls the clustering request, and ClusterTree, which contains the tree output.

The class OracleResultContainer is defined to wrap the search hit result, grouping result, and clustering result.

doOracleFetchSearch is used for fetching a selected list of documents identified by their document ID, federated source ID, or both.

If GroupAttribute is specified, then it is automatically added to the top of the sorting attribute. For example, if the query is grouped by host name and sorted by title, then the search hit is sorted by (hostname, title).

The sorting, grouping, or clustering option can be applied to this result. Sorting is based on the top N result, while grouping and clustering is based on the result window determined by (startIndex, docsRequested).

Cluster Result XML Schema

The main XML element, node, contains the following attributes:

  • id: ID for the node. The value represents the full path with the parent node paths.

  • name: The name of the node. This is actually the topic for the node.

  • level: The cluster node level started from 1 for the top node.

  • size: Number of documents under (directly and indirectly) this cluster node.

  • leaf: This is "1" if the cluster node only contains documents and no child cluster nodes. Otherwise, this is "0".

  • keywords: All keywords and phrases within the cluster node.

The node element contains the document IDs in the XML text element if the node is a simple node. The document ID in the XML file has the format docID.SES_InstanceID. If the document is from the local instance, then the SES_instance_ID is omitted.

<cluster>
   <nodeset>
      <node id="1" name="all" level="1" size="100" leaf="0" keywords="all"/>
      <node id="1.4" name="java" level="2" size="99" leaf="0" keywords="java"/>
      <node id="1.4.1" name="data warehousing" level="3" size="38" leaf="0"
         keywords="technologies bi,data warehousing,linux .net office 
            php security service"/>
      <node id="1.4.1.1" name="tutorials blogs" level="4" size="12" leaf="1"
         keywords="tutorials blogs">
         2773,8031,109,8033,806,26940,817,8024,8030,2862,8032,8028
      </node>
      <node id="1.4.1.2" name="stored procedure" level="4" size="4" leaf="1"
         keywords="stored procedure">
         4239,4243,2784,4335
      </node>
      <node id="1.4.1.3" name="miscellaneous" level="4"  size="22" leaf="1">
         4017,2836,8029,2767,1502,113814,11731,1138,392,2819,2763,1421,
         221,705,7739,2838,2749,2351,2802,1158,15751,15747
      </node>
   </nodeset>
</cluster>

Cluster Result JSON Format

To integrate with AJAX applications, the cluster results can be returned in JSON format. The JSON format directly reflects the tree structure of the cluster results. Each node has a child array, which is a list of nodes representing the direct children of that node, or a docs array representing the document in that node if the node is a leaf node. Nodes in the child array may have children, and so on.

Here is sample JSON output.

{"nodeset":
 
  {"id":"1",
  "name":"all",
  "level":1,
  "size":100,
  "leaf":false,
  "keywords":"all",
  "children":
     [{"id":"1.4",
     "name":"java",
     "level":2,
     "size":99,
     "leaf":false,
     "keywords":"java",
     "children":
         [{"id":"1.4.1",
         "name":"data warehousing",
         "level":3,
         "size":38,
         "leaf":false,
         "keywords":"technologies bi,data warehousing,linux .net office php security service",
         "children":
            [{"id":"1.4.1.1",
            "name":"tutorials blogs",
            "level":4,
            "size":12,
            "leaf":true,
            "keywords":"tutorials blogs", "docs":["2773","8031","10","803","806","26940","817","8024","8030","2862","803","8028"] },
            {"id":"1.4.1.2",
            "name":"stored procedure",
            "level":4,
            "size":4,
            "leaf":true,
            "keywords":"stored procedure",
            "docs":["4239","4243","2784","4335"]}]
         }]
     },
     {"id":"1.5",
     "name":"miscellaneous",
     "level":2,
     "size":1,
     "leaf":true,
     "docs":["265915"]
     }]
   }
}