Overview of Term Discovery

Term Discovery is the feature that extracts salient terms from source documents.

Term Discovery can be thought of as a two-part process:

Extracting terms from source documents (unstructured or structured) and scoring them according to their relevancy. The scored terms are mapped to an Endeca dimension, called a Term Discovery dimension in this guide.
Presenting terms relevant to the current navigation state.

Extracting terms from documents

Term extraction is the process of tagging an Endeca record with a list of its relevant terms. A term represents a concept mentioned in the record’s source document, and is typically a noun phrase. The noun phrase consists of one or more nouns, potentially with adjoining words. A relevant term is a term that bears information for a document relative to the rest of the corpus.

During the term extraction process, term variants found in documents are stemmed for comparison and aggregation, but the final representation of the term uses the dominant form (most frequent variant). Using the dominant form allows the recovery of the preferred representation (singular/plural case, capitalization) of proper nouns and brand names.

Term extraction is performed by the Data Foundry via a Java manipulator pipeline component that uses the Endeca TermExtractor class. The terms are extracted into a user-specified property on the Endeca record. The property is then mapped (via a property mapper) to a dimension. Such a dimension is called a Term Discovery dimension in this guide.

Configuration information on term extraction is given in Chapter 2 (“Configuration Guidelines for Term Extraction”).

Maximum size of extracted terms

A noun phrase consists of a noun (or a sequence of nouns) with any associated modifiers. The modifiers are limited to adjectives, adjective phrases, or nouns used as adjectives.

Programmatically, each word in a noun phrase is called a token. An extracted noun phrase can have a maximum of 5 tokens; each token is limited to 200 characters. Therefore, a valid noun phrase has a maximum size of 1,000 characters.

The maximum size of a sentence in a document is 1,000 tokens (words). If the term extractor cannot determine the sentence boundaries of text in the document, it splits the text into blocks of 1,000 tokens and then performs text extraction on the blocks. In addition, the following WARN message will be entered in the term extraction log:

Sentence boundaries could not be found for text beginning
with tokens t1 t2 t3 t4 t5

where t1 through t5 are the first 5 tokens of the problematic text.

The term extractor treats invalid noun phrases as follows:

If a noun phrase has more than 5 tokens, only the last 5 tokens are retained. For example, with a 7-token noun phrase, the first 2 tokens are completely ignored by the term extractor and the last 5 tokens are retained.
Tokens that are over 200 characters are ignored.
If a noun phrase includes an overly-long token, that token is ignored, and the precedent and antecedent tokens are treated as separate noun phrases. For example, assume a 5-token noun phrase. Token 3 is an overly-long token and the others are valid. In this case, Token 3 will be ignored and the term extractor will return 2 noun phrases: the first noun phrase will consist of Tokens 1 and 2, and the second will consist of Tokens 4 and 5.

Presenting relevant terms

Relevant terms are the most frequent terms available in the Term Discovery dimension. These terms are returned from the documents in the current result set. All of the terms in the set belong to the same Term Discovery dimension.

The Term Discovery dimension must have these two attributes:

It must be a flat dimension (that is, a dimension that does not contain hierarchies).
It must not be a hidden dimension. (Configuring it as a hidden dimension will disable the Cluster Discovery feature.)

Relevant terms are returned by the Endeca MDEX Engine as dimension value refinements. Programmatically, relevant terms are DimVal objects. Therefore, application developers can use the same Endeca Presentation API methods on relevant terms that can be used on normal dimension value refinements. For example, they can be returned sorted using any ranking behavior supported for dimension value refinements.

For more information on displaying relevant terms, see the “Displaying refinements” topic.