Term Discovery is the feature that extracts salient terms from source documents.
Term extraction is the process of tagging an Endeca record with a list of its relevant terms. A term represents a concept mentioned in the record’s source document, and is typically a noun phrase. The noun phrase consists of one or more nouns, potentially with adjoining words. A relevant term is a term that bears information for a document relative to the rest of the corpus.
During the term extraction process, term variants found in documents are stemmed for comparison and aggregation, but the final representation of the term uses the dominant form (most frequent variant). Using the dominant form allows the recovery of the preferred representation (singular/plural case, capitalization) of proper nouns and brand names.
Term extraction is performed by the Data Foundry via a Java manipulator pipeline component that uses the Endeca TermExtractor class. The terms are extracted into a user-specified property on the Endeca record. The property is then mapped (via a property mapper) to a dimension. Such a dimension is called a Term Discovery dimension in this guide.
Configuration information on term extraction is given in Chapter 2 (“Configuration Guidelines for Term Extraction”).
A noun phrase consists of a noun (or a sequence of nouns) with any associated modifiers. The modifiers are limited to adjectives, adjective phrases, or nouns used as adjectives.
Programmatically, each word in a noun phrase is called a token. An extracted noun phrase can have a maximum of 5 tokens; each token is limited to 200 characters. Therefore, a valid noun phrase has a maximum size of 1,000 characters.
Sentence boundaries could not be found for text beginning with tokens t1 t2 t3 t4 t5where t1 through t5 are the first 5 tokens of the problematic text.
Relevant terms are the most frequent terms available in the Term Discovery dimension. These terms are returned from the documents in the current result set. All of the terms in the set belong to the same Term Discovery dimension.
Relevant terms are returned by the Endeca MDEX Engine as dimension value refinements. Programmatically, relevant terms are DimVal objects. Therefore, application developers can use the same Endeca Presentation API methods on relevant terms that can be used on normal dimension value refinements. For example, they can be returned sorted using any ranking behavior supported for dimension value refinements.
For more information on displaying relevant terms, see the “Displaying refinements” topic.