As previously described, ATG Search performs natural language processing during indexing and searching, using language-specific dictionaries. The data in the dictionaries drive the analysis of natural language text. ATG Search provides general-purpose data in over 30 languages.
Users can customize four aspects of dictionary data:
Index term definitions—These are the forms of a term ATG Search uses to index the content. Users can define new index terms or replace the definition of existing index terms.
ATG Search performs morphological analysis (see the Morphology section in this chapter) on text to determine what terms to index and query by. A term that is not in the dictionary is not processed for morphological variants or term expansions, so it is important for customers to extend the core dictionary with terms that appear in their content. For example, if the content contains various forms of the term widget, such as widgets, but widget is not in the dictionary, ATG Search indexes widget and widgets separately. However, if widget is added as a noun index term, then widgets will morph to widget, increasing the consistency of the index and search results.
Custom terminology can include compound terms, which act as a single term to improve searches. For example, content that refers to a product called Office System, is indexed by the general terms office and system. However, if this compound term is added to the dictionary, the search treats office system as a single term, only retrieving mentions of it and not just the general terms.
Thesaurus Entries—Users can provide entries for new index terms or replace the entries for existing terms. The main term links to related terms; each link has a strength ranging from equality to weak. A link may or may not be reciprocal; if it is, the related term has the same thesaurus entries as the main term. The main term can also have a parent term, which allows the thesaurus entries of the main term to be propagated up to the parent.
Term Normalization—Users can define variants of index terms that will be normalized to the standard form. When those variations appear in the content, they are normalized to the main version.
Note the difference between equal-linked terms and equivalent terms. The former means that there are two separate terms that are strongly linked and will behave almost identically in search, except for the literalness factor. The latter means that the equivalent term is normalized into the main term, does not exist separately, and is treated as literally the same term.
Term Weights—Users can provide explicit weights for terms to override the automatic weighting, including using very low-weights to turn terms into stop-words.
ATG Search computes the weight of the index terms to use in its calculation of relevancy. By default, weight is based on a term’s frequency and the size of the indexed content. Explicit weighting overrides automatic weighting. Typically, the term weight data includes low-content terms that may or may not be frequent in the content. Terms with a weight below a certain threshold are considered stop-words and ignored during search.