Tokenization

Tokenization is the process of identifying the terms within input text. Terms can be words, numbers, punctuation, or other special items. ATG Search can identify complex terms such as alpha-numeric expressions (F-14), complex numbers (0x0FB5), or compound terms (New York). By treating these complex terms as individual units of search, ATG Search achieves more precise results than many other search engines. For example, a typical search engine would treat a search for F-14 as a search for F and/or 14, which would retrieve results just about F, just about 14, or about F and 14 in the same document.

Some of this analysis is driven by the ATG Search dictionary, which administrators can add to using the Search Administration user interface. Typically, administrators add relevant product names and models to the dictionary (see the Custom Terminology section in this chapter).

The rest of tokenization is driven by a rule base of terminology patterns. ATG Search provides a core set of general rules to handle complex terms, and more can be added through custom services.

Note: In languages such as Japanese, term boundaries are non-obvious since there is no white space. For these languages, an equivalent process called segmentation is used instead of simple tokenization to determine the word boundaries.

ATG Search Query Reference Guide

Tokenization