Oracle Language Technology analysis performs language-specific dictionary-based forms of linguistic analysis, including the following:
Segmentation: Identifying word breaks in text from languages that do not use whitespaces as word delimiters. Formerly unseparated words must be contiguous to each other and in the same property. Note that Latin-1 analysis is unsuitable for languages that do not use whitespaces as delimiters.
Tokenization: Breaking a stream of text up into words, phrases, symbols, or other meaningful elements.
Orthographic normalization: Accounting for variations in the representation of words in languages that have standardized alternatives to diacritic marks (such as "ae" or "a" for ä in German); for example, treating "Furtwaengler" and "Furtwangler" as matching terms.
Decompounding: Dividing compound word forms into their base terms; for example, dividing "Altertumswissenschaft" into "Altertums" and "Wissenschaft".
Diacritic folding: Ignoring character accents in data when indexing and searching text.
Dynamic stemming: Determining the base (uninflected) form of a word. The process is based on dictionary entries and language specific rules.
Stop words: Common words (such as "the", "and", or "while") that have no value for searching.
A single MDEX Engine can process any number of the originally supported languages whose default language analysis is OLT; for example, a single MDEX Engine can process data in Arabic, Finnish, and Hebrew. However, among the languages that were not originally supported, a single MDEX Engine can process only one language whose default analysis is OLT.
The management of the originally supported languages can be configured
in the file
stemming.xml
.
For a complete list of the languages supported by the MDEX engine, see MDEX Engine Supported Languages .
Note
Different releases of the MDEX Engine may include different versions
of OLT. To find out which version of OLT the MDEX Engine uses, enter the
--version
option for the Dgidx or
Dgraph at the command line.
Note
Only one type of language analysis can be applied to any particular record, dimension, or property.