Oracle Commerce Guided Search - Indexing Languages with a Language Analysis

Indexing Languages with a Language Analysis

A language analysis is a set of rules that the MDEX Engine applies when it indexes text in any of the languages in a particular language group. The languages in this group have common characteristics that require specialized handling by an appropriate language analysis.

Every form of language analysis provides, at a minimum, tokenization: the breaking up of compound phrases into their constituent words or characters. Language analysis can optionally include stemming, which makes it possible to match inflected word forms that share a stem; for example, to treat "family", "families", and "family's" as forms that match each other. Each language analysis includes a different set of other text management features, such as ignoring "stop words" (that is, words with little or no value for searches, such as "the") or ignoring accents (diacritic folding).

The MDEX Engine can apply a language analysis to records, record properties, dimension tags on records, or to all record data processed by an MDEX Engine. Only one language analysis can be applied to any of these units at a time.

Oracle Commerce Guided Search supports two standard forms of language analysis, Latin-1 and OLT (Oracle Language Technologies), which are designed for use with different languages. Customers can create and use non-standard language analyses, if neither Latin-1 nor OLT meets their requirements. For detailed information about Latin-1 and OLT language analysis, see Latin-1 and OLT Language Analysis.

How the MDEX Engine Chooses a Language Analysis

To choose a language analysis to use for processing a particular record, the MDEX Engine follows these steps:

If it finds a language ID code for the record, it assumes that all the record's content is in that language and applies the appropriate the language analysis to the record as a whole.
If it does not find a language ID code for the record, it examines the properties of the record in sequence, as follows:

The MDEX Engine then analyzes the next record, beginning with Step 1.

Note

In the preceding sequence of steps, the MDEX Engine treats dimensions the same way that it treats properties.

For information about how language ID codes can be associated with records, properties, and dimensions, see Assigning Language IDs globally, per record, and per property. For information about dgidx, refer to the Oracle Commerce Guided Search Administrator's Guide.

When dgidx has determined the language of the record, property, or dimension, it consults the stemming.xmlfile (if one exists) to determine whether it contains an entry for that language. If an entry exists, dgidx uses information in the entry to decide which language analysis to apply to the language, and which features of that analysis to use. If no stemming file exists, or one exists but does not contain an entry for a particular language, an applicable default language analysis is applied. For information about stemming.xml, see Specifying Language Analysis in stemming.xml.

The following figure illustrates the process by which the MDEX Engine chooses a language analysis as described above:

Copyright © Legal Notices