Guided Search supports different collations for sorting in different languages. These include the Endeca collation, the Standard collation, and several language-specific collations. Guided Search uses the Endeca collation by default.

The Endeca Collation

The Endeca collation places lower case characters before the upper case versions of those same characters. For example, the Endeca collation sorts text as follows:

0 < 1 < ... < 9 < a < A < b < B < ... < z < Z

The Endeca collation is optimized for unaccented languages and ignores accents and punctuation. For this reason, in applications that use English as their global language, the Endeca collation performs better during indexing and query processing than the Standard collation. In applications that use non-Latin scripts or Latin scripts with accents, the Endeca collation may produce unexpected results for accented characters.

The Standard Collation

The Standard collation sorts data according to the International Components for Unicode (ICU) standard for the language you specify with --lang flag. For details about the standard collation for a particular language, see the Unicode Common Locale Data Repository at http://cldr.unicode.org/. In applications that include internationalized data, the Standard collation is typically the more appropriate choice because it accounts for character accents during sorting.

Language Specific Collations

In addition to the Endeca and Standard collations, dgidx and the dgraph support the following language-specific ICU collations:

The following section explains how to specify the collation that you want to use for your data.


Copyright © Legal Notices