Supported languages

The Dgraph uses a language code to identify a language for a specific attribute.

Language codes must be specified as valid RFC-3066 language code identifiers. The supported languages and their language code identifiers are:
Afrikaans: af Danish: da Indonesian: id Norwegian Bokmal: nb Spanish, Latin American: es_lam
Albanian: sq Divehi: nl Italian: it Norwegian Nynorsk: nn Spanish, Mexican: es_mx
Amharic: am Dutch: nl Japanese: ja Oriya: or Swahili: sw
Arabic: ar English, American: en Kannada: kn Persian: fa Swedish: sv
Armenian: hy English, British: en_GB Kazakh, Cyrillic: kk Persian, Dari: prs Tagalog: tl
Assamese: as Estonian: et Khmer: km Polish: pl Tamil: ta
Azerbaijani: az Finnish: fi Korean: ko Portuguese: pt Telugu: te
Bangla: bn French: fr Kyrgyz: ky Portuguese, Brazilian: pt_BR Thai: th
Basque: eu French, Canadian: fr_ca Lao: lo Punjabi: pa Turkish: tr
Belarusian: be Galician: gl Latvian: lv Romanian: ro Turkmen: tk
Bosnian: bs Georgian: ka Lithuanian: lt Russian: ru Ukrainian: uk
Bulgarian: bg German: de Macedonian: mk Serbian, Cyrillic: sr_Cyrl Urdu: ur
Catalan: ca Greek: el Malay: ms Serbian, Latin: sr_Latn Uzbek, Cyrillic: uz
Chinese, simplified: zh_CN Gujarati: gu Malayalam: ml Sinhala: si Uzbek, Latin: uz_latin
Chinese, traditional: zh_TW Hebrew: he Maltese: mt Slovak: sk Valencian: vc
Croatian: hr Hungarian: hu Marathi: mr Slovenian: sl Vietnamese: vn
Czech: cs Icelandic: is Nepali: ne Spanish: es unknown (i.e., none of the above languages): unknown

The language codes are case insensitive.

Note that an error is returned if you specify an invalid language code.

With the language codes, you can specify the language of the text to the Dgraph during a record search or value search query, so that it can correctly perform language-specific operations.

How country locale codes are treated

A country locale code is a combination of a language code (such as es for Spanish) and a country code (such as MX for Mexico or AR for Argentina). Thus, the es_MX country locale means Mexican Spanish while es_AR is Argentinian Spanish.

If you specify a country locale code for a Language element, the software ignores the country code but accepts the language code part. In other words, a country locale code is mapped to its language code and only that part is used for tokenizing queries or generating search indexes. For example, specifying es_MX is the same as specifying just es. The exceptions to this rule are the codes listed above (such as pt_BR).

Note, however, that if you create a Dgraph attribute and specify a country locale code in the Language field, the attribute will be tagged with the country locale code, even though the country code will be ignored during indexing and querying.

Language-specific dictionaries and Dgraph database

The Dgraph has two spelling correction engines:
  • If the Language property in an attribute is set to en, then spelling correction will be handled through the English spelling engine (and its English spelling dictionary).
  • If the Language property is set to any other value, then spelling correction will use the non-English spelling engine (and its language-specific dictionaries).

All dictionaries are generated from the data records in the Dgraph, and therefore require that the attribute definitions be tagged with a language code.

A data set's dictionary files are stored in the Dgraph database directory for that data set.

Specifying a language for a data set

When creating a data set, you can specify the language for all attributes in that data set, as follows:
  • Studio: When uploading a file in via the Data Set Creation Wizard, the Advanced Settings > Language field in the Preview page lets you select a language.
  • DP CLI: The defaultLanguage property in the edp.properties configuration file sets the language.

Note that you cannot set languages on a per-attribute basis.