Using custom dictionaries

You can optionally add a custom dictionary to supplement a default dictionary for any supported language.

The use of a custom dictionary may be necessary if searches for terms that you know exist in your data are not producing the expected results. The custom dictionary is a UTF-8 encoded file that is line oriented and tab delimited. Each line in the file represents an entry to supplement the primary dictionary.

The generic syntax for a line in the custom dictionary file is:
COMMAND value1 value 2 ... 
The COMMAND should be set only to STEM (for dictionary terms) or COMPOUND (for decompounding). Each value is tab delimited and depends on the COMMAND.

Dictionary terms

One use of a custom dictionary is to add new dictionary terms. (A dictionary term is also called a lemma.) Once a term is added to the dictionary, all morphological rules will apply to it. For example, adding a new noun will allow its plural form to stem to the lemma.

The generic syntax for a STEM line in the custom dictionary file is:
STEM new_term POS [,POS2 ...]
Each line beginning with STEM represents a lemma entry that includes:
  • new_term is a tab-delimited simple text string that represents a lemma.
  • POS is a valid part of speech listed below. At least one part of speech is required. Multiple parts of speech are delimited by a comma. Note that the parts of speech are case sensitive.
You can specify the part of speech attributes by their full name or abbreviation (in parentheses):
  • noun (N) - a simple noun, like table, book, procedure
  • nounProper (propN) - a proper name, for person, place, etc., typically capitalized, like Zachary, Supidito, Susquehanna
  • verb (V) - any verb in its dictionary form, like deconstruct, upsell, skate
  • adjective (Adj) - modifiers of nouns, typically can be compared (green, greener, greenest), like fast, trenchant, pendulous
  • adverb (Adv) - any general modifier of a sentence that may modify an adjective or verb or may stand alone, like slowly, yet, perhaps
  • preposition (Prep) - a word that forms a prepositional phrase with a noun, like off, beside, from. Used for postpositions too, in languages that have postpositions of similar function.
  • punct (Punct) - any non-letter symbol that is treated as a unit by itself, like %, $, ]
  • pronoun (Pro) - any pronominal form, including personal pronouns (I, they), demonstrative pronouns (those, this), relative pronouns (who, which, wherever)
  • interrog (Wh) - an interrogative word, like who, why, when, where, how
  • determiner (Det) - words that carry grammatical information about a noun group, for example definite/indefinite, like the, a, an
  • particle (Part) - small, invariant words that convey grammatical information; also used for interjections.
  • conjunction (Conj) - conjunctions that introduce a subordinate clause, e.g. although, because, while, and conjunctions that introduce a coordinate clause, e.g. and, or, yet
  • numCardinal (Card) - cardinal numbers, like thirteen, 100, five
  • numOrdinal (Ord) - ordinal numbers, like thirteenth, 100th, fifth
For example, this German custom dictionary shows three entries. Each entry is marked with the N attribute to indicate it is a noun:
STEM aalglatt N
STEM aalglatte N
STEM aalglatter N

Decompounding

You can manually configure a custom dictionary to define components of compound words. This can be useful if existing language dictionaries do not align with the usage of the language in a region or market, or if existing libraries have not kept up with changes to the language. A record search query for any of the components in a compound word also returns the compound as a match.

For example, the German orthography reform of 1996 introduced a standard set of rules for compound words, but these rules are not always followed. For this and similar such cases, you may wish to explicitly configure dictionary entries that mark the divisions within compound words.

The generic syntax for a COMPOUND line in the custom dictionary file is similar to the STEM syntax, including the POS attributes.

For example, you may wish to decompound the German word "Binnenschiffahrt" (which refers to transport along inland rivers). You might wish to add two versions: one that adheres to the German orthography reform standards of 1996 and one that reflects the earlier spelling of the word:
COMPOUND Binnenschifffahrt Binnen|Schiff|Fahrt N
COMPOUND Binnenschiffahrt Binnen|Schiff|Fahrt N

Note that the component words of a compound word must each exist in the dictionary. For the example above, this means that the dictionary must include individual entries for "binnen", "schiff", and "fahrt".