You can optionally add a custom dictionary to supplement a default
dictionary for any supported language.
The use of a custom dictionary may be necessary if searches for terms
that you know exist in your data are not producing the expected results. The
custom dictionary is a UTF-8 encoded file that is line oriented and tab
delimited. Each line in the file represents an entry to supplement the primary
dictionary.
The generic syntax for a line in the custom dictionary file is:
COMMAND value1 value 2 ...
The
COMMAND should be set only to
STEM (for dictionary terms) or
COMPOUND (for decompounding). Each
value is tab delimited and depends on the
COMMAND.
Dictionary terms
One use of a custom dictionary is to add new dictionary terms. (A
dictionary term is also called a lemma.) Once a term is added to the
dictionary, all morphological rules will apply to it. For example, adding a new
noun will allow its plural form to stem to the lemma.
The generic syntax for a
STEM line in the custom dictionary file is:
STEM new_term POS [,POS2 ...]
Each line beginning with
STEM represents a lemma entry that includes:
- new_term is a
tab-delimited simple text string that represents a lemma.
- POS is
a valid part of speech listed below. At least one part of speech is required.
Multiple parts of speech are delimited by a comma. Note that the parts of
speech are case sensitive.
You can specify the part of speech attributes by their full name or
abbreviation (in parentheses):
- noun (N)
- a simple noun, like table, book, procedure
- nounProper
(propN) - a proper name, for person, place, etc., typically
capitalized, like Zachary, Supidito, Susquehanna
- verb (V)
- any verb in its dictionary form, like deconstruct, upsell, skate
- adjective
(Adj) - modifiers of nouns, typically can be compared (green, greener,
greenest), like fast, trenchant, pendulous
- adverb
(Adv) - any general modifier of a sentence that may modify an
adjective or verb or may stand alone, like slowly, yet, perhaps
- preposition
(Prep) - a word that forms a prepositional phrase with a noun, like
off, beside, from. Used for postpositions too, in languages that have
postpositions of similar function.
- punct
(Punct) - any non-letter symbol that is treated as a unit by itself,
like %, $, ]
- pronoun
(Pro) - any pronominal form, including personal pronouns (I, they),
demonstrative pronouns (those, this), relative pronouns (who, which, wherever)
- interrog
(Wh) - an interrogative word, like who, why, when, where, how
- determiner
(Det) - words that carry grammatical information about a noun group,
for example definite/indefinite, like the, a, an
- particle
(Part) - small, invariant words that convey grammatical information;
also used for interjections.
- conjunction
(Conj) - conjunctions that introduce a subordinate clause, e.g.
although, because, while, and conjunctions that introduce a coordinate clause,
e.g. and, or, yet
- numCardinal
(Card) - cardinal numbers, like thirteen, 100, five
- numOrdinal
(Ord) - ordinal numbers, like thirteenth, 100th, fifth
For example, this German custom dictionary shows three entries. Each
entry is marked with the
N attribute to indicate it is a noun:
STEM aalglatt N
STEM aalglatte N
STEM aalglatter N
Decompounding
You can manually configure a custom dictionary to define components of
compound words. This can be useful if existing language dictionaries do not
align with the usage of the language in a region or market, or if existing
libraries have not kept up with changes to the language. A record search query
for any of the components in a compound word also returns the compound as a
match.
For example, the German orthography reform of 1996 introduced a
standard set of rules for compound words, but these rules are not always
followed. For this and similar such cases, you may wish to explicitly configure
dictionary entries that mark the divisions within compound words.
The generic syntax for a
COMPOUND line in the custom dictionary file is similar
to the
STEM syntax, including the
POS attributes.
For example, you may wish to decompound the German word
"Binnenschiffahrt" (which refers to transport along inland rivers). You might
wish to add two versions: one that adheres to the German orthography reform
standards of 1996 and one that reflects the earlier spelling of the word:
COMPOUND Binnenschifffahrt Binnen|Schiff|Fahrt N
COMPOUND Binnenschiffahrt Binnen|Schiff|Fahrt N
Note that the component words of a compound word must each exist in
the dictionary. For the example above, this means that the dictionary must
include individual entries for "binnen", "schiff", and "fahrt".