Asian Language Dictionaries

The Asian language dictionaries directory contains additional dictionary word lists that are used for part-of-speech (POS) tagging and similar phrase searching. By adding words to the user dictionary files with special parameters, you can override the default segmentation.

Note: User dictionary files are available only for Chinese and Japanese languages, specifically Mandarin (user.dict_CN.utf8), Cantonese (user.dict_HK.utf8), Taiwanese (user.dict_TW.utf8), and Japanese (user.dict_JP.utf8).

You can create user dictionaries for words specific to an industry or application by adding new words, personal names, and transliterated characters of other alphabets. In addition, you can specify how existing words are segmented. For example, you may want to prevent a product name from being segmented even if it is a compound. The system performs a lookup of more than 500,000 words to determine segmentation. Using the dictionary, alias list, and keywords, you can influence how words are segmented.

If you edit a user dictionary file, you must use a specific format. The word you want to add is followed by the user dictionary part-of-speech tag (listed below), and an optional decomposition pattern (DecompPattern) in the form of a comma-delimited list of numbers specifying the number of characters from the word to include in each component of the string. (Use a zero (0) to indicate that a DecompPattern is not needed.)

For example, the user dictionary entry AABBCC ORGANIZATION 2,2,2 indicates that AABBCC should be decomposed into three, two-character components.

User Dictionary POS Tags for Mandarin, Cantonese, and Taiwanese (case insensitive)

  • NOUN
  • PROPER_NOUN
  • PLACE
  • PERSON
  • ORGANIZATION
  • FOREIGN_PERSON

User Dictionary POS Tags for Japanese (case insensitive)

  • NOUN
  • PROPER_NOUN
  • PLACE
  • PERSON
  • ORGANIZATION
  • GIVEN_NAME
  • SURNAME
  • FOREIGN_PLACE_NAME
  • FOREIGN_GIVEN_NAME
  • FOREIGN_SURNAME
Note: Oracle Cloud Operations must run the Keywordindexer utility before your changes to the word list files are active. To schedule this, Submit a Service Request.