A noun phrase consists of a noun (or a sequence of nouns) with any associated modifiers. The modifiers are limited to adjectives, adjective phrases, or nouns used as adjectives.

Programmatically, each word in a noun phrase is called a token. An extracted noun phrase can have a maximum of 5 tokens; each token is limited to 200 characters. Therefore, a valid noun phrase has a maximum size of 1,000 characters.

The maximum size of a sentence in a document is 1,000 tokens (words). If the term extractor cannot determine the sentence boundaries of text in the document, it splits the text into blocks of 1,000 tokens and then performs text extraction on the blocks. In addition, the following WARN message will be entered in the term extraction log:

Sentence boundaries could not be found for text beginning
with tokens t1 t2 t3 t4 t5

where t1 through t5 are the first 5 tokens of the problematic text.

If you are using OLT noun grouping, the rules in the list below are applied only when the noun group size is more than 1000 characters. If the length of the noun group is less than or equal to 1000 characters, the noun group is taken as is without any modification.

The term extractor treats invalid noun phrases as follows:


Copyright © Legal Notices