A noun phrase consists of a noun (or a sequence of nouns) with any associated modifiers. The modifiers are limited to adjectives, adjective phrases, or nouns used as adjectives.
Programmatically, each word in a noun phrase is called a token. An extracted noun phrase can have a maximum of 5 tokens; each token is limited to 200 characters. Therefore, a valid noun phrase has a maximum size of 1,000 characters.
The maximum size of a sentence in a document is 1,000 tokens (words). If the term extractor cannot determine the sentence boundaries of text in the document, it splits the text into blocks of 1,000 tokens and then performs text extraction on the blocks. In addition, the following WARN message will be entered in the term extraction log:
Sentence boundaries could not be found for text beginning with tokens t1 t2 t3 t4 t5
where
t1
through
t5
are the first 5 tokens of the problematic
text.
If you are using OLT noun grouping, the rules in the list below are applied only when the noun group size is more than 1000 characters. If the length of the noun group is less than or equal to 1000 characters, the noun group is taken as is without any modification.
The term extractor treats invalid noun phrases as follows:
If a noun phrase has more than 5 tokens, only the last 5 tokens are retained. For example, with a 7-token noun phrase, the first 2 tokens are completely ignored by the term extractor and the last 5 tokens are retained.
If a noun phrase includes an overly-long token, that token is ignored, and the precedent and antecedent tokens are treated as separate noun phrases. For example, assume a 5-token noun phrase. Token 3 is an overly-long token and the others are valid. In this case, Token 3 will be ignored and the term extractor will return 2 noun phrases: the first noun phrase will consist of Tokens 1 and 2, and the second will consist of Tokens 4 and 5.