Oracle Commerce Guided Search - Maximum size of extracted terms

Maximum size of extracted terms

A noun phrase consists of a noun (or a sequence of nouns) with any associated modifiers. The modifiers are limited to adjectives, adjective phrases, or nouns used as adjectives.

Programmatically, each word in a noun phrase is called a token. An extracted noun phrase can have a maximum of 5 tokens; each token is limited to 200 characters. Therefore, a valid noun phrase has a maximum size of 1,000 characters.

The maximum size of a sentence in a document is 1,000 tokens (words). If the term extractor cannot determine the sentence boundaries of text in the document, it splits the text into blocks of 1,000 tokens and then performs text extraction on the blocks. In addition, the following WARN message will be entered in the term extraction log:

Sentence boundaries could not be found for text beginning
with tokens t1 t2 t3 t4 t5

where t1 through t5 are the first 5 tokens of the problematic text.

If you are using OLT noun grouping, the rules in the list below are applied only when the noun group size is more than 1000 characters. If the length of the noun group is less than or equal to 1000 characters, the noun group is taken as is without any modification.

The term extractor treats invalid noun phrases as follows:

If a noun phrase has more than 5 tokens, only the last 5 tokens are retained. For example, with a 7-token noun phrase, the first 2 tokens are completely ignored by the term extractor and the last 5 tokens are retained.
Tokens that are over 200 characters are ignored.
If a noun phrase includes an overly-long token, that token is ignored, and the precedent and antecedent tokens are treated as separate noun phrases. For example, assume a 5-token noun phrase. Token 3 is an overly-long token and the others are valid. In this case, Token 3 will be ignored and the term extractor will return 2 noun phrases: the first noun phrase will consist of Tokens 1 and 2, and the second will consist of Tokens 4 and 5.

Copyright © Legal Notices