CORPUS_MIN_RECS

Recommendation: Values of less than 2 are not recommended in general, since they allow terms that are seen only once in the entire corpus. If clustering is used, this value MUST be set to at least 2. Note that this parameter works similarly to CORPUS_MIN_COVERAGE: terms that are seen less frequently than in CORPUS_MIN_RECS are discarded, as are terms that are seen in less than CORPUS_MIN_COVERAGE * (number of documents in the corpus).

CORPUS_MAX_RECS

Recommendation: As a general rule of thumb, this pass-through does not have to be used. If your number of records can change, for example, through partial updates, Oracle recommends that you not use CORPUS_MAX_RECS, because the statistics will change with the changed number of records. In this case, you may want to use the CORPUS_MAX_COVERAGE pass-through instead.

CORPUS_MIN_COVERAGE

Recommendation: The useful range is 0-1. A value of 0.00005 is a good compromise, because the term extractor will retain a term if it has been seen in at least one document out of 20,000.

This value will change with the nature of the data set. For example, a site with a data set with a lot of topical diversity (such as news) can reduce this value and allow terms with lower coverage (however, one out of any hundred thousand is probably the smallest reasonable value). If memory use is an issue, you should increase this value.

CORPUS_MAX_COVERAGE

Recommendation: The useful range is 0-1. A value of 0.2 (which is 20% of the documents) is a good compromise. If a term is seen in more than one out of five documents (i.e., 20%), it is probably too broad to be useful. If terms that are tagged onto documents seem too generic, this number should be turned down. As with CORPUS_MIN_COVERAGE, turning this number down, even slightly, should free memory.

CORPUS_REGEX_KEEP

Recommendation: A useful regular expression for terms to keep is:

^\p{Alnum}[\p{Alnum}\.\-' ]+$

This retains terms that have at least two characters, start with an alphanumeric character, and includes only alphanumerics, spaces, periods, dashes, and single quotes.

CORPUS_REGEX_SKIP

Recommendation: Use this pass-through only if you are certain of the format of the terms you want to discard.

CORPUS_MIN_INFO_GAIN and CORPUS_MAX_INFO_GAIN

Recommendation: Begin by setting CORPUS_MIN_INFO_GAIN to 0. Do not set CORPUS_MAX_INFO_GAIN initially. Tune the other term extraction pass-throughs. Then, run a data set (or a subset) with CORPUS_DEBUG set to true, which will print the list of terms that passed all the selection criteria. You can use this information to adjust the selection criteria, which may include adjusting the CORPUS_MIN_INFO_GAIN and using the CORPUS_MAX_INFO_GAIN pass-throughs.

If fewer generic terms are desired, increase the value of CORPUS_MIN_INFO_GAIN in small increments (0.5 or 1.0). If more generic terms are desired, decrease this value. CORPUS_DEBUG can be used to select a particular value of CORPUS_MIN_INFO_GAIN.

CORPUS_DEBUG

Recommendation: Set this pass-through to true only when you are tuning the filtering parameters; otherwise, do not use it.


Copyright © Legal Notices