CORPUS_MIN_RECS
Recommendation: Values of less than 2 are not recommended in general,
since they allow terms that are seen only once in the entire corpus. If
clustering is used, this value MUST be set to at least 2. Note that this
parameter works similarly to CORPUS_MIN_COVERAGE: terms that are seen less
frequently than in CORPUS_MIN_RECS are discarded, as are terms that are seen in
less than CORPUS_MIN_COVERAGE *
(number of documents in the corpus)
.
CORPUS_MAX_RECS
Recommendation: As a general rule of thumb, this pass-through does not have to be used. If your number of records can change, for example, through partial updates, Oracle recommends that you not use CORPUS_MAX_RECS, because the statistics will change with the changed number of records. In this case, you may want to use the CORPUS_MAX_COVERAGE pass-through instead.
CORPUS_MIN_COVERAGE
Recommendation: The useful range is 0-1. A value of 0.00005 is a good compromise, because the term extractor will retain a term if it has been seen in at least one document out of 20,000.
This value will change with the nature of the data set. For example, a site with a data set with a lot of topical diversity (such as news) can reduce this value and allow terms with lower coverage (however, one out of any hundred thousand is probably the smallest reasonable value). If memory use is an issue, you should increase this value.
CORPUS_MAX_COVERAGE
Recommendation: The useful range is 0-1. A value of 0.2 (which is 20% of the documents) is a good compromise. If a term is seen in more than one out of five documents (i.e., 20%), it is probably too broad to be useful. If terms that are tagged onto documents seem too generic, this number should be turned down. As with CORPUS_MIN_COVERAGE, turning this number down, even slightly, should free memory.
CORPUS_REGEX_KEEP
Recommendation: A useful regular expression for terms to keep is:
^\p{Alnum}[\p{Alnum}\.\-' ]+$
This retains terms that have at least two characters, start with an alphanumeric character, and includes only alphanumerics, spaces, periods, dashes, and single quotes.
Note
Each term must both match CORPUS_REGEXP_KEEP and not match CORPUS_REGEXP_SKIP to be retained.
CORPUS_REGEX_SKIP
Recommendation: Use this pass-through only if you are certain of the format of the terms you want to discard.
CORPUS_MIN_INFO_GAIN and CORPUS_MAX_INFO_GAIN
Recommendation: Begin by setting CORPUS_MIN_INFO_GAIN to 0. Do not set
CORPUS_MAX_INFO_GAIN initially. Tune the other term extraction pass-throughs.
Then, run a data set (or a subset) with CORPUS_DEBUG set to
true
, which will print the list of terms that
passed all the selection criteria. You can use this information to adjust the
selection criteria, which may include adjusting the CORPUS_MIN_INFO_GAIN and
using the CORPUS_MAX_INFO_GAIN pass-throughs.
If fewer generic terms are desired, increase the value of CORPUS_MIN_INFO_GAIN in small increments (0.5 or 1.0). If more generic terms are desired, decrease this value. CORPUS_DEBUG can be used to select a particular value of CORPUS_MIN_INFO_GAIN.
CORPUS_DEBUG
Recommendation: Set this pass-through to
true
only when you are tuning the filtering
parameters; otherwise, do not use it.