The RECORD_FRACT_OF_MEDIAN sets a scoring threshold for record-level filtering.

CAS uses a scoring method in which terms that are more frequent in this document than across the corpus are considered to be more relevant and thus are retained. Filtering occurs on the document-by-document basis; in other words, each term is considered for inclusion or exclusion separately for each document in which it occurs.

The distribution of scores for terms on a single record typically has very few high-scoring terms, followed by a long, gently-sloped plateau of marginally informative terms, with a sudden drop-off of few uninformative terms.

The RECORD_FRACT_OF_MEDIAN value lets you set a scoring threshold for the plateau; only terms that score above this threshold are kept. RECORD_FRACT_OF_MEDIAN should be set to a value that expresses the threshold as a fraction of the median score for terms on the document.

The recommended threshold is 1.1 (i.e., 10% higher than the median), which will keep only the highly-informative terms. Higher values will tend to increase precision (the terms that are kept are more likely to be relevant) but decrease recall (more likely to lose relevant terms). The default value of this threshold is 0.0, which allows all terms through.


Copyright © Legal Notices