The RECORD_NTERMS pass-through sets a limit on the maximum number of terms that are tagged on a record.
You can use the RECORD_NTERMS pass-through to implement one of two strategies to limit the number of terms that are tagged on records:
You cannot mix both strategies. In both strategies, CAS determines which terms have the highest relevance for that record. Note that this pass-through is recommended mainly for collections that have large documents.
To set an absolute upper limit, use the RECORD_NTERMS pass-through with only one integer value. Use this version of the pass-through when you are certain about the number of terms you want per record and can therefore set a hard limit. In this example, RECORD_NTERMS is set to a value of 8:
Using this setting, CAS determines which are the eight most relevant terms for this record and tag the record with them.
To establish a cut-off window, use the RECORD_NTERMS pass-through with a range of two integers, which sets the lower and upper limits of a cut-off window. This windowing strategy establishes a window that will be scanned for an optimal cut-off. This cut-off is where term informativeness drops off most precipitously. Use this strategy when you want CAS to be sensitive to actual term informativeness rather than just using a hard limit.
You can think of the term range as providing a fuzzy neighborhood to be used instead of a hard limit. For example, instead of RECORD_NTERMS having a hard limit of 32, you can set it to a range of 24-36. This range establishes a window where a record can have a minimum of 24 terms and a maximum of 36 terms. CAS determines the optimal cut-off within that window for each record.
For example, assume that 40 terms were extracted from Record A and also from Record B:
For Record A, the optimal cut-off for the terms might be after term 26 (because of a sharp drop-off in relevancy for terms 27-40). Therefore, Record A will have 26 terms tagged onto it.
For Record B, the optimal cut-off for its set of terms might be after term 30. In this case, Record B will have 30 tagged terms.
When using the range version of this pass-through, keep the following in mind:
The lowest recommended value for the lower limit is around 10. The reason is that the scores of the top terms scores usually differ noticeably, and the largest score drop-off is likely to be found at the setting for the lower limit. Thus, if the lower is less than 10, you should expect it to behave like the hard-limit version of RECORD_NTERMS, which is misleading.
The value for the upper limit should not be much larger than the value for the lower limit. If the difference is too much, the number of terms assigned to each particular record will be essentially random (within the cut-off window). The only way to have this number of terms relatively consistent is to use a lower- and upper-limit pair that are not too far from each other.