This topics gives an overview of the workflow of the term extractor.
Term extraction consists of three steps, each of which is optional:
Candidate Term Identification — Identify all terms that are candidates for a given document and then extract those terms. This step is omitted if terms are being extracted from pre-tagged records.
Corpus-level Filtering — Globally filter the extracted terms to determine a controlled vocabulary for the corpus. This involves identifying terms that should be removed corpus-wide.
Record-level Filtering — Determine, for each record, what are the most relevant terms for it from the controlled vocabulary. This involves identifying terms that should be removed from an individual record (independent of terms that should be removed from the entire corpus, but possibly using corpus-level information) and subsequently removing these terms from the tags on the record.
Note that steps 2 and 3 remove the terms that are judged by their score or by their presence on the exclude list to be of low information value. The tagged terms on each record are a result of step 3.
This section provides configuration requirements for the term extraction modules. You supply the configuration parameters as pass-through name/value pairs to the CAS manipulator.