Term extraction workflow

This topics gives an overview of the workflow of the term extractor.

Term extraction consists of three steps, each of which is optional:
  1. Candidate Term Identification — Identify all terms that are candidates for a given document and then extract those terms. This step is omitted if terms are being extracted from pre-tagged records (see the “Term filtering with pre-tagged records” topic in Chapter 6).
  2. Corpus-level Filtering — Globally filter the extracted terms to determine a controlled vocabulary for the corpus. This involves identifying terms that should be removed corpus-wide (and in step 3 are removed from the set of tags on each record).
  3. Record-level Filtering — Determine, for each record, what are the most relevant terms for it from the controlled vocabulary. This involves identifying terms that should be removed from an individual record (independent of terms that should be removed from the entire corpus, but possibly using corpus-level information) and subsequently removing these terms from the tags on the record.

Note that steps 2 and 3 remove the terms that are judged (by their score or by their presence on the exclude list) to be of low information value. The tagged terms on each record are a result of step 3.

This section provides configuration requirements for the term extraction modules. You supply the configuration parameters as pass-through name/value pairs to the Java manipulator.