You add a Term Extractor manipulator to a data source on the Edit page of CAS Console. The manipulator extracts terms from a given property on each record.
Term extraction is the process of tagging a record with a list of its relevant terms. A term represents a concept mentioned in the record’s source document, and is typically a noun phrase. The noun phrase consists of one or more nouns, potentially with adjoining words. In CAS, the Term Extractor manipulator uses the following Oracle Language Technology (OLT) features to analyze a specific record property:
Identifies breaks between sentences in text.
Breaks a stream of text up into words, phrases, symbols, or other meaningful elements.
Determines the base (uniflected) form of a word. The process is based on dictionary entries and language-specific rules.
Tags a word in a text as belonging to a particular part of speech, based on its definition and its context.
Identifies noun phrases in text using grammatical phrase grouping. You can use either OLT or custom technology.
The terms are extracted into a property that you specify in the record. To add a Term Extractor manipulator to a data source:
On the Data Sources page, click a data source name to access its acquisition steps.
The Manipulator Settings page displays.
In Name, specify a unique name for the manipulator to distinguish it from other manipulators in the data source.
You can specify a name with alphanumeric characters, underscores, dashes, and periods. All other characters are invalid for a name. If you change this value, you must run a full crawl.
Specify the following required configuration parameters:
If you modify any of these parameters except for LANG, you must run a full crawl. For information on these parameters, see Minimum term extraction configuration.
The manipulator displays on the Acquisition Steps list.