A Java manipulator must be configured for the term extraction
class.
A Java manipulator component uses one or more Java modules to manipulate
source records as part of Forge's data processing. You can use multiple Java
manipulators in the pipeline.
The Java manipulator in a Term Discovery pipeline typically performs
these major term extraction tasks:
- Extracts terms (noun
phrases) from source records.
- Filters out unwanted terms
(based on the exclude list).
- Calculates per-record scores
for the terms.
- Performs corpus-level
filtering, which determines how informative a term is in respect to the other
records in the corpus.
The configuration attributes of a Java manipulator are described below.
Creating the Java manipulator
You use Developer Studio to create and configure a Java manipulator.
To create a Java manipulator:
- From the Pipeline Diagram
in Developer Studio, click
New.
- Select
Java Manipulator. The New Java Manipulator
editor is displayed.
- In the Name field, enter a
name. The name must be unique among the pipeline components.
- Fill in the appropriate
fields on the General, Sources, and Pass Throughs tabs. See the following
sections for details on these tabs.
- Optionally, you can use
the Comment tab to enter description or other comment about this component.
- Click
Ok.
Configuring the General tab
Use the General tab to configure these Java property attributes:
The following is an example of the General tab:
Configuring the Sources tab
Use the Sources tab to specify which other component in the pipeline
is providing records to this Java manipulator. You can specify multiple record
sources.
A Java manipulator used for term extraction typically uses two record
sources: one for the source records and the other for the exclude list. If two
record source inputs are configured, the first input is used for source records
and the second is for the exclude list.
Note: Make sure that the record sources are configured in the proper
order, with the source record source being first. If they are reversed, the
TermExtractor class in the Java manipulator will throw an exception and the
Forge process will fail. Developer Studio arranges the sources in alphabetical
order; therefore, you may have to rename them so that they are displayed in the
correct order.
In the following example, the CleanBody record manipulator is
considered to be the record data source while the LoadExcludeList source is the
exclude list.
Configuring the Pass Throughs tab
The Pass Throughs tab is used to send configuration-specific
information to the Java classes being executed by the Java manipulator.
Descriptions of the pass-through name/value pairs for the Java classes are
provided in the "Configuration Guidelines for Term Extraction" section.
To add a pass-through name/value pair:
- In the Java Manipulator
editor, click the
Pass Throughs tab
- In the Name field, enter
the name of the pass-through.
- In the Value field, enter
the value for the named pass-through.
- Click
Add.
- Repeat steps 2-4 as
necessary.
- Click
Ok.
The following is an example of a populated Pass Throughs tab: