Creating a record manipulator

Expressions in a record manipulator perform document retrieval, text extraction, language identification, record or property clean up, and other tasks related to crawling. These expressions are evaluated against each record as it flows through the pipeline, and the record is changed as necessary.

For in-depth information about the expressions that can be used in a record manipulator, see the Data Foundry Expression Reference.

At a minimum, a crawler pipeline requires a record manipulator with two expressions: one to retrieve documents (RETRIEVE_URL) and another to convert documents to text (CONVERTTOTEXT or PARSE_DOC). In addition to these expressions, you can include other optional expressions to delete the temporary files created on disk by RETRIEVE_URL (using REMOVE_EXPORTED_PROP).

To create a record manipulator:

  1. In the Project tab of Developer Studio, double-click Pipeline Diagram.
  2. In the Pipeline Diagram editor, click New.
  3. Select Record > Manipulator. The New Record Manipulator editor displays.
  4. In the Name text box, type in the name of this record manipulator.
  5. From the Record source drop-down list, choose the name of the record adapter.
  6. Click OK to add the new record manipulator to the project.
  7. Select File > Save.
  8. If you are ready to add the expressions described in the sections below, double-click the record manipulator in your pipeline diagram. The Expression editor displays.