Terms can be extracted from either structured and unstructured source data.

In general, data and content acquisition will typically be used to retrieve the records used by the term extractor. In particular, crawling (through the Oracle Commerce Web Crawler or the CAS Server) is a viable source of content for term extraction.

The input must be as clean as possible and as similar to natural language as the original data allows. The following list provides some recommendations about how to pre-process unstructured text that is fed to the term extractor:

Whenever possible, the text should be cleansed in the source data. However, you can add record manipulators to the pipeline to perform pre-processing cleanup.

You can also use the CORPUS_REGEX_KEEP and CORPUS_REGEX_SKIP pass-throughs in the CAS manipulator to control which extracted terms are kept or discarded. For details on how to construct regular expressions with these pass-throughs, see the Using regular expressions topic of this chapter.


Copyright © Legal Notices