Terms can be extracted from either structured and unstructured source data.
In general, data and content acquisition will typically be used to retrieve the records used by the term extractor. In particular, crawling (through the Oracle Commerce Web Crawler or the CAS Server) is a viable source of content for term extraction.
The input must be as clean as possible and as similar to natural language as the original data allows. The following list provides some recommendations about how to pre-process unstructured text that is fed to the term extractor:
Remove anything from the property that is sent to the term extractor that is not the main contents of the document. For example, when dealing with news articles, it is a good idea to remove bylines, copyright disclaimers, and so forth. When dealing with Web pages, the task is noticeably harder, because the navigational elements, page headers and footers, guestbook links, ads, etc., all have to be removed. In such extreme cases, one suggestion is to retain long sequences of sentences with correct sentence-terminating punctuation that does not have many major HTML tags (H1, P, HR, DIV, SPAN, and so on.) embedded: meaningful text is likely to satisfy this requirement, and items such as menus, ads, and page elements are likely to fail it.
Remove anything that is not natural language text. This includes HTML tags, other markup, and long sequences of non-alpha characters (e.g., long sequences of dashes used as delimiters). Links to images, URLs (that might be used in plain text, outside of HTML tags), and anything that is not in grammatically correct language should be stripped. The same caveat applies to sequence of characters that are too long to be meaningful terms. The term extractor will report and skip those overlong noun phrases. However, it is useful to detect these upstream of the term extractor because their presence might indicate sections of the documents that should be removed.
Punctuation should be correctly spaced, especially when stripping HTML or adding sentence-terminating punctuation. A sentence terminator is correctly interpreted only if it is followed by a space. For example:
Look at this.<IMG a.gif>Or this.
should not be converted to:
Look at this.Or this.
but instead to:
Look at this. Or this.
Convert non-sentence text into sentences. If, for example, a useful section of the document is written as a list or a table (that is, separated with <LI> or <TD> tags), it is recommended to terminate such entries with periods (or semicolons, depending on context), if they are not so terminated to begin with.
Merge text fields, if needed. For example, if the document title is a separate property, it is useful to append it to the main text property (terminating it with punctuation, if possible).
Use the correct case for capitalized and non-capitalized text. If capitalization is not available, it makes a best guess and this guess is better when dealing with lower-case text than with all upper-case text. If the document text is in all upper case, it is advisable to convert it to all lower-case (or, possibly, all lower case with initial capitalization for the first word in every sentence).
Use correct spelling in Web pages, especially blogs. For documents written in informal language, it is recommended that simple pattern replacement be done on most frequent terms (for example, "u" should be replaced with "you").
Whenever possible, the text should be cleansed in the source data. However, you can add record manipulators to the pipeline to perform pre-processing cleanup.
You can also use the CORPUS_REGEX_KEEP and CORPUS_REGEX_SKIP pass-throughs in the CAS manipulator to control which extracted terms are kept or discarded. For details on how to construct regular expressions with these pass-throughs, see the Using regular expressions topic of this chapter.