This chapter contains the following functions:
The following functions are used in the Generate stage of filtering. Generation functions can generate information that goes into a resource description. In general, they either extract information from the body of the resource itself or copy information from the resource’s metadata.
The extract-full-text function extracts the complete text of the resource and adds it to the resource description.
The extract-full-text function should be used with caution, because it can significantly increase the size of the resource description, thus causing database bloat and overall negative impact on network bandwidth.
The parameters used with the extract-full-text function and their description are:
The maximum number of characters to extract from the resource.
Name of the schema item that will receive the full text.
Generate fn=extract-full-text |
The extract-html-meta function extracts any <META> or <TITLE> information from an HTML file and adds it to the resource description. A content-type may be specified to restrict the kind of URLs that are generated.
The parameters used with the extract-html-meta function and their description are:
The maximum number of bytes to extract.
Optional parameter. If omitted, it will generate all URLs.
Generate fn=extract-html-meta truncate=255 type=text/html |
The extract-html-text function extracts the first few characters of text from an HTML file, excluding the HTML tags, and adds the text to the resource description. This permits the first part of a document’s text to be included in the RD. A content-type may be specified to restrict the kind of URLs that are generated.
The parameters usedwith the extract-html-text function and their description are:
The maximum number of bytes to extract.
Set to true to ignore any HTML headers that occur in the document.
Optional parameter. If omitted, it will generate all URLs.
Generate fn=extract-html-text truncate=255 type=text/html skip-headings=true |
The extract-html-toc function extracts the table-of-contents from the HTML headers and add it to the resource description.
The parameters used with the extract-html-toc function and their description are:
The maximum number of bytes to extract.
Maximum HTML header level to extract. This parameter controls the depth of the table of contents.
Robot HTML Summarizer does not generate description and partial text for some of the documents, such as text/HTML, application/x-maker, or x-frame. There are three causes for Robot not generating the description and partial text for the following:
For HTML or text - Unclosed JavaScript tag. This is an error that you need to fix in the HTML page itself.
Robot does not index the part of the HTML page that falls between stopindex and startindex.
For any file other than HTML or text, such as application/x-maker, or x-frame, Robot uses a third party Convertor to convert them into HTML. Then, Robot indexes them. In some cases, the Convertor might not able to generate the HTML or it may generate an empty HTML body. In this case, Sun will report to the third party for a fix or a patch to solve the issue.
Generate fn=extract-html-toc truncate=255 level=3 |
The extract-source function extracts the specified values from the given sources and adds them to the resource description.
The parameter used with the extract-source function and its description is:
List of source names; you can use the -> operator to define a new name for the RD attribute, for example, type->content-type would take the value of the source named type and save it in the RD under the attribute named content-type.
Generate fn=extract-source src="md5,depth,rd-expires,rd-last-modified" |
The harvest-summarizer function runs a Harvest summarizer on the resource and adds the result to the resource description.
To run Harvest summarizers, you must have $HARVEST_HOME/lib/gatherer in your path before you run the robot.
The parameter used with the harvest-summarizer function and its description is:
Name of the summarizer program.
Generate fn-harvest-summarizer summarizer=HTML.sum |