Sun Java System Portal Server 7.1 Administration Guide

Generation Functions

Generation functions are used in the Generate stage of filtering. Generation functions can create information that goes into a resource description. In general, they either extract information from the body of the resource itself or copy information from the resource’s metadata.

extract-full-text

The extract-full-text function extracts the complete text of the resource and adds it to the resource description.


Note –

Use the extract-full-text function with caution. It can significantly increase the size of the resource description, thus causing database bloat and overall negative impact on network bandwidth.


Example

Generate fn=extract-full-text

Properties

truncate

The maximum number of characters to extract from the resource

dst

Name of the schema item that receives the full text

extract-html-meta

The extract-html-meta function extracts any <META> or <TITLE> information from an HTML file and adds it to the resource description. A content-type may be specified to restrict the kind of URLs that are generated.

Properties

truncate

The maximum number of bytes to extract

type

Optional property. If omitted, all URLs are generated

Example

Generate fn=extract-html-meta truncate=255 type=text/html

extract-html-text

The extract-html-text function extracts the first few characters of text from an HTML file, excluding the HTML tags, and adds the text to the resource description. This function permits the first part of a document’s text to be included in the RD. A content-type may be specified to restrict the kind of URLs that are generated.

Properties

truncate

The maximum number of bytes to extract

skip-headings

Set to true to ignore any HTML headers that occur in the document

type

Optional property. If omitted, all URLs are generated

Example

Generate fn=extract-html-text truncate=255 type=text/html skip-headings=true

extract-html-toc

The extract-html-toc function extracts table of contents from the HTML headers and adds it to the resource description.

Properties

truncate

The maximum number of bytes to extract

level

Maximum HTML header level to extract. This property controls the depth of the table of contents

Example

Generate fn=extract-html-toc truncate=255 level=3

extract-source

The extract-source function extracts the specified values from the given sources and adds them to the resource description.

Property

src

Lists source names. You can use the -> operator to define a new name for the RD attribute. For example, type->content-type would take the value of the source named type and save it in the RD under the attribute named content-type.

Example

Generate fn=extract-source src="md5,depth,rd-expires,rd-last-modified"

harvest-summarizer

The harvest-summarizer function runs a Harvest summarizer on the resource and adds the result to the resource description.

To run Harvest summarizers, you must have $HARVEST_HOME/lib/gatherer in your path before you run the robot.

Property

summarizer

Name of the summarizer program

Example

Generate fn-harvest-summarizer summarizer=HTML.sum