Sun Java logo     Previous      Contents      Index      Next     

Sun logo
Sun Java System Portal Server 6 2005Q1 Administration Guide 

Chapter 15
The Pre-defined Robot Application Functions

This chapter provides descriptions, parameter specifications, and examples of pre-defined Robot Application Functions (RAFs) in the Sun Java™ System Portal Server Search Engine. You can use these functions in the filter.conf file to create and modify filter definitions. The file filter.conf is located in the directory /var/opt/SUNWps/http-hostname-domain/portal/config.

The file filter.conf contains definitions for the enumeration and generation filters. Each of these filters invokes a set of rules which are stored in the file filterrules.conf. The filter definitions contain instructions that are specific to each filter while the filter rules contain the rules used by both filters.

To understand how filter rules are defined, examine the file filterrules.conf. Note that you typically need not manually edit this file since you create filter rules by using the administration console.

To see an example of filter definitions, you should examine the file filter.conf. You only need to edit the filter.conf file to modify the filters in a way that is not accommodated in the administration console, such as instructing the robot to enumerate some resources without generating resources for them.

This chapter contains the following sections:


Sources and Destinations

Most of the Robot Application Functions (RAFs) require sources of information and generate data that goes to destinations. The sources are defined within the robot itself and are not necessarily related to the fields in the resource description it ultimately generates. Destinations, on the other hand, are generally the names of fields in the resource description, as defined by the resource description server’s schema.

For details on using the administration console to determine the database schema, see Chapter 13, "Administering the Search Engine Service"

The following sections describe the different stages of the filtering process, and the sources available at those stages.

Sources Available at the Setup Stage

At the Setup stage, the filter is set up and cannot yet get information about the resource’s URL or content.

Sources Available at the MetaData Filtering Stage

At the MetaData stage, the robot encounters a URL for a resource, but it has not downloaded the resource’s content, thus information is available about the URL as well as data that is derived from other sources such as the filter.conf file. At this stage, however, information is not available about the content of the resource.

Table 15-1 lists the sources available to the RAFs at the MetaData phase. The table contains three columns. The first column lists the source, the second column provides a description, and the third column provides an example.

Table 15-1  Sources Available to the RAFs at the MetaData Phase 

Source

Description

Example

csid

Catalog Server ID

x-catalog//budgie.siroe.com:8086/alexandria

depth

Number of links traversed from starting point

10

enumeration filter

Name of Enumeration filter

enumeration1

generation filter

Name of Generation filter

generation1

host

Host portion of URL

home.siroe.com

IP

Numeric version of host

198.95.249.6

protocol

Access portion of the URL

http, https, ftp, file

path

Path portion of the URL

/, /index.html, /documents/listing.html

URL

Complete URL

http://developer.siroe.com/docs/manuals/

Sources Available at the Data Stage

At the Data stage, the robot has downloaded the content of the resource at the URL, and can access data about the content, such as the description, the author, and so on.

If the resource is an HTML file, the Robot parses the <META> tags in the HTML headers. Consequently, any data contained in <META> tags is available at the Data stage.

During the data phase, the following sources are available to RAFs, in addition to those available during the MetaData phase. The table contains three columns. The first column lists the source, the second column provides a description, and the third column provides an example.

Table 15-2  Sources Available to the RAFs at the Data Phase 

Source

Description

Example

content-charset

Character set used by the resource

 

content-encoding

Any form of encoding

 

content-length

Size of the resource in bytes

 

content-type

MIME type of the resource

text/html, image/jpeg

expires

Date the resource itself expires

 

last-modified

Date the resource was last modified

 

data in <META> tags

Any data that is provided in <META> tags in the header of HTML resources

Author
Description
Keywords

All these sources (except for the data in <META> tags) are derived from the HTTP response header returned when retrieving the resource.

Sources Available at the Enumeration, Generation, and Shutdown Stages

At the Enumeration and Generation stages, the same data sources are available as the Data stage.

At the Shutdown stage, the filter completes its filtering and is shuts down. Although functions written for this stage can use the same data sources as those available at the Data stage, the shutdown functions typically restrict their operations to shutdown and cleanup activities.

Enable Parameter

Each function can have an enable parameter. The values can be true, false, on, or off. The administration console uses these parameters to turn certain directives on or off.

The following example enables enumeration for text/html and disables enumeration for text/plain:

# Perform the enumeration on HTML only

Enumerate enable=true fn=enumerate-urls max=1024 type=text/html

Enumerate enable=false fn=enumerate-urls-from-text max=1024 type=text/plain

Adding an enable=false parameter or an enable=off parameter has the same effect as commenting the line. Because the administration console does not write comments, it writes an enable parameter instead.


Setup Functions

This section describes the functions that are used during the setup phase by both enumeration and generation filters. The following functions are described:

filterrules-setup

When you use the filterrules-setup function, logtype is the type of log file to use. The value can be verbose, normal, or terse.

Parameters

Table 15-3 lists the parameter used with the filterrules-setup function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-3  filterrules-setup Parameters

Parameter

Description

config

Path name to the file containing the filter rules to be used by this filter.

Example

Setup fn=filterrules-setup config=./config/filterrules.conf logtype=normal

setup-regex-cache

The setup-regex-cache function initializes the cache size for the filter-by-regex and generate-by-regex functions. Use this function to specify a number other than the default of 32.

Parameters

Table 15-4 lists the parameter used with the setup-regex-cache function. The table contains three columns. The first column lists the parameter, the second column provides a description, and the third column provides an example.

Table 15-4  setup-regex-cache Parameter

Parameter

Description

cache-size

Maximum number of compiled regular expressions to be kept in the regex cache.

Example

Setup fn=setup-regex-cache cache-size=28

setup-type-by-extension

The setup-type-by-extension function configures the filter to recognize file name extensions. It must be called before the assign-type-by-extension function can be used. The file specified as a parameter must contain mappings between standard MIME content types and file extension strings.

Parameters

Table 15-5 lists the parameter used with the setup-type-by-extension function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-5  setup-type-by-extension Parameter

Parameter

Description

file

Name of the MIME types configuration file.

Example

Setup fn=setup-type-by-extension file=./config/mime.types


Filtering Functions

The following functions operate at the Metadata and Data stages to allow or deny resources based on specific criteria specified by the function and its parameters.

These functions can be used in both Enumeration and Generation filters in the file filter.conf.

Each “filter-by” function performs a comparison, then either allows or denies the resource. Allowing the resource means that processing continues to the next filtering step. Denying the resource means that processing should stop, because the resource does not meet the criteria for further enumeration or generation. The following functions are described:

filter-by-exact

The filter-by-exact function allows or denies the resource if the allow/deny string matches the source of information exactly. The keyword all matches any string.

Parameters

Table 15-6 lists the parameters used with the filter-by-exact function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-6  filter-by-exact Parameter

Parameter

Description

src

Source of information.

allow/deny

Contains a string.

Example

The following example filters out all resources whose content-type is text/plain. It allows all other resources to proceed:

Data fn=filter-by-exact src=type deny=text/plain

filter-by-max

The filter-by-max function allows the resource if the specified information source is less than or equal to the given value. It denies the resource if the information source is greater than the specified value.

This function can be called no more than once per filter.

Parameters

Table 15-7 lists the parameters used with the filter-by-max function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-7  filter-by-max Parameters

Parameter

Description

src

Source of information. It must be one of the following: hosts, objects, or depth.

value

Specifies a value for comparison.

Example

This example allows resources whose content-length is less than 1024 K:

MetaData fn-filter-by-max src=content-length value=1024

filter-by-md5

The filter-by-md5 function only allows the first resource with a given MD5 checksum value. If the current resource’s MD5 has been seen in an earlier resource by this robot, the current resource is denied. As a result, duplication of identical resources or single resources with multiple URLs is prevented.

You can only call this function at the Data stage or later. It can be called no more than once per filter. The filter must invoke the generate-md5 function to generate an MD5 checksum before invoking filter-by-md5.

Parameters

none

Example

The following example shows the typical method of handling MD5 checksums by first generating the checksum and then filtering based on it:

Data fn=generate-md5

Data fn=filter-by-md5

filter-by-prefix

The filter-by-prefix function allows or denies the resource if the given information source begins with the specified prefix string. The resource doesn’t have to match completely. The keyword all matches any string.

Parameters

Table 15-8 lists the parameters used with the filter-by-prefix function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-8   filter-by-prefix Parameters

Parameter

Description

src

Source of information.

allow/deny

Contains a string for prefix comparison.

Example

The following example allows resources whose content-type is any kind of text, including text/html and text/plain:

MetaData fn=filter-by-prefix src=type allow=text

filter-by-regex

The filter-by-regex function supports regular expression pattern matching. It allows resources that match the given regular expression. The supported regular expression syntax is defined by the POSIX.1 specification. The regular expression \\* matches anything.

Parameters

Table 15-9 lists the parameters used with the filter-by-regex function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-9  filter-by-regex Parameters

Parameter

Description

src

Source of information.

allow/deny

Contains a regular expression string.

Example

The following example denies all resources from sites in the government domain:

MetaData fn=filter-by-regex src=host deny=\\*.gov

filterrules-process

The filterrules-process function handles in the rules in the filterrules.conf file.

Parameters

none

Example

MetaData fn=filterrules-process


Filtering Support Functions

The following functions are used during filtering to manipulate or generate information on the resource. The robot can then process the resource by calling filtering functions. These functions can be used in Enumeration and Generation filters in the file filter.conf. The following functions are described:

assign-source

The assign-source function assigns a new value to a given information source. This permits editing during the filtering process. The function can assign an explicit new value, or it can copy a value from another information source.

Parameters

Table 15-10 lists the parameters used with the assign-source function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-10  assign-source Parameters

Parameter

Description

dst

Name of the source whose value is to be changed.

value

Specifies an explicit value.

src

Information source to copy to dst

You must specify either a value parameter or a src parameter, but not both.

Example

Data fn=assign-source dst=type src=content-type

assign-type-by-extension

The assign-type-by-extension function uses the resource’s file name to determine its type and assigns this type to the resource for further processing.

The setup-type-by-extension function must be called during setup before assign-type-by-extension can be used.

Parameters

Table 15-11 lists the parameter used with the assign-type-by-extension function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-11  assign-type-by-extension Parameter

Parameter

Description

src

Source of file name to compare. If you do not specify a source, the default is the resource’s path.

Example

MetaData fn=assign-type-by-extension

clear-source

The clear-source function deletes the specified data source. You typically do not need to perform this function. You can create or replace a source by using the assign-source.

Parameters

Table 15-12 lists the parameter used with the clear-source function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-12  clear-source Parameter

Parameter

Description

src

Name of source to delete.

Example

The following example deletes the path source:

MetaData fn=clear-source src=path

convert-to-html

The convert-to-html function converts the current resource into an HTML file for further processing, if its type matches a specified MIME type. The conversion filter automatically detects the type of the file it is converting.

Parameters

Table 15-13 lists the parameter used with the convert-to-html function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-13  convert-to-html Parameter

Parameter

Description

type

MIME type from which to convert.

Example

The following sequence of function calls causes the filter to convert all Adobe Acrobat PDF files, Microsoft RTF files, and FrameMaker MIF files to HTML, as well as any files whose type was not specified by the server that delivered it.

Data fn=convert-to-html type=application/pdf

Data fn=convert-to-html type=application/rtf

Data fn=convert-to-html type=application/x-mif

Data fn=convert-to-html type=unknown

copy-attribute

The copy-attribute function copies the value from one field in the resource description into another.

Parameters

Table 15-14 lists the parameters used with the copy-attribute function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-14  copy-attribute Parameters

Parameter

Description

src

Field in the resource description from which to copy.

dst

Item in the resource description into which to copy the source.

truncate

Maximum length of the source to copy.

clean

Boolean parameter indicating whether to fix truncated text (such as not leaving partial words). This parameter is false by default.

Example

Generate fn=copy-attribute \

src=partial-text dst=description truncate=200 clean=true

generate-by-exact

The generate-by-exact function generates a source with a specified value, but only if an existing source exactly matches another value.

Parameters

Table 15-15 lists the parameters used with the generate-by-exact function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-15  generate-by-exact Parameter

Parameter

Description

dst

Name of source to generate.

value

Value to assign dst.

src

Source against which to match.

Example

The following example sets the classification to Siroe if the host is www.siroe.com.

Generate fn="generate-by-exact" match="www.siroe.com:80" src="host" value="Siroe" dst="classification"

generate-by-prefix

This generate-by-prefix function generates a source with a specified value, but only if the prefix of an existing source matches another value.

Parameters

Table 15-16 lists the parameters used with the generate-by-prefix function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-16  generate-by-prefix Parameters

Parameter

Description

dst

Name of the source to generate.

value

Value to assign to dst.

src

Source against which to match.

match

Value to compare to src.

Example

The following example sets the classification to Compass if the protocol prefix is HTTP:

Generate fn="generate-by-prefix" match="http" src="protocol" value="World Wide Web" dst="classification"

generate-by-regex

The generate-by-regex function generates a source with a specified value, but only if an existing source matches a regular expression.

Parameters

Table 15-17 lists the parameters used with the generate-by-regex function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-17  generate-by-regex Parameters

Parameter

Description

dst

Name of the source to generate.

value

Value to assign to dst.

src

Source against which to match.

match

Regular expression string to compare to src.

Example

The following example sets the classification to Siroe if the host name matches the regular expression *.siroe.com. For example, resources at both developer.siroe.com and home.siroe.com will be classified as Siroe:

Generate fn="generate-by-regex" match="\\*.siroe.com" src="host" value="Siroe" dst="classification"

generate-md5

The generate-md5 function generates an MD5 checksum and adds it to the resource. You can then use the filter-by-md5 function to deny resources with duplicate MD5 checksums.

Parameters

none

Example

Data fn=generate-md5

generate-rd-expires

The generate-rd-expires function generates an expiration date and adds it to the specified source. The function uses metadata such as the HTTP header and HTML <META> tags to obtain any expiration data from the resource. If none exists, it generates an expiration date three months from the current date.

Parameters

Table 15-18 lists the parameter used with the generate-rd-expires function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-18  generate-rd-expires Parameters

Parameter

Description

dst

Name of the source. If you omit it, it defaults to rd-expires.

Example

Generate fn=generate-rd-expires

generate-rd-last-modified

The generate-rd-last-modified function adds the current time to the specified source.

Parameters

Table 15-19 lists the parameter used with the generate-rd-last-modified function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-19  generate-rd-last-modified Parameter

Parameter

Description

dst

Name of the source. If you omit it, it defaults to rd-last-modified.

Example

Generate fn=generate-last-modified

rename-attribute

The rename-attribute function changes the name of a field in the resource description. It is most useful in cases where, for example, extract-html-meta copies information from a <META> tag into a field, and you want to change the name of the field.

Parameters

Table 15-20 lists the parameter used with the generate-rd-last-modified function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-20  generate-rd-last-modified Parameter

Parameter

Description

src

String containing a mapping from one name to another.

Example

The following example renames an attribute from author to author-name:

Generate fn=rename-attribute src="author->author-name"


Enumeration Functions

The following functions operate at the Enumerate stage. These functions control if and how a robot gathers links from a given resource in order to use as starting points for further resource discovery. The following functions are described in this section:

enumerate-urls

The enumerate-urls function scans the resource and enumerates all URLs found in hypertext links. The results are used to spawn further resource discovery. You can specify a content-type to restrict the kind of URLs enumerated.

Parameters

Table 15-21 lists the parameters used with the enumerate-urls function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-21  enumerate-urls Parameters

Parameter

Description

max

The maximum number of URLs to spawn from a given resource. The default, if max is omitted, is 1024.

type

Content-type that restricts enumeration to those URLs that have the specified content-type. type is an optional parameter. If omitted, it will enumerate all URLs.

Example

The following example enumerates HTML URLs only, up to a maximum of 1024:

Enumerate fn=enumerate-urls type=text/html

enumerate-urls-from-text

The enumerate-urls-from-text function scans text resources, looking for strings matching this regular expression: URL:.*. It spawns robots to enumerate the URLs from these strings and generate further resource descriptions.

Parameters

Table 15-22 lists the parameter used with the enumerate-urls-from-text function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-22  enumerate-urls-from-text Parameter

Parameter

Description

max

The maximum number of URLs to spawn from a given resource. The default, if max is omitted, is 1024.

Example

Enumerate fn=enumerate-urls-from-text


Generation Functions

The following functions are used in the Generate stage of filtering. Generation functions can generate information that goes into a resource description. In general, they either extract information from the body of the resource itself or copy information from the resource’s metadata. The following functions are described in this section:

extract-full-text

The extract-full-text function extracts the complete text of the resource and adds it to the resource description.


Note

The extract-full-text function should be used with caution, because it can significantly increase the size of the resource description, thus causing database bloat and overall negative impact on network bandwidth.


Parameters

Table 15-23 lists the parameters used with the extract-full-text function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-23  extract-full-text Parameters

Parameter

Description

truncate

The maximum number of characters to extract from the resource.

dst

Name of the schema item that will receive the full text.

Example

Generate fn=extract-full-text

extract-html-meta

The extract-html-meta function extracts any <META> or <TITLE> information from an HTML file and adds it to the resource description. A content-type may be specified to restrict the kind of URLs that are generated.

Parameters

Table 15-24 lists the parameters used with the extract-html-meta function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-24  extract-html-meta Parameters

Parameter

Description

truncate

The maximum number of bytes to extract.

type

Optional parameter. If omitted, it will generate all URLs.

Example

Generate fn=extract-html-meta truncate=255 type=text/html

extract-html-text

The extract-html-text function extracts the first few characters of text from an HTML file, excluding the HTML tags, and adds the text to the resource description. This permits the first part of a document’s text to be included in the RD. A content-type may be specified to restrict the kind of URLs that are generated.

Parameters

Table 15-25 lists the parameters used with the extract-html-text function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-25  extract-html-text Parameters

Parameter

Description

truncate

The maximum number of bytes to extract.

skip-headings

Set to true to ignore any HTML headers that occur in the document.

type

Optional parameter. If omitted, it will generate all URLs.

Example

Generate fn=extract-html-text truncate=255 type=text/html skip-headings=true

extract-html-toc

The extract-html-toc function extracts the table-of-contents from the HTML headers and add it to the resource description.

Parameters

Table 15-26 lists the parameters used with the extract-html-toc function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-26   extract-html-toc Parameters

Parameter

Description

truncate

The maximum number of bytes to extract.

level

Maximum HTML header level to extract. This parameter controls the depth of the table of contents.

Example

Generate fn=extract-html-toc truncate=255 level=3

extract-source

The extract-source function extracts the specified values from the given sources and adds them to the resource description.

Parameters

Table 15-27 lists the parameter used with the extract-source function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-27  extract-source Parameter

Parameter

Description

src

List of source names; you can use the -> operator to define a new name for the RD attribute, for example, type->content-type would take the value of the source named type and save it in the RD under the attribute named content-type.

Example

Generate fn=extract-source src="md5,depth,rd-expires,rd-last-modified"

harvest-summarizer

The harvest-summarizer function runs a Harvest summarizer on the resource and adds the result to the resource description.

To run Harvest summarizers, you must have $HARVEST_HOME/lib/gatherer in your path before you run the robot.

Parameters

Table 15-28 lists the parameter used with the harvest-summarizer function. The table contains two columns. The first column lists the parameter, and the second column provides a description.

Table 15-28  harvest-summarizer Parameter

Parameter

Description

summarizer

Name of the summarizer program.

Example

Generate fn-harvest-summarizer summarizer=HTML.sum


Shutdown Functions

The following function can be used during the shutdown phase by both enumeration and generation functions.

filterrules-shutdown

After the rules are run, the filterrules-shutdown function performs clean up and shutdown responsibilities.

Parameters

none

Example

Shutdown fn=filterrules-shutdown



Previous      Contents      Index      Next     


Copyright 2005 Sun Microsystems, Inc. All rights reserved.