Compass Server 3.0 Developer's Guide

[Contents] [Previous] [Next] [Last]

Chapter 10
The Pre-defined Robot Application Functions

This chapter describes the pre-defined Netscape Robot Application Functions (RAFs), providing descriptions, parameter specifications, and examples of each one. You can use these functions in the filter.conf file to create and modify filter definitions. The file filter.conf lives in the directory compass-installdir/compass-name/config.

The file filter.conf contains definitions for the enumeration filter and the generation filter. Each of these filters can invoke a set of filter rules, which are stored in the file filterrules.conf. The general idea is that the filter definitions contain instructions that are specific to each filter, while the filter rules contain the rules used by both filters.

If you are interested, you can examine the file filterrules.conf to see how filter rules are defined. You should not need to manually edit this file, however, since you can create filter rules interactively by using the Robot page in the Compass Server Administration Interface.

You can examine the file filter.conf to see an example of filter definitions. You only need to edit the filter.conf file manually if you want to modify the filters in a way that is not accommodated in the interface, such as instructing the robot to enumerate some resources without generating resources for them.

This chapter describes each of the basic categories of functions:

Sources and Destinations

Most of the Robot Application Functions (RAFs) require sources of information and generate data that goes to destinations. The sources are defined within the robot itself, and are not necessarily related to the fields in the resource description it ultimately generates. Destinations, on the other hand, are generally the names of fields in the resource description, as defined by the resource description server's schema.

For details of using the Compass Server Administration Interface to determine the database schema, see the Compass Server Administrator's Guide (which you can get by pressing the Manuals button in the Compass Server Administration Interface).

Sources Available at the Setup Stage

At the Setup stage, the filter is busy setting up, and cannot yet get information about the resource's URL or content.

Sources Available at the MetaData Filtering Stage

At the MetaData stage, the robot has encountered the URL for a resource, but has not downloaded the resource's content, thus information is available about the URL itself, as well as data that is derived from other sources such as the filter.conf file. At this stage however, information is not available about the content of the resource.

During the MetaData phase, the following sources are available to RAFs:

Source Meaning Example
csid

Catalog Server ID

x-catalog//compass.mydomain.com:8086/alexandria

depth

Number of links traversed from starting point

5

enumeration-filter

Name of Enumeration filter

generation filter

Name of Generation filter

host

host portion of URL

home.netscape.com

ip

Numeric version of host

198.95.249.6

protocol

Access portion of the URL

http, ftp, gopher

uri

Path portion of the URL

/, /index.html, /documents/listing.html

url

Complete URL

http://developer.netscape.com/docs/manuals/

At the Data stage, the robot has already downloaded the content of the resource at the URL, thus it can access data about the content, such as the description, the author, and so on.

If the resource is an HTML file, the Robot parses the <META> tags in the HTML headers. Thus any data contained in <META> tags is available at the Data stage.

During the data phase, the following sources are available to RAFs, in addition to those available during the MetaData phase:

Source Meaning Example
content-charset

Character set used by the resource

content-encoding

Any form of encoding

content-length

Size of the resource in bytes

content-type

MIME type of the resource

text/html, image/jpeg

expires

Date the resource itself expires

last-modified

Date the resource was last modified

data in <META> tags

Any data that is provided in <META> tags in the header of HTML resources

Author
Description
Keywords

All these sources (except for the data in <META> tags) are derived from the HTTP response header returned when retrieving the resource.

Sources Available at the Enumeration, Generation, and Shutdown Stages

At the Enumeration and Generation stages, the same data sources are available as for the Data stage.

At the Shutdown stage, the filter has done its filtering and is shutting down. Although functions written for this stage can use the same data sources as those available at the Data stage, usually shutdown functions restrict their operations to shutdown and clean up activities.

Enable Parameter

Each function can have an enable parameter. The values can be true, false, on, or off. The Compass Server Administration Interface uses these parameters to turn certain directives on or off.

The following example enables enumeration for text/html and disables enumeratiaon for text/plain.

#  Perform the enumeration on HTML only 
Enumerate enable=true fn=enumerate-urls max=1024 type=text/html 
Enumerate enable=false fn=enumerate-urls-from-text max=1024 type=text/plain 
Adding an enable=false or enable=off has the same effect as commenting out the line. (The Compass Server Admistration Interface does not know about writing comments, so it writes an enable parameter instead.)

Setup Functions

These functions are used during the setup phase by both enumeration and generation filters.


filterrules-setup

Parameters
config is the pathname to the file containing the filter rules to be used by this filter.

logtype is the type of log file to use. The value can be verbose, normal, or terse.

Example
Setup fn=filterrules-setup config=./config/filterrules.conf logtype=normal

setup-regex-cache

This function initializes the cache size for by the filter-by-regex and generate-by-regex functions. Use this function to specify a number other than the default of 32.

Parameters
cache-size is the maximum number of compiled regular expressions to be kept in the regex cache.

Example
Setup fn=setup-regex-cache cache-size=28


setup-type-by-extension

This function configures the filter to recognize file name extensions. It must be called before the assign-type-by-extension function can be used. The file specified as a parameter must contain mappings between standard MIME content types and file extension strings.

Parameters
file is the name of the MIME types configuration file.

Example
Setup fn=setup-type-by-extension file=./config/mime.types

Filtering Functions

These five functions operate at the Metadata and Data stages to allow or deny resources based on specific criteria specified by the function and its parameters.

These functions can be used in both Enumeration and Generation filters in the file filter.conf.

Each of these "filter-by" functions performs a comparison, then either allows or denies the resource.


filter-by-exact

This function allows or denies the resource if the allow/deny string matches the source of information exactly. The keyword "all" matches any string.

Parameters
src is the source of information.

allow/deny contains a string.

Example
This example filters out all resources whose content-type is text/plain. It allows all other resources to proceed.

Data fn=filter-by-exact src=type deny=text/plain

filter-by-max

This function allows the resource if the specified information source is less than or equal to the given value. It denies the resource if the information source is greater than the specified value.

This function can be called no more than once per filter.

Parameters
src is the source of information. It must be one of the following: hosts, objects, or depth.

value contains a value for comparison.

Example
This example allows resources whose content-length is less than 1024 K.

MetaData fn-filter-by-max src=content-length value=1024

filter-by-md5

This function allows only the first resource with a given MD5 checksum value. If the current resource's MD5 has been seen in an earlier resource by this robot, the current resource is denied. This prevents duplication of identical resources or single resources that happen to have multiple URLs associated with them.

You can only call this function at the Data stage or later. It can be called no more than once per filter. The filter must invoke the generate-md5 function to generate an MD5 checksum before invoking filter-by-md5.

Parameters
none

Example
This example shows the typical way to handle MD5 checksums, first generating the checksum and then filtering based on it.

Data fn=generate-md5
Data fn=filter-by-md5

filter-by-prefix

This function allows or denies the resource if the given information source begins with the specified prefix string. The resource doesn't have to match completely. The keyword "all" matches any string.

Parameters
src is the source of information.

allow/deny contains a string for prefix comparison.

Example
This example allows resources whose content-type is any kind of text, including text/html and text/plain.

MetaData fn=filter-by-prefix src=type allow=text

filter-by-regex

This function supports regular expression pattern matching. It allows resources that match the given regular expression. The supported regular expression syntax is defined by the POSIX.1 specification. The regular expression \\* matches anything.

Parameters
src is the source of information.

allow/deny contains a regular expression string.

Example
This example denies all resources from sites in the government domain.

MetaData fn=filter-by-regex src=host deny=\\*.gov

filterrules-process

This function rules the rules in the filterrules.conf file.

Parameters
none

Example
MetaData fn=filterrules-process

Filtering Support Functions

These functions can be called during filtering to manipulate or generate information on the resource that the robot can then process by calling filtering functions. These functions can be used in Enumeration and Generation filters in the file filter.conf.


assign-source

This function assigns a new value to a given information source. This permits a form of editing during the filtering process. The function can assign an explicit new value or it can copy a value from another information source.

Parameters
dst is the name of the source whose value is to be changed.

value specifies an explicit value.

src specifies an information source to copy to dst.

Note
You must specify either a value parameter or a src parameter, but not both.

Example
Data fn=assign-source dst=type src=content-type

assign-type-by-extension

This function uses the resource's file name to determine its type and assigns this type to the resource for further processing.

Note
The setup-type-by-extension function must be called during setup before assign-type-by-extension can be used.

Parameters
src is the source of the file name to compare. If you do not specify a source, the default is the resource's URI.

Example
MetaData fn=assign-type-by-extension

clear-source

This function deletes the specified data source. You should rarely need to do this. You can create or replace a source using assign-source.

Parameters
src is the name of the source to delete

Example
The following example deletes the uri source:

MetaData fn=clear-source src=uri

convert-to-html

This function converts the current resource into an HTML file for further processing if its type matches a specified MIME type. The conversion filter itself automatically detects the type of the file it is converting, only converting those it can actually convert.

Parameters
type is the MIME type to convert from

Example
The following sequence of function calls causes the filter to convert all Adobe Acrobat PDF files, Microsoft RTF files, and FrameMaker MIF files to HTML, plus any files whose type was not specified by the server that delivered it.

Data fn=convert-to-html type=application/pdf
Data fn=convert-to-html type=application/rtf
Data fn=convert-to-html type=application/x-mif
Data fn=convert-to-html type=unknown

copy-attribute

This function copies the value from one field in the resource description into another.

Parameters
src is the field in the resource description to copy from

dst is the item in the resource description to copy the source into

truncate is the maximum length of the source to copy

clean is a Boolean parameter indicating whether to clean up truncated text (such as not leaving partial words), which is false by default

Example
The following example copies up to 200 bytes from the partial text of the resource into the description field of the resource description, cleaning up the truncated text:

Generate fn=copy-attribute \
   src=partial-text dst=description truncate=200 clean=true

generate-by-exact

This function generates a source with a specified value, but only if an existing source exactly matches another value.

Parameters
dst is the name of the source to generate.

value is the value to assign to dst.

src is the source to match against.

match is the value to compare to src.

Example
This example sets the classification to Netscape if the host is www.netscape.com.
Generate fn="generate-by-exact" match="www.netscape.com:80" src="host" value="Netscape" dst="classification" 

generate-by-prefix

This function generates a source with a specified value, but only if the prefix of an existing source matches another value.

Parameters
dst is the name of the source to generate

value is the value to assign to dst

src is the source to match against

match is the value to compare to src.

Example
This example sets the classification to Compass if the protocol prefix is http.

Generate fn="generate-by-prefix" match="http" src="protocol" value="Compass" dst="classification" 

generate-by-regex

This function generates a source with a specified value, but only if an existing source matches a regular expression.

Parameters
dst is the name of the source to generate

value is the value to assign to dst

src is the source to match against

match is the regular expression string to compare to src.

Example
This example sets the classification to Netscape if the host name matches the regular expression *.netscape.com. For example, resources at both developer.netscape.com and home.netscape.com will be classified as Netscape.

Generate fn="generate-by-regex" match="\\*.netscape.com" src="host" value="Netscape" dst="classification" 

generate-md5

This function generates an MD5 checksum and adds it to the resource. You can then use the filter-by-md5 function to deny resources with duplicate MD5 checksums.

Parameters
none

Example
Data fn=generate-md5


generate-rd-expires

This function generates an expiration date and adds it to the specified source. The function uses metadata such as the HTTP header and HTML META tags to obtain any expiration data from the resource. If none exists, it generates an expiration date three months from the current date.

Parameters
dst is the name of the source. If you omit it, it defaults to rd-expires.

Example
Generate fn=generate-rd-expires


generate-rd-last-modified

This function adds the current time to the specified source.

Parameters
dst is the name of the source. If you omit it, it defaults to rd-last-modified.

Example
Generate fn=generate-last-modified


rename-attribute

This function changes the name of a field in the resource description. It is most useful in cases where, for example, extract-html-meta copies information from a META tag into a field, and you want to change the name of the field.

Parameters
src is a string containing a mapping from one name into another.

Example
The following example renames an attribute from author to author-name:

Generate fn=rename-attribute src="author->author-name"

Enumeration Functions

These functions operate at the Enumerate stage. Use them in the definition for Enumeration filters. They control whether and how a robot gathers links from a given resource to use as starting points for further resource discovery.


enumerate-urls

This function scans the resource and enumerates all URLs found in hypertext links. The results are used to spawn further resource discovery. You can specify a content-type to restrict the kind of URLs enumerated.

Parameters
max is the maximum number of URLs to spawn from a given resource. The default, if max is omitted, is 1024.

type specifies a content-type that restricts enumeration to those URLs that have the specified content-type. type is optional. If omitted, it will enumerate all URLs.

Example
This example enumerates HTML URLs only, up to a maximum of 1024.

Enumerate fn=enumerate-urls type=text/html

enumerate-urls-from-text

This function scans text resources, looking for strings matching this regular expression: URL:.*. It spawns robots to enumerate the URLs from these strings and generate further resource descriptions.

Parameters
max is the maximum number of URLs to spawn from a given resource. The default, if max is omitted, is 1024.

Example
Enumerate fn=enumerate-urls-from-text

Generation Functions

These functions can be used in the Generate stage of filtering. Use them in Generation filters. They generate information that goes into a resource description. In general, they either extract information from the body of the resource itself or copy information from the resource's metadata.


extract-full-text

This function extracts the complete text of the resource and adds it to the resource description. This function should be used with caution. It can significantly increase the size of the resource description, thus causing database bloat and overall negative impact on network bandwidth.

Parameters
truncate is the maximum number of characters to extract from the resource.

dst is the name of the schema item that will receive the full text.

Example
Generate fn=extract-full-text


extract-html-meta

This function extracts any META or TITLE information from an HTML file and adds it to the resource description. A content-type may be specified to restrict the kind of URLs that are generated.

Parameters
truncate is the maximum number of bytes to extract.

type is optional. If omitted, it will generate all URLs

Example
Generate fn=extract-html-meta truncate=255 type=text/html


extract-html-text

This function extracts the first few characters of text from an HTML file, excluding the HTML tags, and adds the text to the resource description. This permits the first part of a document's text to be included in the RD. A content-type may be specified to restrict the kind of URLs that are generated.

Parameters
truncate is the maximum number of bytes to extract.

skip-headings is true to ignore any HTML headers that occur in the document.

type is optional. If omitted, it will generate all URLs.

Example
Generate fn=extract-html-text truncate=255 type=text/html skip-headings=true


extract-html-toc

This function extracts the table-of-contents from the HTML headers and add it to the resource description.

Parameters
truncate is the maximum number of bytes to extract.

level is the maximum HTML header level to extract. This controls the depth of the table of contents.

Example
Generate fn=extract-html-toc truncate=255 level=3


extract-source

This function extracts the specified values from the given sources and adds them to the resource description.

Parameters
src is a list of source names; you can use the -> operator to define a new name for the RD attribute, for example, type->content-type would take the value of the source named type and save it in the RD under the attribute named content-type.

Example
Generate fn=extract-source src="md5,depth,rd-expires,rd-last-modified"


harvest-summarizer

This function runs a Harvest summarizer on the resource and adds the result to the resource description.

Note
To run Harvest summarizers, you must have $HARVEST_HOME/lib/gatherer in your PATH before you run the robot.

Parameters
summarizer is the name of the summarizer program.

Example
Generate fn-harvest-summarizer summarizer=HTML.sum

Shutdown Functions

These functions can be used during the shutdown phase by both enumeration and generation functions.


filterrules-shutdown

This function does some clean up after the rules have been run.

Parameters
none

Example
Shutdown fn=filterrules-shutdown


[Contents] [Previous] [Next] [Last]

Last Updated: 02/07/98 20:49:12

Any sample code included above is provided for your use on an "AS IS" basis, under the Netscape License Agreement - Terms of Use