filter.conf
file to create and modify filter definitions. The file filter.conf
lives in the directory compass-installdir/compass
-name/config
.
The file filter.conf
contains definitions for the enumeration filter and the generation filter. Each of these filters can invoke a set of filter rules, which are stored in the file filterrules.conf
. The general idea is that the filter definitions contain instructions that are specific to each filter, while the filter rules contain the rules used by both filters.
If you are interested, you can examine the file filterrules.conf
to see how filter rules are defined. You should not need to manually edit this file, however, since you can create filter rules interactively by using the Robot page in the Compass Server Administration Interface.
You can examine the file filter.conf
to see an example of filter definitions. You only need to edit the filter.conf
file manually if you want to modify the filters in a way that is not accommodated in the interface, such as instructing the robot to enumerate some resources without generating resources for them.
This chapter describes each of the basic categories of functions:
filter.conf
file. At this stage however, information is not available about the content of the resource.
During the MetaData phase, the following sources are available to RAFs:
At the Data stage, the robot has already downloaded the content of the resource at the URL, thus it can access data about the content, such as the description, the author, and so on.
If the resource is an HTML file, the Robot parses the <META> tags in the HTML headers. Thus any data contained in <META> tags is available at the Data stage.
During the data phase, the following sources are available to RAFs, in addition to those available during the MetaData phase:
All these sources (except for the data in <META>
tags) are derived from the HTTP response header returned when retrieving the resource.
Sources Available at the Enumeration, Generation, and Shutdown Stages
At the Enumeration and Generation stages, the same data sources are available as for the Data stage.
At the Shutdown stage, the filter has done its filtering and is shutting down. Although functions written for this stage can use the same data sources as those available at the Data stage, usually shutdown functions restrict their operations to shutdown and clean up activities.
Enable Parameter
Each function can have an enable
parameter. The values can be true
, false
, on
, or off
. The Compass Server Administration Interface uses these parameters to turn certain directives on or off.
The following example enables enumeration for text/html and disables enumeratiaon for text/plain.
# Perform the enumeration on HTML only
Enumerate enable=true fn=enumerate-urls max=1024 type=text/html
Enumerate enable=false fn=enumerate-urls-from-text max=1024 type=text/plain
Adding an enable=false
or enable=off
has the same effect as commenting out the line. (The Compass Server Admistration Interface does not know about writing comments, so it writes an enable
parameter instead.)
Setup Functions
These functions are used during the setup phase by both enumeration and generation filters.
config
is the pathname to the file containing the filter rules to be used by this filter.
logtype
is the type of log file to use. The value can be verbose
, normal
, or terse
.
Setup fn=filterrules-setup config=./config/filterrules.conf logtype=normal
Parameters
cache-size
is the maximum number of compiled regular expressions to be kept in the regex cache.
Example
Setup fn=setup-regex-cache cache-size=28
Parameters
file
is the name of the MIME types configuration file.
Example
Setup fn=setup-type-by-extension file=./config/mime.types
Filtering Functions
These five functions operate at the Metadata and Data stages to allow or deny resources based on specific criteria specified by the function and its parameters.
These functions can be used in both Enumeration and Generation filters in the file filter.conf
.
Each of these "filter-by" functions performs a comparison, then either allows or denies the resource.
src
is the source of information.
allow/deny
contains a string.
text/plain
. It allows all other resources to proceed.
Data fn=filter-by-exact src=type deny=text/plain
src
is the source of information. It must be one of the following: hosts
, objects
, or depth
.
value
contains a value for comparison.
MetaData fn-filter-by-max src=content-length value=1024
generate-md5
function to generate an MD5 checksum before invoking filter-by-md5.
Parameters
none
Example
This example shows the typical way to handle MD5 checksums, first generating the checksum and then filtering based on it.
Data fn=generate-md5
Data fn=filter-by-md5
src
is the source of information.
allow/deny
contains a string for prefix comparison.
This example allows resources whose content-type is any kind of text, including text/html and text/plain.
MetaData fn=filter-by-prefix src=type allow=text
\\*
matches anything.
src
is the source of information.
allow/deny
contains a regular expression string.
MetaData fn=filter-by-regex src=host deny=\\*.gov
filterrules.conf
file.
MetaData fn=filterrules-process
filter.conf
.
dst
is the name of the source whose value is to be changed.
value
specifies an explicit value.
src
specifies an information source to copy to dst
.
value
parameter or a src
parameter, but not both.
Data fn=assign-source dst=type src=content-type
Parameters
src
is the source of the file name to compare. If you do not specify a source, the default is the resource's URI.
Example
MetaData fn=assign-type-by-extension
assign-source
.
Parameters
src is the name of the source to delete
Example
The following example deletes the uri
source:
MetaData fn=clear-source src=uri
type
is the MIME type to convert from
Data fn=convert-to-html type=application/pdf
Data fn=convert-to-html type=application/rtf
Data fn=convert-to-html type=application/x-mif
Data fn=convert-to-html type=unknown
src
is the field in the resource description to copy from
dst
is the item in the resource description to copy the source into
truncate
is the maximum length of the source to copy
clean
is a Boolean parameter indicating whether to clean up truncated text (such as not leaving partial words), which is false by default
description
field of the resource description, cleaning up the truncated text:
Generate fn=copy-attribute \
src=partial-text dst=description truncate=200 clean=true
dst
is the name of the source to generate.
value
is the value to assign to dst
.
src
is the source to match against.
match
is the value to compare to src
.
This example sets the classification to Netscape if the host is www.netscape.com
.
Generate fn="generate-by-exact" match="www.netscape.com:80" src="host" value="Netscape" dst="classification"
dst
is the name of the source to generate
value
is the value to assign to dst
src
is the source to match against
match
is the value to compare to src
.
Compass
if the protocol prefix is http.
Generate fn="generate-by-prefix" match="http" src="protocol" value="Compass" dst="classification"
dst
is the name of the source to generate
value
is the value to assign to dst
src
is the source to match against
match
is the regular expression string to compare to src
.
Netscape
if the host name matches the regular expression *.netscape.com
. For example, resources at both developer.netscape.com
and home.netscape.com
will be classified as Netscape
.
Generate fn="generate-by-regex" match="\\*.netscape.com" src="host" value="Netscape" dst="classification"
filter-by-md5
function to deny resources with duplicate MD5 checksums.
Parameters
none
Example
Data fn=generate-md5
dst
is the name of the source. If you omit it, it defaults to rd-expires
.
Generate fn=generate-rd-expires
dst
is the name of the source. If you omit it, it defaults to rd-last-modified
.
Generate fn=generate-last-modified
Parameters
src
is a string containing a mapping from one name into another.
Example
The following example renames an attribute from author
to author-name
:
Generate fn=rename-attribute src="author->author-name"
Enumeration Functions
These functions operate at the Enumerate stage. Use them in the definition for Enumeration filters. They control whether and how a robot gathers links from a given resource to use as starting points for further resource discovery.
max
is the maximum number of URLs to spawn from a given resource. The default, if max
is omitted, is 1024.
type
specifies a content-type that restricts enumeration to those URLs that have the specified content-type. type
is optional. If omitted, it will enumerate all URLs.
Enumerate fn=enumerate-urls type=text/html
URL:.*
. It spawns robots to enumerate the URLs from these strings and generate further resource descriptions.
max
is the maximum number of URLs to spawn from a given resource. The default, if max
is omitted, is 1024.
Enumerate fn=enumerate-urls-from-text
truncate
is the maximum number of characters to extract from the resource.
dst
is the name of the schema item that will receive the full text.
Generate fn=extract-full-text
truncate
is the maximum number of bytes to extract.
type
is optional. If omitted, it will generate all URLs
Generate fn=extract-html-meta truncate=255 type=text/html
truncate
is the maximum number of bytes to extract.
skip-headings
is true
to ignore any HTML headers that occur in the document.
type
is optional. If omitted, it will generate all URLs.
Generate fn=extract-html-text truncate=255 type=text/html skip-headings=true
truncate
is the maximum number of bytes to extract.
level
is the maximum HTML header level to extract. This controls the depth of the table of contents.
Generate fn=extract-html-toc truncate=255 level=3
src
is a list of source names; you can use the ->
operator to define a new name for the RD attribute, for example, type->content-type
would take the value of the source named type
and save it in the RD under the attribute named content-type
.
Generate fn=extract-source src="md5,depth,rd-expires,rd-last-modified"
$HARVEST_HOME/lib/gatherer
in your PATH before you run the robot.
summarizer
is the name of the summarizer program.
Generate fn-harvest-summarizer summarizer=HTML.sum
Shutdown fn=filterrules-shutdown
Last Updated: 02/07/98 20:49:12
Any sample code included above is provided for your use on an "AS IS" basis, under the Netscape License Agreement - Terms of Use