Starting with the seed URLs, the robot examines URLs and their associated network resources. Each resource is tested by both the enumerator and the generator. If the resource passes the enumeration test, the robot enumerates it, that is checks it for more URLs. If the resource passes the generator test, the robot generates a resource description for it to put in the Compass Server database.
Figure 8.1 How the Robot Works
Files that Define Robot Behavior
Three robot configuration files, process.conf
, filter.conf
, and filterrules.conf
control the behavior of the Compass Server robots. These files live in the directory compass-installdir/
compass-name/config
. For example, if you installed Compass Server in suitespot
, and you created a Compass Server instance named nikki
, the directory would be suitespot/compass-nikki/config
process.conf
defines many options for the robot, including telling it which of the filters from filter.conf
to use.(For backwards-compatibility with the Catalog Server, process.conf
can also contain the starting points.)
In general, you do not need to edit the file process.conf
. It is written by the Compass Server when you make changes in the Robot page of the Compass Server Administration Interface. However, there are a few parameters that you might want to manually edit. These parameters are discussed in Chapter 9, "Defining Parameters in Process.conf".
The two most important process parameters you can set by editing process.conf
directly are enumeration-filter
and generation-filter
. These parameters determine which filters the robot uses for enumeration and generation. The default values for these are enumeration-default
and generation-default
, which are the names of the filters provided by default in filter.conf
file.
All filters must be defined in the file filter.conf.
If you define your own filters in filter.conf
, you must add the corresponding parameters to process.conf
.
For example, if you define a new enumeration filter named my-enumerator
, you would add the following parameter to process.conf
:
enumeration-filter=my-enumerator
The Filtering Process
Robots use filters to control which resources to process and how to process them. When the robot discovers references to resources (and resources themselves), it applies filters to each one to enumerate it (that is, examine it for more resources) and to determine whether or not to generate a resource description for it to put in the Compass Server database.
The robot starts by examining one or more starting points or seed URLs and applying the filters to each one, and then applying the filters to the URLS spawned by enumerating the seed URLS, and so on. (The seed URLS are defined in the filterrules.conf
file.)
A filter starts by performing any required initialization operations. Then it applies comparison tests to the current resource. The goal of each test is to either allow or deny the resource. A filter also has a shutdown phase during which it performs any required cleanup operations.
If a resource is allowed, that means that it is allowed to continue passage through the filter. If a resource is denied, then the resource is rejected. No further action is taken by the filter for resources that are denied. If a resource is not denied, the robot will eventually enumerate it, attempting to discover further resources. The generator might also create a resource description for it.
Note that these operations are not necessarily linked. Some resources result in enumeration; others result in RD generation. Many result in both. For example, if the resource is an FTP directory, that resource will probably not have an RD generated for it. However, the robot might well enumerate the individual files in the FTP directory. An HTML document that contains links to other documents will probably get an RD for itself and lead to enumeration of the linked documents as well.
Stages in the Filter Process
Both enumerator and generator filters each have five phases in the filtering process. They both have four common phases, Setup, Metadata, Data, and Shutdown. If the resource makes it past the Data phase, there is either an Enumerate or Generate phase, depending on whether the filter is an enumerator or a generator.
Setup
Performs initialization operations. Occurs only once in the life of the robot.
Metadata
Filters the resource based on metadata that is available about the resource. Metadata filtering occurs once per resource before the resource is retrieved over the network.
Examples of metadata include the following:
Data
Filters the resource based on data contained by the resource. Data filtering is done once per resource after it is retrieved over the network. Some data that can be used for filtering follow:
filter.conf
file contains definitions for Enumeration and Generation filters. This file can contain multiple filters for both enumeration and generation. (The robot knows which ones to use because they are specified by the enumeration-filter
and generation-filter
parameters in the file process.conf.)
Filter definitions have a well-defined structure: a header, a body, and an end. The header identifies the beginning of the filter and declares its name, for example:
<Filter name="myFilter">The body consists of a series of filter directives that define the filter's behavior during setup, testing, enumeration or generation, and shutdown. Each directive specifies a function, and if applicable, parameters for the function. The end is marked by
</Filter>
. The following code shows an example filter named enumeration1
.
<Filter name="enumeration1>
Setup fn=filterrules-setup config=./config/filterrules.conf
# Process the rules
MetaData fn=filterrules-process
# Filter by type and process rules again
Data fn=assign-source dst=type src=content-type
Data fn=filterrules-process
# Perform the enumeration on HTML only
Enumerate enable=true fn=enumerate-urls max=1024 type=text/html
# Cleanup
Shutdown fn=filterrules-shutdown
</Filter>
obj.conf
. Data is stored and transferred using parameter blocks, also called pblocks. If you are not already acquainted with SAFs and pblocks and other basic aspects of NSAPI, you should consult the NSAPI Programmer's Guide before continuing with this material.
You can find the NSAPI Programmer's Guide at:
http://developer.netscape.com/docs/manuals/enterprise/nsapi/index.htm
There are six robot directives, or RAF classes, corresponding to the filtering phases and operations listed in The Filtering Process:Each directive has its own particular robot application functions. For example, you use filtering functions with the Metadata and Data directives, enumeration functions with the Enumerate directive, generation functions with the Generate directive, and so on. The built-in robot application functions are listed and explained in Chapter 10, The Pre-defined Robot Application Functions. You can also write your own robot application functions, as described in Chapter 11, The Pre-defined Robot Application Functions.
Writing or Modifying a Filter
In most cases, you should not need to write filters from scratch. You can create most of your filters using the Robot page in the Compass Server Administration Interface. You can then modify the filter.conf
and filterrules.conf
files to make any desired changes. These files live in the directory compass-installdir/
compass-name/config
.
However, if you want to create a more complex set of parameters than is supported by the Robot page in the interface, you will need to edit the configuration files used by the robot.
In most cases, you can probably see what you need to do by looking at the existing configuration files and the examples in Chapter 11, The Pre-defined Robot Application Functions. You will need to keep the following points in mind when writing or modifying a filter:
process.conf
, and see the Chapter 10, The Pre-defined Robot Application Functions for a discussion of the robot application functions that you can use in the file filter.conf
. See Chapter 11, The Pre-defined Robot Application Functions for a discussion of how to create your own robot application functions.
Last Updated: 02/07/98 20:49:09
Any sample code included above is provided for your use on an "AS IS" basis, under the Netscape License Agreement - Terms of Use