Compass Server 3.0 Developer's Guide

[Contents] [Previous] [Next] [Last]

Chapter 8
How Netscape Compass Server Robots Work

This chapter describes how the Compass Server robot works, and discusses the configuration files that it uses.

What is a Robot?

A robot is an agent, a piece of software, that can traverse segments of a network generating resource descriptions as it locates and identifies appropriate resources.

The Netscape Compass Server robot uses two kinds of filters: an enumerator filter and a generator filter.

How The Robot Works

Figure 8.1 illustrates how the Netscape Compass Server robot works.

Starting with the seed URLs, the robot examines URLs and their associated network resources. Each resource is tested by both the enumerator and the generator. If the resource passes the enumeration test, the robot enumerates it, that is checks it for more URLs. If the resource passes the generator test, the robot generates a resource description for it to put in the Compass Server database.

Figure 8.1    How the Robot Works

Files that Define Robot Behavior

Three robot configuration files, process.conf, filter.conf, and filterrules.conf control the behavior of the Compass Server robots. These files live in the directory compass-installdir/compass-name/config. For example, if you installed Compass Server in suitespot, and you created a Compass Server instance named nikki, the directory would be suitespot/compass-nikki/config

Each of these configuration files is a plain text file that you can edit with any standard text editor.

Setting Robot Process Parameters

The file process.conf defines many options for the robot, including telling it which of the filters from filter.conf to use.(For backwards-compatibility with the Catalog Server, process.conf can also contain the starting points.)

In general, you do not need to edit the file process.conf. It is written by the Compass Server when you make changes in the Robot page of the Compass Server Administration Interface. However, there are a few parameters that you might want to manually edit. These parameters are discussed in Chapter 9, "Defining Parameters in Process.conf".

The two most important process parameters you can set by editing process.conf directly are enumeration-filter and generation-filter. These parameters determine which filters the robot uses for enumeration and generation. The default values for these are enumeration-default and generation-default, which are the names of the filters provided by default in filter.conf file.

All filters must be defined in the file filter.conf. If you define your own filters in filter.conf, you must add the corresponding parameters to process.conf.

For example, if you define a new enumeration filter named my-enumerator, you would add the following parameter to process.conf:

  enumeration-filter=my-enumerator

The Filtering Process

Robots use filters to control which resources to process and how to process them. When the robot discovers references to resources (and resources themselves), it applies filters to each one to enumerate it (that is, examine it for more resources) and to determine whether or not to generate a resource description for it to put in the Compass Server database.

The robot starts by examining one or more starting points or seed URLs and applying the filters to each one, and then applying the filters to the URLS spawned by enumerating the seed URLS, and so on. (The seed URLS are defined in the filterrules.conf file.)

A filter starts by performing any required initialization operations. Then it applies comparison tests to the current resource. The goal of each test is to either allow or deny the resource. A filter also has a shutdown phase during which it performs any required cleanup operations.

If a resource is allowed, that means that it is allowed to continue passage through the filter. If a resource is denied, then the resource is rejected. No further action is taken by the filter for resources that are denied. If a resource is not denied, the robot will eventually enumerate it, attempting to discover further resources. The generator might also create a resource description for it.

Note that these operations are not necessarily linked. Some resources result in enumeration; others result in RD generation. Many result in both. For example, if the resource is an FTP directory, that resource will probably not have an RD generated for it. However, the robot might well enumerate the individual files in the FTP directory. An HTML document that contains links to other documents will probably get an RD for itself and lead to enumeration of the linked documents as well.

Stages in the Filter Process

Both enumerator and generator filters each have five phases in the filtering process. They both have four common phases, Setup, Metadata, Data, and Shutdown. If the resource makes it past the Data phase, there is either an Enumerate or Generate phase, depending on whether the filter is an enumerator or a generator.

Setup

Performs initialization operations. Occurs only once in the life of the robot.

Metadata

Filters the resource based on metadata that is available about the resource. Metadata filtering occurs once per resource before the resource is retrieved over the network.

Examples of metadata include the following:

Metadata Meaning Example
Complete URL

The location of a resource

http://home.netscape.com/

Protocol

The access portion of the URL

http, ftp, file

Host

The address portion of the URL

www.netscape.com:1080

IP address

Numeric version of the host

198.95.249.6:8000

URI

The path portion of the URL

/index.html

Depth

Number of links from the seed URL

5

Data

Filters the resource based on data contained by the resource. Data filtering is done once per resource after it is retrieved over the network. Some data that can be used for filtering follow:

Enumerate

Enumerates the current resource to find if it points to other resources to be examined.

Generate

Generates a resource description (RD) for the resource, and saves the RD in the Compass Server database.

Shutdown

Performs any needed termination operations. Occurs only once in the life of the robot.

Filter Syntax

The filter.conf file contains definitions for Enumeration and Generation filters. This file can contain multiple filters for both enumeration and generation. (The robot knows which ones to use because they are specified by the enumeration-filter and generation-filter parameters in the file process.conf.)

Filter definitions have a well-defined structure: a header, a body, and an end. The header identifies the beginning of the filter and declares its name, for example:

<Filter name="myFilter">
The body consists of a series of filter directives that define the filter's behavior during setup, testing, enumeration or generation, and shutdown. Each directive specifies a function, and if applicable, parameters for the function.

The end is marked by </Filter>. The following code shows an example filter named enumeration1.

<Filter name="enumeration1>
   Setup fn=filterrules-setup config=./config/filterrules.conf
   #  Process the rules
   MetaData fn=filterrules-process
   #  Filter by type and process rules again
   Data fn=assign-source dst=type src=content-type
   Data fn=filterrules-process
   #  Perform the enumeration on HTML only
   Enumerate enable=true fn=enumerate-urls max=1024 type=text/html
   #  Cleanup
   Shutdown fn=filterrules-shutdown
</Filter>

Filter Directives

Filter directives use Robot Application Functions (RAFs) to perform their operations. Their usage and flow of execution is similar to that of NSAPI directives and Server Application Functions (SAFs) in the file obj.conf. Data is stored and transferred using parameter blocks, also called pblocks. If you are not already acquainted with SAFs and pblocks and other basic aspects of NSAPI, you should consult the NSAPI Programmer's Guide before continuing with this material.

You can find the NSAPI Programmer's Guide at:

http://developer.netscape.com/docs/manuals/enterprise/nsapi/index.htm

There are six robot directives, or RAF classes, corresponding to the filtering phases and operations listed in The Filtering Process:

Each directive has its own particular robot application functions. For example, you use filtering functions with the Metadata and Data directives, enumeration functions with the Enumerate directive, generation functions with the Generate directive, and so on.

The built-in robot application functions are listed and explained in Chapter 10, The Pre-defined Robot Application Functions. You can also write your own robot application functions, as described in Chapter 11, The Pre-defined Robot Application Functions.

Writing or Modifying a Filter

In most cases, you should not need to write filters from scratch. You can create most of your filters using the Robot page in the Compass Server Administration Interface. You can then modify the filter.conf and filterrules.conf files to make any desired changes. These files live in the directory compass-installdir/compass-name/config.

However, if you want to create a more complex set of parameters than is supported by the Robot page in the interface, you will need to edit the configuration files used by the robot.

In most cases, you can probably see what you need to do by looking at the existing configuration files and the examples in Chapter 11, The Pre-defined Robot Application Functions. You will need to keep the following points in mind when writing or modifying a filter:

See the Chapter 9, Defining Parameters in Process.conf, for a discussion of the parameters you can modify in the file process.conf, and see the Chapter 10, The Pre-defined Robot Application Functions for a discussion of the robot application functions that you can use in the file filter.conf. See Chapter 11, The Pre-defined Robot Application Functions for a discussion of how to create your own robot application functions.


[Contents] [Previous] [Next] [Last]

Last Updated: 02/07/98 20:49:09

Any sample code included above is provided for your use on an "AS IS" basis, under the Netscape License Agreement - Terms of Use