Sun Java System Portal Server 7.1 Technical Reference

Chapter 43 The Filtering Process

This chapter contains the following sections

Overview

The robot uses filters to determine which resources to process and how to process them. When the robot discovers references to resources as well as the resources themselves, it applies filters to each resource in order to enumerate it and to determine whether or not to generate a resource description to store in the Search Engine database.

The robot examines one or more seed URLs, applies the filters, and then applies the filters to the URLs spawned by enumerating the seed URLs, and so on. The seed URLs are defined in the filterrules.conf file.

A filter performs any required initialization operations and applies comparison tests to the current resource. The goal of each test is to either allow or deny the resource. A filter also has a shutdown phase during which it performs any required cleanup operations.

If a resource is allowed, that means that it is allowed to continue passage through the filter. If a resource is denied, then the resource is rejected. No further action is taken by the filter for resources that are denied. If a resource is not denied, the robot will eventually enumerate it, attempting to discover further resources. The generator might also create a resource description for it.

These operations are not necessarily linked. Some resources result in enumeration; others result in RD generation. Many resources result in both enumeration and RD generation. For example, if the resource is an FTP directory, the resource typically will not have an RD generated for it. However, the robot might enumerate the individual files in the FTP directory. An HTML document that contains links to other documents can receive an RD and can lead to enumeration of the linked documents as well.

Stages in the Filter Process

Both enumerator and generator filters have five phases in the filtering process. They both have four common phases: Setup, Metadata, Data, and Shutdown. If the resource makes it past the Data phase, it is either in the Enumerate or Generate phase, depending on whether the filter is an enumerator or a generator.

The phases are as follows:

Setup

Performs initialization operations. Occurs only once in the life of the robot.

Metadata

Filters the resource based on metadata that is available about the resource. Metadata filtering occurs once per resource before the resource is retrieved over the network. The table below lists the common metadata types and their description.

Table 43–1 Common Metadata Types

Metadata 

Description 

Example 

Complete URL 

The location of a resource 

http://home.siroe.com/ 

Protocol 

The access portion of the URL 

http, ftp, file 

Host 

The address portion of the URL 

www.siroe.com 

IP address 

Numeric version of the host 

198.95.249.6 

PATH 

The path portion of the URL 

/index.html 

Depth 

Number of links from the seed URL 

Data

Filters the resource based on its data. Data filtering is done once per resource after it is retrieved over the network. Data that can be used for filtering include:

Enumerates the current resource in order to determine if it points to other resources to be examined.

Generate

Generates a resource description (RD) for the resource and saves it in the Search Engine database.

Shutdown

Performs any needed termination operations. Occurs once in the life of the robot.

Filter Syntax

The filter.conf file contains definitions for enumeration and generation filters. This file can contain multiple filters for both enumeration and generation. Note that the robot can determine which filters to use because they are specified by the enumeration-filter and generation-filter parameters in the robot.conf file.

Filter definitions have a well-defined structure: a header, a body, and an end. The header identifies the beginning of the filter and declares its name; for example:


<Filter name="myFilter">

The body consists of a series of filter directives that define the filter’s behavior during setup, testing, enumeration or generation, and shutdown. Each directive specifies a function, and if applicable, parameters for the function.

The end is marked by </Filter>.

The following example shows a filter named enumeration1.


Example 43–1 Enumeration File Syntax


<Filter name="enumeration1>
Setup fn=filterrules-setup config=./config/filterrules.conf
# Process the rules
MetaData fn=filterrules-process
# Filter by type and process rules again
Data fn=assign-source dst=type src=content-type
Data fn=filterrules-process
# Perform the enumeration on HTML only
Enumerate enable=true fn=enumerate-urls max=1024 type=text/html
# Cleanup
Shutdown fn=filterrules-shutdown
</Filter>

Filter Directives

Filter directives use Robot Application Functions (RAFs) to perform operations. Their use and flow of execution is similar to that of NSAPI directives and Server Application Functions (SAFs) in the file obj.conf. Like NSAPI and SAF, data are stored and transferred using parameter blocks, also called pblocks.

There are six robot directives, or RAF classes, corresponding to the filtering phases and operations listed below. See Stages in the Filter Process for more information on these phases.

Each directive has its own robot application functions. For example, use filtering functions with the Metadata and Data directives, enumeration functions with the Enumerate directive, generation functions with the Generate directive, and so on.

The built-in robot application functions and instructions for writing your own robot application functions are explained in the Sun Java System Portal Server 7.1 Developer's Guide.

Writing or Modifying a Filter

In most cases, you should not need to write filters from scratch. You can create most of your filters using the administration console. You can then modify the filter.conf and filterrules.conf files to make any desired changes. These files reside in the directory /var/opt/SUNWportal/searchservers/search1/config.

However, if you want to create a more complex set of parameters, you will need to edit the configuration files used by the robot.

Follow these points when writing or modifying a filter:

For a discussion of the parameters you can modify in the robot.conf file, the robot application functions that you can use in the filter.conf file, and how to create your own robot application functions, see the Sun Java System Portal Server 7.1 Developer's Guide.