Sun Java System Portal Server 7.2 Administration Guide

Resource Filtering Process

The robot uses filters to determine which resources to process and how to process them. When the robot discovers references to resources as well as the resources themselves, it applies filters to each resource. The filters enumerate the resourceand determine whether to generate a resource description to store in the Search Server database.

The robot examines one or more starting point URLs, applies the filters, and then applies the filters to the URLs spawned by enumerating those URLs, and so on. The starting point URLs are defined in the filterrules.conf file.

Each enumeration and generation filter performs any required initialization operations and applies comparison tests to the current resource. The goal of each test is to allow or deny the resource. Each filter also has a shutdown phase during which it performs clean-up operations.

If a resource is allowed, then it continues its passage through the filter. The robot eventually enumerates it, attempting to discover further resources. The generator might also create a resource description for it.

If a resource is denied, the resource is rejected. No further action is taken by the filter for resources that are denied.

These operations are not necessarily linked. Some resources result in enumeration; others result in RD generation. Many resources result in both enumeration and RD generation. For example, if the resource is an FTP directory, the resource typically does not have an RD generated for it. However, the robot might enumerate the individual files in the FTP directory. An HTML document that contains links to other documents can result in an RD being generated, and can lead to enumeration of any linked documents as well.

The following sections describe the filter process:

Stages in the Filter Process

Both enumeration and generation filters have five phases in the filtering process.

Table 19–1 Common Metadata Types

Metadata Type 

Description 

Example 

Complete URL 

The location of a resource 

http://home.siroe.com/

Protocol 

The access portion of the URL 

http, ftp, file

Host 

The address portion of the URL 

www.siroe.com

IP address 

Numeric version of the host 

198.95.249.6 

PATH 

The path portion of the URL 

/index.html

Depth 

Number of links from the starting point URL 

Filter Syntax

The filter.conf file contains definitions for enumeration and generation filters. This file can contain multiple filters for both enumeration and generation. The filters used by the robot are specified by the enumeration-filter and generation-filter properties in the file robot.conf.

Filter definitions have a well-defined structure: a header, a body, and an end. The header identifies the beginning of the filter and declares its name, for example:


<Filter name="myFilter">

The body consists of a series of filter directives that define the filter’s behavior during setup, testing, enumeration or generation, and shutdown. Each directive specifies a function and, if applicable, properties for the function.

The end is marked by </Filter>.

Example 19–1 shows a filter named enumeration1.


Example 19–1 Enumeration File Syntax


<Filter name="enumeration1>
   Setup fn=filterrules-setup config=./config/filterrules.conf
#  Process the rules
   MetaData fn=filterrules-process
#  Filter by type and process rules again
   Data fn=assign-source dst=type src=content-type
   Data fn=filterrules-process
#  Perform the enumeration on HTML only
   Enumerate enable=true fn=enumerate-urls max=1024 type=text/html
#  Cleanup
   Shutdown fn=filterrules-shutdown
</Filter>

Filter Directives

Filter directives use robot application functions (RAFs) to perform operations. Their use and flow of execution is similar to that of NSAPI directives and server application functions (SAFs) in the Sun Java System Web Server's obj.conf file. Like NSAPI and SAF, data are stored and transferred using property blocks, also called pblocks.

Six robot directives, or RAF classes, correspond to the filtering phases and operations listed in Resource Filtering Process:

Each directive has its own robot application functions. For example, use filtering functions with the Metadata and Data directives, enumeration functions with the Enumerate directive, generation functions with the Generate directive, and so on.

The built-in robot application functions, as well as instructions for writing your own robot application functions, are explained in the Sun Java System Portal Server 7.1 Developer's Guide.

Writing or Modifying a Filter

In most cases, you can use the management console to create most of your site-definition based filters. You can then modify the filter.conf and filterrules.conf files to make any further desired changes. These files reside in the directory /var/opt/SUNWportal/searchservers/searchserverid/config.

To create a more complex set of properties, edit the configuration files used by the robot.

When you write or modify a filter, note the order of

You can also do the following:

For more information, see the Sun Java System Portal Server 7.1 Developer's Guide