This chapter contains the following sections
The robot uses filters to determine which resources to process and how to process them. When the robot discovers references to resources as well as the resources themselves, it applies filters to each resource in order to enumerate it and to determine whether or not to generate a resource description to store in the Search Engine database.
The robot examines one or more seed URLs, applies the filters, and then applies the filters to the URLs spawned by enumerating the seed URLs, and so on. The seed URLs are defined in the filterrules.conf file.
A filter performs any required initialization operations and applies comparison tests to the current resource. The goal of each test is to either allow or deny the resource. A filter also has a shutdown phase during which it performs any required cleanup operations.
If a resource is allowed, that means that it is allowed to continue passage through the filter. If a resource is denied, then the resource is rejected. No further action is taken by the filter for resources that are denied. If a resource is not denied, the robot will eventually enumerate it, attempting to discover further resources. The generator might also create a resource description for it.
These operations are not necessarily linked. Some resources result in enumeration; others result in RD generation. Many resources result in both enumeration and RD generation. For example, if the resource is an FTP directory, the resource typically will not have an RD generated for it. However, the robot might enumerate the individual files in the FTP directory. An HTML document that contains links to other documents can receive an RD and can lead to enumeration of the linked documents as well.
Both enumerator and generator filters have five phases in the filtering process. They both have four common phases: Setup, Metadata, Data, and Shutdown. If the resource makes it past the Data phase, it is either in the Enumerate or Generate phase, depending on whether the filter is an enumerator or a generator.
The phases are as follows:
Setup
Performs initialization operations. Occurs only once in the life of the robot.
Metadata
Filters the resource based on metadata that is available about the resource. Metadata filtering occurs once per resource before the resource is retrieved over the network. The table below lists the common metadata types and their description.
Table 43–1 Common Metadata Types
Metadata |
Description |
Example |
---|---|---|
Complete URL |
The location of a resource |
http://home.siroe.com/ |
Protocol |
The access portion of the URL |
http, ftp, file |
Host |
The address portion of the URL |
www.siroe.com |
IP address |
Numeric version of the host |
198.95.249.6 |
PATH |
The path portion of the URL |
/index.html |
Depth |
Number of links from the seed URL |
5 |
Data
Filters the resource based on its data. Data filtering is done once per resource after it is retrieved over the network. Data that can be used for filtering include:
content-type
content-length
content-encoding
content-charset
last-modified
expires
Enumerate
Enumerates the current resource in order to determine if it points to other resources to be examined.
Generate
Generates a resource description (RD) for the resource and saves it in the Search Engine database.
Shutdown
Performs any needed termination operations. Occurs once in the life of the robot.
The filter.conf file contains definitions for enumeration and generation filters. This file can contain multiple filters for both enumeration and generation. Note that the robot can determine which filters to use because they are specified by the enumeration-filter and generation-filter parameters in the robot.conf file.
Filter definitions have a well-defined structure: a header, a body, and an end. The header identifies the beginning of the filter and declares its name; for example:
<Filter name="myFilter"> |
The body consists of a series of filter directives that define the filter’s behavior during setup, testing, enumeration or generation, and shutdown. Each directive specifies a function, and if applicable, parameters for the function.
The end is marked by </Filter>.
The following example shows a filter named enumeration1.
<Filter name="enumeration1> Setup fn=filterrules-setup config=./config/filterrules.conf # Process the rules MetaData fn=filterrules-process # Filter by type and process rules again Data fn=assign-source dst=type src=content-type Data fn=filterrules-process # Perform the enumeration on HTML only Enumerate enable=true fn=enumerate-urls max=1024 type=text/html # Cleanup Shutdown fn=filterrules-shutdown </Filter> |
Filter directives use Robot Application Functions (RAFs) to perform operations. Their use and flow of execution is similar to that of NSAPI directives and Server Application Functions (SAFs) in the file obj.conf. Like NSAPI and SAF, data are stored and transferred using parameter blocks, also called pblocks.
There are six robot directives, or RAF classes, corresponding to the filtering phases and operations listed below. See Stages in the Filter Process for more information on these phases.
Setup
Metadata
Data
Enumerate
Generate
Shutdown
Each directive has its own robot application functions. For example, use filtering functions with the Metadata and Data directives, enumeration functions with the Enumerate directive, generation functions with the Generate directive, and so on.
The built-in robot application functions and instructions for writing your own robot application functions are explained in the Sun Java System Portal Server 7.1 Developer's Guide.
In most cases, you should not need to write filters from scratch. You can create most of your filters using the administration console. You can then modify the filter.conf and filterrules.conf files to make any desired changes. These files reside in the directory /var/opt/SUNWportal/searchservers/search1/config.
However, if you want to create a more complex set of parameters, you will need to edit the configuration files used by the robot.
Follow these points when writing or modifying a filter:
The order of execution of directives (especially the available information at each phase)
The order of rules
For a discussion of the parameters you can modify in the robot.conf file, the robot application functions that you can use in the filter.conf file, and how to create your own robot application functions, see the Sun Java System Portal Server 7.1 Developer's Guide.