The robot uses filters to determine which resources to process and how to process them. When the robot discovers references to resources as well as the resources themselves, it applies filters to each resource. The filters enumerate the resourceand determine whether to generate a resource description to store in the Search Server database.
The robot examines one or more starting point URLs, applies the filters, and then applies the filters to the URLs spawned by enumerating those URLs, and so on. The starting point URLs are defined in the filterrules.conf file.
Each enumeration and generation filter performs any required initialization operations and applies comparison tests to the current resource. The goal of each test is to allow or deny the resource. Each filter also has a shutdown phase during which it performs clean-up operations.
If a resource is allowed, then it continues its passage through the filter. The robot eventually enumerates it, attempting to discover further resources. The generator might also create a resource description for it.
If a resource is denied, the resource is rejected. No further action is taken by the filter for resources that are denied.
These operations are not necessarily linked. Some resources result in enumeration; others result in RD generation. Many resources result in both enumeration and RD generation. For example, if the resource is an FTP directory, the resource typically does not have an RD generated for it. However, the robot might enumerate the individual files in the FTP directory. An HTML document that contains links to other documents can result in an RD being generated, and can lead to enumeration of any linked documents as well.
The following sections describe the filter process:
Both enumeration and generation filters have five phases in the filtering process.
Setup – Performs initialization operations. Occurs only once in the life of the robot.
Metadata – Filters the resource based on metadata available about the resource. Metadata filtering occurs once per resource before the resource is retrieved over the network. Table 19–1 lists examples of common metadata types.
Metadata Type |
Description |
Example |
---|---|---|
Complete URL |
The location of a resource |
http://home.siroe.com/ |
Protocol |
The access portion of the URL |
http, ftp, file |
Host |
The address portion of the URL |
www.siroe.com |
IP address |
Numeric version of the host |
198.95.249.6 |
PATH |
The path portion of the URL |
/index.html |
Depth |
Number of links from the starting point URL |
5 |
Data – Filters the resource based on its data. Data is filtered once per resource after the data is retrieved over the network. Data that can be used for filtering include:
content-type
content-length
content-encoding
content-charset
last-modified
expires
Enumerate – Enumerates the current resource in order to determine whether it points to other resources to be examined.
Generate – Generates a resource description (RD) for the resource and saves it in the Search Server database.
Shutdown – Performs any needed termination operations. This process occurs once in the life of the robot.
The filter.conf file contains definitions for enumeration and generation filters. This file can contain multiple filters for both enumeration and generation. The filters used by the robot are specified by the enumeration-filter and generation-filter properties in the file robot.conf.
Filter definitions have a well-defined structure: a header, a body, and an end. The header identifies the beginning of the filter and declares its name, for example:
<Filter name="myFilter"> |
The body consists of a series of filter directives that define the filter’s behavior during setup, testing, enumeration or generation, and shutdown. Each directive specifies a function and, if applicable, properties for the function.
The end is marked by </Filter>.
Example 19–1 shows a filter named enumeration1.
<Filter name="enumeration1> Setup fn=filterrules-setup config=./config/filterrules.conf # Process the rules MetaData fn=filterrules-process # Filter by type and process rules again Data fn=assign-source dst=type src=content-type Data fn=filterrules-process # Perform the enumeration on HTML only Enumerate enable=true fn=enumerate-urls max=1024 type=text/html # Cleanup Shutdown fn=filterrules-shutdown </Filter> |
Filter directives use robot application functions (RAFs) to perform operations. Their use and flow of execution is similar to that of NSAPI directives and server application functions (SAFs) in the Sun Java System Web Server's obj.conf file. Like NSAPI and SAF, data are stored and transferred using property blocks, also called pblocks.
Six robot directives, or RAF classes, correspond to the filtering phases and operations listed in Resource Filtering Process:
Setup
Metadata
Data
Enumerate
Generate
Shutdown
Each directive has its own robot application functions. For example, use filtering functions with the Metadata and Data directives, enumeration functions with the Enumerate directive, generation functions with the Generate directive, and so on.
The built-in robot application functions, as well as instructions for writing your own robot application functions, are explained in the Sun Java System Portal Server 7.1 Developer's Guide.
In most cases, you can use the management console to create most of your site-definition based filters. You can then modify the filter.conf and filterrules.conf files to make any further desired changes. These files reside in the directory /var/opt/SUNWportal/searchservers/searchserverid/config.
To create a more complex set of properties, edit the configuration files used by the robot.
When you write or modify a filter, note the order of
The execution of directives, especially the available information at each phase.
The filter rules in filterrules.conf.
You can also do the following:
Modify properties in robot.conf file.
Modify robot application functions in filter.conf file.
Create your own robot application functions.
For more information, see the Sun Java System Portal Server 7.1 Developer's Guide