Sun Java System Portal Server 7 Developer's Guide

Chapter 16 Search Engine Robot Overview

This chapter describes the search engine robot. This chapter covers the following topics:

Introduction to the Search Engine Robot

A search engine robot is an agent that identifies and reports on resources in its domains; it does so by using two kinds of filters: an enumerator filter and a generator filter.

The enumerator filter locates resources by using network protocols. It tests each resource, and, if it meets the selection criteria, it is enumerated. For example, the enumerator filter can extract hypertext links from an HTML file and use the links to find additional resources.

The generator filter tests each resource to determine if a resource description (RD) should be created. If the resource passes the test, the generator creates an RD which is stored in the search engine database.

How the Robot Works

How the Robot Works illustrates how the search engine robot works. In How the Robot Works, the robot examines URLs and their associated network resources. Each resource is tested by both the enumerator and the generator. If the resource passes the enumeration test, the robot checks it for additional URLs. If the resource passes the generator test, the robot generates a resource description that is stored in the search engine database.

Figure 16–1 How the Robot Works

Robot Configuration Files

Robot configuration files define the behavior of the search engine robots. These files reside in the directory /var/opt/SUNWportal/searchservers/search1/config/ directory.

robot.conf: Defines most of the operating parameters for the robot.
filter.conf: Contains all of the functions used by the Search Engine robot during the enumeration and generation filtering tasks. Including the same functions for both enumeration and generation ensures that a single rule change affects both tasks.
filterrules.conf: Contains the starting points (also referred to as starting point URLs) and rules used by the filterrules-process function.
classification.conf: Contains rules used to classify RDs generated by the robot

Use the administration console to edit these configuration files.

The Filtering Process

The robot uses filters to determine which resources to process and how to process them. When the robot discovers references to resources as well as the resources themselves, it applies filters to each resource in order to enumerate it and to determine whether or not to generate a resource description to store in the search engine database.

The robot examines one or more starting point URLs, applies the filters, and then applies the filters to the URLs spawned by enumerating the starting point URLs, and so on. Note that the starting point URLs are defined in the filterrules.conf file.

A filter performs any required initialization operations and applies comparison tests to the current resource. The goal of each test is to either allow or deny the resource. A filter also has a shutdown phase during which it performs any required cleanup operations.

If a resource is allowed, it continues its passage through the filter. If a resource is denied, then the resource is rejected. No further action is taken by the filter for resources that are denied. If a resource is not denied, the robot will eventually enumerate it, attempting to discover further resources. The generator might also create a resource description for it.

Note that these operations are not necessarily linked. Some resources result in enumeration; others result in RD generation. Many resources result in both enumeration and RD generation. For example, if the resource is an FTP directory, the resource typically will not have an RD generated for it. However, the robot might enumerate the individual files in the FTP directory. An HTML document that contains links to other documents can produce a generated RD and can lead to enumeration of the linked documents as well.

Stages in the Filter Process

Both enumerator and generator filters have five phases in the filtering process. They both have four common phases, Setup, Metadata, Data, and Shutdown. If the resource makes it past the Data phase, it is either in the Enumerate or Generate phase, depending on whether the filter is an enumerator or a generator.

Setup: Performs initialization operations. Occurs only once in the life of the robot.
Metadata: Filters the resource based on metadata that is available about the resource. Metadata filtering occurs once per resource before the resource is retrieved over the network. Table 16–1 lists examples of common metadata types.

Table 16–1 Common Metadata Types


Metadata	Description	Example
Complete URL	The location of a resource	`http://home.siroe.com/`
Protocol	The access portion of the URL	http, ftp, file
Host	The address portion of the URL	`www.siroe.com:`
IP address	Numeric version of the host	198.95.249.6
PATH	The path portion of the URL	`/index.html`
Depth	Number of links from the starting point URL	5

Data

Filters the resource based on its data. Data filtering is done once per resource after it is retrieved over the network. Data that can be used for filtering includes:

content-type
content-length
content-encoding
content-charset
last-modified
expires

Enumerate

Enumerates the current resource in order to determine if it points to other resources to be examined.

Generate

Generates a resource description (RD) for the resource and saves it for adding it to the search engine database.

Shutdown

Performs any needed termination operations. Occurs once in the life of the robot.