Sun Java System Portal Server 7.1 Administration Guide

Understanding the Search Server Robot

A Search Server robot is an agent that identifies and reports on resources in its domains. It does so by using two kinds of filters: an enumerator filter and a generator filter.

The enumerator filter locates resources by using network protocols. The filter tests each resource and if the resource meets the proper criteria, it is enumerated. For example, the enumerator filter can extract hypertext links from an HTML file and use the links to find additional resources.

The generator filter tests each resource to determine whether a resource description (RD) should be created. If the resource passes the test, the generator creates an RD that is stored in the Search Server database.

Configuration and maintenance tasks you might need to do to administer the robot are described in the following sections:

How the Robot Works

Figure 12–1 shows how the robot examines URLs and their associated network resources. Both the enumerator and the generator test each resource. If the resource passes the enumeration test, the robot checks it for additional URLs. If the resource passes the generator test, the robot generates a resource description that is stored in the Search Server database.

Figure 12–1 How the Robot Works

This figure illustrates how the robot works.

Robot Configuration Files

Robot configuration files define the behavior of the robots. These files reside in the directory /var/opt/SUNWportal/searchservers/searchserverid/config. The following list provides a description for each of the robot configuration files.

classification.conf

Contains rules used to classify RDs generated by the robot.

filter.conf

Defines the enumeration and generation filters used by the robot.

filterrules.conf

Contains the robot's site definitions, starting point URLs, rules for filtering based on mime type, and URL patterns.

robot.conf

Defines most operating properties for the robot.

Because you can set most properties by using the Search Server Administration interface, you typically do not need to edit the robot.conf file. However, advanced users might manually edit this file to set properties that cannot be set through the interface.

Defining Sites

The robot finds resources and determines whether to add descriptions of those resources to the database. The determination of which servers to visit and what parts of those servers to index is called a site definition.

Defining the sites for the robot is one of the most important jobs of the server administrator. You need to be sure you send the robot to all the servers it needs to index, but you also need to exclude extraneous sites that can fill the database and make finding the correct information more difficult.

Controlling Robot Crawling

The robot extracts and follows links to the various sites selected for indexing. As the system administrator, you can control these processes through a number of settings, including:

See the Sun Java System Portal Server 7.1 Technical Reference for descriptions of the robot crawling attributes.

Filtering Robot Data

Filters enable identify a resource so that it can be excluded or included by comparing an attribute of a resource against a filter definition. The robot provides a number of predefined filters, some of which are enabled by default. The following filters are predefined. Filters marked with an asterisk are enabled by default.

You can create new filter definitions, modify a filter definition, or enable or disable filters. See Resource Filtering Process for detailed information.

Using the Robot Utilities

The robot includes two debugging tools or utilities:

Scheduling the Robot

To keep the search data timely, the robot should search and index sites regularly. Because robot crawling and indexing can consume processing resources and network bandwidth, you should schedule the robot to run during non-peak days and times. The management console allows administrators to set up a schedule to run the robot.