Sun Java System Portal Server 7.1 Administration Guide

Understanding the Search Server Robot

A Search Server robot is an agent that identifies and reports on resources in its domains. It does so by using two kinds of filters: an enumerator filter and a generator filter.

The enumerator filter locates resources by using network protocols. The filter tests each resource and if the resource meets the proper criteria, it is enumerated. For example, the enumerator filter can extract hypertext links from an HTML file and use the links to find additional resources.

The generator filter tests each resource to determine whether a resource description (RD) should be created. If the resource passes the test, the generator creates an RD that is stored in the Search Server database.

Configuration and maintenance tasks you might need to do to administer the robot are described in the following sections:

How the Robot Works

Figure 12–1 shows how the robot examines URLs and their associated network resources. Both the enumerator and the generator test each resource. If the resource passes the enumeration test, the robot checks it for additional URLs. If the resource passes the generator test, the robot generates a resource description that is stored in the Search Server database.

Figure 12–1 How the Robot Works

This figure illustrates how the robot works.

Robot Configuration Files

Robot configuration files define the behavior of the robots. These files reside in the directory /var/opt/SUNWportal/searchservers/searchserverid/config. The following list provides a description for each of the robot configuration files.

classification.conf: Contains rules used to classify RDs generated by the robot.
filter.conf: Defines the enumeration and generation filters used by the robot.
filterrules.conf: Contains the robot's site definitions, starting point URLs, rules for filtering based on mime type, and URL patterns.
robot.conf: Defines most operating properties for the robot.

Because you can set most properties by using the Search Server Administration interface, you typically do not need to edit the robot.conf file. However, advanced users might manually edit this file to set properties that cannot be set through the interface.

Defining Sites

The robot finds resources and determines whether to add descriptions of those resources to the database. The determination of which servers to visit and what parts of those servers to index is called a site definition.

Defining the sites for the robot is one of the most important jobs of the server administrator. You need to be sure you send the robot to all the servers it needs to index, but you also need to exclude extraneous sites that can fill the database and make finding the correct information more difficult.

Controlling Robot Crawling

The robot extracts and follows links to the various sites selected for indexing. As the system administrator, you can control these processes through a number of settings, including:

Starting, stopping, and scheduling the robot
Defining the sites the robot visits
Crawling attributes that determine how aggressively it crawls
The types of resources the robot indexes by defining filters
What kind of entries the robot creates in the database by defining the indexing attributes

See the Sun Java System Portal Server 7.1 Technical Reference for descriptions of the robot crawling attributes.

Filtering Robot Data

Filters enable identify a resource so that it can be excluded or included by comparing an attribute of a resource against a filter definition. The robot provides a number of predefined filters, some of which are enabled by default. The following filters are predefined. Filters marked with an asterisk are enabled by default.

Archive Files*
Audio Files*
Backup Files*
Binary Files*
CGI Files*
Image Files*
Java, JavaScript, Style Sheet Files*
Log Files*
Lotus Domino Documents
Lotus Domino OpenViews
Plug-in Files
Power Point Files
Revision Control Files*
Source Code Files*
Spreadsheet Files
System Directories (UNIX)
System Directories (NT)
Temporary Files*
Video Files*

You can create new filter definitions, modify a filter definition, or enable or disable filters. See Resource Filtering Process for detailed information.

Using the Robot Utilities

The robot includes two debugging tools or utilities:

Site Probe – Checks for DNS aliases, server redirects, virtual servers, and the like.
Simulator – Performs a partial simulation of robot filtering on a URL. The simulator indicates whether sites you listed would be accepted by the robot.

Scheduling the Robot

To keep the search data timely, the robot should search and index sites regularly. Because robot crawling and indexing can consume processing resources and network bandwidth, you should schedule the robot to run during non-peak days and times. The management console allows administrators to set up a schedule to run the robot.