Netscape Compass Server Administrator's Guide

Chapter 3
Filling the Database

There are several different ways to get information about resources into the Compass Server database. As the administrator, you need to decide which of these methods you want to use. That's all done in the planning stage, as discussed in Compass Server Concepts. Once you determine which methods to use, you can set your chosen methods in motion. This chapter describes the following topics:

About Robots

Controlling the Robot

Controlling Where the Robot Goes

Controlling What the Robot Gets

Scheduling the Robot

Controlling What the Robot Generates

Checking Database Contents

Configuring Crawling Settings

Troubleshooting the Robot

Most of the resource descriptions that make up a Compass Server database come from the Compass robot. This chapter describes how to set up and use the robot, how to manually augment the resource descriptions it provides, and how to ensure that the robot generates the proper resource descriptions.

NOTE: If you are upgrading from Netscape Catalog Server 1.0, you will probably want to migrate your existing Catalog Server configuration and content to the Compass Server. The migration process is explained in Migrating from Catalog Server 1.0.

As administrator, you have a great deal of control over where your robots go to locate resources and what they do when they get there. The Server Manager provides two separate areas for configuring robots: one for setting operational parameters and the other for defining robot rules.

About Robots

Compass Servers use robots to find and report on the resources in their domains. A robot is a small program that does two things:

Extracts and follows links to resources (also called enumeration or crawling)

Describes those resources and puts the descriptions in the database (also called generation or indexing)

As the system administrator, you control every aspect of these processes in a number of ways, including the following:

You define when the robot runs (described in Controlling the Robot) and how aggressively it searches (described in Configuring Crawling Settings).

You define the sites the robot visits to start looking for resources. This is described in Controlling Where the Robot Goes.

You determine what types of resources the robot indexes by defining filters, as described in Controlling What the Robot Gets.

Once the robot locates resources you of the right types on the sites you define, you also determine the kind of entry it creates for the database, as explained in Controlling What the Robot Generates.

After all those, you need to be able to ensure that the robot has done what you wanted. The Compass Server provides several tools for that, as explained in Checking Database Contents.

The following figure shows the process in somewhat more detail. You'll also see this diagram in the Server Manager for the Compass Server on the robot control panel page.

Here are the phases in a bit more detail:

Everything starts with a list of starting points. These are the URLs you want the robot to visit each time it starts a new run.
When you start the robot, it copies the list of starting points (1) into the URL pool. This is a list of URLs that the robot will check out. Later, as the robot visits sites and retrieves pages, it discovers links to other resources, which it adds to the URL pool for further investigation. The robot continues processing until the URL pool is empty, at which point it becomes idle.
For each URL in the URL pool (2), the robot applies a set of filters. The basic question it asks about each one is "Should I process this further?" If the answer is "no," it places the URL in the excluded URLs (4). Before applying filters, the robot checks to make sure it hasn't already processed the URL. Even if the robot encounters numerous links to the same page, it will process it at most once, preventing duplicate entries in the database.
There are two kinds of tests the robot applies to each URL: site definitions and filters. A site definition determines whether the URL is part of a site that should be included in the database. This is how you limit the database to specific servers, groups of servers, or domains (such as entire companies or organizations). A filter determines whether the URL represents a type of resource you want to include in the database. You can choose, for example, to include text files, but exclude spreadsheets or binary files.

If a URL passes both the site definitions and the filters, the robot queues it up for two further kinds of processing: extracting (5) and indexing (6).
The excluded URLs list is where the robot places URLs that will not be included in the database. Along with the URL, it writes the reason for the rejection and sorts the list according to these reasons. You can later go through the list of excluded URLs to determine whether you need to update your site definitions or their filters.
The extracting (or enumeration) phase is where the robot examines each resource for references to other resources it should check out. For example, while processing a web page, it finds all the hyperlinks and adds them to the link pool (2) for further processing.
The indexing phase is where the robot actually creates a resource description for each filtered URL and places the resource description in the database (7).
The database holds the resource descriptions for all the resources that make it through the filters (3) into the indexing phase (6). Using the Compass Server, network users can query the database to produce lists of resources that match their search criteria.

You can customize the behavior of your robot in most ways by changing the settings in the Server Manager. This includes configuring resource usage and timing in addition to the customizations that apply to your particular installation, such as classification.

You can change almost any aspect of robot behavior or add new behaviors by using an API based on the Netscape Server API (NSAPI), writing shared libraries with loadable modules, or tuning the behavior of the default suite of modules provided with the Compass Server. These are all described in the Netscape Compass Server Programmer's Guide.

Do not confuse robots with import agents, which are the processes that bring resource descriptions from other servers or databases into a Compass Server. A robot works along with the server, generating the resource descriptions for the Compass Server database.

About Extraction

Extraction is the process of discovering or locating resources to be included in the database. The robot starts with a list of locations called its starting points (in the form of standard URLs) from which to start looking for resources. The robot iterates through this list of URLs, looking at each resource it retrieves for references to other resources. Every time it finds a URL in a resource, it adds it to the list of URLs to examine. The result is a dynamic list of URLs called the URL pool.

Depending on the nature of each URL provided, the robot might either generate a resource description (if the URL denotes a resource of a type it should index) or track down links to other resources. The robot carries a database of the URLs already enumerated and checksum data on those URLs so that it can skip resources already enumerated. This prevents recursive or infinite looping.

The goal of the extraction process is to discover all the resources in the sites specified by the initial URL list and to filter out references to unwanted items. Extraction is explained in more detail in Controlling Where the Robot Goes.

About Indexing

Indexing is the phase in which the robot takes all the information it has gathered on a resource and turns it into a resource description for inclusion in the Compass Server database. This mostly involves putting information in the proper fields, as defined by the database schema, and also generating some of the fields, such as the creation date, keywords, partial text, and so on.

You can customize some aspects of resource-description generation, as described in Controlling What the Robot Generates. There is also a great deal more detail on indexing in the Netscape Compass Server Programmer's Guide.

Starting, Stopping, and State

In many instances, you'll want to keep the robot running continuously, so that as it encounters new documents published on the net, it can enumerate those resources, generate new resource descriptions, and update the database. It is entirely possible, however, that there will be times that you don't want to run the robot, either because of the load on the server or because of the load the robot can place on other servers.

As the robot runs, it maintains a great deal of information about its current state: which sites it has visited, where it has yet to go, and various other status data. You can stop and restart the robot at will, either manually or by automated schedule, and it will resume where it left off because it maintains state information. This enables you, for example, to only run the robot at night, even if it takes several nights to complete a single robot run.

You can also specify that you want the robot to start fresh. That is, you can delete its state information, causing it to start its next run from the beginning, by enumerating from its starting points.

It is also possible to manually pause and resume the robot. That is, while the robot process is running, you can freeze its activity, saving its exact state, and resume immediately. When paused, the robot finishes processing any URLs it is already working on, but will not process URLs waiting in the URL pool until it resumes. This is a more efficient way to create a short break in robot activity, as long as you don't need to shut the robot down entirely.

You can manually purge the robot's state information, causing it to start the next time from its starting points again. This is called a fresh start.

Controlling the Robot

The Compass Server robot can be a continuously running process, much like a server in that at any time it is running, it can accept commands and act on them. The most important commands are those that provide the robot with the addresses of sites to visit to look for and describe resources, but there are also commands for such tasks as starting and stopping the robot, retrieving its current status, and so on.

As administrator, you have complete control over both what your robot does (configuring) and when it does it (controlling). After the initial setup, most of your robot operation will probably be automated, but you always have the option to control your robot manually.

The central location for controlling robots is the Robot Overview form. This is the default form when you choose the Robot button in the Server Manager.

Controlling Where the Robot Goes

The function of the Compass Server robot is to find resources and determine whether (and how) to add descriptions of those resources to the database. The determination of which servers to visit and what parts of those servers to index is called site definition.

Defining the sites for the Compass Server is one of the most important jobs of the server administrator. You need to be sure you send the robot to all the servers it needs to index, but you also need to exclude extraneous sites that can bloat the database and make it hard to find the correct information.

The site definition process is essentially one of filtering. That is, by applying rules set out by the administrator, the robot determines whether to include or exclude a particular resource. Before you run the robot for the first time, or after you change its site definitions or other parameters, you should check to ensure that it is locating the resources you want and only the ones you want.

Site definitions control whether the robot will actually visit a given server, based on its URL. You can allow or disallow visits to specific servers, groups of servers, or parts of servers. You can also determine whether to index a particular document within a site based on its content, such as its MIME type.

These are the specific tasks involved in controlling where your robot goes:

Defining Sites

Limiting Robot Discovery

Controlling the types of documents the robot indexes from each site is explained in Controlling What the Robot Gets.

Defining Sites

When the robot processes each URL in its link pool, it first passes the URL through a series of site definition to determine whether the database should include resources from that site.

Site definitions are rules that tell the robot whether to extract links from resources and/or generate descriptions of resources from specific servers, groups of servers, or domains. You can use site definitions to limit the range of your robot by restricting which links it follows from the resources it discovers.

A site definition has five parts:

A nickname you can use to refer to the site definition

A definition of which server, group of servers, or domain make up the site

A list of URLs that are starting points for the site

A set of filters that define what kinds of resources to allow or deny from the site

Additional information about the site in the form of a comment

There are three kinds of site definitions, applying to single sites, groups of sites, and domains, respectively. For each URL in its link pool, the robot looks for a site definition that applies, based on the following criteria:

It applies only one site definition to each URL. That is, once it finds a match, it ignores all other site definitions. If you define overlapping sites, the robot will treat resources as part of the first site that fits.
It applies the most specific rule possible. That is, if it matches a single-site definition, it applies that. If it does not match a single-site definition, it tries site-group definitions. If none of those applies, it tries domain definitions.
If no site definition includes the URL, the robot will not process the URL further. The URL is added to the list of excluded URLs.

Once the robot finds a site definition that applies to a given URL, it then applies its associated filters to determine whether to proceed further.

These are the specific tasks involving site definitions:

Managing Site Definitions

Creating Site Definitions

Editing Site Definitions

Testing Site Definitions

Site definitions also include filters, discussed in Controlling What the Robot Gets.

Managing Site Definitions

The Compass Server maintains a list of all its site definitions. Through this list, you can manage the definitions of all sites, including editing their specific parameters, turning individual sites on or off, and creating and deleting site definitions.

Site definitions are explained in Defining Sites. Each line on the Manage Sites form represents one of the possible site definitions the robot can use. For each site, you can do the following:

Specify On or Off to indicate whether the robot should use the filter. This lets you use different sets of filters at different times without having to completely redefine the filters each time you change them. Changes to the On/Off status only take effect after you click OK.

Click the nickname to open the Edit Site Definition form for that site. You can then edit the site definition, as described in Editing Site Definitions.

Click the Delete icon next to a site to immediately remove that site definition from the Compass Server. You do not have to click OK for deletions to take effect.

Click the New Site button below the list of sites to create a new site definition.

If the robot is running when you make changes to site definitions, you need to stop and restart the robot before your changes take effect.

Creating Site Definitions

When you add sites to your database, you probably need to create site definitions for them. The exception to this is if you have already defined a domain that includes the new site.

Site definitions are explained in Defining Sites. To create a new site definition, do the following:

Choose what kind of site you are defining. The pop-up menu gives you two choices.

Site Type Meaning
URL
The site is either a single server or a part of a single server. If you specify just a hostname, the site is limited to that host. If you provide directory information in addition to the hostname, the site is defined as only that directory and any of its subdirectories.

Domain
The site covers all servers within a domain, such as *.netscape.com. The suffix string for a domain defines which hostnames belong to the domain, and should include the "." before the domain name. The default starting point for the domain is "www." plus the domain suffix.
For example, *.netscape.com defines the Netscape domain. Using netscape.com alone matches all those same hosts, but would also include other, unwanted hosts, such as www.notnetscape.com.
Similarly, you cannot use wildcards within the suffix string. www.*.com is not a valid suffix string. You can use "*" only at the beginning of the string.

Type the URL or a suffix string defining a domain in the text box.
Choose a depth limit for the site.
Click Create Site to add the site definition to the robot or click Create and Edit Site to add the definition and edit it.

Site Type	Meaning
URL	The site is either a single server or a part of a single server. If you specify just a hostname, the site is limited to that host. If you provide directory information in addition to the hostname, the site is defined as only that directory and any of its subdirectories.
Domain	The site covers all servers within a domain, such as `.netscape.com`. The suffix string for a domain defines which hostnames belong to the domain, and should include the "." before the domain name. The default starting point for the domain is `"www."` plus the domain suffix. For example, `.netscape.com` defines the Netscape domain. Using `netscape.com` alone matches all those same hosts, but would also include other, unwanted hosts, such as `www.notnetscape.com`. Similarly, you cannot use wildcards within the suffix string. `www..com` is not a valid suffix string. You can use `""` only at the beginning of the string.

Creating the site definition incorporates the default set of content filters defined for your system. You can edit these and any other aspect of the site definition.

Editing Site Definitions

The Edit Site Definition form allows you to change any aspect of an existing site definition or create an entirely new one.

Site definitions are explained in Defining Sites. The process differs slightly for site definitions that cover one or more specific web servers and for entire domains. Other than specifying the site, however, the filter editors for servers and domains are identical.

These are the elements you define for each site filter:

Site Nickname--Each filter needs a unique nickname that identifies it in the list on the Manage Sites form. This can be any string of characters that is meaningful to you.

Server Group or Domain--Depending on whether this filter defines a number of specific servers or an entire domain, the next part of the editor form varies, as described in the following table

Type of Site Description
Server or Group
Each entry for a server consists of the full hostname of the server (such as www.netscape.com) and a type and port number. The types are the standard server types found on the network, such as web server, FTP servers, gopher servers, and local file systems. The port number is the server's TCP/IP port number. The editor automatically inserts the default port for each type of server. If you know that the server operates on a different port, you can type that in.

Domain
A domain covers all the servers that end with the same domain name suffix. For example, the netscape.com domain includes all servers such as home.netscape.com, ftp.netscape.com, developer.netscape.com, and so on. You define a domain by specifying a suffix string to match against hostnames, and the port numbers and protocols to allow for the domain (for example, *.netscape.com). The site uses www plus the domain suffix string as its default starting point.
You can also specify which ports and protocols within the domain you want to support. Note that if you do not specify any ports, the site definition includes all ports.

:

Type of Site	Description
Server or Group	Each entry for a server consists of the full hostname of the server (such as `www.netscape.com`) and a type and port number. The types are the standard server types found on the network, such as web server, FTP servers, gopher servers, and local file systems. The port number is the server's TCP/IP port number. The editor automatically inserts the default port for each type of server. If you know that the server operates on a different port, you can type that in.
Domain	A domain covers all the servers that end with the same domain name suffix. For example, the netscape.com domain includes all servers such as `home.netscape.com`, `ftp.netscape.com`, `developer.netscape.com`, and so on. You define a domain by specifying a suffix string to match against hostnames, and the port numbers and protocols to allow for the domain (for example, `.netscape.com`). The site uses `www` plus the domain suffix string as its default starting point. You can also specify which ports and protocols within the domain you want to support. Note that if you do not specify any ports, the site definition includes all* ports.

Using the buttons below the list of hostnames or domain suffixes, you can add another hostname or domain suffix. You can remove an existing hostname or domain suffix by clicking the trashcan icon next to its name.

Starting Points--You can define one or more addresses to use as starting points for the site. Each starting point has the parts listed in the following table.

Part Description
URL
This is the address of the server or document to use as the starting point. It can be any URL, either very specific or as broad as a site address.

Depth
This restricts how many links the robot will follow away from the starting point. The default depth is determined by the setting in the Advanced Settings. You can specify a limit from 1 to 10 or unlimited depth.

Part	Description
URL	This is the address of the server or document to use as the starting point. It can be any URL, either very specific or as broad as a site address.
Depth	This restricts how many links the robot will follow away from the starting point. The default depth is determined by the setting in the Advanced Settings. You can specify a limit from 1 to 10 or unlimited depth.

The site definition editor automatically checks with a DNS server to ensure that it understands any aliases used by these sites. It also performs a check for virtual hosts on the server. These are both explained in more detail in Probing for Site Information.

Filter Definition--These are filters that allow or disallow the indexing of specific kinds of files at the site you've defined. The order in which the filters appear is the order in which they are applied.

You can perform any of the actions defined in the following table.

Action Description
Reorder
Click the upward or downward arrows to the left of a filter to change its order.

Include/exclude
This option determines whether the type of files defined by the filter are to be included in the index or excluded from the index. The most common method of filtering is to exclude specific types of files, then include all others.

Filter
This pop-up menu lists all the available filters. You do not have to use all of them.

Delete
Click the trashcan icon next to a filter to remove it from the site definition. The filter remains available for other sites to use.

Action	Description
Reorder	Click the upward or downward arrows to the left of a filter to change its order.
Include/exclude	This option determines whether the type of files defined by the filter are to be included in the index or excluded from the index. The most common method of filtering is to exclude specific types of files, then include all others.
Filter	This pop-up menu lists all the available filters. You do not have to use all of them.
Delete	Click the trashcan icon next to a filter to remove it from the site definition. The filter remains available for other sites to use.

Be sure to set the final rule correctly. It should be a "catch all" that applies to all files that did not match any of the preceding rules.

Advanced--This is additional information that describes the site and how to index it.

Item Description
Comment
This is a text field that describes the site to you. It is not used by the robot.

Item	Description
Comment	This is a text field that describes the site to you. It is not used by the robot.

Click OK to commit any changes or additions to the site definition.

Testing Site Definitions

You can perform a partial test of your site definitions by using the Robot Rules Simulator. It is only a partial test because it only checks the URL, DNS translations (including Smart Host Heuristics), and site redirections, but not the contents of the document specified by the URL, so it cannot detect duplications, MIME types, network errors, permissions, and so on.

To test your site definitions, do the following:

The simulator shows all the defined starting points.

Type any other URLs you want to test, only one per line.
Click OK to test the listed URLs.
The simulator indicates whether each URL would be accepted or rejected by the current site definitions, and lists the reasons.
Repeat steps 1-2 for any other URLs you want to test.

By testing your site definitions before you run the robot, you can ensure that the robot accepts resources from sites you want to index and rejects others.

Limiting Robot Discovery

As the robot visits sites and discovers links to other resources, it could conceivably end up following huge numbers of links a great "distance" from its original starting points. You can limit these wanderings by limiting the number of links the robot will extract from any one resource.

By default, the robot restricts itself to no more than 1,024 links from any one document.

To change the limits on links extracted from each resource, do the following:

Choose whether to extract hyperlinks from either HTML or plain text documents (or both) by using the checkboxes next to HTML and Plain Text, respectively.
Specify the limits for the number of links to extract from each kind of document by typing the number in the text box for each checked item.
Click OK to accept the changes.

You can also control how many links away from the original starting points the server will follow by changing the Depth setting for each starting point in a site definition, as described in Editing Site Definitions.

Probing for Site Information

When adding new sites, it can be difficult to anticipate some of the difficulties that might arise, such as nonstandard DNS aliases, virtual servers, and server redirections. The Compass Server provides a tool that seeks out such information for you. In most cases, the Site Definition Editor takes these factors into account automatically, but the Site Probe enables you to override or augment these adjustments as needed.

To probe for site information, do the following:

Type the URL for the site you want to probe in the text box.
Check Show Advanced DNS Information if you want detailed technical information in addition to the regular probe.
Checking this option sends the output of the gethostbyname() DNS lookup directly to the form, so you can see the data on which the probe bases its checks.
Click OK.

The site probe returns the following information on the site:

Name check--The probe does a DNS lookup to ensure that the name is a known, valid hostname. It also indicates whether the name is an alias for another hostname.

Redirection check--Next, the probe accesses the site to determine whether the server at that site redirects requests to another server.

Virtual server check--Finally, if the name check determined that the specified name is an alias, the probe checks to see whether the aliased name is a virtual server. A virtual server responds differently than the host server that shares its alias. The robot will therefore need to override the normal handling of DNS aliasing.

Controlling What the Robot Gets

There are times when the robot can determine that you don't want to index a resource based on its content, sometimes without even having to retrieve it. The robot does this filtering in conjunction with its site definition, using filters.

This section contains several topics related to content types and filters:

Understanding Filters

Managing Filters

Creating a New Filter

Editing Filters

Adding New Conversion Types

Understanding Filters

The robot has a set of filters you can choose to apply to any of the sites you define. In addition to the standard filters supplied with the Compass Server, you can define your own filters and make them available to use on any site.

A filter consists of two parts:

A nickname you can use to refer to the filter elsewhere

One or more rules that define a type of content.

A typical filter covers some identifiable type of resource content, and its rules define that kind of resource. For example, you might want to define a simple filter that would enable you to exclude programming source code from your database. You would define a rule with a nickname such as "Source Code," and provide it with rules that identify files that contain source code. Those rules would check a resource's URL or pathname, looking for the extensions .c, .cpp, and .h.

You would then apply the filter to various site definitions, having them exclude resources that match the filter called "Source Code."

Managing Filters

To see a list of all the filters available to your robot, use the Manage Filters form.

The Manage Filters form shows all the defined filters and indicates which site definitions, if any, use each one. If the designation "New Site default" follows the name of the filter, it means the filter is enabled by default in newly created site definitions.

From the overview form, you can globally enable or disable any of the filters or delete them entirely. Disabling a filter means that all site definitions ignore that filter. Enabling the filter means that the site definitions that use the filter rule do apply it.

You can also create a new filter by clicking New Filter. You can also delete a filter by clicking the delete icon to the right of it.

Creating a New Filter

There are two ways to create a new filter:

In the Server Manager, choose Robot|New Filter.

If you're on the Manage Filters page, click the New Filter button.

Either of these methods opens the Edit Filter Definition form with default content, ready for the creation of a new filter.

Editing Filters

The Edit Filter Definition allows you to create or change any of the rules used to allow or deny various kinds of resources. Site definitions can use any or all of the filters currently defined.

To edit a filter, do the following:

Change any or all of the items described below.
Click OK to commit your changes.

These are the elements you can edit in the filter rules editor:

Filter Name--Each filter rule needs a unique nickname that identifies it in the list on the Filter Rules Settings form. This can be any string of characters that is meaningful to you.

Filter Definition--This is a list of one or more criteria used to match all or part of a resource's URL. See the example in the next section for more details on matching.

Advanced--These are extra elements you can define for each filter rule.

Element Description
Comment
This is a text field that describes the filter rule to you. It is not used by the robot.

New Site
If you check this box, the Compass Server will use this rule as one of the default filters when you add a site using the New Site form described in Creating Site Definitions. If you do not check the box, you can still add the filter to sites added through New Site, but you will have to do so by editing the site definition.

By Default
This setting determines whether this filter, when added to a site definition, will appear with "Include" or "Exclude" selected in the pop-up menu. You always have the option to change it for any given site definition, but since most filters are created to specifically include or exclude a type of file, you can set that by default.
This does not affect existing site definitions. If you want to use your new filter in an existing site, you must add it manually.

Deployment
This is for your information only, a read-only list of the site definitions that use this filter rule.

Element	Description
Comment	This is a text field that describes the filter rule to you. It is not used by the robot.
New Site	If you check this box, the Compass Server will use this rule as one of the default filters when you add a site using the New Site form described in Creating Site Definitions. If you do not check the box, you can still add the filter to sites added through New Site, but you will have to do so by editing the site definition.
By Default	This setting determines whether this filter, when added to a site definition, will appear with "Include" or "Exclude" selected in the pop-up menu. You always have the option to change it for any given site definition, but since most filters are created to specifically include or exclude a type of file, you can set that by default. This does not affect existing site definitions. If you want to use your new filter in an existing site, you must add it manually.
Deployment	This is for your information only, a read-only list of the site definitions that use this filter rule.

Filtering Files: An Example

In order to understand how the matching works in filter rules, it's probably easiest to look at a concrete example. Here is the URL for the online version of the Release Notes for Netscape Compass Server:

http://home.netscape.com:80/eng/server/compass/3.0/relnotes.html

The filter rule can match five different aspects of the URL in five different ways. These are the five parts:

Matching part In this example
URL
http://home.netscape.com:80/eng/server/compass/3.0/relnotes.html

protocol
http://

hostname
home.netscape.com

pathname
/eng/server/compass/3.0/relnotes.html

MIME type
text/html

Matching part	In this example
URL	`http://home.netscape.com:80/eng/server/compass/3.0/relnotes.html`
protocol	`http://`
hostname	`home.netscape.com`
pathname	`/eng/server/compass/3.0/relnotes.html`
MIME type	text/html

And here are some examples of true matches:

Method True match in this example
is
protocol is http

contains
hostname contains netscape

begins with
MIME type begins with text

ends with
pathname ends with .html

regular expression

Method	True match in this example
is	protocol is http
contains	hostname contains netscape
begins with	MIME type begins with text
ends with	pathname ends with .html
regular expression

Adding New Conversion Types

As delivered, Netscape Compass Server provides the ability to convert a number of common file formats into HTML that the robot can use to generate a resource description for the document. Conversions from formats other than HTML are performed by filter programs provided by third party vendors or written by you.

Supported Conversions

These are the standard formats supported in this version.

Format Versions Comments
HTML
Levels 1-3.2
Native support

Adobe Acrobat (PDF)

Converted to ASCII only

Adobe FrameMaker (MIF)
3.0-5.0

Lotus Ami Professional
1.x-3.1

Interleaf
5.2-6.0

Microsoft Rich Text Format (RTF)

Microsoft Word
3.0-6.0
DOS

2.0, 6.0, 7.0
Windows

WordPerfect
5.0-6.0
DOS

2.0-3.5
Macintosh

6.x-6.1
Windows

Microsoft Excel (XLS)
3.0-5.0
Macintosh

2.1, 5.0, 7.0
Windows (text only)

Microsoft PowerPoint (PPT)
3.x, 4.x, 7.0
Windows

Format	Versions	Comments
HTML	Levels 1-3.2	Native support
Adobe Acrobat (PDF)		Converted to ASCII only
Adobe FrameMaker (MIF)	3.0-5.0
Lotus Ami Professional	1.x-3.1
Interleaf	5.2-6.0
Microsoft Rich Text Format (RTF)
Microsoft Word	3.0-6.0	DOS
2.0, 6.0, 7.0	Windows
WordPerfect	5.0-6.0	DOS
2.0-3.5	Macintosh
6.x-6.1	Windows
Microsoft Excel (XLS)	3.0-5.0	Macintosh
2.1, 5.0, 7.0	Windows (text only)
Microsoft PowerPoint (PPT)	3.x, 4.x, 7.0	Windows

Adding and Removing Conversions

If you want to index resources based on file formats other than those supported by the Compass Server's default conversions, you can add your own conversion programs (or supplement those already provided).

Note that if you create new conversions, you will probably also need to define new filters to tell the robot that these new document types are usable. Creating new filters is explained in Creating a New Filter.

Choosing Conversion Modules

The Indexing Settings form includes a list of all installed document convertors. You can choose which, if any, of these conversions you want the robot to use by checking items in the list. These and other indexing settings are explained in Setting Indexing Preferences.

Installing Conversion Modules

If you need to support file formats other than the standard ones, you can purchase additional conversion modules. Once you acquire a new module, place it in the Compass Server bin directory, located under the server root directory:

<NS_HOME>/bin/compass/bin/

Your robots automatically detect the additional conversions and use them as needed.

Writing Custom Conversions

If you need to index resources stored in file formats not supported by the supplied conversion modules, you can write your own conversion routines and incorporate them into the robots through the robot plug-in API. This programming is outside the scope of this guide, but is explained in the Netscape Compass Server Programmer's Guide.

Scheduling the Robot

Most of the time, you will want to update your Compass Server database automatically, so you can set up your robot to start on a periodic schedule. For example, you might run your robot once a week or every night, depending on how current your information needs to be and how frequently documents change at your site.

If you need to update the database between the scheduled robot runs, you can either run the robot manually or submit resource descriptions manually.

From the Schedule Task form you can to choose days and times to automatically start or stop the robot. The form indicates whether there is a currently active schedule. If there is, you can either change the schedule or deactivate it. If there is no active schedule, you can create one.

To create an active schedule, follow the directions in Scheduling Tasks.

To change the existing schedule, follow the directions for activating a schedule. When you activate the changed schedule, it will replace the previous one.

Tips on Robot Scheduling

Before scheduling your robot to run automatically, you should run it at least once manually, as described in Starting the Robot Manually. You should only start the robot automatically after you know it works properly.

If you plan to run the robot continuously, we also suggest that you stop and restart it at least once per day. This gives the robot a chance to release resources and reinitialize itself, providing a kind of "sanity check" for the ongoing operation.

Using Completion Scripts

When the robot finishes processing all the URLs from all its sites, it can run a script (called a completion script) that you choose. For Unix systems, the script is a shell script. For Windows NT systems, it is a batch file.

The most common thing you'll probably want to automate at robot completion is the My Compass profiler, to update user newsletters to reflect newly found resources.

Completion scripts all reside in the bin/compass/admin/bin directory under the server root directory. You choose which script will run in the Completion Actions settings.

Writing completion scripts is explained in the Netscape Compass Server Programmer's Guide.

Controlling What the Robot Generates

For each resource that passes through the robot's filters (as described in Controlling Where the Robot Goes and Controlling What the Robot Gets), the robot generates a resource description that it places in the Compass Server database. You can control the way the robot generates its resource descriptions in two ways:

Setting Indexing Preferences

Setting Classification Rules

The choices you make in setting up the generation of resource descriptions determine what users will see when they search the Compass Server database. For example, you can choose to index the full text of each document or only some fixed portion of the beginning of the document.

In addition, by creating effective classification rules, you can make it easier for users to locate what they want in the Compass Server by browsing in categories.

Setting Indexing Preferences

There are several common options you can set that control the way the robot generates resource descriptions. The Indexing Settings form allows you to set those options in one place.

You can set the following indexing preferences:

Setting Meaning
Page Extraction
This controls how much of each resource's text the robot includes and indexes in the resource description. By default, the robot uses the first 4, 096 bytes. You can either increase or decrease this amount, or you can choose to extract and index the full text of each resource.
Keep in mind that changing this amount could produce huge resource descriptions, requiring enormous amounts of space for the database.

Document Convertors
This section shows a list of all the installed conversion modules, with checkboxes for each. If you check a conversion, the robot will convert documents of the specified type into HTML for indexing. If you uncheck a convertor, the robot will not be able to index documents of that type.
The Convertor Timeout option determines how long the robot will wait for the conversion of a non-HTML document to HTML before it gives up and excludes the URL. The default is 600 seconds, which should be more than enough time to convert even a large document. If you encounter problems with conversions failing, thereby holding up your robot processing, you can shorten this delay.

Advanced
These options control specific fields of the resource descriptions generated by the robot. If the options are checked (as they all are by default), the robot includes the specified information in the resource description.

Extract Table of Contents--This option causes the robot to find all the HTML headings in the resource and insert their text into a field called TOC.

Extract data in META tags--This option directs the robot to copy information from HTML META tags and assign them to similarly named fields in the resource description.

Setting	Meaning
Page Extraction	This controls how much of each resource's text the robot includes and indexes in the resource description. By default, the robot uses the first 4, 096 bytes. You can either increase or decrease this amount, or you can choose to extract and index the full text of each resource. Keep in mind that changing this amount could produce huge resource descriptions, requiring enormous amounts of space for the database.
Document Convertors	This section shows a list of all the installed conversion modules, with checkboxes for each. If you check a conversion, the robot will convert documents of the specified type into HTML for indexing. If you uncheck a convertor, the robot will not be able to index documents of that type. The Convertor Timeout option determines how long the robot will wait for the conversion of a non-HTML document to HTML before it gives up and excludes the URL. The default is 600 seconds, which should be more than enough time to convert even a large document. If you encounter problems with conversions failing, thereby holding up your robot processing, you can shorten this delay.
Advanced	These options control specific fields of the resource descriptions generated by the robot. If the options are checked (as they all are by default), the robot includes the specified information in the resource description.
Extract Table of Contents--This option causes the robot to find all the HTML headings in the resource and insert their text into a field called TOC.
Extract data in META tags--This option directs the robot to copy information from HTML META tags and assign them to similarly named fields in the resource description.

To set indexing preferences, do the following:

Set or choose any of the options described in the table above.
Click OK to apply your choices.

Setting Classification Rules

One very useful feature of Netscape Compass Server is its ability to assign resources to categories, allowing users to browse through the hierarchy of categories to pinpoint the items they want. In order for this to work, you must set up rules to assign resources to categories. These rules are called classification rules.

This section explains several topics you need to understand in order to create a useful set of classification rules:

About Classification Rules

Editing Classification Rules

Handling Unassigned Resources

If you do not define your classification rules properly, users will not be able to locate resources by browsing in categories. You need to avoid categorizing resources incorrectly, but you also should avoid failing to categorize documents at all.

Documents can be assigned to multiple categories, up to a maximum number defined in the Compass Server settings. This setting is explained in Database Options.

About Classification Rules

Classification rules are simpler than filter rules because they don't involve any flow-control decisions. All you need to do in classification rules is determine what criteria to use to assign specific categories to a resource as part of its resource description. A classification rule is a simple conditional statement, taking the form "if <some condition> is true, assign the resource to <a category>."

The following table explains the available elements for conditions.

Element Meaning Example
url
A complete URL
http://home.netscape.com:80/eng/server/compass/3.0/relnotes.html

hostname
The host part of a URL
home.netscape.com

protocol
The protocol part of a URL
http://

ip
The numeric version of a hostname, available only if the Crawl Settings include "Use IP as Source"
207.200.77.40

pathname
The path part of a URL
/eng/server/compass/3.0/relnotes.html

MIME Type
A MIME content type
text/plain

attribute
One of the attributes or fields of the resource description
See next table for a complete list.

Element	Meaning	Example
url	A complete URL	`http://home.netscape.com:80/eng/server/compass/3.0/relnotes.html`
hostname	The host part of a URL	home.netscape.com
protocol	The protocol part of a URL	`http://`
ip	The numeric version of a hostname, available only if the Crawl Settings include "Use IP as Source"	207.200.77.40
pathname	The path part of a URL	/eng/server/compass/3.0/relnotes.html
MIME Type	A MIME content type	text/plain
attribute	One of the attributes or fields of the resource description	See next table for a complete list.

The following table shows all the standard attributes you can compare as part of a condition. All are based on META tags in the document.

Attribute Meaning Example
Author
The person who created the document
Mark Twain

Author-EMail
A contact address for the author
mozilla@netscape.com

Content-Charset
The character set used in the document

Content-Encoding

Content-Language

Content-Length
The number of bytes in the document

Content-Type
The MIME type of the document content
text/html

Description
A brief description or summary of the document
A brief summary of all marketing activity in the first quarter.

Expires
The date after which the document is no longer valid

Full-Text
The full text of the document

Keywords
Important words or phrases used to identify the subject matter
Calavaras County, frogs, contests

Last-Modified
The date the document was last changed, if known. Some web servers do not provide this information.

Partial-Text
The first n bytes of the document, with the length defined by the Page Extraction setting.

Phone
A telephone number for the author
1-650-555-1212

Title
The title of the document
Following the Equator

URL
The URL of the document
http://home.netscape.com

Attribute	Meaning	Example
Author	The person who created the document	Mark Twain
Author-EMail	A contact address for the author	`mozilla@netscape.com`
Content-Charset	The character set used in the document
Content-Encoding
Content-Language
Content-Length	The number of bytes in the document
Content-Type	The MIME type of the document content	text/html
Description	A brief description or summary of the document	A brief summary of all marketing activity in the first quarter.
Expires	The date after which the document is no longer valid
Full-Text	The full text of the document
Keywords	Important words or phrases used to identify the subject matter	Calavaras County, frogs, contests
Last-Modified	The date the document was last changed, if known. Some web servers do not provide this information.
Partial-Text	The first `n` bytes of the document, with the length defined by the Page Extraction setting.
Phone	A telephone number for the author	1-650-555-1212
Title	The title of the document	Following the Equator
URL	The URL of the document	`http://home.netscape.com`

The following table explains all the comparison tests that can be used to compare the elements described above with a specified text string.

Test Meaning
is
The string matches the element exactly and completely. If you use this test with hostname or ip parameters, be aware that you need an exact match, including the port number (usually 80). If you only want to compare the name part (or the IP number), use the "contains" test instead of "is."

contains
The specified string appears somewhere in the element.

begins with
The string appears at the beginning of the element.

ends with
The string appears at the end of the element.

regular expression
The string contains a POSIX 1.0 regular expression that matches the element.

Test	Meaning
is	The string matches the element exactly and completely. If you use this test with hostname or ip parameters, be aware that you need an exact match, including the port number (usually 80). If you only want to compare the name part (or the IP number), use the "contains" test instead of "is."
contains	The specified string appears somewhere in the element.
begins with	The string appears at the beginning of the element.
ends with	The string appears at the end of the element.
regular expression	The string contains a POSIX 1.0 regular expression that matches the element.

The following are some examples of tests that evaluate true for the URL http://home.netscape.com:80/index.html:

hostname is home.netscape.com:80

url contains netscape

protocol regular expression http[s]*

ip is 207.200.77.40:80

pathname ends with .html

attribute Title is Welcome to Netscape

type begins with text

Once you have constructed a test, you can assign documents that pass that test to a particular category. The available categories appear in the pop-up menu at the bottom of the editor.

Before you can assign categories to resources, you must define the hierarchy of categories you will use, as described in Setting Up Categories.

As with site definitions, classification rules proceed in order. Because there is no acceptance or rejection, the robot tests all resources against all the rules. If a resource matches more than one rule, it is assigned to multiple categories, up to the maximum number defined in the Database Options settings for the server.

Your goal should be to have each resource assigned to at least one category. In practice, that is probably not possible, but you should avoid having large numbers of unassigned resources.

Editing Classification Rules

The classification rules editor is a Java applet that runs in its own window. The window contains an editable list of classification rules and a group of fields for editing the selected rule.

To edit classification rules, do the following:

Click Enable Java Applet to activate the editor applet.
Add, delete, or modify classification rules.
Click Save to commit the changes.

You do not need to click Disable Java Applet if you close the applet window.

WARNING: Choosing another administration form while the Classification Rules Editor is open will close the editor without saving changes.

Handling Unassigned Resources

In some cases, resources will pass through the entire set of classification rules with no category assigned. As the administrator, you need to check these unassigned resources to see if you need additional classification rules or gathering rules.

For example, if over 100 resources are not assigned to any category, you might need to add classification rules for these resources. If the robot finds resources that you do not want to index, you need to add site definitions or content filters to prevent the robot from generating descriptions for these resources.

After you run the robot, you should check the list of the unassigned resources, as described in Assigning Categories Manually. Use the list as feedback for setting up additional classification rules and gathering rules.

Checking Database Contents

There are a number of ways you can check to see that your Compass Server and its robot produce the results you want. Through the Server Manager, you can get information about the database of resource descriptions and read the log files produced by server processes. In addition, there are some maintenance and repair tasks you can schedule.

This section explains the following topics:

Checking Starting Points

Checking Log Files

Checking Excluded URLs

Checking Starting Points

The robot's starting points are a persistent list of URLs that the robot uses as the initial contents of the URL pool for resource enumeration every time it starts. That is, when you start the robot, it copies its list of starting points into the link pool and begins enumerating resources by going through that list, discovering linked resources as it goes.

Part of the definition of every site is a list of one or more starting points for that site. Cumulatively, all the starting points for all the robot's sites make up the list of starting points for the robot.

Because these starting points are distributed across the definitions of different sites, it can be difficult for you to track the combined list. The Server Manager therefore provides a report of all the starting points in all the defined sites. You can access this list at any time, regardless of whether the robot is currently running.

The list of starting points includes links to the site definition that includes each starting point, allowing you easy access to the site definition editor so you can make any necessary changes.

Checking Log Files

Each process that runs in a Compass Server system creates a log file. You can view each log file separately. There are also log files dealing with access and security, which are explained in the Managing Netscape Servers, along with a discussion of how to work with and analyze log files.

Log files are very useful tools for debugging your Compass Server setup, but they can also become very large. You can control the level of robot logging by changing parameters in the robot's Speed and Logfile Settings. You can also automate the process of archiving and deleting log files, as discussed in Archiving Log Files.

By default, the All Logs form displays the last 25 entries in a selected log file and provides you with two options for customizing the view:

Number of entries to view--enables you to choose the number of log lines to display (from the end of the log).

Only show entries with--enables you to filter the log entries so you only see lines containing the specified string (such as "error"). This match is not case-sensitive.

The following table lists the different log files you can view:

Log File File Name Contents
Excluded URLs
filter.log
All resources rejected by site definitions or filter rules. There is also a special tool for analyzing this very important log file, which is explained in Checking Excluded URLs.

My Compass
gv.log
Progress and errors encountered by the My Compass profiler.

RD Manager
rdmgr.log
Record of resource descriptions as they enter the database, along with related warnings and error messages.

RDM Database
rdmdebug.log
Debugging messages generated by the RDM server in the Compass Server.

RDM Servers
robot.report
This is a list of all the servers the robot found that identify themselves as RDM servers. You can use it to create import agents for those servers and import their content.

Robot Activities
robot.log
Warnings and error messages from all robot processes.

Search Engine
nsirmgr.log
Debugging messages issued by the search engine.

Server Access
access.log
Records of end-user access to the server.

Server Errors
errors.log
Errors that occurred from end-user access to the server.

User Queries
rdm.log
Transactions between server and others using RDM. Also records user searches and browsing and My Compass profiler queries.

Log File	File Name	Contents
Excluded URLs	filter.log	All resources rejected by site definitions or filter rules. There is also a special tool for analyzing this very important log file, which is explained in Checking Excluded URLs.
My Compass	gv.log	Progress and errors encountered by the My Compass profiler.
RD Manager	rdmgr.log	Record of resource descriptions as they enter the database, along with related warnings and error messages.
RDM Database	rdmdebug.log	Debugging messages generated by the RDM server in the Compass Server.
RDM Servers	robot.report	This is a list of all the servers the robot found that identify themselves as RDM servers. You can use it to create import agents for those servers and import their content.
Robot Activities	robot.log	Warnings and error messages from all robot processes.
Search Engine	nsirmgr.log	Debugging messages issued by the search engine.
Server Access	access.log	Records of end-user access to the server.
Server Errors	errors.log	Errors that occurred from end-user access to the server.
User Queries	rdm.log	Transactions between server and others using RDM. Also records user searches and browsing and My Compass profiler queries.

Checking Excluded URLs

The best way to make sure your robot is doing what you want is to look through the list of URLs it did not incorporate into the database. The Excluded URLs form provides a summary of all the URLs the robot rejected, grouped according to the reasons for rejection.

The Excluded URLs form analyzes the contents of the filter log file, grouping and sorting the URLs rejected by the robot to make it easier for you to determine whether the robot is systematically excluding documents you want to include.

To view a report of excluded URLs, do the following:

Choose which robot run you want to examine. The pop-up menu shows all available robot runs, with the most recent one being the default choice.
When you choose one of the robot runs, the Server Manager generates a summary report, listing the reasons for exclusion and the number of resources excluded for each reason.
Click one of the reasons in the summary report.
The Server Manager generates a report in the frame below the summary. At the top of the report is an explanation of the reason for the exclusions, followed by a list of the URLs.

The following table explains the various reasons for excluding URLs:

Reason Meaning
Depth = 0
For each site, you define a depth parameter that can limit the number of hypertext links the robot can follow away from the site's starting point. After each such link, the robot decrements its allowed depth. Once that depth reaches 0, the robot cannot follow any more links, so it excludes any URLs it finds there.

Site not allowed
The specified URL is not part of a site defined for the robot. If you want to index this URL, you need to add or refine a site definition, as described in Editing Site Definitions.

Errors: Forbidden, Server Error, Unauthorized, File Not Found
The robot could not retrieve the resource specified by the URL. The errors follow the standard Internet error codes, indicating that the server did not exist, the document did not exist on the server, the robot was not authorized for access, and so on.

Duplication
The robot already indexed this same document, perhaps at a different URL. Duplicate detection is done by checksums. You can turn off duplicate detection if you want, but that will likely produce multiple, identical entries in the Compass Server database.

Visited
This indicates that the robot has already retrieved and indexed this exact URL. Most often, this indicates multiple links to a central point, and is not a problem.

Protocol not allowed
The protocol used in a URL is either specifically excluded by a site definition or is not supported by the robot at all. For example, there is no way to index URLs using the mailto protocol, so the robot does not support it.

Blocked By Filter
If a resource is excluded because of a filter rule in a site definition, it shows up in the Excluded URLs report with the name of the filter that excluded it. A common example is CGI scripts, excluded by a filter called CGI Files.

Redirect
The remote server redirected the URL to another location. That new location automatically goes into the URL pool for further processing. You need to make sure the new URL is part of a defined site, however.

Unknown
The robot was unable to determine the reason for failure.

None
This generally means the URL was excluded during the indexing phase.

An RDM server already
The URL is the entry point for a site that is an RDM server, such as another Compass Server or a Netscape Enterprise Server using the AutoCatalog feature. If you want to index this site, you can either import the resource descriptions from the RDM server, as described in Importing Resource Descriptions or you can turn off the detection of RDM servers, as described in Standards Compliance.

Forbidden by robots.txt
The server for this URL has a robots.txt file that denies access to the robot. The robots.txt protocol is explained in Controlling Robot Access. If you want to index this site in spite of the robots.txt file, you can ignore the restriction, as explained in Standards Compliance.

Reason	Meaning
Depth = 0	For each site, you define a depth parameter that can limit the number of hypertext links the robot can follow away from the site's starting point. After each such link, the robot decrements its allowed depth. Once that depth reaches 0, the robot cannot follow any more links, so it excludes any URLs it finds there.
Site not allowed	The specified URL is not part of a site defined for the robot. If you want to index this URL, you need to add or refine a site definition, as described in Editing Site Definitions.
Errors: Forbidden, Server Error, Unauthorized, File Not Found	The robot could not retrieve the resource specified by the URL. The errors follow the standard Internet error codes, indicating that the server did not exist, the document did not exist on the server, the robot was not authorized for access, and so on.
Duplication	The robot already indexed this same document, perhaps at a different URL. Duplicate detection is done by checksums. You can turn off duplicate detection if you want, but that will likely produce multiple, identical entries in the Compass Server database.
Visited	This indicates that the robot has already retrieved and indexed this exact URL. Most often, this indicates multiple links to a central point, and is not a problem.
Protocol not allowed	The protocol used in a URL is either specifically excluded by a site definition or is not supported by the robot at all. For example, there is no way to index URLs using the mailto protocol, so the robot does not support it.
Blocked By Filter	If a resource is excluded because of a filter rule in a site definition, it shows up in the Excluded URLs report with the name of the filter that excluded it. A common example is CGI scripts, excluded by a filter called CGI Files.
Redirect	The remote server redirected the URL to another location. That new location automatically goes into the URL pool for further processing. You need to make sure the new URL is part of a defined site, however.
Unknown	The robot was unable to determine the reason for failure.
None	This generally means the URL was excluded during the indexing phase.
An RDM server already	The URL is the entry point for a site that is an RDM server, such as another Compass Server or a Netscape Enterprise Server using the AutoCatalog feature. If you want to index this site, you can either import the resource descriptions from the RDM server, as described in Importing Resource Descriptions or you can turn off the detection of RDM servers, as described in Standards Compliance.
Forbidden by robots.txt	The server for this URL has a `robots.txt` file that denies access to the robot. The robots.txt protocol is explained in Controlling Robot Access. If you want to index this site in spite of the `robots.txt` file, you can ignore the restriction, as explained in Standards Compliance.

Archiving Excluded URLs Reports

When you are no longer actively using Excluded URLs reports, you can archive them. Archiving the report moves the filter.log file that contains the report to a subdirectory called archive below the standard logs directory.

Archiving does not remove the files. You need to do that manually.

To archive an Excluded URLs report, do the following:

Choose which robot run you want to archive.
Click Archive.

The Server Manager moves the specified file to the archive directory, where you can remove it or store it for later use.

Configuring Crawling Settings

There are a number of operational parameters for the robot you can configure through the Server Manager. These parameters control such items as how often the robot accesses each server, and how much memory it can use.

The Crawling Settings form contains controls for all the operating parameters of your robot. These settings come in seven groups:

Speed

Completion Actions

Logfile Settings

Standards Compliance

Authentication Parameters

Proxying

Advanced Settings

For most of these settings, you can probably start with the default values. However, you will need to change at least the contact information under Completion Actions.

All these robot parameters are written into the configuration file process.conf. You can edit the file directly with a text editor if you want, but the Server Manager provides a complete interface.

Speed

The robot's speed settings control the load on your processor, the load on your network connections, and the burden placed on sites the robot indexes. You should configure these settings so that you optimize the use of your own resources, but without placing undo strain on other systems. You can also vary the settings depending on whether the robot runs at times when users will be querying your system heavily.

Setting Meaning
Server Delay
This parameter determines how long the robot waits between contacts with any given server. Increasing this parameter makes your robot less of a burden on remote sites. The "No Delay" should only be used on known sites with little traffic, as the robot could bombard servers with requests.

Maximum Concurrent Retrieval
This parameter determines the number of simultaneous resources the robot can work on. Increasing this value increases the amount of system resources the robot consumes, but it also can make the robot more efficient, processing some sites while waiting to contact others.

Speed Potential
This graphically displays a relative estimated speed indicator. That is, based on the server delay and concurrent retrieval parameters you specified, it estimates the relative speed of your robot. 100% would indicate the maximum possible speed: no delay and maximum concurrent retrieval. The graph does not adjust until you click OK.

Send RDs to Indexing Every
Determines how often the robot releases its database of newly generated resource descriptions to merge into the main database. This controls how quickly newly added items become available to users.

Setting	Meaning
Server Delay	This parameter determines how long the robot waits between contacts with any given server. Increasing this parameter makes your robot less of a burden on remote sites. The "No Delay" should only be used on known sites with little traffic, as the robot could bombard servers with requests.
Maximum Concurrent Retrieval	This parameter determines the number of simultaneous resources the robot can work on. Increasing this value increases the amount of system resources the robot consumes, but it also can make the robot more efficient, processing some sites while waiting to contact others.
Speed Potential	This graphically displays a relative estimated speed indicator. That is, based on the server delay and concurrent retrieval parameters you specified, it estimates the relative speed of your robot. 100% would indicate the maximum possible speed: no delay and maximum concurrent retrieval. The graph does not adjust until you click OK.
Send RDs to Indexing Every	Determines how often the robot releases its database of newly generated resource descriptions to merge into the main database. This controls how quickly newly added items become available to users.

Completion Actions

When the robot finishes processing all its sites and all URLs it discovers at those sites, it reaches a state called completion. You can configure several options for what the robot should do at that point.

By default, the robot goes idle after completing processing, without running any script commands. You can change either of those options, as described in the following table.

Setting Meaning
Script to Launch
At the completion of processing, you might want to trigger other processing, such as updating database statistics or running the My Compass profiler on new documents. You can specify any predefined command script. Completion script management is described in Using Completion Scripts.
You can also produce your own scripts and install those. This process is described in the Netscape Compass Server Programmer's Guide.

After Processing All URLs
When the robot has processed all its sites and any URLs found in links, it can continue running and processing status requests. This idle state is the default state on completion of processing.
You can choose to have the robot take either of two other actions, either shutting down completely or resetting and starting processing over from the initial sites. You should choose one of these to ensure that the robot updates descriptions of changed documents.

Contact Email
This is an email address for contacting the system administrator. The robot appends this address to its User Agent string.
You can also use it in command scripts. For example, you might want to program a script to notify you when the robot reaches completion.

Setting	Meaning
Script to Launch	At the completion of processing, you might want to trigger other processing, such as updating database statistics or running the My Compass profiler on new documents. You can specify any predefined command script. Completion script management is described in Using Completion Scripts. You can also produce your own scripts and install those. This process is described in the Netscape Compass Server Programmer's Guide.
After Processing All URLs	When the robot has processed all its sites and any URLs found in links, it can continue running and processing status requests. This idle state is the default state on completion of processing. You can choose to have the robot take either of two other actions, either shutting down completely or resetting and starting processing over from the initial sites. You should choose one of these to ensure that the robot updates descriptions of changed documents.
Contact Email	This is an email address for contacting the system administrator. The robot appends this address to its User Agent string. You can also use it in command scripts. For example, you might want to program a script to notify you when the robot reaches completion.

Logfile Settings

You can control the amount of information the robot writes to its log files through the Logfile settings. Keep in mind that the robot's log files can become quite large, so you should only generate the information you will actually use, and you should rotate the files as often as practical.

These are the log file settings:

Setting Meaning
Log Level
There are six different levels of robot information logging. By default, the robot uses the third level, which can still produce large log files. Depending on the needs of your site, you can vary the level of logging to balance needed information with storage requirements.
The levels of logging, starting with the least amount of information, are:

Nothing But Serious Errors--As the name suggests, the robot does not log its progress, but only logs messages that indicate serious errors.

Level 0 + Generate and Enumerate

Level 1 + Retrieving (default)

Level 2 + Filtering

Level 3 + Spawning

Level 4 + Retrieval Progress

Setting	Meaning
Log Level	There are six different levels of robot information logging. By default, the robot uses the third level, which can still produce large log files. Depending on the needs of your site, you can vary the level of logging to balance needed information with storage requirements. The levels of logging, starting with the least amount of information, are: Nothing But Serious Errors--As the name suggests, the robot does not log its progress, but only logs messages that indicate serious errors. Level 0 + Generate and Enumerate Level 1 + Retrieving (default) Level 2 + Filtering Level 3 + Spawning Level 4 + Retrieval Progress

As a rule, you should only have the robot log as much data as you will need. At the highest level of detail, the robot log can become huge in a very short period. Once you have your site definitions and filters working to your satisfaction, you should reduce the level of logging to reduce the amount of disk space required for the robot.

Standards Compliance

There are a number of Internet standards governing the behavior of robots. If you send your robot outside your local network, you should ensure that your robot obeys these protocols. However, when indexing an internal network, you might find it useful to bypass some of the external protections.

These are the standards-compliance settings:

Setting Meaning
User Agent
A string that identifies the robot to the remote site, sent as the User-agent field in the HTTP request header. The default string is "Netscape-Compass-Robot/3.0". The remote site can use this string when setting up its robots.txt file to control your robot's access.

RDM Probe
Normally, the robot checks to see whether a site it visits is an RDM server, which is a site that produces and exports resource descriptions. Examples of RDM servers include the Netscape Enterprise Server using the AutoCatalog feature and Netscape Compass Server. By default, when the robot encounters a site that exports resource descriptions, it bypasses indexing the site and writes a message to that effect in its log file.
You can see the list of RDM servers found by looking at the Excluded URLs report or for a running server, the advanced report on RDM servers.
You can disable the check for RDM service by checking Disable RDM Server Probe. The robot will then treat RDM servers as regular web servers.

Robots.txt Enforcement
Web servers can control robot access by using a file called robots.txt that specifies what specific robots are allowed to access. You can have your robot ignore robots.txt restrictions by checking Ignore Robots.txt Protocol.

Setting	Meaning
User Agent	A string that identifies the robot to the remote site, sent as the User-agent field in the HTTP request header. The default string is "Netscape-Compass-Robot/3.0". The remote site can use this string when setting up its `robots.txt` file to control your robot's access.
RDM Probe	Normally, the robot checks to see whether a site it visits is an RDM server, which is a site that produces and exports resource descriptions. Examples of RDM servers include the Netscape Enterprise Server using the AutoCatalog feature and Netscape Compass Server. By default, when the robot encounters a site that exports resource descriptions, it bypasses indexing the site and writes a message to that effect in its log file. You can see the list of RDM servers found by looking at the Excluded URLs report or for a running server, the advanced report on RDM servers. You can disable the check for RDM service by checking Disable RDM Server Probe. The robot will then treat RDM servers as regular web servers.
Robots.txt Enforcement	Web servers can control robot access by using a file called `robots.txt` that specifies what specific robots are allowed to access. You can have your robot ignore `robots.txt` restrictions by checking Ignore Robots.txt Protocol.

Authentication Parameters

Authentication enables a site to restrict access by user name and password. You can set up your robots to provide a user name and password to the web servers they visit. This enables the robot to index resources at sites that require authentication by "impersonating" an authorized user.

These are the authentication parameters for the robot:

Setting Meaning
Username
This is a login user name. By default, the robot uses "anonymous."

Password
This is the password that accompanies the user name. By default, the robot uses "netscape@". If the authentication user name is "anonymous," the password should be a non-empty string.

Setting	Meaning
Username	This is a login user name. By default, the robot uses "anonymous."
Password	This is the password that accompanies the user name. By default, the robot uses "netscape@". If the authentication user name is "anonymous," the password should be a non-empty string.

Proxying

Proxy servers are used to allow applications inside a security firewall to access resources outside the firewall. Most Compass Server installations will not need to use proxies. If you are running your robot on an internal network behind a firewall, you need to get the names and associated port numbers for the proxy server for each network service from your system administrator.

If possible, you should try to set up the robot so that it doesn't need to run through proxy servers. Running a robot through a proxy server can damage the proxy service by causing it to load too many items into its cache.

Setting Meaning
Direct Internet Connection
The Compass Server is connected directly to the network, and does not go through a proxy server.

AutoConfiguration
You can choose autoconfiguration if your site uses a proxy server with autoconfiguration. You can specify either a local autoconfiguration file or a remote file.
To use autoconfiguration, specify the full local pathname or the URL for the proxy that contains the autoconfiguration file. The URL is generally something like the following:

http://proxy.yourdomain.com:8080/

Manual Configuration
You can manually specify a proxy for any or all of the following protocols: HTTP, FTP, or Gopher. Type the URL for each type of proxy server in the appropriate text box.

Setting	Meaning
Direct Internet Connection	The Compass Server is connected directly to the network, and does not go through a proxy server.
AutoConfiguration	You can choose autoconfiguration if your site uses a proxy server with autoconfiguration. You can specify either a local autoconfiguration file or a remote file. To use autoconfiguration, specify the full local pathname or the URL for the proxy that contains the autoconfiguration file. The URL is generally something like the following: http://proxy.yourdomain.com:8080/
Manual Configuration	You can manually specify a proxy for any or all of the following protocols: HTTP, FTP, or Gopher. Type the URL for each type of proxy server in the appropriate text box.

Advanced Settings

You can use the advanced settings to fine-tune the performance and behavior of your Compass Server robot.

These are the advanced setting parameters:

Setting Meaning
IP as Source?
In most cases, robots operate only on the domain name of a resource. In some cases, you might want to be able to filter or classify resources based on subnets by Internet Protocol (IP) address. In that case, you must explicitly allow the robot to retrieve the IP address in addition to the domain name. Retrieving IP addresses requires an extra DNS lookup, which can slow the operation of the robot. If you don't need this option, you can turn it off to improve performance.

Smart Host
Heuristics
This option tells the robot to convert common alternate hostnames used by a server to a single name. This is most useful in cases where a site has a number of servers all aliased to the same address, such as www.netscape.com, which often have names such as www1.netscape.com, www2.netscape.com, and so on.
When you turn this option on, the robot will internally translate all hostnames starting with wwwn to www, where n is any integer number. This feature only operates on hostnames starting with wwwn.

Default Starting
Point Depth
This sets the default value for the levels of hyperlinks the robot will traverse from any starting point. You can change the depth value for any given starting point using the Site Definition Editor, as explained in Editing Site Definitions.
The corresponding limit on the "breadth" of links followed is explained in Limiting Robot Discovery.

Work Directory
This is the full pathname of a temporary working directory the robot can use to store data. The robot retrieves the entire contents of documents into this directory, often many at a time, so this space should be large enough to handle all of those at once.

State Directory
This is the full pathname of a temporary directory the robot uses to store its state information, including the list of URLs it has visited, the current link pool, and so on. This database can be quite large, so you might want to locate it in a separate partition from the Work Directory. If you do not specify a pathname, the robot uses the compass-serverID/robot/state directory.

Command
Privileges
Most of the robot's control functions operate through a TCP/IP port. This field controls whether commands to the robot must come from the local host system, or whether they can come from anywhere on the network.
We recommend that you keep this option turned off, restricting direct robot control to the local host. You can still administer the robot remotely through the Server Manager.

Setting	Meaning
IP as Source?	In most cases, robots operate only on the domain name of a resource. In some cases, you might want to be able to filter or classify resources based on subnets by Internet Protocol (IP) address. In that case, you must explicitly allow the robot to retrieve the IP address in addition to the domain name. Retrieving IP addresses requires an extra DNS lookup, which can slow the operation of the robot. If you don't need this option, you can turn it off to improve performance.
Smart Host Heuristics	This option tells the robot to convert common alternate hostnames used by a server to a single name. This is most useful in cases where a site has a number of servers all aliased to the same address, such as `www.netscape.com`, which often have names such as `www1.netscape.com`, `www2.netscape.com`, and so on. When you turn this option on, the robot will internally translate all hostnames starting with wwwn to www, where n is any integer number. This feature only operates on hostnames starting with wwwn.
Default Starting Point Depth	This sets the default value for the levels of hyperlinks the robot will traverse from any starting point. You can change the depth value for any given starting point using the Site Definition Editor, as explained in Editing Site Definitions. The corresponding limit on the "breadth" of links followed is explained in Limiting Robot Discovery.
Work Directory	This is the full pathname of a temporary working directory the robot can use to store data. The robot retrieves the entire contents of documents into this directory, often many at a time, so this space should be large enough to handle all of those at once.
State Directory	This is the full pathname of a temporary directory the robot uses to store its state information, including the list of URLs it has visited, the current link pool, and so on. This database can be quite large, so you might want to locate it in a separate partition from the Work Directory. If you do not specify a pathname, the robot uses the `compass-`serverID`/robot/state` directory.
Command Privileges	Most of the robot's control functions operate through a TCP/IP port. This field controls whether commands to the robot must come from the local host system, or whether they can come from anywhere on the network. We recommend that you keep this option turned off, restricting direct robot control to the local host. You can still administer the robot remotely through the Server Manager.

Troubleshooting the Robot

When you run the robot, especially for the first time after making significant changes, it is important that you be able to track down and fix problems that might occur. This section details the most common kinds of robot problems, how to find them, and how to deal with them.

The first place to look for information is almost always the Excluded URLs report (Reports|Excluded URLs). Here are the most common robot problems covered here:

Getting Detailed Status

No Resources Indexed

Too Many or Too Few Resources

Resources Not Assigned to Categories

Performance Problems

For each kind of problem, we'll suggest some possible solutions, or at least approaches to solving the problem.

Getting Detailed Status

When the robot is running, you can see a summary of its activity on the Robot Overview form, as described in Checking Robot Status. When troubleshooting, you might need more detailed information, which you can get from a number of specialized robot reports.

To access detailed robot status information, do the following:

Choose a report from the pull-down menu. The table below explains each report briefly.
Click Refresh to update the same report or repeat step 1 to generate a different one.

If the robot is not running, you'll get an error message.

These are the reports available from the robot:

Report What it tells you
DNS Cache Dump
Lists all the DNS aliases the robot has looked up in the current run or that have been assigned by site definitions.
The format for these is

alias -> cname
Where alias is the site name alias and cname is the canonical hostname for the site.
Entries that appear as exact names in quotation marks indicate assignments made by site definitions.

Performance
Calculates rates for various operations the robot has performed. This gives you a notion of how efficiently the robot is performing.

Servers Found - All
Lists all servers the robot has accessed, including some information about what was encountered there, such as the number of hits, errors, robots.txt restrictions, and so on.

Servers Found - RDM
Lists any RDM servers encountered by the robot. These are generally Netscape Enterprise Servers using the AutoCatalog feature, but they can also be other Compass Servers. If you want resource descriptions for these servers, you should import from them, rather than using the robot. You can also bypass this behavior by turning off the check for RDM servers, as described in Standards Compliance.

Status - Current Configuration
Provides a summary of the crawl parameters.

Status - Database (internal)
Lists the contents of the robot's internal database files.

Status - Libnet
Shows the status of network connections, including backlogs and active connections.

Status - Modules
Provides raw status numbers for the number of different kinds of operations performed by the robot so far.

Status - Overview
Provides raw numbers for URLs processed at different stages, plus bytes retrieved and seconds run.

URLs - Visited
Lists every site the robot has visited. The robot will not revisit sites it has already filtered out.

URLs - in processing

URLs - ready for extraction
Lists all URLs that have passed through the filtering stage and are waiting for the robot to extract other URLs from them.

URLs - ready for indexing
Lists all URLs that have passed through the filtering stage and are waiting for the robot to index generate resource descriptions for them.

URLs - waiting for filtering
Lists any URLs in the URL pool. These are currently waiting to be filtered by the robot prior to extraction.

URLs - waiting for indexing
Lists any URLs currently waiting to be filtered prior to having resource descriptions generated for them.

Version
Prints the version string for the robot.

all reports
Combines the information of all the reports listed above into one screen.

Report	What it tells you
DNS Cache Dump	Lists all the DNS aliases the robot has looked up in the current run or that have been assigned by site definitions. The format for these is alias -> cname Where `alias` is the site name alias and `cname` is the canonical hostname for the site. Entries that appear as exact names in quotation marks indicate assignments made by site definitions.
Performance	Calculates rates for various operations the robot has performed. This gives you a notion of how efficiently the robot is performing.
Servers Found - All	Lists all servers the robot has accessed, including some information about what was encountered there, such as the number of hits, errors, robots.txt restrictions, and so on.
Servers Found - RDM	Lists any RDM servers encountered by the robot. These are generally Netscape Enterprise Servers using the AutoCatalog feature, but they can also be other Compass Servers. If you want resource descriptions for these servers, you should import from them, rather than using the robot. You can also bypass this behavior by turning off the check for RDM servers, as described in Standards Compliance.
Status - Current Configuration	Provides a summary of the crawl parameters.
Status - Database (internal)	Lists the contents of the robot's internal database files.
Status - Libnet	Shows the status of network connections, including backlogs and active connections.
Status - Modules	Provides raw status numbers for the number of different kinds of operations performed by the robot so far.
Status - Overview	Provides raw numbers for URLs processed at different stages, plus bytes retrieved and seconds run.
URLs - Visited	Lists every site the robot has visited. The robot will not revisit sites it has already filtered out.
URLs - in processing
URLs - ready for extraction	Lists all URLs that have passed through the filtering stage and are waiting for the robot to extract other URLs from them.
URLs - ready for indexing	Lists all URLs that have passed through the filtering stage and are waiting for the robot to index generate resource descriptions for them.
URLs - waiting for filtering	Lists any URLs in the URL pool. These are currently waiting to be filtered by the robot prior to extraction.
URLs - waiting for indexing	Lists any URLs currently waiting to be filtered prior to having resource descriptions generated for them.
Version	Prints the version string for the robot.
all reports	Combines the information of all the reports listed above into one screen.

Getting Summary Status

You can generate a summary report that includes much of the important information from the various detailed status reports, but more detail than the terse summary on the Robot Overview form.

To generate an executive summary of the current robot run, do the following:

Make sure the robot is running.
Click Generate Report to create a current report.
If there is already a report, click Refresh to update it.

The executive summary gives information both about the current state of the run and about the configuration of the robot settings.

No Resources Indexed

By far the most common problem, especially with newly created robots, is that when you start it running, it immediately goes idle, not having produced any results.

There are several possible causes for this happening, including the following:

No Sites Defined

Server or Network Errors

DNS Aliasing Errors

Server Redirection Errors

The most helpful tool for locating problems with sites that don't generate resource descriptions is the Excluded URLs report. For every URL passed to the robot that does not generate a resource description, there is an entry that should explain the reason. Using this report is explained in Checking Excluded URLs.

No Sites Defined

When you first install the Compass Server, you have the opportunity to specify one or more sites you want the robot to index. If you did not specify any sites, you need to do so before running the robot. This is explained in Creating Site Definitions.

You can tell whether you have sites defined by opening the Robot Overview form (choose Robot|Overview). The first piece of information to the right of the system diagram indicates the number of starting points you have defined for the robot.

It is also possible that you have sites defined, but the starting points for the sites contain typographical errors or invalid URLs.

Generating a list of all your starting points is explained in Checking Starting Points.

Server or Network Errors

If you have defined sites for the robot to index, but running the robot still produces no resource descriptions in the database, you should check to make sure there are no server or network errors. The most likely place for these to appear is in the error log, but you can also find information in the Excluded URLs report.

The most likely problems in this area are that the server was unreachable (due to being down), the address given for the server was incorrect (typographical error, perhaps), or the server denied access to the resource (a password might be required).

To narrow down the problem, try using the ping utility to see whether the server is reachable at all. If the server is running, try to access the URL with your web client.

DNS Aliasing Errors

A common indexing problem in the complex world of network addressing is that resources often aren't where they appear to be. In particular, many web servers use DNS aliases so that they appear to the world as, say, www.netscape.com, when in fact the server's name is something else.

Very busy web sites often have a number of servers that alternately respond to a common address. In the most common case, a group of servers all answer to the name www.somedomain.com, when their real names are www1.somedomain.com, www2.somedomain.com, and so on. The Compass Server robot has a built-in mechanism to handle this common case, called Smart Host Heuristics. If you encounter problems with sites that alias a number of servers to the same hostname, make sure your robot has Smart Host Heuristics turned on, as explained in Advanced Settings.

The Site Definition Editor form includes a button labeled "Verify Names" that checks entered names against a DNS server.

Server Redirection Errors

A problem that often looks similar to DNS Aliasing Errors is caused by server redirection. That is, clients access a certain URL, but the server for that URL redirects the request to either another part of the server or to another server entirely.

Problems can arise when such a redirection sends the robot to a host that is not defined as being part of one of its sites. You can often solve this by either defining an additional site that includes the redirection site or by indexing the redirection site instead of the original site.

One source of help on redirection is the site probe, described in Probing for Site Information. You can also use the robot rules simulator to check for redirection problems. The simulator is explained in Testing Site Definitions.

Too Many or Too Few Resources

When you first define a site, you may find that the robot generates either many more or many fewer resource descriptions than you expected. You can usually solve this with some fine adjustments.

NOTE: This does not include the special case of returning no resource descriptions, which is covered in the preceding section.

Not Extracting Enough URLs

The most important factor in determining how many resource descriptions the robot generates is the number of resources it actually locates. That might seem obvious, but there are ways to either increase or decrease the number of resources it sees that can have surprisingly large effects.

The best control over the number of resources a robot finds is the depth of its search. That is, by limiting the number of links it can follow from its starting point, you can effectively restrict the robot from delving too deeply into the structure of a site.

Robot Excluded from Sites

If the robot seems to work in general, but does not generate resource descriptions for specific sites, there are at least two common explanations.

RDM Servers: Normally, when the robot visits a site that already generates its own resource descriptions, such as another Compass Server or a Netscape Enterprise Server using the AutoCatalog feature, it skips crawling that site, noting in its log file that it found an RDM server.

You can use this list of RDM servers to create import agents to import the existing resource descriptions from that site, as explained in Importing Resource Descriptions. If you want to send the robot to index that site anyway, you can tell the robot not to check for RDM servers, as explained in Standards Compliance.

Exclusion by robots.txt: Some sites restrict robot access to their servers based on the user agent of the robot. This is done through a file called robots.txt. The robots.txt protocol is explained in Controlling Robot Access.

In general, it is best to observe the wishes of the site manager and stay out of restricted areas. In some cases, however, it might be appropriate to circumvent the robots.txt exclusion, such as indexing a private internal site.

To get around a robots.txt exclusion, you can either change the user agent for the robot or tell the robot not to check robots.txt at all. Both of these are explained in Standards Compliance.

Resources Not Assigned to Categories

If your resources are not being categorized, you probably need to refine your classification rules, which assign resources to categories. For an immediate workaround, you can use the RD Editor to assign categories for resources otherwise unclassified. After you adjust your classification rules, subsequent robot runs should classify resources correctly.

New Sites Added

Probably the most common reason for resources not being classified is that the site was added after the rules were defined, so no rule applies to those particular resources. For example, if you classify resources based on the name of the server they come from, a newly added server will probably not have a corresponding rule. Adding the appropriate rule for the new site should cause the robot to classify the resources from the server correctly in the future.

A robot that runs too slowly can be an annoyance. In general, it means that the robot does not generate resource descriptions quickly enough to keep the database up-to-date for your users. You can gauge the speed of your robot as it runs by tracking the rates shown on the Robot Overview form.

The following table suggests some common approaches to speeding up the Compass Server robot.

Approach Description
Speed settings
The robot has several parameters that control how rapidly it operates. These are explained in detail in Speed. The speed settings, and particularly the Server Delay factor, can make a big difference in the overall rate for the robot.
You can check the advanced robot report (Reports|Advanced) called "Servers Found - All" to see when the robot will next visit each pending server.

Scheduling
There are two aspects of scheduling that can greatly affect speed. The load on the Compass Server system itself might vary greatly depending on when users make the most searches. You should try to schedule robot runs at times when users are least likely to use the server.
Similarly, network traffic is not balanced throughout the day. If you can pinpoint times when there are fewer network delays, you should schedule your robot to run then.

Resources
In general, the robot runs fastest when it has more resources available to it. The most important resources affecting performance are physical memory, virtual memory or swap space, and network bandwidth. Increasing any of those can produce a marked improvement in robot performance.

Configuration
Don't assign too much territory to a single robot. You can deploy additional robots on separate systems to index different parts of the network, then import the resource descriptions from them into a single Compass Server.
For example, if you determine that access to one server is creating a bottleneck for your entire indexing operation, you might deploy a separate robot just to index that one site, freeing the main robot to handle the rest.
Similarly, you could divide the labor between different Compass Server systems, using one only to run the robot, and the other to handle end-user searches. The user search system would import its resource descriptions from the robot-only system.

Database speed
In addition to the speed of the robot itself, you might determine that database access is slowing down the robot's indexing. You can improve the performance of the database by changing its partitioning (as discussed in Partitioning the Database) or optimizing the database (as discussed in Optimizing the Database).

Approach	Description
Speed settings	The robot has several parameters that control how rapidly it operates. These are explained in detail in Speed. The speed settings, and particularly the Server Delay factor, can make a big difference in the overall rate for the robot. You can check the advanced robot report (Reports\|Advanced) called "Servers Found - All" to see when the robot will next visit each pending server.
Scheduling	There are two aspects of scheduling that can greatly affect speed. The load on the Compass Server system itself might vary greatly depending on when users make the most searches. You should try to schedule robot runs at times when users are least likely to use the server. Similarly, network traffic is not balanced throughout the day. If you can pinpoint times when there are fewer network delays, you should schedule your robot to run then.
Resources	In general, the robot runs fastest when it has more resources available to it. The most important resources affecting performance are physical memory, virtual memory or swap space, and network bandwidth. Increasing any of those can produce a marked improvement in robot performance.
Configuration	Don't assign too much territory to a single robot. You can deploy additional robots on separate systems to index different parts of the network, then import the resource descriptions from them into a single Compass Server. For example, if you determine that access to one server is creating a bottleneck for your entire indexing operation, you might deploy a separate robot just to index that one site, freeing the main robot to handle the rest. Similarly, you could divide the labor between different Compass Server systems, using one only to run the robot, and the other to handle end-user searches. The user search system would import its resource descriptions from the robot-only system.
Database speed	In addition to the speed of the robot itself, you might determine that database access is slowing down the robot's indexing. You can improve the performance of the database by changing its partitioning (as discussed in Partitioning the Database) or optimizing the database (as discussed in Optimizing the Database).

Fast Robot

A robot that runs quickly is not a problem in itself, unless it takes up so many system resources that it gets in the way of user searches. More often, a robot can cause trouble by sending too many "hits" to a single server in a short period of time.

Most often, this kind of problem comes to light when a server administrator sees a large number of hits in an access log from the robot, which includes an email address in its User Agent string. The administrator can then contact you to ask you to hold back your robot, in which case you can increase the Server Delay parameter. The administrator might also exclude your robot from any access, using the robots.txt exclusion file explained in Controlling Robot Access.

[Contents] [Previous] [Next] [Index]

Last Updated: 02/12/98 13:33:35

Any sample code included above is provided for your use on an "AS IS" basis, under the Netscape License Agreement - Terms of Use

Chapter 3 Filling the Database