Compass Server 3.0 Developer's Guide

[Contents] [Previous] [Next] [Last]

Chapter 9
Defining Parameters in Process.conf

The file process.conf defines many options for the robot, including telling it which of the filters from filter.conf to use.(For backwards-compatibility with the Catalog Server, process.conf can also contain the starting points.)

In general, you do not need to edit the process.conf file directly. You can set most parameters by using the interactive options in the Robot page in the Compass Server Administration Interface.

However, advanced users may want to edit this file manually to set parameters that cannot be set through the interface.

This chapter lists the parameters that you can set in process.conf.

User-Modifiable Parameters in Process.conf

auto-proxy

auto-proxy="http://punk.mcom.com:80/"
This is the proxy setting for the robot. It can be a Netscape proxy server or a Javascript file for automatically configuring the proxy. For more information see:

http://home.netscape.com/eng/mozilla/2.0/relnotes/demo/proxy-live.html

bindir

bindir=path 
If you specify bindir then the robot will add it to PATH environment. This is an extra PATH for users to run external program in a robot, such as those specified by cmd-hook parameter.

cmd-hook

cmd-hook="command-string" 
There is no default.

This specifies an external completion script to run after the robot completes one run. This must be a full path to the command name. The robot will execute this script from the compass-name/ directory.

There must be at least one RD registered for the command to run.

See How To Write Completion Scripts for information about writing completion scripts.

command-port

command-port=port_number 
The socket that the robot listens at for accepting command from other programs, such as the Administration Interface or robot control panels.

For security reasons, the robot can accept commands only from the local host unless remote-access is set to yes.

connect-timeout

connect-timeout=seconds 
The default is 120.

This is the maximum time to allow for a network to respond to a connection request.

convert-timeout

convert-timeout=seconds 
The default is 600 seconds.

This is the maximum time to allow for document conversion.

depth

depth=integer
The default is 20. This is the number of links from the starting point (seed URLs) that the robot should examine. This parameter sets the default value for seed URLs that do not specify a depth.

A value of negative one (depth=-1) means the link depth is infinite.

email

email=user@hostname 
The default is user@domain

This is the email address of the person who runs the robot.

This is sent out along with user-agent in the http request header, so that web managers can contact the people who run robots at their sites.

enable-ip

enable-ip=[true | yes | false | no] 
The default is true.

Generates an IP address for the url in each RD that is created.

enable-rdm-probe

enable-rdm-probe=[true | false | yes | no] 
The default is yes.

This determines whether or not the robot queries each server it encounters to find out if the server supports RDM. If the server supports RDM, the robot will not attempt to enumerate the server's resources, as that server can act as its own resource description server.

enable-robots-txt

enable-robots-txt=[true | false | yes | no] 
The default is yes.

This variable determines whether or not the robot checks the robots.txt file (if available) at each site it visits.

engine-concurrent

engine-concurrent=[1..100] 
The default is 10.

The number of pre-created threads for the robot to use.

This parameter is cannot be set interactively through the Compass Server Administration Interface.

enumeration-filter

enumeration-filter=enumfiltername
The default is enumeration-default.

This specifies the enumeration filter that the robot uses to determine whether or not to enumerate a resource. The value must be the name of a filter defined in the file filter.conf.

This parameter is cannot be set interactively through the Compass Server Administration Interface.

generation-filter

generation-filter=genfiltername
The default is generation-default.

This specifies the generation filter that the robot uses to determine whether or not to generate a resource description for a resource. The value must be the name of a filter defined in the file filter.conf.

This parameter is cannot be set interactively through the Compass Server Administration Interface.

index-after-ngenerated

index-after-ngenerated=30
This tells the robot how many minutes to collect RDs for before batching them to the Compass Server.

If you do not specify this parameter, it is set to 256 minutes.

loglevel

loglevel=[0...100] 
The default value is 2.

The values mean:

Level 0: log nothing but serious errors
Level 1: also log generate and enumerate activity
Level 2: also log retrieval activity (DEFAULT)
Level 3: also log filtering activity
Level 4: also log spawning activity
Level 5: also log retrieval progress

max-concurrent

max-concurrent=[1..100] 
The default is 8.

The maximum number of concurrent retrievals that a robot can make.

max-filesize-kb

max-filesize-kb=1024 
The maximum file size in kilobytes for files retrieved by the robot.

max-memory-per-url / max-memory

max-memory-per-url=n_bytes 
The default is 1.

Maximum memory in bytes used by each url. If the URL needs more memory, the RD is saved to disk.

This parameter is cannot be set interactively through the Compass Server Administration Interface.

max-working

max-working=1024 
The size of the robot working set, which is the maximum number of URLS the robot can work on at one time.

This parameter is cannot be set interactively through the Compass Server Administration Interface.

onCompletion

OnCompletion=[idle | loop | quit] 
The default is idle.

This determines what the robot does after it has completed a run. The robot can either go into idle mode, loop back and start again, or quit.

This parameter works with the cmd-hook parameter. When the robot is done, it will do the action of onCompletion and then run the cmd-hook program

password

password=string 
The default is netscape@.

This is used for httpd authentication and ftp connection.

referer

referer=string 
This is sent in the http-request if it is set to identify the robot as the referer when accessing web pages.

register-user and register-password

register-user=string 
register-password=string 
These are the username and password used for registering RDs to the Compass Server database.

This parameter is cannot be set interactively through the Compass Server Administration Interface.

remote-access

remote-access=[true | false | yes | no] 
The default is no.

This determines whether or not the robot can accept commands from remote hosts.

robot-statedir

robot-statedir="/newport/robot/state"
This specifies the directory where the robot saves its state. The robot uses this as a working directory for saving its internal state, including recording how many RDs have been collected, and so on.

robots-txt-refresh-rate

robots-txt-refresh-rate=seconds 
The default is 3600*24.

This is the number of seconds the robot must wait before reading a robots.txt file again.

This parameter is cannot be set interactively through the Compass Server Administration Interface.

schema-name

schema-name=schema 
The default value is DOCUMENT.

This is the name for a schema

This parameter is cannot be set interactively through the Compass Server Administration Interface.

server-delay

server-delay=delay_in_seconds 
The time period between two visits to the same web site. This prevents the robot from hitting the same site too frequently.

smart-host-heuristics

smart-host-heuristics=[true | false] 
The default is true.

This enables the robot to work out those sites rotating their DNS canonical hostnames. for example, it turns www123.netscape.com to www.netscape.com.

tmpdir

tmpdir=path
Specifies a place for the robot to create temporary files.

Use this value to set the environment variable TMPDIR (and TMP for NT).

user-agent

user-agent=Netscape-Compass-Robot/3.0
This is sent with the email address in the http-request to the server.

username

username=string
The default is anonymous.

This is the username of the user who runs the robot. This is used for httpd authentication and ftp connection.

Sample process.conf File

Note that this sample file includes some parameters used by the Compass Server Administration Interface that you should not modify.

Here is a sample process.conf file. The parameters that are commented out in this case use the default values shown. The first parameter, csid, indicates the Compass Server instance that uses this file. Do not change the value of the csid parameter.

<Process csid="x-catalog://boots.nikki.com:80/jack" \
   auto-proxy="http://punk.mcom.com:80/"
   auto_serv="http://punk.mcom.com:80/"
   command-port=21445
   convert-timeout=600
   depth="-1"
   # email="user@domain"
   enable-ip=true
   enumeration-filter="enumeration-default"
   generation-filter="generation-default"
   index-after-ngenerated=30
   loglevel=2
   max-concurrent=8
   onCompletion=idle
   password=boots
   proxy-loc=server
   proxy-type=auto
   robot-state-dir="/export/home/suitespot/compass-newport/robot/state"
   server-delay=1
   smart-host-heuristics=true
   tmpdir="/export/home/suitespot/compass-newport/tmp"
   user-agent="Netscape-Compass-Robot/3.0"
   username=nikki
</Process>

How To Write Completion Scripts

The cmdHook parameter specifies a program to execute after the robot completes one run.

The cmdHook is provided as a way for you to extend the shutdown phase of the robot. Perhaps you want the robot to send email when it's done, start another process, analyze it's own log files and write a small report, and so on.

When the robot is 'done' with one run (that is: it has no more entries in the enumeration-pool and has finished all outstanding processing) it will call the executable specified in the cmdHook setting. If the onCompletion parameter is set to idle or quit, the script is called once before the robot shuts down or goes idle. If the parameter is set to loop, the script is called each time the robot restarts. See the log samples in "Monitoring the cmdHook Execution".

The cmdHook script can be written in any language (as a Perl or shell script, a C program, and so on). If you choose to use a C program, you'll have to insert the cmdHook parameter in the process.conf file manually, as the Compass Server Administration Interface does nto scan binary executables. See the notes in "Preparing Your Completion Script to Appear in the Administration Interface."

The cmdHook script is run from the robot's execution environment. This means your script will inherit any environment variables set by the robot and will not have access to environment variables that might be set by your compass server or the admin server. The cmdHook script will be executed from compass-installdir/compass-name/ instead of bin/compass/admin/bin. This is a common source of errors and is important to keep in mind if you are using any relative directory references, (for example, perl is located in "../install/perl" instead of "../../../../install/perl" and so on.

There are two UNIX-only examples provided in bin/compass/admin/bin. They are a simple 'touch' test (cmdHook0) and an email upon completion script (cmdHook1).

Monitoring the cmdHook Execution

At the default logging level, a log message is written in robot.log when the cmdHook is run.

For example, if the onCompletion parameter is set to idle, the robot.log output would look like:

[12:45:57]  Run cmd: /usr/suitespot/compass-test/bin/compass/admin/bin/cmdHook1 
[12:45:58] Complete cmd: /usr/suitespot/compass-test/bin/compass/admin/bin/cmdHook1
[12:45:58] Robot is idle....

. . . if the onCompletion parameter is set to shutdown, the robot.log output would look like:

[12:45:57]  Run cmd: /usr/suitespot/compass-test/bin/compass/admin/bin/cmdHook1 
[12:45:58] Complete cmd: /usr/suitespot/compass-test/bin/compass/admin/bin/cmdHook1
[12:46:33] Workload complete.

. . . if the onCompletion parameter is set to loop, the robot.log output would look like:

[12:52:04]  Run cmd: /usr/suitespot/compass-test/bin/compass/admin/bin/cmdHook1 
[12:52:05] Complete cmd: /usr/suitespot/compass-test/bin/compass/admin/bin/cmdHook1
[12:52:05] Restart Robot.

. . . if the onCompletion parameter is set to loop, there will be an additional entry in filter.log like this:

[12:54:41] Filter log started - loop 15 

Preparing Your Completion Script to Appear in the Administration Interface

If you want your cmdHook script to show up as an option on the Compass Server Administration in the "Robot -> Crawling Settings" page, you should do the following:

  1. Write your script in a run-time evaluated language.That is, don't use C or another compiled language. The Compass Server Administration Interface does not scan binary files when it looks for cmdHook scripts.

  2. Place the file in bin/compass/admin/bin.

  3. Name the file cmdHook, followed by an alphanumeric character-string, for example, cmdHook0, cmdHookAlpha, cmdHook12a

  4. Place a description string in the file using the following format: #description="My menu choice string"

    The description assignment should be placed in a comment so it doesn't effect the execution of the script.

    Since many scripts are platform specific, if the description contains the string "(non NT)", the Compass Server Administration Interface will not list that script as an option on NT platforms. This is not particularly useful for a specific instance of a Compass Server. However, if your script will be redistributed across a wide variety of platforms it is something to keep in mind.


[Contents] [Previous] [Next] [Last]

Last Updated: 02/07/98 20:49:10

Any sample code included above is provided for your use on an "AS IS" basis, under the Netscape License Agreement - Terms of Use