process.conf
defines many options for the robot, including telling it which of the filters from filter.conf
to use.(For backwards-compatibility with the Catalog Server, process.conf
can also contain the starting points.)
In general, you do not need to edit the process.conf
file directly. You can set most parameters by using the interactive options in the Robot page in the Compass Server Administration Interface.
However, advanced users may want to edit this file manually to set parameters that cannot be set through the interface.
This chapter lists the parameters that you can set in process.conf
.
auto-proxy="http://punk.mcom.com:80/"This is the proxy setting for the robot. It can be a Netscape proxy server or a Javascript file for automatically configuring the proxy. For more information see:
http://home.netscape.com/eng/mozilla/2.0/relnotes/demo/proxy-live.html
bindir=pathIf you specify
bindir
then the robot will add it to PATH environment. This is an extra PATH for users to run external program in a robot, such as those specified by
cmd-hook parameter.
cmd-hook="command-string"There is no default. This specifies an external completion script to run after the robot completes one run. This must be a full path to the command name. The robot will execute this script from the
compass-
name/ directory
.
There must be at least one RD registered for the command to run.
See How To Write Completion Scripts for information about writing completion scripts.
command-port
command-port=port_number
The socket that the robot listens at for accepting command from other programs, such as the Administration Interface or robot control panels.
For security reasons, the robot can accept commands only from the local host unless remote-access
is set to yes
.
connect-timeout
connect-timeout=seconds
The default is 120.
This is the maximum time to allow for a network to respond to a connection request.
convert-timeout
convert-timeout=seconds
The default is 600 seconds.
This is the maximum time to allow for document conversion.
depth
depth=integer
The default is 20. This is the number of links from the starting point (seed URLs) that the robot should examine. This parameter sets the default value for seed URLs that do not specify a depth.
A value of negative one (depth=-1
) means the link depth is infinite.
email
email=user@hostname
The default is user@domain
This is the email address of the person who runs the robot.
This is sent out along with user-agent in the http request header, so that web managers can contact the people who run robots at their sites.
enable-ip
enable-ip=[true | yes | false | no]
The default is true
.
Generates an IP address for the url in each RD that is created.
enable-rdm-probe
enable-rdm-probe=[true | false | yes | no]
The default is yes.
This determines whether or not the robot queries each server it encounters to find out if the server supports RDM. If the server supports RDM, the robot will not attempt to enumerate the server's resources, as that server can act as its own resource description server.
enable-robots-txt
enable-robots-txt=[true | false | yes | no]
The default is yes.
This variable determines whether or not the robot checks the robots.txt
file (if available) at each site it visits.
engine-concurrent
engine-concurrent=[1..100]
The default is 10.
The number of pre-created threads for the robot to use.
This parameter is cannot be set interactively through the Compass Server Administration Interface.
enumeration-filter
enumeration-filter=enumfiltername
The default is enumeration-default
.
This specifies the enumeration filter that the robot uses to determine whether or not to enumerate a resource. The value must be the name of a filter defined in the file filter.conf
.
This parameter is cannot be set interactively through the Compass Server Administration Interface.
generation-filter
generation-filter=genfiltername
The default is generation-default
.
This specifies the generation filter that the robot uses to determine whether or not to generate a resource description for a resource. The value must be the name of a filter defined in the file filter.conf
.
This parameter is cannot be set interactively through the Compass Server Administration Interface.
index-after-ngenerated
index-after-ngenerated=30
This tells the robot how many minutes to collect RDs for before batching them to the Compass Server.
If you do not specify this parameter, it is set to 256 minutes.
loglevel
loglevel=[0...100]
The default value is 2.
The values mean:
Level 0: log nothing but serious errors
Level 1: also log generate and enumerate activity
Level 2: also log retrieval activity (DEFAULT)
Level 3: also log filtering activity
Level 4: also log spawning activity
Level 5: also log retrieval progress
max-concurrent
max-concurrent=[1..100]
The default is 8.
The maximum number of concurrent retrievals that a robot can make.
max-filesize-kb
max-filesize-kb=1024
The maximum file size in kilobytes for files retrieved by the robot.
max-memory-per-url / max-memory
max-memory-per-url=n_bytes
The default is 1.
Maximum memory in bytes used by each url. If the URL needs more memory, the RD is saved to disk.
This parameter is cannot be set interactively through the Compass Server Administration Interface.
max-working
max-working=1024
The size of the robot working set, which is the maximum number of URLS the robot can work on at one time.
This parameter is cannot be set interactively through the Compass Server Administration Interface.
onCompletion
OnCompletion=[idle | loop | quit]
The default is idle
.
This determines what the robot does after it has completed a run. The robot can either go into idle mode, loop back and start again, or quit.
This parameter works with the cmd-hook parameter. When the robot is done
, it will do the action of onCompletion and then run the cmd-hook program
password
password=string
The default is netscape@.
This is used for httpd authentication and ftp connection.
referer
referer=string
This is sent in the http-request if it is set to identify the robot as the referer when accessing web pages.
register-user and register-password
register-user=string
register-password=string
These are the username and password used for registering RDs to the Compass Server database.
This parameter is cannot be set interactively through the Compass Server Administration Interface.
remote-access
remote-access=[true | false | yes | no]
The default is no.
This determines whether or not the robot can accept commands from remote hosts.
robot-statedir
robot-statedir="/newport/robot/state"
This specifies the directory where the robot saves its state. The robot uses this as a working directory for saving its internal state, including recording how many RDs have been collected, and so on.
robots-txt-refresh-rate
robots-txt-refresh-rate=seconds
The default is 3600*24.
This is the number of seconds the robot must wait before reading a robots.txt
file again.
This parameter is cannot be set interactively through the Compass Server Administration Interface.
schema-name
schema-name=schema
The default value is DOCUMENT.
This is the name for a schema
This parameter is cannot be set interactively through the Compass Server Administration Interface.
server-delay
server-delay=delay_in_seconds
The time period between two visits to the same web site. This prevents the robot from hitting the same site too frequently.
smart-host-heuristics
smart-host-heuristics=[true | false]
The default is true
.
This enables the robot to work out those sites rotating their DNS canonical hostnames. for example, it turns www123.netscape.com to www.netscape.com.
tmpdir
tmpdir=path
Specifies a place for the robot to create temporary files.
Use this value to set the environment variable TMPDIR (and TMP for NT).
user-agent
user-agent=Netscape-Compass-Robot/3.0
This is sent with the email address in the http-request to the server.
username
username=string
The default is anonymous
.
This is the username of the user who runs the robot. This is used for httpd authentication and ftp connection.
Sample process.conf File
Note that this sample file includes some parameters used by the Compass Server Administration Interface that you should not modify.
Here is a sample process.conf file
. The parameters that are commented out in this case use the default values shown. The first parameter, csid
, indicates the Compass Server instance that uses this file. Do not change the value of the csid
parameter.
<Process csid="x-catalog://boots.nikki.com:80/jack" \
auto-proxy="http://punk.mcom.com:80/"
auto_serv="http://punk.mcom.com:80/"
command-port=21445
convert-timeout=600
depth="-1"
# email="user@domain"
enable-ip=true
enumeration-filter="enumeration-default"
generation-filter="generation-default"
index-after-ngenerated=30
loglevel=2
max-concurrent=8
onCompletion=idle
password=boots
proxy-loc=server
proxy-type=auto
robot-state-dir="/export/home/suitespot/compass-newport/robot/state"
server-delay=1
smart-host-heuristics=true
tmpdir="/export/home/suitespot/compass-newport/tmp"
user-agent="Netscape-Compass-Robot/3.0"
username=nikki
</Process> How To Write Completion Scripts
The cmdHook parameter specifies a program to execute after the robot completes one run.
The cmdHook is provided as a way for you to extend the shutdown phase of the robot. Perhaps you want the robot to send email when it's done, start another process, analyze it's own log files and write a small report, and so on.
When the robot is 'done' with one run (that is: it has no more entries in the enumeration-pool and has finished all outstanding processing) it will call the executable specified in the cmdHook setting. If the onCompletion parameter is set to idle
or quit
, the script is called once before the robot shuts down or goes idle. If the parameter is set to loop
, the script is called each time the robot restarts. See the log samples in "Monitoring the cmdHook Execution".
The cmdHook script can be written in any language (as a Perl or shell script, a C program, and so on). If you choose to use a C program, you'll have to insert the cmdHook parameter in the process.conf
file manually, as the Compass Server Administration Interface does nto scan binary executables. See the notes in "Preparing Your Completion Script to Appear in the Administration Interface."
The cmdHook script is run from the robot's execution environment. This means your script will inherit any environment variables set by the robot and will not have access to environment variables that might be set by your compass server or the admin server. The cmdHook script will be executed from compass-installdir
. . . if the onCompletion parameter is set to shutdown, the robot.log output would look like:
. . . if the onCompletion parameter is set to loop, the robot.log output would
look like:
. . . if the onCompletion parameter is set to loop, there will be an additional
entry in filter.log like this:
/compass-
name/
instead of bin/compass/admin/bin
. This is a common source of errors and is important to keep in mind if you are using any relative directory references, (for example, perl is located in "../install/perl
" instead of "../../../../install/perl
" and so on.
There are two UNIX-only examples provided in bin/compass/admin/bin
. They are a simple 'touch' test (cmdHook0) and an email upon completion script (cmdHook1).
Monitoring the cmdHook Execution
At the default logging level, a log message is written in robot.log when the cmdHook is run.
For example, if the onCompletion parameter is set to idle
, the robot.log output would look like:
[12:45:57] Run cmd: /usr/suitespot/compass-test/bin/compass/admin/bin/cmdHook1
[12:45:58] Complete cmd: /usr/suitespot/compass-test/bin/compass/admin/bin/cmdHook1
[12:45:58] Robot is idle....
[12:45:57] Run cmd: /usr/suitespot/compass-test/bin/compass/admin/bin/cmdHook1
[12:45:58] Complete cmd: /usr/suitespot/compass-test/bin/compass/admin/bin/cmdHook1
[12:46:33] Workload complete.
[12:52:04] Run cmd: /usr/suitespot/compass-test/bin/compass/admin/bin/cmdHook1
[12:52:05] Complete cmd: /usr/suitespot/compass-test/bin/compass/admin/bin/cmdHook1
[12:52:05] Restart Robot.
[12:54:41] Filter log started - loop 15
Preparing Your Completion Script to Appear in the Administration Interface
If you want your cmdHook script to show up as an option on the Compass Server Administration in the "Robot -> Crawling Settings" page, you should do the following:
bin/compass/admin/bin.
#description="My menu choice string"
The description assignment should be placed in a comment so it doesn't effect the execution of the script.
Since many scripts are platform specific, if the description contains the string "(non NT)", the Compass Server Administration Interface will not list that script as an option on NT platforms. This is not particularly useful for a specific instance of a Compass Server. However, if your script will be redistributed across a wide variety of platforms it is something to keep in mind.
Last Updated: 02/07/98 20:49:10
Any sample code included above is provided for your use on an "AS IS" basis, under the Netscape License Agreement - Terms of Use