Sun Java System Portal Server 7.1 Technical Reference

Part VII Search Engine Robot

Chapter 41 Overview of Search Engine Robot

A Search Engine robot is an agent that identifies and reports on resources in its domains. It does so by using two kinds of filters: an enumerator filter and a generator filter.

The enumerator filter locates resources by using network protocols. It tests each resource, and, if it meets the proper criteria, it is enumerated. For example, the enumerator filter can extract hypertext links from an HTML file and use the links to find additional resources.

The generator filter tests each resource to determine if a resource description (RD) should be created. If the resource passes the test, the generator creates an RD which is stored in the Search Engine database.

Figure 42–1 illustrates how the Search Engine robot works. In Figure 42-1, the robot examines URLs and their associated network resources. Each resource is tested by both the enumerator and the generator. If the resource passes the enumeration test, the robot checks it for additional URLs. If the resource passes the generator test, the robot generates a resource description that is stored in the Search Engine database.

Overview

The Robot Application Functions (RAFs) in the filter.conf file can be used to create and modify filter definitions. The file filter.conf is located in the /var/opt/SUNWportal/searchservers/instanceName/config directory.

The following figure shows how the robot works.

Figure 41–1 How the Robot Works

The filter.conf file contains definitions for the enumeration and generation filters. Each of these filters invokes a set of rules which are stored in the filterrules.conf file. The filter definitions contain instructions that are specific to each filter while the filter rules contain the rules used by both filters.

To understand how filter rules are defined, examine the filterrules.conf file. Note that you typically need not manually edit this file since you can create filter rules from the administration console.

For an example of filter definitions, examine the filter.conf file. Edit the filter.conf file only to modify the filters in a way that is not accommodated in the administration console, such as instructing the robot to enumerate some resources without generating resources for them.

Chapter 42 Process Parameters

The robot.conf file defines many options for the robot, including pointing the robot to the appropriate filters in filter.conf file. For backwards-compatibility with older versions, the robot.conf file can also contain the seed URLs.

The administration console is used to edit the file robot.conf. Because you can set most parameters by using the administration console, you typically do not need to edit the robot.conf file. However, advanced users might manually edit this file in order to set parameters that cannot be set through the administration console.

This chapter lists the user-modifiable parameters in the robot.conf file. The first column of the table lists the parameter, the second column provides a description of the parameter, and the third column provides an example.

User Modifiable Parameters in `robot.conf` File

Table 42–1 User Modifiable Parameters in robot.conf File


Parameter	Description	Example
auto-proxy	Specifies the proxy setting for the robot. It can be a proxy server or a JavaScript file for automatically configuring the proxy.	auto-proxy=”http:// proxy_server/proxy.pac”
bindir	Specifies whether the robot will add a bind directory to the PATH environment. This is an extra PATH for users to run an external program in a robot, such as those specified by cmd-hook parameter.	bindir=path
cmd-hook	Specifies an external completion script to run after the robot completes one run. This must be a full path to the command name. The robot will execute this script from the `/var/opt/SUNWportal/` directory. There is no default. There must be at least one RD registered for the command to run.	cmd-hook=”command-string”
command-port	Specifies the socket that the robot listens to in order to accept commands from other programs, such as the Administration Interface or robot control panels. For security reasons, the robot can accept commands only from the local host unless remote-access is set to yes.	command-port=port_number
connect-timeout	Specifies the maximum time allowed for a network to respond to a connection request. The default is 120 seconds.	command-timeout=seconds
convert-timeout	Specifies the maximum time allowed for document conversion. The default is 600 seconds.	convert-timeout=seconds
depth	Specifies the number of links from the seed URLs (also referred to as starting point) that the robot will examine. This parameter sets the default value for any seed URLs that do not specify a depth. The default is 10. A value of negative one (depth=-1) indicates that the link depth is infinite.	depth=integer
email	Specifies the email address of the person who runs the robot. The email address is sent with the user-agent in the HTTP request header, so that Web managers can contact the people who run robots at their sites. The default is user@domain.	email=user@hostname
enable-ip	Generates an IP address for the URL for each RD that is created. The default is true.	enable-ip=[true \| yes \| false \| no]
enable-rdm-probe	Determines if the server supports RDM, the robot decides whether to query each server it encounters by using this parameter. If the server supports RDM, the robot will not attempt to enumerate the server’s resources, since that server is able to act as its own resource description server. The default is false.	enable-rdm-probe= [true \| false \| yes \| no]
enable-robots-txt	Determines if the robot should check the robots.txt file at each site it visits, if available. The default is yes.	enable-robots-txt= [true \| false \| yes \| no]
engine-concurrent	Specifies the number of pre-created threads for the robot to use. The default is 10. This parameter cannot be set interactively through the administration console.	engine-concurrent=[1..100]
enumeration-filter	Specifies the enumeration filter that is used by the robot to determine if a resource should be enumerated. The value must be the name of a filter defined in the file filter.conf. The default is enumeration-default. This parameter cannot be set interactively through the administration console.	enumeration-filter= enumfiltername
generation-filter	Specifies the number of minutes that the robot should collect RDs before batching them for the Search Engine. If you do not specify this parameter, it is set to 256 minutes.	generation-filter=genfiltername
index-after-ngenerated	Specifies the number of minutes that the robot should collect RDs before batching them for the Search Engine. If you do not specify this parameter, it is set to 256 minutes.	index-after-ngenerated=30
loglevel	Specifies the levels of logging. The loglevel values are as follows: Level 0: log nothing but serious errors Level 1: also log RD generation (default) Level 2: also log retrieval activity Level 3: also log filtering activity Level 4: also log spawning activity Level 5: also log retrieval progress The default value is 1.	loglevel=[0...100]
max-connections	Specifies the maximum number of concurrent retrievals that a robot can make. The default is 8.	max-connections=[1..100]
max-filesize-kb	Specifies the maximum file size in kilobytes for files retrieved by the robot. The default is 10240.	max-filesize-kb=1024
max-memory-per-url / max-memory	Specifies the maximum memory in bytes used by each URL. If the URL needs more memory, the RD is saved to disk. The default is 64000. This parameter cannot be set interactively through the administration console.	max-memory-per-url=n_bytes
max-working	Specifies the size of the robot working set, which is the maximum number of URLs the robot can work on at one time. This parameter cannot be set interactively through the administration console.	max-working=1024
onCompletion	Determines what the robot does after it has completed a run. The robot can either go into idle mode, loop back and start again, or quit. The default is idle. This parameter works with the cmd-hook parameter. When the robot is done, it will do the action of onCompletion and then run the cmd-hook program.	OnCompletion=[idle \| loop \| quit]
password	Specifies the password is used for httpd authentication and ftp connection.	password=string
referer	Specifies the parameter sent in the HTTP request if it is set to identify the robot as the referer when accessing Web pages	referer=string
remote-access	This parameter determines if the robot can accept commands from remote hosts. The default is false.	remote-access=[true \| false \| yes \| no]
robot-state-dir	Specifies the directory where the robot saves its state. In this working directory, the robot can record the number of collected RDs and so on.	robot-state-dir="/var/opt/ SUNWportal/ins tance/ portal/robot”
server-delay	Specifies the time period between two visits to the same web site, thus preventing the robot from accessing the same site too frequently.	server-delay=delay_in_seconds
site-max-connections	Indicates the maximum number of concurrent connections that a robot can make to any one site. The default is 2.	site-max-connections=[1..100]
smart-host-heuristics	Enables the robot to change sites that are rotating their DNS canonical host names. For example, www123.siroe.com is changed to www.siroe.com. The default is false.	smart-host-heuristics=[true \| false]
tmpdir	Specifies a place for the robot to create temporary files. Use this value to set the environment variable TMPDIR.	tmpdir=path
user-agent	Specifies the parameter sent with the email address in the http-request to the server.	user-agent=iPlanetRobot/4.0
username	Specifies the user name of the user who runs the robot and is used for httpd authentication and ftp connection. The default is anonymous.	username=string

The most important parameters are enumeration-filter and generation-filter, which determine the filters the robot uses for enumeration and generation. The default values for these are enumeration-default and generation-default, which are the names of the filters provided by default in the filter.conf file.

All filters must be defined in the file filter.conf file. If you define your own filters in filter.conf file, you must add any necessary parameters to robot.conf file.

For example, if you define a new enumeration filter named my-enumerator, you would add the parameter to robot.conf file:

enumeration-filter=my-enumerator

Chapter 43 The Filtering Process

This chapter contains the following sections

Overview

The robot uses filters to determine which resources to process and how to process them. When the robot discovers references to resources as well as the resources themselves, it applies filters to each resource in order to enumerate it and to determine whether or not to generate a resource description to store in the Search Engine database.

The robot examines one or more seed URLs, applies the filters, and then applies the filters to the URLs spawned by enumerating the seed URLs, and so on. The seed URLs are defined in the filterrules.conf file.

A filter performs any required initialization operations and applies comparison tests to the current resource. The goal of each test is to either allow or deny the resource. A filter also has a shutdown phase during which it performs any required cleanup operations.

If a resource is allowed, that means that it is allowed to continue passage through the filter. If a resource is denied, then the resource is rejected. No further action is taken by the filter for resources that are denied. If a resource is not denied, the robot will eventually enumerate it, attempting to discover further resources. The generator might also create a resource description for it.

These operations are not necessarily linked. Some resources result in enumeration; others result in RD generation. Many resources result in both enumeration and RD generation. For example, if the resource is an FTP directory, the resource typically will not have an RD generated for it. However, the robot might enumerate the individual files in the FTP directory. An HTML document that contains links to other documents can receive an RD and can lead to enumeration of the linked documents as well.

Stages in the Filter Process

Both enumerator and generator filters have five phases in the filtering process. They both have four common phases: Setup, Metadata, Data, and Shutdown. If the resource makes it past the Data phase, it is either in the Enumerate or Generate phase, depending on whether the filter is an enumerator or a generator.

The phases are as follows:

Setup

Performs initialization operations. Occurs only once in the life of the robot.

Metadata

Filters the resource based on metadata that is available about the resource. Metadata filtering occurs once per resource before the resource is retrieved over the network. The table below lists the common metadata types and their description.

Table 43–1 Common Metadata Types


Metadata	Description	Example
Complete URL	The location of a resource	http://home.siroe.com/
Protocol	The access portion of the URL	http, ftp, file
Host	The address portion of the URL	www.siroe.com
IP address	Numeric version of the host	198.95.249.6
PATH	The path portion of the URL	/index.html
Depth	Number of links from the seed URL	5

Data

Filters the resource based on its data. Data filtering is done once per resource after it is retrieved over the network. Data that can be used for filtering include:

content-type
content-length
content-encoding
content-charset
last-modified
expires

Enumerate

Enumerates the current resource in order to determine if it points to other resources to be examined.

Generate

Generates a resource description (RD) for the resource and saves it in the Search Engine database.

Shutdown

Performs any needed termination operations. Occurs once in the life of the robot.

Filter Syntax

The filter.conf file contains definitions for enumeration and generation filters. This file can contain multiple filters for both enumeration and generation. Note that the robot can determine which filters to use because they are specified by the enumeration-filter and generation-filter parameters in the robot.conf file.

Filter definitions have a well-defined structure: a header, a body, and an end. The header identifies the beginning of the filter and declares its name; for example:

<Filter name="myFilter">

The body consists of a series of filter directives that define the filter’s behavior during setup, testing, enumeration or generation, and shutdown. Each directive specifies a function, and if applicable, parameters for the function.

The end is marked by </Filter>.

The following example shows a filter named enumeration1.

Example 43–1 Enumeration File Syntax

<Filter name="enumeration1>
Setup fn=filterrules-setup config=./config/filterrules.conf
# Process the rules
MetaData fn=filterrules-process
# Filter by type and process rules again
Data fn=assign-source dst=type src=content-type
Data fn=filterrules-process
# Perform the enumeration on HTML only
Enumerate enable=true fn=enumerate-urls max=1024 type=text/html
# Cleanup
Shutdown fn=filterrules-shutdown
</Filter>

Filter Directives

Filter directives use Robot Application Functions (RAFs) to perform operations. Their use and flow of execution is similar to that of NSAPI directives and Server Application Functions (SAFs) in the file obj.conf. Like NSAPI and SAF, data are stored and transferred using parameter blocks, also called pblocks.

There are six robot directives, or RAF classes, corresponding to the filtering phases and operations listed below. See Stages in the Filter Process for more information on these phases.

Setup
Metadata
Data
Enumerate
Generate
Shutdown

Each directive has its own robot application functions. For example, use filtering functions with the Metadata and Data directives, enumeration functions with the Enumerate directive, generation functions with the Generate directive, and so on.

The built-in robot application functions and instructions for writing your own robot application functions are explained in the Sun Java System Portal Server 7.1 Developer's Guide.

Writing or Modifying a Filter

In most cases, you should not need to write filters from scratch. You can create most of your filters using the administration console. You can then modify the filter.conf and filterrules.conf files to make any desired changes. These files reside in the directory /var/opt/SUNWportal/searchservers/search1/config.

However, if you want to create a more complex set of parameters, you will need to edit the configuration files used by the robot.

Follow these points when writing or modifying a filter:

The order of execution of directives (especially the available information at each phase)
The order of rules

For a discussion of the parameters you can modify in the robot.conf file, the robot application functions that you can use in the filter.conf file, and how to create your own robot application functions, see the Sun Java System Portal Server 7.1 Developer's Guide.

Chapter 44 Robot Application Functions - Sources and Destinations

This chapter contains the following sections:

Introduction

Most of the Robot Application Functions (RAFs) require sources of information and generate data that goes to destinations. The sources are defined within the robot itself and are not necessarily related to the fields in the resource description it ultimately generates. Destinations, on the other hand, are generally the names of fields in the resource description, as defined by the resource description server’s schema.

For details on using the administration console to determine the database schema, see Sun Java System Portal Server 7.1 Administration Guide.

The following sections describe the different stages of the filtering process, and the sources available at those stages.

Setup Stage

At the Setup stage, the filter is set up and cannot yet get information about the resource’s URL or content.

MetaData Filtering Stage

At the MetaData stage, the robot encounters a URL for a resource, but it has not downloaded the resource’s content, thus information is available about the URL as well as data that is derived from other sources such as the filter.conf file. At this stage, however, information is not available about the content of the resource.

The table below lists the sources available in the RAFs at the MetaData phase and their description.

Table 44–1 Sources Available to the RAFs at the MetaData Phase


Source	Description	Example
csid	Catalog Server ID	x-catalog//budgie.siroe.com:8086/alexandria
depth	Number of links traversed from starting point	10
enumeration filter	Name of Enumeration filter	enumeration1
generation filter	Name of Generation filter	generation1
host	Host portion of URL	home.siroe.com
IP	Numeric version of host	198.95.249.6
protocol	Access portion of the URL	http, https, ftp, file
path	Path portion of the URL	/, /index.html, /documents/listing.html
URL	Complete URL	http://developer.siroe.com/docs/manuals/

Data Stage

At the Data stage, the robot has downloaded the content of the resource at the URL, and can access data about the content, such as the description, the author, and so on.

If the resource is an HTML file, the Robot parses the <META> tags in the HTML headers. Consequently, any data contained in <META> tags is available at the Data stage.

During the data phase, the following sources, shown in the following table are available to RAFs, in addition to those available during the MetaData phase.

Table 44–2 Sources Available to the RAFs at the Data Phase


Source	Description	Example
content-charset	Character set used by the resource
content-encoding	Any form of encoding
content-length	Size of the resource in bytes
content-type	MIME type of the resource	text/html, image/jpeg
expires	Date the resource itself expires
last-modified	Date the resource was last modified
data in <META> tags	Any data that is provided in <META> tags in the header of HTML resources	Author Description Keywords

Enumeration, Generation, and Shutdown Stages

At the Enumeration and Generation stages, the same data sources are available as the Data stage.

At the Shutdown stage, the filter completes its filtering and is shuts down. Although functions written for this stage can use the same data sources as those available at the Data stage, the shutdown functions typically restrict their operations to shutdown and cleanup activities.

Chapter 45 Robot Application Functions - Enable Parameter

Each function can have an enable parameter. The values can be true, false, on, or off. The administration console uses these parameters to turn certain directives on or off.

The following example enables enumeration for text/html and disables enumeration for text/plain:

Example

# Perform the enumeration on HTML only
Enumerate enable=true fn=enumerate-urls max=1024 type=text/html
Enumerate enable=false fn=enumerate-urls-from-text max=1024 type=text/plain

Adding an enable=false parameter or an enable=off parameter has the same effect as commenting the line. Because the administration console does not write comments, it writes an enable parameter instead.

Chapter 46 Robot Application Functions - Setup Functions

This section describes the functions that are used during the setup phase by both enumeration and generation filters. The following functions are described:

filterrules-setup

When you use the filterrules-setup function, logtype is the type of log file to use. The value can be verbose, normal, or terse.

Parameters

The list of parameters used with the filterrules-setup function and their description are:

config: Path name to the file containing the filter rules to be used by this filter.

Example

Setup fn=filterrules-setup config=./config/filterrules.conf logtype=normal

setup-regex-cache

The setup-regex-cache function initializes the cache size for the filter-by-regex and generate-by-regex functions. Use this function to specify a number other than the default of 32.

Parameters

The parameter used with the setup-regex-cache function and its description is:

cache-size: Maximum number of compiled regular expressions to be kept in the regex cache.

Example

Setup fn=setup-regex-cache cache-size=28

setup-type-by-extension

The setup-type-by-extension function configures the filter to recognize file name extensions. It must be called before the assign-type-by-extension function can be used. The file specified as a parameter must contain mappings between standard MIME content types and file extension strings.

Parameters

The parameter used with the setup-type-by-extension function and its description is:

file: Name of the MIME types configuration file.

Example

Setup fn=setup-type-by-extension file=./config/mime.types

Chapter 47 Robot Application Functions - Filtering Functions

This chapter contains the following sections:

Introduction

The functions discussed in this chapter operate at the Metadata and Data stages to allow or deny resources based on specific criteria specified by the function and its parameters.

These functions can be used in both Enumeration and Generation filters in the filter.conf file.

Each “filter-by” function performs a comparison, then either allows or denies the resource. Allowing the resource means that processing continues to the next filtering step. Denying the resource means that processing should stop, because the resource does not meet the criteria for further enumeration or generation.

filter-by-exact

The filter-by-exact function allows or denies the resource if the allow/deny string matches the source of information exactly. The keyword all matches any string.

Parameters

The parameters used with the filter-by-exact function and their description are:

src: Source of information.
allow/deny: Contains a string.

Example

The following example filters out all resources whose content-type is text/plain. It allows all other resources to proceed:

Data fn=filter-by-exact src=type deny=text/plain

filter-by-max

The filter-by-max function allows the resource if the specified information source is less than or equal to the given value. It denies the resource if the information source is greater than the specified value.

This function can be called no more than once per filter.

Parameters

The parameters used with the filter-by-max function and their description are:

src: Source of information. It must be one of the following: hosts, objects, or depth.
value: Specifies a value for comparison.

Example

This example allows resources whose content-length is less than 1024 K:

MetaData fn-filter-by-max src=content-length value=1024

filter-by-md5

The filter-by-md5 function only allows the first resource with a given MD5 checksum value. If the current resource’s MD5 has been seen in an earlier resource by this robot, the current resource is denied. As a result, duplication of identical resources or single resources with multiple URLs is prevented.

You can only call this function at the Data stage or later. It can be called no more than once per filter. The filter must invoke the generate-md5 function to generate an MD5 checksum before invoking filter-by-md5 function.

Parameters

none

Example

The following example shows the typical method of handling MD5 checksums by first generating the checksum and then filtering based on it:

Data fn=generate-md5
Data fn=filter-by-md5

filter-by-prefix

The filter-by-prefix function allows or denies the resource if the given information source begins with the specified prefix string. The resource doesn’t have to match completely. The keyword all matches any string.

Parameters

The parameters used with the filter-by-prefix function and their description are:

src: Source of information.
allow/deny: Contains a string for prefix comparison.

Example

The following example allows resources whose content-type is any kind of text, including text/html and text/plain:

MetaData fn=filter-by-prefix src=type allow=text

filter-by-regex

The filter-by-regex function supports regular expression pattern matching. It allows resources that match the given regular expression. The supported regular expression syntax is defined by the POSIX.1 specification. The regular expression \\\\* matches anything.

Parameters

The parameters used with the filter-by-regex function and their description are:

src: Source of information.
allow/deny: Contains a string for prefix comparison.

Example

The following example denies all resources from sites in the government domain:

MetaData fn=filter-by-regex src=host deny=\\\\*.gov

filterrules-process

The filterrules-process function handles in the rules in the filterrules.conf file.

Parameters

none

Example

MetaData fn=filterrules-process

Chapter 48 Robot Application Functions - Filtering Support Functions

This chapter contains the following sections:

Introduction

The functions discussed in this chapter are used during filtering to manipulate or generate information on the resource. The robot can then process the resource by calling filtering functions. These functions can be used in Enumeration and Generation filters in the filter.conf file.

assign-source

The assign-source function assigns a new value to a given information source. This permits editing during the filtering process. The function can assign an explicit new value, or it can copy a value from another information source.

Parameters

The parameters used with the assign-source function and their description are:

dst: Name of the source whose value is to be changed.
value: Specifies an explicit value.
src: Information source to copy to dst

You must specify either a value parameter or a src parameter, but not both.

Example

Data fn=assign-source dst=type src=content-type

assign-type-by-extension

The assign-type-by-extension function uses the resource’s file name to determine its type and assigns this type to the resource for further processing.

The setup-type-by-extension function must be called during setup before assign-type-by-extension function can be used.

Parameters

The parameter used with the assign-type-by-extension function and its description is:

src: Source of file name to compare. If you do not specify a source, the default is the resource’s path.

Example

MetaData fn=assign-type-by-extclear-source

clear-source

The clear-source function deletes the specified data source. You typically do not need to perform this function. You can create or replace a source by using the assign-source function.

Parameters

The parameter used with the clear-source function and its description is:

src: Name of source to delete.

Example

The following example deletes the path source:

MetaData fn=clear-source src=path

convert-to-html

The convert-to-html function converts the current resource into an HTML file for further processing, if its type matches a specified MIME type. The conversion filter automatically detects the type of the file it is converting.

Parameters

The parameter used with the convert-to-html function and its description is:

type: MIME type from which to convert.

Example

The following sequence of function calls causes the filter to convert all Adobe Acrobat PDF files, Microsoft RTF files, and FrameMaker MIF files to HTML, as well as any files whose type was not specified by the server that delivered it.

Data fn=convert-to-html type=application/pdf
Data fn=convert-to-html type=application/rtf
Data fn=convert-to-html type=application/x-mif
Data fn=convert-to-html type=unknown

copy-attribute

The copy-attribute function copies the value from one field in the resource description into another.

Parameters

The parameters used with the copy-attribute function and their description are:

src: Field in the resource description from which to copy.
dst: Item in the resource description into which to copy the source.
truncate: Maximum length of the source to copy.
clean: Boolean parameter indicating whether to fix truncated text (such as not leaving partial words). This parameter is false by default.

Example

Generate fn=copy-attribute \\
src=partial-text dst=description truncate=200 clean=true

generate-by-exact

The generate-by-exact function generates a source with a specified value, but only if an existing source exactly matches another value.

Parameters

The parameters used with the generate-by-exact function and their description are:

dst: Name of source to generate.
value: Value to assign dst.
src: Source against which to match.

Example

The following example sets the classification to Siroe if the host is www.siroe.com.

Generate fn="generate-by-exact" match="www.siroe.com:80" src="host"
value="Siroe" dst="classification"

generate-by-prefix

This generate-by-prefix function generates a source with a specified value, but only if the prefix of an existing source matches another value.

Parameters

The parameters used with the generate-by-prefix function and their description are:

dst: Name of the source to generate.
value: Value to assign to dst.
src: Source against which to match.
match: Value to compare to src.

Example

The following example sets the classification to Search if the protocol prefix is HTTP:

Generate fn="generate-by-prefix" match="http" src="protocol"
value="World Wide Web" dst="classification"

generate-by-regex

The generate-by-regex function generates a source with a specified value, but only if an existing source matches a regular expression.

Parameters

The parameters used with the generate-by-regex function and their description are:

dst: Name of the source to generate.
value: Value to assign to dst.
src: Source against which to match.
match: Regular expression string to compare to src.

Example

The following example sets the classification to Siroe if the host name matches the regular expression *.siroe.com. For example, resources at both developer.siroe.com and home.siroe.com will be classified as Siroe:

Generate fn="generate-by-regex" match="\\\\*.siroe.com"
src="host" value="Siroe" dst="classification"

generate-md5

The generate-md5 function generates an MD5 checksum and adds it to the resource. You can then use the filter-by-md5 function to deny resources with duplicate MD5 checksums.

Parameters

none

Example

Data fn=generate-md5

generate-rd-expires

The generate-rd-expires function generates an expiration date and adds it to the specified source. The function uses metadata such as the HTTP header and HTML <META> tags to obtain any expiration data from the resource. If none exists, it generates an expiration date three months from the current date.

Parameters

The parameter used with the generate-rd-expires function and its description is:

dst: Name of the source. If you omit it, it defaults to rd-expires.

Example

Generate fn=generate-rd-expires

generate-rd-last-modified

The generate-rd-last-modified function adds the current time to the specified source.

Parameters

The parameter used with the generate-rd-last-modified function and its description is:

dst: Name of the source. If you omit it, it defaults to rd-last-modified.

Example

Generate fn=generate-last-modified

rename-attribute

The rename-attribute function changes the name of a field in the resource description. It is most useful in cases where, for example, extract-html-meta copies information from a <META> tag into a field, and you want to change the name of the field.

Parameters

The parameter used with the generate-rd-last-modified function and its description is:

src: String containing a mapping from one name to another.

Example

The following example renames an attribute from author to author-name:

Generate fn=rename-attribute src="author->author-name"

Chapter 49 Robot Application Functions - Enumeration Functions

This chapter contains the following sections

Introduction

The functions discussed in this chapter operate at the Enumerate stage. These functions control if and how a robot gathers links from a given resource in order to use as starting points for further resource discovery.

enumerate-urls

The enumerate-urls function scans the resource and enumerates all URLs found in hypertext links. The results are used to spawn further resource discovery. You can specify a content-type to restrict the kind of URLs enumerated.

Parameters

The parameters used with the enumerate-urls function and their description are:

max: The maximum number of URLs to spawn from a given resource. The default, if max is omitted, is 1024.
type: Content-type that restricts enumeration to those URLs that have the specified content-type. type is an optional parameter. If omitted, it will enumerate all URLs.

Example

The following example enumerates HTML URLs only, up to a maximum of 1024:

Enumerate fn=enumerate-urls type=text/html

enumerate-urls-from-text

The enumerate-urls-from-text function scans text resources, looking for strings matching this regular expression: URL:.*. It spawns robots to enumerate the URLs from these strings and generate further resource descriptions.

Parameters

The parameter used with the enumerate-urls-from-text function and its description is:

max: The maximum number of URLs to spawn from a given resource. The default, if max is omitted, is 1024.

Example

Enumerate fn=enumerate-urls-from-text

Chapter 50 Robot Application Functions - Generation Functions

This chapter contains the following functions:

Introduction

The following functions are used in the Generate stage of filtering. Generation functions can generate information that goes into a resource description. In general, they either extract information from the body of the resource itself or copy information from the resource’s metadata.

extract-full-text

The extract-full-text function extracts the complete text of the resource and adds it to the resource description.

Note –

The extract-full-text function should be used with caution, because it can significantly increase the size of the resource description, thus causing database bloat and overall negative impact on network bandwidth.

Parameters

The parameters used with the extract-full-text function and their description are:

truncate: The maximum number of characters to extract from the resource.
dst: Name of the schema item that will receive the full text.

Example

Generate fn=extract-full-text

extract-html-meta

The extract-html-meta function extracts any <META> or <TITLE> information from an HTML file and adds it to the resource description. A content-type may be specified to restrict the kind of URLs that are generated.

Parameters

The parameters used with the extract-html-meta function and their description are:

truncate: The maximum number of bytes to extract.
type: Optional parameter. If omitted, it will generate all URLs.

Example

Generate fn=extract-html-meta truncate=255 type=text/html

extract-html-text

The extract-html-text function extracts the first few characters of text from an HTML file, excluding the HTML tags, and adds the text to the resource description. This permits the first part of a document’s text to be included in the RD. A content-type may be specified to restrict the kind of URLs that are generated.

Parameters

The parameters usedwith the extract-html-text function and their description are:

truncate: The maximum number of bytes to extract.
skip-headings: Set to true to ignore any HTML headers that occur in the document.
type: Optional parameter. If omitted, it will generate all URLs.

Example

Generate fn=extract-html-text truncate=255 type=text/html skip-headings=true

extract-html-toc

The extract-html-toc function extracts the table-of-contents from the HTML headers and add it to the resource description.

Parameters

The parameters used with the extract-html-toc function and their description are:

truncate: The maximum number of bytes to extract.
level: Maximum HTML header level to extract. This parameter controls the depth of the table of contents.

Robot HTML Summarizer does not generate description and partial text for some of the documents, such as text/HTML, application/x-maker, or x-frame. There are three causes for Robot not generating the description and partial text for the following:

For HTML or text - Unclosed JavaScript tag. This is an error that you need to fix in the HTML page itself.
Robot does not index the part of the HTML page that falls between stopindex and startindex.

For any file other than HTML or text, such as application/x-maker, or x-frame, Robot uses a third party Convertor to convert them into HTML. Then, Robot indexes them. In some cases, the Convertor might not able to generate the HTML or it may generate an empty HTML body. In this case, Sun will report to the third party for a fix or a patch to solve the issue.

Example

Generate fn=extract-html-toc truncate=255 level=3

extract-source

The extract-source function extracts the specified values from the given sources and adds them to the resource description.

Parameters

The parameter used with the extract-source function and its description is:

src: List of source names; you can use the -> operator to define a new name for the RD attribute, for example, type->content-type would take the value of the source named type and save it in the RD under the attribute named content-type.

Example

Generate fn=extract-source src="md5,depth,rd-expires,rd-last-modified"

harvest-summarizer

The harvest-summarizer function runs a Harvest summarizer on the resource and adds the result to the resource description.

To run Harvest summarizers, you must have $HARVEST_HOME/lib/gatherer in your path before you run the robot.

Parameters

The parameter used with the harvest-summarizer function and its description is:

summarizer: Name of the summarizer program.

Example

Generate fn-harvest-summarizer summarizer=HTML.sum

Chapter 51 Robot Application Functions - Shutdown Functions

This chapter contains the following sections:

Introduction

The function in this chapter can be used during the shutdown phase by both enumeration and generation functions.

filterrules-shutdown

After the rules are run, the filterrules-shutdown function performs clean up and shutdown responsibilities.

Parameters

none

Example

Shutdown fn=filterrules-shutdown

Part VII Search Engine Robot

Chapter 41 Overview of Search Engine Robot

Overview

Figure 41–1 How the Robot Works

Chapter 42 Process Parameters

User Modifiable Parameters in robot.conf File

Chapter 43 The Filtering Process

Overview

Stages in the Filter Process

Filter Syntax

Example 43–1 Enumeration File Syntax

Filter Directives

Writing or Modifying a Filter

Chapter 44 Robot Application Functions - Sources and Destinations

Introduction

Setup Stage

MetaData Filtering Stage

Data Stage

Enumeration, Generation, and Shutdown Stages

Chapter 45 Robot Application Functions - Enable Parameter

Example

Chapter 46 Robot Application Functions - Setup Functions

filterrules-setup

Parameters

Example

setup-regex-cache

Parameters

Example

setup-type-by-extension

Parameters

Example

Chapter 47 Robot Application Functions - Filtering Functions

Introduction

filter-by-exact

Parameters

Example

filter-by-max

Parameters

Example

filter-by-md5

Parameters

Example

filter-by-prefix

Parameters

Example

filter-by-regex

Parameters

Example

filterrules-process

Parameters

Example

Chapter 48 Robot Application Functions - Filtering Support Functions

Introduction

assign-source

Parameters

Example

assign-type-by-extension

Parameters

Example

clear-source

Parameters

Example

convert-to-html

Parameters

Example

copy-attribute

Parameters

Example

generate-by-exact

Parameters

Example

generate-by-prefix

Parameters

Example

generate-by-regex

Parameters

Example

generate-md5

Parameters

Example

User Modifiable Parameters in `robot.conf` File