Sun Java System Portal Server 7.1 Administration Guide

Chapter 12 Managing the Search Server Robot

This chapter describes the Sun JavaTM System Portal Server Search Server robot and its corresponding configuration files. The chapter contains following topics:

Understanding the Search Server Robot

A Search Server robot is an agent that identifies and reports on resources in its domains. It does so by using two kinds of filters: an enumerator filter and a generator filter.

The enumerator filter locates resources by using network protocols. The filter tests each resource and if the resource meets the proper criteria, it is enumerated. For example, the enumerator filter can extract hypertext links from an HTML file and use the links to find additional resources.

The generator filter tests each resource to determine whether a resource description (RD) should be created. If the resource passes the test, the generator creates an RD that is stored in the Search Server database.

Configuration and maintenance tasks you might need to do to administer the robot are described in the following sections:

How the Robot Works

Figure 12–1 shows how the robot examines URLs and their associated network resources. Both the enumerator and the generator test each resource. If the resource passes the enumeration test, the robot checks it for additional URLs. If the resource passes the generator test, the robot generates a resource description that is stored in the Search Server database.

Figure 12–1 How the Robot Works

This figure illustrates how the robot works.

Robot Configuration Files

Robot configuration files define the behavior of the robots. These files reside in the directory /var/opt/SUNWportal/searchservers/searchserverid/config. The following list provides a description for each of the robot configuration files.

classification.conf

Contains rules used to classify RDs generated by the robot.

filter.conf

Defines the enumeration and generation filters used by the robot.

filterrules.conf

Contains the robot's site definitions, starting point URLs, rules for filtering based on mime type, and URL patterns.

robot.conf

Defines most operating properties for the robot.

Because you can set most properties by using the Search Server Administration interface, you typically do not need to edit the robot.conf file. However, advanced users might manually edit this file to set properties that cannot be set through the interface.

Defining Sites

The robot finds resources and determines whether to add descriptions of those resources to the database. The determination of which servers to visit and what parts of those servers to index is called a site definition.

Defining the sites for the robot is one of the most important jobs of the server administrator. You need to be sure you send the robot to all the servers it needs to index, but you also need to exclude extraneous sites that can fill the database and make finding the correct information more difficult.

Controlling Robot Crawling

The robot extracts and follows links to the various sites selected for indexing. As the system administrator, you can control these processes through a number of settings, including:

See the Sun Java System Portal Server 7.1 Technical Reference for descriptions of the robot crawling attributes.

Filtering Robot Data

Filters enable identify a resource so that it can be excluded or included by comparing an attribute of a resource against a filter definition. The robot provides a number of predefined filters, some of which are enabled by default. The following filters are predefined. Filters marked with an asterisk are enabled by default.

You can create new filter definitions, modify a filter definition, or enable or disable filters. See Resource Filtering Process for detailed information.

Using the Robot Utilities

The robot includes two debugging tools or utilities:

Scheduling the Robot

To keep the search data timely, the robot should search and index sites regularly. Because robot crawling and indexing can consume processing resources and network bandwidth, you should schedule the robot to run during non-peak days and times. The management console allows administrators to set up a schedule to run the robot.

Managing the Robot

This section describes the following tasks to manage the robot:

ProcedureTo Start the Robot

  1. Log in to the Portal Server management console.

  2. Choose Search Servers from the menu bar. Select a search server from the list of servers.

  3. Click Robot from the menu bar, then Status and Control from the menu.

  4. Click Start.

For equivalent psadmin command

psadmin start-robot


Note –

For the command psadmin start-robot, the search robot does not start if no defined sites are available for the robot to crawl. The command psadmin start-robot indicates that no sites are available by displaying Starting Points: 0 defined.


ProcedureTo Clear Robot Database

  1. Log in to the Portal Server management console.

  2. Select Search Servers from the menu bar, then select a search server.

  3. Select Robot from the menu bar then Status and Control.

  4. Click Clear Robot Database.

ProcedureTo Create a Site Definition

The robot finds resources and determines whether to add descriptions of those resources to the database. The determination of which servers to visit and what parts of those servers to index is called a site definition.

  1. Log in to the Portal Server management console.

  2. Select Search Servers from the menu bar, then select a search server.

  3. Select Robot from the menu bar, then Sites.

  4. Click New under Manage Sites and specify the configuration attributes for the site.

    For more information about the attributes, see Sites in Sun Java System Portal Server 7.1 Technical Reference.

  5. Click OK.

ProcedureTo Edit a Site Definition

  1. Log in to the Portal Server management console.

  2. Select Search Servers from the menu bar, then select a search server.

  3. Click Robot from the menu bar, then Sites.

  4. Click the name of the site you want to modify.

    The Edit Site dialog appears.

  5. Modify the configuration attributes as necessary.

    For more information about the attributes, see Sites in Sun Java System Portal Server 7.1 Technical Reference

  6. Click OK to record the changes.

ProcedureTo Control Robot Crawling and Indexing

The robot crawls to the various sites selected for indexing. You control how the robot crawls sites by defining crawling and indexing operational properties.

  1. Log in to the Portal Server management console.

  2. Select Search Servers from the menu bar, then select a search server.

  3. Click Robot from the menu bar, then Properties.

  4. Specify the robot crawling and indexing attributes as necessary.

    For more information about the attributes, see Site Probe in Sun Java System Portal Server 7.1 Technical Reference in Sun Java System Portal Server 7.1 Technical Reference.

  5. Click Save.

ProcedureTo Run the Simulator

The simulator performs a partial simulation of robot filtering on one or more listed site sites.

  1. Log in to the Portal Server management console.

  2. Select Search Servers from the menu bar, then select a search server.

  3. Click Robot from the menu bar, then Utilities.

  4. Type the URL of a new site to simulate in the Add a new URL text box and click Add.

    You can also run the simulator on existing sites listed under Existing Robot sites.

  5. Click Run Simulator.

ProcedureTo Run the Site Probe Utility

The site probe utility checks for such information as DNS aliases, server redirects, and virtual servers.

  1. Log in to the Portal Server management console.

  2. Select Search Servers from the menu bar, then select a search server.

  3. Click Robot from the menu bar, then Utilities.

  4. Type the URL of the site to probe.

  5. (Optional) If you want the probe to return DNS information choose Show Advanced DNS information under Site Probe.

  6. Click Run SiteProbe.

Resource Filtering Process

The robot uses filters to determine which resources to process and how to process them. When the robot discovers references to resources as well as the resources themselves, it applies filters to each resource. The filters enumerate the resourceand determine whether to generate a resource description to store in the Search Server database.

The robot examines one or more starting point URLs, applies the filters, and then applies the filters to the URLs spawned by enumerating those URLs, and so on. The starting point URLs are defined in the filterrules.conf file.

Each enumeration and generation filter performs any required initialization operations and applies comparison tests to the current resource. The goal of each test is to allow or deny the resource. Each filter also has a shutdown phase during which it performs clean-up operations.

If a resource is allowed, then it continues its passage through the filter. The robot eventually enumerates it, attempting to discover further resources. The generator might also create a resource description for it.

If a resource is denied, the resource is rejected. No further action is taken by the filter for resources that are denied.

These operations are not necessarily linked. Some resources result in enumeration; others result in RD generation. Many resources result in both enumeration and RD generation. For example, if the resource is an FTP directory, the resource typically does not have an RD generated for it. However, the robot might enumerate the individual files in the FTP directory. An HTML document that contains links to other documents can result in an RD being generated, and can lead to enumeration of any linked documents as well.

The following sections describe the filter process:

Stages in the Filter Process

Both enumeration and generation filters have five phases in the filtering process.

Table 12–1 Common Metadata Types

Metadata Type 

Description 

Example 

Complete URL 

The location of a resource 

http://home.siroe.com/

Protocol 

The access portion of the URL 

http, ftp, file

Host 

The address portion of the URL 

www.siroe.com

IP address 

Numeric version of the host 

198.95.249.6 

PATH 

The path portion of the URL 

/index.html

Depth 

Number of links from the starting point URL 

Filter Syntax

The filter.conf file contains definitions for enumeration and generation filters. This file can contain multiple filters for both enumeration and generation. The filters used by the robot are specified by the enumeration-filter and generation-filter properties in the file robot.conf.

Filter definitions have a well-defined structure: a header, a body, and an end. The header identifies the beginning of the filter and declares its name, for example:


<Filter name="myFilter">

The body consists of a series of filter directives that define the filter’s behavior during setup, testing, enumeration or generation, and shutdown. Each directive specifies a function and, if applicable, properties for the function.

The end is marked by </Filter>.

Example 12–1 shows a filter named enumeration1.


Example 12–1 Enumeration File Syntax


<Filter name="enumeration1>
   Setup fn=filterrules-setup config=./config/filterrules.conf
#  Process the rules
   MetaData fn=filterrules-process
#  Filter by type and process rules again
   Data fn=assign-source dst=type src=content-type
   Data fn=filterrules-process
#  Perform the enumeration on HTML only
   Enumerate enable=true fn=enumerate-urls max=1024 type=text/html
#  Cleanup
   Shutdown fn=filterrules-shutdown
</Filter>

Filter Directives

Filter directives use robot application functions (RAFs) to perform operations. Their use and flow of execution is similar to that of NSAPI directives and server application functions (SAFs) in the Sun Java System Web Server's obj.conf file. Like NSAPI and SAF, data are stored and transferred using property blocks, also called pblocks.

Six robot directives, or RAF classes, correspond to the filtering phases and operations listed in Resource Filtering Process:

Each directive has its own robot application functions. For example, use filtering functions with the Metadata and Data directives, enumeration functions with the Enumerate directive, generation functions with the Generate directive, and so on.

The built-in robot application functions, as well as instructions for writing your own robot application functions, are explained in the Sun Java System Portal Server 7.1 Developer's Guide.

Writing or Modifying a Filter

In most cases, you can use the management console to create most of your site-definition based filters. You can then modify the filter.conf and filterrules.conf files to make any further desired changes. These files reside in the directory /var/opt/SUNWportal/searchservers/searchserverid/config.

To create a more complex set of properties, edit the configuration files used by the robot.

When you write or modify a filter, note the order of

You can also do the following:

For more information, see the Sun Java System Portal Server 7.1 Developer's Guide

Managing Filters

The following tasks to manage robot filters are described in this section:

ProcedureTo Create a Filter

  1. Log in to the Portal Server management console.

  2. Select Search Servers from the menu bar, then select a search server.

  3. Select Robot from the menu bar, then Filters.

  4. Click New.

    The New Robot Filter wizard appears.

  5. Follow the instructions to create the specified filter.

    1. Type a filter name and filter description in the text box, and click Next.

    2. Specify filter definition and behavior, and click Finish.

      For more information about filter attributes, see Filters in Sun Java System Portal Server 7.1 Technical Reference.

    3. Click Close to load the new filter.

ProcedureTo Delete a Filter

  1. Log in to the Portal Server management console.

  2. Select Search Servers from the menu bar, then select a search server.

  3. Select Robot from the menu bar, then Filters.

  4. Select a filter.

  5. Click Delete.

  6. Click OK in the confirmation dialog box that appears.

ProcedureTo Edit a Filter

  1. Log in to the Portal Server management console.

  2. Select Search Servers from the menu bar, then select a search server.

  3. Select Robot from the menu bar, then Filters.

  4. Select a filter, and click Edit.

    The Edit a Filter page appears.

  5. Modify the configuration attributes as necessary.

    For more information about filter attributes, see Filters in Sun Java System Portal Server 7.1 Technical Reference.

  6. Click OK.

ProcedureTo Enable or Disable a Filter

  1. Log in to the Portal Server management console.

  2. Select Search Servers from the menu bar, then select a search server.

  3. Select Robot from the menu bar, then Filters.

  4. Select a filter.

    • To enable a filter, click Enable.

    • To disable a filter, click Disable.

Managing Classification Rules

Documents can be assigned to multiple categories, up to a maximum number defined in the settings. Classification rules are simpler than robot filter rules because they do not involve any flow-control decisions. In classification rules you determine what criteria to use to assign specific categories to a resource as part of its Resource Description. A classification rule is a simple conditional statement, taking the form if condition is true, assign the resource to <a category>.

ProcedureTo Create a Classification Rule

  1. Log in to the Portal Server management console.

  2. Select Search Servers from the menu bar, then select a search server.

  3. Select Robot from the menu bar, then Classification Rules.

  4. Select Classification Rules and click New.

    The New Classification Rule dialog box appears.

  5. Specify the configuration attributes as necessary.

    For more information about the attributes, see Manage Classification Rules in Sun Java System Portal Server 7.1 Technical Reference.

  6. Click OK.

ProcedureTo Edit a Classification Rule

  1. Log in to the Portal Server management console.

  2. Select Search Servers from the menu bar, then select a search server.

  3. Select Robot, then Classification Rules from the menu bar.

  4. Select a classification rule, and click Edit.

  5. Modify the attributes as necessary.

    For more information about the attributes, see Manage Classification Rules in Sun Java System Portal Server 7.1 Technical Reference.

  6. Click OK.

Sources and Destinations

Most robot application functions (RAFs) require sources of information and generate data that go to destinations. The sources are defined within the robot and are not necessarily related to the fields in the resource description that the robot ultimately generates. Destinations, on the other hand, are generally the names of fields in the resource description, as defined by the resource description server’s schema.

The following sections describe the different stages of the filtering process, and the sources available at those stages:

Sources Available at the Setup Stage

At the Setup stage, the filter is set up but cannot yet obtain information about the resource’s URL or content.

Sources Available at the MetaData Filtering Stage

At the MetaData stage, the robot encounters a URL for a resource but it has not downloaded the resource’s content. Thus information is available about the URL as well as data that is derived from other sources such as the filter.conf file. At this stage, however, information about the content of the resource is not available.

Table 12–2 Sources Available to the RAFs at the MetaData Phase

Source 

Description 

Example 

csid 

Catalog server ID 

x-catalog//budgie.siroe.com:8086/alexandria

depth 

Number of links traversed from starting point 

10

enumeration filter 

Name of enumeration filter 

enumeration1

generation filter 

Name of generation filter 

generation1

host 

Host portion of URL 

home.siroe.com

IP 

Numeric version of host 

198.95.249.6

protocol 

Access portion of the URL 

http, https, ftp, file

path 

Path portion of the URL 

/, /index.html, /documents/listing.html

URL 

Complete URL 

http://developer.siroe.com/docs/manuals/

Sources Available at the Data Stage

At the Data stage, the robot has downloaded the content of the resource at the URL and can access data about the content, such as the description and the author.

If the resource is an HTML file, the Robot parses the <META> tags in the HTML headers. Consequently, any data contained in <META> tags is available at the Data stage.

During the Data phase, the following sources are available to RAFs, in addition to those available during the MetaData phase.

Table 12–3 Sources Available to the RAFs at the Data Phase

Source 

Description 

Example 

content-charset

Character set used by the resource 

 

content-encoding

Any form of encoding 

 

content-length

Size of the resource in bytes 

 

content-type

MIME type of the resource 

text/html, image/jpeg

expires

Date the resource expires 

 

last-modified

Date the resource was last modified 

 

data in <META> tags

Any data that is provided in <META> tags in the header of HTML resources

Author, Description, Keywords 

All of these sources except for the data in <META> tags are derived from the HTTP response header returned when retrieving the resource.

Sources Available at the Enumeration, Generation, and Shutdown Stages

At the Enumeration and Generation stages, the same data sources are available as in the Data stage. See Table 12–3 for information.

At the Shutdown stage, the filter completes its filtering and shuts down. Although functions written for this stage can use the same data sources as those available at the Data stage, the shutdown functions typically restrict their operations to robot shutdown and clean-up activities.

Enable Property

Each function can have an enable property. The values can be true, false, on, or off. The management console uses these parameters to turn certain directives on or off.

The following example enables enumeration for text/html and disables enumeration for text/plain:


#  Perform the enumeration on HTML only
Enumerate enable=true fn=enumerate-urls max=1024 type=text/html
Enumerate enable=false fn=enumerate-urls-from-text max=1024 type=text/plain

Adding an enable=false property or an enable=off property has the same effect as commenting the line. These properties are used because the management console does not write comments.

Setup Functions

This section describes the functions that are used during the setup phase by both enumeration and generation filters. The functions are described in the following sections:

filterrules-setup

When you use the filterrules-setup function, use the logtype log file. The value can be verbose, normal, or terse.

Property

config

Path name to the file containing the filter rules to be used by this filter.

Example

Setup fn=filterrules-setup

config="/var/opt/SUNWportal/searchservers/search1/config/filterrules.conf"

setup-regex-cache

The setup-regex-cache function initializes the cache size for the filter-by-regex and generate-by-regex functions. Use this function to specify a number other than the default of 32.

Property

cache-size

Maximum number of compiled regular expressions to be kept in the regex cache.

Example

Setup fn=setup-regex-cache cache-size=28

setup-type-by-extension

The setup-type-by-extension function configures the filter to recognize file name extensions. It must be called before the assign-type-by-extension function can be used. The file specified as a property must contain mappings between standard MIME content types and file extension strings.

Property

file

Name of the MIME types configuration file

Example

Setup fn=setup-type-by-extension

file="/var/opt/SUNWportal/searchservers/search1/config/mime.types"

Filtering Functions

Filtering functions operate at the Metadata and Data stages to allow or deny resources based on specific criteria specified by the function and its properties. These functions can be used in both Enumeration and Generation filters in the file filter.conf.

Each filter-by function performs a comparison and either allows or denies the resource. Allowing the resource means that processing continues to the next filtering step. Denying the resource means that processing should stop, because the resource does not meet the criteria for further enumeration or generation.

filter-by-exact

The filter-by-exact function allows or denies the resource if the allow/deny string matches the source of information exactly. The keyword all matches any string.

Properties

src

Source of information

allow/deny

Contains a string

Example

The following example filters out all resources whose content-type is text/plain. It allows all other resources to proceed:

Data fn=filter-by-exact src=type deny=text/plain

filter-by-max

The filter-by-max function allows the resource if the specified information source is less than or equal to the given value. It denies the resource if the information source is greater than the specified value.

This function can be called no more than once per filter.

Properties

The filter-by-maxfunction lists the properties used with the filter-by-max function.

src

Source of information: hosts, objects, or depth

value

Specifies a value for comparison

Example

This example allows resources whose content-length is less than 1024 kilobytes:

MetaData fn-filter-by-max src=content-length value=1024

filter-by-md5

The filter-by-md5 function allows only the first resource with a given MD5 checksum value. If the current resource’s MD5 has been seen in an earlier resource by this robot, the current resource is denied. The function prevents duplication of identical resources or single resources with multiple URLs.

You can only call this function at the Data stage or later. It can be called no more than once per filter. The filter must invoke the generate-md5 function to generate an MD5 checksum before invoking filter-by-md5.

Properties

None

Example

The following example shows the typical method of handling MD5 checksums by first generating the checksum and then filtering based on it:

Data fn=generate-md5

Data fn=filter-by-md5

filter-by-prefix

The filter-by-prefix function allows or denies the resource if the given information source begins with the specified prefix string. The resource doesn’t have to match completely. The keyword all matches any string.

Properties

src

Source of information

allow/deny

Contains a string for prefix comparison

Example

The following example allows resources whose content-type is any kind of text, including text/html and text/plain:

MetaData fn=filter-by-prefix src=type allow=text

filter-by-regex

The filter-by-regex function supports regular-expression pattern matching. It allows resources that match the given regular expression. The supported regular expression syntax is defined by the POSIX.1 specification. The regular expression \\\\* matches anything.

Properties

src

Source of information

allow/deny

Contains a regular expression string

Example

The following example denies all resources from sites in the .gov domain:

MetaData fn=filter-by-regex src=host deny=\\\\*.gov

filterrules-process

The filterrules-process function processes the site definition and filter rules in the filterrules.conf file.

Properties

None

Example

MetaData fn=filterrules-process

Filtering Support Functions

Support functions are used during filtering to manipulate or generate information on the resource. The robot can then process the resource by calling filtering functions. These functions can be used in enumeration and generation filters in the file filter.conf.

assign-source

The assign-source function assigns a new value to a given information source. This function permits editing during the filtering process. The function can assign an explicit new value, or it can copy a value from another information source.

Properties

dst

Name of the source whose value is to be change

value

Specifies an explicit value

src

Information source to copy to dst

You must specify either a value property or a srcproperty, but not both.

Example

Data fn=assign-source dst=type src=content-type

assign-type-by-extension

The assign-type-by-extension function uses the resource’s file name to determine its type and assigns this type to the resource for further processing.

The setup-type-by-extension function must be called during setup before assign-type-by-extension can be used.

Property

src

Source of file name to compare. If you do not specify a source, the default is the resource’s path

Example

MetaData fn=assign-type-by-extension

clear-source

The clear-source function deletes the specified data source. You typically do not need to perform this function. You can create or replace a source by using the assign-source function.

Property

src

Name of the source to delete

Example

The following example deletes the path source:

MetaData fn=clear-source src=path

convert-to-html

The convert-to-html function converts the current resource into an HTML file for further processing if its type matches a specified MIME type. The conversion filter automatically detects the type of the file it is converting.

Property

type

MIME type from which to convert

Example

The following sequence of function calls causes the filter to convert all Adobe Acrobat PDF files, Microsoft RTF files, and FrameMaker MIF files to HTML, as well as any files whose type was not specified by the server that delivered it.

Data fn=convert-to-html type=application/pdf

Data fn=convert-to-html type=application/rtf

Data fn=convert-to-html type=application/x-mif

Data fn=convert-to-html type=unknown

copy-attribute

The copy-attribute function copies the value from one field in the resource description into another.

Properties

src

Field in the resource description from which to copy

dst

Item in the resource description into which to copy the source

truncate

Maximum length of the source to copy

clean

Boolean property indicating whether to fix truncated text, to not leave partial words. This property is false by default

Example

Generate fn=copy-attribute \\

src=partial-text dst=description truncate=200 clean=true

generate-by-exact

The generate-by-exact function generates a source with a specified value, but only if an existing source exactly matches another value.

Properties

dst

Name of the source to generate

value

Value to assign dst

src

Source against which to match

Example

The following example sets the classification to siroe if the host is www.siroe.com.

Generate fn="generate-by-exact" match="www.siroe.com:80" src="host" value="Siroe" dst="classification"

generate-by-prefix

This generate-by-prefix function generates a source with a specified value if the prefix of an existing source matches another value.

Properties

dst

Name of the source to generate

value

Value to assign dst

src

Source against which to match

match

Value to compare to src

Example

The following example sets the classification to Compass if the protocol prefix is HTTP:

Generate fn="generate-by-prefix" match="http" src="protocol" value="World Wide Web" dst="classification"

generate-by-regex

The generate-by-regex function generates a source with a specified value if an existing source matches a regular expression.

Properties

dst

Name of the source to generate

value

Value to assign dst

src

Source against which to match

match

Regular expression string to compare to src

Example

The following example sets the classification to siroe if the host name matches the regular expression *.siroe.com. For example, resources at both developer.siroe.com and home.siroe.com are classified as Siroe:

Generate fn="generate-by-regex" match="\\\\*.siroe.com" src="host" value="Siroe" dst="classification"

generate-md5

The generate-md5 function generates an MD5 checksum and adds it to the resource. You can then use the filter-by-md5 function to deny resources with duplicate MD5 checksums.

Properties

None

Example

Data fn=generate-md5

generate-rd-expires

The generate-rd-expires function generates an expiration date and adds it to the specified source. The function uses metadata such as the HTTP header and HTML <META> tags to obtain any expiration data from the resource. If none exists, the function generates an expiration date three months from the current date.

Properties

dst

Name of the source. If you omit it, the source defaults to rd-expires.

Example

Generate fn=generate-rd-expires

generate-rd-last-modified

The generate-rd-last-modified function adds the current time to the specified source.

Properties

dst

Name of the source. If you omit it, the source defaults to rd-last-modified

Example

Generate fn=generate-last-modified

rename-attribute

The rename-attribute function changes the name of a field in the resource description. The function is most useful in cases where, for example, the extract-html-meta function copies information from a <META> tag into a field and you want to change the name of the field.

Property

src

String containing a mapping from one name to another

Example

The following example renames an attribute from author to author-name:

Generate fn=rename-attribute src="author->author-name"

Enumeration Functions

The following functions operate at the Enumerate stage. These functions control whether and how a robot gathers links from a given resource to use as starting points for further resource discovery.

enumerate-urls

The enumerate-urls function scans the resource and enumerates all URLs found in hypertext links. The results are used to spawn further resource discovery. You can specify a content-type to restrict the kind of URLs enumerated.

Properties

max

The maximum number of URLs to spawn from a given resource. The default is 1024.

type

Content-type that restricts enumeration to those URLs that have the specified content-type. type is an optional property. If omitted, the function enumerates all URLs.

Example

The following example enumerates HTML URLs only, up to a maximum of 1024:

Enumerate fn=enumerate-urls type=text/html

enumerate-urls-from-text

The enumerate-urls-from-text function scans text resource, looking for strings matching the regular expression: URL:.*. The function spawns robots to enumerate the URLs from these strings and generate further resource descriptions.

Property

max

The maximum number of URLs to spawn from a given resource. The default, if max is omitted, is 1024

Example

Enumerate fn=enumerate-urls-from-text

Generation Functions

Generation functions are used in the Generate stage of filtering. Generation functions can create information that goes into a resource description. In general, they either extract information from the body of the resource itself or copy information from the resource’s metadata.

extract-full-text

The extract-full-text function extracts the complete text of the resource and adds it to the resource description.


Note –

Use the extract-full-text function with caution. It can significantly increase the size of the resource description, thus causing database bloat and overall negative impact on network bandwidth.


Example

Generate fn=extract-full-text

Properties

truncate

The maximum number of characters to extract from the resource

dst

Name of the schema item that receives the full text

extract-html-meta

The extract-html-meta function extracts any <META> or <TITLE> information from an HTML file and adds it to the resource description. A content-type may be specified to restrict the kind of URLs that are generated.

Properties

truncate

The maximum number of bytes to extract

type

Optional property. If omitted, all URLs are generated

Example

Generate fn=extract-html-meta truncate=255 type=text/html

extract-html-text

The extract-html-text function extracts the first few characters of text from an HTML file, excluding the HTML tags, and adds the text to the resource description. This function permits the first part of a document’s text to be included in the RD. A content-type may be specified to restrict the kind of URLs that are generated.

Properties

truncate

The maximum number of bytes to extract

skip-headings

Set to true to ignore any HTML headers that occur in the document

type

Optional property. If omitted, all URLs are generated

Example

Generate fn=extract-html-text truncate=255 type=text/html skip-headings=true

extract-html-toc

The extract-html-toc function extracts table of contents from the HTML headers and adds it to the resource description.

Properties

truncate

The maximum number of bytes to extract

level

Maximum HTML header level to extract. This property controls the depth of the table of contents

Example

Generate fn=extract-html-toc truncate=255 level=3

extract-source

The extract-source function extracts the specified values from the given sources and adds them to the resource description.

Property

src

Lists source names. You can use the -> operator to define a new name for the RD attribute. For example, type->content-type would take the value of the source named type and save it in the RD under the attribute named content-type.

Example

Generate fn=extract-source src="md5,depth,rd-expires,rd-last-modified"

harvest-summarizer

The harvest-summarizer function runs a Harvest summarizer on the resource and adds the result to the resource description.

To run Harvest summarizers, you must have $HARVEST_HOME/lib/gatherer in your path before you run the robot.

Property

summarizer

Name of the summarizer program

Example

Generate fn-harvest-summarizer summarizer=HTML.sum

Shutdown Function

The filterrules-shutdown function can be used during the shutdown phase by both enumeration and generation functions.

filterrules-shutdown

After the rules are run, the filterrules-shutdown function performs clean up and shutdown responsibilities.

Properties

None

Example

Shutdown fn=filterrules-shutdown

Modifiable Properties

The robot.conf file defines many options for the robot, including pointing the robot to the appropriate filters in filter.conf . For backward compatibility with older versions , robot.conf can also contain the starting point URLs.

Because you can set most properties by using the management console, you typically do not need to edit the robot.conf file. However, advanced users might manually edit this file to set properties that cannot be set through the management console. See Sample robot.conf File for an example of this file.

Table 12–4 lists the properties you can change in the robot.conf file.

Table 12–4 User-Modifiable Properties

Property 

Description 

Example 

auto-proxy

Specifies the proxy setting for the robot. It can be a proxy server or a JavaScript file for automatically configuring the proxy. . 

auto-proxy="http://proxy_server/proxy.pac"

bindir

Specifies whether the robot adds a bin directory to the PATH environment. This is an extra PATH for users to run an external program in a robot, such as those specified by cmd-hook property.

bindir=path

cmd-hook

Specifies an external completion script to run after the robot completes one run. This must be a full path to the command name. The robot executes this script from the /var/opt/SUNWportal/ directory.

No default is set. 

At least one RD must be registered for the command to run. 

 

cmd-hook=”command-string”

command-port

Specifies the port number that the robot listens to in order to accept commands from other programs, such as the Administration Interface or robot control panels. 

For security reasons, the robot can accept commands only from the local host unless remote-access is set to yes.

command-port=port_number

connect-timeout

Specifies the maximum time allowed for a network to respond to a connection request. 

The default is 120 seconds.

command-timeout=seconds

convert-timeout

Specifies the maximum time allowed for document conversion. 

The default is 600 seconds.

convert-timeout=seconds

depth

Specifies the number of links from the starting point URLs that the robot examines. This property sets the default value for any starting point URLs that do not specify a depth. 

The default is 10.

A value of negative one (depth=-1) indicates that the link depth is infinite.

depth=integer

email

Specifies the email address of the person who runs the robot. 

The email address is sent with the user-agent in the HTTP request header so that Web managers can contact the people who run robots at their sites. 

The default is user@domain.

email=user@hostname

enable-ip

Generates an IP address for the URL for each RD that is created. 

The default is true.

enable-ip=[true | yes | false | no]

enable-rdm-probe

Determines the server supports RDM. The robot decides whether to query each server it encounters by using this property. If the server supports RDM, the robot does not attempt to enumerate the server’s resources that server is able to act as its own resource description server. 

The default is false.

enable-rdm-probe=[true | false | yes | no]

enable-robots-txt

Determines the robot should check the robots.txt file at each site it visits, if available.

The default is yes.

enable-robots-txt=[true | false | yes | no]

engine-concurrent

Specifies the number of pre-created threads for the robot to use. 

The default is 10.

You cannot use the management console to set this property interactively. 

engine-concurrent=[1..100]

enumeration-filter

Specifies the enumeration filter that is used by the robot to determine a resource should be enumerated. The value must be the name of a filter defined in the file filter.conf.

The default is enumeration-default.

You cannot use the management console to set this property interactively. 

enumeration-filter=enumfiltername

generation-filter

Specifies the generation filter that is used by the robot to determine a resource description should be generated for a resource. The value must be the name of a filter defined in the file filter.conf.

The default is generation-default.

You cannot use the management console to set this property interactively. 

generation-filter=genfiltername

index-after-ngenerated

Specifies the number of minutes that the robot should collect RDs before batching them for the Search Server. 

The default value is 30 minutes. 


index-after-ngenerated=30

loglevel

Specifies the levels of logging. The loglevel values are as follows:

  • Level 0: log nothing but serious errors

  • Level 1: also log RD generation (default)

  • Level 2: also log retrieval activity

  • Level 3: also log filtering activity

  • Level 4: also log spawning activity

  • Level 5: also log retrieval progress

    The default value is 1.


loglevel=[0...100]

max-connections

Specifies the maximum number of concurrent retrievals that a robot can make. 

The default is 8.


max-connections=[1..100]

max-filesize-kb

Specifies the maximum file size in kilobytes for files retrieved by the robot. 


max-filesize-kb=1024

max-memory-per-url / max-memory

Specifies the maximum memory in bytes used by each URL. If the URL needs more memory, the RD is saved to disk. 

The default is 64k.

You cannot use the management console to set this property interactively. 


max-memory-per-url=n_bytes

max-working

Specifies the size of the robot working set, which is the maximum number of URLs the robot can work on at one time. 

You cannot use the management console to set this property interactively. 


max-working=1024

onCompletion

Determines what the robot does after it has completed a run. The robot can either go into idle mode, loop back and start again, or quit. 

The default is idle.

This property works with the cmd-hook property. When the robot is done, it performs the action of onCompletion and then runs the cmd-hook program.


OnCompletion=[idle | loop | quit]

password

Specifies the password used for httpd authentication and ftp connection.


password=string

referer

Specifies the property sent in the HTTP request if it is set to identify the robot as the referrer when accessing Web pages 


referer=string

register-user

Specifies the user name used to register RDs to the Search Server database. 

This property cannot be set interactively through the Search Server Administration Interface. 


register-user=string

register-password

Specifies the password used to register RDs to the Search Server database. 

This property cannot be set interactively through the management console. 


register-password=string

remote-access

This property determines the robot can accept commands from remote hosts. 

The default is false.


remote-access=[true | false | yes | no]

robot-state-dir

Specifies the directory where the robot saves its state. In this working directory, the robot can record the number of collected RDs and so on. 


robot-state-dir="/var/opt/SUNWportal/
searchservers/<searchserverid>/config/robot"

server-delay

Specifies the time period between two visits to the same web site, thus preventing the robot from accessing the same site too frequently. The default is 0 seconds. 


server-delay=delay_in_seconds

site-max-connections

Indicates the maximum number of concurrent connections that a robot can make to any one site. 

The default is 2.


site-max-connections=[1..100]

smart-host-heuristics

Enables the robot to change sites that are rotating their DNS canonical host names. For example, www123.siroe.com is changed to www.siroe.com .

The default is false.


smart-host-heuristics=[true | false]

tmpdir

Specifies a place for the robot to create temporary files. 

Use this value to set the environment variable TMPDIR.


tmpdir=path

user-agent

Specifies the property sent with the email address in the http-request to the server.


user-agent=SunONERobot/6.2

username

Specifies the user name of the user who runs the robot and is used for httpd authentication and ftp connection.

The default is anonymous.


username=string

Sample robot.conf File

This section describes a sample robot.conf file. Any commented properties in the sample use the default values shown. The first property, csid, indicates the Search Server instance that uses this file. Do not to change the value of this property. See Modifiable Properties for definitions of the properties in this file.


Note –

This sample file includes some properties used by the Search Server that you should not modify. The csid property is one example.



<Process csid="x-catalog://budgie.siroe.com:80/jack" \\
   auto-proxy="http://sesta.varrius.com:80/"
   auto_serv="http://sesta.varrius.com:80/"
   command-port=21445
   convert-timeout=600
   depth="-1"
   # email="user@domain"
   enable-ip=true
   enumeration-filter="enumeration-default"
   generation-filter="generation-default"
   index-after-ngenerated=30
   loglevel=2
   max-concurrent=8
   site-max-concurrent=2
   onCompletion=idle
   password=boots
   proxy-loc=server
   proxy-type=auto
   robot-state-dir="/var/opt/SUNWportal/searchservers/search1/robot" \\
   ps/robot"
   server-delay=1
   smart-host-heuristics=true
   tmpdir="/var/opt/SUNWportal/searchservers/search1/tmp"
   user-agent="iPlanetRobot/4.0"
   username=jack
</Process>