This chapter describes how to configure and administer the Sun JavaTM System Portal Server Search Server.
This chapter contains these sections:
The Portal Server Search Server is a taxonomy and database service designed to support search and browse interfaces similar to popular Internet search servers such as Google and Alta Vista. The Search Server includes a robot to discover, convert, and summarize document resources. The Portal Server Desktop includes a search user interface based on JavaServer PagesTM (JSPTM). The Search Server includes administration tools for configuration editing and command-line tools for system management. Configuration settings can be defined and stored through the Portal Server management console.
The management console permits an administrator to configure a majority of the search server options, but it does not perform all the administrative functions available through the command-line interface.
User query the search server's databases to locate resources. Individual entries in each database are called resource descriptions (RDs). A resource description provides summary information about a single resource. The database schema determines the fields of each resource description.
The search server is based on open Internet standards such as Resource Description Messages (RDM) and the Summary Object Interchange Format (SOIF) to ensure that the search server can operate in a cross-platform enterprise environment.
Users interact with the search system in two ways. They can type direct queries to search the database, or they can browse through the database contents using a set of categories that you design. A hierarchy of categories is sometimes called a taxonomy. Categorizing resources is like creating a table of contents for the database.
Browsing is an optional feature in a search system. That is, you can have a perfectly useful Search system that does not include browsing by categories. You need to decide whether adding categories that users can browse is useful to the users of your index, and, if so, what kind of categories you want to create.
The resources in a Search database are assigned to categories to reduce complexity. If a large number of items are in the database, grouping related items together is helpful. Doing so allows users to quickly locate specific kinds of items, compare similar items, and choose which ones they want.
Such categorizing is common in product and service indexes. Clothing catalogs divide men’s, women’s, and children’s clothing, with each of those further subdivided for coats, shirts, shoes, and other items. An office products catalog could separate furniture from stationery, computers, and software. And advertising directories are arranged by categories of products and services.
The principles of categorical groupings in a printed index also apply to online indexes. The idea is to make it easy for users to locate resources of a certain type, so that they can choose the ones they want. No matter what the scope of the index you design, the primary concern in setting up your categories should be usability. You need to know how users use the categories. For example, if you design an index for a company with three offices in different locations, you might make your top-level categories correspond to each of the three offices. If users are more interested in, say, functional divisions that cut across the geographical boundaries, it might make more sense to categorize resources by corporate divisions.
Once the categories are defined, you must set up rules to assign resources to categories. These rules are called classification rules. If you do not define your classification rules properly, users cannot locate resources by browsing in categories. You need to avoid categorizing resources incorrectly, but you also should avoid failing to categorize documents.
Sun Java System Portal Server can support one or more search servers.
During Portal Server installation, a default search server (search1) is created. You can also create a new search server using the Create Search Server wizard.
Select Search Servers and then New from the menu bar.
The New Search Server wizard appears.
Follow the instructions and then click Finish to create the specified search server.
psadmin create-search-server in Sun Java System Portal Server 7.2 Command-Line Reference.
psadmin delete-search-server in Sun Java System Portal Server 7.2 Command-Line Reference
The search server stores its descriptions of resources in a database. A search database is a document collection index. They are created by the indexer (command rdmgr, or search server itself). For example, by default the robot can be setup to crawl web sites and the robot indexes whatever it finds into the default" search database where users can search for the data. The data or index into other databases too.
The following are some configuration and maintenance tasks you may need to perform to administer the database:
Normally, items in your search database come from the robot. You can also import databases of existing items, either from other Portal Server Search servers, from iPlanet Web Servers or NetscapeTM Enterprise Servers, or from databases generated from other sources. Importing existing databases of RDs instead of sending the robot to create them anew helps reduce the amount of network traffic. Doing so also enables large indexing efforts to be completed more quickly by breaking the effort down into smaller parts. If the central database is physically distant from the servers being indexed, it can be helpful to generate the RDs locally and periodically import the remote databases to the central database.
The search server uses import agents to import RDs from another server or from a database. An import agent is a process that retrieves a number of RDs from an external source and merges that information into a local database.
Before you can import a database, you must create an import agent. Once an agent is created, you can start the import process immediately or schedule a time to run the import process on a regular basis.
A schema determines what information your search server maintains on each resource, and in what form. The design of your schema determines two factors that affect the usability of your index:
The way users can search for resources
The ways users view resource information
The schema is a master data structure for Resource Descriptions in the database. Depending on how you define and index the fields in that data structure, users have varying degrees of access to the resources.
The schema is closely tied to the structure of the files used by the search server and its robot. You should change only the data structure by using the schema tools in management console. Never edit the schema file directly.
You can edit the database schema of the search server to add a new schema attribute, to modify a schema attribute, or to delete attributes.
The schema includes the following attributes:
Editable – If checked, this attribute indicates that the attribute appears in the Resource Description Editor, and you can change its values.
Indexable – This attribute indicates that users can search for values in this particular field. An indexable fields may also appear in the pop-up menu in the Advanced Search screen.
Description – This attribute is a text string to use to describe the schema. You can use it for comments or annotations.
Aliases – This attribute allows you to define aliases to convert imported database schema names into your own schema.
Score Multiplier – A weighting field for scoring a particular element. Any positive value is valid.
Data Type – Defines the data type.
You might encounter discrepancies between the names used for fields in database schemas. When you import Resource Descriptions from one server to another, you cannot always guarantee that the two servers use identical names for items in their schemas. Similarly, when the robot converts HTML <meta> tags from a document into schema fields, the document controls the names.
The search server allows you to define schema aliases for your schema attributes, to map these external schema names into valid names for fields in your database.
The search server provides a report with information about the number of sites indexed and the number of resources from each in the database.
You might need to re-index the Resource Description database for the search server if you have edited the schema to add or remove an indexed field or if a disk error corrupts the index file. It may also be necessary to re-index if a discrepancy occurs between the database content and its index for any other reason. For example, a system failure while indexing.
Re-indexing a large database can take several hours. The time required to re-index the database corresponds to the number of records in the database. If you have a large database, perform re-indexing at a time when the server is not in high demand.
Removing Resource Descriptions that are out of date is expiring the database. Resource Descriptions are removed only when you run the expiration. Expired Resource Descriptions are deleted, but the database size is not decreased.
One attribute of a Resource Description is its expiration date. Your robots can set the expiration date from HTML <meta> tags or from information provided by the resource’s server. By default, Resource Descriptions expire in three months from creation unless the resource specifies a different expiration date. Periodically your search server should purge expired Resource Descriptions from its database.
Purging allows you to remove the contents of the database. Disk space used for indexes is recovered, but disk space used by the main database is not recovered. Instead it is reused as new data are added to the database.
The search server allows you to put the physical files that make up each search database on multiple disks, file systems, directories, or partitions. By spreading databases across different physical or logical devices, you can create a larger database than would fit on a single device.
By default, the search server sets up the database to use only one directory. The command-line interface allows you to perform two kinds of manipulations on the database partitions:
Adding New Partitions
Moving Partitions
The search server does not perform any checking to ensure that individual partitions have space remaining. It is your responsibility to maintain adequate free space for the database.
You can add new database partitions up to a maximum of 15 total partitions.
Once you increase the number of partitions, you must delete the entire database if you want to reduce the number later.
However, partitions are not recommended as long as you have enough disk space.
To change the physical location of any database partition, specify the name of the new location. Similarly, you can rename an existing partition. Use the rdmgr command to manipulate the partitions. See the Sun Java System Portal Server 7.2 Command Line Reference for information on the psadmin command.
Use the following instruction to manage a database:
Select Search Servers tab, then select a search server.
Click Databases, then Management from the menu bar.
Click New.
The New Database page displays.
Type the name of the new database, and click OK.
psadmin create-search-database in Sun Java System Portal Server 7.2 Command-Line Reference
Select Search Servers tab, then select a search server.
Click Databases, then Import Agents from the menu bar.
Click New to launch the wizard.
Specify the Import Agent attributes.
For more information about the attributes, see Import Agents in Sun Java System Portal Server 7.2 Technical Reference in Sun Java System Portal Server 7.2 Technical Reference
Click Finish.
psadmin create-search-importagent in Sun Java System Portal Server 7.2 Command-Line Reference
Select the Search Servers tab, then select a search server.
Click Databases, then Management from the menu bar.
Select a database and click Manage Resource Descriptions.
Click New and specify the attributes.
For more information about the attributes, see Schema in Sun Java System Portal Server 7.2 Technical Reference in Sun Java System Portal Server 7.2 Technical Reference
Click OK.
Select Search Servers tab, then select a search server.
Click Databases, then Management from the menu bar.
Select a database and click Manage Resource Descriptions.
Select a Resource Description to perform one of the following actions:
Edit
Edit All
Delete
For more information about the attributes, see Schema in Sun Java System Portal Server 7.2 Technical Reference in Sun Java System Portal Server 7.2 Technical Reference
Click Save.
The search server provides a number of reports to allow you to monitor search activity.
Select the Search Servers tab , then select a search server.
Click Reports from the menu bar.
Click on a link in the menu bar to view a specific report.
The following options are available:
Logs
Advanced Robot Reports
Popular Searches
Excluded URLs
The following tasks can be used to manage categories:
Select Search Servers from the tab, then select a search server.
Select Categories, then Browse/Search from the menu bar.
Click New.
The New Search Category dialog appears.
Specify the attributes as necessary.
For more information about the attributes, see Manage Categories in Sun Java System Portal Server 7.2 Technical Reference in Sun Java System Portal Server 7.2 Technical Reference
Click OK.
Select the Search Servers tab, then select a search server.
Click Categories, then Browse/Search from the menu bar.
Select a category and click Edit to display the Edit Category page.
For more information about the attributes, see Manage Categories in Sun Java System Portal Server 7.2 Technical Reference in Sun Java System Portal Server 7.2 Technical Reference
Select the Search Servers tab, then select a search server.
Click Categories, then Autoclassify from the menu bar.
Click Run Autoclassify.
Click the Search Servers tab, then select a search server.
Click Categories, then Autoclassify from the menu bar.
Modify the attributes as necessary.
For more information about the attributes, see Sun Java System Portal Server 7.2 Technical Reference
Click Save.
This chapter describes the Sun JavaTM System Portal Server Search Server robot and its corresponding configuration files. The chapter contains following topics:
A Search Server robot is an agent that identifies and reports on resources in its domains. It does so by using two kinds of filters: an enumerator filter and a generator filter.
The enumerator filter locates resources by using network protocols. The filter tests each resource and if the resource meets the proper criteria, it is enumerated. For example, the enumerator filter can extract hypertext links from an HTML file and use the links to find additional resources.
The generator filter tests each resource to determine whether a resource description (RD) should be created. If the resource passes the test, the generator creates an RD that is stored in the Search Server database.
Configuration and maintenance tasks you might need to do to administer the robot are described in the following sections:
Figure 19–1 shows how the robot examines URLs and their associated network resources. Both the enumerator and the generator test each resource. If the resource passes the enumeration test, the robot checks it for additional URLs. If the resource passes the generator test, the robot generates a resource description that is stored in the Search Server database.
Robot configuration files define the behavior of the robots. These files reside in the directory /var/opt/SUNWportal/searchservers/searchserverid/config. The following list provides a description for each of the robot configuration files.
Contains rules used to classify RDs generated by the robot.
Defines the enumeration and generation filters used by the robot.
Contains the robot's site definitions, starting point URLs, rules for filtering based on mime type, and URL patterns.
Defines most operating properties for the robot.
Because you can set most properties by using the Search Server Administration interface, you typically do not need to edit the robot.conf file. However, advanced users might manually edit this file to set properties that cannot be set through the interface.
The robot finds resources and determines whether to add descriptions of those resources to the database. The determination of which servers to visit and what parts of those servers to index is called a site definition.
Defining the sites for the robot is one of the most important jobs of the server administrator. You need to be sure you send the robot to all the servers it needs to index, but you also need to exclude extraneous sites that can fill the database and make finding the correct information more difficult.
The robot extracts and follows links to the various sites selected for indexing. As the system administrator, you can control these processes through a number of settings, including:
Starting, stopping, and scheduling the robot
Defining the sites the robot visits
Crawling attributes that determine how aggressively it crawls
The types of resources the robot indexes by defining filters
What kind of entries the robot creates in the database by defining the indexing attributes
See the Sun Java System Portal Server 7.2 Technical Reference for descriptions of the robot crawling attributes.
Filters enable identify a resource so that it can be excluded or included by comparing an attribute of a resource against a filter definition. The robot provides a number of predefined filters, some of which are enabled by default. The following filters are predefined. Filters marked with an asterisk are enabled by default.
Archive Files*
Audio Files*
Backup Files*
Binary Files*
CGI Files*
Image Files*
Java, JavaScript, Style Sheet Files*
Log Files*
Lotus Domino Documents
Lotus Domino OpenViews
Plug-in Files
Power Point Files
Revision Control Files*
Source Code Files*
Spreadsheet Files
System Directories (UNIX)
System Directories (NT)
Temporary Files*
Video Files*
You can create new filter definitions, modify a filter definition, or enable or disable filters. See Resource Filtering Process for detailed information.
The robot includes two debugging tools or utilities:
Site Probe – Checks for DNS aliases, server redirects, virtual servers, and the like.
Simulator – Performs a partial simulation of robot filtering on a URL. The simulator indicates whether sites you listed would be accepted by the robot.
To keep the search data timely, the robot should search and index sites regularly. Because robot crawling and indexing can consume processing resources and network bandwidth, you should schedule the robot to run during non-peak days and times. The management console allows administrators to set up a schedule to run the robot.
This section describes the following tasks to manage the robot:
Choose Search Servers from the menu bar. Select a search server from the list of servers.
Click Robot from the menu bar, then Status and Control from the menu.
Click Start.
psadmin start-robot in Sun Java System Portal Server 7.2 Command-Line Reference
For the command psadmin start-robot, the search robot does not start if no defined sites are available for the robot to crawl. The command psadmin start-robot indicates that no sites are available by displaying Starting Points: 0 defined.
Select Search Servers from the menu bar, then select a search server.
Select Robot from the menu bar then Status and Control.
Click Clear Robot Database.
The robot finds resources and determines whether to add descriptions of those resources to the database. The determination of which servers to visit and what parts of those servers to index is called a site definition.
Select Search Servers from the menu bar, then select a search server.
Select Robot from the menu bar, then Sites.
Click New under Manage Sites and specify the configuration attributes for the site.
For more information about the attributes, see Sites in Sun Java System Portal Server 7.2 Technical Reference in Sun Java System Portal Server 7.2 Technical Reference.
Click OK.
Select Search Servers from the menu bar, then select a search server.
Click Robot from the menu bar, then Sites.
Click the name of the site you want to modify.
The Edit Site dialog appears.
Modify the configuration attributes as necessary.
For more information about the attributes, see Sites in Sun Java System Portal Server 7.2 Technical Reference in Sun Java System Portal Server 7.2 Technical Reference
Click OK to record the changes.
The robot crawls to the various sites selected for indexing. You control how the robot crawls sites by defining crawling and indexing operational properties.
Select Search Servers from the menu bar, then select a search server.
Click Robot from the menu bar, then Properties.
Specify the robot crawling and indexing attributes as necessary.
For more information about the attributes, see Properties in Sun Java System Portal Server 7.2 Technical Reference in Sun Java System Portal Server 7.2 Technical Reference.
Click Save.
The simulator performs a partial simulation of robot filtering on one or more listed site sites.
Select Search Servers from the menu bar, then select a search server.
Click Robot from the menu bar, then Utilities.
Type the URL of a new site to simulate in the Add a new URL text box and click Add.
You can also run the simulator on existing sites listed under Existing Robot sites.
Click Run Simulator.
The site probe utility checks for such information as DNS aliases, server redirects, and virtual servers.
Select Search Servers from the menu bar, then select a search server.
Click Robot from the menu bar, then Utilities.
Type the URL of the site to probe.
(Optional) If you want the probe to return DNS information choose Show Advanced DNS information under Site Probe.
Click Run SiteProbe.
The robot uses filters to determine which resources to process and how to process them. When the robot discovers references to resources as well as the resources themselves, it applies filters to each resource. The filters enumerate the resourceand determine whether to generate a resource description to store in the Search Server database.
The robot examines one or more starting point URLs, applies the filters, and then applies the filters to the URLs spawned by enumerating those URLs, and so on. The starting point URLs are defined in the filterrules.conf file.
Each enumeration and generation filter performs any required initialization operations and applies comparison tests to the current resource. The goal of each test is to allow or deny the resource. Each filter also has a shutdown phase during which it performs clean-up operations.
If a resource is allowed, then it continues its passage through the filter. The robot eventually enumerates it, attempting to discover further resources. The generator might also create a resource description for it.
If a resource is denied, the resource is rejected. No further action is taken by the filter for resources that are denied.
These operations are not necessarily linked. Some resources result in enumeration; others result in RD generation. Many resources result in both enumeration and RD generation. For example, if the resource is an FTP directory, the resource typically does not have an RD generated for it. However, the robot might enumerate the individual files in the FTP directory. An HTML document that contains links to other documents can result in an RD being generated, and can lead to enumeration of any linked documents as well.
The following sections describe the filter process:
Both enumeration and generation filters have five phases in the filtering process.
Setup – Performs initialization operations. Occurs only once in the life of the robot.
Metadata – Filters the resource based on metadata available about the resource. Metadata filtering occurs once per resource before the resource is retrieved over the network. Table 19–1 lists examples of common metadata types.
Metadata Type |
Description |
Example |
---|---|---|
Complete URL |
The location of a resource |
http://home.siroe.com/ |
Protocol |
The access portion of the URL |
http, ftp, file |
Host |
The address portion of the URL |
www.siroe.com |
IP address |
Numeric version of the host |
198.95.249.6 |
PATH |
The path portion of the URL |
/index.html |
Depth |
Number of links from the starting point URL |
5 |
Data – Filters the resource based on its data. Data is filtered once per resource after the data is retrieved over the network. Data that can be used for filtering include:
content-type
content-length
content-encoding
content-charset
last-modified
expires
Enumerate – Enumerates the current resource in order to determine whether it points to other resources to be examined.
Generate – Generates a resource description (RD) for the resource and saves it in the Search Server database.
Shutdown – Performs any needed termination operations. This process occurs once in the life of the robot.
The filter.conf file contains definitions for enumeration and generation filters. This file can contain multiple filters for both enumeration and generation. The filters used by the robot are specified by the enumeration-filter and generation-filter properties in the file robot.conf.
Filter definitions have a well-defined structure: a header, a body, and an end. The header identifies the beginning of the filter and declares its name, for example:
<Filter name="myFilter"> |
The body consists of a series of filter directives that define the filter’s behavior during setup, testing, enumeration or generation, and shutdown. Each directive specifies a function and, if applicable, properties for the function.
The end is marked by </Filter>.
Example 19–1 shows a filter named enumeration1.
<Filter name="enumeration1> Setup fn=filterrules-setup config=./config/filterrules.conf # Process the rules MetaData fn=filterrules-process # Filter by type and process rules again Data fn=assign-source dst=type src=content-type Data fn=filterrules-process # Perform the enumeration on HTML only Enumerate enable=true fn=enumerate-urls max=1024 type=text/html # Cleanup Shutdown fn=filterrules-shutdown </Filter> |
Filter directives use robot application functions (RAFs) to perform operations. Their use and flow of execution is similar to that of NSAPI directives and server application functions (SAFs) in the Sun Java System Web Server's obj.conf file. Like NSAPI and SAF, data are stored and transferred using property blocks, also called pblocks.
Six robot directives, or RAF classes, correspond to the filtering phases and operations listed in Resource Filtering Process:
Setup
Metadata
Data
Enumerate
Generate
Shutdown
Each directive has its own robot application functions. For example, use filtering functions with the Metadata and Data directives, enumeration functions with the Enumerate directive, generation functions with the Generate directive, and so on.
The built-in robot application functions, as well as instructions for writing your own robot application functions, are explained in the Sun Java System Portal Server 7.1 Developer's Guide.
In most cases, you can use the management console to create most of your site-definition based filters. You can then modify the filter.conf and filterrules.conf files to make any further desired changes. These files reside in the directory /var/opt/SUNWportal/searchservers/searchserverid/config.
To create a more complex set of properties, edit the configuration files used by the robot.
When you write or modify a filter, note the order of
The execution of directives, especially the available information at each phase.
The filter rules in filterrules.conf.
You can also do the following:
Modify properties in robot.conf file.
Modify robot application functions in filter.conf file.
Create your own robot application functions.
For more information, see the Sun Java System Portal Server 7.1 Developer's Guide
The following tasks to manage robot filters are described in this section:
Select Search Servers from the menu bar, then select a search server.
Select Robot from the menu bar, then Filters.
Click New.
The New Robot Filter wizard appears.
Follow the instructions to create the specified filter.
Type a filter name and filter description in the text box, and click Next.
Specify filter definition and behavior, and click Finish.
For more information about filter attributes, see Filters in Sun Java System Portal Server 7.2 Technical Reference.
Click Close to load the new filter.
Select Search Servers from the menu bar, then select a search server.
Select Robot from the menu bar, then Filters.
Select a filter.
Click Delete.
Click OK in the confirmation dialog box that appears.
Select Search Servers from the menu bar, then select a search server.
Select Robot from the menu bar, then Filters.
Select a filter, and click Edit.
The Edit a Filter page appears.
Modify the configuration attributes as necessary.
For more information about filter attributes, see Filters in Sun Java System Portal Server 7.2 Technical Reference.
Click OK.
Select Search Servers from the menu bar, then select a search server.
Select Robot from the menu bar, then Filters.
Select a filter.
Documents can be assigned to multiple categories, up to a maximum number defined in the settings. Classification rules are simpler than robot filter rules because they do not involve any flow-control decisions. In classification rules you determine what criteria to use to assign specific categories to a resource as part of its Resource Description. A classification rule is a simple conditional statement, taking the form if condition is true, assign the resource to <a category>.
Select Search Servers from the menu bar, then select a search server.
Select Robot from the menu bar, then Classification Rules.
Select Classification Rules and click New.
The New Classification Rule dialog box appears.
Specify the configuration attributes as necessary.
For more information about the attributes, see Manage Classification Rules in Sun Java System Portal Server 7.2 Technical Reference.
Click OK.
Select Search Servers from the menu bar, then select a search server.
Select Robot, then Classification Rules from the menu bar.
Select a classification rule, and click Edit.
Modify the attributes as necessary.
For more information about the attributes, see Manage Classification Rules in Sun Java System Portal Server 7.2 Technical Reference.
Click OK.
Most robot application functions (RAFs) require sources of information and generate data that go to destinations. The sources are defined within the robot and are not necessarily related to the fields in the resource description that the robot ultimately generates. Destinations, on the other hand, are generally the names of fields in the resource description, as defined by the resource description server’s schema.
The following sections describe the different stages of the filtering process, and the sources available at those stages:
At the Setup stage, the filter is set up but cannot yet obtain information about the resource’s URL or content.
At the MetaData stage, the robot encounters a URL for a resource but it has not downloaded the resource’s content. Thus information is available about the URL as well as data that is derived from other sources such as the filter.conf file. At this stage, however, information about the content of the resource is not available.
Table 19–2 Sources Available to the RAFs at the MetaData Phase
Source |
Description |
Example |
---|---|---|
csid |
Catalog server ID |
x-catalog//budgie.siroe.com:8086/alexandria |
depth |
Number of links traversed from starting point |
10 |
enumeration filter |
Name of enumeration filter |
enumeration1 |
generation filter |
Name of generation filter |
generation1 |
host |
Host portion of URL |
home.siroe.com |
IP |
Numeric version of host |
198.95.249.6 |
protocol |
Access portion of the URL |
http, https, ftp, file |
path |
Path portion of the URL |
/, /index.html, /documents/listing.html |
URL |
Complete URL |
http://developer.siroe.com/docs/manuals/ |
At the Data stage, the robot has downloaded the content of the resource at the URL and can access data about the content, such as the description and the author.
If the resource is an HTML file, the Robot parses the <META> tags in the HTML headers. Consequently, any data contained in <META> tags is available at the Data stage.
During the Data phase, the following sources are available to RAFs, in addition to those available during the MetaData phase.
Table 19–3 Sources Available to the RAFs at the Data Phase
Source |
Description |
Example |
---|---|---|
content-charset |
Character set used by the resource | |
content-encoding |
Any form of encoding | |
content-length |
Size of the resource in bytes | |
content-type |
MIME type of the resource |
text/html, image/jpeg |
expires |
Date the resource expires | |
last-modified |
Date the resource was last modified | |
data in <META> tags |
Any data that is provided in <META> tags in the header of HTML resources |
Author, Description, Keywords |
All of these sources except for the data in <META> tags are derived from the HTTP response header returned when retrieving the resource.
At the Enumeration and Generation stages, the same data sources are available as in the Data stage. See Table 19–3 for information.
At the Shutdown stage, the filter completes its filtering and shuts down. Although functions written for this stage can use the same data sources as those available at the Data stage, the shutdown functions typically restrict their operations to robot shutdown and clean-up activities.
Each function can have an enable property. The values can be true, false, on, or off. The management console uses these parameters to turn certain directives on or off.
The following example enables enumeration for text/html and disables enumeration for text/plain:
# Perform the enumeration on HTML only Enumerate enable=true fn=enumerate-urls max=1024 type=text/html Enumerate enable=false fn=enumerate-urls-from-text max=1024 type=text/plain |
Adding an enable=false property or an enable=off property has the same effect as commenting the line. These properties are used because the management console does not write comments.
This section describes the functions that are used during the setup phase by both enumeration and generation filters. The functions are described in the following sections:
When you use the filterrules-setup function, use the logtype log file. The value can be verbose, normal, or terse.
Path name to the file containing the filter rules to be used by this filter.
Setup fn=filterrules-setup
config="/var/opt/SUNWportal/searchservers/search1/config/filterrules.conf"
The setup-regex-cache function initializes the cache size for the filter-by-regex and generate-by-regex functions. Use this function to specify a number other than the default of 32.
Maximum number of compiled regular expressions to be kept in the regex cache.
Setup fn=setup-regex-cache cache-size=28
The setup-type-by-extension function configures the filter to recognize file name extensions. It must be called before the assign-type-by-extension function can be used. The file specified as a property must contain mappings between standard MIME content types and file extension strings.
Name of the MIME types configuration file
Setup fn=setup-type-by-extension
file="/var/opt/SUNWportal/searchservers/search1/config/mime.types"
Filtering functions operate at the Metadata and Data stages to allow or deny resources based on specific criteria specified by the function and its properties. These functions can be used in both Enumeration and Generation filters in the file filter.conf.
Each filter-by function performs a comparison and either allows or denies the resource. Allowing the resource means that processing continues to the next filtering step. Denying the resource means that processing should stop, because the resource does not meet the criteria for further enumeration or generation.
The filter-by-exact function allows or denies the resource if the allow/deny string matches the source of information exactly. The keyword all matches any string.
Source of information
Contains a string
The following example filters out all resources whose content-type is text/plain. It allows all other resources to proceed:
Data fn=filter-by-exact src=type deny=text/plain
The filter-by-max function allows the resource if the specified information source is less than or equal to the given value. It denies the resource if the information source is greater than the specified value.
This function can be called no more than once per filter.
The filter-by-maxfunction lists the properties used with the filter-by-max function.
Source of information: hosts, objects, or depth
Specifies a value for comparison
This example allows resources whose content-length is less than 1024 kilobytes:
MetaData fn-filter-by-max src=content-length value=1024
The filter-by-md5 function allows only the first resource with a given MD5 checksum value. If the current resource’s MD5 has been seen in an earlier resource by this robot, the current resource is denied. The function prevents duplication of identical resources or single resources with multiple URLs.
You can only call this function at the Data stage or later. It can be called no more than once per filter. The filter must invoke the generate-md5 function to generate an MD5 checksum before invoking filter-by-md5.
None
The following example shows the typical method of handling MD5 checksums by first generating the checksum and then filtering based on it:
Data fn=generate-md5
Data fn=filter-by-md5
The filter-by-prefix function allows or denies the resource if the given information source begins with the specified prefix string. The resource doesn’t have to match completely. The keyword all matches any string.
Source of information
Contains a string for prefix comparison
The following example allows resources whose content-type is any kind of text, including text/html and text/plain:
MetaData fn=filter-by-prefix src=type allow=text
The filter-by-regex function supports regular-expression pattern matching. It allows resources that match the given regular expression. The supported regular expression syntax is defined by the POSIX.1 specification. The regular expression \\\\* matches anything.
Source of information
Contains a regular expression string
The following example denies all resources from sites in the .gov domain:
MetaData fn=filter-by-regex src=host deny=\\\\*.gov
The filterrules-process function processes the site definition and filter rules in the filterrules.conf file.
None
MetaData fn=filterrules-process
Support functions are used during filtering to manipulate or generate information on the resource. The robot can then process the resource by calling filtering functions. These functions can be used in enumeration and generation filters in the file filter.conf.
The assign-source function assigns a new value to a given information source. This function permits editing during the filtering process. The function can assign an explicit new value, or it can copy a value from another information source.
Name of the source whose value is to be change
Specifies an explicit value
Information source to copy to dst
You must specify either a value property or a srcproperty, but not both.
Data fn=assign-source dst=type src=content-type
The assign-type-by-extension function uses the resource’s file name to determine its type and assigns this type to the resource for further processing.
The setup-type-by-extension function must be called during setup before assign-type-by-extension can be used.
Source of file name to compare. If you do not specify a source, the default is the resource’s path
MetaData fn=assign-type-by-extension
The clear-source function deletes the specified data source. You typically do not need to perform this function. You can create or replace a source by using the assign-source function.
Name of the source to delete
The following example deletes the path source:
MetaData fn=clear-source src=path
The convert-to-html function converts the current resource into an HTML file for further processing if its type matches a specified MIME type. The conversion filter automatically detects the type of the file it is converting.
MIME type from which to convert
The following sequence of function calls causes the filter to convert all Adobe Acrobat PDF files, Microsoft RTF files, and FrameMaker MIF files to HTML, as well as any files whose type was not specified by the server that delivered it.
Data fn=convert-to-html type=application/pdf
Data fn=convert-to-html type=application/rtf
Data fn=convert-to-html type=application/x-mif
Data fn=convert-to-html type=unknown
The copy-attribute function copies the value from one field in the resource description into another.
Field in the resource description from which to copy
Item in the resource description into which to copy the source
Maximum length of the source to copy
Boolean property indicating whether to fix truncated text, to not leave partial words. This property is false by default
Generate fn=copy-attribute \\
src=partial-text dst=description truncate=200 clean=true
The generate-by-exact function generates a source with a specified value, but only if an existing source exactly matches another value.
Name of the source to generate
Value to assign dst
Source against which to match
The following example sets the classification to siroe if the host is www.siroe.com.
Generate fn="generate-by-exact" match="www.siroe.com:80" src="host" value="Siroe" dst="classification"
This generate-by-prefix function generates a source with a specified value if the prefix of an existing source matches another value.
Name of the source to generate
Value to assign dst
Source against which to match
Value to compare to src
The following example sets the classification to Compass if the protocol prefix is HTTP:
Generate fn="generate-by-prefix" match="http" src="protocol" value="World Wide Web" dst="classification"
The generate-by-regex function generates a source with a specified value if an existing source matches a regular expression.
Name of the source to generate
Value to assign dst
Source against which to match
Regular expression string to compare to src
The following example sets the classification to siroe if the host name matches the regular expression *.siroe.com. For example, resources at both developer.siroe.com and home.siroe.com are classified as Siroe:
Generate fn="generate-by-regex" match="\\\\*.siroe.com" src="host" value="Siroe" dst="classification"
The generate-md5 function generates an MD5 checksum and adds it to the resource. You can then use the filter-by-md5 function to deny resources with duplicate MD5 checksums.
None
Data fn=generate-md5
The generate-rd-expires function generates an expiration date and adds it to the specified source. The function uses metadata such as the HTTP header and HTML <META> tags to obtain any expiration data from the resource. If none exists, the function generates an expiration date three months from the current date.
Name of the source. If you omit it, the source defaults to rd-expires.
Generate fn=generate-rd-expires
The generate-rd-last-modified function adds the current time to the specified source.
Name of the source. If you omit it, the source defaults to rd-last-modified
Generate fn=generate-last-modified
The rename-attribute function changes the name of a field in the resource description. The function is most useful in cases where, for example, the extract-html-meta function copies information from a <META> tag into a field and you want to change the name of the field.
String containing a mapping from one name to another
The following example renames an attribute from author to author-name:
Generate fn=rename-attribute src="author->author-name"
The following functions operate at the Enumerate stage. These functions control whether and how a robot gathers links from a given resource to use as starting points for further resource discovery.
The enumerate-urls function scans the resource and enumerates all URLs found in hypertext links. The results are used to spawn further resource discovery. You can specify a content-type to restrict the kind of URLs enumerated.
The maximum number of URLs to spawn from a given resource. The default is 1024.
Content-type that restricts enumeration to those URLs that have the specified content-type. type is an optional property. If omitted, the function enumerates all URLs.
The following example enumerates HTML URLs only, up to a maximum of 1024:
Enumerate fn=enumerate-urls type=text/html
The enumerate-urls-from-text function scans text resource, looking for strings matching the regular expression: URL:.*. The function spawns robots to enumerate the URLs from these strings and generate further resource descriptions.
The maximum number of URLs to spawn from a given resource. The default, if max is omitted, is 1024
Enumerate fn=enumerate-urls-from-text
Generation functions are used in the Generate stage of filtering. Generation functions can create information that goes into a resource description. In general, they either extract information from the body of the resource itself or copy information from the resource’s metadata.
The extract-full-text function extracts the complete text of the resource and adds it to the resource description.
Use the extract-full-text function with caution. It can significantly increase the size of the resource description, thus causing database bloat and overall negative impact on network bandwidth.
Generate fn=extract-full-text
The maximum number of characters to extract from the resource
Name of the schema item that receives the full text
The extract-html-meta function extracts any <META> or <TITLE> information from an HTML file and adds it to the resource description. A content-type may be specified to restrict the kind of URLs that are generated.
The maximum number of bytes to extract
Optional property. If omitted, all URLs are generated
Generate fn=extract-html-meta truncate=255 type=text/html
The extract-html-text function extracts the first few characters of text from an HTML file, excluding the HTML tags, and adds the text to the resource description. This function permits the first part of a document’s text to be included in the RD. A content-type may be specified to restrict the kind of URLs that are generated.
The maximum number of bytes to extract
Set to true to ignore any HTML headers that occur in the document
Optional property. If omitted, all URLs are generated
Generate fn=extract-html-text truncate=255 type=text/html skip-headings=true
The extract-html-toc function extracts table of contents from the HTML headers and adds it to the resource description.
The maximum number of bytes to extract
Maximum HTML header level to extract. This property controls the depth of the table of contents
Generate fn=extract-html-toc truncate=255 level=3
The extract-source function extracts the specified values from the given sources and adds them to the resource description.
Lists source names. You can use the -> operator to define a new name for the RD attribute. For example, type->content-type would take the value of the source named type and save it in the RD under the attribute named content-type.
Generate fn=extract-source src="md5,depth,rd-expires,rd-last-modified"
The harvest-summarizer function runs a Harvest summarizer on the resource and adds the result to the resource description.
To run Harvest summarizers, you must have $HARVEST_HOME/lib/gatherer in your path before you run the robot.
Name of the summarizer program
Generate fn-harvest-summarizer summarizer=HTML.sum
The filterrules-shutdown function can be used during the shutdown phase by both enumeration and generation functions.
After the rules are run, the filterrules-shutdown function performs clean up and shutdown responsibilities.
None
Shutdown fn=filterrules-shutdown
The robot.conf file defines many options for the robot, including pointing the robot to the appropriate filters in filter.conf . For backward compatibility with older versions , robot.conf can also contain the starting point URLs.
Because you can set most properties by using the management console, you typically do not need to edit the robot.conf file. However, advanced users might manually edit this file to set properties that cannot be set through the management console. See Sample robot.conf File for an example of this file.
Table 19–4 lists the properties you can change in the robot.conf file.
Table 19–4 User-Modifiable Properties
This section describes a sample robot.conf file. Any commented properties in the sample use the default values shown. The first property, csid, indicates the Search Server instance that uses this file. Do not to change the value of this property. See Modifiable Properties for definitions of the properties in this file.
This sample file includes some properties used by the Search Server that you should not modify. The csid property is one example.
<Process csid="x-catalog://budgie.siroe.com:80/jack" \\ auto-proxy="http://sesta.varrius.com:80/" auto_serv="http://sesta.varrius.com:80/" command-port=21445 convert-timeout=600 depth="-1" # email="user@domain" enable-ip=true enumeration-filter="enumeration-default" generation-filter="generation-default" index-after-ngenerated=30 loglevel=2 max-concurrent=8 site-max-concurrent=2 onCompletion=idle password=boots proxy-loc=server proxy-type=auto robot-state-dir="/var/opt/SUNWportal/searchservers/search1/robot" \\ ps/robot" server-delay=1 smart-host-heuristics=true tmpdir="/var/opt/SUNWportal/searchservers/search1/tmp" user-agent="iPlanetRobot/4.0" username=jack </Process> |