Sun Java System Portal Server 7.1 Technical Reference

Chapter 4 Search Attributes: Robot

This chapter explains about the attributes available in the Search Robot. The properties for the robot are quite complex. You can select the sites to be searched, check to see if a site is valid, define what types of documents should be picked up, and schedule when the searches take place.

This chapter contains the following sections:

Status and Control

The Robot Overview panel is where you can see what the robot is doing. If it is Off, Idle, Running, or Paused; and if it is Running, what progress it is making in the search since the panel is refreshed about every 30 seconds. The refresh rate is defined using the robot-refresh parameter in the search.conf file.

If the robot is Off, the buttons are Start and Clear Robot Database . The Start button is at the top and the Clear Robot Database button is at the bottom of the panel. If the robot is Running or Idle, the two buttons are Stop and Pause. If it is Paused, the two buttons are Stop and Resume. By selecting on any of the Attributes, you go to the Reports section where you can get a detailed up-to-the-minute report of that Attribute.

The table below lists the Robot Overview attributes and their description.

Table 4–1 Robot Overview Attributes

Attribute 

Default Value 

Description 

The Robot is 

Current activity 

The Robot’s state. Value can be Idle, Running, Paused, or Off 

Last Updated at 

Date and time last refreshed. 

This page is refreshed to keep you aware of what progress the robot is making. 

Starting Points 

Number defined 

Displays the sites that the robot crawls to generate resource descriptions. The robot does not index resources from disabled sites. 

URL Pool 

Number URLs waiting 

Number of URLs (Uniform Resource Locator) yet to be investigated. When you begin a search, the starting point URLs are entered into the URL pool. As the search progresses, the robot discovers links to other URLs. These URLs get added to the pool. After all the URLs in the pool have been processed, the URL pool is empty and the robot is idle. 

Extracting 

Number connections per second 

Number of resources looked at in a second. 

Extracting is the process of discovering or locating resources, documents or hyperlinks to be included in the database and filtering out unwanted items. 

Filtering 

Number URLs rejected 

Total number of URLs that are excluded. 

Indexing 

Number URLs per second 

Number of resources or documents turned into a resource description in a second. 

Indexing is the phase when all the information that has been gathered on a document is turned into a resource description for inclusion in the search database. 

Excluded URLs 

Number URLs excluded by filters 

Number of URLs that did not meet the filtering criteria. 

 

Number URLs excluded by errors 

Number of URLs where the robot encountered errors as file not found. 

Resource Descriptions 

Number RDs contributed 

Number of resource descriptions added to the database. 

 

Number Bytes of RDs contributed 

Number of bytes added to the database. 

General Stats 

Number URLs retrieved 

Number of URLs retrieved during run. 

 

Number Bytes average size of RDs 

Average number of bytes per resource description.

 

Time in days, hours, minutes, and seconds running 

The amount of time the robot has been running. 

Sites

When you click the Sites tab, the Manage Sites page is displayed. This page displays the list of Site Names and the Status of each site (enabled or disabled) that the robot crawls to generate resource descriptions. When you select the checkbox, the Delete, Enable, and Disable buttons become active. Select the Delete button to delete a selected site. You can enable or disable the selected site by clicking the Enable or Disable buttons. A disabled site is not searched when the robot is run.

The table below provides the attributes and their description in the Manage Sites page.

Table 4–2 Manage Sites Attributes

Attribute 

Default Value 

Description 

Lock or cluster graphic 

Status of site 

Lock open means that the URL is accessible. The closed lock means that the site is a secure web server and uses SSL. The cluster means that the site is a domain. 

Enabled/Disabled 

Enabled 

Choose to search this site or not when the robot is run. 

You can create a new site, by clicking the New button. When you click the New button, the New Robot Site page appears. This page allows you to set up a new Robot site. The table below provides the attributes available in the New Robot Site page and their description.

Table 4–3 New Robot Site Attributes

Attribute 

Default Value 

Description 

Type 

URL 

Select URL or Domain from the list box. 

Site 

Blank 

If you have selected the Type as URL, enter the URL of the site you want to create. The URL format is: http://www.sesta.com

If you have selected the Type as Domain, enter the domain of the site you want to create. The Domain format is: *.sesta.com

Depth 

10 

You have a choice of 1 for this URL only, 2 for this URL and first links, 3 - 10, 100 or unlimited. The default value is set in the Robot —> Manage Properties page. 

Destination Database 

Use Internal Default 

Select the database that you want to use from the list box showing the available databases. 

Click on the Site name to navigate to the Edit a Site page. You can use this page to define the search site more completely. You can specify what type of server it is, redefine the depth of the search, and select what type of files to add to the search database. The attributes for URL and Domain sites are mostly the same. The additional column in this table shows which attributes are shared and which are unique.

You can verify the server name for the search site you entered. In the Server Group section, click the New button to add more servers to the server group. In the Starting Points section, click the New button to add more starting points. In the Filter Definition section, you can add or delete, exclude or include certain types of files as well as change the order the filters for these files are applied.

The table below provides the attributes and their description in the Edit a Site page.

Table 4–4 Edit a Site Attributes

Attribute 

URL/ Domain  

Default Value 

Description 

Site Name 

URL/D 

Site entered - www.sesta.com

Name of the web site 

Server Group - Name 

URL 

URL - www.sesta.com

Is either a single server or a part of a single server. The entry must include the full host name. If you specify just a host name, the site is limited to that host. If you provide directory information in addition to the host name, the site is defined as only that directory and any of its subdirectories. 

Checkbox to select Server Group for deletion or verification 

URL 

Unselected 

Unselected — Does not delete or verify the Server Group 

Selected — Deletes or verifies the Server Group 

Port 

URL/D 

80 for URL; blank for Domain 

If the site you are searching uses a different port, enter it here. 

Type 

URL 

Web Server 

Web Server, File Server, FTP Server, Secure Web Server 

Allowed Protocols 

All selected 

Checkboxes for http, file, ftp, https 

Starting Points- Checkbox to select site for deletion 

URL/D 

Unselected 

Unselected 

Selected 

Starting Points- URL 

URL/D 

http://URL:80

URL or domain 

Starting Points - Depth 

URL/D 

10 

1 - this URL only 

2 - this URL and first links 

3-10 

100 

unlimited 

Filter Definition - Checkbox to select file type for deletion 

URL/D 

Unselected 

Unselected 

Selected 

Filter Definitions 

URL/D 

In this order, the defaults are Archive Files; Audio Files; Backup Files; Binary Files; CGI Files; Image Files; Java, JavaScript, Style Sheet Files; Log Files; Revision Control Files; Source Code Files; Temporary Files; Video Files. 

The possible choices are Archive Files; Audio Files; Backup Files; Binary Files; CGI Files; Image Files; Java, JavaScript, Style Sheet Files; Log Files; Power Point Files; Revision Control Files; Source Code Files; Temporary Files; Video Files; Spreadsheet Files; Plug-in Files; Lotus Domino Documents; Lotus Domino OpenViews; System Directories (UNIX); System Directories (NT). 

DNS Translation 

URL/D 

Blank 

The DNS translation modifies the URL and the way it is crawled by replacing a domain name or alias with a cname. Select the available databases that you want to use from the box.

Description 

URL/D 

Blank 

Description for the site that you had created. 

Destination Database 

URL/D 

Use Internal Default 

Select the available databases that you want to use from the list box. 

Domain Group — Name 

Domain entered. For example, *.sests.com

Name of the domain. 

Checkbox to select Domain Group for deletion 

Unselected 

Unselected 

Selected 

Filters

    Under the Filters tab, there is a Manage Filters page, which lists all the defined Filter Rules, Status of each Filter Rule, Default value for New Site, and Used in Sites. Each Filter Rule is preceded by a checkbox. To delete a Filter Rule, you need to select the corresponding checkbox and click the Delete button. To create a new filter:

  1. Click the New button.

    The New Robot Filter Wizard appears. As a first step, the Specify Filter Name and Description page is displayed.

  2. Enter the filter name in the Filter Name text box.

  3. Enter the description for the filter in the Filter Description text box.

  4. Click the Next button.

    The Specify Filter Definition and Behavior page appears. This page provides the Filter Definition — Matching Rules section. The table below lists the attributes and their description provided in the Filter Definition and Behavior section.

  5. Click the Finish button.

Table 4–5 Filter Definition and Behavior Attributes

Attribute 

Default Value 

Description 

Filter Source 

URL 

Choose an option from the list box to specify the source of the filter. The available values are: URL, protocol, host, path, and MIME type. 

Filter By 

is 

Choose an option from the list box to specify the how you want to filter the source. The available values are: is, contains, begins with, ends with, and regular expression. 

Filter String 

Blank 

You can enter the string to define the filter. 

Filter Default 

Selected 

Assign this filer to new sites when they are created. 

Filter Behavior 

Exclude documents that match this filter when Robot runs 

The default option excludes documents that match this filter when robot runs. The other unselected option includes documents that match this filter when Robot runs. 

Click on the Filter Rule to navigate to the Edit a Filter page. The table below lists the attributes and their description in the Edit a Filter page. The default value for these attributes are same as provided in the previous table.

Table 4–6 Edit a Filter Attributes

Attribute 

Description 

Filter Name 

A descriptive name that reflects the type of file the filter applies to. 

Drop down list of Filter Sources 

URL, protocol, host, path, MIME type 

Drop down list of positions 

is, contains, begins with, ends with, regular expression 

Text box for type (directory, protocol, file extensions) specifics 

In this text box, list what you want to match. What would match in this example - http://docs.sesta.com/manual.html

protocol is http; host contains sesta; file ends with html. 

Filter Description 

Describe the filter rule for yourself. The robot does not use it. 

Filter Default 

Use this as one of the default filters when creating new sites. If you do not check this, you can still add this filter to a new site by editing the site on the Robot, Sites page. 

Filter Behavior 

This attribute provides two options: Exclude documents that match this filter when Robot runs. 

Include documents that match this filter when Robot runs. 

By default, the first option is selected. 

Properties

Click the Robot —> Properties tab. The Manage Properties page appears. The settings on this page control the robot’s operational parameters and defaults. It is divided into these sections: Crawling Speed, Completion Actions, Logfile Settings, Standard Compliance, Authentication Parameters, Proxy Settings, Link Following, Advanced Settings, and Indexing Settings.

The table below lists the attributes and their description in the Manage Properties page.

Table 4–7 Manage Properties Attributes

Attribute 

Default Value 

Description 

Server Delay 

No Delay 

No Delay (default), 1 second, 2 seconds, 5 seconds, 10 seconds, 30 seconds, 1 minute, 5 minutes. 

Maximum Connections - Max concurrent retrieval URLs 

1, 2, 4, 8 (default), 10, 12, 16, 20. 

Maximum Connections per Site 

(no limit), 1, 2, 4, 8, 10, 12, 16, 20. 

Send RDs to Indexing every 

30 minutes 

3 minutes, 5 minutes, 10 minutes, 15 minutes, 30 minutes (default), 1 hour, 2 hours, 4 hours, 8 hours. 

Script to Launch 

nothing 

nothing (default). For sample files, see the cmdHook files in the /opt/SUNWportal/samples/robot directory (for the default installation).

After Processing all URLs 

go idle 

go idle (default), shut down, start over. 

Contact Email 

Blank 

Enter your own. 

Log Level 

1 Generation 

0 Errors only; 1 Generation (default); 2 Enumeration, Conversion; 3 Filtering; 4 Spawning; 5 Retrieval 

User Agent 

SunONERobot/6.2 

Version of the search server. 

Ignore robots.txt protocol 

No 

Some servers have a robot.txt file that says robots do not come here. If your search robot encounters this file on a site and this attribute is false, it does not search the site. If this attribute is true, the robot ignores the file and searches the site.

Perform Authentication? 

Yes 

Yes 

No 

Robot Username 

Blank 

Robot uses the anonymous user name to gain access to a site. 

Password 

Blank 

Frequently a site that allows anonymous users requires a email address as a password. This address is in plain text. 

Proxy Username 

Blank 

Robot uses the anonymous user name to gain access to a site. 

Password 

Blank 

Frequently a site that allows anonymous users requires an email address as a password. This address is in plain text. 

Proxy Connection Type 

Proxy — Manual Configuration 

Direct Internet Connection, Proxy--Auto Configuration, Proxy--Manual Configuration 

Auto Proxy Configuration Type 

Local Proxy File 

Local Proxy File, Remote Proxy File 

Auto Proxy Configuration Location 

Blank 

The auto proxy has a file that lists all the proxy information needed. 

An example of a local proxy file is robot.pac. An example of a emote proxy file is http://proxy.sesta.com:8080/proxy.pac

Manual Proxy Configuration HTTP Proxy 

Host Name:Port 

Format: server1.sesta.com:8080. These three manual configuration values are put in the robot.pac file in the

/var/opt/SUNWportal/searchservers/
search1/config

directory. 

Manual Proxy Configuration HTTPS Proxy 

Host Name:Port 

This manual configuration value is put in the robot.pac file.

Format: server1.sesta.com:8080

Manual Proxy Configuration FTP Proxy 

Host Name:Port 

This manual configuration value is put in the robot.pac file.

Format: server1.sesta.com:8080

Follow Links in HTML 

Yes 

Extract hyperlinks from HTML 

maximum links 

1024 

Limits the number of links the robot can extract from any one HTML resource. As the robot searches sites and discovers links to other resources, it could conceivably end up following huge numbers of links a great distance from its original starting point. 

Follow Links in Plain Text 

No 

Extract hyperlinks from plain text. 

maximum links 

1024 

Limits the number of links the robot can extract from any one text resource. 

Use Cookies 

No 

If checked, the robot uses cookies when it crawls. Some sites require the use of cookies in order for them to be navigated correctly. The robot keeps its cookies in a file called cookies.txt in the robot state directory. The format of cookies.txt is the same format as used by the Netscape™ Communicator browser.

Use IP as Source 

Yes 

In most cases, the robot operates only on the domain name of a resource. In some cases, you might want to be able to filter or classify resources based on subnets by Internet Protocol (IP) address. In that case, you must explicitly allow the robot to retrieve the IP address in addition to the domain name. Retrieving IP addresses requires an extra DNS lookup, which can slow the operation of the robot. If you do not need this option, you can turn it off to improve performance. 

Enable Smart Host Heuristics 

No 

If checked, the robot converts common alternate host names used by a server to a single name. This is most useful in cases where a site has a number of servers all aliased to the same address, such as www.sesta.com, which often have names such as www1.sesta.com, www2.sesta.com, and so on.

When you select this option, the robot will internally translate all host names starting with wwwn to www, where n is any integer. This attribute only operates on host names starting with wwwn.

This attribute cannot be used when CNAME resolution is OFF (No). 

Resolve Host Names to CNAMEs 

No 

If checked, the robot validates and resolves any host name it encounters into a canonical host name. This allows the robot to accurately track unique RDs. If unchecked, the robot validates host names without converting them to the canonical form. So you may get duplicate RDs listed with the different host names found by the robot. 

For example, devedge.sesta.com is an alias for developer.sesta.com. With CNAME resolution on, a URL referenced as devedge.sesta.com is listed as being found on developer.sesta.com. With CNAME resolution off, the RD retains the original reference to devedge.sesta.com.

Smart Host Heuristics cannot be enabled when CNAME resolution is OFF (No). 

Accepts Commands from any Host 

No 

Most robot control functions operate through a TCP/IP port. This attribute controls whether commands to the robot must come from the local host system (No), or whether they can come from anywhere on the network (Yes). 

It is recommended that you restrict direct robot control to the local host (No). You can still administer the robot remotely through the Administration Console. 

Default Starting Point Depth 

10 

1- starting points only, 2- bookmark style, 3-10, unlimited. 

Default value for the levels of hyperlinks the robot traverses from any starting point. You can set the depth for any given starting point by editing the site on the Robot, Sites page. 

Work Directory 

/var/opt/SUNWportal/
searchservers/search1/tmp

Full pathname of a temporary working directory the robot can use to store data. The robot retrieves the entire contents of documents into this directory, often many at a time, so this space should be large enough to handle all of those at once. 

State Directory 

/var/opt/SUNWportal/
searchservers/search1/robot

Full pathname of a temporary directory the robot uses to store its state information, including the list of URLs it has visited, the URL pool, and so on. This database can be quite large, so you might want to locate it in a separate partition from the Work Directory.

Page Extraction Index 

Partial Text 

Full Text uses the complete document in the resource description. Partial text only uses the specified number of bytes in the resource description. 

extract first # bytes 

4096 

Enter the number of bytes. 

Extract Table Of Contents 

Yes 

Yes includes the Table of Contents in the resource description. 

Extract data in META tags 

Yes 

Yes includes the META tags in the resource description. 

Allow No Existing Classifications 

Yes 

Yes allows none of the existing classifications in the resource description. 

Document Converters 

All selected; if unselected, that type of document cannot be indexed. 

Adobe PDF 

Corel Presentations 

Corel Quattro Pro 

FrameMaker 

Lotus Ami Pro 

Lotus Freelance 

Lotus Word Pro 

Lotus 1-2-3 

Microsoft Excel 

Microsoft Powerpoint 

Microsoft RTF 

Microsoft Word 

Microsoft Works 

Microsoft Write 

WordPerfect 

StarOffice™ Calc 

StarOffice Impress 

StarOffice Writer 

XyWrite 

Converter Timeout 

600 

Time in seconds allowed for one document to be converted to HTML. If this time is exceeded, the URL is excluded. 

Indexing

The robot searches sites and collects documents based on the filters you have selected. The documents collected are in many different formats. To make them uniform and easily readable they need to be in one format, which is HTML. This page controls some of the parts that go into each resource description.

Simulator

You can find the simulator attributes in the Robot Utilities page under the Utilities tab. The Robot Utilities page is a debugging tool that performs a partial simulation of robot filtering on a URL. You can type in a new URL to check. It checks the URL, DNS translations (including Smart Host Heuristics), and site redirections. It does not check the contents of the document specified by the URL, so it does not detect duplications, MIME types, network errors, permissions, and the like. The simulator indicates whether the listed sites would be accepted by the robot (ACCEPTED) or not (WARNING).

The table below provides the attributes and their description in the Simulator section in the Robot Utilities page.

Table 4–8 Robot Simulator Attributes

Attribute 

Default Value 

Description 

Run Simulator on 

URLs you have already defined and one blank text box. 

You can check access to a new site by typing its URL in the blank text box. This checks to see if the new site accepts crawling. 

Format http://www.sesta.com:80/

Show advanced DNS information 

Unselected 

Selected displays more information about the site. 

Check for server redirects 

Selected 

Selected checks for any server redirects. 

Site Probe

The site probe attributes are also available in the Robot Utilities page. This page is a debugging tool that checks for DNS aliases, server redirects, and virtual servers. This tool returns information about site but does not test its acceptance of crawling.

The table below provides the site Probe attributes and their description.

Table 4–9 Robot Site Probe Attributes

Attribute 

Default Value 

Description 

Run Site Probe on 

Blank 

Type in URL in format http://www.sesta.com:80

Show advanced DNS information 

Unselected 

Selected displays more information about the site including IP addresses.