Sun Java logo     Previous      Contents      Index      Next     

Sun logo
Sun Java System Portal Server 6 2005Q1 Technical Reference Guide 

Chapter 5
Search Attributes: Robot

The properties for the robot are quite complex. You can select the sites to be searched or crawled, check to see if a site is valid, define what types of documents should be picked up, and schedule when the searches take place.

This chapter contains the following sections:


Overview

The Robot Overview panel is where you can see what the robot is doing: if it is Off, Idle, Running, or Paused; and if it is Running, what progress it is making in the search since the panel is refreshed about every 30 seconds. The refresh rate is defined using the robot-refresh parameter in the search.conf file.

The two buttons on the top right are appropriate for its state. If the robot is Off, the buttons are Start and Remove Status. If it is Running or Idle, the two buttons are Stop and Pause. If it is Paused, the two buttons are Stop and Resume. By selecting on any of the Attributes, you go to the Reports section where you can get a detailed up-to-the-minute report of that Attribute.

Table 5-1 describes the attributes in the Robot Overview panel. The table contains three columns: the first column identifies the attribute, the second column provides the default value for the attribute, and the third column describes the attribute.

Table 5-1  Robot Overview Attributes  

Attribute

Default Value

Description

The Robot is

Current activity

The Robot’s state. Value can be Idle, Running, Paused, or Off

Updated at date

Date and time last refreshed.

This page is refreshed to keep you aware of what progress the robot is making.

Starting Points

Number defined

Number of sites you have selected to be searched. A site is disabled (not included in a search) on the Robot, Site page.

URL Pool

Number URLs waiting

Number of URLs yet to be investigated. When you begin a search, the starting point URLs are entered into the URL pool. As the search progresses, the robot discovers links to other URLs. These URLs get added to the pool. After all the URLs in the pool have been processed, the URL pool is empty and the robot is idle.

Extracting

Number connections per second

Number of resources looked at in a second.

Extracting is the process of discovering or locating resources, documents or hyperlinks to be included in the database and filtering out unwanted items.

Filtering

Number URLs rejected

Total number of URLs that are excluded.

Indexing

Number URLs per second

Number of resources or documents turned into a resource description in a second.

Indexing is the phase when all the information that has been gathered on a document is turned into a resource description for inclusion in the search database.

Excluded URLs

Number URLs excluded by filters

Number of URLs that did not meet the filtering criteria.

 

Number URLs excluded by errors

Number of URLS where the robot encountered errors as file not found.

Resource Descriptions

Number RDs contributed

Number of resource descriptions added to the database.

 

Number Bytes of RDs contributed

Number of bytes added to the database.

General Stats

Number URLs retrieved

Number of URLs retrieved during run.

 

Number Bytes average size of RDs

Average number of bytes per resource description.

 

Time in days, hours, minutes, and seconds running

The amount of time the robot has been running.


Sites

The initial page in this section shows what sites are available to be searched.

A site can be enabled (On) and disabled (Off) by using the radio buttons. A disabled site is not searched when the robot is run. The Edit link displays a page where you can change how a search site is defined.

To delete a site, check the checkbox and select Delete.

To add a new site, select New. Add a URL or Domain in the text box and select a depth for the search. Select Create to use the default values. Otherwise, select Create and Edit to select non-default values and go to the Edit page to define the search site.

Table 5-2 describes the attributes in the Robot Manage Sites page. The table contains three columns: the first column identifies the attribute, the second column provides the default value for the attribute, and the third column describes the attribute.

Table 5-2  Robot Manage Sites Attributes  

Attribute

Default Value

Description

Lock or cluster graphic

Status of site

Lock open means that the URL is accessible. The closed lock means that the site is a secure web server and uses SSL. The cluster means that the site is a domain.

On/Off

On

Choose to search this site or not when the robot is run.

The New Site page allows you to set up an entire site for indexing. Table 5-3 contains the attributes for the Robot New Site page. The table contains three columns: the first column identifies the attribute, the second column provides the default value for the attribute, and the third column describes the attribute.

Table 5-3  Robot New Site Attributes  

Attribute

Default Value

Description

New site

URL

URL - format: http://www.sesta.com

Domain - format: *.sesta.com

Depth

10

You have a choice of 1 for this URL only, 2 for this URL and first links, 100 for the robot, 3 - 10 or unlimited. The default value is set on the Robot, Crawling page.

The edit page is where you can define the search site more completely. You can specify what type of server it is, redefine the depth of the search, and select what type of files to add to the search database. The attributes for URL and Domain sites are mostly the same. The additional column in this table shows which attributes are shared and which are unique.

A number of actions are performed on this page. You can verify the server name for the search site you entered. You can add more servers to the server group by selecting Add in the Server Group section. You can add more starting points by selecting Add in the Starting Points section. In the Filter Definition section, you can add or delete, exclude or include certain types of files as well as change the order the filters for these files are applied.

Table 5-4 contains the attributes for the edit page. The table contains three columns: the first column identifies the attribute, the second column provides the default value for the attribute, and the third column describes the attribute.

Table 5-4  Robot Sites Edit Attributes  

Attribute

URL/ Domain

Default Value

Description

Site Nickname

URL/D

Site entered - www.sesta.com

 

Checkbox to select site for deletion or verification

URL/D

Unchecked

Unchecked--not selected

Checked--selected

Server Group - Name

URL

URL - www.sesta.com

Is either a single server or a part of a single server. The entry must include the full host name. If you specify just a host name, the site is limited to that host. If you provide directory information in addition to the host name, the site is defined as only that directory and any of its subdirectories.

Domain Suffix

D

Domain entered - *.sesta.com

Includes all servers within a domain, such as *.sesta.com.

Port

URL/D

80 for URL; blank for Domain

If the site you are searching uses a different port, enter it here.

Type

URL

Web Server

Web Server, File Server, FTP Server, Secure Web Server

Allowed Protocols

D

Checkboxes all checked

Checkboxes for http, file, ftp, https

Starting Points- Checkbox to select site for deletion

URL/D

Unchecked

Unchecked--not selected

Checked--selected

Starting Points- URL

URL/D

http:// URL:80

URL or domain

Starting Points - Depth

URL/D

10

1 - this URL only

2 - this URL and first links

3-10

unlimited

Filter Definition - Checkbox to select file type for deletion

URL/D

Unchecked

Unchecked - not selected

Checked - selected

Filter Definitions

URL/D

In this order, the defaults are Archive Files; Audio Files; Backup Files; Binary Files; CGI Files; Image Files; Java, Javascript, Style Sheet Files; Log Files; Revision Control Files; Source Code Files; Temporary Files; Video Files.

The possible choices are Archive Files; Audio Files; Backup Files; Binary Files; CGI Files; Image Files; Java, Javascript, Style Sheet Files; Log Files; Power Point Files; Revision Control Files; Source Code Files; Temporary Files; Video Files; Spreadsheet Files; Plug-in Files; Lotus Domino Documents; Lotus Domino OpenViews; System Directories (UNIX); System Directories (NT).

Comment

URL/D

Blank

Text field that describes the site to you. It is not used by the robot.

DNS Translation

URL

Blank

The DNS translation modifies the URL and the way it is crawled by replacing a domain name or alias with a cname. Format: alias1->cname1,alias2->cname1


Filters

The initial page in this section shows all the defined filter rules and the site definitions that use them. Each filter name is proceeded by a checkbox to select that document type and two radio buttons to turn the Filter Rule On and Off. If a checkbox is checked, the filter is selected and can be deleted. You can add a new filter by selecting New. The new filter page is an abbreviated Edit page, requiring only a Nick Name and one rule. Another option is selecting the Edit link, which takes you to a page where you define the rules for that file type or what that filter does. Each rule is made up of a drop down list of Filter Sources, a drop down list to Filter By, and a text box to enter the filter string specifics in.

Table 5-5 contains the Robot Filter Edit attributes. The table contains three columns: the first column identifies the attribute, the second column provides the default value for the attribute, and the third column describes the attribute.

Table 5-5  Robot Filter Edit Attributes  

Attribute

Default Value

Description

Filter Name

Prompts for new name. File name of the file type you choose to edit.

A descriptive name that reflects the type of file the filter applies to.

Drop down list of Filter Sources

URL for new filter. Displays previously chosen information for that particular file type.

URL, protocol, host, path, MIME type

Drop down list of positions

is for new filter. Displays previously chosen information for that particular file type. For example, Binary Files ends with exe.

is, contains, begins with, ends with, regular expression

Text box for type (directory, protocol, file extensions) specifics

Blank for new filter. Displays previously entered information for that particular file type. For example, Temporary Files contains /tmp/.

In this text box, list what you want to match. What would match in this example - http://docs.sesta.com/manual.html

protocol is http; host contains sesta; file ends with html.

Description

Prompts for new description. Displays previously entered description for that particular file type.

Describe the filter rule for yourself. The robot does not use it.

New Site

True (checked) for new filter. Displays previously chosen value for that particular file type.

Use this as one of the default filters when creating new sites. If you do not check this, you can still add this filter to a new site by editing the site on the Robot, Sites page.

By Default

Nothing selected for a new filter. Default selected previously for a defined file type.

Exclude documents matching this filter.

Include documents matching this filter.

Selection for a new filter does not affect existing site definitions. To use your new filter on an existing site, you must add it by editing the site on the Robot, Sites page.

Deployment

Lists the sites that use this filter.

 


Crawling

The settings on this page control the robot’s operational parameters and defaults. It is divided into these sections: Speed, Completion Actions, Logfile Settings, Standards Compliance, Authentication Parameters, Proxying, Advanced Settings and Link Extraction.

Table 5-6 contains the Robot Crawling attributes. The table contains three columns: the first column identifies the attribute, the second column provides the default value for the attribute, and the third column describes the attribute.

Table 5-6  Robot Crawling Attributes  

Attribute

Default Value

Description

Server Delay

No Delay

No Delay (default), 1 second, 2 seconds, 5 seconds, 10 seconds, 30 seconds, 1 minute, 5 minutes.

Maximum Connections - Max concurrent retrieval URLs

8

1, 2, 4, 8 (default), 10, 12, 16, 20.

Maximum Connections per Site

2

(no limit), 1, 2, 4, 8, 10, 12, 16, 20.

Send RDs to Indexing every

30 minutes

3 minutes, 5 minutes, 10 minutes, 15 minutes, 30 minutes (default), 1 hour, 2 hours, 4 hours, 8 hours.

Script to Launch

nothing (default)

nothing (default). For sample files, see the cmdHook files in the /opt/SUNWps/samples/robot directory (for the default installation).

After Processing all URLs

go idle (default)

go idle (default), shut down, start over.

Contact Email

user@domain

Enter your own.

Log Level

1 - Generation

0 Errors only; 1 Generation (default); 2 Enumeration, Conversion; 3 Filtering; 4 Spawning; 5 Retrieval

User Agent

SunJavaSystemRobot/6.0

Version of the search server.

Ignore robots.txt protocol

False (unchecked)

Some servers have a robot.txt file that says robots do not come here. If your search robot encounters this file on a site and this attribute is false, it does not search the site. If this attribute is true, the robot ignores the file and searches the site.

Perform Authentication

Yes

Yes

No

Robot Username

anonymous

Robot uses the anonymous user name to gain access to a site.

Password

user@domain

Frequently a site that allows anonymous users requires a email address as a password. This address is in plain text.

Proxy Username

anonymous

Robot uses the anonymous user name to gain access to a site.

Password

user@domain

Frequently a site that allows anonymous users requires an email address as a password. This address is in plain text.

Proxy Connection Type

Direct Internet Connection

Direct Internet Connection, Proxy--Auto Configuration, Proxy--Manual Configuration

Auto Proxy Configuration Type

Local Proxy File

Local Proxy File, Remote Proxy File

Auto Proxy Configuration Location

Blank

The auto proxy has a file that lists all the proxy information needed.

An example of a local proxy file is robot.pac. An example of a emote proxy file is http://proxy.sesta.com:8080/proxy.pac

Manual Configuration HTTP Proxy

Blank

Format: server1.sesta.com:8080 These three manual configuration values are put in the robot.pac file in the /var/opt/SUNWps/https-servername/portal/config directory.

Manual Configuration HTTPS Proxy

Blank

This manual configuration value is put in the robot.pac file.

Format: server1.sesta.com:8080

Manual Configuration FTP Proxy

Blank

This manual configuration value is put in the robot.pac file.

Format: server1.sesta.com:8080

Follow links in HTML

True (checked)

Extract hyperlinks from HTML

maximum links

1024

Limits the number of links the robot can extract from any one HTML resource. As the robot searches sites and discovers links to other resources, it could conceivably end up following huge numbers of links a great distance from its original starting point.

Follow links in plain text

False (unchecked)

Extract hyperlinks from plain text.

maximum links

1024

Limits the number of links the robot can extract from any one text resource.

Use Cookies

False (unchecked)

If checked, the robot uses cookies when it crawls. Some sites require the use of cookies in order for them to be navigated correctly. The robot keeps its cookies in a file called cookies.txt in the robot state directory. The format of cookies.txt is the same format as used by the Netscape™ Communicator browser.

Use IP as Source

True (checked)

In most cases, the robot operates only on the domain name of a resource. In some cases, you might want to be able to filter or classify resources based on subnets by Internet Protocol (IP) address. In that case, you must explicitly allow the robot to retrieve the IP address in addition to the domain name. Retrieving IP addresses requires an extra DNS lookup, which can slow the operation of the robot. If you do not need this option, you can turn it off to improve performance.

Smart Host Heuristics

False (unchecked)

If checked, the robot converts common alternate host names used by a server to a single name. This is most useful in cases where a site has a number of servers all aliased to the same address, such as www.sesta.com, which often have names such as www1.sesta.com, www2.sesta.com, and so on.

When you select this option, the robot will internally translate all host names starting with wwwn to www, where n is any integer. This attribute only operates on host names starting with wwwn.

This attribute cannot be used when CNAME resolution is OFF (false).

Resolve hostnames to CNAMEs

False (unchecked)

If checked, the robot validates and resolves any host name it encounters into a canonical host name. This allows the robot to accurately track unique RDs. If unchecked, the robot validates host names without converting them to the canonical form. So you may get duplicate RDs listed with the different host names found by the robot.

For example, devedge.sesta.com is an alias for developer.sesta.com. With CNAME resolution on, a URL referenced as devedge.sesta.com is listed as being found on developer.sesta.com. With CNAME resolution off, the RD retains the original reference to devedge.sesta.com.

Smart Host Heuristics cannot be enabled when CNAME resolution is OFF (false).

Accepts commands from ANY host

False (unchecked)

Most robot control functions operate through a TCP/IP port. This attribute controls whether commands to the robot must come from the local host system (false), or whether they can come from anywhere on the network (true).

It is recommended that you restrict direct robot control to the local host (false). You can still administer the robot remotely through the Administration Console.

Default Starting Point Depth

10

1- starting points only, 2- bookmark style, 3-10, unlimited.

Default value for the levels of hyperlinks the robot traverses from any starting point. You can set the depth for any given starting point by editing the site on the Robot, Sites page.

Work Directory

/var/opt/SUNWps/https-servernamefull/portal/tmp

Full pathname of a temporary working directory the robot can use to store data. The robot retrieves the entire contents of documents into this directory, often many at a time, so this space should be large enough to handle all of those at once.

State Directory

/var/opt/SUNWps/https-servernamefull/portal/robot

Full pathname of a temporary directory the robot uses to store its state information, including the list of URLs it has visited, the URL pool, and so on. This database can be quite large, so you might want to locate it in a separate partition from the Work Directory.


Indexing

The robot searches sites and collects documents based on the filters you have selected. The documents collected are in many different formats. To make them uniform and easily readable they need to be in one format, which is HTML. This page controls some of the parts that go into each resource description.

Table 5-7 contains the Robot Index attributes. The table contains three columns: the first column identifies the attribute, the second column provides the default value for the attribute, and the third column describes the attribute.

Table 5-7  Robot Index Attributes  

Attribute

Default Value

Description

Full Text or Partial Text

Partial Text

Full uses the complete document in the resource description. Partial text only uses the specified number of bytes in the resource description.

extract first # bytes

4096

Enter the number of bytes.

Extract Table Of Contents

True (checked)

True includes the Table of Contents in the resource description.

Extract data in META tags

True (checked)

True includes the META tags in the resource description.

Document Converters

All checked (true); if false, that type of document cannot be indexed.

Adobe PDF

Corel Presentations

Corel Quattro Pro

FrameMaker

Lotus Ami Pro

Lotus Freelance

Lotus Word Pro

Lotus 1-2-3

Microsoft Excel

Microsoft Powerpoint

Microsoft RTF

Microsoft Word

Microsoft Works

Microsoft Write

WordPerfect

StarOffice™ Calc

StarOffice™ Impress

StarOffice™ Writer

XyWrite

Converter Timeout

600

Time in seconds allowed for one document to be converted to HTML. If this time is exceeded, the URL is excluded.


Simulator

This page is a debugging tool that performs a partial simulation of robot filtering on a URL. You can type in a new URL to check. It checks the URL, DNS translations (including Smart Host Heuristics), and site redirections. It does not check the contents of the document specified by the URL, so it does not detect duplications, MIME types, network errors, permissions, and the like. The simulator indicates whether the listed sites would be accepted by the robot (ACCEPTED) or not (WARNING).

Table 5-8 contains the Robot Simulator properties. The table contains three columns: the first column identifies the attribute, the second column provides the default value for the attribute, and the third column describes the attribute.

Table 5-8  Robot Simulator Properties  

Attribute

Default Value

Description

URL

URLs you have already defined and one blank text box.

You can check access to a new site by typing its URL in the blank text box. This checks to see if the new site accepts crawling.

Format http://www.sesta.com:80/

Check for DNS aliases

True (checked)

True (checked) checks for number of servers aliased to the same address.

Check for Server Redirects (302)

True (checked)

True (checked) checks for any server redirects.


Site Probe

This page is a debugging tool that checks for DNS aliases, server redirects, and virtual servers. This tool returns information about site but does not test its acceptance of crawling.

Table 5-9 contains the Robot Site Probe attributes. The table contains three columns: the first column identifies the attribute, the second column provides the default value for the attribute, and the third column describes the attribute.

Table 5-9  Robot Site Probe Attributes  

Attribute

Default Value

Description

Site

Blank

Type in URL in format http://www.sesta.com:80

Show advanced DNS information

False (unchecked)

True (checked) displays more information about the site including IP addresses.


Schedule

This page is where you set up the automatic search schedule for the robot. Table 5-10 contains the Robot Schedule attributes. The table contains three columns: the first column identifies the attribute, the second column provides the default value for the attribute, and the third column describes the attribute.

Table 5-10  Robot Schedule Attributes  

Attribute

Default Value

Description

Start Robot Time in hours and minutes

00:00

This is the time that the robot starts to search.

Days

none selected

Sun, Mon, Tue, Wed, Thu, Fri, or Sat

Check at least one day.

Stop Robot Time in hours and minutes

00:00

If you plan to run the robot continuously, it is recommended that you stop and restart it at least once per day. This gives the robot a chance to release resources and reinitialize itself.

Days

none selected

Sun, Mon, Tue, Wed, Thu, Fri, or Sat



Previous      Contents      Index      Next     


Part No: 817-7696.   Copyright 2005 Sun Microsystems, Inc. All rights reserved.