Sun Java System Portal Server 7.1 Technical Reference

Properties

Click the Robot —> Properties tab. The Manage Properties page appears. The settings on this page control the robot’s operational parameters and defaults. It is divided into these sections: Crawling Speed, Completion Actions, Logfile Settings, Standard Compliance, Authentication Parameters, Proxy Settings, Link Following, Advanced Settings, and Indexing Settings.

The table below lists the attributes and their description in the Manage Properties page.

Table 4–7 Manage Properties Attributes

Attribute 

Default Value 

Description 

Server Delay 

No Delay 

No Delay (default), 1 second, 2 seconds, 5 seconds, 10 seconds, 30 seconds, 1 minute, 5 minutes. 

Maximum Connections - Max concurrent retrieval URLs 

1, 2, 4, 8 (default), 10, 12, 16, 20. 

Maximum Connections per Site 

(no limit), 1, 2, 4, 8, 10, 12, 16, 20. 

Send RDs to Indexing every 

30 minutes 

3 minutes, 5 minutes, 10 minutes, 15 minutes, 30 minutes (default), 1 hour, 2 hours, 4 hours, 8 hours. 

Script to Launch 

nothing 

nothing (default). For sample files, see the cmdHook files in the /opt/SUNWportal/samples/robot directory (for the default installation).

After Processing all URLs 

go idle 

go idle (default), shut down, start over. 

Contact Email 

Blank 

Enter your own. 

Log Level 

1 Generation 

0 Errors only; 1 Generation (default); 2 Enumeration, Conversion; 3 Filtering; 4 Spawning; 5 Retrieval 

User Agent 

SunONERobot/6.2 

Version of the search server. 

Ignore robots.txt protocol 

No 

Some servers have a robot.txt file that says robots do not come here. If your search robot encounters this file on a site and this attribute is false, it does not search the site. If this attribute is true, the robot ignores the file and searches the site.

Perform Authentication? 

Yes 

Yes 

No 

Robot Username 

Blank 

Robot uses the anonymous user name to gain access to a site. 

Password 

Blank 

Frequently a site that allows anonymous users requires a email address as a password. This address is in plain text. 

Proxy Username 

Blank 

Robot uses the anonymous user name to gain access to a site. 

Password 

Blank 

Frequently a site that allows anonymous users requires an email address as a password. This address is in plain text. 

Proxy Connection Type 

Proxy — Manual Configuration 

Direct Internet Connection, Proxy--Auto Configuration, Proxy--Manual Configuration 

Auto Proxy Configuration Type 

Local Proxy File 

Local Proxy File, Remote Proxy File 

Auto Proxy Configuration Location 

Blank 

The auto proxy has a file that lists all the proxy information needed. 

An example of a local proxy file is robot.pac. An example of a emote proxy file is http://proxy.sesta.com:8080/proxy.pac

Manual Proxy Configuration HTTP Proxy 

Host Name:Port 

Format: server1.sesta.com:8080. These three manual configuration values are put in the robot.pac file in the

/var/opt/SUNWportal/searchservers/
search1/config

directory. 

Manual Proxy Configuration HTTPS Proxy 

Host Name:Port 

This manual configuration value is put in the robot.pac file.

Format: server1.sesta.com:8080

Manual Proxy Configuration FTP Proxy 

Host Name:Port 

This manual configuration value is put in the robot.pac file.

Format: server1.sesta.com:8080

Follow Links in HTML 

Yes 

Extract hyperlinks from HTML 

maximum links 

1024 

Limits the number of links the robot can extract from any one HTML resource. As the robot searches sites and discovers links to other resources, it could conceivably end up following huge numbers of links a great distance from its original starting point. 

Follow Links in Plain Text 

No 

Extract hyperlinks from plain text. 

maximum links 

1024 

Limits the number of links the robot can extract from any one text resource. 

Use Cookies 

No 

If checked, the robot uses cookies when it crawls. Some sites require the use of cookies in order for them to be navigated correctly. The robot keeps its cookies in a file called cookies.txt in the robot state directory. The format of cookies.txt is the same format as used by the Netscape™ Communicator browser.

Use IP as Source 

Yes 

In most cases, the robot operates only on the domain name of a resource. In some cases, you might want to be able to filter or classify resources based on subnets by Internet Protocol (IP) address. In that case, you must explicitly allow the robot to retrieve the IP address in addition to the domain name. Retrieving IP addresses requires an extra DNS lookup, which can slow the operation of the robot. If you do not need this option, you can turn it off to improve performance. 

Enable Smart Host Heuristics 

No 

If checked, the robot converts common alternate host names used by a server to a single name. This is most useful in cases where a site has a number of servers all aliased to the same address, such as www.sesta.com, which often have names such as www1.sesta.com, www2.sesta.com, and so on.

When you select this option, the robot will internally translate all host names starting with wwwn to www, where n is any integer. This attribute only operates on host names starting with wwwn.

This attribute cannot be used when CNAME resolution is OFF (No). 

Resolve Host Names to CNAMEs 

No 

If checked, the robot validates and resolves any host name it encounters into a canonical host name. This allows the robot to accurately track unique RDs. If unchecked, the robot validates host names without converting them to the canonical form. So you may get duplicate RDs listed with the different host names found by the robot. 

For example, devedge.sesta.com is an alias for developer.sesta.com. With CNAME resolution on, a URL referenced as devedge.sesta.com is listed as being found on developer.sesta.com. With CNAME resolution off, the RD retains the original reference to devedge.sesta.com.

Smart Host Heuristics cannot be enabled when CNAME resolution is OFF (No). 

Accepts Commands from any Host 

No 

Most robot control functions operate through a TCP/IP port. This attribute controls whether commands to the robot must come from the local host system (No), or whether they can come from anywhere on the network (Yes). 

It is recommended that you restrict direct robot control to the local host (No). You can still administer the robot remotely through the Administration Console. 

Default Starting Point Depth 

10 

1- starting points only, 2- bookmark style, 3-10, unlimited. 

Default value for the levels of hyperlinks the robot traverses from any starting point. You can set the depth for any given starting point by editing the site on the Robot, Sites page. 

Work Directory 

/var/opt/SUNWportal/
searchservers/search1/tmp

Full pathname of a temporary working directory the robot can use to store data. The robot retrieves the entire contents of documents into this directory, often many at a time, so this space should be large enough to handle all of those at once. 

State Directory 

/var/opt/SUNWportal/
searchservers/search1/robot

Full pathname of a temporary directory the robot uses to store its state information, including the list of URLs it has visited, the URL pool, and so on. This database can be quite large, so you might want to locate it in a separate partition from the Work Directory.

Page Extraction Index 

Partial Text 

Full Text uses the complete document in the resource description. Partial text only uses the specified number of bytes in the resource description. 

extract first # bytes 

4096 

Enter the number of bytes. 

Extract Table Of Contents 

Yes 

Yes includes the Table of Contents in the resource description. 

Extract data in META tags 

Yes 

Yes includes the META tags in the resource description. 

Allow No Existing Classifications 

Yes 

Yes allows none of the existing classifications in the resource description. 

Document Converters 

All selected; if unselected, that type of document cannot be indexed. 

Adobe PDF 

Corel Presentations 

Corel Quattro Pro 

FrameMaker 

Lotus Ami Pro 

Lotus Freelance 

Lotus Word Pro 

Lotus 1-2-3 

Microsoft Excel 

Microsoft Powerpoint 

Microsoft RTF 

Microsoft Word 

Microsoft Works 

Microsoft Write 

WordPerfect 

StarOffice™ Calc 

StarOffice Impress 

StarOffice Writer 

XyWrite 

Converter Timeout 

600 

Time in seconds allowed for one document to be converted to HTML. If this time is exceeded, the URL is excluded.