Sun Java System Portal Server 7.1 Technical Reference

Properties

Click the Robot —> Properties tab. The Manage Properties page appears. The settings on this page control the robot’s operational parameters and defaults. It is divided into these sections: Crawling Speed, Completion Actions, Logfile Settings, Standard Compliance, Authentication Parameters, Proxy Settings, Link Following, Advanced Settings, and Indexing Settings.

The table below lists the attributes and their description in the Manage Properties page.

Table 4–7 Manage Properties Attributes


Attribute	Default Value	Description
Server Delay	No Delay	No Delay (default), 1 second, 2 seconds, 5 seconds, 10 seconds, 30 seconds, 1 minute, 5 minutes.
Maximum Connections - Max concurrent retrieval URLs	8	1, 2, 4, 8 (default), 10, 12, 16, 20.
Maximum Connections per Site	2	(no limit), 1, 2, 4, 8, 10, 12, 16, 20.
Send RDs to Indexing every	30 minutes	3 minutes, 5 minutes, 10 minutes, 15 minutes, 30 minutes (default), 1 hour, 2 hours, 4 hours, 8 hours.
Script to Launch	nothing	nothing (default). For sample files, see the `cmdHook` files in the `/opt/SUNWportal/samples/robot` directory (for the default installation).
After Processing all URLs	go idle	go idle (default), shut down, start over.
Contact Email	Blank	Enter your own.
Log Level	1 Generation	0 Errors only; 1 Generation (default); 2 Enumeration, Conversion; 3 Filtering; 4 Spawning; 5 Retrieval
User Agent	SunONERobot/6.2	Version of the search server.
Ignore robots.txt protocol	No	Some servers have a `robot.txt` file that says robots do not come here. If your search robot encounters this file on a site and this attribute is false, it does not search the site. If this attribute is true, the robot ignores the file and searches the site.
Perform Authentication?	Yes	Yes No
Robot Username	Blank	Robot uses the anonymous user name to gain access to a site.
Password	Blank	Frequently a site that allows anonymous users requires a email address as a password. This address is in plain text.
Proxy Username	Blank	Robot uses the anonymous user name to gain access to a site.
Password	Blank	Frequently a site that allows anonymous users requires an email address as a password. This address is in plain text.
Proxy Connection Type	Proxy — Manual Configuration	Direct Internet Connection, Proxy--Auto Configuration, Proxy--Manual Configuration
Auto Proxy Configuration Type	Local Proxy File	Local Proxy File, Remote Proxy File
Auto Proxy Configuration Location	Blank	The auto proxy has a file that lists all the proxy information needed. An example of a local proxy file is `robot.pac`. An example of a emote proxy file is `http://proxy.sesta.com:8080/proxy.pac`
Manual Proxy Configuration HTTP Proxy	Host Name:Port	Format: `server1.sesta.com:8080`. These three manual configuration values are put in the `robot.pac` file in the /var/opt/SUNWportal/searchservers/ search1/config directory.
Manual Proxy Configuration HTTPS Proxy	Host Name:Port	This manual configuration value is put in the `robot.pac` file. Format: `server1.sesta.com:8080`
Manual Proxy Configuration FTP Proxy	Host Name:Port	This manual configuration value is put in the `robot.pac` file. Format: `server1.sesta.com:8080`
Follow Links in HTML	Yes	Extract hyperlinks from HTML
maximum links	1024	Limits the number of links the robot can extract from any one HTML resource. As the robot searches sites and discovers links to other resources, it could conceivably end up following huge numbers of links a great distance from its original starting point.
Follow Links in Plain Text	No	Extract hyperlinks from plain text.
maximum links	1024	Limits the number of links the robot can extract from any one text resource.
Use Cookies	No	If checked, the robot uses cookies when it crawls. Some sites require the use of cookies in order for them to be navigated correctly. The robot keeps its cookies in a file called `cookies.txt` in the robot state directory. The format of `cookies.txt` is the same format as used by the Netscape™ Communicator browser.
Use IP as Source	Yes	In most cases, the robot operates only on the domain name of a resource. In some cases, you might want to be able to filter or classify resources based on subnets by Internet Protocol (IP) address. In that case, you must explicitly allow the robot to retrieve the IP address in addition to the domain name. Retrieving IP addresses requires an extra DNS lookup, which can slow the operation of the robot. If you do not need this option, you can turn it off to improve performance.
Enable Smart Host Heuristics	No	If checked, the robot converts common alternate host names used by a server to a single name. This is most useful in cases where a site has a number of servers all aliased to the same address, such as `www.sesta.com`, which often have names such as `www1.sesta.com`, `www2.sesta.com`, and so on. When you select this option, the robot will internally translate all host names starting with `wwwn` to `www`, where `n` is any integer. This attribute only operates on host names starting with wwwn. This attribute cannot be used when CNAME resolution is OFF (No).
Resolve Host Names to CNAMEs	No	If checked, the robot validates and resolves any host name it encounters into a canonical host name. This allows the robot to accurately track unique RDs. If unchecked, the robot validates host names without converting them to the canonical form. So you may get duplicate RDs listed with the different host names found by the robot. For example, `devedge.sesta.com` is an alias for `developer.sesta.com`. With CNAME resolution on, a URL referenced as `devedge.sesta.com` is listed as being found on `developer.sesta.com`. With CNAME resolution off, the RD retains the original reference to `devedge.sesta.com`. Smart Host Heuristics cannot be enabled when CNAME resolution is OFF (No).
Accepts Commands from any Host	No	Most robot control functions operate through a TCP/IP port. This attribute controls whether commands to the robot must come from the local host system (No), or whether they can come from anywhere on the network (Yes). It is recommended that you restrict direct robot control to the local host (No). You can still administer the robot remotely through the Administration Console.
Default Starting Point Depth	10	1- starting points only, 2- bookmark style, 3-10, unlimited. Default value for the levels of hyperlinks the robot traverses from any starting point. You can set the depth for any given starting point by editing the site on the Robot, Sites page.
Work Directory	/var/opt/SUNWportal/ searchservers/search1/tmp	Full pathname of a temporary working directory the robot can use to store data. The robot retrieves the entire contents of documents into this directory, often many at a time, so this space should be large enough to handle all of those at once.
State Directory	/var/opt/SUNWportal/ searchservers/search1/robot	Full pathname of a temporary directory the robot uses to store its state information, including the list of URLs it has visited, the URL pool, and so on. This database can be quite large, so you might want to locate it in a separate partition from the Work Directory.
Page Extraction Index	Partial Text	Full Text uses the complete document in the resource description. Partial text only uses the specified number of bytes in the resource description.
extract first # bytes	4096	Enter the number of bytes.
Extract Table Of Contents	Yes	Yes includes the Table of Contents in the resource description.
Extract data in META tags	Yes	Yes includes the META tags in the resource description.
Allow No Existing Classifications	Yes	Yes allows none of the existing classifications in the resource description.
Document Converters	All selected; if unselected, that type of document cannot be indexed.	Adobe PDF Corel Presentations Corel Quattro Pro FrameMaker Lotus Ami Pro Lotus Freelance Lotus Word Pro Lotus 1-2-3 Microsoft Excel Microsoft Powerpoint Microsoft RTF Microsoft Word Microsoft Works Microsoft Write WordPerfect StarOffice™ Calc StarOffice Impress StarOffice Writer XyWrite
Converter Timeout	600	Time in seconds allowed for one document to be converted to HTML. If this time is exceeded, the URL is excluded.