Appendix F
Search Attributes
This appendix describes attributes that you can configure for the search engine through the Sun ONE Identity Server administration console.
When you select Search Properties from the Service Management View, a two-toned tabbed menu bar is displayed. This appendix is organized according to the topics or tabs on the upper portion of the menu bar.
When one of these tabs is selected, the menu bar below lists the related subtopics for the topic. The default Search page selects Server/Settings. Each subtopic uses one or more tables to explain the attributes for that subtopic. The tables are divided into three columns: Attribute, Default Value and Description. The Attribute gives the descriptive text found on the page; the Default Value provides the default value for the Attribute; and the Description explains the Attribute and its format.
Every Search Properties page gives you the Select Server attribute as described in the Table F-1.
Table F-1 Search Select Server Attribute
Attribute
|
Default Value
|
Description
|
Select Server
|
http://servername:80/portal
|
Fully qualified server name of your Search server.
|
Server
The Server section is where you configure the preferences for your server. You select what directory to use for temporary files, what information to log and how much detail should be in the logs. The Server attributes are displayed on two pages:
Settings
This page contains the basic settings for the administration and operation of the search server.
Table F-2 Server Settings Attributes
Attribute
|
Default Value
|
Description
|
Server Root
|
/var/opt/SUNWps/https-servernamefull/portal
|
Houses the configuration, log, database, and robot information files. Also it is the root directory for all of the search files that are generated and updated when conducting a search. This is not configurable.
|
Temporary Files
|
/var/opt/SUNWps/https-servernamefull/portal/tmp
|
Contains all temporary files used to manage a search during the search. It includes newly generated resource descriptions that have not yet been added to the main database. These are removed when the search is completed.
|
Document level security
|
Off
|
Controls who can access documents.
When this setting is changed, the server must be restarted.
Values:
- Off (default) means all users have access to the RDs.
- On means that the ReadACL field in an RD is checked to see if the user asking for the RD has permission because the user is in an acceptable organization or role, or is an acceptable individual user. The ReadACL field is set in the Database, Resource Descriptor page.
|
Robot
This page contains the advanced settings for the administration and operation of the search server. Here is where you configure the log files for user queries, index maintenance, resource description management, and debugging.
Table F-3 Server Advanced Settings Attributes
Attribute
|
Default Value
|
Description
|
Search (rdm)
|
/var/opt/SUNWps/https-servername/portal/logs/rdm.log
|
Logs the queries end users make of the database. You can check the Disable Search Log checkbox to suppress this logging.
If you do, you cannot view the User Queries (rdm) report.
|
Disable Search Log
|
False (unchecked) - enabled
|
Controls use of query log.
In the report section, you can generate a report the lists the most popular queries based on this log.
Values:
|
Index Maintenance
|
/var/opt/SUNWps/https-servername/portal/logs/searchengine.log
|
Logs the transactions involving the search engine, except for not registration of resource descriptions.
|
RD Manager
|
/var/opt/SUNWps/https-servername/portal/logs/rdmgr.log
|
Logs the registration of resource descriptions from the robot or import agents into the database. You can view this log as a RD Manager (rdmgr) report.
|
RDM Server
|
/var/opt/SUNWps/https-servername/portal/logs/rdmserver.log
|
Logs debugging information on RDM transactions. The level of detail is controlled by the Log Level. You can view this log as a RDM Server (rdmsvr) report.
|
Log Level
|
1
|
Controls the amount of detail the RDM Server log file contains.
The possible levels are 2, 10, 20, 50, 100, and 999.
A setting of 1 (default) logs only severe errors. The higher the number, the more detail the RDM Server log file contains.
|
Robot
The properties for the robot are quite complex. You can select the sites to be searched or crawled, check to see if a site is valid, define what types of documents should be picked up, and schedule when the searches take place.
This section is organized as follows:
Overview
The Robot Overview panel is where you can see what the robot is doing: if it is Off, Idle, Running, or Paused; and if it is Running, what progress it is making in the search since the panel is refreshed about every 30 seconds. The refresh rate is defined using the robot-refresh parameter in the search.conf file.
The two buttons on the top right are appropriate for its state. If the robot is Off, the buttons are Start and Remove Status. If it is Running or Idle, the two buttons are Stop and Pause. If it is Paused, the two buttons are Stop and Resume. By selecting on any of the Attributes, you go to the Reports section where you can get a detailed up-to-the-minute report of that Attribute.
Table F-4 Robot Overview Attributes
Attribute
|
Default Value
|
Description
|
The Robot is
|
Current activity
|
The Robot’s state. Value can be Idle, Running, Paused, or Off
|
Updated at date
|
Date and time last refreshed.
|
This page is refreshed to keep you aware of what progress the robot is making.
|
Starting Points
|
Number defined
|
Number of sites you have selected to be searched. A site is disabled (not included in a search) on the Robot, Site page.
|
URL Pool
|
Number URLs waiting
|
Number of URLs yet to be investigated. When you begin a search, the starting point URLs are entered into the URL pool. As the search progresses, the robot discovers links to other URLs. These URLs get added to the pool. After all the URLs in the pool have been processed, the URL pool is empty and the robot is idle.
|
Extracting
|
Number connections per second
|
Number of resources looked at in a second.
Extracting is the process of discovering or locating resources, documents or hyperlinks to be included in the database and filtering out unwanted items.
|
Filtering
|
Number URLs rejected
|
Total number of URLs that are excluded.
|
Indexing
|
Number URLs per second
|
Number of resources or documents turned into a resource description in a second.
Indexing is the phase when all the information that has been gathered on a document is turned into a resource description for inclusion in the search database.
|
Excluded URLs
|
Number URLs excluded by filters
|
Number of URLs that did not meet the filtering criteria.
|
|
Number URLs excluded by errors
|
Number of URLS where the robot encountered errors as file not found.
|
Resource Descriptions
|
Number RDs contributed
|
Number of resource descriptions added to the database.
|
|
Number Bytes of RDs contributed
|
Number of bytes added to the database.
|
General Stats
|
Number URLs retrieved
|
Number of URLs retrieved during run.
|
|
Number Bytes average size of RDs
|
Average number of bytes per resource description.
|
|
Time in days, hours, minutes, and seconds running
|
The amount of time the robot has been running.
|
Sites
The initial page in this section shows what sites are available to be searched.
A site can be enabled (On) and disabled (Off) by using the radio buttons. A disabled site is not searched when the robot is run. The Edit link displays a page where you can change how a search site is defined.
To delete a site, check the checkbox and select Delete.
To add a new site, select New. Add a URL or Domain in the text box and select a depth for the search. Select Create to use the default values. Otherwise, select Create and Edit to select non-default values and go to the Edit page to define the search site.
Table F-5 Robot Manage Sites Attributes
Attribute
|
Default Value
|
Description
|
Lock or cluster graphic
|
Status of site
|
Lock open means that the URL is accessible. The closed lock means that the site is a secure web server and uses SSL. The cluster means that the site is a domain.
|
On/Off
|
On
|
Choose to search this site or not when the robot is run.
|
The New Site page allows you to set up an entire site for indexing.
Table F-6 Robot New Site Attributes
Attribute
|
Default Value
|
Description
|
New site
|
URL
|
URL - format: http://www.sesta.com
Domain - format: *.sesta.com
|
Depth
|
10
|
You have a choice of 1 for this URL only, 2 for this URL and first links, 100 for the robot, , 3 - 10 or unlimited. The default value is set on the Robot, Crawling page.
|
The edit page is where you can define the search site more completely. You can specify what type of server it is, redefine the depth of the search, and select what type of files to add to the search database. The attributes for URL and Domain sites are mostly the same. The additional column in this table shows which attributes are shared and which are unique.
A number of actions are performed on this page. You can verify the server name for the search site you entered. You can add more servers to the server group by selecting Add in the Server Group section. You can add more starting points by selecting Add in the Starting Points section. In the Filter Definition section, you can add or delete, exclude or include certain types of files as well as change the order the filters for these files are applied.
Table F-7 Robot Sites Edit Attributes
Attribute
|
URL/ Domain
|
Default Value
|
Description
|
Site Nickname
|
URL/D
|
Site entered - www.sesta.com
|
Name that is displayed on the initial page. The default is the URL or domain you entered. You can change this name here.
|
Checkbox to select site for deletion or verification
|
URL/D
|
Unchecked
|
Unchecked—not selected
Checked—selected
|
Server Group - Name
|
URL
|
URL - www.sesta.com
|
Is either a single server or a part of a single server. The entry must include the full host name. If you specify just a host name, the site is limited to that host. If you provide directory information in addition to the host name, the site is defined as only that directory and any of its subdirectories.
|
Domain Suffix
|
D
|
Domain entered - *.sesta.com
|
Includes all servers within a domain, such as *.sesta.com.
|
Port
|
URL/D
|
80 for URL; blank for Domain
|
If the site you are searching uses a different port, enter it here.
|
Type
|
URL
|
Web Server
|
Web Server, File Server, FTP Server, Secure Web Server
|
Allowed Protocols
|
D
|
Checkboxes all checked
|
Checkboxes for http, file, ftp, https
|
Starting Points- Checkbox to select site for deletion
|
URL/D
|
Unchecked
|
Unchecked—not selected
Checked—selected
|
Starting Points- URL
|
URL/D
|
http:// URL:80
|
URL or domain
|
Starting Points - Depth
|
URL/D
|
10
|
1 - this URL only
2 - this URL and first links
3-10
unlimited
|
Filter Definition - Checkbox to select file type for deletion
|
URL/D
|
Unchecked
|
Unchecked - not selected
Checked - selected
|
Filter Definitions
|
URL/D
|
In this order, the defaults are Archive Files; Audio Files; Backup Files; Binary Files; CGI Files; Image Files; Java, Javascript, Style Sheet Files; Log Files; Revision Control Files; Source Code Files; Temporary Files; Video Files.
|
The possible choices are Archive Files; Audio Files; Backup Files; Binary Files; CGI Files; Image Files; Java, Javascript, Style Sheet Files; Log Files; Power Point Files; Revision Control Files; Source Code Files; Temporary Files; Video Files; Spreadsheet Files; Plug-in Files; Lotus Domino Documents; Lotus Domino OpenViews; System Directories (UNIX); System Directories (NT).
|
Comment
|
URL/D
|
Blank
|
Text field that describes the site to you. It is not used by the robot.
|
DNS Translation
|
URL
|
Blank
|
The DNS translation modifies the URL and the way it is crawled by replacing a domain name or alias with a cname. Format: alias1->cname1,alias2->cname1
|
Filters
The initial page in this section shows all the defined filter rules and the site definitions that use them. Each filter name is proceeded by a checkbox to select that document type and two radio buttons to turn the Filter Rule On and Off. If a checkbox is checked, the filter is selected and can be deleted. You can add a new filter by selecting New. The new filter page is an abbreviated Edit page, requiring only a Nick Name and one rule. Another option is selecting the Edit link, which takes you to a page where you define the rules for that file type or what that filter does. Each rule is made up of a drop down list of Filter Sources, a drop down list to Filter By, and a text box to enter the filter string specifics in.
Table F-8 Robot Filter Edit Attributes
Attribute
|
Default Value
|
Description
|
Filter Name
|
Prompts for new name. File name of the file type you choose to edit.
|
A descriptive name that reflects the type of file the filter applies to.
|
Drop down list of Filter Sources
|
URL for new filter. Displays previously chosen information for that particular file type.
|
URL, protocol, host, path, MIME type
|
Drop down list of positions
|
is for new filter. Displays previously chosen information for that particular file type. For example, Binary Files ends with exe.
|
is, contains, begins with, ends with, regular expression
|
Text box for type (directory, protocol, file extensions) specifics
|
Blank for new filter. Displays previously entered information for that particular file type. For example, Temporary Files contains /tmp/.
|
In this text box, list what you want to match. What would match in this example - http://docs.sesta.com/manual.html
protocol is http; host contains sesta; file ends with html.
|
Description
|
Prompts for new description. Displays previously entered description for that particular file type.
|
Describe the filter rule for yourself. The robot does not use it.
|
New Site
|
True (checked) for new filter. Displays previously chosen value for that particular file type.
|
Use this as one of the default filters when creating new sites. If you do not check this, you can still add this filter to a new site by editing the site on the Robot, Sites page.
|
By Default
|
Nothing selected for a new filter. Default selected previously for a defined file type.
|
Exclude documents matching this filter.
Include documents matching this filter.
Selection for a new filter does not affect existing site definitions. To use your new filter on an existing site, you must add it by editing the site on the Robot, Sites page.
|
Deployment
|
Lists the sites that use this filter.
|
|
Crawling
The settings on this page control the robot’s operational parameters and defaults. It is divided into these sections: Speed, Completion Actions, Logfile Settings, Standards Compliance, Authentication Parameters, Proxying, Advanced Settings and Link Extraction.
Table F-9 Robot Crawling Attributes
Attribute
|
Default Value
|
Description
|
Server Delay
|
No Delay
|
No Delay (default), 1 second, 2 seconds, 5 seconds, 10 seconds, 30 seconds, 1 minute, 5 minutes.
|
Maximum Connections - Max concurrent retrieval URLs
|
8
|
1, 2, 4, 8 (default), 10, 12, 16, 20.
|
Maximum Connections per Site
|
2
|
(no limit), 1, 2, 4, 8, 10, 12, 16, 20.
|
Send RDs to Indexing every
|
30 minutes
|
3 minutes, 5 minutes, 10 minutes, 15 minutes, 30 minutes (default), 1 hour, 2 hours, 4 hours, 8 hours.
|
Script to Launch
|
nothing (default)
|
nothing (default). For sample files, see the cmdHook files in the /opt/SUNWps/samples/robot directory (for the default installation).
|
After Processing all URLs
|
go idle (default)
|
go idle (default), shut down, start over.
|
Contact Email
|
user@domain
|
Enter your own.
|
Log Level
|
1 - Generation
|
0 Errors only; 1 Generation (default); 2 Enumeration, Conversion; 3 Filtering; 4 Spawning; 5 Retrieval
|
User Agent
|
SunONERobot/6.0
|
Version of the search server.
|
Ignore robots.txt protocol
|
False (unchecked)
|
Some servers have a robot.txt file that says robots do not come here. If your search robot encounters this file on a site and this attribute is false, it does not search the site. If this attribute is true, the robot ignores the file and searches the site.
|
Perform Authentication
|
Yes
|
Yes
No
|
Robot Username
|
anonymous
|
Robot uses the anonymous user name to gain access to a site.
|
Password
|
user@domain
|
Frequently a site that allows anonymous users requires a email address as a password. This address is in plain text.
|
Proxy Username
|
anonymous
|
Robot uses the anonymous user name to gain access to a site.
|
Password
|
user@domain
|
Frequently a site that allows anonymous users requires an email address as a password. This address is in plain text.
|
Proxy Connection Type
|
Direct Internet Connection
|
Direct Internet Connection, Proxy—Auto Configuration, Proxy—Manual Configuration
|
Auto Proxy Configuration Type
|
Local Proxy File
|
Local Proxy File, Remote Proxy File
|
Auto Proxy Configuration Location
|
Blank
|
The auto proxy has a file that lists all the proxy information needed.
An example of a local proxy file is robot.pac. An example of a emote proxy file is http://proxy.sesta.com:8080/proxy.pac
|
Manual Configuration HTTP Proxy
|
Blank
|
Format: server1.sesta.com:8080 These three manual configuration values are put in the robot.pac file in the /var/opt/SUNWps/https-servername/portal/config directory.
|
Manual Configuration HTTPS Proxy
|
Blank
|
This manual configuration value is put in the robot.pac file.
Format: server1.sesta.com:8080
|
Manual Configuration FTP Proxy
|
Blank
|
This manual configuration value is put in the robot.pac file.
Format: server1.sesta.com:8080
|
Follow links in HTML
|
True (checked)
|
Extract hyperlinks from HTML
|
maximum links
|
1024
|
Limits the number of links the robot can extract from any one HTML resource. As the robot searches sites and discovers links to other resources, it could conceivably end up following huge numbers of links a great distance from its original starting point.
|
Follow links in plain text
|
False (unchecked)
|
Extract hyperlinks from plain text.
|
maximum links
|
1024
|
Limits the number of links the robot can extract from any one text resource.
|
Use Cookies
|
False (unchecked)
|
If checked, the robot uses cookies when it crawls. Some sites require the use of cookies in order for them to be navigated correctly. The robot keeps its cookies in a file called cookies.txt in the robot state directory. The format of cookies.txt is the same format as used by the Netscape Communicator browser.
|
Use IP as Source
|
True (checked)
|
In most cases, the robot operates only on the domain name of a resource. In some cases, you might want to be able to filter or classify resources based on subnets by Internet Protocol (IP) address. In that case, you must explicitly allow the robot to retrieve the IP address in addition to the domain name. Retrieving IP addresses requires an extra DNS lookup, which can slow the operation of the robot. If you do not need this option, you can turn it off to improve performance.
|
Smart Host Heuristics
|
False (unchecked)
|
If checked, the robot converts common alternate host names used by a server to a single name. This is most useful in cases where a site has a number of servers all aliased to the same address, such as www.sesta.com, which often have names such as www1.sesta.com, www2.sesta.com, and so on.
When you select this option, the robot will internally translate all host names starting with wwwn to www, where n is any integer. This attribute only operates on host names starting with wwwn.
This attribute cannot be used when CNAME resolution is OFF (false).
|
Resolve hostnames to CNAMEs
|
False (unchecked)
|
If checked, the robot validates and resolves any host name it encounters into a canonical host name. This allows the robot to accurately track unique RDs. If unchecked, the robot validates host names without converting them to the canonical form. So you may get duplicate RDs listed with the different host names found by the robot.
For example, devedge.sesta.com is an alias for developer.sesta.com. With CNAME resolution on, a URL referenced as devedge.sesta.com is listed as being found on developer.sesta.com. With CNAME resolution off, the RD retains the original reference to devedge.sesta.com.
Smart Host Heuristics cannot be enabled when CNAME resolution is OFF (false).
|
Accepts commands from ANY host
|
False (unchecked)
|
Most robot control functions operate through a TCP/IP port. This attribute controls whether commands to the robot must come from the local host system (false), or whether they can come from anywhere on the network (true).
It is recommended that you restrict direct robot control to the local host (false). You can still administer the robot remotely through the Administration Console.
|
Default Starting Point Depth
|
10
|
1- starting points only, 2- bookmark style, 3-10, unlimited.
Default value for the levels of hyperlinks the robot traverses from any starting point. You can set the depth for any given starting point by editing the site on the Robot, Sites page.
|
Work Directory
|
/var/opt/SUNWps/https-servernamefull/portal/tmp
|
Full pathname of a temporary working directory the robot can use to store data. The robot retrieves the entire contents of documents into this directory, often many at a time, so this space should be large enough to handle all of those at once.
|
State Directory
|
/var/opt/SUNWps/https-servernamefull/portal/robot
|
Full pathname of a temporary directory the robot uses to store its state information, including the list of URLs it has visited, the URL pool, and so on. This database can be quite large, so you might want to locate it in a separate partition from the Work Directory.
|
Indexing
The robot searches sites and collects documents based on the filters you have selected. The documents collected are in many different formats. To make them uniform and easily readable they need to be in one format, which is HTML. This page controls some of the parts that go into each resource description.
Table F-10 Robot Index Attributes
Attribute
|
Default Value
|
Description
|
Full Text or Partial Text
|
Partial Text
|
Full uses the complete document in the resource description. Partial text only uses the specified number of bytes in the resource description.
|
extract first # bytes
|
4096
|
Enter the number of bytes.
|
Extract Table Of Contents
|
True (checked)
|
True includes the Table of Contents in the resource description.
|
Extract data in META tags
|
True (checked)
|
True includes the META tags in the resource description.
|
Document Converters
|
All checked (true); if false, that type of document cannot be indexed.
|
Adobe PDF
Corel Presentations
Corel Quattro Pro
FrameMaker
Lotus Ami Pro
Lotus Freelance
Lotus Word Pro
Lotus 1-2-3
Microsoft Excel
Microsoft Powerpoint
Microsoft RTF
Microsoft Word
Microsoft Works
Microsoft Write
WordPerfect
StarOffice Calc
StarOffice Impress
StarOffice Writer
XyWrite
|
Converter Timeout
|
600
|
Time in seconds allowed for one document to be converted to HTML. If this time is exceeded, the URL is excluded.
|
Simulator
This page is a debugging tool that performs a partial simulation of robot filtering on a URL. You can type in a new URL to check. It checks the URL, DNS translations (including Smart Host Heuristics), and site redirections. It does not check the contents of the document specified by the URL, so it does not detect duplications, MIME types, network errors, permissions, and the like. The simulator indicates whether the listed sites would be accepted by the robot (ACCEPTED) or not (WARNING).
Table F-11 Robot Simulator Properties
Attribute
|
Default Value
|
Description
|
URL
|
URLs you have already defined and one blank text box.
|
You can check access to a new site by typing its URL in the blank text box. This checks to see if the new site accepts crawling.
Format http://www.sesta.com:80/
|
Check for DNS aliases
|
True (checked)
|
True (checked) checks for number of servers aliased to the same address.
|
Check for Server Redirects (302)
|
True (checked)
|
True (checked) checks for any server redirects.
|
Site Probe
This page is a debugging tool that checks for DNS aliases, server redirects, and virtual servers. This tool returns information about site but does not test its acceptance of crawling.
Table F-12 Robot Site Probe Attributes
Attribute
|
Default Value
|
Description
|
Site
|
Blank
|
Type in URL in format http://www.sesta.com:80
|
Show advanced DNS information
|
False (unchecked)
|
True (checked) displays more information about the site including IP addresses.
|
Schedule
This page is where you set up the automatic search schedule for the robot.
Table F-13 Robot Schedule Attributes
Attribute
|
Default Value
|
Description
|
Start Robot Time in hours and minutes
|
00:00
|
This is the time that the robot starts to search.
|
Days
|
none selected
|
Sun, Mon, Tue, Wed, Thu, Fri, or Sat
Check at least one day.
|
Stop Robot Time in hours and minutes
|
00:00
|
If you plan to run the robot continuously, it is recommended that you stop and restart it at least once per day. This gives the robot a chance to release resources and reinitialize itself.
|
Days
|
none selected
|
Sun, Mon, Tue, Wed, Thu, Fri, or Sat
|
Database
The Database attributes are divided as follows:
- Schedule
|
Note
|
To partition the database, you must use the command line function because stopping the search server is required.
|
|
Management
The initial Management page lists the available databases. You can create a new one, reindex, purge, or expire an existing one. Use the checkbox to select a database on which to perform an action. Use the small icons above the checkbox to select or deselect all the databases. When you select Reindex, Purge or Expire, a prompt confirming that you want to perform the action with a list of database names displays. To perform the action, select OK.
You should reindex the database if you have edited the schema to add or remove an indexed field (as author), or if a disk error has corrupted the index. You need to restart the server after you change the schema.
Because the time required to reindex the database is proportional to the number of RDs in the database, a large database should be reindexed when the server is not in high demand.
When you purge the contents of the database, disk space used for indexes will be recovered, but disk space used by the main database will not be recovered; instead, it is reused as new data is added to the database.
Expiring a database deletes all RDs that are deemed out-of-date. It does not decrease the size of the database. By default, an RD is scheduled to expire in 90 days from the time of creation.
You can also edit the database by selecting the Edit link which takes you to a page where you define the database attributes.
Table F-14 Database Management Attributes
Attribute
|
Default Value
|
Description
|
Name
|
Default
|
Name for the database used by Search.
|
Title
|
Blank
|
A title for the database.
|
Description
|
Blank
|
Describe the database for yourself.
|
Import Agents
Import agents are the processes that bring resource descriptions from other servers or databases and merge them into your search database.
The initial Import page lists the available import agents. You can create a new one, or run, edit or delete an existing one. Use the checkbox to select an agent to delete. Use the small icons above the checkbox to select or deselect all import agents. Use the radio buttons to turn an Agent Action On or Off. To schedule the import agents, select Schedule on the lower menu bar.
If you choose to edit or modify an existing import agent or create a new one, the following attributes are displayed.
Table F-15 Database Import Agent Attributes
Attribute
|
Default Value
|
Description
|
Charset
|
Blank for new
|
Specifies the character set of the input SOIF stream. For example, ISO8859-1, UTF-8, UTF-16. Character sets ISO8859-1 through ISO8859-15 are supported.
|
Import From
|
Local File
|
Select either Local File or Search Server (if one is enabled).
|
Local File Path
|
Blank for new
|
Gives the full path name of local file that contains valid resource descriptions in SOIF (Summary Object Interchange Format). This can be a file on another server, as long as the path is addressable as if it were locally mounted.
|
Database Name
|
Default
|
Name of the destination database.
|
Remote Server
|
Blank for new
|
Gives the URL of the search server to retrieve resource descriptions from; format http://www.sesta.com:80
|
Instance Name
|
Blank for new
|
Server instance name used by the search server. You can find this instance name in the Server Preferences for the server you are importing from. Value must be 3.01C or 3.01C SP1.
|
Search URI
|
blank for new
|
Enter full path and file names. Use /portal/search.
|
Is Compass Server 3.01X?
|
False (unchecked)
|
Is the server you are importing from a Compass Server 3.01X?
|
Enable SSL
|
False (unchecked)
|
If this is a server-to-server transaction, select if the servers should use the SSL (Secure Sockets Layer) protocol.
|
Authentication
|
None (default)
|
None (default) or Use User/Password
This specifies how the import agent should identify itself to the system it imports from. By default, no authentication is used. If the server you want to import from requires authentication, you can specify a user name and password for the import agent to use. Importing from 3.01C does not require authentication. Importing data from 3.01C SP1 requires authentication.
|
User
|
Blank for new or none
|
If you selected Use User/Password, enter a user.
|
Password
|
Blank for new or none
|
If you selected Use User/Password, enter a password (shown as *).
|
Content Transfer
|
Use Incremental Gathering of Full Contents (default)
|
Choice of Use Incremental Gathering of Full Contents (default) or Use Search Query
These specify which resource descriptions to import from the source.
By default, an import agent asks for all resource descriptions added or changed since its last import from the same source.
The search query specifies that the import agent should request only certain resource descriptions from the source. This is much the same way that users request listings of resources from the search database.
Use Scope, View-Attributes and View-Hits fields to specify the query.
|
Scope
|
Blank for new
|
The text of the query. The query syntax is identical to that used for end-user queries from the server.
|
View-Attributes
|
Blank for new
|
Lists which fields (not case sensitive) you want to import in each resource description. For example, title and author. The default is all.
|
View-Hits
|
Blank for new
|
The maximum number of matching resource descriptions to import. If no hits are specified, it defaults to 20.
|
Agent Description
|
Blank for new
|
Appears in the list of available import agents on the initial Import page. It is ignored by the program. If this field is blank, the Resource Description Source file name or server name is used to identify the import agent. Note here if user name and password are needed.
|
Newest Resource Description
|
Blank for new
|
The date of the creation of the newest resource description previously imported by this import agent. This date is used by the Use Incremental Gathering of Full Contents option to determine which resources are new and should be imported.
|
Network Timeout in seconds
|
Blank for new
|
Specifies the number of seconds the import agent will allow before timing out the connection over the network. You can adjust this to allow for varying network traffic and quality.
|
Resource Descriptions
The initial Resource Descriptions page allows you to search the Resource Descriptions in the database. For example, you can correct a typographical error in an RD or manually assign RDs discovered by the robot to categories.
t
Table F-16 Resource Descriptions Attributes
Attribute
|
Default Value
|
Description
|
Search For
|
All RDs
|
All RDs, Uncategorized RDs, Categorized RDs, RDs by category, Specific RD by URL, RDs that contain
|
Text box
|
Blank
|
Enter a unique text string to identify the RDs searched for. Use with the RDs by category, Specific RD by URL, and RDs that contain attribute values.
|
Database
|
Default
|
Name of the database to search.
|
Select Category
|
|
Browse and select a category from the category tree.
|
Delete
|
|
Delete one or more selected RDs that are returned from an RD search.
|
Next
|
|
Display the next set of RDs returned from an RD search
|
Previous
|
|
Display the previous set of RDs returned from an RD search
|
Edit Selected
|
|
Edit the attributes of one or more RDs that are returned from an RD search.
|
Edit All
|
|
Edit the attributes of the current set of RDs that are returned from an RD search.
|
To limit the search by category, select Select Category. A Category Editor page displays allowing you to specify the category from the taxonomy for the search. You can specify the category in the Selected Category text box or browse the taxonomy to select it. After specifying the category, select OK to return to the RD search page.
Table F-17 Category Editor Attributes
Attribute
|
Default Value
|
Description
|
Selected Categories
|
Blank
|
Text field that displays the selected categories
|
Expand All
|
|
Expands the taxonomy so that all entries in the hierarchy display for browsing.
|
Collapse All
|
Blank
|
Collapses the taxonomy so that only categories within the first two levels of the hierarchy display for browsing.
|
Categories per page
|
25
|
Drop down list of the number of categories to display per page. Values are 25, 50, 100, 250, 500, and all.
|
A successful search displays the Number of RDs found and a list box with the RDs found. After clicking on the Edit link of an RD, the following attributes, which you can edit, and partial text of the RD are displayed. All these attributes except Classification are set to editable in the Database/ Schema page.
Table F-18 Database RD Editable Attributes
Attribute
|
Default Value
|
Description
|
Author
|
Blank
|
Author(s) of the document.
|
Author e-mail
|
Blank
|
Email address to contact the Author(s) of the document.
|
Classification
|
Category name of selected RD.
|
Category name if classified; No Classification if not classified.
|
ReadACL
|
Blank
|
Related to document level security.
|
Content-Charset
|
|
Content-Charset information from HTTP Server.
|
Content-Encoding
|
Blank
|
Content-Encoding information from HTTP Server.
|
Content-Language
|
Blank
|
Content-Language information from HTTP Server.
|
Content-Length
|
Blank
|
Content-Length information from HTTP Server.
|
Content-Type
|
Blank
|
Content-Type information from HTTP Server.
|
Description
|
Description from the selected RD.
|
Description from RD.
|
Expires
|
Valid date.
|
Date on which resource description is no longer valid.
|
Full-Text
|
Blank
|
Entire contents of the document.
|
Keywords
|
Keywords, if any, from the selected RD.
|
Keywords taken from meta- tags.
|
Last-Modified
|
Last modification date
|
Date when the document was last modified.
|
Partial-text
|
Partial text of the document
|
Partial selection of text from the document
|
Phone
|
Blank
|
Phone number for Author contact
|
Title
|
Title of the selected RD.
|
Title of RD
|
URL
|
Blank
|
Uniform Resource Locator for the document
|
Schema
The schema determines what information is in a resource description and what form that information is in. You can add new attributes or fields to an RD and set which ones can be edited and which ones can be indexed. When importing new RDs, you can convert schemas embedded in new RDs into your own schema.
Table F-19 Database Schema Edit Attributes
Attribute
|
Description
|
Author
|
Author(s) of the document.
|
Author-EMail
|
Email address to contact the Author(s) of the document.
|
Content-Charset
|
Content-Charset information from HTTP Server.
|
Content-Encoding
|
Content-Encoding information from HTTP Server.
|
Content-Language
|
Content-Language information from HTTP Server.
|
Content-Length
|
Content-Length information from HTTP Server.
|
Content-Type
|
Content-Type information from HTTP Server.
|
Description
|
Brief one-line description for document.
|
Expires
|
Date on which resource description is no longer valid.
|
Full-Text
|
Entire contents of the document.
|
Keywords
|
Keywords that best describe the document.
|
Last-modifie
|
Date when the document was last modified.
|
Partial-Text
|
Partial selection of text from the document.
|
Phone
|
Phone number for Author contact.
|
ReadACL
|
Used by Search servers to enforce security.
|
Title
|
Title of the document.
|
URL
|
Uniform Resource Locator for the document
|
Aliases
Name
Description
|
When you import new RDs, you can convert schemas embedded in new RDs into your own schema. You would use this conversion when there are discrepancies between the names used for fields in the import database schema and the schema used for RDs in your database. An example would be if you imported RDs that used Writer as a field for the author and you used Author in your RDs as the field for the author. The conversion would be Writer to Author, so you would enter Writer in this text box.
|
Data Type
|
Defines the data type.
|
Editable
|
If true (checked), the selected attribute (field) appears in the Database RD Editor, so you can change its values.
Description, Keywords, Title and ReadACL are editable.
|
Indexable
|
If true (checked), the selected attribute (field) can be used as a basis for indexing.
Author, Title and URL appear in the menu in the Advanced Search screen for the end user. This allows end users to search for values in those particular fields.
Author, Expires, Keywords, Last Modified, Title, URL and ReadACL can be used as the basis for indexing.
|
Score Multiplier
|
A weighting field for scoring a particular element. Any positive value is valid.
|
Analysis
The Analysis page shows a sorted list of all sites and the number of resources from that site currently in the search database. Select Update Analysis to update the analysis on file.
Table F-20 Database Analysis Attributes
Attribute
|
Default Value
|
Description
|
Total number of RDs
|
Current number of RDs in database.
|
Lists current total number of resource descriptions in the database.
|
Number of servers
|
Current number of servers that the database is partitioned across.
|
The database can be partitioned and placed on a number of servers.
|
Site
|
URL or domain that the robot has successfully searched.
|
A URL or domain that has added resource descriptions to the database.
|
Number of RDs
|
Current number of RDs from that site.
|
Lists current number of RDs from that site.
|
Type
|
Type of RD
|
Resource descriptions can be of many different types, for example, http.
|
Percentage
|
Type of RD/ Total number of RDs
|
Percentage of this type of document compared to the total number of resource descriptions.
|
Schedule
This page is where you set up the schedule for running the import agents.
Table F-21 Database Import Schedule Attributes
Attribute
|
Default Value
|
Description
|
Start Import Time in hours and minutes
|
00:00
|
Time that the import agent starts to import.
|
Days
|
none selected
|
Sun -Sat
Check at least one day.
|
Categories
End users interact with the search database in two distinct ways: They can type direct queries to search the database, or they can browse through the database contents using a set of categories you design. You assign resources in a search database to categories to clarify complexity. If a large number of items are in the database, it is helpful to group related items together. Your primary concern in setting up your categories should be usability so that end users can more quickly locate specific kinds of items.
The search server uses a hierarchy of categories called a taxonomy. The term taxonomy in general describes any system of categories. In the context of a networked resource database such as the search server database, it describes any method you choose of categorizing network resources to facilitate retrieval.
The Categories topic is divided into the following subtopics:
Category Editor
The Category Editor page displays a listing the categories in the taxonomy allowing you to browse the categories. After browsing to the category, you may select the category link to bring up the Classification Rules Editor to set up the Robot collections under specific categories.
Table F-22 Category Editor Attributes
Attribute
|
Default Value
|
Description
|
Expand All
|
|
Expands the taxonomy so that all entries in the hierarchy display for browsing.
|
Collapse All
|
|
Collapses the taxonomy so that only categories within the first two levels of the hierarchy display for browsing.
|
Reindex
|
|
Reindexes the database. If you have just created your taxonomy, you need to index the database to make category search available to your end users. If you have changed your categories, you need to reindex the database to make it up-to-date. Save the categories tree before you reindex the database. Load the new taxonomy.
|
Categories per page
|
25
|
Drop down list of the number of categories to display per page. Values are 25, 50, 100, 250, 500, and all.
|
Name
|
Selected category
|
Displays the name of the selected category to edit
|
Description
|
Blank
|
Displays the description of the selected category.
|
Matching Rule
|
Blank
|
Displays the the matching rule to use with the selected category.
|
Update
|
|
Updates the category definition.
|
Add as a child
|
|
Adds the category as a child.
|
Add as a sibling
|
|
Adds the category as a sibling.
|
Classification Rules Editor
After you set up the categories for your database, Click New to set or change the rules the robot for selected categories to assign resources to categories.
Table F-23 Categories Classification Rules Editor Attributes
Attribute
|
Default Value
|
Description
|
Source
|
Author
|
The valid attributes include:
- Author
- Author-EMail
- Content-Charset
- Content-Encoding
- Content-Language
- Content-Length
- Content-Type
- Description
- Expires
- Full-Text
- Keywords
- Last-modified
- Partial-Text
- Phone
- ReadACL
- Title
- URL
- Host
- Protocol
- IP
- Path
- Type
|
Method
|
is
|
is, contains, begins with, ends with, regular expression
|
Criteria
|
Blank
|
Specifies the criteria for the rule.
|
Classification
|
.Blank
|
Category to in which to classify the RD if the rule conditions are met. Type the category or use the Select Category Edit page to browse to it.
|
Reports
The Reports section allows you to monitor your search server. You can see a summary of its activity: what sites were searched, what URLs were excluded and why, detailed information about URLs visited by the robot, and what your end users are interested in.
The Reports topic is divided into the following subtopics:
Starting Points
The robot will visit all the enabled sites every time it starts.
Table F-24 Reports Starting Points Attributes
Attribute
|
Default Value
|
Description
|
Enabled
|
Current value of site.
|
Yes or No.
This is set on the Robot/ Sites page.
|
Starting Point
|
Chosen URL:80
|
Link brings up chosen URL.
|
in site definition
|
Chosen URL
|
Links to Robot/ Sites edit page.
|
Depth
|
Lists selected level of search.
|
1-n Set on the Robot/ Sites edit page.
|
Excluded URLs
This page shows a list of robot runs. To display a list of reasons URLs were excluded, select a robot run to examine, select View Selected, then select one of the Reasons for Exclusion. Displayed is a list of the excluded URLs for that reason. Duplicate and warning exclusions have been removed.
Table F-25 Reports Excluded URLs Attributes
Attribute
|
Default Value
|
Description
|
Log
|
Lists log from most recent run.
|
Lists all run logs available.
|
Count
|
Numbers
|
List of numbers with reasons for exclusion.
|
Reason for Exclusion
|
List of reasons sites have not been allowed. Each reason is linked to a list of all the URLs that were excluded for that reason.
|
Filter rules, file not found, site not allowed, protocol not allowed, errors, duplication are some of the reasons URLs were excluded.
|
Robot Advanced Reports
This page gives you access to a number of different reports from the robot. Select from a drop down list to get information for chosen report to show up. The Refresh button gets the current information.
Table F-26 Reports Robot Advance Reports Attribute
Attribute
|
Default Value
|
Description
|
Advanced Robot Reports
|
Version
|
Version, DNS Cache Dump, Performance, Servers Found-All, Server Found-RDM, Status-Current Configuration, Status -Database (internal), Status-Libnet, Status -Modules, Status-Overview, URLs-ready for extraction, URLs-ready for indexing, URLs- waiting for filtering (URL pool), URLs- waiting for indexing, all reports.
|
Log Files
This page allows you to view the entries or specific lines from a log file. Drop down list of log files. Enter the number of lines you want to be displayed when you select View button.
Table F-27 Reports View Log Files Attributes
Attribute
|
Default Value
|
Description
|
View this logfile
|
Excluded URLs (filter)
|
Excluded URLs (filter), RD Manager (rdmgr), RDM Server (rdmsvr), Robot Activities (robot), Search Engine (searchengine), User Queries (rdm).
|
Number of lines
|
25
|
A number you can enter to display the most current entries in the log file.
|
Popular Searches
This page allows you to see what users are searching for. The most frequent searches appear first in the report.
Table F-28 Reports Popular Searches Attribute
Attribute
|
Default Value
|
Description
|
Exclude Browsing
|
False (unchecked)
|
False (unchecked) includes what categories users browse in. True (checked) excludes browsing statistics.
|