This chapter explains about the attributes available in the Search Robot. The properties for the robot are quite complex. You can select the sites to be searched, check to see if a site is valid, define what types of documents should be picked up, and schedule when the searches take place.
This chapter contains the following sections:
The Robot Overview panel is where you can see what the robot is doing. If it is Off, Idle, Running, or Paused; and if it is Running, what progress it is making in the search since the panel is refreshed about every 30 seconds. The refresh rate is defined using the robot-refresh parameter in the search.conf file.
If the robot is Off, the buttons are Start and Clear Robot Database . The Start button is at the top and the Clear Robot Database button is at the bottom of the panel. If the robot is Running or Idle, the two buttons are Stop and Pause. If it is Paused, the two buttons are Stop and Resume. By selecting on any of the Attributes, you go to the Reports section where you can get a detailed up-to-the-minute report of that Attribute.
The table below lists the Robot Overview attributes and their description.
Table 4–1 Robot Overview Attributes
Attribute |
Default Value |
Description |
---|---|---|
The Robot is |
Current activity |
The Robot’s state. Value can be Idle, Running, Paused, or Off |
Last Updated at |
Date and time last refreshed. |
This page is refreshed to keep you aware of what progress the robot is making. |
Starting Points |
Number defined |
Displays the sites that the robot crawls to generate resource descriptions. The robot does not index resources from disabled sites. |
URL Pool |
Number URLs waiting |
Number of URLs (Uniform Resource Locator) yet to be investigated. When you begin a search, the starting point URLs are entered into the URL pool. As the search progresses, the robot discovers links to other URLs. These URLs get added to the pool. After all the URLs in the pool have been processed, the URL pool is empty and the robot is idle. |
Extracting |
Number connections per second |
Number of resources looked at in a second. Extracting is the process of discovering or locating resources, documents or hyperlinks to be included in the database and filtering out unwanted items. |
Filtering |
Number URLs rejected |
Total number of URLs that are excluded. |
Indexing |
Number URLs per second |
Number of resources or documents turned into a resource description in a second. Indexing is the phase when all the information that has been gathered on a document is turned into a resource description for inclusion in the search database. |
Excluded URLs |
Number URLs excluded by filters |
Number of URLs that did not meet the filtering criteria. |
Number URLs excluded by errors |
Number of URLs where the robot encountered errors as file not found. |
|
Resource Descriptions |
Number RDs contributed |
Number of resource descriptions added to the database. |
Number Bytes of RDs contributed |
Number of bytes added to the database. |
|
General Stats |
Number URLs retrieved |
Number of URLs retrieved during run. |
Number Bytes average size of RDs | ||
Time in days, hours, minutes, and seconds running |
The amount of time the robot has been running. |
When you click the Sites tab, the Manage Sites page is displayed. This page displays the list of Site Names and the Status of each site (enabled or disabled) that the robot crawls to generate resource descriptions. When you select the checkbox, the Delete, Enable, and Disable buttons become active. Select the Delete button to delete a selected site. You can enable or disable the selected site by clicking the Enable or Disable buttons. A disabled site is not searched when the robot is run.
The table below provides the attributes and their description in the Manage Sites page.
Table 4–2 Manage Sites Attributes
Attribute |
Default Value |
Description |
---|---|---|
Lock or cluster graphic |
Status of site |
Lock open means that the URL is accessible. The closed lock means that the site is a secure web server and uses SSL. The cluster means that the site is a domain. |
Enabled/Disabled |
Enabled |
Choose to search this site or not when the robot is run. |
You can create a new site, by clicking the New button. When you click the New button, the New Robot Site page appears. This page allows you to set up a new Robot site. The table below provides the attributes available in the New Robot Site page and their description.
Table 4–3 New Robot Site Attributes
Attribute |
Default Value |
Description |
---|---|---|
Type |
URL |
Select URL or Domain from the list box. |
Site |
Blank |
If you have selected the Type as URL, enter the URL of the site you want to create. The URL format is: http://www.sesta.com If you have selected the Type as Domain, enter the domain of the site you want to create. The Domain format is: *.sesta.com |
Depth |
10 |
You have a choice of 1 for this URL only, 2 for this URL and first links, 3 - 10, 100 or unlimited. The default value is set in the Robot —> Manage Properties page. |
Destination Database |
Use Internal Default |
Select the database that you want to use from the list box showing the available databases. |
Click on the Site name to navigate to the Edit a Site page. You can use this page to define the search site more completely. You can specify what type of server it is, redefine the depth of the search, and select what type of files to add to the search database. The attributes for URL and Domain sites are mostly the same. The additional column in this table shows which attributes are shared and which are unique.
You can verify the server name for the search site you entered. In the Server Group section, click the New button to add more servers to the server group. In the Starting Points section, click the New button to add more starting points. In the Filter Definition section, you can add or delete, exclude or include certain types of files as well as change the order the filters for these files are applied.
The table below provides the attributes and their description in the Edit a Site page.
Table 4–4 Edit a Site Attributes
Under the Filters tab, there is a Manage Filters page, which lists all the defined Filter Rules, Status of each Filter Rule, Default value for New Site, and Used in Sites. Each Filter Rule is preceded by a checkbox. To delete a Filter Rule, you need to select the corresponding checkbox and click the Delete button. To create a new filter:
Click the New button.
The New Robot Filter Wizard appears. As a first step, the Specify Filter Name and Description page is displayed.
Enter the filter name in the Filter Name text box.
Enter the description for the filter in the Filter Description text box.
Click the Next button.
The Specify Filter Definition and Behavior page appears. This page provides the Filter Definition — Matching Rules section. The table below lists the attributes and their description provided in the Filter Definition and Behavior section.
Click the Finish button.
Attribute |
Default Value |
Description |
---|---|---|
Filter Source |
URL |
Choose an option from the list box to specify the source of the filter. The available values are: URL, protocol, host, path, and MIME type. |
Filter By |
is |
Choose an option from the list box to specify the how you want to filter the source. The available values are: is, contains, begins with, ends with, and regular expression. |
Filter String |
Blank |
You can enter the string to define the filter. |
Filter Default |
Selected |
Assign this filer to new sites when they are created. |
Filter Behavior |
Exclude documents that match this filter when Robot runs |
The default option excludes documents that match this filter when robot runs. The other unselected option includes documents that match this filter when Robot runs. |
Click on the Filter Rule to navigate to the Edit a Filter page. The table below lists the attributes and their description in the Edit a Filter page. The default value for these attributes are same as provided in the previous table.
Table 4–6 Edit a Filter Attributes
Attribute |
Description |
---|---|
Filter Name |
A descriptive name that reflects the type of file the filter applies to. |
Drop down list of Filter Sources |
URL, protocol, host, path, MIME type |
Drop down list of positions |
is, contains, begins with, ends with, regular expression |
Text box for type (directory, protocol, file extensions) specifics |
In this text box, list what you want to match. What would match in this example - http://docs.sesta.com/manual.html protocol is http; host contains sesta; file ends with html. |
Filter Description |
Describe the filter rule for yourself. The robot does not use it. |
Filter Default |
Use this as one of the default filters when creating new sites. If you do not check this, you can still add this filter to a new site by editing the site on the Robot, Sites page. |
Filter Behavior |
This attribute provides two options: Exclude documents that match this filter when Robot runs. Include documents that match this filter when Robot runs. By default, the first option is selected. |
Click the Robot —> Properties tab. The Manage Properties page appears. The settings on this page control the robot’s operational parameters and defaults. It is divided into these sections: Crawling Speed, Completion Actions, Logfile Settings, Standard Compliance, Authentication Parameters, Proxy Settings, Link Following, Advanced Settings, and Indexing Settings.
The table below lists the attributes and their description in the Manage Properties page.
Table 4–7 Manage Properties Attributes
The robot searches sites and collects documents based on the filters you have selected. The documents collected are in many different formats. To make them uniform and easily readable they need to be in one format, which is HTML. This page controls some of the parts that go into each resource description.
You can find the simulator attributes in the Robot Utilities page under the Utilities tab. The Robot Utilities page is a debugging tool that performs a partial simulation of robot filtering on a URL. You can type in a new URL to check. It checks the URL, DNS translations (including Smart Host Heuristics), and site redirections. It does not check the contents of the document specified by the URL, so it does not detect duplications, MIME types, network errors, permissions, and the like. The simulator indicates whether the listed sites would be accepted by the robot (ACCEPTED) or not (WARNING).
The table below provides the attributes and their description in the Simulator section in the Robot Utilities page.
Table 4–8 Robot Simulator Attributes
Attribute |
Default Value |
Description |
---|---|---|
Run Simulator on |
URLs you have already defined and one blank text box. |
You can check access to a new site by typing its URL in the blank text box. This checks to see if the new site accepts crawling. Format http://www.sesta.com:80/ |
Show advanced DNS information |
Unselected |
Selected displays more information about the site. |
Check for server redirects |
Selected |
Selected checks for any server redirects. |
The site probe attributes are also available in the Robot Utilities page. This page is a debugging tool that checks for DNS aliases, server redirects, and virtual servers. This tool returns information about site but does not test its acceptance of crawling.
The table below provides the site Probe attributes and their description.
Table 4–9 Robot Site Probe Attributes
Attribute |
Default Value |
Description |
---|---|---|
Run Site Probe on |
Blank |
Type in URL in format http://www.sesta.com:80 |
Show advanced DNS information |
Unselected |
Selected displays more information about the site including IP addresses. |