Table of Contents

Crawler Settings Page

The Ultra Search crawler is a multi-threaded Java application that is responsible for spawning threads to crawl defined data sources such as websites and database tables.

Crawling occurs at regularly scheduled intervals as defined in the Schedules Page.

With the "Crawler Settings" page, you can set the following parameters:

	Related Topics
About the Ultra Search Crawler and data sources About the Administration Tool

Crawler threads

Specify the number of crawler threads that will be spawned at run time.

Number of processors

Specify the number of central processing units (CPUs) that exist on the server where the Ultra Search crawler will run. This setting is used to determine the optimal number of document conversion threads used by the system. A document conversion thread converts multi-format documents into HTML documents for proper indexing.

Automatic language detection

Not all documents retrieved by the Ultra Search crawler specify the language. For documents with no language specification, the Ultra Search crawler attempts to automatically detect language. Specify Yes to turn on this feature.

Crawling depth

A web document might contain links to other web documents, which in turn might contain more links. This setting allows you to specify the maximum number of nested links the crawler will follow. Click here for more information on the importance of the crawling depth.

Crawler timeout threshold

Specify in seconds a crawler timeout. The crawler timeout threshold is used to force a timeout when the crawler cannot access a web page.

Default character set

Specify the default character set. The crawler uses this setting when an HTML document does not have its character set specified.

Temporary directory location and size

Specify a temporary directory and size. The crawler uses the temporary directory for intermittent storage when gathering documents. Specify the absolute path of the temporary directory. The size is the maximum temporary space in megabytes that will be used by the crawler.

Logfile directory

Specify the logfile directory. The logfile directory is used for storing the crawler logfile(s). The log file records all crawler activity, warnings, and error messages for a particular schedule. It includes messages logged at startup, runtime, and at shutdown.

By default, the Ultra Search crawler prints selected crawler activity into each schedule logfile. Selective printing is necessary to avoid creating immensely large logfiles (which can easily happen when crawling a large number of documents). However, in certain situations, it may be beneficial to configure the crawler to print detailed activity to each schedule logfile. This is known as verbose logging. To configure the crawler for verbose logging, you must log in to the Oracle Server using SQL*PLus. Login as the Ultra Search instance owner (or any user that has been granted administrative privileges on that instance). Once logged in, run the following commands:

exec wk_adm.use_instance('<instance_name>');
exec wk_crw.update_crawler_config(verbose=>1);

Database connect string

The database connect string is a standard JDBC connect string used by the crawler when it needs to connect to the database. The format of the connect string must be as follows: [hostname]:[port]:[SID]

Remote Crawler Profiles Page

Use this page to view and edit remote crawler profiles. A remote crawler profile consists of all parameters needed to run the Ultra Search crawler on a remote machine other than the Oracle Ultra Search database. A remote crawler profile is identified by the hostname. The profile includes the cache, log, and mail directories that the remote crawler shares with the database machine.

To set these parameters, click on "Edit". Enter the shared directory paths as seen by the remote crawler. You must ensure that these directories are are shared or mounted appropriately.

Crawler Statistics Page

Use this page to view the following crawler statistics:

Summary of crawler activity

Detailed crawler statistics

Crawler progress

Problematic URLs