Table of Contents

About the Oracle Ultra Search Crawler and data sources

The Oracle Ultra Search crawler is a multi-threaded Java application responsible for gathering documents from the data sources you specify during configuration. The crawler stores the documents in a local file system cache. With the cached data, Oracle Ultra Search creates the index required for querying.

  Related Topics

Crawler Settings

Before you can use the crawler, you must set its operating parameters such as number of crawler threads, crawler timeout threshold, database connect string and default character set. To do so, use the Crawler Page in the administration tool.

Primary Crawling Schedule

By default, the crawler is assigned a primary schedule. This schedule must be run before any other crawling schedule. The web access parameters you define such as seed URLs tell the crawler's primary schedule where to start. To define web access parameters, use the Web Access Page in the administration tool.

To edit and execute the primary crawling schedule, use the "Data Synchronization" subtab in the Schedules Page in the administration tool.

Crawler Data Sources

In addition to the web access parameters, you can define specific data sources. You can define one or more of the following data sources:

  • web sites
  • database tables
  • files
  • mailing lists

Web Sites

You can define web sites as a data source with the http:// protocol. To define web sites as a data source, use the Web tab on the Sources Page in the administration tool.

Database Tables

You can define a database table as a data source. To define database tables as a data source, use the Table tab on the Sources Page in the administration tool.

Files

You can define files as a data source with the file:// protocol. These files must be accessible by each crawler machine. To define files as a data source, use the File tab on the Sources Page in the administration tool.

Emails

You can define an email source to crawl emails sent to a specific email address. This feature is useful for crawling mailing lists. To do so, you create an IMAP email account that subscribes to the list(s). All messages addressed to the email address / mailing list are indexed. To define an email data source, use the Email tab on the Sources Page in the administration tool.

Synchronizing Data Sources

Data sources are used by synchronization schedules you create. A synchronization schedule has one or more data sources attached to it. Synchronization schedules define the frequency at which the Ultra Search index is kept up to date with existing information in the associated data sources. To define a synchronization schedule, use the "Data Synchronization" subtab in the Schedules Page in the administration tool.

Crawling Process for the Primary Schedule

The first time the crawler runs, it must fetch web pages and create the Ultra Search index. The crawling process for the primary schedule can be broken into two phases:

  • Queuing and Caching Documents
  • Indexing Documents

Queuing and Caching Documents

Figures 2a and 2b illustrate an instance of the crawling cycle in a sequence of nine steps. The example uses a web data source, although the crawler can crawl databases tables, files, and mailing lists in addition to a web source.

Figure 2a illustrates how the crawler and its crawling threads are activated. It also shows how the crawler queues hypertext links to control its navigation. This figure corresponds to Steps 1 to 5.

Figure 2b illustrates how the crawler caches web pages. This figure correspond to Steps 6 to 8.

The steps are the following:

  1. Oracle spawns the crawler according to the primary schedule you specify with the administration tool. When crawling is initiated for the first time, the URL queue is populated with the seed URLs. Refer to Figure 2a.
  2. Crawler initiates multiple crawling threads.
  3. Crawler thread removes the next URL in the queue.
  4. Crawler thread fetches the document from the web. The document is usually an HTML file containing text and hypertext links.
  5. Crawler thread scans the HTML file for hypertext links and inserts new links into the URL queue. Duplicate links already in the document table are discarded.
  6. Crawler caches HTML file in the local file system. Refer to Figure 2b.
  7. Crawler registers URL in the document table.
  8. Crawler thread starts over by repeating Step 3.

Fetching a document as shown in Step 4 can be time-consuming because of network traffic or slow web sites. For maximum throughput, multiple threads fetch pages at any given time.

Figure 2a: Queuing URLs



Text description of the illustration isrch005.gif

 

Figure 2b: Caching URLs



Text description of the illustration isrch006.gif

 

Indexing Documents

When the file system cache is full (default maximum size is 20 megabytes), document caching stops and indexing begins. In this phase, Ultra Search augments the Oracle9i Text index using the cached files referred to by the document table. Refer to Figure 3.

Figure 3: Indexing Documents



Text description of the illustration isrch004.gif

 

Crawling Process for Data Synchronization

Since document content can change, the crawler must keep its index up to date. To schedule data synchronization crawling, you must define a synchronization schedule and attach data sources to it. To do so, use the Schedules Page in the administration tool.

Crawling for data synchronization is different from the initial crawl on the primary schedule. Data synchronization only updates changed documents while the primary crawl indexes new documents.

To update changed documents, the crawler uses an internal checksum to compare new web pages with cached web pages. Changed web pages are cached and marked for re-indexing.

The steps involved in data synchronization are the following:

  1. Oracle spawns the crawler according to the synchronization schedule you specify with the administration tool. The URL queue is populated with the data source URLs assigned to the schedule.
  2. Crawler initiates multiple crawling threads.
  3. Crawler thread removes the next URL in the queue.
  4. Crawler thread fetches the document from the web. The page is usually an HTML file containing text and hypertext links.
  5. Crawler thread calculates a checksum for the newly retrieved page and compares it with the checksum of the cached page. If the checksum is the same, the page is discarded and crawler goes to Step 3. Otherwise the crawler moves to the next step.
  6. Crawler thread scans the document for hypertext links and inserts new links into the URL queue. Duplicate links already in the document table are discarded.
  7. Crawler caches the document in the local file system. Refer to Figure 2b.
  8. Crawler registers URL in the document table.
  9. If the file system cache is full or if the URL queue is empty, web page caching stops and indexing begins. Otherwise, the crawler thread starts over by repeating Step 3.

Remote Crawler

To increase crawling performance, you can set up the Ultra Search crawler to run on one or more machines separate from your database. These machines are called remote crawlers. However, each machine must share cache, log, and mail archive directories with the database machine.

To configure a remote crawler, you must first install the Ultra Search middle-tier components module on a machine other than the database host. During installation, the remote crawler will be registered with the Ultra Search system and a profile will be created for the remote crawler. After installing the Ultra Search middle-tier components module, you must log in to the Ultra Search Administration Tool and edit the remote crawler profile. You can then assign a remote crawler to a crawling schedule. To edit remote crawler profiles, use the Remote Crawler Tab in the Crawler Page in the administration tool.