4
Understanding the Ultra Search Crawler and Data Sources

This chapter contains the following topics:

Ultra Search Crawler Overview

The Ultra Search crawler is a Java process activated by your Oracle server according to a set schedule. When activated, the crawler spawns processor threads that fetch documents from various data sources. These documents are cached in the local file system. When the cache is full, the crawler indexes the cached files using Oracle Text. This index is used for querying.

Note:

An empty index is created when an Ultra Search instance is created. You can alter the index using SQL. The existing preferences, such as language-specific parameters, are defined in the $ORACLE_HOME/ultrasearch/admin/wk0pref.sql file.

Crawler Settings

Before you can use the crawler, you must set its operating parameters, such as the number of crawler threads, the crawler timeout threshold, the database connect string, and the default character set. Some parameters, like the log file directory and the temporary directory, have no default value, so you must set them before crawling. To do so, use the Crawler Settings Page in the administration tool.

See Also:

"Crawler Page"

Crawler Data Sources

In addition to the Web access parameters, you can define specific data sources on the Sources Page in the administration tool. You can define one or more of the following data sources:

Web sites
Database tables
Files
Mailing lists
Oracle9iAS Portal page groups
User-defined data sources (requires crawler agent)

Using Crawler Agents

If you are defining a user-defined data source to crawl and index a proprietary document repository or management system, such as Lotus Notes or Documentum, you must implement a crawler agent as a Java class. The agent collects document URLs and associated metadata from the proprietary document source and returns the information to the Ultra Search crawler, which enqueues it for later crawling. For more information on defining a new data source type, see the User-Defined sub-tab in Sources Page in the administration tool.

Synchronizing Data Sources

You can create synchronization schedules with one or more data sources attached to it. Synchronization schedules define the frequency at which the Ultra Search index is kept up to date with existing information in the associated data sources. To define a synchronization schedule, use the Schedules Page in the administration tool.

Display URL and Access URL

For some applications, for security reasons, the URL crawled is different from the one seen by the end user. For example, crawling on an internal Web site inside a firewall might be done without security checking, but when queried by the end user, a corresponding mirror URL outside the firewall must be used. This mirror URL is called the display URL.

By default, the display URL is treated as the access URL unless a separate access URL is provided. The display URL must be unique in a data source, so two different access URLs cannot have the same display URL.

See Also:

"Sources Page"

Document Attributes

Document attributes, or metadata, describe the properties of a document. Each data source has its own set of document attributes. The value is retrieved during the crawling process and then mapped to one of the search attributes and stored and indexed in the database. This lets you query documents based on their attributes. Document attributes in different data sources can be mapped to the same search attribute. Therefore, you can query documents from multiple data sources based on the same search attribute.

If the document is a Web page, the attribute can come from the HTTP header or it can be embedded inside the HTML in metatags. Document attributes can be used for many things, including document management, access control, or version control. Different data sources can have attributes of different names but used for the same purpose; for example, "version" and "revision". It can also have the same attribute name for different purposes; for example, "language" as in natural language in one data source but as "programming language" in another.

Search attributes are created in three ways:

System-defined search attributes, such as title, author, description, subject, and mimetype.
Search attributes created by the system administrator
Search attributes created by the crawler (During crawling, the crawler agent maps the document attribute to a search attribute with the same name and data type. If not found, then the crawler creates a new search attribute with the same name and type as the document attribute defined in the crawler agent.)

The list of values (LOV) for a search attribute can help you specify a search query. If attribute LOV is available, then the crawler registers the LOV definition, which includes attribute value, attribute value display name, and its translation.

Crawling Process for the Schedule

The first time the crawler runs, it must fetch Web pages, table rows, files, and so on based on the data source. It then adds the document to the Ultra Search index. The crawling process for the schedule is broken into two phases:

Queuing and Caching Documents

Figure 4-1 and Figure 4-2 illustrate an instance of the crawling cycle in a sequence of nine steps. The example uses a Web data source, although the crawler can also crawl other data source types.

Figure 4-1 illustrates how the crawler and its crawling threads are activated. It also shows how the crawler queues hypertext links to control its navigation. This figure corresponds to Steps 1 to 5.

Figure 4-2 illustrates how the crawler caches Web pages. This figure correspond to Steps 6 to 8.

The steps are the following:

Oracle spawns the crawler according to the schedule you specify with the administration tool. When crawling is initiated for the first time, the URL queue is populated with the seed URLs. Figure 4-1.
Crawler initiates multiple crawling threads.
Crawler thread removes the next URL in the queue.
Crawler thread fetches the document from the Web. The document is usually an HTML file containing text and hypertext links.
Crawler thread scans the HTML file for hypertext links and inserts new links into the URL queue. Duplicate links already in the document table are discarded.
Crawler caches the HTML file in the local file system. Figure 4-2.
Crawler registers URL in the document table.
Crawler thread starts over by repeating Step 3.

Fetching a document, as shown in Step 4, can be time-consuming because of network traffic or slow Web sites. For maximum throughput, multiple threads fetch pages at any given time.

Note:

URLs remain visible until the next crawling run. When the crawler detects that the URL is no longer there, it is removed from the wk$doc table where Oracle Text automatically marks this document as deleted, even though the index data is still there. Cleanup is done through index optimization, which can be scheduled separately.

Figure 4-1 Queuing URLs

Text description of isrch005.gif follows.

Text description of the illustration isrch005.gif

Figure 4-2 Caching URLs

Text description of isrch006.gif follows.

Text description of the illustration isrch006.gif

Indexing Documents

When the file system cache is full (default maximum size is 20 megabytes), document caching stops and indexing begins. In this phase, Ultra Search augments the Oracle Text index using the cached files referred to by the document table. See Figure 4-3.

Figure 4-3 Indexing Documents

Text description of isrch004.gif follows.

Text description of the illustration isrch004.gif

Data Synchronization

After the initial crawl, a URL page is only crawled and indexed if it has changed since the last crawl. The crawler determines if it has changed with the HTTP If-Modified-Since header field or with the checksum of the page. URLs that no longer exist are marked so and removed from the index.

To update changed documents, the crawler uses an internal checksum to compare new Web pages with cached Web pages. Changed Web pages are cached and marked for reindexing.

The steps involved in data synchronization are the following:

Oracle spawns the crawler according to the synchronization schedule you specify with the administration tool. The URL queue is populated with the data source URLs assigned to the schedule.
Crawler initiates multiple crawling threads.
Crawler thread removes the next URL in the queue.
Crawler thread fetches the document from the Web. The page is usually an HTML file containing text and hypertext links.
Crawler thread calculates a checksum for the newly retrieved page and compares it with the checksum of the cached page. If the checksum is the same, then the page is discarded and crawler goes to step 3. Otherwise, the crawler moves to the next step.
Crawler thread scans the document for hypertext links and inserts new links into the URL queue. Duplicate links already in the document table are discarded.
Crawler caches the document in the local file system. See Figure 4-2.
Crawler registers URL in the document table.
If the file system cache is full or if the URL queue is empty, then Web page caching stops and indexing begins. Otherwise, the crawler thread starts over by repeating Step 3.

Ultra Search Remote Crawler

To increase crawling performance, set up the Ultra Search crawler to run on one or more machines separate from your database. These machines are called remote crawlers. However, each machine must share cache, log, and mail archive directories with the database machine.

To configure a remote crawler, you must first install the Ultra Search middle tier components module on a machine other than the database host. During installation, the remote crawler is registered with the Ultra Search system, and a profile is created for the remote crawler. After installing the Ultra Search middle tier components module, you must log on to the Ultra Search administration tool and edit the remote crawler profile. You can then assign a remote crawler to a crawling schedule. To edit remote crawler profiles, use the Crawler Settings Page in the administration tool.

Caution:

When launching a remote crawler, the Ultra Search back end database communicates with the remote machine through Java RMI (remote method invocation). By default, RMI sends data over the network unencrypted. Using the remote crawler to perform crawling introduces a potential security risk. A malicious entity within the enterprise could steal the Ultra Search instance schema and password by listening to packets going across the network. Refrain from using the remote crawler feature if this security risk is unacceptable.

4 Understanding the Ultra Search Crawler and Data Sources