Crawler Settings
Before you can use the crawler, you must set its operating parameters
such as number of crawler threads, crawler timeout threshold, database
connect string and default character set. To do so, use the Crawler
Page in the administration tool.
Primary Crawling Schedule
By default, the crawler is assigned a primary schedule. This schedule
must be run before any other crawling schedule. The web access parameters
you define such as seed URLs tell the crawler's primary schedule where
to start. To define web access parameters, use the Web
Access Page in the administration tool.
To edit and execute the primary crawling schedule, use the "Data Synchronization"
subtab in the Schedules Page in the administration
tool.
Crawler Data Sources
In addition to the web access parameters, you can define specific data
sources. You can define one or more of the following data sources:
- web sites
- database tables
- files
- mailing lists
Web Sites
You can define web sites as a data source with the http:// protocol.
To define web sites as a data source, use the Web tab on the Sources
Page in the administration tool.
Database Tables
You can define a database table as a data source. To define database
tables as a data source, use the Table tab on the Sources
Page in the administration tool.
Files
You can define files as a data source with the file:// protocol. These
files must be accessible by each crawler machine. To define files as
a data source, use the File tab on the Sources
Page in the administration tool.
Emails
You can define an email source to crawl emails sent to a specific email
address. This feature is useful for crawling mailing lists. To do so,
you create an IMAP email account that subscribes to the list(s). All
messages addressed to the email address / mailing list are indexed.
To define an email data source, use the Email tab on the Sources
Page in the administration tool.
Synchronizing Data Sources
Data sources are used by synchronization schedules you create. A synchronization
schedule has one or more data sources attached to it. Synchronization
schedules define the frequency at which the Ultra Search index is kept
up to date with existing information in the associated data sources.
To define a synchronization schedule, use the "Data Synchronization"
subtab in the Schedules Page in the administration
tool.
Crawling Process for the Primary Schedule
The first time the crawler runs, it must fetch web pages and create
the Ultra Search index. The crawling process for the primary schedule
can be broken into two phases:
- Queuing and Caching Documents
- Indexing Documents
Queuing and Caching Documents
Figures 2a and 2b illustrate an instance of the crawling cycle in a
sequence of nine steps. The example uses a web data source, although
the crawler can crawl databases tables, files, and mailing lists in
addition to a web source.
Figure 2a illustrates how the crawler and its crawling threads are
activated. It also shows how the crawler queues hypertext links to control
its navigation. This figure corresponds to Steps 1 to 5.
Figure 2b illustrates how the crawler caches web pages. This figure
correspond to Steps 6 to 8.
The steps are the following:
- Oracle spawns the crawler according to the primary schedule you
specify with the administration tool. When crawling is initiated for
the first time, the URL queue is populated with the seed URLs. Refer
to Figure 2a.
- Crawler initiates multiple crawling threads.
- Crawler thread removes the next URL in the queue.
- Crawler thread fetches the document from the web. The document is
usually an HTML file containing text and hypertext links.
- Crawler thread scans the HTML file for hypertext links and inserts
new links into the URL queue. Duplicate links already in the document
table are discarded.
- Crawler caches HTML file in the local file system. Refer to Figure
2b.
- Crawler registers URL in the document table.
- Crawler thread starts over by repeating Step 3.
Fetching a document as shown in Step 4 can be time-consuming because
of network traffic or slow web sites. For maximum throughput, multiple
threads fetch pages at any given time.
Figure 2a: Queuing URLs
Figure 2b: Caching URLs
Indexing Documents
When the file system cache is full (default maximum size is 20 megabytes),
document caching stops and indexing begins. In this phase, Ultra Search
augments the Oracle9i Text index using the cached files referred to
by the document table. Refer to Figure 3.
Figure 3: Indexing Documents
Crawling Process for Data Synchronization
Since document content can change, the crawler must keep its index
up to date. To schedule data synchronization crawling, you must define
a synchronization schedule and attach data sources to it. To do so,
use the Schedules Page in the administration
tool.
Crawling for data synchronization is different from the initial crawl
on the primary schedule. Data synchronization only updates changed documents
while the primary crawl indexes new documents.
To update changed documents, the crawler uses an internal checksum
to compare new web pages with cached web pages. Changed web pages are
cached and marked for re-indexing.
The steps involved in data synchronization are the following:
- Oracle spawns the crawler according to the synchronization schedule
you specify with the administration tool. The URL queue is populated
with the data source URLs assigned to the schedule.
- Crawler initiates multiple crawling threads.
- Crawler thread removes the next URL in the queue.
- Crawler thread fetches the document from the web. The page is usually
an HTML file containing text and hypertext links.
- Crawler thread calculates a checksum for the newly retrieved page
and compares it with the checksum of the cached page. If the checksum
is the same, the page is discarded and crawler goes to Step 3. Otherwise
the crawler moves to the next step.
- Crawler thread scans the document for hypertext links and inserts
new links into the URL queue. Duplicate links already in the document
table are discarded.
- Crawler caches the document in the local file system. Refer to Figure
2b.
- Crawler registers URL in the document table.
- If the file system cache is full or if the URL queue is empty, web
page caching stops and indexing begins. Otherwise, the crawler thread
starts over by repeating Step 3.
Remote Crawler
To increase crawling performance, you can set up the Ultra Search crawler
to run on one or more machines separate from your database. These machines
are called remote crawlers. However, each machine must share cache,
log, and mail archive directories with the database machine.
To configure a remote crawler, you must first install the Ultra Search
middle-tier components module on a machine other than the database host.
During installation, the remote crawler will be registered with the
Ultra Search system and a profile will be created for the remote crawler.
After installing the Ultra Search middle-tier components module, you
must log in to the Ultra Search Administration Tool and edit the remote
crawler profile. You can then assign a remote crawler to a crawling
schedule. To edit remote crawler profiles, use the Remote Crawler Tab
in the Crawler Page in the administration
tool.
|