Understanding the Crawling Process

The first time the crawler runs, it must fetch data (Web pages, table rows, files, and so on) based on the source. It then adds the document to the Oracle SES index.

The Initial Crawl

This section describes a Web source crawling process for a schedule. It is divided into these phases:

Queuing and Caching Documents
Indexing Documents

Queuing and Caching Documents

The crawling cycle involves the following steps:

Oracle spawns the crawler according to the schedule you specify with the Oracle SES Administration GUI. When crawling is initiated for the first time, the URL queue is populated with the seed URLs.
The crawler initiates multiple crawling threads.
The crawler thread removes the next URL in the queue.
The crawler thread fetches the document from the Web. The document is usually an HTML file containing text and hypertext links. When the document is not in HTML format, the crawler converts the document into HTML before caching.
The crawler thread scans the HTML file for hypertext links and inserts new links into the URL queue. Duplicate links in the document table are discarded.
The crawler caches the HTML file.
The crawler registers the URL in the URL table.
The crawler thread starts over by repeating Step 3.

Fetching a document, as described in Step 4, can be time-consuming because of network traffic or slow Web sites. For maximum throughput, multiple threads fetch pages at any given time.

Indexing Documents

When the cache is full (default maximum size is 250 MB), the indexing process begins. At this point, the document content and any searchable attributes are pushed into the index.

When the Preserve Document Cache parameter is set to false, the crawler automatically deletes the cache after indexing the documents.

Oracle SES Stoplist

Oracle SES ma intains a stoplist. A stoplist is a list of words that are ignored during the indexing process. These words are known as stopwords. Stopwords are not indexed because they are deemed not useful, or even disruptive, to the performance and accuracy of indexing. The Oracle SES stoplist contains only English words, and cannot be modified.

When you run a phrase search with a stopword in the middle, the stopword is not used as a match word, but it is used as a placeholder. For example, the word "on" is a stopword. If you search for the phrase "oracle on demand", then Oracle SES matches a document titled "oracle on demand" but not a document titled "oracle demand". If you search for the phrase "oracle on on demand", then Oracle SES matches a document entitled "oracle technology on demand" but not a document titled "oracle demand" or "oracle on demand".

Maintenance Crawls

After the initial crawl, a URL page is only crawled and indexed if it changed since the last crawl. The crawler determines whether it has changed with the HTTP If-Modified-Since header field or with the checksum of the page. URLs that no longer exist are marked and removed from the index.

To update changed documents, the crawler uses an internal checksum to compare new Web pages with cached Web pages. Changed Web pages are cached and marked for re-indexing.

Data synchronization involves the following steps:

Oracle spawns the crawler according to the schedule specified in the Oracle SES Administration GUI. The URL queue is populated with the seed URLs of the source assigned to the schedule.
The crawler initiates multiple crawling threads.
Each crawler thread removes the next URL in the queue.
Each crawler thread fetches a document from the Web. The page is usually an HTML file containing text and hypertext links. When the document is not in HTML format, the crawler converts the document into HTML before caching.
Each crawler thread calculates a checksum for the newly retrieved page and compares it with the checksum of the cached page. If the checksum is the same, then the page is discarded and the crawler goes to Step 3. Otherwise, the crawler continues to the next step.
The crawler thread scans the document for hypertext links and inserts new links into the URL queue. Links that are in the document table are discarded. Oracle SES does not follow links from filtered binary documents.
The crawler marks the URL as accepted. The URL is crawled in future maintenance crawls.
The crawler registers the URL in the document table.
If the cache is full or if the URL queue is empty, then caching stops. Otherwise, the crawler thread starts over at Step 3.

A maintenance or a forced recrawl does not move a cache from the file system to the database, or the reverse. The cache location for a source remains the same until it is migrated to a different location.

Automatic Forced Recrawls

When you configure a data source, certain operations trigger an automatic forced recrawl of the data source. These operations include the following:

Deleting a document attribute from the data source
Remapping a document attribute to a different search attribute
Changing the crawler configuration "Index Dynamic Page" from No to Yes for a Web source.

These operations set the Force Recall flag, but no notice is given of this change in mode.