Overview of the Oracle Secure Enterprise Search Crawler

The Oracle Secure Enterprise Search (Oracle SES) crawler is a Java process activated by a set schedule. When activated, the crawler spawns processor threads that fetch documents from sources. The crawler caches the documents, and when the cache reaches the maximum batch size of 250 MB, the crawler indexes the cached files. This index is used for searching.

The document cache, called Secure Cache, is stored in Oracle Database in a compressed SecureFile LOB. Oracle Database provides excellent security and compact storage.

In the Oracle SES Administration GUI, you can create schedules with one or more sources attached to them. Schedules define the frequency at which the Oracle SES index is kept up to date with existing information in the associated sources.

Crawler URL Queue

In the process of crawling, the crawler maintains a list of URLs of the discovered documents that are fetched and indexed in an internal URL queue. The queue is persistently stored, so that crawls can be resumed after the Oracle SES instance is restarted.

Understanding Access URLs and Display URLs

A display URL is a URL string used for search result display. This is the URL used when users click the search result link. An access URL is an optional URL string used by the crawler for crawling and indexing. If it does not exist, then the crawler uses the display URL for crawling and indexing. If it does exist, then it is used by the crawler instead of the display URL for crawling. For regular Web crawling, only display URLs are available. But in some situations, the crawler needs an access URL for crawling the internal site while keeping a display URL for the external use. For every internal URL, there is an external mirrored URL.

For example, for file sources with display URLs, end users can access the original document with the H TTP or HTTPS protocols. These provide the appropriate authentication and personalization and result in better user experience.

Display URLs can be provided using the URL Rewriter API. Or, they can be generated by specifying the mapping between the prefix of the original file URL and the prefix of the display URL. Oracle SES replaces the prefix of the file URL with the prefix of the display URL.

For example, if the file URL is

file://localhost/home/operation/doc/file.doc

and the display URL is

https://webhost/client/doc/file.doc

then specify the file URL prefix as

file://localhost/home/operation

and the display URL prefix as

https://webhost/client

Modifying the Crawler Parameters

You can alter the crawler's operating parameters at two levels:

At the global level for all sources
At the source level for a particular defined source

Global parameters include the default values for language, crawling depth, and other crawling parameters, and the settings that control the crawler log and cache.

To configure the crawler:

Click the Global Settings tab.
Under Sources, click Crawler Configuration.
Make the desired changes on the Crawler Configuration page. Click Help for more information about the configuration settings.
Click Apply.

To configure the crawling parameters for a specific source:

From the Home page, click the Sources secondary tab to see a list of sources you have created.
Click the edit icon for the source whose crawler you want to configure, to display the Edit Source page.
Click the Crawling Parameters subtab.
Make the desired changes. Click Help for more information about the crawling parameters.
Click Apply.

Note that the parameter values for a particular source can override the default values set at the global level. For example, for Web sources, Oracle SES sets a default crawling depth of 2, irrespective of the crawling depth you set at the global level.

Also note that some parameters are specific to a particular source type. For example, Web sources include parameters for HTTP cookies.