7 Understanding the Oracle Ultra Search Crawler and Data Sources

This chapter contains the following topics:

Overview of the Oracle Ultra Search Crawler
Crawler Settings
Crawler Data Sources
Document Attributes
Crawling Process for the Schedule
Data Synchronization
Web Crawling Boundary Control
Oracle Ultra Search Remote Crawler
Oracle Ultra Search Crawler Status Codes

See Also:

7.1 Overview of the Oracle Ultra Search Crawler

The Oracle Ultra Search crawler is a Java process activated by your Oracle server according to a set schedule. When activated, the crawler spawns processor threads that fetch documents from various data sources. These documents are cached in the local file system. When the cache is full, the crawler indexes the cached files using Oracle Text. This index is used for querying.

Note:

An empty index is created when an Oracle Ultra Search instance is created. You can alter the index using SQL. The existing preferences, such as language-specific parameters, are defined in the $ORACLE_HOME/ultrasearch/admin/wk0pref.sql file.

7.2 Crawler Settings

Before you can use the crawler, you must set its operating parameters, such as the number of crawler threads, the crawler timeout threshold, the database connect string, and the default character set. To do so, use the Crawler Settings Page in the administration tool.

See Also:

"Crawler Page"

7.3 Crawler Data Sources

In addition to the Web access parameters, you can define specific data sources on the Sources page in the administration tool. You can define one or more of the following data sources:

Web sites
Database tables
Files
Mailing lists
Oracle Application Server Portal page groups
User-defined data sources (requires crawler agent)

7.3.1 Using Crawler Agents

If you are defining a user-defined data source to crawl and index a proprietary docum ent repository or management system, such as Lotus Notes or Documentum, then you must implement a crawler agent as a Java class. The agent collects document URLs and associated metadata from the proprietary document source and returns the information to the Oracle Ultra Search crawler, which enqueues it for later crawling. For more information on defining a new data source type, see the User-Defined sub-tab in Sources page in the administration tool.

7.3.2 Synchronizing Data Sources

You can create synchronization schedules with one or more data sources attached to it. Synchronization schedules define the frequency at which the Oracle Ultra Search index is kept up to date with existing information in the associated data sources. To define a synchronization schedule, use the Sources page in the administration tool.

7.3.3 Display URL and Access URL

For some applications, for security reasons, the URL crawled is different from the one seen by the end user. For example, crawling on an internal Web site inside a firewall might be done without security checking, but when queried by the end user, a corresponding mirror URL outside the firewall must be used. This mirror URL is called the display URL.

By default, the display URL is treated as the access URL unless a separate access URL is provided. The display URL must be unique in a data source; so two different access URLs cannot have the same display URL.

See Also:

"Sources Page"

7.4 Document Attributes

Docume nt attributes, or metadata, describe the properties of a document. Each data source has its own set of document attributes. The value is retrieved during the crawling process and then mapped to one of the search attributes and stored and indexed in the database. This lets you query documents based on their attributes. Document attributes in different data sources can be mapped to the same search attribute. Therefore, you can query documents from multiple data sources based on the same search attribute.

If the document is a Web page, the attribute can come from the HTTP header or it can be embedded inside the HTML in metatags. Document attributes can be used for many things, including document management, access control, or version control. Different data sources can have attributes of different names which are used for the same purpose; for example, "version" and "revision". It can also have the same attribute name for different purposes; for example, "language" as in natural language in one data source but as programming language in another.

Search attributes are created in three ways:

System-defined search attributes, such as title, author, description, subject, and mimetype
Search attributes created by the system administrator
Search attributes created by the crawler. (During crawling, the crawler agent maps the document attribute to a search attribute with the same name and data type. If not found, then the crawler creates a new search attribute with the same name and type as the document attribute defined in the crawler agent.)

The list of values (LOV) for a search attribute can help you specify a search query. If attribute LOV is available, then the crawler registers the LOV definition, which includes attribute value, attribute value display name, and its translation.

7.5 Crawling Process for the Schedule

The first time the crawler runs, it must fetch Web pages, table rows, files, and so on based on the data source. It then adds the document to the Oracle Ultra Search index. The crawling process for the schedule is broken into two phases:

Queuing and Caching Documents
Indexing Documents

7.5.1 Queuing and Caching Documents

Figure 7-1 and Figure 7-2 illustrate an instance of the crawling cycle in a sequence of nine steps. The example uses a Web data source, although the crawler can also crawl other data source types.

Figure 7-1 illustrates how the crawler and its crawling threads are activated. It also shows how the crawler queues hypertext links to control its navigation. This figure corresponds to Steps 1 to 5.

Figure 7-2 illustrates how the crawler caches Web pages. This figure correspond to Steps 6 to 8.

The steps are the following:

Oracle spawns the crawler according to the schedule you specify with the administration tool. When crawling is initiated for the first time, the URL queue is populated with the seed URLs. Figure 7-1.
Crawler initiates multiple crawling threads.
Crawler thread removes the next URL in the queue.
Crawler thread fetches the document from the Web. The document is usually an HTML file containing text and hypertext links.
Crawler thread scans the HTML file for hypertext links and inserts new links into the URL queue. Duplicate links already in the document table are discarded.
Crawler caches the HTML file in the local file system. Figure 7-2.
Crawler registers URL in the document table.
Crawler thread starts over by repeating Step 3.

Fetching a document, as shown in Step 4, can be time-consuming because of network traffic or slow Web sites. For maximum throughput, multiple threads fetch pages at any given time.

Note:

URLs remain visible until the next crawling run. When the crawler detects that the URL is no longer there, it is removed from the wk$doc table where Oracle Text automatically marks this document as deleted, even though the index data still exists. Cleanup is done through index optimization, which can be scheduled separately.

Figure 7-1 Queuing URLs

Description of the illustration isrch005.gif

Figure 7-2 Caching URLs

Description of the illustration isrch006.gif

7.5.2 Indexing Documents

When the file system cache is full (default maximum size is 20 megabytes), document caching stops and indexing begins. In this phase, Oracle Ultra Search augments the Oracle Text index using the cached files referred to by the document table. See Figure 7-3.

Figure 7-3 Indexing Documents

Description of the illustration isrch004.gif

7.6 Data Synchronization

After the initial crawl, a URL page is only crawled and indexed if it has changed since the last crawl. The crawler determines if it has changed with the HTTP If-Modified-Since header field or with the checksum of the page. URLs that no longer exist are marked and removed from the index.

To update changed documents, the crawler uses an internal checksum to compare new Web pages with cached Web pages. Changed Web pages are cached and marked for reindexing.

The steps involved in data synchronization are the following:

Oracle spawns the crawler according to the synchronization schedule you specify with the administration tool. The URL queue is populated with the data source URLs assigned to the schedule.
Crawler initiates multiple crawling threads.
Each crawler thread removes the next URL in the queue.
Each crawler thread fetches the document from the Web. The page is usually an HTML file containing text and hypertext links.
Each crawler thread calculates a checksum for the newly retrieved page and compares it with the checksum of the cached page. If the checksum is the same, then the page is discarded and crawler goes to step 3. Otherwise, the crawler moves to the next step.
Each crawler thread scans the document for hypertext links and inserts new links into the URL queue. Links that are already in the document table are discarded.
Crawler caches the document in the local file system. See Figure 7-2.
Crawler registers URL in the document table.
If the file system cache is full or if the URL queue is empty, then Web page caching stops and indexing begins. Otherwise, the crawler thread starts over at Step 3.

7.7 Web Crawling Boundary Control

Oracle Ultra Search provides the following mechanisms to control the scope of a Web data source crawling:

URL boundary rule (domain rule and path rule)
Robots.txt file and robots META tag
Crawling depth
URL Rewriter API

7.7.1 URL Boundary Rule

The URL boundary rule consists of domain rules and path rules. A domain rule specifies the set of Web sites allowed using a host name prefix or suffix. A path rule specifies the URL file path allowed or disallowed for a particular host. You can specify an inclusion or exclusion rule for both a domain rule and a path rule. Exclusion rules always override inclusion rules. Path rules are always host-specific.

For example, an inclusion domain ending with oracle.com limits the Oracle Oracle Ultra Search crawler to hosts belonging to Oracle worldwide. Anything ending with oracle.com is crawled, but http://www.oracle.com.tw is not crawled. If you change the inclusion domain to someurl.com with a new seed http://www.someurl.com, then all oracle.com URLs are dropped by the crawler.

An exclusion domain uk.oracle.com prevents the crawler from crawling Oracle hosts in the United Kingdom. You can also include or exclude Web sites with a specific port. (By default, all ports are crawled.) You can have port inclusion or port exclusion rules for a specific host, but not both.

All URLs must pass domain rules before being checked for path rules. Path rules let you further restrict the crawling space. Path rules are host-specific, but you can specify more than one path rule for each host. For example, on the same host, you can include the path /host/doc and exclude the path /host/doc/private. Note that path rules are prefix-based.

Regular expression-based domain and path rules are not supported in the current release.

The following rules restrict the crawler to only crawl www.oracle.com and otn.oracle.com. Furthermore, only URLs under /products/database/ and /products/ias/ but not under /products/ias/web_cache/ will be crawled.

Domain inclusion: www.oracle.com
Domain inclusion: otn.oracle.com
Path inclusion for otn.oracle.com:         /products/database/
                                           /products/ias/
Path exclusion for otn.oracle.com:         /products/ias/web_cache/

7.7.2 robots.txt Protocol and robots META Tag

The robots.txt protocol is the webmaster's path rule for any spider or crawler that visits his or her Web site. (It is described in the document "A Standard for Robot Exclusion" at http://www.robotstxt.org/wc/norobots.html.) The following example /robots.txt file specifies that no robots should visit any URL starting with /cyberworld/map/ or /tmp/, or /foo.html:

# robots.txt for http://www.acme.com/
 
User-agent: *
Disallow: /cyberworld/map/ 
Disallow: /tmp/ 
Disallow: /foo.html

By default, the Oracle Ultra Search crawler observes the robots.txt protocol, but it also allow the user to override it. If the Web site is under the user's control, a specific robots rule can be tailored for the crawler by specifying the Ultra Search crawler agent name "User-agent: Oracle Ultra Search." For example,

User-agent: Oracle Ultra Search
 
Disallow: /tmp/

The robots META tag instructs the crawler whether to index a Web page or follow the links within it. It is described in "HTML Author's Guide to the Robots META tag" (http://www.robotstxt.org/wc/meta-user.html).

7.7.3 Crawling Depth

Crawling depth controls how deep the crawler follows a link starting from the given seed URL. Since crawling is multi-threaded, this is not a deterministic control, as there may be different routes to a particular page.

The crawling depth limit applies to all Web sites in a given Web data source.

7.7.4 URL Rewriter

You implement the URL rewriter API as a Java class to perform link filtering or rewriting. Extracted links within a crawled Web page are passed to this module for checking. This enables ultimate control over which links extracted from a Web page are allowed and which ones should be discard. See "Oracle Ultra Search URL Rewriter API" for details.

7.7.5 URL Redirection and Boundary Rule Enforcement

With regard to HTTP redirection, earlier Oracle Ultra Search releases (9.0.2, 9.0.3, and 9.2.0.4) applied the same boundary checking to a redirected URL. Thus, a redirected URL would be rejected if it was outside the boundary rule. If the redirected URL was to be crawled, you had to make sure it was covered by the boundary rule.

In 9.2.0.5, iAS 10g, and Oracle Database 10g the redirected URL is always allowed if it is a temporary redirection (HTTP status 302, 307). For permanent redirection (status 301), the redirected URL is still subject to boundary rules.

HTTP meta tag redirection is always checked against boundary rules.

7.8 Oracle Ultra Search Remote Crawler

To increase crawli ng performance, set up the Oracle Ultra Search crawler to run on one or more computers separate from your database. These computers are called remote crawlers. However, each computer must share log, and mail archive directories with the database computer.

To configure a remote crawler, you must first install the Oracle Ultra Search middle tier on a computer other than the database host. During installation, the remote crawler is registered with the Oracle Ultra Search system, and a profile is created for the remote crawler. After installing the Oracle Ultra Search middle tier, you must log on to the Oracle Ultra Search administration tool and edit the remote crawler profile. You can then assign a remote crawler to a crawling schedule. To edit remote crawler profiles, use the Crawler Settings page in the administration tool.

See Also:

"Using the Remote Crawler"

7.9 Oracle Ultra Search Crawler Status Codes

The crawler uses a set of codes to indicate the crawling result of the crawled URL. Besides the standard HTTP status codes, it uses its own codes for non-HTTP related situations. Only URLs with status 200 will be indexed. Table 7-1 shows these URL status codes.

Table 7-1 Oracle Ultra Search URL Status Codes

Code	Explanation
200	URL OK
400	bad request
401	authorization required
402	payment required
403	access forbidden
404	not found
405	method not allowed
406	not acceptable
407	proxy authentication required
408	request timeout
409	conflict
410	gone
414	request-URI too large
500	internal server error
501	not implemented
502	bad gateway
503	service unavailable
504	gateway timeout
505	HTTP version not supported
902	timeout reading document
903	filtering failed
904	IOEXCEPTION in processing URL
906	connection refused
907	socket bind exception
908	filter not available
909	duplicate document detected
910	duplicate document ignored
911	empty document
951	URL not indexed
952	URL crawled
953	meta tag redirection
954	HTTP redirection
955	blacklist URL
956	URL is not unique
957	sentry URL (URL as placeholder)
958	document read error
959	form login failed
1001	data type is not TEXT/HTML
1002	broken network datastream
1003	HTTP redirect location does not exist
1004	bad relative URL
1005	HTTP error
1006	error parsing HTTP header
1007	invalid URL table column name
1008	JDBC driver missing
1009	binary document reported as text document
1010	invalid display URL