Tuning the Crawl Performance

Your Web crawling strategy can be as simple as identifying a few well-known sites that are likely to contain links to most of the other intranet sites in your organization. You could test this by crawling these sites without indexing them. After the initial crawl, you have a good idea of the hosts that exist in your intranet. You could then define separate Web sources to facilitate crawling and indexing on individual sites.

However, the process of discovering and crawling your organization's intranet, or the Internet, is generally an interactive one characterized by periodic analysis of crawling results and modification to crawling parameters. For example, if you observe that the crawler is spending days crawling one Web host, then you might want to exclude crawling at that host or limit the crawling depth.

This section contains the most common things to consider to improve crawl performance:

Understanding the Crawler Schedule
Registering a Proxy
Checking Boundary Rules
Checking Dynamic Pages
Checking Crawler Depth
Checking Robots Rule
Checking Duplicate Documents
Checking Redirected Pages
Checking URL Looping
Increasing the Oracle Redo Log File Size
What to Do Next
Automatically Adding Datafiles

Understanding the Crawler Schedule

Schedules define the frequency at which the Oracle SES index is updated with information about each source. This section describes characteristics the Oracle SES crawler schedule.

The Failed Schedules section on the Home - General page lists all schedules that have failed. A failed schedule is one in which the crawler encountered an irrecoverable error, such as an indexing error or a source-specific login error, and cannot proceed. A failed schedule could be because of a partial collection and indexing of documents.
The smallest granularity of the schedule interval is one hour. For example, you cannot start a schedule at 1:30 am.
If a crawl takes longer to finish than the scheduled interval, then it starts again when the current crawl is done. Currently, there is no option to have the scheduled time automatically pushed back to the next scheduled time.
When multiple sources are assigned to one schedule, the sources are crawled one by one following the order of their assignment in the schedule.
The schedule starts crawling the assigned sources in the assigned order. Only one source is crawling under a schedule at any given time. If a source crawl fails, then the rest of the sources assigned after it are not crawled. The schedule does not restart. You must either resolve the cause of the failure and resume the schedule, or remove the failed source from the schedule.
There is no automatic e-mail notification of schedule success or failure.

Registering a Proxy

By default, Oracle SES is configured to crawl Web sites in the intranet, so no additional configuration is required. However, to crawl Web sites on the Internet (also referred to as external Web sites), Oracle SES needs the HTTP proxy server information.

To register a proxy:

On the Global Settings page under Sources, select Proxy Settings.
Enter the proxy server name and port. Click Set Proxy.
Enter the internal host name suffix under Exceptions, so that internal Web sites do not go through the proxy server. Click Set Domain Exceptions.

To exclude the entire domain, omit http, begin with *., and use the suffix of the host name. For example, *.us.example.com or *.example.com. Entries without the *. prefix are treated as a single host. Use the IP address only when the URL crawled is also specified using the IP for the host name. They must be consistent.
If the proxy requires authentication, then enter the proxy authentication information on the Global Settings - Authentication page.

Checking Boundary Rules

The seed URL you enter when you create a source is turned into an inclusion rule. For example, if www.example.com is the seed URL, then Oracle SES creates an inclusion rule that only URLs containing the string www.example.com are crawled.

However, suppose that the example Web site includes URLs starting with www.exa-mple.com or example.com (without the www). Many pages have a prefix on the site name. For example, the investor section of the site has URLs that start with investor.example.com.

Always check the inclusion rules before crawling, then check the log after crawling to see what patterns have been excluded.

In this case, you might add www.example.com, www.exa-mple.com, and investor.example.com to the inclusion rules. Or you might just add example.

To crawl outside the seed site (for example, if you are crawling text.us.oracle.com, but you want to follow links outside of text.us.oracle.com to oracle.com), consider removing the inclusion rules completely. Do so carefully. This action could lead the crawler into many, many sites.

Notes for File Sources

If no boundary rule is specified, then crawling is limited to the underlying file system access privileges. Files accessible from the specified seed file URL are crawled, subject to the default crawling depth. The depth, which is 2 by default, is set on the Global Settings - Crawler Configuration page. For example, if the seed is file://localhost/home/user_a/, then the crawl picks up all files and directories under user_a with access privileges. It crawls any documents in the directory /home/user_a/level1 due to the depth limit. The documents in the /home/user_a/level1/level2 directory are at level 3.
The file URL can be in UNC (universal naming convention) format. The UNC file URL has the following format for files located within the host computer:

file://localhost///LocalComputerName/SharedFolderName

For example, specify \\stcisfcr\docs\spec.htm as file://localhost///stcisfcr/docs/spec.htm

where stcisfcr is the name of the host computer.

The string localhost is optional. You can specify the URL path without the string localhost in the URL, in which case the URL format is:

file:///LocalComputerName/SharedFolderName

For example,

file:///stcisfcr/docs/spec.htm

Note that you cannot use the UNC format to access files on other computers.
On some computers, the path or file name could contain non-ASCII and multibyte characters. URLs are always represented using the ASCII character set. Non-ASCII characters are represented using the hex representation of their UTF-8 encoding. For example, a space is encoded as %20, and a multibyte character can be encoded as %E3%81%82.

You can enter spaces in simple (not regular expression) boundary rules. Oracle SES automatically encodes these URL boundary rules. For example, Home Alone is specified internally as Home%20Alone. Oracle SES does this encoding for the following:
- File source simple boundary rules
- URL string tests
- File source seed URLs
Oracle SES does not alter regular expression rules. You must ensure that the regular expression rule specified is against the encoded file URL. Spaces are not allowed in regular expression rules.

Checking Dynamic Pages

Indexing dynamic pages can generate too many URLs. From the target Web site, manually navigate through a few pages to understand what bound ary rules should be set to avoid crawling of duplicate pages.

Checking Crawler Depth

Setting the crawler depth very high (or unlimited) could lead the crawler into many sites. Without boundary rules, a crawler depth of 20 probably crawls the entire World Wide Web from most locations.

Checking Robots Rule

You can control which parts of your sites can be visited by robots. If robots exclusion is enabled (the default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots.txt file.

The following sample robots.txt file specifies that no robots visit any URL starting with /cyberworld/map/ or /tmp/ or /foo.html:

# robots.txt for http://www.example.com/
 
User-agent: *
Disallow: /cyberworld/map/ 
Disallow: /tmp/ 
Disallow: /foo.html

If the Web site is under your control, then you can tailor a specific robots rule for the crawler by specifying Oracle Secure Enterprise Search as the user agent. For example:

User-agent: Oracle Secure Enterprise Search
 
Disallow: /tmp/

The robots meta tag can instruct the crawler either to index a Web page or to follow the links within it. For example:

<meta name="robots" content="noindex,nofollow">

Checking Duplicate Documents

Oracle SES always removes duplicate (identical) documents. Oracle SES does not index a page that is identical to one it has already indexed. Oracle SES also does not index a page that it reached through a URL that it has already processed.

With the Web Services API, you can enable or disable near duplicate detection and removal from the result list. Near duplicate documents are similar to each other. They may or may not be identical to each other.

Checking Redirected Pages

The crawler crawls only redirected pages. For example, a Web site might have Javascript that redirects users to another site with the same title. In such cases, only the redirected site is indexed.

Check for inclusion rules from redirects. The inclusion rules are based on the type of redirect. The EQ_TEST.EQ$URL table stores all of the URLs that have been crawled or are scheduled to be crawled. There are three kinds of redirects defined in it:

Temporary Redirect: A redirected URL is always allowed if it is a temporary redirection (HTTP status code 302, 307). Temporary redirection is used for whatever reason that the original URL should still be used in the future. It's not possible to find out temporary redirect from EQ$URL table other than filtering out the rest from the log file.
Permanent Redirect: For permanent redirection (HTTP status 301), the redirected URL is subject to boundar y rules. Permanent redirection means the original URL is no longer valid and the user should start using the new (redirected) one. In EQ$URL, HTTP permanent redirect has the status code 954
Meta Redirect: Metatag redirection is treated as a permanent redirect. Meta redirect has status code 954. This is always checked against boundary rules.

The STATUS column of EQ_TEST.EQ$URL lists the status codes. For descriptions of the codes, refer to Appendix B, "URL Crawler Status Codes."

Note:

Some browsers, such as Mozilla and Firefox, do not allow redirecting a page to load a network file. Microsoft Internet Explorer does not have this limitation.

Checking URL Looping

URL looping refers to the scenario where a large number of unique URLs all point to the same document. Looping sometimes occurs where a site contains a large number of pages, and each page contains links to every other page in the site. Ordinarily this is not a problem, because the crawler eventually analyzes all documents in the site. However, some Web servers attach parameters to generated URLs to track information across requests. Such Web servers might generate a large number of unique URLs that all point to the same document.

For example,

http://example.com/somedocument.html?p_origin_page=10

might refer to the same document as

http://example.com/somedocument.html?p_origin_page=13

but the p_origin_page parameter is different for each link, because the referring pages are different. If a large number of parameters are specified and if the number of referring links is large, then a single unique document could have thousands or tens of thousands of links referring to it. This is an example of how URL looping can occur.

Monitor the crawler statistics in the Oracle SES Administration GUI to determine which URLs and Web servers are being crawled the most. If you observe an inordinately large number of URL accesses to a particular site or URL, then you might do one of the following:

Exclude the Web server: This prevents the crawler from crawling any URLs at that host. (You cannot limit the exclusion to a specific port on a host.)
Reduce the crawling depth: This limits the number of levels of referred links the crawler follows. If you are observing URL looping effects on a particular host, then you should take a visual survey of the site to find out an estimate of the depth of the leaf pages at that site. Leaf pages are pages that do not have any links to other pages. As a general guideline, add three to the leaf page depth, and set the crawling depth to this value.

Be sure to restart the crawler after altering any parameters. Your changes take effect only after restarting the crawler.

Increasing the Oracle Redo Log File Size

Oracle SES allocates 200M for the redo log during installation. 200M is sufficient to crawl a relatively large number of documents. However, if your disk has sufficient space to increase the redo log and if you are going to crawl a very large number of documents (for example, more than 300G of text), then increase the redo log file size for better crawl performance.

Note:

The biggest transaction during crawling is SYNC INDEX by Oracle Text. Check the AWR report or the V$SYSSTAT view to see the actual redo size during crawling. Roughly, 200M is sufficient to crawl up to 300G.

To increase the size of the redo log files:

Open SQL*Plus and connect as the SYSTEM user. It has the same password as EQSYS.

Issue the following SQL statement to see the current redo log status:

SELECT vl.group#, member, bytes, vl.status 
    FROM v$log vl, v$logfile vlf 
    WHERE vl.group#=vlf.group#; 
 
GROUP# MEMBER                                                 BYTES STATUS 
------ ------------------------------------------------- ---------- ---------- 
     3 /scratch/ses111/oradata/o11101/redo03.log         209715200 INACTIVE 
     2 /scratch/ses111/oradata/o11101/redo02.log         209715200 CURRENT 
     1 /scratch/ses111/oradata/o11101/redo01.log         209715200 INACTIVE

Drop the INACTIVE redo log file. For example, to drop group 3:
```
ALTER DATABASE DROP LOGFILE group 3; 
 
Database altered. 
```
Create a larger redo log file with a command like the following. If you want to change the file location, specify the new location.
```
ALTER DATABASE ADD LOGFILE '/scratch/ses111/oradata/o11101/redo03.log' 2 
     size 400M reuse; 
```

Check the status to ensure that the file was created.

SELECT vl.group#, member, bytes, vl.status 
     FROM v$log vl, v$logfile vlf 
     WHERE vl.group#=vlf.group#; 
 
GROUP# MEMBER                                                  BYTES STATUS 
------ -------------------------------------------------- ---------- ---------- 
     3 /scratch/ses111/oradata/o11101/redo03.log           419430400 UNUSED 
     2 /scratch/ses111/oradata/o11101/redo02.log           209715200 CURRENT 
     1 /scratch/ses111/oradata/o11101/redo01.log           209715200 INACTIVE

To drop a log file with a CURRENT status, issue the following ALTER statement, then check the results.

ALTER SYSTEM SWITCH LOGFILE; 
 
SELECT vl.group#, member, bytes, vl.status 
     FROM v$log vl, v$logfile vlf 
     WHERE vl.group#=vlf.group#; 
 
GROUP# MEMBER                                                  BYTES STATUS 
------ -------------------------------------------------- ---------- ---------- 
     3 /scratch/ses111/oradata/o11101/redo03.log           419430400 CURRENT 
     2 /scratch/ses111/oradata/o11101/redo02.log           209715200 ACTIVE 
     1 /scratch/ses111/oradata/o11101/redo01.log           209715200 INACTIVE

Issue the following SQL statement to change the status of Group 2 from ACTIVE to INACTIVE:

ALTER SYSTEM CHECKPOINT; 
 
SELECT vl.group#, member, bytes,  vl.status 
     FROM v$log vl, v$logfile vlf 
     WHERE vl.group#=vlf.group#; 
 
GROUP# MEMBER                                                  BYTES STATUS 
------ -------------------------------------------------- ---------- ---------- 
     3 /scratch/ses111/oradata/o11101/redo03.log           419430400 CURRENT 
     2 /scratch/ses111/oradata/o11101/redo02.log           209715200 INACTIVE 
     1 /scratch/ses111/oradata/o11101/redo01.log           209715200 INACTIVE

Repeat steps 3, 4 and 5 for redo log groups 1 and 2.

What to Do Next

If you are still not crawling all the pages you think you should, then check which pages were crawled by doing one of the following:

Check the crawler log file
Create a search source group

To check the crawler log file:

On the Home page, click the Schedules secondary tab to display the Crawler Schedules page.
Click the Log File icon to display the log file for the source.
To obtain the location of the full log, click the Status link. The Crawler Progress Summary and Log Files by Source section displays the full path to the log file.

To create a search source group:

On the Search page, click the Source Groups subtab.
Click New to display Create New Source Group Step 1.
Enter a name, then click Proceed to Step 2.
Select a source type, then shuttle only one source from Available Sources to Assigned Sources.
Click Finish.

To search the source group:

On any page, click the Search link in the top right corner to open the Search application.
Select the group name, then issue a search term to list the matches within the source.
Select the group name, then click Browse to see a list of search groups:
- The number after the group name identifies the number of browsed documents. Click the number to browse the search results.
- Click the arrow before the group name to display a hierarchy of search results. The number of matches appears after each item in the hierarchy.

Automatically Adding Datafiles

While there are no space restrictions for user defined datafiles, Oracle Database sets a size limit of 32 GB for any datafile. As a result, Oracle SES datafiles cannot grow beyond this limit. However, Oracle SES automatically creates a new datafile at the same location when the existing datafile becomes full.