Overview of Crawler Settings

This section describes crawler settings and other mechanisms to control the scope of Web crawling:

Crawling Mode
URL Boundary Rules
Document Types
Crawling Depth
Robots Exclusion
Index Dynamic Pages
Title Fallback
Character Set Detection

See Also:

"Tuning Crawl Performance" for more detailed information on these settings and other issues affecting crawl performance

Crawling Mode

For initial planning purposes, you might want the crawler to collect URLs without indexing. After crawling is finished, examine the document URLs and status, remove unwanted documents, and start indexing. The crawling mode is set on the Home - Schedules - Edit Schedules page.

See Also:

Appendix B, "URL Crawler Status Codes"

These are the crawling mode options:

Automatically Accept All URLs for Indexing: This crawls and indexes all URLs in the source. For Web sources, it also extracts and indexes any links found in those URLs. If the URL has been crawled before, then it is reindexed only if it has changed.
Examine URLs Before Indexing: This crawls but does not index any URLs in the source. It also crawls any links found in those URLs.
Index Only: This crawls and indexes all URLs in the source. It does not extract any links from those URLs. In general, select this option for a source that has been crawled previously under "Examine URLs Before Indexing".

URL Boundary Rules

URL b oundary rul es limit the crawling space. When boundary rules are added, the crawler is restricted to URLs that match the indicated rules. The order in which rules are specified has no impact, but exclusion rules always override inclusion rules.

Boundary rules are set on the Home - Sources - Boundary Rules page.

Inclusion Rules

Specify an inclusion rule that a URL contain, start with, or end with a term. Use an asterisk (*) to represents a wildcard. For example, www.*.example.com. Simple inclusion rules are case-insensitive. For case-sensitivity, use regular expression rules.

An inclusion rule ending with example.com limits the search to URLs ending with the string example.com. Anything ending with example.com is crawled, but http://www.example.com.tw is not crawled.

If the URL Submission functionality is enabled on the Global Settings - Query Configuration page, then URLs that are submitted by end users are added to the inclusion rules list. You can delete URLs that you do not want to index.

Oracle SES supports the regular expression syntax used in Java JDK 1.4.2 Pattern class (java.util.regex.Pattern). Regular expression rules use special characters. The following is a summary of some basic regular expression constructs.

A caret (^) denotes the beginning of a URL and a dollar sign ($) denotes the end of a URL.
A period (.) matches any one character.
A question mark (?) matches zero or one occurrence of the character that it follows.
An asterisk (*) matches zero or more occurrences of the pattern that it follows. You can use an asterisk in the starts with, ends with, and contains rules.
A backslash (\) escapes any special characters, such as periods (\.), question marks (\?), or asterisks (\*).

See Also:

http://www.oracle.com/technetwork/java/index.html for a complete description in the Java documentation

Exclusion Rules

You can specify an exclusion rule that a URL contains, starts with, or ends with a term. An exclusion of uk.example.com prevents the crawling of Example hosts in the United Kingdom.

Default Exclusion Rules

The crawler contains a default exclusion rule to exclude non-textual files. The following file extensions are included in the default exclusion rule.

Image: bmp, png, tif
Audio: wav, wma, mp3
Video: avi, wmv, mpeg, mpg
Binary: bin, cab, dll, dmp, ear, exe, iso, jar, scm, so, tar, war, wmv

To crawl a file with these extensions, update the globalBoundaryRules object object using the Administration API. See the Oracle Secure Enterprise Search Administration API Guide

Note:

Only the file name is indexed when crawling multimedia files, unless the file is crawled using a crawler plug-in that provides a richer set of attributes, such as the Image Document Service plug-in.

Examples of Inclusion and Exclusion Rules

The following example uses several regular expression constructs that are not described earlier, including range quantifiers, non-grouping parentheses, and mode switches. For a complete description, see the Java documentation.

To crawl only HTTPS URLs in the example.com and examplecorp.com domains, and to exclude files ending in .doc and .ppt:

Inclusion: URL regular expression

^https://.*\.example(?:corp){0,1}\.com
Exclusion: URL regular expression (?i:\.doc|\.ppt)$

Document Types

You can customize which document types are processed for each source. By default, PDF, Microsoft Excel, Microsoft PowerPoint, Microsoft Word, HTML and plain text are always processed.

To add or remove document types:

On the Home page, click the Sources secondary tab.
Choose a source from the list and select Edit to display the Customize Source page.
Select the Document Types subtab.

The listed document types are supported for the source type.
Move the types to process to the Processed list and the others to the Not Processed list.
Click Apply.

Keep the following in mind about graphics file formats:

For graphics format files (JPEG, JPEG 2000, GIF, TIFF, DICOM), only the file name is searchable. The crawler does not extract any metadata from graphics files or make any attempt to convert graphical text into indexable text, unless you enable a document service plug-in. See "Configuring Support for Image Metadata".

Oracle SE S allows up to 1000 files in zip files and LHA files. If there are more than 1000 files, then an error is raised and the file is ignored.

See Also:

Oracle Text Reference Appendix B for supported document types

Crawling Depth

Crawling depth is the number of levels to crawl Web and file sources. A Web document can contain links to other Web documents, which can contain more links. Specify the maximum number of nested links for the crawler to follow. Crawling depth starts at 0; that is, if you specify 1, then the crawler gathers the starting (seed) URL plus any document that is linked directly from the starting URL. For file crawling, this is the number of directory levels from the starting URL.

Set the crawling depth on the Home - Sources - Crawling Parameters page.

Robots Exclusion

You can control which parts of your sites can be visited by robots. If robots exclusion is enabled (default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots.txt file. The crawler also respects the page-level robot exclusion specified in HTML metatags.

For example, when a robot visits http://www.example.com/, it checks for http://www.example.com/robots.txt. If it finds it, then the crawler checks to see if it is allowed to retrieve the document. If you own the Web sites, then you can disable robots exclusions. However, when crawling other Web sites, always defer to robots.txt by enabling robots exclusion.

Set the robots parameter on the Home - Sources - Crawling Parameters page.

Index Dynamic Pages

By default, Oracle SES processes dynamic pages. Dynamic pages are generally served from a database application and have a URL that contains a question mark (?). Oracle SES identifies URLs with question marks as dynamic pages.

Some dynamic pages appear as multiple search results for the same page, and you might not want them all indexed. Other dynamic pages are each different and must be indexed. You must distinguish between these two kinds of dynamic pages. In general, dynamic pages that only change in menu expansion without affecting its contents should not be indexed.

Consider the following three URLs:

http://example.com/aboutit/network/npe/standards/naming_convention.html
 
http://example.com/aboutit/network/npe/standards/naming_convention.html?nsdnv=14z1
 
http://example.com/aboutit/network/npe/standards/naming_convention.html?nsdnv=14

The question marks (?) in two URLs indicate that the rest of the strings are input parameters. The three results are essentially the same page with different side menu expansion. Ideally, the search yields only one result:

http://example.com/aboutit/network/npe/standards/naming_convention.html

Note:

The crawler cannot crawl and index dynamic Web pages written in Javascript.

Set the dynamic pages parameter on the Home - Sources - Crawling Parameters page.

Title Fallback

You can override a default documen t title with a meaningful title if the default title is irrelevant. For example, suppose that the result list shows numerous documents with the title "Daily Memo". The documents had been created with the same template file, but the document properties had not been changed. Overriding this title in Oracle SES can help users better understand their search results.

Title fallback can be used for any source type. Oracle SES uses different logic for each document type to determine which fallback title to use. For example, for HTML documents, Oracle SES looks for the first heading, such as <h1>. For Microsoft Word documents, Oracle SES looks for text with the largest font.

If the default title was collected in the initial crawl, then the fallback title is only used after the document is reindexed during a re-crawl. Thus, if there is no change to the document, then you must force the change by setting the re-crawl policy to Process All Documents on the Home - Schedules - Edit Schedule page.

To implement title fallback, modify the crawlerSettings object using the Administration API. Set the <search:indexNullTitleFallback> element to indexForAll, and list the bad titles in the <search:badTitles> elements. See the Oracle Secure Enterprise Search Administration API Guide.

Title fallback is not currently supported in the Oracle SES Administration GUI, and by default, it is turned off.

Special considerations with title fallback

With Microsoft Office documents:
- Font sizes 14 and 16 in Microsoft Word correspond to normalized font sizes 4 and 5 (respectively) in converted HTML. The Oracle SES crawler only picks up strings with normalized font size greater than 4 as the fallback title.
- Titles must contain more than five characters.
When a title is null, Oracle SES automatically indexes the fallback title for all binary documents (for example, .doc, .ppt, .pdf).

For HTML and text documents, Oracle SES does not automatically index the fallback title. Thus, the replaced title on HTML or text documents cannot be searched with the title attribute on the Advanced Search page. To turn on indexing for HTML and text documents, modify the crawlerSettings object using the Administration API. Set the <search:indexNullTitleFallback> parameter to indexForAll.

Character Set Detection

This featu re enables the crawler to automatically detect character set information for HTML, plain text, and XML files. Character set detection allows the crawler to properly cache files during crawls, index text, and display files for queries. This is important when crawling multibyte files (such as files in Japanese or Chinese).

To enable character set detection, update the crawlerSettings object using the Administration API. Set the <search:charsetDetection> parameter to true. See the Oracle Secure Enterprise Search Administration API Guide for more information about changing crawler settings.

Special Considerations with Automatic Character Set Detection

To crawl XML files for a source, be sure to add XML to the list of processed document types on the Home - Source - Document Types page. XML files are currently treated as HTML format, and detection for XML files may not be as accurate as for other file formats.

Language Detection

With multibyte files, besides turning on character set detection, be sure to set the Default Language parameter. For example, if the files are all in Japanese, select Japanese as the default language for that source. If automatic language detection is disabled, or if the crawler cannot determine the document language, then the crawler assumes that the document is written in the default language. This default language is used only if the crawler cannot determine the document language during crawling.

If your files are in multiple languages, then turn on the Enable Language Detection parameter. Not all documents retrieved by the crawler specify the language. For documents with no language specification, the crawler attempts to automatically detect language. The language recognizer is trained statistically using trigram data from documents in various languages (for instance, Danish, Dutch, English, French, German, Italian, Portuguese, and Spanish). It starts with the hypothesis that the given document does not belong to any language and ultimately refutes this hypothesis for a particular language where possible. It operates on Latin-1 alphabet and any language with a deterministic Unicode range of characters (like Chinese, Japanese, Korean, and so on).

The crawler determines the language code by checking the HTTP header content-language or the LANGUAGE column, if it is a table source. If it cannot determine the language, then it takes the following steps:

If the language recognizer is not available or if it cannot determine a language code, then the default language code is used.
If the language recognizer is available, then the output from the recognizer is used.
Oracle SES uses different lexers for space-delimited languages (such as English), Chinese, Japanese, and Korean. See the lexer object description in the Oracle Secure Enterprise Search Administration API Guide.

The Default Language and the Enable Language Detection parameters are on the Global Settings - Crawler Configuration page (globally) and also the Home - Sources - Crawling Parameters page (for each source).

Note:

For file sources, the individual source setting for Enable Language Detection remains false regardless of the global setting. In most cases, the language for a file source should be the same, and set from, the Default Language setting.

Deleting the Secure Cache

You can manage the Secure Cache either on the global level or at the data source level. The data source configuration supersedes the global configuration.

The cache is preserved by default and supports the Cached link feature in the search result page. If you do not use the Cache link, then you can delete the cache, either for specific sources or globally for all of them. Without a cache, the Cached link in a search result page returns a File not found error.

To delete the cache for all sources:

Select the Global Settings tab in the Oracle SES Administration GUI.
Choose Crawler Configuration.
Set Preserve Document Cache to No.
Click Delete Cache Now to remove the cache from all sources, except any that are currently active under an executing schedule. The cache is deleted in the background, and you do not have to wait for it to complete.
Click Apply.

To delete the cache for an individual source:

Select the Sources secondary tab on the Home page.
Click Edit for the source.
Click the Crawling Parameters subtab.
Set Preserve Document Cache to No.
Click Apply.