Overview of Crawler Settings

This section describes crawler settings and other mechanisms to control the scope of Web crawling:

Crawling Mode
URL Boundary Rules
Document Types
Crawling Depth
Robots Exclusion
Index Dynamic Pages
URL Rewriter API
Title Fallback
Character Set Detection
Cache Directory

See Also:

"Tuning the Crawl Performance" for more detailed information on these settings and other issues affecting crawl performance

Crawling Mode

For initial planning purposes, you might want the crawler to collect URLs without indexing. After crawling is finished, examine the document URLs and status, remove unwanted documents, and start indexing. The crawling mode is set on the Home - Schedules - Edit Schedules page.

See Also:

Appendix B, "URL Crawler Status Codes"

Note:

If you are using a custom crawler created with the Crawler Plug-in API, then the crawling mode set here does not apply. The implemented plug-in controls the crawling mode.

These are the crawling mode options:

Automatically Accept All URLs for Indexing: This crawls and indexes all URLs in the source. For Web sources, it also extracts and indexes any links found in those URLs. If the URL has been crawled before, then it is reindexed only if it has changed.
Examine URLs Before Indexing: This crawls but does not index any URLs in the source. It also crawls any links found in those URLs.
Index Only: This crawls and indexes all URLs in the source. It does not extract any links from those URLs. In general, select this option for a source that has been crawled previously under "Examine URLs Before Indexing".

URL Boundary Rules

URL b oundary rul es limit the crawling space. When boundary rules are added, the crawler is restricted to URLs that match the indicated rules. The order in which rules are specified has no impact, but exclusion rules always override inclusion rules.

This is set on the Home - Sources - Boundary Rules page.

Inclusion Rules

Specify an inclusion rule that a URL contain, start with, or end with a term. Use an asterisk (*) to represents a wildcard. For example, www.*.example.com. Simple inclusion rules are case-insensitive. For case-sensitivity, use regular expression rules.

An inclusion rule ending with example.com limits the search to URLs ending with the string example.com. Anything ending with example.com is crawled, but http://www.example.com.tw is not crawled.

If the URL Submission functionality is enabled on the Global Settings - Query Configuration page, then URLs that are submitted by end users are added to the inclusion rules list. You can delete URLs that you do not want to index.

Oracle SES supports the regular expression syntax used in Java JDK 1.4.2 Pattern class (java.util.regex.Pattern). Regular expression rules use special characters. The following is a summary of some basic regular expression constructs.

A caret (^) denotes the beginning of a URL and a dollar sign ($) denotes the end of a URL.
A period (.) matches any one character.
A question mark (?) matches zero or one occurrence of the character that it follows.
An asterisk (*) matches zero or more occurrences of the pattern that it follows. You can use an asterisk in the starts with, ends with, and contains rules.
A backslash (\) escapes any special characters, such as periods (\.), question marks (\?), or asterisks (\*).

See Also:

http://java.sun.com for a complete description in the Sun Microsystems Java documentation

Exclusion Rules

You can specify an exclusion rule that a URL contains, starts with, or ends with a term.

An exclusion of uk.example.com prevents the crawling of Example hosts in the United Kingdom.

Default Exclusion Rules

The crawler contains a default exclusion rule to exclude non-textual files. The following file extensions are included in the default exclusion rule.

Image: jpg, gif, tif, bmp, png
Audio: wav, mp3, wma
Video: avi, mpg, mpeg, wmv
Binary: bin, exe, so, dll, iso, jar, war, ear, tar, wmv, scm, cab, dmp

To crawl a file with these extensions, modify the following section in the ORACLE_HOME/search/data/config/crawler.dat file, removing any file type suffix from the exclusion list.

# default file name suffix exclusion list
RX_BOUNDARY (?i:(?:\.jar)|(?:\.bmp)|(?:\.war)|(?:\.ear)|(?:\.mpg)|(?:\.wmv)|(?:\.mpeg)|(?:\.scm)|(?:\.iso)|(?:\.dmp)|(?:\.dll)|(?:\.cab)|(?:\.so)|(?:\.avi)|(?:\.wav)|(?:\.mp3)|(?:\.wma)|(?:\.bin)|(?:\.exe)|(?:\.iso)|(?:\.tar)|(?:\.png))$

Then add the MIMEINCLUDE parameter to the crawler.dat file to crawl any multimedia file type, and the file name is indexed as title.

For example, to crawl any audio files, remove .wav, .mp3, and .wma, and add the MIMEINCLUDE parameter:

RX_BOUNDARY (?i:(?:\.gif)|(?:\.jpg)|(?:\.jar)|(?:\.tif)|(?:\.bmp)|(?:\.war)|(?:\.ear)|(?:\.mpg)|(?:\.wmv)|(?:\.mpeg)|(?:\.scm)|(?:\.iso)|
(?:\.dmp)|(?:\.dll)|(?:\.cab)|(?:\.so)|(?:\.avi)|(?:\.bin)|(?:\.exe)|(?:\.iso)|(?:\.tar)|(?:\.png))$
MIMEINCLUDE audio/x-wav audio/mpeg

Note:

Only the file name is indexed when crawling multimedia files, unless the file is crawled using a crawler plug-in that provides a richer set of attributes, such as the Image Document Service plug-in.

Example Using a Regular Expression

The following example uses several regular expression constructs that are not described earlier, including range quantifiers, non-grouping parentheses, and mode switches. For a complete description, see the Sun Microsystems Java documentation.

To crawl only HTTPS URLs in the example.com and examplecorp.com domains, and to exclude files ending in .doc and .ppt:

Inclusion: URL regular expression

^https://.*\.example(?:corp){0,1}\.com
Exclusion: URL regular expression (?i:\.doc|\.ppt)$

Document Types

You can customize which document types are processed for each source. By default, PDF, Microsoft Excel, Microsoft PowerPoint, Microsoft Word, HTML and plain text are always processed.

To add or remove document types:

On the Home page, click the Sources secondary tab.
Choose a source from the list and select Edit to display the Customize Source page.
Select the Document Types subtab.

The listed document types are supported for the source type.
Move the types to process to the Processed list and the others to the Not Processed list.
Click Apply.

Keep the following in mind about graphics file formats:

For graphics format files (JPEG, JPEG 2000, GIF, TIFF, DICOM), only the file name is searchable. The crawler does not extract any metadata from graphics files or make any attempt to convert graphical text into indexable text, unless you enable a document service plug-in. See "Configuring Support for Image Metadata".

Oracle SE S allows up to 1000 files in zip files and LHA files. If there are more than 1000 files, then an error is raised and the file is ignored. See "Crawling Zip Files Containing Non-UTF8 File Names".

See Also:

Oracle Text Reference Appendix B for supported document types

Crawling Depth

Crawling depth is the number of levels to crawl Web and file sources. A Web document can contain links to other Web documents, which can contain more links. Specify the maximum number of nested links for the crawler to follow. Crawling depth starts at 0; that is, if you specify 1, then the crawler gathers the starting (seed) URL plus any document that is linked directly from the starting URL. For file crawling, this is the number of directory levels from the starting URL.

Set the crawling depth on the Home - Sources - Crawling Parameters page.

Robots Exclusion

You can control which parts of your sites can be visited by robots. If robots exclusion is enabled (default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots.txt file. The crawler also respects the page-level robot exclusion specified in HTML metatags.

For example, when a robot visits http://www.example.com/, it checks for http://www.example.com/robots.txt. If it finds it, then the crawler checks to see if it is allowed to retrieve the document. If you own the Web sites, then you can disable robots exclusions. However, when crawling other Web sites, always comply with robots.txt by enabling robots exclusion.

Set the robots parameter on the Home - Sources - Crawling Parameters page.

Index Dynamic Pages

By default, Oracle SES processes dynamic pages. Dynamic pages are generally served from a database application and have a URL that contains a question mark (?). Oracle SES identifies URLs with question marks as dynamic pages.

Some dynamic pages appear as multiple search results for the same page, and you might not want them all indexed. Other dynamic pages are each different and must be indexed. You must distinguish between these two kinds of dynamic pages. In general, dynamic pages that only change in menu expansion without affecting its contents should not be indexed.

Consider the following three URLs:

http://example.com/aboutit/network/npe/standards/naming_convention.html
 
http://example.com/aboutit/network/npe/standards/naming_convention.html?nsdnv=14z1
 
http://example.com/aboutit/network/npe/standards/naming_convention.html?nsdnv=14

The question marks (?) in two URLs indicate that the rest of the strings are input parameters. The three results are essentially the same page with different side menu expansion. Ideally, the search yields only one result:

http://example.com/aboutit/network/npe/standards/naming_convention.html

Note:

The crawler cannot crawl and index dynamic Web pages written in Javascript.

Set the dynamic pages parameter on the Home - Sources - Crawling Parameters page.

URL Rewriter API

The URL Rewriter is a user-supplied Java module for implementing the Oracle SES UrlRewriter interface. The crawler uses it to filter or rewrite extracted URL links before they are put into the URL queue. The API enables ultimate control over which links extracted from a Web page are allowed and which ones should be discarded.

URL filtering removes unwanted links, and URL rewriting transforms the URL link. This transformation is necessary when access URLs are used and alternate display URLs must be presented to the user in the search results.

Set the URL rewriter on the Home - Sources - Crawling Parameters page.

See Also:

Title Fallback

You can override a default documen t title with a meaningful title if the default title is irrelevant. For example, suppose that the result list shows numerous documents with the title "Daily Memo". The documents had been created with the same template file, but the document properties had not been changed. Overriding this title in Oracle SES can help users better understand their search results.

Title fallback can be used for any source type. Oracle SES uses different logic for each document type to determine which fallback title to use. For example, for HTML documents, Oracle SES looks for the first heading, such as <h1>. For Microsoft Word documents, Oracle SES looks for text with the largest font.

If the default title was collected in the initial crawl, then the fallback title is only used after the document is reindexed during a re-crawl. This means if there is no change to the document, then you must force the change by setting the re-crawl policy to Process All Documents on the Home - Schedules - Edit Schedule page.

This feature is not currently supported in the Oracle SES Administration GUI. Override a default document title with a meaningful title by adding the keyword BAD_TITLE to the ORACLE_HOME/search/data/config/crawler.dat file. For example:

BAD_TITLE Daily Memo

Where Daily Memo is the title string to be overridden. The title string is case-insensitive and can use multibyte characters in UTF8 character set.

You can specify multiple bad titles, each one on a separate line.

Special considerations with title fallback

With Microsoft Office documents:
- Font sizes 14 and 16 in Microsoft Word correspond to normalized font sizes 4 and 5 (respectively) in converted HTML. The Oracle SES crawler only picks up strings with normalized font size greater than 4 as the fallback title.
- Titles should contain more than five characters.
When a title is null, Oracle SES automatically indexes the fallback title for all binary documents (for example, .doc, .ppt, .pdf). For HTML and text documents, Oracle SES does not automatically index the fallback title. This means that the replaced title on HTML or text documents cannot be searched with the title attribute on the Advanced Search page. You can turn on indexing for HTML and text documents in the crawler.dat file. For example, set NULL_TITLE_FALLBACK_INDEX ALL.
The crawler.dat file is not included in the backup available on the Global Settings - Configuration Data Backup and Recovery page. Ensure you manually back up the crawler.dat file.

See Also:

"Crawler Configuration File"

Character Set Detection

This featu re enables the crawler to automatically detect character set information for HTML, plain text, and XML files. Character set detection allows the crawler to properly cache files during crawls, index text, and display files for queries. This is important when crawling multibyte files (such as files in Japanese or Chinese).

This feature is not currently supported in the Oracle SES Administration GUI, and by default, it is turned off. Enable automatic character set detection by adding a line in the crawler configuration file: ORACLE_HOME/search/data/config/crawler.dat. For example, add the following as a new line:

AUTO_CHARSET_DETECTION

You can check whether this is turned on or off in the crawler log under the "Crawling Settings" section.

Special Considerations with Automatic Character Set Detection

To crawl XML files for a source, be sure to add XML to the list of processed document types on the Home - Source - Document Types page. XML files are currently treated as HTML format, and detection for XML files may not be as accurate as for other file formats.
The crawler.dat file is not included in the backup available on the Global Settings - Configuration Data Backup and Recovery page. Ensure that you manually back up the crawler.dat file.

See Also:

"Crawler Configuration File"

Language Detection

With multibyte files, besides turning on character set detection, be sure to set the Default Language parameter. For example, if the files are all in Japanese, select Japanese as the default language for that source. If automatic language detection is disabled, or if the crawler cannot determine the document language, then the crawler assumes that the document is written in the default language. This default language is used only if the crawler cannot determine the document language during crawling.

If your files are in multiple languages, then turn on the Enable Language Detection parameter. Not all documents retrieved by the crawler specify the language. For documents with no language specification, the crawler attempts to automatically detect language. The language recognizer is trained statistically using trigram data from documents in various languages (for instance, Danish, Dutch, English, French, German, Italian, Portuguese, and Spanish). It starts with the hypothesis that the given document does not belong to any language and ultimately refutes this hypothesis for a particular language where possible. It operates on Latin-1 alphabet and any language with a deterministic Unicode range of characters (like Chinese, Japanese, Korean, and so on).

The crawler determines the language code by checking the HTTP header content-language or the LANGUAGE column, if it is a table source. If it cannot determine the language, then it takes the following steps:

If the language recognizer is not available or if it cannot determine a language code, then the default language code is used.
If the language recognizer is available, then the output from the recognizer is used.
Oracle Text MULTI_LEXER is the only lexer used for Oracle Secure Enterprise Search.

The Default Language and the Enable Language Detection parameters are on the Global Settings - Crawler Configuration page (globally) and also the Home - Sources - Crawling Parameters page (for each source).

Note:

For file sources, the individual source setting for Enable Language Detection remains false regardless of the global setting. In most cases, the language for a file source should be the same, and set from, the Default Language setting.

Cache Directory

For sources created before Oracle SES 11g, the document cache remains in the cache directory. Sources are not stored in Secure Cache in the database until they are migrated to use Secure Cache. You can manage the cache directory for these older sources the same as in earlier releases.

Deleting the Secure Cache

You can manage the Secure Cache either on the global level or at the data source level. The data source configuration supersedes the global configuration.

The cache is preserved by default and supports the Cached link feature in the search result page. If you do not use the Cache link, then you can delete the cache, either for specific sources or globally for all of them. Without a cache, the Cached link in a search result page returns a File not found error.

To delete the cache for all sources:

Select the Global Settings tab in the Oracle SES Administration GUI.
Choose Crawler Configuration.
Set Preserve Document Cache to No.
Click Delete Cache Now to remove the cache from all sources, except any that are currently active under an executing schedule. The cache is deleted in the background, and you do not have to wait for it to complete.
Click Apply.

To delete the cache for an individual source:

Select the Sources secondary tab on the Home page.
Click Edit for the source.
Click the Crawling Parameters subtab.
Set Preserve Document Cache to No.
Click Apply.