4 Understanding Crawling

This chapter discusses the Oracle SES crawler. It contains the following topics:

Overview of the Oracle Secure Enterprise Search Crawler
Overview of Crawler Settings
Overview of XML Connector Framework
Configuring Support for Image Metadata
Overview of Attributes
Understanding the Crawling Process
Monitoring the Crawling Process
Parallel Query Indexing

See Also:

"Tuning the Crawl Performance" and "Tuning Search Performance"
The Oracle Secure Enterprise Search tutorials at

http://www.oracle.com/technology/products/oses/index.html

Overview of the Oracle Secure Enterprise Search Crawler

The Oracle Secure Enterprise Search (Oracle SES) crawler is a Java process activated by a set schedule. When activated, the crawler spawns processor threads that fetch documents from sources. The crawler caches the documents, and when the cache reaches the maximum batch size of 250 MB, the crawler indexes the cached files. This index is used for searching.

The document cache, called Secure Cache, is stored in Oracle Database in a compressed SecureFile LOB. Oracle Database provides excellent security and compact storage.

In the Oracle SES Administration GUI, you can create schedules with one or more sources attached to them. Schedules define the frequency at which the Oracle SES index is kept up to date with existing information in the associated sources.

Crawler URL Queue

In the process of crawling, the crawler maintains a list of URLs of the discovered documents that are fetched and indexed in an internal URL queue. The queue is persistently stored, so that crawls can be resumed after the Oracle SES instance is restarted.

Understanding Access URLs and Display URLs

A display URL is a URL string used for search result display. This is the URL used when users click the search result link. An access URL is an optional URL string used by the crawler for crawling and indexing. If it does not exist, then the crawler uses the display URL for crawling and indexing. If it does exist, then it is used by the crawler instead of the display URL for crawling. For regular Web crawling, only display URLs are available. But in some situations, the crawler needs an access URL for crawling the internal site while keeping a display URL for the external use. For every internal URL, there is an external mirrored URL.

For example, for file sources with display URLs, end users can access the original document with the H TTP or HTTPS protocols. These provide the appropriate authentication and personalization and result in better user experience.

Display URLs can be provided using the URL Rewriter API. Or, they can be generated by specifying the mapping between the prefix of the original file URL and the prefix of the display URL. Oracle SES replaces the prefix of the file URL with the prefix of the display URL.

For example, if the file URL is

file://localhost/home/operation/doc/file.doc

and the display URL is

https://webhost/client/doc/file.doc

then specify the file URL prefix as

file://localhost/home/operation

and the display URL prefix as

https://webhost/client

Modifying the Crawler Parameters

You can alter the crawler's operating parameters at two levels:

At the global level for all sources
At the source level for a particular defined source

Global parameters include the default values for language, crawling depth, and other crawling parameters, and the settings that control the crawler log and cache.

To configure the crawler:

Click the Global Settings tab.
Under Sources, click Crawler Configuration.
Make the desired changes on the Crawler Configuration page. Click Help for more information about the configuration settings.
Click Apply.

To configure the crawling parameters for a specific source:

From the Home page, click the Sources secondary tab to see a list of sources you have created.
Click the edit icon for the source whose crawler you want to configure, to display the Edit Source page.
Click the Crawling Parameters subtab.
Make the desired changes. Click Help for more information about the crawling parameters.
Click Apply.

Note that the parameter values for a particular source can override the default values set at the global level. For example, for Web sources, Oracle SES sets a default crawling depth of 2, irrespective of the crawling depth you set at the global level.

Also note that some parameters are specific to a particular source type. For example, Web sources include parameters for HTTP cookies.

Overview of Crawler Settings

This section describes crawler settings and other mechanisms to control the scope of Web crawling:

Crawling Mode
URL Boundary Rules
Document Types
Crawling Depth
Robots Exclusion
Index Dynamic Pages
URL Rewriter API
Title Fallback
Character Set Detection
Cache Directory

See Also:

"Tuning the Crawl Performance" for more detailed information on these settings and other issues affecting crawl performance

Crawling Mode

For initial planning purposes, you might want the crawler to collect URLs without indexing. After crawling is finished, examine the document URLs and status, remove unwanted documents, and start indexing. The crawling mode is set on the Home - Schedules - Edit Schedules page.

See Also:

Appendix B, "URL Crawler Status Codes"

Note:

If you are using a custom crawler created with the Crawler Plug-in API, then the crawling mode set here does not apply. The implemented plug-in controls the crawling mode.

These are the crawling mode options:

Automatically Accept All URLs for Indexing: This crawls and indexes all URLs in the source. For Web sources, it also extracts and indexes any links found in those URLs. If the URL has been crawled before, then it is reindexed only if it has changed.
Examine URLs Before Indexing: This crawls but does not index any URLs in the source. It also crawls any links found in those URLs.
Index Only: This crawls and indexes all URLs in the source. It does not extract any links from those URLs. In general, select this option for a source that has been crawled previously under "Examine URLs Before Indexing".

URL Boundary Rules

URL b oundary rul es limit the crawling space. When boundary rules are added, the crawler is restricted to URLs that match the indicated rules. The order in which rules are specified has no impact, but exclusion rules always override inclusion rules.

This is set on the Home - Sources - Boundary Rules page.

Inclusion Rules

Specify an inclusion rule that a URL contain, start with, or end with a term. Use an asterisk (*) to represents a wildcard. For example, www.*.example.com. Simple inclusion rules are case-insensitive. For case-sensitivity, use regular expression rules.

An inclusion rule ending with example.com limits the search to URLs ending with the string example.com. Anything ending with example.com is crawled, but http://www.example.com.tw is not crawled.

If the URL Submission functionality is enabled on the Global Settings - Query Configuration page, then URLs that are submitted by end users are added to the inclusion rules list. You can delete URLs that you do not want to index.

Oracle SES supports the regular expression syntax used in Java JDK 1.4.2 Pattern class (java.util.regex.Pattern). Regular expression rules use special characters. The following is a summary of some basic regular expression constructs.

A caret (^) denotes the beginning of a URL and a dollar sign ($) denotes the end of a URL.
A period (.) matches any one character.
A question mark (?) matches zero or one occurrence of the character that it follows.
An asterisk (*) matches zero or more occurrences of the pattern that it follows. You can use an asterisk in the starts with, ends with, and contains rules.
A backslash (\) escapes any special characters, such as periods (\.), question marks (\?), or asterisks (\*).

See Also:

http://java.sun.com for a complete description in the Sun Microsystems Java documentation

Exclusion Rules

You can specify an exclusion rule that a URL contains, starts with, or ends with a term.

An exclusion of uk.example.com prevents the crawling of Example hosts in the United Kingdom.

Default Exclusion Rules

The crawler contains a default exclusion rule to exclude non-textual files. The following file extensions are included in the default exclusion rule.

Image: jpg, gif, tif, bmp, png
Audio: wav, mp3, wma
Video: avi, mpg, mpeg, wmv
Binary: bin, exe, so, dll, iso, jar, war, ear, tar, wmv, scm, cab, dmp

To crawl a file with these extensions, modify the following section in the ORACLE_HOME/search/data/config/crawler.dat file, removing any file type suffix from the exclusion list.

# default file name suffix exclusion list
RX_BOUNDARY (?i:(?:\.jar)|(?:\.bmp)|(?:\.war)|(?:\.ear)|(?:\.mpg)|(?:\.wmv)|(?:\.mpeg)|(?:\.scm)|(?:\.iso)|(?:\.dmp)|(?:\.dll)|(?:\.cab)|(?:\.so)|(?:\.avi)|(?:\.wav)|(?:\.mp3)|(?:\.wma)|(?:\.bin)|(?:\.exe)|(?:\.iso)|(?:\.tar)|(?:\.png))$

Then add the MIMEINCLUDE parameter to the crawler.dat file to crawl any multimedia file type, and the file name is indexed as title.

For example, to crawl any audio files, remove .wav, .mp3, and .wma, and add the MIMEINCLUDE parameter:

RX_BOUNDARY (?i:(?:\.gif)|(?:\.jpg)|(?:\.jar)|(?:\.tif)|(?:\.bmp)|(?:\.war)|(?:\.ear)|(?:\.mpg)|(?:\.wmv)|(?:\.mpeg)|(?:\.scm)|(?:\.iso)|
(?:\.dmp)|(?:\.dll)|(?:\.cab)|(?:\.so)|(?:\.avi)|(?:\.bin)|(?:\.exe)|(?:\.iso)|(?:\.tar)|(?:\.png))$
MIMEINCLUDE audio/x-wav audio/mpeg

Note:

Only the file name is indexed when crawling multimedia files, unless the file is crawled using a crawler plug-in that provides a richer set of attributes, such as the Image Document Service plug-in.

Example Using a Regular Expression

The following example uses several regular expression constructs that are not described earlier, including range quantifiers, non-grouping parentheses, and mode switches. For a complete description, see the Sun Microsystems Java documentation.

To crawl only HTTPS URLs in the example.com and examplecorp.com domains, and to exclude files ending in .doc and .ppt:

Inclusion: URL regular expression

^https://.*\.example(?:corp){0,1}\.com
Exclusion: URL regular expression (?i:\.doc|\.ppt)$

Document Types

You can customize which document types are processed for each source. By default, PDF, Microsoft Excel, Microsoft PowerPoint, Microsoft Word, HTML and plain text are always processed.

To add or remove document types:

On the Home page, click the Sources secondary tab.
Choose a source from the list and select Edit to display the Customize Source page.
Select the Document Types subtab.

The listed document types are supported for the source type.
Move the types to process to the Processed list and the others to the Not Processed list.
Click Apply.

Keep the following in mind about graphics file formats:

For graphics format files (JPEG, JPEG 2000, GIF, TIFF, DICOM), only the file name is searchable. The crawler does not extract any metadata from graphics files or make any attempt to convert graphical text into indexable text, unless you enable a document service plug-in. See "Configuring Support for Image Metadata".

Oracle SE S allows up to 1000 files in zip files and LHA files. If there are more than 1000 files, then an error is raised and the file is ignored. See "Crawling Zip Files Containing Non-UTF8 File Names".

See Also:

Oracle Text Reference Appendix B for supported document types

Crawling Depth

Crawling depth is the number of levels to crawl Web and file sources. A Web document can contain links to other Web documents, which can contain more links. Specify the maximum number of nested links for the crawler to follow. Crawling depth starts at 0; that is, if you specify 1, then the crawler gathers the starting (seed) URL plus any document that is linked directly from the starting URL. For file crawling, this is the number of directory levels from the starting URL.

Set the crawling depth on the Home - Sources - Crawling Parameters page.

Robots Exclusion

You can control which parts of your sites can be visited by robots. If robots exclusion is enabled (default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots.txt file. The crawler also respects the page-level robot exclusion specified in HTML metatags.

For example, when a robot visits http://www.example.com/, it checks for http://www.example.com/robots.txt. If it finds it, then the crawler checks to see if it is allowed to retrieve the document. If you own the Web sites, then you can disable robots exclusions. However, when crawling other Web sites, always comply with robots.txt by enabling robots exclusion.

Set the robots parameter on the Home - Sources - Crawling Parameters page.

Index Dynamic Pages

By default, Oracle SES processes dynamic pages. Dynamic pages are generally served from a database application and have a URL that contains a question mark (?). Oracle SES identifies URLs with question marks as dynamic pages.

Some dynamic pages appear as multiple search results for the same page, and you might not want them all indexed. Other dynamic pages are each different and must be indexed. You must distinguish between these two kinds of dynamic pages. In general, dynamic pages that only change in menu expansion without affecting its contents should not be indexed.

Consider the following three URLs:

http://example.com/aboutit/network/npe/standards/naming_convention.html
 
http://example.com/aboutit/network/npe/standards/naming_convention.html?nsdnv=14z1
 
http://example.com/aboutit/network/npe/standards/naming_convention.html?nsdnv=14

The question marks (?) in two URLs indicate that the rest of the strings are input parameters. The three results are essentially the same page with different side menu expansion. Ideally, the search yields only one result:

http://example.com/aboutit/network/npe/standards/naming_convention.html

Note:

The crawler cannot crawl and index dynamic Web pages written in Javascript.

Set the dynamic pages parameter on the Home - Sources - Crawling Parameters page.

URL Rewriter API

The URL Rewriter is a user-supplied Java module for implementing the Oracle SES UrlRewriter interface. The crawler uses it to filter or rewrite extracted URL links before they are put into the URL queue. The API enables ultimate control over which links extracted from a Web page are allowed and which ones should be discarded.

URL filtering removes unwanted links, and URL rewriting transforms the URL link. This transformation is necessary when access URLs are used and alternate display URLs must be presented to the user in the search results.

Set the URL rewriter on the Home - Sources - Crawling Parameters page.

See Also:

Title Fallback

You can override a default documen t title with a meaningful title if the default title is irrelevant. For example, suppose that the result list shows numerous documents with the title "Daily Memo". The documents had been created with the same template file, but the document properties had not been changed. Overriding this title in Oracle SES can help users better understand their search results.

Title fallback can be used for any source type. Oracle SES uses different logic for each document type to determine which fallback title to use. For example, for HTML documents, Oracle SES looks for the first heading, such as <h1>. For Microsoft Word documents, Oracle SES looks for text with the largest font.

If the default title was collected in the initial crawl, then the fallback title is only used after the document is reindexed during a re-crawl. This means if there is no change to the document, then you must force the change by setting the re-crawl policy to Process All Documents on the Home - Schedules - Edit Schedule page.

This feature is not currently supported in the Oracle SES Administration GUI. Override a default document title with a meaningful title by adding the keyword BAD_TITLE to the ORACLE_HOME/search/data/config/crawler.dat file. For example:

BAD_TITLE Daily Memo

Where Daily Memo is the title string to be overridden. The title string is case-insensitive and can use multibyte characters in UTF8 character set.

You can specify multiple bad titles, each one on a separate line.

Special considerations with title fallback

With Microsoft Office documents:
- Font sizes 14 and 16 in Microsoft Word correspond to normalized font sizes 4 and 5 (respectively) in converted HTML. The Oracle SES crawler only picks up strings with normalized font size greater than 4 as the fallback title.
- Titles should contain more than five characters.
When a title is null, Oracle SES automatically indexes the fallback title for all binary documents (for example, .doc, .ppt, .pdf). For HTML and text documents, Oracle SES does not automatically index the fallback title. This means that the replaced title on HTML or text documents cannot be searched with the title attribute on the Advanced Search page. You can turn on indexing for HTML and text documents in the crawler.dat file. For example, set NULL_TITLE_FALLBACK_INDEX ALL.
The crawler.dat file is not included in the backup available on the Global Settings - Configuration Data Backup and Recovery page. Ensure you manually back up the crawler.dat file.

See Also:

"Crawler Configuration File"

Character Set Detection

This featu re enables the crawler to automatically detect character set information for HTML, plain text, and XML files. Character set detection allows the crawler to properly cache files during crawls, index text, and display files for queries. This is important when crawling multibyte files (such as files in Japanese or Chinese).

This feature is not currently supported in the Oracle SES Administration GUI, and by default, it is turned off. Enable automatic character set detection by adding a line in the crawler configuration file: ORACLE_HOME/search/data/config/crawler.dat. For example, add the following as a new line:

AUTO_CHARSET_DETECTION

You can check whether this is turned on or off in the crawler log under the "Crawling Settings" section.

Special Considerations with Automatic Character Set Detection

To crawl XML files for a source, be sure to add XML to the list of processed document types on the Home - Source - Document Types page. XML files are currently treated as HTML format, and detection for XML files may not be as accurate as for other file formats.
The crawler.dat file is not included in the backup available on the Global Settings - Configuration Data Backup and Recovery page. Ensure that you manually back up the crawler.dat file.

See Also:

"Crawler Configuration File"

Language Detection

With multibyte files, besides turning on character set detection, be sure to set the Default Language parameter. For example, if the files are all in Japanese, select Japanese as the default language for that source. If automatic language detection is disabled, or if the crawler cannot determine the document language, then the crawler assumes that the document is written in the default language. This default language is used only if the crawler cannot determine the document language during crawling.

If your files are in multiple languages, then turn on the Enable Language Detection parameter. Not all documents retrieved by the crawler specify the language. For documents with no language specification, the crawler attempts to automatically detect language. The language recognizer is trained statistically using trigram data from documents in various languages (for instance, Danish, Dutch, English, French, German, Italian, Portuguese, and Spanish). It starts with the hypothesis that the given document does not belong to any language and ultimately refutes this hypothesis for a particular language where possible. It operates on Latin-1 alphabet and any language with a deterministic Unicode range of characters (like Chinese, Japanese, Korean, and so on).

The crawler determines the language code by checking the HTTP header content-language or the LANGUAGE column, if it is a table source. If it cannot determine the language, then it takes the following steps:

If the language recognizer is not available or if it cannot determine a language code, then the default language code is used.
If the language recognizer is available, then the output from the recognizer is used.
Oracle Text MULTI_LEXER is the only lexer used for Oracle Secure Enterprise Search.

The Default Language and the Enable Language Detection parameters are on the Global Settings - Crawler Configuration page (globally) and also the Home - Sources - Crawling Parameters page (for each source).

Note:

For file sources, the individual source setting for Enable Language Detection remains false regardless of the global setting. In most cases, the language for a file source should be the same, and set from, the Default Language setting.

Cache Directory

For sources created before Oracle SES 11g, the document cache remains in the cache directory. Sources are not stored in Secure Cache in the database until they are migrated to use Secure Cache. You can manage the cache directory for these older sources the same as in earlier releases.

Deleting the Secure Cache

You can manage the Secure Cache either on the global level or at the data source level. The data source configuration supersedes the global configuration.

The cache is preserved by default and supports the Cached link feature in the search result page. If you do not use the Cache link, then you can delete the cache, either for specific sources or globally for all of them. Without a cache, the Cached link in a search result page returns a File not found error.

To delete the cache for all sources:

Select the Global Settings tab in the Oracle SES Administration GUI.
Choose Crawler Configuration.
Set Preserve Document Cache to No.
Click Delete Cache Now to remove the cache from all sources, except any that are currently active under an executing schedule. The cache is deleted in the background, and you do not have to wait for it to complete.
Click Apply.

To delete the cache for an individual source:

Select the Sources secondary tab on the Home page.
Click Edit for the source.
Click the Crawling Parameters subtab.
Set Preserve Document Cache to No.
Click Apply.

Overview of XML Connector Framework

Oracle SES provides an XML connector framework to crawl any repository that provides an XML interface to its contents. The connectors for Oracle Content Server, Oracle E-Business Suite 12, and Siebel 8 use this framework.

Every document in a repository is known as an item. An item contains information about the document, such as author, access URL, last modified date, security information, status, and contents.

A set of items is known as a feed or channel. To crawl a repository, an XML document must be generated for each feed. Each feed is associated with information such as feed name, type of the feed, and number of items.

To crawl a repository with the XML connector, place data feeds in a location accessible to Oracle SES over one of the following protocols: HTTP, FTP, or file. Then generate an XML Configuration File that contains information such as feed location and feed type. Create a source with a source type that is based on this XML connector and trigger the crawl from Oracle SES to crawl the feeds.

There are two types of feeds:

Control feed: Individual feeds can be located anywhere, and a single control file is generated with links to the feeds. This control file is input to the connector through the configuration file. A link in control feed can point to another control feed. Control feed is useful when data feeds are distributed over many locations or when the data feeds are accessed over diverse protocols such as FTP and file.
Directory feed: All feeds are placed in a directory, and this directory is input to the connector through the configuration file. Directory feed is useful when the data feeds are available in a single directory.

Guidelines for the target repository generating the XML feeds:

XML feeds are generated by the target repository, and each file system has a limit on how many files it can hold. For directory feeds, the number of documents in each directory should be less than 10,000. There are two considerations:
- Feed files: The number of items in each feed file should be set such that the total number of feed files in the feed directory is kept under 10,000.
- Content files: If the feed files specify content through attachment links and the targets of these links are stored in the file system, then ensure that the targets are distributed in multiple directories so that the total number of files in each directory is kept under 10,000.
When feeds are generated real-time over HTTP, ensure that the component generating the feeds is sensitive to time out issues of feed requests. The feed served as the response for every request should be made available within this time out interval; otherwise, the request from Oracle SES times out. The request is retried as many times as specified while setting up the source in Oracle SES. If all these attempts fail, then the crawler ignores this feed and proceeds with the next feed.

See Also:

Example Using the XML Connector

The courses in the Oracle E-Business Suite Learning Management application can be crawled and indexed to readily search the courses offered, location and other details pertaining to the courses.

To crawl and index courses in Oracle E-Business Suite Learning Management:

Generate XML feed containing the courses. Each course can be an item in the feed. The properties of the course such as location and instructor can be set as attributes of the item.
Move the feed to a location accessible to Oracle SES through HTTP, FTP, or file protocol.
Generate a control file that points to that feed.
Generate a configuration file to point to this feed. Specify the feed type as control, the URL of the control feed, and the source name in the configuration file.
Create an Oracle E-Business Suite 12 source in Oracle SES, specifying in the parameters the location of the configuration file, the user ID and the password to access the feed.

XML Configuration File

The configuration file is an XML file conforming to a set schema.

The following is an example of a configuration file to set up an XML-based source:

<rsscrawler xmlns="http://xmlns.oracle.com/search/rsscrawlerconfig">  
     <feedLocation>ftp://my.host.com/rss_feeds</feedLocation>
     <feedType>directoryFeed</feedType>
     <errorFileLocation>/tmp/errors</errorFileLocation>
     <securityType>attributeBased</securityType> 
     <sourceName>Contacts</sourceName>
     <securityAttribute name="EMPLOYEE_ID" grant="true"/> 
</rsscrawler>

Where

feedLocation is one of the following:
- URL of the directory, if the data feed is a directory feed
  
  This URL should be the FTP URL or the file URL of the directory where the data feeds are located. For example:
```
ftp://example.domain.com/relativePathOfDirectory
file://example.domain.com/c:\dir1\dir2\dir3
file://example.domain.com//private/home/dir1/dir2/dir3 
```
  File URL if the data feeds are available on the same computer as Oracle SES. The path specified in the URL should be the absolute path of the directory.
  
  FTP URL to access data feeds on any other computer. The path of the directory in the URL can be absolute or relative. The absolute path should be specified following the slash (/) after the host name in the URL. The relative path should be specified relative to the home directory of the user used to access FTP feeds.
  
  The user ID used to crawl the source should have write permissions on the directory, so that the data feeds can be deleted after crawl.
- URL of the control file, if the data feed is a control feed
  
  This URL can be HTTP, HTTPS, file, or FTP URL. For example:
```
http://example.com:7777/context/control.xml
```
  The path in FTP and file protocols can be absolute or relative.
feedType indicates the type of feed. Valid values are directoryFeed, controlFeed, and dataFeed.
errorFileLocation (optional) specifies the directory where status feeds should be uploaded.

A status feed is generated to indicate the status of the processing feed. This status feed is named data_feed_file_name.suc or data_feed_file_name.err depending on whether the processing was successful. Any errors encountered are listed in the error status feed. If a value is specified for this parameter, then the status feed is uploaded to this location. Otherwise, the status feed is uploaded to the same location as the data feed.

The user ID used to access the data feed should have write permission on the directory.

If feedLocation is an HTTP URL, then errorFileLocation also should be an HTTP URL, to which the status feeds are posted. If no value is specified for errorFileLocation, then the status feeds are posted to the URL given in feedLocation.

If an error occurs while processing a feed available over file or FTP protocol, then the erroneous feed is renamed filename.prcsdErr in the same directory.
sourceName (optional) specifies the name of the source.
securityType (optional) specifies the security type. Valid values are the following:
- noSecurity: There is no security information associated with this source at the document level. This is the default value.
- identityBased: Identity-based security is used for documents in the feed.
- attributeBased: Attribute-based security is used for documents in the feed. With this security model, security attributes should be specified in the securityAttribute tag, and the values for these attributes should be specified for each document.
securityAttribute specifies attribute-based security. One or more tags of this type should be specified, and each tag should contain the following attributes:
- name: Name of the security attribute.
- grant: Boolean parameter indicating whether this is a grant/deny attribute. The security attribute is considered a grant attribute if the value is true and a deny attribute if the value is false.

Configuring Support for Image Metadata

The Oracle SES crawler initially is set to search only text files. You can change this behavior by configuring an image document service connector to search the metadata associated with image files. Image files can contain rich metadata that provide additional information about the image itself.

The Image Document Service connector integrates Oracle Multimedia (formerly Oracle interMedia) images with Oracle SES. This connector is separate from any specific data source.

The following table identifies the metadata formats (EXIF, IPTC, XMP, DICOM) that can be extracted from each supported image format (JPEG, TIFF, GIF, JPEG 2000, DICOM).

	JPEG	TIFF	GIF	JPEG2000	DICOM
EXIF	Yes	Yes	No	No	No
IPTC	Yes	Yes	No	No	No
XMP	Yes	Yes	Yes	Yes	No
DICOM	No	No	No	No	Yes

See Also:

Oracle Multimedia User's Guide and Oracle Multimedia Reference for more information about image metadata

Identifying the Search Attributes for Image Metadata

Image files can contain metadata in multiple formats, but not all of it is useful when performing searches. A configuration file in Oracle SES enables you to control the metadata that is searched and published to an Oracle SES Web application.

The default configuration file is named attr-config.xml. Note that if you upgraded from a previous release, then the default configuration file remains ordesima-sample.xml.

You can either modify the default configuration file or create your own file. The configuration file must be located at ORACLE_HOME/search/lib/plugins/doc/ordim/config/. Oracle recommends that you create a copy of the default configuration file before editing it. Note that the configuration file must conform to the XML schema ORACLE_HOME/search/lib/plugins/doc/ordim/xsd/ordesima.xsd.

Oracle SES indexes and searches only those image metadata tags that are defined within the metadata element (between <metadata>...</metadata>) in the configuration file. By default, the configuration file contains a set of the most commonly searched metadata tags for each of the file formats. You can add other metatags to the file based on your specific requirements.

Image files can contain metadata in multiple formats. For example, an image can contain metadata in the EXIF, XMP, and IPTC formats. An exception to this are DICOM images, which contain only DICOM metadata. Note that for IPTC and EXIF formats, Oracle Multimedia defines its own image metadata schemas. The metadata defined in the configuration file must conform to the Oracle Multimedia defined schemas.

Because different metadata formats use different tags to refer to the same attribute, it is necessary to map metatags and the search attributes they define. Table 4-1 lists some of the commonly used metatags and how they are mapped in Oracle SES.

Table 4-1 Metatag Mapping

Oracle SES Attribute Name	Oracle SES Predefined Name	EXIF Metatag	IPTC Metatag	XMP Metatag
Author	Author	Artist	Author	photoshop:Creator
AuthorTitle	X	X	AuthorTitle	photoshop:AuthorsPosition
Description	Description	ImageDescription	Caption	dc:Description
Title	Title	X	ObjectName	dc:Title
DescriptionWriter	X	X	captionWriter	photoshop:CaptionWriter
Headline1	Headline1	X	Headline	photoshop:Headline
Category	X	X	Category	photoshop:Category
Scene	X	X	X	Iptc4xmpCore:Scene
Publisher	X	X	X	dc:Publisher
Source	X	X	Source	photoshop:Source
Copyright	X	Copyright	Copyright	dc:rights
Keywords	Keywords	X	Keyword	dc:subject
Provider	X	X	Credit	photoshop:Credit
City	X	X	City	photoshop:City
State	X	X	provinceState	photoshop:State
Country	X	X	Country	photoshop:Country
Location	X	X	location	Iptc4xmpCore:Location
EquipmentMake	X	Make	X	tiff:Make
EquipmentModel	X	Model	X	tiff:Model

Oracle SES provides this mapping in the configuration file attr-config.xml. You can edit the file to add other metatags. Oracle recommends that you make a copy of the original configuration file before editing the settings. The configuration file defines the display name of a metatag and how it is mapped to the corresponding metadata in each of the supported formats.

This is done using the searchAttribute tag, as shown in the example below:

<searchAttribute>
 <displayName>Author</displayName>
 <metadata>
   <value format="iptc">byline/author</value>
   <value format="exif">TiffIfd/Artist</value>
   <value format="xmp">dc:creator</value>
   <value format="xmp">tiff:Artist</value>
 </metadata>
</searchAttribute>

For each search attribute, the value of displayName is an Oracle SES attribute name that is displayed in the Oracle SES web application when an Advanced Search - Attribute Selection is performed. If any of the listed attributes are detected during a crawl, then Oracle SES automatically publishes the attributes to the SES web application.

For the element value, format must take the value of one of the supported formats such as iptc, exif, xmp, or dicom.

The value defined within the element, for example, byline/author, is the XML path when the image format is IPTC, EXIF, or XMP. For DICOM, this value must be the standard tag number or value locator.

For IPTC and EXIF formats, the XML path must conform to the metadata schemas defined by Oracle Multimedia. These schemas are defined in the files ordexif.xsd and ordiptc.xsd located at ORACLE_HOME/search/lib/plugins/doc/ordim/xsd/.

You do not need to specify the root elements defined in these schemas (iptcMetadata, exifMetadata) in the configuration file. For example, you can specify byline/author as the xmlPath value of the author attribute in IPTC format. Oracle Multimedia does not define XML schemas for XMP metadata, so refer to the Adobe XMP specification for the xmlPath value.

Within the <searchAttribute> tag, you can also specify an optional <dataType> tag if the attribute carries a date or numerical value. For example,

<searchAttribute>
     <displayName>AnDateAttribute</displayName>
     <dataType>date</dataType>
     <metadata>
        ...
     </metadata>
</searchAttribute>

The default data type is string, so you do not have to explicitly specify a string.

Supporting XMP Metadata

Oracle SES supports both standard and custom XMP metadata searches. Because all XMP properties share the same parent elements <rdf:rdf><rdf:description>, you must specify only the real property schema and property name in the configuration file. For example, specify photoshop:category instead of rdf:rdf/rdf:description/photoshop:category. The same rule applies to XMP custom metadata also. However, for XMP structure data, you must specify the structure element in the format parent/child 1/child 2/…child N, where child N is a leaf node. For example, Iptc4xmpCore:CreatorContactInfo/Iptc4xmpCore:CiPerson. Note that the image plug-in does not validate the metadata value for XMP metadata.

XMP metatags consist of 2 components separated by a colon(:). For example, photoshop:Creator, which corresponds to the Author attribute (see Table 4-1). In this, photoshop refers to the XMP schema namespace. The other common namespaces include dc, tiff, and Iptc4xmpCore.

Before defining any XMP metadata in the configuration file, you must ensure that the namespace is defined. For example, before defining the metadata photoshop:Creator, you must include the namespace photoshop in the configuration file. This rule applies to both the standard and custom XMP metadata namespaces. As a best practice, Oracle recommends that you define all the namespaces at the beginning of the configuration file. If the namespace defined in the configuration file is different from the one in the image, then Oracle SES cannot find the attributes associated with this namespace. You can define namespaces as shown:

<xmpNamespaces>
<namespace prefix="Iptc4xmpCore">http://iptc.org/std/Iptc4xmpCore/1.0/xmlns/</namespace>
<namespace prefix="dc">http://purl.org/dc/elements/1.1/</namespace>
<namespace prefix="photoshop">http://ns.adobe.com/photoshop/1.0/</namespace>
<namespace prefix="xmpRights">http://ns.adobe.com/xap/1.0/rights/</namespace>
<namespace prefix="tiff">http://ns.adobe.com/tiff/1.0/</namespace>
</xmpNamespaces>

Note that the Adobe XMP Specification requires that XMP namespaces end with a slash (/) or hash (#) character.

See Also:

Adobe Extensible Metadata Platform (XMP) Specification for the XMP metadata schema and a list of standard XMP namespace values:

http://www.adobe.com/devnet/xmp/pdfs/xmp_specification.pdf

Custom XMP metadata must be explicitly added to attr-config.xml. An example of a custom metadata is:

<xmpNamespaces>
  <namespace prefix="hm">http://www.oracle.com/ordim/hm/</namespace>
</xmpNamespaces>
<searchattribute>
  <displayname>CardTitle</displayname>
  <metadata>
    <value format="xmp">hm:cardtitle</value>       
  </metadata>
</searchattribute>

Supporting DICOM Metatags

Oracle SES 11g supports DICOM metatags, and these metatags are available in the default configuration file attr-config.xml. Note that the configuration file ordesima-sample.xml, which is the default configuration file if you upgraded from a previous release, does not contain DICOM metatags. Therefore, you must manually add DICOM metatags to the ordesima-sample.xml file. To do this, you can copy the DICOM metatags from attr-config.xml, which is available in the same directory. You can also reference the DICOM standard and add additional DICOM tags.

DICOM metatags are either DICOM standard tags or DICOM value locators.

DICOM Standard Tags

DICOM standard tags are 8-digit hexadecimal numbers, represented in the format ggggeeee where gggg specifies the group number and eeee specifies the element number. For example, the DICOM standard tag for the attribute performing physician's name is represented using the hexadecimal value 00081050.

Note that the group number gggg must take an Even value, excepting 0000, 0002, 0004, and 0006, which are reserved group numbers.

The DICOM standard defines over 2000 standard tags.

The file attr-config.xml contains a list of predefined DICOM standard metatags. You can add new metatags to the file as shown in the following example:

<searchAttribute>
      <displayName>PerformingPhysicianName</displayName>
      <metadata>
        <value format="dicom">00081050</value>       
      </metadata>
</searchAttribute>

Note:

The image connector does not support SQ, UN, OW, OB, and OF data type tags. Therefore, do not define such tags in the configuration file.

See Also:

http://medical.nema.org for more information about the standard tags defined in DICOM images, and the rules for defining metatags

DICOM Value Locators

Value locators identify an attribute in the DICOM content, either at the root level or from the root level down.

A value locator contains one or more sublocators and a tag field (optional). A typical value locator is of the format:

sublocator#tag_field

Or of the format:

sublocator

Each sublocator represents a level in the tree hierarchy. DICOM value locators can include multiple sublocators, depending on the level of the attribute in the DICOM hierarchy. Multiple sublocators are separated by the dot character (.). For example, value locators can be of the format:

sublocator1.sublocator2.sublocator3#tag_field

Or of the format:

sublocator1.sublocator2.sublocator3

A tag_field is an optional string that identifies a derived value within an attribute. A tag that contains this string must be the last tag of a DICOM value locator. The default is NONE.

A sublocator consists of a tag element and can contain other optional elements. These optional elements include definer and item_num. Thus, a sublocator can be of the format:

tag

Or it can be of the format

tag(definer)[item_num)

Table 4-2 Sub Components of a Sublocator

Component Description

Component	Description
tag	A DICOM standard tag represented as an 8-digit hexadecimal number.
definer	A string that identifies the organization creating the tag. For tags that are defined by the DICOM standard, the default value (which can be omitted) is `DICOM`. Note that Oracle SES supports DICOM standard tags alone. It does not support private tags.
item_num	An integer that identifies a data element within an attribute, or a wildcard character ("*") that identifies all data elements within an attribute. It takes a default value of 1, the first data element of an attribute. This parameter is optional.

tag

A DICOM standard tag represented as an 8-digit hexadecimal number.

definer

A string that identifies the organization creating the tag. For tags that are defined by the DICOM standard, the default value (which can be omitted) is DICOM.

Note that Oracle SES supports DICOM standard tags alone. It does not support private tags.

item_num

An integer that identifies a data element within an attribute, or a wildcard character ("*") that identifies all data elements within an attribute. It takes a default value of 1, the first data element of an attribute. This parameter is optional.

The following example shows how to add a value locator to the attr-config.xml file:

<searchAttribute>
  <displayName>PatientFamilyName</displayName>
  <metadata>
  <value format="dicom">00100010#UnibyteFamily</value>       
  </metadata>
</searchAttribute>

where UnibyteFamily is a tag_field of person name.

The following example shows how to define a value locator from the root level.

<searchAttribute>
      <displayName>AdmittingDiagnosisCode</displayName>
      <metadata>
        <value format="dicom">00081084.00080100</value>       
      </metadata>
</searchAttribute>
<searchAttribute>
      <displayName>AdmittingDiagnosis</displayName>
      <metadata>
        <value format="dicom">00081084.00080104</value>
      </metadata>
</searchAttribute>

In the above example, the tag 00081084 represents the root tag Admitting Diagnoses Code Sequence. This tag includes four child tags: code value (0008, 0100), coding scheme designator (0008, 0102), coding scheme version (0008, 0103) and code meaning (0008, 0104). In this example, we define the value locators for code value: 00081084.00080100 and code meaning: 00081084.00080104.

Note:

The image connector does not support SQ, UN, OW, OB, and OF data type value locators. Therefore, ensure that the last sublocator of a value locator does not specify such data types.

See Also:

Oracle Multimedia DICOM Developer's Guide for more information about DICOM value locators

Example: Adding an Attribute to the Default attr-config.xml File

To search for information about image caption writer:

Open Oracle SES Administration GUI and create the DescriptionWriter attribute:

Specify DescriptionWriter as an Oracle SES attribute name (shown on the Advanced Search - Attribute Selection page).
Examine the following sources for information relevant to modifying the default attr-config.xml file:
- Oracle Multimedia IPTC schema at ORACLE_HOME/search/lib/plugins/doc/ordim/xsd/ordiptc.xsd. The IPTC metadata for image caption writer is shown as captionWriter.
- Adobe XMP Specification for XMP Metadata. The XMP path for this property is defined as photoshop:CaptionWriter.
- Oracle Multimedia EXIF schema. There is no caption writer metadata in EXIF.

Add the following section to attr-config.xml:

<searchAttribute>
   <displayName>DescriptionWriter</displayName>
   <metadata>
       <xmlPath format="iptc">captionWriter</xmlPath>
       <xmlPath format="xmp">photoshop:CaptionWriter</xmlPath>
   </metadata>
</searchAttribute>

If the photoshop XMP namespace is not registered in the configuration file, then add the namespace element to xmpNamespaces as shown here:
```
<xmpNamespaces>
   <namespace prefix="photoshop">http://ns.adobe.com/photoshop/1.0/</namespace>
```
.

. existing namespaces

.
```
</xmpNamespaces>
```

Creating an Image Document Service Connector

A default Image Document Service connector instance is created during the installation of Oracle SES. You can configure the default connector or create a new one.

To create an Image Document Service instance:

In the Oracle SES Administration GUI, click Global Settings.
Under Sources, click Document Services to display the Global Settings - Document Services page.
To configure the default image service instance:
1. Click Expand All
2. Click Edit for the default image service instance.
or

To create a new image service instance:
1. Click Create to display the Create Document Service page.
2. For Select From Available Managers, choose ImageDocumentService. Provide a name for the instance.
Provide a value for the attributes configuration file parameter.

The default value of attributes configuration file is attr-config.xml. The file is located at ORACLE_HOME/search/lib/plugins/doc/ordim/config/, where ORACLE_HOME refers to ORACLE_BASE/seshome, the directory which stores the Oracle SES specific components. If you create a new configuration file, then you must place the file at the same default location.
Click Apply.
Click Document Services in the locator links to return to the Document Services page.
Add the Image Document Service plug-in to either the default pipeline or a new pipeline.

To add the default Image Document Service plug-in to the default pipeline:

Under Document Service Pipelines, click Edit for the default pipeline.
Move the Image Document Service instance from Available Services to Used in Pipeline.
Click Apply.

To create a new pipeline for the default Image Document Service plug-in:

Under Document Service Pipelines, click Create to display the Create Document Service Pipeline page.
Enter a name and description for the pipeline.
Move the Image Document Service instance from Available Services to Used in Pipeline.
Click Create.

Using the Image Document Service Connector

You must either create a source to use the connector or enable the connector for an existing source.

To enable the connector for an existing source:

Click Sources on the Home page.
Click the Edit icon for the desired source.
Click Crawling Parameters.
Select the pipeline that uses the Image Document Service and enable the pipeline for this source.
Click Document Types. From the Not Processed column, select the image types to search and move them to the Processed column. The following sources are supported: JPEG, JPEG2000, GIF, TIFF, DICOM.

Searching Image Metadata

You can search image metadata from either the Oracle SES Basic Search page or the Advanced Search - Attribute Selection page.

For Basic Search, Oracle SES searches all the metadata defined in the configuration file for each supported image document (JPEG, TIFF, GIF, JPEG 2000, and DICOM). It returns the image document if any matching metadata is found.

Advanced Search enables you to search one or more specified attributes. It also supports basic operations for date and number attributes. Oracle SES returns only those image documents that contain the specified metadata.

Note that Oracle SES does not display the Cache link for image search results.

Troubleshooting the Image Document Service Connector

If the Image Document Service Connector fails, then check the following:

Is the pipeline with an Image Document Service connector instance enabled for the source?
Are the image types added to the source?
For a web source, are the correct MIME types included in the HTTP server configuration file?

For example, if you use Oracle Application Server, then check the file ORACLE_HOME/Apache/Apache/conf/mime.types. If the following media types are missing, then add them:

MIME Type Extensions

image/jp2 jp2

application/dicom dcm
If a connection is established, and all the image files are not crawled, then check whether the recrawl policy is set to Process Documents That Have Changed. If so, change this to Process All Documents.

To do this, go to Home - Schedules, and under Crawler Schedules, click Edit for the specific source. This opens the Edit Schedule page. Under Update Crawler Recrawl Policy, select Process All Documents.

Note that you can change the recrawl policy back to Process Documents That Have Changed, after the crawler has finished crawling all the documents in the new source.

MIME Type	Extensions
image/jp2	jp2
application/dicom	dcm

Overview of Attributes

Each source has its own set of document attributes. Docume nt attributes, like metadata, describe the properties of a document. The crawler retrieves values and maps them to a search attributes. This mapping lets users search documents based on their attributes. Document attributes in different sources can be mapped to the same search attribute. Therefore, users can search documents from multiple sources based on the same search attribute.

After you crawl a source, you can see the attributes for that source. Document attribute information is obtained differently depending on the source type.

See Also:

"Customizing the Appearance of Search Results" for a list of Oracle internal attributes
"Searching on Date Attributes"

Document attributes can be used in tasks such as document management, access control, or version control. Different sources can have different attribute names that are used for the same idea; for example, version and revision. It can also have the same attribute name for different ideas; for example, "language" as in natural language in one source but as programming language in another. Document attribute information is obtained differently depending on the source type.

Oracle SES has several default search attributes. They can be incorporated in search applications for a more detailed search and richer presentation.

Search attributes are defined in the following ways:

System-defined search attributes, such as title, author, description, subject, and mimetype.
Search attributes created by the Oracle SES administrator.
Search attributes created by the crawler. During crawling, the crawler plug-in maps the document attribute to a search attribute with the same name and data type. If not found, then the crawler creates a new search attribute with the same name and type as the document attribute defined in the crawler plug-in.

Attributes For Different Source Types

Table and database sources have no predefined attributes. The crawler collects attributes from columns defined during source creation. You must map the columns to the search attributes.

For Siebel 7.8 sources, specify the attributes in the query while creating the source. For Oracle E-Business Suite and Siebel 8 sources, specify the attributes in the XML data feed.

For many source types, such as OracleAS Portal, e-mail, NTFS, and Microsoft Exchange sources, the crawler picks up key attributes offered by the target systems. For other sources, such as Documentum eRoom or Lotus Notes, an Attribute list parameter is in the Home - Sources - Customize User-Defined Source page. Any attributes that you define are collected by the crawler and available for search.

Using Lists of Values for Search Attributes

The list of values (LOV) for a search attribute can help you specify a search. Global search attributes can be specified on the Global Settings - Search Attributes page. For user-defined sources where LOV information is supplied through a crawler plug-in, the crawler registers the LOV definition. Use the Oracle SES Administration GUI or the crawler plug-in to specify attribute LOVs, attribute value, attribute value display name, and its translation.

When multiple sources define the LOV for a common attribute, such as title, the user sees all the possible values for the attribute. When the user restricts search within a particular source group, only LOVs provided by the corresponding sources in the source group are shown.

LOVs can be collected automatically. The following example shows Oracle SES collecting LOV values to crawl a fictitious URL.

Create a Web source with http://www.example.com as the starting URL. Do not start crawling yet.
From the Global Settings - Search Attributes page, select the Attribute for Oracle SES to collect LOVs and click Manage Lov. (For example, click Manage Lov for Author.)
Select Source-Specific for the created source, and click Apply.
Click Update Policy.
Choose Document Inspection and click Update, then click Finish.
From the Home - Schedules page, start crawling the Web source. After crawling, the LOV button in the Advanced Search page shows the collected LOVs.

System-Defined Search Attributes

There are also two system-defined search attributes, Urldepth and Infosource Path.

Urldepth measures the number of levels down from the root directory. It is derived from the URL string. In general, the depth is the number of slashes, not counting the slash immediately following the host name or a trailing slash. An adjustment of -2 is made to home pages. An adjustment of +1 is made to dynamic pages, such as the example in Table 4-3 with the question mark in the URL.

Urldepth is used internally for calculating relevance ranking, because a URL with a smaller URL depth is typically more important.

Table 4-3 lists the Urldepth of some example URLs.

Table 4-3 Depth of Example URLs

URL	Urldepth
http://example.com/portal/page/myo/Employee_Portal/MyCompany	4
http://example.com/portal/page/myo/Employee_Portal/MyCompany/	4
http://example.com/portal/page/myo/Employee_Portal/MyCompany.htm	4
http://example.com/finance/finhome/topstories/wall_street.html?.v=46	4
http://example.com/portal/page/myo/Employee_Portal/home.htm	2

Infosource Path is a path representing the source of the document. This internal attribute is used in situations where documents can be browsed by their source. The Infosource Path is derived from the URL string.

For example, for this URL:

http://example.com/portal/page/myo/Employee_Portal/home.htm

The Infosource Path is:

portal/page/myo/Employee_Portal

If the document is submitted through a connector, this value can be set explicitly by using the DocumentMetadata.setSourceHierarchy API.

Understanding the Crawling Process

The first time the crawler runs, it must fetch data (Web pages, table rows, files, and so on) based on the source. It then adds the document to the Oracle SES index.

The Initial Crawl

This section describes a Web source crawling process for a schedule. It is divided into these phases:

Queuing and Caching Documents
Indexing Documents

Queuing and Caching Documents

The crawling cycle involves the following steps:

Oracle spawns the crawler according to the schedule you specify with the Oracle SES Administration GUI. When crawling is initiated for the first time, the URL queue is populated with the seed URLs.
The crawler initiates multiple crawling threads.
The crawler thread removes the next URL in the queue.
The crawler thread fetches the document from the Web. The document is usually an HTML file containing text and hypertext links. When the document is not in HTML format, the crawler converts the document into HTML before caching.
The crawler thread scans the HTML file for hypertext links and inserts new links into the URL queue. Duplicate links in the document table are discarded.
The crawler caches the HTML file.
The crawler registers the URL in the URL table.
The crawler thread starts over by repeating Step 3.

Fetching a document, as described in Step 4, can be time-consuming because of network traffic or slow Web sites. For maximum throughput, multiple threads fetch pages at any given time.

Indexing Documents

When the cache is full (default maximum size is 250 MB), the indexing process begins. At this point, the document content and any searchable attributes are pushed into the index.

When the Preserve Document Cache parameter is set to false, the crawler automatically deletes the cache after indexing the documents.

Oracle SES Stoplist

Oracle SES ma intains a stoplist. A stoplist is a list of words that are ignored during the indexing process. These words are known as stopwords. Stopwords are not indexed because they are deemed not useful, or even disruptive, to the performance and accuracy of indexing. The Oracle SES stoplist contains only English words, and cannot be modified.

When you run a phrase search with a stopword in the middle, the stopword is not used as a match word, but it is used as a placeholder. For example, the word "on" is a stopword. If you search for the phrase "oracle on demand", then Oracle SES matches a document titled "oracle on demand" but not a document titled "oracle demand". If you search for the phrase "oracle on on demand", then Oracle SES matches a document titled "oracle technology on demand" but not a document titled "oracle demand" or "oracle on demand".

Maintenance Crawls

After the initial crawl, a URL page is only crawled and indexed if it changed since the last crawl. The crawler determines whether it has changed with the HTTP If-Modified-Since header field or with the checksum of the page. URLs that no longer exist are marked and removed from the index.

To update changed documents, the crawler uses an internal checksum to compare new Web pages with cached Web pages. Changed Web pages are cached and marked for re-indexing.

Data synchronization involves the following steps:

Oracle spawns the crawler according to the schedule specified in the Oracle SES Administration GUI. The URL queue is populated with the seed URLs of the source assigned to the schedule.
The crawler initiates multiple crawling threads.
Each crawler thread removes the next URL in the queue.
Each crawler thread fetches a document from the Web. The page is usually an HTML file containing text and hypertext links. When the document is not in HTML format, the crawler converts the document into HTML before caching.
Each crawler thread calculates a checksum for the newly retrieved page and compares it with the checksum of the cached page. If the checksum is the same, then the page is discarded and the crawler goes to Step 3. Otherwise, the crawler continues to the next step.
The crawler thread scans the document for hypertext links and inserts new links into the URL queue. Links that are in the document table are discarded. Oracle SES does not follow links from filtered binary documents.
The crawler marks the URL as accepted. The URL is crawled in future maintenance crawls.
The crawler registers the URL in the document table.
If the cache is full or if the URL queue is empty, then caching stops. Otherwise, the crawler thread starts over at Step 3.

A maintenance or a forced recrawl does not move a cache from the file system to the database, or the reverse. The cache location for a source remains the same until it is migrated to a different location.

Automatic Forced Recrawls

When you configure a data source, certain operations trigger an automatic forced recrawl of the data source. These operations include the following:

Deleting a document attribute from the data source
Remapping a document attribute to a different search attribute
Changing the crawler configuration "Index Dynamic Page" from No to Yes for a Web source.

These operations set the Force Recall flag, but no notice is given of this change in mode.

Monitoring the Crawling Process

Monitor the crawling process in the Oracle SES Administration GUI by using a combination of the following:

Check the crawl progress and crawl status on the Home - Schedules page. (Click Refresh Status.)
Monitor your crawler statistics on the Home - Schedules - Crawler Progress Summary page and the Home - Statistics page.
Monitor the log file for the current schedule.

See Also:

"Tuning the Crawl Performance"

Crawler Statistics

The following crawler statistics are shown on the Home - Schedules - Crawler Progress Summary page. Some of them are also shown in the log file, under "Crawling results".

Documents to Fetch: Number of URLs in the queue waiting to be crawled. The log file uses the phrase "Documents to Process".
Documents Fetched: Number of documents retrieved by the crawler.
Document Fetch Failures: Number of documents whose contents cannot be retrieved by the crawler. This could be due to an inability to connect to the Web site, slow server response time causing time-outs, or authorization requirements. Problems encountered after successfully fetching the document are not considered here; for example, documents that are too big or duplicate documents that were ignored.
Documents Rejected: Number of URL links encountered but not considered for crawling. The rejection could be due to bound ary rules, the robots exclusion rule, the mime type inclusion rule, the crawling depth limit, or the URL rewriter discard directive.
Documents Discovered: Total number of documents discovered so far. This is roughly equal to (documents to fetch) + (documents fetched) + (document fetch failures) + (documents rejected).
Documents Indexed: Number of documents that have been indexed or are pending indexing.
Documents Non-Indexable: Number of documents that cannot be indexed; for example, a file source directory or a document with robots NOINDEX metatag.
Document Conversion Failures: Number of document filtering errors. This is counted whenever a document cannot be converted to HTML format.

Crawler Log File

The log file records all crawler activity, warnings, and error messages for a particular schedule. It includes messages logged at startup, run time, and shutdown. Logging everything can create very large log files when crawling a large number of documents. However, in certain situations, it can be beneficial to configure the crawler to print detailed activity to each schedule log file.

On the Global Settings - Crawler Configuration page, you can select either to log everything or to log only summary information. You can also select the crawler log file directory and the language the crawler uses to generate the log file.

Note:

On UNIX-based systems, ensure that the directory permission is set to 700 if you change the log file directory. Only the user who installed the Oracle software should have access to this directory.

A new log file is created when you restart the crawler. The location of the crawler log file can be found on the Home - Schedules - Crawler Progress Summary page. The crawler maintains the past seven versions of its log file, but only the most recent log file is shown in the Oracle SES Administration GUI. You can view the other log files in the file system.

The naming convention of the log file name is ids.MMDDhhmm.log, where ids is a system-generated ID that uniquely identifies the source, MM is the month, DD is the date, hh is the launching hour in 24-hour format, and mm is the minutes.

For example, if a schedule for a source identified as i3ds23 starts at 10:00 PM on July 8, then the log file name is i3ds23.07082200.log. Each successive schedule has a unique log file name. After a source has seven log files, the oldest log file is overwritten.

Each logging message in the log file is one line, containing the following six tab delimited columns, in order:

Timestamp
Message level
Crawler thread name
Component name. It is typically the name of the executing Java class.
Module name. It can be internal Java class method name
Message

Crawler Configuration File

The crawler configuration file is ORACLE_HOME/search/data/config/crawler.dat. Most crawler configuration tasks are controlled in the Oracle SES Administration GUI, but certain features (like title fallback, character set detection, and indexing the title of multimedia files) are controlled only by the crawler.dat file.

Note:

The crawler.dat file is not backed up with Oracle SES backup and recovery. If you edit this file, be sure to back it up manually.

Crawling Zip Files Containing Non-UTF8 File Names

The Java library used to process zip files (java.util.zip) supports only UTF8 file names for zip entries. The content of non-UTF8 file names is not indexed.

To crawl zip files containing non-UTF8 file names, change the ZIPFILE_PACKAGE parameter in crawler.dat from JDK to APACHE. The Apache library org.apache.tools.zip does not read the zip content in the same order as the JDK library, so the content displayed in the user interface could look different. Zip file titles also may be different, because it uses the first file as the fallback title. Also, with the Apache library, the source default character set value is used to read the zip entry file name.

Setting the Logging Level

Specify the crawler logging level with the parameter Doracle.search.logLevel. The defined levels are DEBUG(2), INFO(4), WARN(6), ERROR(8), and FATAL(10). The default value is 4, which means that messages of level 4 and higher are logged. DEBUG (level=2) messages are not logged by default.

For example, the following "info" message is logged at 23:10:39330. It is from thread name crawler_2, and the message is Processing file://localhost/net/stawg02/. The component and module names are not specified.

23:10:39:330 INFO    crawler_2      Processing file://localhost/net/stawg02/

The crawler uses a set of codes to indicate the crawling result of the crawled URL. Besides the standard HTTP status codes, it uses its own codes for non-HTTP related situations.

See Also:

Appendix B, "URL Crawler Status Codes"

Parallel Query Indexing

In order to scale up the indexed data size while maintaining satisfactory query response time, the indexed data can be stored in independent disks to perform disk I/O operations in parallel. The major features of this architecture are:

Oracle SES index is partitioned, so that the sub-queries are executed in parallel.
Disks perform I/O operations independent of one another. As a result, the I/O bus contention does not create a significant bottleneck on the collective I/O throughput.
Partition rules are used to control the document distribution among the partitions.

Figure 4-1 End User Query Partitioning

Description of "Figure 4-1 End User Query Partitioning"

Document Partition Model and Storage Areas

Storage areas are used to store the partitions when the partitioning option is enabled. See "Storage Areas" for more information.

There are two kinds of partition mechanisms for improving query performance, attribute-based partitioning and hash-based partitioning. Currently, Oracle SES supports only hash-based partitioning.

Hash-based partitioning uses a hash function to distribute a large set of documents into multiple partitions. A partition engine controls the partition logic at both crawl time and query time. When a large data set needs to be searched without pruning the conditions, the end user request is broken into multiple parallel sub-queries so that the I/O and CPU resources can be utilized in parallel. After the result sets of the sub-queries are returned by the independent query processors, a merged result set is returned to the end user.

Figure 4-2 shows how the mechanism works during crawl time. The documents are partitioned and stored in different storage areas. Note that the storage areas are created on separate physical disks, so that I/O operations can be performed in parallel to improve the search turn around time.

Figure 4-2 Document Partitioning at Crawl Time

Description of "Figure 4-2 Document Partitioning at Crawl Time"

At query time, the query partition engine generates sub-queries and submits them to the storage areas, as shown in Figure 4-3.

Figure 4-3 Generation of Sub Queries at Query Time

Description of "Figure 4-3 Generation of Sub Queries at Query Time"

See "Parallel Querying and Index Partitioning" for more information.

Note:

In previous releases, the base path of Oracle SES was referred to as ORACLE_HOME. In Oracle SES release 11g, the base path is referred to as ORACLE_BASE. This represents the Software Location that you specify at the time of installing Oracle SES.

ORACLE_HOME now refers to the path ORACLE_BASE/seshome.

For more information about ORACLE_BASE, see "Conventions".