3 Understanding Crawling and Searching

This chapter contains the following topics:

Overview of the Oracle Secure Enterprise Search Crawler
Overview of Crawler Settings
Overview of XML Connector Framework
Overview of Attributes
Understanding the Crawling Process
Monitoring the Crawling Process
Overview of Searching in Oracle Secure Enterprise Search

See Also:

"Tuning Crawl Performance" and "Tuning Search Performance"
The Oracle Secure Enterprise Search tutorials at http://www.oracle.com/technology/products/oses/index.html

Overview of the Oracle Secure Enterprise Search Crawler

The Oracle Secure Enterprise Search (SES) crawler is a Java process activated by a set schedule. When activated, the crawler spawns processor threads that fetch documents from sources. These documents are cached in the local file system. When the cache reaches the maximum batch size, the crawler indexes the cached files. This index is used for searching.

In the administration tool, you can create schedules with one or more sources attached to them. Schedules define the frequency at which the Oracle SES index is kept up to date with existing information in the associated sources.

Crawler URL Queue

In the process of crawling, the crawler maintains a list of URLs of the documents that are discovered and will be fetched and indexed in an internal URL queue. The queue is persistently stored, so that crawls can be resumed after the Oracle SES instance is restarted.

Understanding Access URLs and Display URLs

A display URL is a URL string used for search result display. This is the URL used when users click the search result link. An access URL is a URL string used by the crawler for crawling and indexing. An access URL is optional. If it does not exist, then the crawler uses the display URL for crawling and indexing. If it does exist, then it is used by the crawler instead of the display URL for crawling. For regular Web crawling, there are only display URLs available. But in some situations, the crawler needs an access URL for crawling the internal site while keeping a display URL for the external use. For every internal URL, there is an external mirrored one.

For example, for file sources, by defining display URLs, end users can access the original document with the H TTP or HTTPS protocols. These provide the appropriate authentication and personalization and result in better user experience.

Display URLs can be provided using the URL Rewriter API. Or, they can be generated by specifying the mapping between the prefix of the original file URL and the prefix of the display URL. Oracle SES replaces the prefix of the file URL with the prefix of the display URL. For example, if the file URL is file://localhost/home/operation/doc/file.doc and the display URL is https://webhost/client/doc/file.doc, then specify the file URL prefix to file://localhost/home/operation and the display URL prefix to https://webhost/client.

Using Crawler Plug-ins

In addition to the default source types Oracle SES provides (such as Web, file, OracleAS Portal, and so on), you can also crawl proprietary sources . This is accomplished by implementing a crawler plug-in as a Java class. The plug-in collects document URLs and associated metadata (including access privilege) and contents from the proprietary source and returns the information to the Oracle SES crawler. The crawler starts processing each document as it is collected.

See Also:

"Crawler Plug-in API"

Overview of Crawler Settings

You can alter the crawler's operating parameters, such as the crawler timeout threshold and the default character set, on the Global Settings - Crawler Configuration page in the administration tool. After a source has been created, you can define crawling parameters, such as URL boun dary rules and crawling depth, for that source by editing that source on the Home - Sources page.

This section describes crawler settings, as well as other mechanisms to control the scope of Web crawling:

Crawling Mode
URL Boundary Rules
Document Types
Crawling Depth
Robots Exclusion
Index Dynamic Pages
URL Rewriter API
Title Fallback
Character Set Detection
Cache Directory

See Also:

"Tuning Crawl Performance" for more detailed information on these settings and other issues affecting crawl performance

Crawling Mode

For initial planning purposes, you might want the crawler to collect URLs without indexing. After crawling is finished, examine the document URLs and status, remove unwanted documents, and start indexing. The crawling mode is set on the Home - Schedules - Edit Schedules page.

See Also:

Appendix D, "URL Crawler Status Codes"

Note:

If you are using a custom crawler created with the Crawler Plug-in API, then the crawling mode set here will not apply. The implemented plug-in controls the crawling mode.

These are the crawling mode options:

Automatically Accept All URLs for Indexing: This crawls and indexes all URLs in the source. For Web sources, it also extracts and indexes any links found in those URLs. If the URL has been crawled before, then it will be reindexed only if it has changed.
Examine URLs Before Indexing: This crawls but does not index any URLs in the source. It also crawls any links found in those URLs.
Index Only: This crawls and indexes all URLs in the source. It does not extract any links from those URLs. In general, select this option for a source that has been crawled previously under "Examine URLs Before Indexing".

URL Boundary Rules

URL b oundary rul es limit the crawling space. When boundary rules are added, the crawler is restricted to URLs that match the indicated rules. The order in which rules are specified has no impact, but exclusion rules always override inclusion rules.

This is set on the Home - Sources - Boundary Rules page.

Inclusion Rules

Specify an inclusion rule that a URL contain, start with, or end with a term. Use an asterisk (*) to represents a wildcard. For example, www.*.example.com. Simple inclusion rules are case-insensitive. For case-sensitivity, use regular expression rules.

An inclusion rule ending with example.com limits the search to URLs ending with the string example.com. Anything ending with example.com is crawled, but http://www.example.com.tw is not crawled.

If the URL Submission functionality is enabled on the Global Settings - Query Configuration page, then URLs that are submitted by end users are added to the inclusion rules list. You can delete URLs that you do not want to index.

Oracle SES supports the regular expression syntax used in Java JDK 1.4.2 Pattern class (java.util.regex.Pattern). Regular expression rules use special characters. The following is a summary of some basic regular expression constructs.

Use a caret (^) to denote the beginning of a URL and a dollar sign ($) to denote the end of a URL.
Use a period (.) to match any one character.
Use a question mark (?) to match zero or one occurrence of the character that it follows.
Use an asterisk (*) to match zero or more occurrences of the pattern that it follows. An asterisk can be used in the starts with, ends with, and contains rule.
Use a backslash (\) to escape any special characters, such as periods (\.), question marks (\?), or asterisks (\*).

See Also:

http://java.sun.com for a complete description on Sun Microsystems Java documentation

Exclusion Rules

You can specify an exclusion rule that a URL contains, starts with or ends with a term.

An exclusion of uk.example.com prevents the crawling of Example hosts in the United Kingdom.

Default Exclusion Rules

The crawler contains a default exclusion rule to exclude non-textual files. The following file extensions are included in the default exclusion rule.

Image: jpg, gif, tif, bmp, png
Audio: wav, mp3, wma
Video: avi, mpg, mpeg, wmv
Binary: bin, exe, so, dll, iso, jar, war, ear, tar, wmv, scm, cab, dmp

To crawl file with such extensions, modify the following section in the $ORACLE_HOME/search/data/config/crawler.dat file, removing any file type suffix from the exclusion list.

# default file name suffix exclusion list
RX_BOUNDARY (?i:(?:\.gif)|(?:\.jpg)|(?:\.jar)|(?:\.tif)|(?:\.bmp)|(?:\.war)|(?:\.ear)|(?:\.mpg)|(?:\.wmv)|(?:\.mpeg)|(?:\.scm)|(?:\.iso)|
(?:\.dmp)|(?:\.dll)|(?:\.cab)|(?:\.so)|(?:\.avi)|(?:\.wav)|(?:\.mp3)|(?:\.wma)|(?:\.bin)|(?:\.exe)|(?:\.iso)|(?:\.tar)|(?:\.png))$

Also, add the MIMEINCLUDE parameter to the crawler.dat file to include any multimedia file type you want to crawl, and the file name will be indexed as title.

For example, to crawl any audio files, remove .wav, .mp3, and .wma and add the MIMEINCLUDE line:

RX_BOUNDARY (?i:(?:\.gif)|(?:\.jpg)|(?:\.jar)|(?:\.tif)|(?:\.bmp)|(?:\.war)|(?:\.ear)|(?:\.mpg)|(?:\.wmv)|(?:\.mpeg)|(?:\.scm)|(?:\.iso)|
(?:\.dmp)|(?:\.dll)|(?:\.cab)|(?:\.so)|(?:\.avi)|(?:\.bin)|(?:\.exe)|(?:\.iso)|(?:\.tar)|(?:\.png))$
MIMEINCLUDE audio/x-wav audio/mpeg

Note:

Only the file name is indexed when crawling multimedia files, unless the file is crawled through a crawler plug-in where a more rich set of attributes can be provided.

Example Using Regular Expression

The following example uses several regular expression constructs that are not described earlier, including range quantifiers, non-grouping parentheses, and mode switches. For a complete description, see the Sun Microsystems Java documentation.

Suppose you want to crawl only HTTPS URLs in the example.com and examplecorp.com domains. Also, you want to exclude files ending in .doc and .ppt.

Inclusion: URL regular expression ^https://.*\.example(?:corp){0,1}\.com
Exclusion: URL regular expression (?i:\.doc|\.ppt)$

Boundary Rules for Web Sources

When creating a Web source, the host name of the seed is automatically added to the boundary rule. However, subsequent changes to the seed will not be reflected automatically to the rule. Remember to synchronize the boundary rule if there is any change to the seed URL. Currently, Oracle SES does not remove crawled URLs even if the original seed is removed: everything is controlled by the boundary rules.

Document Types

Customize which document types are processed for each source on the Home - Sources - Document Types page. HTML and plain text files are always crawled and indexed.

Note:

Oracle SE S allows up to 1000 files in zip files and LHA files. If there are more than 1000 files, then an error is raised and the file is ignored.

Crawling Depth

Crawling depth is the number of levels to crawl Web and file sources. A Web document can contain links to other Web documents, which can contain more links. Specify the maximum number of nested links the crawler will follow. Crawling depth starts at 0; that is, if you specify 1, then the crawler will gather the starting (seed) URL plus any document that is linked directly from the starting URL. For file crawling, this is the number of directory levels from the starting URL.

This is set on the Home - Sources - Crawling Parameters page.

Robots Exclusion

You can control which parts of your sites can be visited by robots. If robots exclusion is enabled (default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots.txt file. The crawler also respects the page-level robot exclusion specified in HTML metatags.

For example, when a robot visits http://www.example.com/, it checks for http://www.example.com/robots.txt. If it finds it, then the crawler checks to see if it is allowed to retrieve the document. If you own the Web sites, then you can disable robots exclusions. However, when crawling other Web sites, always comply with robots.txt by enabling robots exclusion.

This is set on the Home - Sources - Crawling Parameters page.

Index Dynamic Pages

By default, Oracle SES will process dynamic pages. Dynamic pages are generally served from a database application and have a URL that contains a question mark (?). Oracle SES identifies URLs with question marks as dynamic pages.

Some dynamic pages appear as multiple search results for the same page, and you might not want them all indexed. Other dynamic pages are each different and need to be indexed. You must distinguish between these two kinds of dynamic pages. In general, dynamic pages that only change in menu expansion without affecting its contents should not be indexed. Consider the following three URLs:

http://itweb.oraclecorp.com/aboutit/network/npe/standards/naming_convention.html
 
http://itweb.oraclecorp.com/aboutit/network/npe/standards/naming_convention.html?nsdnv=14z1
 
http://itweb.oraclecorp.com/aboutit/network/npe/standards/naming_convention.html?nsdnv=14

The question mark ('?') in the URL indicates that the rest of the strings are input parameters. The similar results are essentially the same page with different side menu expansion. Ideally, the search should yield only one result:

http://itweb.oraclecorp.com/aboutit/network/npe/standards/naming_convention.html

Note:

The crawler cannot crawl and index dynamic Web pages written in Javascript.

This is set on the Home - Sources - Crawling Parameters page.

URL Rewriter API

The URL Rewriter is a user-supplied Java module for implementing the Oracle SES UrlRewriter interface. The crawler uses it to filter or rewrite extracted URL links before they are put into the URL queue. The API enables ultimate control over which links extracted from a Web page are allowed and which ones should be discarded.

URL filtering removes unwanted links, and URL rewriting transforms the URL link. This transformation is necessary when access URLs are used and alternate display URLs need to be presented to the user in the search results.

This is set on the Home - Sources - Crawling Parameters page.

See Also:

Title Fallback

You can override a default documen t title with a meaningful title if the default title is irrelevant. For example, suppose that the result list shows numerous documents with the title "Daily Memo". The documents had been created with the same template file, but the document properties had not been changed. Overriding this title in Oracle SES can help users better understand their search results.

Title fallback can be used for any source type. Oracle SES uses different logic for each document type to determine which fallback title to use. For example, for HTML documents, Oracle SES looks for the first heading, such as <h1>. For Microsoft Word documents, Oracle SES looks for text with the largest font.

If the default title was collected in the initial crawl, then the fallback title will only be used after the document is reindexed during a re-crawl. This means if there is no change to the document, then you must force the change by setting the re-crawl policy to Process All Documents on the Home - Schedules - Edit Schedule page.

This feature is not currently supported in the Oracle SES administration tool. Override a default document title with a meaningful title by adding the keyword BAD_TITLE to the $ORACLE_HOME/search/data/config/crawler.dat file. For example:

BAD_TITLE Daily Memo

Where Daily Memo is the title string that should be overridden. The title string is case-insensitive and can use multibyte characters in UTF8 character set.

Multiple bad titles can be specified, each one on a separate line.

Special Considerations with Title Fallback

With Microsoft Office documents:
- Font sizes 14 and 16 in Microsoft Word correspond to normalized font sizes 4 and 5 (respectively) in converted HTML. The Oracle SES crawler only picks up strings with normalized font size greater than 4 as the fallback title.
- Title should contain more than five characters.
When a title is null, Oracle SES automatically indexes the fallback title for all binary documents (for example, .doc, .ppt, .pdf). For HTML and text documents, Oracle SES does not automatically index the fallback title. This means that the replaced title on HTML or text documents cannot be searched with the title attribute on the Advanced Search page. You can turn on indexing for HTML and text documents in the crawler.dat file. (For example, set NULL_TITLE_FALLBACK_INDEX ALL)
The crawler.dat file is not included in the backup available on the Global Settings - Configuration Data Backup and Recovery page. Make sure you manually back up the crawler.dat file.

See Also:

"Crawler Configuration File"

Character Set Detection

This featu re enables the crawler to automatically detect character set information for HTML, plain text, and XML files. Character set detection allows the crawler to properly cache files during crawls, index text, and display files for queries. This is important when crawling multibyte files (such as files in Japanese or Chinese).

This feature is not currently supported in the Oracle SES administration tool, and by default, it is turned off. Enable automatic character set detection by adding a line in the crawler configuration file: $ORACLE_HOME/search/data/config/crawler.dat. For example, add the following as a new line:

AUTO_CHARSET_DETECTION

You can check whether this is turned on or off in the crawler log under the "Crawling Settings" section.

Special Considerations with Automatic Character Set Detection

To crawl XML files for a source, make sure to add XML to the list of processed document types on the Home - Source - Document Types page. XML files are currently treated as HTML format, and detection for XML files may not be as accurate as for other file formats.
The crawler.dat file is not included in the backup available on the Global Settings - Configuration Data Backup and Recovery page. Make sure you manually back up the crawler.dat file.

See Also:

"Crawler Configuration File"

Language Detection

With multibyte files, besides turning on character set detection, it is also important to set the Default Language parameter. For example, if the files are all in Japanese, select Japanese as the default language for that source. If automatic language detection is disabled, or if the crawler cannot determine the document language, then the crawler assumes that the document is written in the default language. This default language is used only if the crawler cannot determine the document language during crawling.

If your files are in more than one language, then turn on the Enable Language Detection parameter. Not all documents retrieved by the crawler specify the language. For documents with no language specification, the crawler attempts to automatically detect language. The language recognizer is trained statistically using trigram data from documents in various languages (for instance, Danish, Dutch, English, French, German, Italian, Portuguese, and Spanish). It starts with the hypothesis that the given document does not belong to any language and ultimately refutes this hypothesis for a particular language where possible. It operates on Latin-1 alphabet and any language with a deterministic Unicode range of characters (like Chinese, Japanese, Korean, and so on).

The crawler determines the language code by checking the HTTP header content-language or the LANGUAGE column, if it is a table source. If it cannot determine the language, then it takes the following steps:

If the language recognizer is not available or if it is unable to determine a language code, then the default language code is used.
If the language recognizer is available, then the output from the recognizer is used.
Multilexer is the only lexer used for Oracle Secure Enterprise Search.

The Default Language and the Enable Language Detection parameters are on the Global Settings - Crawler Configuration page (globally) and also the Home - Sources - Crawling Parameters page (for each source).

Note:

For file sources, the individual source setting for Enable Language Detection remains false regardless of the global setting. In most cases, the language for a file source should be the same, and set from, the Default Language setting.

Cache Directory

During crawling, documents are stored in the cache directory. (E-mails are stored in the e-mail archive directory.) When the size of the fetched documents in the cache directory reaches the indexing batch size, Oracle SES starts indexing.

On the Global Settings - Crawler Configuration page, you can select the cache directory location. You can also select to clear the cache after indexing. The default for this parameter is no, because the cache is necessary for the Cached link feature in the search result page. If you do not use the Cached link, then change this setting to yes to save space. You can delete the cache to remove all cache files from all sources, except the one cache file that is under crawl at this moment (that is, the cache file under the executing schedule). The deletion happens in the background, and you do not need to wait for it to complete. After the cache is deleted, users clicking the Cached link in the search result page will see a "File not found" error.

Note:

If you change the cache directory location, then move all existing cache files to the new directory. The e-mail archive directory is placed by default next to the cache directory. For example, if the cache directory is /foo/bar/cache, the e-mail archive directory is /foo/bar/mail. If you change the e-mail archive directory, then move all existing e-mail archive files to the new e-mail archive directory. If you delete the source, then also delete the cache files.

On non-Windows systems, make sure that the directory permission is set to 700 if you change the log file directory. Only the person who installed the Oracle software should be allowed access to this directory.

Each source has a subcache directory created under the cache directory. The format of the subcache directory is: i<system-generated ID>ds<source ID> as the directory name. For example, for source ID 35, all the cache files for that source are stored under [cache directory]/I1DS35/.

Overview of XML Connector Framework

Oracle SES provides an XML connector framework to crawl any repository that provides an XML interface to its contents. The connectors for Oracle Content Server, Oracle E-Business Suite 12, and Siebel 8 use this framework.

Every document in a repository is known as an item. An item contains information about the document, such as author, access URL, last modified date, security information, status, and contents.

A set of items is known as a feed (or channel). To crawl a repository, an XML document must be generated for each feed. Each feed is associated with information such as feed name, type of the feed, and number of items.

To crawl a repository with the XML connector, place data feeds in a location accessible to Oracle SES over one of the following protocols: HTTP, FTP, or file. Then generate an XML Configuration File that contains information such as feed location and feed type. Create a source with a source type that is based on this XML connector and trigger the crawl from Oracle SES to crawl the feeds.

There are two types of feeds:

Control feed: Individual feeds can be located anywhere, and a single control file is generated with links to the feeds. This control file is input to the connector through the configuration file. A link in control feed can point to another control feed. Control feed is useful when data feeds are distributed over many locations or when the data feeds are accessed over diverse protocols such as FTP and file.
Directory feed: All feeds are placed in a directory, and this directory is input to the connector through the configuration file. Directory feed is useful when the data feeds are available in a single directory.

Guidelines for the target repository generating the XML feeds:

XML feeds are generated by the target repository, and each file system has a limit on how many files it can hold. For directory feeds, the number of documents in each directory should be less than 10,000. There are two considerations:
- Feed files: The number of items in each feed file should be set such that the total number of feed files in the feed directory is kept under 10,000.
- Content files: If the feed files specify content through attachment links and the targets of these links are stored in the file system, then ensure that the targets are distributed in multiple directories so that the total number of files in each directory is kept under 10,000.
When feeds are generated real-time over HTTP, ensure that the component generating the feeds is sensitive to time out issues of feed requests. The feed served as the response for every request should be made available within this time out interval; otherwise, the request from Oracle SES will timeout. The request will be retried as many times as specified while setting up the source in Oracle SES. If all these attempts fail, then the crawler ignores this feed and proceeds with the next feed.

See Also:

Example Using the XML Connector

The courses in the Oracle E-Business Suite Learning Management (OLM) application can be crawled and indexed to readily search the courses offered, location and other details pertaining to the courses. Follow these steps to set this up:

Generate XML feed containing the courses. Each course can be an item in the feed. The properties of the course such as location and instructor can be set as attributes of the item.
Move the feed to a location accessible to SES through HTTP, FTP, or file protocol.
Generate a control file that points to that feed.
Generate a configuration file to point to this feed. Specify the feed type as control, the URL of the control feed, and the source name in the configuration file.
Create an Oracle E-Business Suite 12 source in Oracle SES, specifying in the parameters the location of the configuration file, the user ID and the password to access the feed.

XML Configuration File

The configuration file is an XML file conforming to a set schema.

The following is an example of a configuration file to set up an XML-based source:

<rsscrawler xmlns="http://xmlns.oracle.com/search/rsscrawlerconfig">  
 <feedLocation>ftp://my.host.com/rss_feeds</feedLocation>
 <feedType>directoryFeed</feedType>
 <errorFileLocation>/tmp/errors</errorFileLocation>
 <securityType>attributeBased</securityType> 
 <sourceName>Contacts</sourceName>
 <securityAttribute name="EMPLOYEE_ID" grant="true"/> 
</rsscrawler>

Where

feedLocation is one of the following:
- URL of the directory, if the data feed is a directory feed
  
  This URL should be the FTP URL or the file URL of the directory where the data feeds are located. For example:
```
ftp://host1.domain.com/relativePathOfDirectory
file://host1.domain.com/c:\dir1\dir2\dir3
file://host1.domain.com//private/home/dir1/dir2/dir3 
```
  File URL if the data feeds are available on the same computer as Oracle SES. The path specified in the URL should be the absolute path of the directory.
  
  FTP URL to access data feeds on any other computer. The path of the directory in the URL can be absolute or relative. The absolute path should be specified following the '/' after the host name in the URL. The relative path should be specified relative to the home directory of the user used to access FTP feeds.
  
  The user ID used to crawl the source should have write permissions on the directory, so that the data feeds can be deleted after crawl.
- URL of the control file, if the data feed is a control feed
  
  This URL can be HTTP, HTTPS, file, or FTP URL. For example:
```
http://host1.domain.com:port/context/control.xml
```
  The path in FTP and file protocols can be absolute or relative.
feedType indicates the type of feed. Valid values are directoryFeed, controlFeed and dataFeed.
errorFileLocation (optional) specifies the directory where status feeds should be uploaded.

A status feed is generated to indicate the status of the processing feed. This status feed is named <data feed file name>.suc or <data feed file name>.err depending on whether the processing was successful. Any errors encountered are listed in the error status feed. If a value is specified for this parameter, then the status feed is uploaded to this location. Otherwise, the status feed is uploaded to the same location as the data feed.

The user ID used to access the data feed should have write permission on the directory.

If feedLocation is an HTTP URL, then errorFileLocation also should be an HTTP URL, to which the status feeds will be posted. If no value is specified for errorFileLocation, then the status feeds are posted to the URL given in feedLocation.

If an error occurs while processing a feed available over file or FTP protocol, then the erroneous feed is renamed <filename>.prcsdErr in the same directory.
sourceName (optional) specifies the name of the source.
securityType (optional) specifies the security type. Valid values are the following:
- noSecurity: There is no security information associated with this source at the document level. This is the default value.
- identityBased: Identity-based security is used for documents in the feed.
- attributeBased: Attribute-based security is used for documents in the feed. With this security model, security attributes should be specified in the securityAttribute tag, and the values for these attributes should be specified for each document.
securityAttribute specifies attribute-based security. One or more tags of this type should be specified, and each tag should contain the following attributes:
- name: Name of the security attribute.
- grant: Boolean parameter indicating whether this is a grant/deny attribute. The security attribute is considered a grant attribute if the value is true and a deny attribute if the value is false.

See Also:

"Configuration File XSD"

Overview of Attributes

Each source has its own set of document attributes. Docume nt attributes, like metadata, describe the properties of a document. The crawler retrieves values and maps them to one of the search attributes. This mapping lets users search documents based on their attributes. Document attributes in different sources can be mapped to the same search attribute. Therefore, users can search documents from multiple sources based on the same search attribute.

Document attributes can be used for many things, including document management, access control, or version control. Different sources can have different attribute names that are used for the same idea; for example, "version" and "revision". It can also have the same attribute name for different ideas; for example, "language" as in natural language in one source but as programming language in another. Document attribute information is obtained differently depending on the source type.

See Also:

"Understanding Attributes" for information about document attributes for each source type
"Customizing the Appearance of Search Results" for a list of Oracle internal attributes

Oracle SES has several default search attributes. They can be incorporated in search applications for a more detailed search and richer presentation.

Search attributes are defined in the following ways:

System-defined search attributes, such as title, author, description, subject, and mimetype
Search attributes created by the Oracle SES administrator
Search attributes created by the crawler. (During crawling, the crawler plug-in maps the document attribute to a search attribute with the same name and data type. If not found, then the crawler creates a new search attribute with the same name and type as the document attribute defined in the crawler plug-in.)

The list of values (LOV) for a search attribute can help you specify a search. Global search attributes can be specified on the Global Settings - Search Attributes page. For user-defined sources where LOV information is supplied through a crawler plug-in, the crawler registers the LOV definition. Use the administration tool or the crawler plug-in to specify attribute LOVs, attribute value, attribute value display name, and its translation.

Note:

When multiple sources define the LOV for a common attribute, such as title, the user sees all the possible values for the attribute. When the user restricts search within a particular source group, only LOVs provided by the corresponding sources in the source group will be shown.

Example of Attribute LOV Collection

LOVs can be collected automatically. The following example shows Oracle SES collecting LOV values to crawl http://www.oracle.com.

Create a Web source with http://www.oracle.com as the starting URl. Do not start crawling yet.
From the Global Settings - Search Attributes page, select the Attribute for which you want Oracle SES collect LOVs and click Manage Lov. (For example, click Manage Lov for Author.)
Select Source-Specific for the created source, and click Apply.
Click Update Policy.
Choose Document Inspection and click Update, then click Finish.
From the Home - Schedules page, start crawling the Web source. After crawling, the LOV button in the Advanced Search page shows the collected LOVs.

Understanding the Crawling Process

The first time the crawler runs, it must fetch data (Web pages, table rows, files, and so on) based on the source. It then adds the document to the Oracle SES index.

The Initial Crawl

This section describes a Web source crawling process for a schedule. It is broken into two phases:

Queuing and Caching Documents
Indexing Documents

Queuing and Caching Documents

The steps in the crawling cycle are the following:

Oracle spawns the crawler according to the schedule you specify with the administration tool. When crawling is initiated for the first time, the URL queue is populated with the seed URLs.
The crawler initiates multiple crawling threads.
The crawler thread removes the next URL in the queue.
The crawler thread fetches the document from the Web. The document is usually an HTML file containing text and hypertext links.
The crawler thread scans the HTML file for hypertext links and inserts new links into the URL queue. Duplicate links already in the document table are discarded.
The crawler caches the HTML file in the local file system.
The crawler registers URL in the URL table.
The crawler thread starts over by repeating Step 3.

Fetching a document, as described in Step 4, can be time-consuming because of network traffic or slow Web sites. For maximum throughput, multiple threads fetch pages at any given time.

Indexing Documents

When the file system cache is full (default maximum size is 250 MB), the indexing process begins. At this point, the document content and any searchable attributes are pushed into the index. When the indexing of the documents in the batch completes, the crawler switches back to the queuing and caching mode.

Oracle SES Stoplist

Oracle SES ma intains a stoplist internally. A stoplist is a list of words that are ignored during the indexing process. These words are known as stopwords. Stopwords are not indexed because they are deemed not useful, or even disruptive, to the performance and accuracy of indexing. The Oracle SES stoplist is in English only, and it cannot be modified.

When you run a phrase search with a stopword in the middle, the stopword is not used as a match word, but it is used as a placeholder. For example, the word "on" is a stopword. If you search for the phrase "oracle on demand", then Oracle SES will match a document titled "oracle on demand" but not a document titled "oracle demand". If you search for the phrase "oracle on on demand", then Oracle SES will match a document titled "oracle technology on demand" but not a document titled "oracle demand" or "oracle on demand".

Maintenance Crawls

After the initial crawl, a URL page is only crawled and indexed if it has changed since the last crawl. The crawler determines if it has changed with the HTTP If-Modified-Since header field or with the checksum of the page. URLs that no longer exist are marked and removed from the index.

To update changed documents, the crawler uses an internal checksum to compare new Web pages with cached Web pages. Changed Web pages are cached and marked for reindexing.

The steps involved in data synchronization are the following:

Oracle spawns the crawler according to the schedule you specify with the administration tool. The URL queue is populated with the seed URLs of the source assigned to the schedule.
The crawler initiates multiple crawling threads.
Each crawler thread removes the next URL in the queue.
Each crawler thread fetches the document from the Web. The page is usually an HTML file containing text and hypertext links. When the document is not in HTML format, the crawler tries to convert the document into HTML before caching.
Each crawler thread calculates a checksum for the newly retrieved page and compares it with the checksum of the cached page. If the checksum is the same, then the page is discarded and the crawler goes to Step 3. Otherwise, the crawler moves to the next step.
Each crawler thread scans the document for hypertext links and inserts new links into the URL queue. Links that are already in the document table are discarded. (Oracle SES does not follow links from filtered binary documents.)
The crawler marks the URL as "accepted". The URL will be crawled in future maintenance crawls.
The crawler registers the URL in the document table.
If the file system cache is full or if the URL queue is empty, then Web page caching stops and indexing begins. Otherwise, the crawler thread starts over at Step 3.

Monitoring the Crawling Process

Monitor the crawling process in the administration tool by using a combination of the following:

Check the crawl progress and crawl status on the Home - Schedules page. (Click Refresh Status.)
Monitor your crawler statistics on the Home - Schedules - Crawler Progress Summary page and the Home - Statistics page.
Monitor the log file for the current schedule.

See Also:

"Tuning Crawl Performance"

Crawler Statistics

The following crawler statistics are shown on the Home - Schedules - Crawler Progress Summary page. Some of these statistics are also shown in the log file, under "Crawling results".

Documents to Fetch: Number of URLs in the queue waiting to be crawled. The log file uses the phrase "Documents to Process".
Documents Fetched: Number of documents retrieved by the crawler.
Document Fetch Failures: Number of documents whose contents cannot be retrieved by the crawler. This could be due to an inability to connect to the Web site, slow server response time causing timeouts, or authorization requirements. Problems encountered after successfully fetching the document are not considered here; for example, documents that are too big or duplicate documents that were ignored.
Documents Rejected: Number of URL links encountered but not considered for crawling. The rejection could be due to bound ary rules, the robots exclusion rule, the mime type inclusion rule, the crawling depth limit, or the URL rewriter discard directive.
Documents Discovered: Total number of documents discovered so far. This is roughly equal to (documents to fetch) + (documents fetched) + (document fetch failures) + (documents rejected).
Documents Indexed: Number of documents that have been indexed or are pending indexing.
Documents non-indexable: Number of documents that cannot be indexed; for example, a file source directory or a document with robots NOINDEX metatag.
Document Conversion Failures: Number of document filtering errors. This is counted whenever a document cannot be converted to HTML format.

Crawler Log File

The log file records all crawler activity, warnings, and error messages for a particular schedule. It includes messages logged at startup, runtime, and shutdown. Logging everything can create very large log files when crawling a large number of documents. However, in certain situations, it can be beneficial to configure the crawler to print detailed activity to each schedule log file.

On the Global Settings - Crawler Configuration page, you can select either to log everything or to log only summary information. You can also select the crawler log file directory and the language the crawler uses to generate the log file.

Note:

A new log file is created when you restart the crawler. The location of the crawler log file can be found on the Home - Schedules - Crawler Progress Summary page. The crawler maintains the past seven versions of its log file, but only the most recent log file is shown in the administration tool. You can view the other log files in the file system.

The naming convention of the log file name is ids.MMDDhhmm.log, where ids is a system-generated ID that uniquely identifies the source, MM is the month, DD is the date, hh is the launching hour in 24-hour format, and mm is the minutes.

For example, if a schedule for a source identified as i3ds23 is launched at 10 pm, July 8th, then the log file name is i3ds23.07082200.log. Each successive schedule launching will have a unique log file name. If the total number of log files for a source reaches seven, then the oldest log file is deleted.

Each logging message in the log file is one line, containing the following six tab delimited columns, in order:

Timestamp
Message level
Crawler thread name
Component name. It is in general the name of the executing Java class.
Module name. It can be internal Java class method name
Message

Crawler Configuration File

The crawler configuration file is $ORACLE_HOME/search/data/config/crawler.dat. Most crawler configuration tasks are controlled in the Oracle SES administration tool, but certain features (like title fallback, character set detection, and indexing the title of multimedia files) are controlled in the crawler.dat file.

Note:

The crawler.dat file is not backed up with Oracle SES backup and recovery. If you edit this file, make sure to back it up manually.

Crawling Zip Files Containing Non-UTF8 File Names

The Java library used to process zip files (java.util.zip) supports only UTF8 file names for zip entries. The content of non-UTF8 file names is not indexed.

To crawl zip files containing non-UTF8 file names, change the ZIPFILE_PACKAGE parameter in crawler.dat from JDK to APACHE. The Apache library (org.apache.tools.zip) does not read the zip content in the same order as the JDK library, so the content displayed in the user interface could look different. Zip file titles also may be different, because it uses the first file as the fallback title. Also, with the Apache library, the source default character set value is used to read the zip entry file name.

Setting the Logging Level

Specify the crawler logging level with the parameter Doracle.search.logLevel. The defined levels are DEBUG(2), INFO(4), WARN(6), ERROR(8), FATAL(10). The default value is 4, which means that messages of level 4 and higher will be logged. DEBUG (level=2) messages are not logged by default.

For example, the following "info" message is logged at 23:10:39330. It is from thread name crawler_2, and the message is Processing file://localhost/net/stawg02/. The component and module names are not specified.

23:10:39:330 INFO    crawler_2      Processing file://localhost/net/stawg02/

The crawler uses a set of codes to indicate the crawling result of the crawled URL. Besides the standard HTTP status codes, it uses its own codes for non-HTTP related situations.

See Also:

Appendix D, "URL Crawler Status Codes"

Smart Incremental Crawl for OracleAS Portal Sources

Oracle SES provides a smart incremental crawl for OracleAS Portal sources, designed to make re-crawls more efficient by not traversing the entire portal tree. Instead of trying to detect all content and permission changes, Portal simply tells Oracle SES what changed. During re-crawl, the Oracle SES crawler asks OracleAS Portal for list of changes since a certain date (that is, the last re-crawl date) and OracleAS Portal generates a list of added, updated, and deleted resources. To enable this feature, turn on the PORTAL_SMART_INCR_CRAWL flag in crawler.dat.

Overview of Searching in Oracle Secure Enterprise Search

A search can be submitted to Oracle SES in the following places:

The query string of the Oracle SES Web services API
The search box in the default Oracle SES query application

To get to the search page in the default query application from any page in the administration tool, click the Search link in the top right corner. This brings up the Basic Search page in a new window, with a text box to enter a search string. This section contains the following topics:

Basic Search
Submit URL
Advanced Search
Thesaurus-Based Search
Alternate Words
Restricted Search and the Browse Popup

See Also:

"Tuning Search Performance"

Basic Search

A search string can consist of one or more words. It is case-insensitive. Clicking the Search button returns all matches for that search string. You can reorder the way results are presented using the Group by and Sort by lists.

Oracle SES applies stemming to the query term. Stemming expands the term to other terms that share the same root. For example, [banks] returns documents containing the word banks, banking, or bank. Oracle SES uses stemming based on the language of query, which is determined by the language of browser in the default query application, or it is input by the caller in the query API. Implicit stemming expansion applies on individual terms in term search, in proximity search, and in attribute shortcut search for STRING attribute. Implicit stemming expansion does not apply to phrase search, and it can be turned off by enclosing the term in double quotes.

Oracle SES also performs implicit alternate words expansion. When an alternate word pair is expanded, Oracle SES shows the "do you mean ..." message. The Web Services API outputs the alternate keyword. Implicit alternate words expansion is applied only to single term and phrase search.

By default Oracle SES performs implicit alternate words expansion; that is, Oracle SES shows the "do you mean ..." message for alternate word pairs and the Web Services API outputs the alternate keyword. However, if you select the Auto-Expand option in the administration tool, then Oracle SES also automatically includes the alternate word or phrase as part of the search. For example, if the alternate word pair is "RAC" and "Real Application Clusters", then a query for "RAC" returns documents containing "RAC" or "Real Application Clusters".

See Also:

"Alternate Words"

If the search administrator has turned on spellchecker, then Oracle SES only gives suggestions for term search, phrase search, and proximity search. Spellchecker does not give suggestions for terms or phrases in attribute shortcut search.

The results can include the following links:

Cached: The cached HTML version of the document.

Links: Pages that link to and from this document.

Source Group: This link leads to Browse source groups.

Any links on top of the search text box are source groups. Clicking a source group restricts the search to that group.

The following table describes rules that apply to the search string. Text in square brackets represents characters entered into the search.

Table 3-1 Search String Rules

Rule	Description
Single term search	Enter one term finds documents that contain that term. For example, [Oracle] finds all documents that contain the word Oracle anywhere in that document. Any two searchable items (including term, phrase, attribute shortcut, and proximity search) can be written together in a query with a white space in between and the AND operator applies. The operator [&] also explicitly denotes an AND relationship. For example, [oracle text] and [oracle & text] both return documents containing oracle and text.
Phrase search ["..."]	Put quotes around a set of words to find documents that contain that exact phrase. Oracle SES does not apply implicit stemming expansion to a query phrase, but it can apply explicit term expansion to terms in a phrase. All operators except term expansion operators in a phrase are not treated as valid operators but as normal special characters. For example, [oracle "RAC performance"] returns documents containing oracle and the phrase "RAC performance". Documents containing the stemming form "RAC performances" are not returned. (There is no implicit stemming expansion on either term.) The query ["sec*re search"] returns documents with the phrase "secure search". The query ["sec^re search"] returns documents with the phrase "sec re search".
Attribute shortcut search [attribute_nume:attribute_value]	Search on attributes with an attribute name, a colon (:), and then the value to be searched. Implicit stemming is applied to the attribute value term. You can specify operators as options. When no operator is specified, Oracle SES uses Contains for STRING attributes and Equals for NUMBER and DATE attributes. For example, [DocVersion:>1] returns documents that have number attribute Docversion where attribute value is larger than 1. The query [title:"oracle text"] returns documents with the phrase "oracle text" in the title attribute. The query [oracle \| title:S*S] returns documents with the term oracle or SES in the title attribute. The query [title:^oracle] has the same effect as [title:oracle]. The contains [^] operator applies only to the STRING attribute. Equals [=] returns documents with an attribute equaling the query with case-insensitivity. For example, [title:="oracle text"] returns documents whose title equals "oracle text". It applies to all three attributes. Lessthan and narrower terms [<] return documents with an attribute value less than or earlier than the query value. For example, [DocVersion:<2] returns documents that have number attribute Docversion and where the attribute value is less than 2. They apply to all three attributes. Lessthanequals [<=] applies to NUMBER and DATE attributes. Greaterthan and broader terms [>] return documents with an attribute value greater than or later than the query value. They apply to all three attributes. Greaterthanequals [>=] applies to NUMBER and DATE attributes. For example, [price:>=10] returns documents whose price attribute value is larger than or equal to 10. The query [lastmodifieddate:>=12/23/2006] returns documents whose lastmodifieddate attribute value is on or after December 23, 2006. See Also: "Searching on Date Attributes" and "Advanced Search"
Proximity search ["..."~]	A proximity search specifies the maximum distance within which multiple terms occur. A proximity search must have the search terms in double quotes. When the maximum spanning distance is not specified, Oracle SES applies a default window of 100 terms. The maximum number is 100. When a value larger than 100 is specified, Oracle SES treats it as 100. For example, ["ses performance"~10] returns documents with terms SES and performance within any 10 terms spanning windows. The query ["ses performance"~] returns documents containing the terms SES and performance within any 100 terms spanning windows.Implicit stemming expansion is applied to each term in proximity search. Term expansion operators can be applied to terms in proximity search.
Fuzzy [...~] search	Put the operator (~) at the end of a single term returns documents that contain terms similar to the query term. For example, [hallo~] returns documents containing term hello. The query [specifi*tion~] returns documents containing the term specification. Note: If a single term enclosed in double quotes is followed by ~, then the query is not a proximity search but a fuzzy search. The query ["parformance"~] returns documents containing the term performance.
Thesaurus-based search Synonym [~...] search Narrower term [<] search Broader term [>] search	Thesaurus-based operators require that a thesaurus be loaded into Oracle SES. Put the operator [~] at the beginning of a term to return documents that contain the original query term or a synonym for it. For example, [title:~"RAC"] returns documents with RAC or the synonym real application clusters in the title. A synonym relationship is symmetric: real application clusters is a synonym of RAC, and RAC is a synonym of real application clusters. In attribute search, it applies only to the STRING attribute. The query [<"Northern California"] returns documents with the thesaurus-defined narrower term San Francisco or the original phrase Northern California. The query [product:>chair] returns documents whose product attributes contain the broader term furniture or the original term chair. Broader and narrower terms are symmetric. Specifying that furniture is a broader term of chair also implicitly specifies that chair is a narrower term of furniture. See Also: "Thesaurus-Based Search"
OR [\|] search	Use the OR [\|]operator to connect any two searchable items. For example, [oracle \| "RAC performance"~ ] returns documents with the term oracle or with the terms RAC and performance in any 100 terms spanning windows. The query [oracle \| title:SES] returns documents with the term oracle or SES in the title attribute.
Grouping ( ) search	Use parentheses ( ) to group query components together to change precedence of the binary logical operators AND and OR. The grouped query components must form a valid query. If the query string inside parentheses is not a valid query, then Oracle SES implicitly rewrites it to the closest valid query. For example, [(oracle \| database) sales] returns documents containing sales and containing either oracle or database. The query [(oracle \|) sales] returns documents containing oracle and sales. This is because [oracle \|] is not a valid query.
Wildcard matching [*] for multiple characters	Put the operator [] in the middle or end of a term for wildcard matching. It can be applied multiple times in one term. For example, [ora] finds documents that contain words beginning with ora, such as Oracle and orator. The query [title:ae] returns documents with the title containing words such as apple or ape. Multiple character wildcard expansion could result in too many results. For example, [a] could find too many results. Oracle SES throws an error to refine the queries. The wildcard operator [] is ineffective with the escape character [\] just before it. For example [Pro\c]. Wildcard matching cannot be used with Chinese or Japanese native characters.
Wildcard matching [?] for single characters	Put the operator (?) in middle or end of a term for wildcard matching for a single character. It can be applied multiple times in one term. For example, [orac?e] and [or?cl?] both return documents containing terms that replace ? with a single character, such as Oracle. The wildcard operator [?] is ineffective with the escape character [\] just before it. Wildcard matching cannot be used with Chinese or Japanese native characters.
Compulsory inclusion [+] search	Put the operator (+) at the beginning of any searchable item (including term, phrase, attribute, and proximity search) to require that the word be found in all matching documents. There should be no space between the [+] and the search term. For example, searching for [Oracle +Applications] only finds documents that contain the words Oracle and Applications. When compulsory inclusion search is used with the OR (\|) operator, the compulsory inclusion operator does not have any effect. For example, searching for [text \| +database] returns documents containing the term text or database.
Compulsory exclusion [-] search	Put the operator (-) at the beginning of any searchable item (including term, phrase, attribute, and proximity search) to require that the word not be found in all matching documents. It can be a single word or a phrase, but there should be no space between the [-] and the token. For example, [oracle –applications] returns documents containing oracle but not containing applications. The query [oracle –"application server"] returns documents containing oracle but not containing the phrase "application server". The query [oracle –title:oracle] returns documents containing oracle but with the title not containing oracle. The query [oracle –"application server"~] returns documents containing oracle but not containing application and server in any 100 terms spanning window. The compulsory exclusion query cannot be the only query. For example, the query [-oracle] raises an error. Also, the compulsory exclusion query cannot be connected with the OR [\|] operator. For example, [oracle \| -database] raises an error.
Filetype search [filetype:filetype]	Use [filetype:filetype] after the search term to limit results to that particular file type. A search can have only one filetype. No operator is allowed in filetype shortcut search. For example, [documentation filetype:pdf] returns PDF format documents for the term documentation. The "filetype" shortcut must be lowercase, but the file type name is case-insensitive; that is, [documentation filetype:PDF] returns the same documents. The following file types are supported, with their corresponding mimetype: filetype string: mimetype ps: application/postscript ppt: application/vnd.ms-powerpoint, application/x-mspowerpoint doc: application/msword xls: application/vnd.ms-excel, application/x-msexcel, application/ms-excel txt: text/plain html: text/html htm: text/html pdf: application/pdf xml: text/xml rtf: application/rtf
Site search [site:host]	Use [site:host] after the search term to limit results to that particular site. For example, [site:www.oracle.com filetype:pdf] returns documents from www.oracle.com in PDF format. The "site" shortcut must be lowercase, but the host name is case-insensitive; that is, [site:www.Oracle.com filetype:pdf] returns the same documents. Oracle SES only supports exact host matching. The query [site:*.oracle.com] does not work.
Group [sg:source group] search	Use [sg:source group] to limit results to that particular source group. For example, [sg:intranet] returns documents in the intranet source group. The "sg" shortcut must be lowercase, but the source group name is case-insensitive; that is, [sg:IntraNet] returns the same documents. In federated search, the source group names are the source groups in the local (broker) node. If the local source groups contain federated sources, then Oracle SES translates the local source group name to the federated source group name by changing the query, which is then sent to federated source for results.

Note:

Oracle incorporates KW IC (keyword in context) as part of the search result. This has a size restriction of 4k. That is, if the searched keyword appears in the first 4k of a document, then the KWIC is shown for the search result. If the keyword appears after the first 4k, then no KWIC is shown.

Browse Source Groups

Source groups are groups of sources that can be searched together. A source group consists of one or more sources, and a source can be assigned to multiple source groups. Source groups are defined on the Search - Source Groups page. Groups, or folders, are only generated for Web, e-mail, and OracleAS Portal source types.

On the basic Search page, users can browse source groups that the administrator created. Click a source group name to see the subgroups under it, or drill down further into the hierarchy by clicking a subgroup name.To view all the documents under a particular group, click the number next to the source group name. You can also perform a restricted search in the source group from this page.

The source hierarchy lets end users limit search results based on document source type. The hierarchy is generated automatically during crawl time.

Searching on Date Attributes

Date attribute values on the result list are shown in Greenwich Mean Time (GMT). For example, when you crawl a document on your local computer with a last modified date value of "Sep 13 2007 20:30:00 PDT", the Oracle SES crawler converts this date value to the corresponding GMT date value, which is "Sep 14 2007 3:30:00 GMT". These two values represent the same instance in time, but Oracle SES only displays the date (not the time or time zone). Therefore, the last modified date displayed in the result list is Sep 14 2007, and not Sep 13 2007.

A search using the lastModifiedDate attribute should be based on GMT date value. To search this document based on last modified date, enter the GMT date value as the last modified date attribute value. For example, lastModifiedDate=09/14/2004. The query lastModifiedDate=09/13/2004 will not return this document.

Changing lastModifiedDate to the Local Time Zone

To see lastModifiedDate in your local time zone, update the search.properties file located in the $ORACLE_HOME/search/webapp/config directory. Set ses.qapp.convert_timezone=true and restart the Oracle SES middle tier with searchctl restart. The browser picks up your local time zone and lastModifiedDate is converted to your local time zone before displaying search results.

Note:

For date attribute search, the format must be mm/dd/yyyy.

Submit URL

The URL submission feature lets users submit URLs to be crawled and indexed. These URLs are added to the seed URL list for a particular source and included in the crawler search space. If you allow URL submission (on the Global Settings - Query Configuration page), then you must select the Web source to which submitted URLs will be added.

Note:

This feature is disabled on the Search page if no sources have been created.

Advanced Search

Oracle SES Advanced Search lets you refine searches in the following ways:

Search Documents in a Specific Language
Search in Specific Source Groups
Search by Attribute

Search Documents in a Specific Language

Oracle SES can search documents in different languages. Specifying a language restricts searches to documents that are written in that language. Use the language list box to specify a language.

Search in Specific Source Groups

If one or more source groups are defined, then corresponding check boxes appear when you select specific categories. You can limit your search to source groups by selecting those check boxes. If no source group is selected, then all documents are searched. If you select All, (that is, all source groups present), then the documents not in the selected groups (in the default group) will not be searched.

A source group represents a collection of documents. They are created by the Oracle SES administrator.

Search by Attribute

Oracle SES includes default attributes for every instance, such as title, description, and keywords. For example, the Author attribute is mapped to the From header in e-mail documents and the Author metatag in HTML documents. Search administrators also can create custom attributes.

To require that documents matching your query have specific attributes values, select a search attribute with an operator and a value on the Advanced Search page. Date format must be entered as MM/DD/YYYY format. Click Add more attributes to enter more than four search attribute values.

Oracle SES advanced search supports search within three types of attributes: STRING, NUMBER, and DATE. The following operators are supported with these attributes:

Table 3-2 SES Attribute Search Operators and Types of Attributes

Attribute	Contains (^)	Equals (=)	Synonym (~)	Lessthan, Narrower terms (<)	Lessthanequals (<=)	Greaterthan, Broader terms (>)	Greaterthanequals (>=)
String	Yes	Yes	Yes	Yes	No	Yes	No
Number	No	Yes	No	Yes	Yes	Yes	Yes
Date	No	Yes	No	Yes	Yes	Yes	Yes

The Equals operator returns a hit only if the attribute value that you enter exactly matches the attribute value in the document. The Contains operator returns a hit if the attribute value you enter matches any of the tokens in the attribute values in the document. The token must be an exact match; partial matches are not returned.

For example, suppose you have the following four documents being indexed:

Document	Author	Number of Tokens
doc1	"scott"
doc2	"scott tiger"	2 (scott, tiger)
doc3	"scottm tiger"	2 (scottm, tiger)
doc4	"scott.tiger@oracle.com"	4 (scott, tiger, oracle, com)

An attribute restricted search for "author equals scott" would return only doc1. But an attribute restricted search for "author contains scott" would return doc1, doc2, and doc4. There is no hit for doc3 because scott is only a partial match for scottm.

Oracle SES and Oracle Text

When a query is prefixed with 'otext::', Oracle SES determines that it is Oracle Text syntax query. This is supported in the default query application and in the Web services API.

Oracle Text query syntax and Oracle SES query syntax cannot be used together in the same query.

The following table shows query syntax comparison between Oracle Text and SES.

Table 3-3 Syntax Comparison Between Oracle Text and Oracle SES

Query	Oracle Text	Oracle SES
Term	otext::secure	secure
Phrase	otext::secure search otext::{secure search}	"secure search"
Proximity search	otext::secure ; search otext::near((secure, search),10)	"secure search"~ "secure search"~10
Attribute search	otext::oracle within title otext::(oracle & text) within title N/A for numeric and date attribute	title:oracle title:oracle & title:text lastmodifieddate:10/20/2006
AND operator	otext::secure & search	secure search
OR operator	otext::secure \| search	secure \| search
ACCUM and Weight	otext::secure10, search 5	N/A
Compulsory exclusion	otext::oracle ~apps	oracle -apps
Compulsory inclusion	N/A	oracle +apps
Grouping operator	otext::(rac \| {real application clusters}) & whitepaper	(rac \| "real application clusters") whitepaper
Stemming operator	otext::$feature	N/A (implicit stemming, turned off by using double quotes)
Multiple character wildcard Single character wildcard Fuzzy expansion	otext::feat%e otext::featu_e otext::?hallo	feat*e featu?e hallo~
Soundex	otext::!smythe	N/A
Theme search	otext::about(dogs)	N/A
Synonym search Narrower term search Broader term search	otext::syn(dog) otext::NT(dog) otext::BT(dog)	~dog <dog >dog
Query template	otext::<query><textquery>123</textquery></query>	N/A
Query relaxation	otext::<query><textquery><progression><seq>secure search</seq><seq>secure;search </seq><seq>secure & search</seq></progression></textquery></query>	N/A (implicitly done)
Highlight	otext::oracle highlight:search	N/A (implicitly done)

Notes:

A highlight query highlights terms in returned documents. The highlight query can be used only at the end of an Oracle text compatible query and prefixed by the string "highlight:". When no highlight query is specified, Oracle SES chooses highlight terms from original queries.

A thesaurus must be loaded for thesaurus-based (that is, synonym, broader or narrower term) searching.

The index should be changed to index themes for 'about' to work. For more information, see the Oracle Text Reference.

See Also:

Oracle Text Reference, available on OTN

Internationalization Support

This section lists possible internationalization issues with supported operators:

Proximity Search ["secure search"~10]: The term distance definition could be different for non-whitespace delimited languages, such as Japanese. The behavior of proximity search for those languages could be different.
Implicit stemming: This is available in English, German, French, Spanish, Italian, Dutch.
Wildcard search [feat*e] or [featu?e]: The term definition is different for nonwhitespace-delimited languages, such as Japanese. The behavior of wildcard expansion for those languages is different.
Fuzzy expansion [hallo~]: This is available in English, German, French, Chinese, Japanese, Spanish, Italian, and Dutch.

Thesaurus-Based Search

A thesaurus is a list of terms or phrases with relationships specified among them, such as synonym, broader term, and narrower term. Given a query term or phrase, query expansion can be done on these relationships.

A thesaurus contains domain-specific knowledge. You can build a thesaurus, buy an industrial-specific thesaurus, or use tools to extract a thesaurus from a specific corpus of documents.

A thesaurus named "DEFAULT" must be loaded to Oracle SES for thesaurus-based query expansion. If no thesaurus is loaded or if the specified term (or phrase) cannot be found in the loaded thesaurus, then there is no query expansion. Oracle SES only returns documents containing the original term (or phrase). The default expansion level is one.

Thesaurus-based expansion operators can be applied to attributes in attribute search. Because an attribute value is either a term or a phrase, the expansion has the same effect on a term or a phrase, except that the expansion in attribute shortcut search is restricted to attribute value search.

Uploading a Thesaurus

The thesaurus must be compliant with both the ISO-2788 and ANSI Z39.19(1993) standards. Thesaurus upload can be done on the command line utility or in the administration tool.

To upload a thesaurus:

searchadminctl -p <eqsys password> thesaurus -create –configFile configXML_file_path

where password is the eqsys user password and configXML_file_path is the file path for the Oracle SES XML configuration file that contains the thesaurus definition.

The following example has the thesaurus defined in a separate file (thesaurusFile defines the full URL of the thesaurus file):

<?xml version="1.0" encoding="UTF-8" ?>
<config productVersion="10.1.8.2">
<thesauruses>
      <thesaurus name="DEFAULT">
        <thesaurusFile>http://stacd27:7777/search/query/thesaurus.txt</thesaurusFile>
      </thesaurus>
</thesauruses>
</config>

The following example has the thesaurus content defined within the XML file:

<?xml version="1.0" encoding="UTF-8" ?> 
<config productVersion="10.1.8.2"> 
  <thesauruses> 
    <thesaurus name="DEFAULT"> 
      <thesaurusFile></thesaurusFile> 
      <thesaurusContent><![CDATA[ 
dog 
     BT mammal 
     NT domestic dog 
     NT wild dog 
     SYN canine 
]]> 
      </thesaurusContent> 
    </thesaurus> 
  </thesauruses> 
</config>

where thesaurus_file_path is the full file path of a thesaurus file to be uploaded.

Only one thesaurus can be loaded to an Oracle SES instance at a time. If another thesaurus is loaded to the Oracle SES instance, then it overwrites the previous one.

Note:

The encoding of the XML file for thesaurus configuration should be UTF-8, which is the Oracle SES default language setting. Make sure the NLS_LANG environment variable is set consistent with the XML file encoding before using the command line tool.

Exporting the Loaded Thesaurus

To export the loaded thesaurus:

searchadminctl -p <eqsys password> thesaurus -export default –configFile configXML_file_path

where configXML_file_path defines the location of the exported thesaurus file that is in the Oracle SES local file system only.

searchadminctl -p <eqsys password> thesaurus -list –configFile configXML_file_path

where configXML_file_path is the Oracle SES XML configuration file that contains the exported thesaurus content in the XML file.

Removing a Loaded Thesaurus

To delete the default thesaurus:

searchadminctl -p <eqsys password> thesaurus -delete default

A thesaurus term can be added to or removed from the loaded thesaurus. The synonym, broader term, or narrower term relationship can be added to or removed from two existing thesaurus terms.

Alternate Words

Oracle SES maintains an alternate word list containing word pairs. The two words in the word pair can be used alternatively. The semantic similarity between the two words is higher than that between two synonyms. Oracle SES uses alternate words in the following places:

To provide a suggestion for a query. For example, the query [RAC] can result in the suggestion: "did you mean 'real application clusters'?"
To provide an implicit expansion based on alternate words. For example, the query [RAC] returns documents containing RAC or real application clusters.

Oracle SES provides an option for each alternate word pair to do implicit expansion for this pair.

Uploading a List of Alternate Words

The uploaded list of alternate words is stored in a text file, which contains multiple lines of the following XML segment, each representing one alternate pair.

<?xml version="1.0" encoding="UTF-8" ?>
<config productVersion="10.1.8.2">
<altWords>
 <altWord name="oes">
   <alternate>Oracle Secure Enterprise Search</alternate>
   <autoExpansion>true</autoExpansion>
 </altWord> 
</config>

If autoExpansion is not specified, then Oracle SES does not do implicit expansion.

To upload a list of alternate words:

searchadminctl -p <eqsys password> altWord -upload –configFile alternateword_file_path

where alternateword_file_path is the full file path of the XML configuration file where alternate words are stored.

Only one list of alternate words can be loaded to a SES instance at any time. If another list of alternate words is loaded to the Oracle SES instance, then it overwrites the previous one.

Export the Alternate Words List

The exported alternate word list is in a plain text file. It can be uploaded to Oracle SES without any modification.

To export a list of alternate words:

searchadminctl -p <eqsys password> altWord -list –configFile alternateword_file_path

where alternateword_file_path is the full file path of the XML configuration file where exported alternate words are stored.

To export one alternate word:

searchadminctl -p <eqsys password> altWord –export oes –configFile alternateword_file_path

If the configFile is not provided, then the results are shown in the command line window.

You can change the auto-expansion flag for each alternate word. After upgrading from a earlier releases of Oracle SES, auto-expand is turned off for all entries.

Removing Alternate Words

To remove the alternate word for "oes":

searchadminctl -p <eqsys password> altWord –delete oes

To remove all alternate words from Oracle SES:

searchadminctl -p <eqsys password> altWord –deleteAll

Adding Alternate Word Pairs

To add an alternative word for "oes":

searchadminctl -p <eqsys password> altWord –create oes –configFile alternateword_file_path

Restricted Search and the Browse Popup

On the Basic search page, users can browse source groups that the administrator created by clicking the Browse link next to the Search box. This will display a Browse popup window containing a tree view of the browse information hierarchy. Users can click an expand icon (>) to see the infosource nodes under it, or drill down further by clicking additional expand icons.

Clicking the document count number of a browse node refreshes the entire page to show the set of all documents within that infosource node. Clicking the node label causes the list of source groups above the Search box to be replaced with the message "Search within: <infosource node name>".

The infosource node in the browse tree, along with its subtree, are highlighted to indicate which node is selected to search within.

The search result set is not immediately replaced. Instead, users must click Search to perform a restricted search within the selected infosource node. Only one infosource node may be selected at a time for "search-within."

To restrict search to a set of top-level source groups (rather than an infosource node), select multiple source groups within the Browse Tree popup. The source group will have a check mark next to it, and the list of selected source group names are displayed above the Search box. Again, users must click Search to perform a restricted search within the selected source groups.