1 Introduction to Oracle Secure Enterprise Search

This chapter describes the basic components of Oracle Secure Enterprise Search: the sources, crawler, and user interfaces. It contains the following topics:

Special-Use Licensing

Oracle Secure Enterprise Search (Oracle SES) is a complete stacked application. Oracle Database 11g Release 2 (11.2.0.3) Enterprise Edition (EE) is installed with Oracle SES. Use of Oracle Database EE is restricted to storing and managing the search index, metadata, cache, and Oracle SES configuration information. Oracle WebLogic Server 10g (10.3.6.0) is included with Oracle SES. This embedded version is provided solely to run the Oracle SES user interfaces and APIs.

The Oracle SES home software use is restricted to support the Oracle SES database repository, and no other databases created using the Oracle SES executables are supported. Oracle SES connectors listed on the Oracle price list may be licensed separately to use with the Oracle SES installation.

Some connectors shipped with Oracle SES require additional licensing fees. Contact Oracle sales for details.

Overview of Oracle Secure Enterprise Search

Oracle Secure Enterprise Search enables a secure, high quality, easy-to-use search across all enterprise information assets. Key features include:

  • The ability to search and locate public, private and shared content across Intranet web-servers, databases, files on local disks or file servers, IMAP e-mail, document management systems, applications, and portals

  • Highly secure crawling, indexing, and searching

  • A simple, intuitive search interface leading to an excellent user experience

  • Excellent search quality, with the most relevant items for a query shown first, even when the query spans diverse public and private data sources

  • Analytics on search results and usage patterns

  • Sub-second query performance

  • Ease of administration and maintenance, leveraging existing IT expertise

See Also:

  • Oracle Secure Enterprise Search Installation Guide for requirements, tips, and information on getting started using Oracle SES

  • Oracle Technology Network (OTN) for updated information on the known issues, code samples, and best practices:

    http://www.oracle.com/technetwork/search/oses/overview/index.html

  • The Oracle Secure Enterprise Search Release Notes has version information and known issues

Source Types

A collection of information is called a source. Each source has a type that identifies where the information is stored, such as on a Web site or in a database table. Oracle SES provides several built-in source types and an architecture for adding new, custom types.

Additionally, Oracle SES provides access to more third-party data repositories than any other enterprise search engine, without requiring you to generate any additional coding. While these data sources are classified as user-defined source types, they are available the same as the built-in source types. This guide organizes these user-defined source types into content management sources, collaboration sources, and applications sources.

Oracle SES also provides authorization cache sources for facilitating access to secure data.

Built-in Sources 

  • Web: Represents the content on a specific Web site. Web sources facilitate maintenance crawling of specific Web sites.

  • Table: Represents content in a table or view in Oracle Database.

  • File: The set of documents that can be accessed through the file system protocol.

  • E-mail: Derives content from e-mails sent to a specific e-mail address. When Oracle SES crawls an e-mail source, it collects e-mail from all folders set up in the e-mail account, including Drafts, Sent Items, and Trash e-mails.

  • Mailing list: Derives its content from e-mails sent to a specific mailing list.

  • OracleAS Portal: Lets you search across multiple OracleAS Portal repositories, such as Web pages, files on disk, and pages on other OracleAS Portal instances.

  • Federated: Enable you to share content across multiple Oracle SES instances.

Content Management Sources 

  • EMC Documentum Content Server

  • Microsoft SharePoint

  • Oracle Content Database

  • Oracle Content Server (formerly Stellent Content Server)

You may need to install client libraries and obtain a license from the vendor for some content sources to work. For example, EMC Documentum requires installation of a compatible version of Documentum Foundation Classes (DFC), which is a Java library, on the computer running Oracle SES. Oracle SES does not ship with DFC.

Collaboration Sources 

  • IBM Lotus Notes

  • Microsoft Exchange

  • Microsoft NT File Systems (NTFS)

  • Oracle Calendar

  • Oracle Collaboration Suite E-Mail

Oracle Applications Sources 

  • Database

  • Oracle E-Business Suite

  • Siebel

  • Oracle Fusion

  • Oracle WebCenter

Authorization Sources 

  • User Authorization Cache

  • Federated User Authorization Cache

See Also:

Oracle Secure Enterprise Search Release Notes for a list of supported platforms

Oracle Secure Enterprise Search Components

Oracle SES includes the following components:

Oracle Secure Enterprise Search Administration GUI

The Oracle Secure Enterprise Search Administration GUI enables you to manage and monitor Oracle SES components using a browser-based interface. These are among the tasks that you perform:

  • Define sources and crawling scope

  • Configure the search application

  • Monitor crawl progress and search quality

  • Customize search results

See Also:

Oracle Secure Enterprise Search Crawler

Oracle SES uses a crawler to collect data from the sources. The Oracle SES crawler is a Java process activated by a schedule. When activated, the crawler spawns a configurable number of processor threads that fetch information from various sources and index the documents. This index is used for searching sources.

The crawler maps links and analyzes relationships. Whenever the crawler encounters embedded non-HTML, or non-textual documents during the crawling, it automatically detects the document type, and filters and indexes the document.

Figure 1-1 shows the crawler in relation to other Oracle SES components and a variety of data sources.

Figure 1-1 Crawler Collecting Information for Oracle SES

Description of Figure 1-1 follows
Description of ''Figure 1-1 Crawler Collecting Information for Oracle SES''

Oracle Secure Enterprise Search APIs

Oracle Secure Enterprise Search provides several APIs. For example, with the Web Services API, you can integrate Oracle SES search capabilities into your search application. You can also customize the default Oracle SES ranking to create a more relevant search result list for your enterprise or configure clustering for customized applications.

The Crawler Plug-in API enables you to create a custom secure crawler plug-in (or connector) to meet your requirements. The Document Service API accepts input from documents and performs some operation on it. For example, you could create a document service for auditing or to show custom metatags.

Oracle Secure Enterprise Search Features

Information in an enterprise can be spread across Web pages, databases, mail servers or other collaboration software, document repositories, file servers, and desktops. Oracle SES searches all your data through the same interface. Oracle SES is fully globalized and works with many languages including Chinese, Japanese, Korean, Arabic, and Hebrew.

This section introduces a few of the features in Oracle SES. It includes the following topics:

See Also:

Chapter 4, "Understanding Crawling" for more features relating to the crawler

Secure Search

Much of the information within an organization is publicly accessible. Anyone is allowed to view it. Therefore, it is relatively easy for a crawler to find and index that information.

However, there are other sources that are protected. These protected sources might be viewable only by certain users or groups of users. For example, while users can search in their own e-mail folders, they should not be able to search anyone else's e-mail.

For protected sources, the Oracle SES crawler indexes documents with the proper access control list. When end users perform a search, only documents that they have privileges to view are returned.

Federated Search

Oracle SES can search multiple Oracle SES instances with their own document repositories and indexes. It provides a unified framework to search the different repositories that are crawled, indexed, and maintained separately.

Federated search allows a single query to run across all Oracle SES instances. It aggregates the search results to show one unified result list to the user. User credentials are passed along with the query so that each federation endpoint can authenticate the user against its own document repository.

Figure 1-2 illustrates the federation architecture and two options for an end user to connect through a browser to Oracle SES. Option 1 allows users to connect their browsers directly to Oracle SES using the end-user graphical interface. Option 2 retrieves results from Oracle SES through Web Services after arbitrary post-processing, such as changing the look-and-feel or embedding the results in a page. For this option, the browser connects to remote applications, which connect to the Web Services API.

Figure 1-2 Federation Architecture

Description of Figure 1-2 follows
Description of ''Figure 1-2 Federation Architecture''

Extensible Crawler Plug-in Framework

Oracle SES provides an extensible crawler plug-in framework that lets you crawl and index proprietary document repositories. The Crawler Plug-in API enables you to create a custom secure crawler plug-in to meet your requirements. You can also create a custom identity plug-in and a custom authorization plug-in for crawling a data source. You can also update or delete a custom plug-in.

Oracle SES 11.2.2.2 supports open architecture, that is, an Oracle SES application can have multiple middle tier components across multiple systems. You must ensure that the custom plug-in jar files must be accessible across all the Oracle SES instance middle tiers.

Thus, if all the middle tiers have a shared file system, then the custom plug-in jar files must be placed at a location which is accessible to all the middle tiers. If all the middle tiers do not have a shared file system, then you must create the same directory structure across all the middle tiers and copy the custom plug-in jar files to all these locations. For example, if the path specified for a custom plug-in jar file is /app/install/plugins/custom_plugin.jar, then the same directory structure must be created for all the middle tiers.

Note:

In the earlier Oracle SES release (Oracle SES 11.1.2.2), a custom plug-in jar file was required to be stored in the ses_home/search/lib/plugins directory, but this is not required in Oracle SES 11.2.2.2 release. You can now store a custom plug-in jar file in any directory, but you must refer it using its absolute file path in the Oracle SES application.

The following are the various custom plug-ins that need to be accessible across all the middle tiers for an Oracle SES instance:

  • Source Type plug-in

  • Authorization plug-in

  • Identity plug-in

  • Document Service plug-in

  • URL Rewriter plug-in

For example:

  • Document service for "Default summarizer Doc service" uses a stop words directory. The default stop words directory is ses_home/search/lib/plugins/doc/extractor/phrasestopwords.

  • Document Service for "ImageDocumentService" uses an attributes configuration file. Its default directory location is ses_home/search/lib/plugins/doc/ordim/config/attr-config.xml.

  • Database source type uses an XML query file.

  • Federated User Authorization Cache source type uses an XML remote cache configuration file.

Note:

Documentum Content Server source type uses the following property files:
  • search/lib/plugins/dcs/dfc.properties

  • search/lib/plugins/dcs/dcsothers/dfc.properties

As these files are used by Documentum APIs, they must be stored in the directory ses_home/search/lib/plugins.

See Also:

Oracle Secure Enterprise Search Language Model

Different components of Oracle SES, such as crawler, Query application, Administration GUI, Administration API, Query API, and ODL support different set of languages.

This section provides information about the languages supported by these components. This section includes the following topics:

Languages Supported by Crawler

An Oracle SES data source can contain documents in different languages, for example, one document can be in English, while other can be in Japanese. Oracle SES associates a single document with a single language.

A data source can explicitly specify a language for a document. Oracle SES recognizes only the document languages that are specified using the ISO 639-1 standard, but not all of the languages.

The Administration GUI can also be used to set one of the following document languages as a default language for a data source - Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, and Turkish.

If a default language is not set for a data source using Administration GUI, then the crawler can automatically detect a document's language by reading its metadata. The crawler can detect the following document languages automatically - Arabic, Chinese, Danish, Dutch, English, French, German, Greek, Japanese, Korean, Italian, Lao, Portuguese, Spanish, Thai, and Tibetan.

Note:

The automatic language detection feature of Oracle SES is enabled by default for all the data source types, except the file data source type. You can also disable the automatic language detection feature using Administration GUI. See "Language Detection" for more information.

The crawler determines a document's language by performing the following checks in the mentioned order:

  • From the HTTP response header Content-Language for a Web data source

  • From the HTML Language meta tag, for example, <meta name="Language" content ="en">

  • From the HTML content-language meta tag, for example, <meta http-equiv="content-language" content="fr">

  • From the LANGUAGE column of a Table data source

  • From the language specified for a crawler Plug-in

  • Using the automatic language detection feature (if this feature is enabled)

  • From the default language specified for a data source

Languages Supported by the Query Application and Query API

The default query application supports the following languages - Arabic, Catalan, Chinese, Czech, Danish, German, Greek, English, Spanish, Finnish, French, Hungarian, Italian, Japanese, Korean, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Swedish, Thai, and Turkish.

The default query application obtains the client-side language, territory, and character-set information from the Web browser. The language information is then passed as is to the Query API, only if CJK characters (Chinese, Japanese, and Korean) are not present in the query, otherwise, the language is set to the most likely one among the CJK in the Query API.

A query in Oracle SES can be specified using only one language. A document that is not in the same language as specified in the query, is either not searched at all or is assigned lower relevancy score in the query search result depending upon the Oracle SES configuration.

Note:

The language specific behavior of the default query application may change in the future releases of Oracle SES without notice.

Oracle SES Query API supports different languages for different types of search as described in Table 1-1.

Table 1-1 Languages Supported by Query API for Different Types of Search

Search Type Languages Supported

Stemming search

English, Dutch, French, German, Italian, Spanish

Fuzzy search

English, Dutch, French, German, Italian, Korean, Spanish, Chinese, and Japanese. For Chinese and Japanese languages, only the VGRAM search is supported. The VGRAM search is insensitive to word boundaries.

Composite word search

German

Alternate spelling search

German


Languages Supported by the Administration GUI and Administration API

The Administration GUI supports the following languages - Chinese, English, French, German, Italian, Japanese, Korean, Portuguese, and Spanish.

You can translate the search attribute names to the following languages using Administration GUI and Administration API - English, Arabic, Brazilian Portuguese, Catalan, Czech, Danish, Dutch, Finnish, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Romanian, Russian, Simplified Chinese, Slovak, Spanish, Swedish, Thai, Tradition Chinese, and Turkish.

Languages Supported by ODL Logs

Oracle SES uses ODL for logging the warning and error messages into log files. Oracle SES supports the following languages for logging these messages - English, German, Spanish, Franch, Italian, Japanese, Korean, Brazilian Portuguese, Simplified Chinese, and Traditional Chinese.