8 Oracle Secure Enterprise Search Advanced Information

This chapter contains the following topics:

Setting Up Federated Sources
Adding Suggested Content in Search Results
Customizing the Appearance of Search Results
Configuring Clustering in Search Results
Configuring Top-N Documents and Group/Sort Attributes
Customizing the Relevancy of Search Results
Using Backup and Recovery
Understanding Attributes
Troubleshooting Sources
Tuning Crawl Performance
Tuning Search Performance
Oracle SES Command Line Tools
Turning On Debug Mode
Monitoring Oracle Secure Enterprise Search
Integrating with Google Desktop for Enterprise
Accessing Application Server Control Console on Oracle SES

Setting Up Federated Sources

Secure federated search enables searching secure content across distributed Oracle SES instances. An end user is authenticated to the Oracle SES federation broker. Along with querying the secure content in its own index, the federation broker federates the query to each federation endpoint on behalf of the authenticated end user. This mechanism necessitates propagation of user identity between the Oracle SES instances. In building a secure federated search environment, an important consideration is the secure propagation of user identities between the Oracle SES instances. This section explains how Oracle SES performs secure federation.

See Also:

Federation Trusted Entities

When performing a secure search on a federation endpoint, the federation broker must pass the identity of the logged-in user to the federation endpoint. If the endpoint instance trusts the broker instance, then the broker instance can proxy as the end user. To esta blish this trust relationship, Oracle SES instances should exchange some secret. This secret is exchanged in the form of a trusted entity.

A trusted entity consists of two values: entity name and entity password. Each Oracle SES instance can have one or more trusted entities that it can use to participate in secure federated search. (A trusted entity is also referred to as a proxy user.)

An Oracle SES instance can connect to an identity management (IDM) system for managing users and groups. An IDM system can be an LDAP-compliant directory, such as Oracle Internet Directory or Active Directory.

Each trusted entity can be authenticated by either an IDM system or by the Oracle SES instance directly, independent of an IDM system. For authentication by an IDM system, check the box Use Identity Plug-in for authentication when creating a trusted entity. In this case, the entity password is not required. This is useful when there is a user configured in the IDM system that can be used for proxy authentication. Make sure that the entity name is the name of the user that exists in the IDM system and is going to be used as the proxy user.

For authentication of the proxy user by Oracle SES, clear (uncheck) the box Use Identity Plug-in for authentication when creating a trusted entity. Then use any name and password pair to create a trusted entity.

Use Authentication Attribute to specify the format of the user credential that the Oracle SES federation endpoint expects for this particular trusted entity in proxy authentication. The identity plug-in registered on the federation endpoint should be able to map this user identity to the default authentication format used on the federation endpoint. This is useful when a federation broker cannot send user identity in the default authentication format used on the federation endpoint for proxy authentication, but the identity plug-in registered on the federation endpoint can map the value from the attribute in which it receives the user identity during proxy authentication to the default authentication format used on the federation endpoint.

To use a proxy entity, use the Web services API proxyLogin() user name and password for the entity name and entity password. The identity plug-in can validate the password instead of storing it. When a request is sent for proxyLogin(), Oracle SES calls the identity plug-in (which returns the call) to authenticate the entity. The proxyLogin() must supply one of the valid trusted entities registered in the federation trusted entities.

To perform secure federated search, both the broker and the endpoint instances involved in the federation must have identity plug-ins registered. The identity plug-ins may or may not talk to the same IDM system.

Note:

All user names should be unique across all Oracle SES instances. If not, then there should be a clear mapping for the users to make them unique across all IDMs involved in the secure federation.

Carefully specify the following parameters under the section Secure Federated Search when creating a federated source on the broker instance:

Remote Entity Name: This is the name of the federation trusted entity on the federation endpoint. It is provided by the administrator of the endpoint instance.
Remote Entity Password: This is the password of the federation trusted entity on the federation endpoint. It is provided by the administrator of the endpoint instance.
Search User Attribute: This attribute identifies, and is used to authenticate, a user on the federation endpoint instance. This parameter is an optional parameter, except when the broker and endpoint use different authentication attributes to identify end users. (For example, on the broker instance, an end user can be identified by user name; on the endpoint instance, the end user can be identified by e-mail address.)

The identity plug-in registered on the broker instance should be able to map the user identity to this attribute based on the authentication attribute used during the registration of the identity plug-in. If this attribute is not specified during creation of the federation source, then the user identity on the broker instance is used to search on the endpoint instance.

Note:
If these parameters are not specified during the creation of the federated source, then the federated source is treated as a public source (that is, only public content is available to the search users).
Secure Oracle HTTP Server-Oracle SES channel: Because any Oracle HTTP Server can potentially connect to the AJP13 port on the Oracle SES instances and masquerade as a specific person, either the channel between the Oracle HTTP Server and the Oracle SES instance must be SSL-enabled or the entire Oracle HTTP Server and Oracle SES instance computers must be protected by a firewall.

Notes:

In a secure federated search environment, the broker or the endpoint instance might or might not be using single sign-on (SSO). However, the Web service URL of the endpoint should not be behind SSO.
Oracle strongly recommends that you SSL-protect the channel between Oracle HTTP Server and Oracle SES for secure content. The endpoint instance should be SSL-enabled, or you should be able to access the Web service using HTTPS.

See Also:

"Tips for Using Federated Sources"

Example Creating a Federated Source

This section describes the steps for setting up a federated source that connects to Active Directory.

Activate the Active Directory identity plug-in on both the endpoint and broker instances. For example, on the Global Settings - Identity Management Setup page, enter the following:
- Parameter Name: value
- Directory URL: ldap://ad.oracle.com:389
- Directory account name: administrator@ad.oracle.com
- Directory account password: ****
- Directory subscriber: dc=ad,dc=oracle,dc=com
- Directory security protocol: none
Create federation trusted entities on the endpoint instance. For example, login to Oracle SES on the endpoint instance, navigate to the Global Settings - Federation Trusted Entities page, and enter the following:
- Entity Name: Entity name
- Entity Password: Password
Create a federated source on the broker side. For example, login to Oracle SES on the broker instance, navigate to the Home - Sources page, select the source type as Federated, and enter the following:
- Source Name: Sourcename1
- Web Service URL: http://endpoint.cn.oracle.com:7777/search/query/OracleSearch
- Remote Entity Name: Entity name
- Remote Entity Password: Password
To browse the federated source on broker side, create a source group and then add the federated source to the group.

Customizing Federated Sources

On the Home - Sources - Customize Federated Source page, you can change the source name, Web Service URL, remote entity name and password, and search user attribute.

This section describes the other ways you can customize a federated source:

Route Queries to the Federated Source
Set Search Restrictions
Retrieve Attributes
Map Attributes

Route Queries to the Federated Source

Enter a filter rule, which sets conditions for routing queries to the federated source, on the Home - Sources - Customize Federated Source page. Filter rules can improve scalability. If no rule is defined, then the federation agent sends all queries to the federated source to perform the search. The rules are applied only against the search query filter. They are not applied when an end user enters the attribute shortcut query.

Each rule has an attribute, a colon (:) and an expression. Rules can be based on end user properties, such as name or e-mail address, and on query information, such as document language, author, or document modified date. For example, an identity attribute could be mail or dn. A query attribute could be author or lastmodifieddate.

Multiple rules for the source are joined together with the AND and OR operators. The attribute name and the operators are not case-sensitive. For example, the following rule defines that the federated source is for English documents and for users having an e-mail address starting with A in the identity management system:

(language:en ) AND (idm::mail:a.*)

The attribute can be Date, String and Number type. For String attributes, the rule expression is regular expression. Oracle SES supports the regular expression syntax used in Java JDK 1.4.2 Pattern class (java.util.regex.Pattern). For date and number attributes, the expression contains the operator and value. The operators are =, >, >=, <, <=.

Filter Rule Examples

The following rule defines that the federated source is for documents larger than 1 M:

content-length:>1000000

The following rule defines that the federated source is for documents published after 12/31/2006:

lastmodifieddate:> 12/31/2006

The following example defines that the federated source has only documents for the last week:

lastmodifieddate:> sysdate - 7

The following rule defines that the federated source is for the login name, which could be an attribute of the identity management repository:

username:test00.*

Set Search Restrictions

Restrict search to a specific list of source groups on the Home - Sources - Customize Federated Source - Search Restrictions page.

Available source groups from the federated source are retrieved when the page is loaded. When Source Group Restricted Search is selected, you can move the source groups between the Not Searched and Searched lists. When Unrestricted Search is selected, all source groups on the remote instance are searched.

The Refresh Source Groups button refreshes the available source groups from the remote instance. If a source group is no longer available, then it is marked (Not Available). All newly available source groups after a refresh appear in the Not Searched list by default, and all existing source groups remain in the list they are presently in. If a remote source group is renamed, the old name will be marked (Not Available) and the new name will appear in the Not Searched list. Unavailable source groups will be persisted as long as they remain in the Searched list.

If the federated source is unavailable, then the available source groups are loaded from local storage. A warning message then states that Oracle SES is unable to retrieve the available source groups from the remote instance, indicating that the available source groups may be out of date.

Note:

A federated source can be restricted to only explicitly-created source groups on the remote Oracle SES instance. For example, a federated source cannot be restricted to the Miscellaneous group on the remote Oracle SES instance.

Retrieve Attributes

Identify which attributes to retrieve from the federated source on the Home - Sources - Customize Federated Source - Attribute Retrieval page.

Available attributes from the federated source are retrieved when the page is loaded. Move search attributes to retrieve between the Not Retrieved column and the Retrieved column. Attributes that are always retrieved by Oracle SES by default are in the Retrieved list and marked (Mandatory). These attributes cannot be saved in the Not Retrieved list.

The Refresh Attributes button refreshes the available attributes from the remote instance. If an attribute is no longer available, then it is marked (Not Available). All newly available attributes after a refresh appear in the Not Retrieved list by default, and all existing attributes remain in the list they are presently in. If a remote attribute is renamed, then the old attribute name will be marked (Not Available), and the new name will appear in the Not Retrieved list. Unavailable attributes are persisted as long as they remain the Retrieved list or are used in a explicit attribute mapping.

If the federated source is unavailable, then the available attributes are loaded from local storage. A warning message then states that Oracle SES is unable to retrieve the available attributes from the remote instance, so the available attributes may be out of date.

Map Attributes

Map local search attributes with federated search attributes on the Home - Sources - Customize Federated Source - Attribute Mapping page. For example, a local search attribute named Creator can be mapped to a remote attribute named Author. This is an explicit attribute mapping. Only one-to-one mappings between attributes of the same data type are supported.

Note:

For default Oracle SES search attributes, Oracle SES implicitly maps local attributes to remote attributes. For example, a remote attribute named Author is always mapped to local search attribute name Author. For all other attributes, explicit mappings must be created.

Local search attributes are the available attributes on the local instance, as defined on the Global Settings - Search Attributes page. Local search attributes that are used in a mapping cannot be deleted on the Global Settings - Search Attributes page. Initially, there are no mappings.

Remote search attributes are the available attributes on the federated source. This list is retrieved when the page is loaded. If a remote attribute is mapped to a local attribute but the remote attribute is no longer available, then the remote attribute is marked (Not available). Only attribute mappings involving available remote attributes are used during queries.

Limitations Federating Release 10.1.8.2 and Prior Releases

The Oracle SES internal attributes eqtopphrases, eqtopsentences, eqdatasourcename, eqdatasourcetype and eqfedchain cannot be retrieved from versions of Oracle SES prior to 10.1.8.2. This impacts topic clustering and result list configuration. If a topic cluster tree is configured with eqtopphrases or eqtopsentences, then results from federated sources prior to 10.1.8.2 will not contain values for these attributes, and therefore will not contribute to the cluster tree. Similarly, eqdatasourcename, eqdatasourcetype and eqfedchain will have empty values when used in the advanced result list configuration XSLT, meaning that source-level result rendering cannot currently be accomplished for such sources.
There is an issue highlighting keywords if advanced configuration is enabled on the Global Settings - Configure Search Result List page. All attributes in the XSLT are escaped to be HTML safe. However, highlighted attributes like title and author cannot be easily escaped in the XSLT.

Highlighted attributes from local results contain markers "[[" and "]]" around the keywords. These are replaced with bold tags in the XSLT. A special attribute ID is passed in to tell Web Services to use these markers. This attribute ID will not be handled by versions prior to 10.1.8.2, so bold tags will always be returned and be double-escaped.

If we can guarantee that highlighted attributes are always escaped when we get the attribute values back, then we can disable escaping in the XSLT.

As a workaround to disable escaping, edit the XSLT. The XSLT has the following template to process highlighted attributes:
```
 
<xsl:template name="process-hilite-attr"> 
  <xsl:param name="str" /> 
  <xsl:call-template name="bold-template"> 
    <xsl:with-param name="str" select="$str" /> 
    <xsl:with-param name="startdelim" select="'[['" /> 
    <xsl:with-param name="enddelim" select="']]'" /> 
    <xsl:with-param name="doe" select="'no'" />  
  </xsl:call-template> 
</xsl:template> 
```
Turn off escaping by changing the following line to "yes":
```
<xsl:with-param name="doe" select="'no'" /> 
```

Adding Suggested Content in Search Results

Suggested content lets you display real-time data content along with the result list in the default query application. Oracle SES retrieves data from content providers and applies a style sheet to the data to generate an HTML fragment. The HTML fragment is displayed in the result list and is available through the Web Services API. For example, when an end user searches for contact information on a coworker, Oracle SES can fetch the content from the suggested content provider and return the contact information (e-mail address, phone number, and so on) for that person with the result list. Suggested content results appear in tabbed panes above the query results.

Configure suggested content on the Search - Suggested Content page in the administration tool. Enter the maximum number of suggested content results (up to 20) to be included with the Oracle SES result list. The results are rendered on a first-come, first-served basis.

Regular expressions (as supported in the Java regular expression API java.util.regex) are used to define query patterns for suggested content providers. The regular expression-based pattern matching is case-sensitive. For example, a provider with the pattern dir\s(\S+) is triggered on the query dir james but not on the query Dir James. To trigger on the query Dir James, the pattern could be defined either as [Dd][Ii][Rr]\s+(\S+) or as (?i)dir\s+(\S+). A provider with a blank query pattern is triggered on all queries.

The URL you enter for the suggested content provider can contain the following variables: $ora:q, $ora:lang, $ora:q1, ... $ora:qn and $ora:username.

$ora:q is the end user full query.
$ora:lang is the two-letter code for the browser language.
$ora:qn is the nth regular expression match group from the end user query. n starts from 1. If no nth group is matched, then the empty string replaces the variable.
$ora:username is the end user name.

Enter an XSLT style sheet to define rules (for example, the size and style) for transforming XML content from a provider into an HTML fragment. This HTML fragment is displayed in the result list or returned over the Web Services API. If you do not enter an XSLT style sheet, then Oracle SES assumes that the suggested content provider returns HTML. If you do not enter an XSLT style sheet and the provider returns XML, then the result list displays the plain XML.

Note:

It is the administrator's responsibility to ensure that suggested content providers return valid and safe content. Corrupted or incomplete content returned by a suggested content provider can affect the formatting of the default query application results page.

There are three security options for how Oracle SES passes the end user's authentication information to the suggested content provider:

None: With this method (the default), no security policy is used.
Cookie: With this method, the end user first must be authenticated by the suggested content provider. A cookie is set for the user to maintain a session. Oracle SES must know the cookie used by the provider for authentication, and it is made available during registration of the suggested content provider. When the user enters a query, Oracle SES grabs the cookies from the user's request header and passes them to the provider. The cookie scope must be set to the common domain of the provider site and the Oracle SES site by the provider.

For example, suppose the provider site is http://provider.company.com and the Oracle SES site is http://ses.company.com. After the end user logs in to the provider site, the site could set the value of the security cookie loginCookie with domain scope .company.com. When the end user searches in Oracle SES, Oracle SES gets the loginCookie value from the end user browser and forwards it to the provider site to get the suggested content (without login to the provider site again). However, if the provider site is accessed as http://provider or if the Oracle SES site is accessed as http://SES, then no domain cookie is available for sharing between the two sites and this security mechanism does not work.

You can decide what happens when suggested content is available but the user is not logged in to the suggested content provider or the cookie for the provider is not available. For Unauthenticated User Action, if you select Ignore content, then content from that provider will not be displayed in the result list. If you select Display login message, then Oracle SES returns a message that there is content available from this provider but the user is not logged in. The message also provides a link to log in to that provider. Enter the link for the suggested content provider login in the Login URL field.
Service-to-Service: With this method, a one-way trusted relationship is established between Oracle SES and the suggested content provider. Any user already logged in to Oracle SES does not need to be authenticated by the provider again. The provider only authenticates the Oracle SES application and trusts the Oracle SES application to act as the end user.

The end user identity is sent from Oracle SES to the provider site in the HTTP header ORA_S2S_PROXY_USER. The trusted entity could be a proxy user configured in the identity management system used by the provider, or it could be a name-value pair.

Note:
If the secured content provider needs to authenticate the end user and it sets the domain level security cookie to maintain login information after the end user login, then use the cookie method for form authentication. The Oracle SES end user must login manually to the provider site, and the security cookie is stored in the browser. Oracle SES searches on the provider for the end user without additional login.
However, if the domain security cookie is not allowed for the provider, then the provider must support service-to-service security. The provider must allow an Oracle SES application account to search after passing HTTP basic or digest authentication. Also, if the provider has different secured content for different Oracle SES end users, then it must respect the end user security (in the HTTP header ORA_S2S_PROXY_USER) for the Oracle SES search request.

To register a provider that requires either HTTP basic or HTTP digest authentication, specify the authentication user name in the Entity Name field and specify the authentication password in the Entity Password field.

Example Configuring Google OneBox for Suggested Content

Existing OneBox providers can be configured as Oracle SES suggested content providers. For example, for a Google OneBox provider, the provider URL might be http://host.company.com/apps/directory.jsp and the trigger might be dir\s(\S+). When the user query is dir james, the provider receives the request with a query string similar to the following: apiMaj=10&apiMin=1&oneboxName=app&query=james.

With a suggested content provider, set the URL template as http://host.company.com/apps/directory.jsp?apiMaj=10&apiMin=1&oneboxName=app&query=$ora:q1. The provider pattern is the same: dir\s(\S+). The XSLT used for Google OneBox can be re-used with a minor change. Look for the line:

<xsl:template name="apps">

and change that line in your template to

<xsl:template match="/OneBoxResults">

Customizing the Appearance of Search Results

You can customize the default look and feel of the search result list for the default query application on the Global Settings - Configure Search Result List page.

Note:

The new 10.1.8.2 query application is certified with Internet Explorer versions 6 and 7 and Firefox versions 1.5 and 2.x. Existing 10.1.8.1 functionality is certified on all Oracle SES-supported browsers through the classic user interface: http://<host>:<port>/search/query/search-classic.jsp

First select attributes that should appear in the XML description of result documents. The available attributes are local attributes, federated attributes, and internal attributes. Each attribute name appears only one time. That is, the name of a federated attribute with the same name as another attribute or with an explicit mapping to a local search attribute appears only once.

The following table describes Oracle SES internal attributes.

Table 8-1 Oracle SES Internal Attributes

Name	Type	Description
`eqdatasourcename`	String	The (untranslated) name of the source where the document originated. This name is local to the instance that the document came from and not the instance that is receiving the document. If a document comes from a federated source, then the value of this attribute is the name of the source on the federated instance, and not the name of the federated source on the local instance.
`eqdatasourcetype`	String	The (untranslated) type of source where the document originated. This type is local to the instance from which the document came. For example, "Federated" is not a valid value for this attribute.
`eqsnippet`	String	The excerpt or keyword in context of the document.
`eqredirecturl`	String	The redirect URL to the original document; that is, the value of the title link in the default result list.
`eqcacheurl`	String	The URL of the cached version of the document; that is, the value of the "Cached" link in the default result list.
`eqlinksurl`	String	The URL of the page containing a list of links to the document; that is the value of the "Links" link in the default result list.
`eqcontentlength`	Number	The size of the document in bytes.
`equserquery`	String	The query string.
`eqgroupbrowseurl`	String	The URL to browse the infosource group; that is, the value of the "Source Group" link in the default result list.
`eqpathbrowseurl`	String	The URL to browse the infosource path; that is, the value of the "Path" link in the default result list.
`eqdocid`	Number	Document ID.
`eqfedid`	String	The federated source ID chain delimited by an underscore (_).
`eqfedchain`	String	The chain of instance names representing the path of a federated document. The instance names are delimited by a semi-colon (;).

Then, provide an XSLT to operate on the selected attributes. This will transform XML content into an HTML fragment to be displayed in the result list. If the XSLT is blank, then the XML is not transformed and the search results will be displayed using the default look and feel. Invalid XSLTs cannot be saved. The output of this transformation should be HTML by providing the following in the XSLT:

<xsl:output method="html" />

You can provide a CSS to style the HTML generated in the XSLT. This CSS is used along with the included CSS files for the default query application. When there are conflicts between the advanced configuration CSS and the default CSS files, the advanced configuration definitions are used. Default XSLT and CSS style sheets are provided for Advanced Configuration.

XML Result Schema

To apply the XSLT, each document is converted into an XML description at query-time with the following schema:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
   <xsd:element name="result">
      <!-- Useful internal non-search attributes -->
      <element name="eqdatasourcename" type="xsd:string" maxOccurs=1 />
      <element name="eqdatasourcetype" type="xsd:string" maxOccurs=1 />
      <element name="eqsnippet" type="xsd:string" maxOccurs=1 />
      <element name="eqredirecturl" type="xsd:string" maxOccurs=1 />
      <element name="eqcacheurl" type="xsd:string" maxOccurs=1 />
      <element name="eqlinksurl" type="xsd:string" maxOccurs=1 />
      <element name="eqdisplayurl" type="xsd:string" maxOccurs=1 />
      <element name="eqcontentlength" type="xsd:integer" maxOccurs=1 />
      <element name="equserquery" type="xsd:string" maxOccurs=1 />
      <element name="eqgroupbrowseurl" type="xsd:string" maxOccurs=1 />
      <element name="eqpathbrowseurl" type="xsd:string" maxOccurs=1 />
      <element name="eqdocid" type="xsd:integer" maxOccurs=1 />
      <element name="eqfedid" type="xsd:string" maxOccurs=1 />
      <!-- Built-in search attributes -->
      <element name="author" type="xsd:string" maxOccurs=1 />
      <element name="description" type="xsd:string" maxOccurs=1 />
      <element name="headline1" type="xsd:string" maxOccurs=1 />
      <element name="headline2" type="xsd:string" maxOccurs=1 />
      <element name="host" type="xsd:string" maxOccurs=1 />
      <element name="infosource" type="xsd:string" maxOccurs=1 />
      <element name="infosourcepath" type="xsd:string" maxOccurs=1 />
      <element name="keywords" type="xsd:string" maxOccurs=1 />
      <element name="language" type="xsd:string" maxOccurs=1 />
      <element name="lastmodifieddate" type="xsd:date" maxOccurs=1 />
      <element name="mimetype" type="xsd:string" maxOccurs=1 />
      <element name="referencetext" type="xsd:string" maxOccurs=1 />
      <element name="subject" type="xsd:string" maxOccurs=1 />
      <element name="title" type="xsd:string" maxOccurs=1 />
      <element name="url" type="xsd:string" maxOccurs=1 />
      <element name="urldepth" type="xsd:integer" maxOccurs=1 />
      <!-- Custom search attributes -->
      …
   </xsd:element>
</xsd:schema>

XML has the following rules for element names:

Alphanumeric, as well as non-English characters, numbers, and ideograms, are allowed
Limited punctuation is allowed: underscore, hyphen, and period
Names can only begin with letters, ideograms, and underscores

Custom attribute names must conform to these rules for advanced result rendering. To enforce these rules, the empty string will replace all characters that are not permitted by these rules. In addition, Oracle SES search attributes are case-insensitive, and therefore all attributes are converted to lowercase when used in XML format.

For example, suppose the raw XML result data is as follows.

<result>
   <eqdatasourcetype>WEB</eqdatasourcetype>
   <title>Oracle Secure Enterprise Search</title>
   <url>
      http://www.oracle.com/technology/products/oses/index.html
   </url>
      <author>Anonymous</author>
   <description>
      Oracle Secure Enterprise Search 10g, a standalone product from Oracle, enables a secure, high quality, easy-to-use search across all enterprise information assets.
   </description>
 </result>

The following XSLT extracts and formats the title, URL, and author for documents coming from Web sources:

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:template match="result[eqdatasourcetype='WEB']">
   <span class="title">
     <xsl:text> &quot;</xsl:text><xsl:value-of select="title" /><xsl:text>&quot;</xsl:text>
   </span>
   <span class="author">
     <xsl:text> By </xsl:text><xsl:value-of select="author" />
   </span>
   <br/>
   <span class="url">
     <a href="http://{url}"><xsl:value-of select="url" /></a>
   </span>
 </xsl:template>
</xsl:stylesheet>

A CSS style sheet for this output may be:

.title { font-weight: bold; }

.url { font-style: italic; }

These style sheets produce a final result of:

"Oracle Secure Enterprise Search" By Anonymous

http://www.oracle.com/technology/products/oses/index.html

Configuring Clustering in Search Results

Real-time clustering dynamically organizes search results into groups to provide end users with different views on the top results. Clustered documents within one group, called a cluster node, share the same common topics or property values. A cluster node with a large document set can be categorized into child cluster nodes, and a hierarchy is built. Users can navigate directly to a specific cluster node. Effective real-time clustering balances clustering quality and clustering time.

Note:

Search attributes (String, Number or Date) are used to generate a cluster tree. The attributes can be local search attributes, federated attributes that are not explicitly mapped, and Oracle SES internal attributes.

Oracle SES supports two types of cluster trees: topic and metadata. Each tree can be enabled or disabled individually. Parameters that apply to all cluster trees for the default query application can be configured on the Global Settings - Clustering Configuration page. These include the following:

Maximum cluster tree depth: The maximum level of the cluster node hierarchy.
Maximum number of children per node: The maximum number of cluster nodes on each level. This does not apply to the miscellaneous node.
Minimum number of documents per node: The minimum number of the documents within one node. This does not apply to the miscellaneous node.

Note:

Within each level of a cluster tree, documents that are not categorized into a node are placed in a special node called "miscellaneous". The Minimum number of documents per node and Maximum number of children per node parameters do not apply to the miscellaneous node.

For customized Oracle SES applications, configure clustering with the Query Web Services API.

Topic Clustering

Topic clustering uses the most significant phrases (and optionally sentences) from documents to create relevant cluster nodes and hierarchies. The significant phrases are extracted both at query-time and by the Secure Enterprise Search Document Summarizer, which is a document service included by default for search result clustering.

Configure crawl-time extraction of top phrases with document services parameters on the Global Settings - Document Services page. Create a topic clustering tree on the Global Settings - Clustering Configuration - Create Topic Clustering Tree page.

Topic clustering can be configured with one or more search attributes of String type, as well as with the following Oracle SES internal attributes:

eqsnippet: The excerpt of the document with keywords in context.
eqtopphrases: The most frequent phrases within one document among the phrases with the same number of words.
eqtopsentences: The significant sentences within one document based on the significant phrases.

By default, the attributes keywords, title, eqsnippet and eqtopphrases are configured for topic clustering. Keywords, eqtopphrases, and eqtopsentences contain pre-extracted words and phrases: no additional phrase extraction is performed on these attributes.

Parameters that control query-time word and phrase extraction for the default query application can be configured on the Global Settings - Clustering Configure page. These include the following:

Single Word Extraction:

Minimum occurrence: The minimum frequency for the word to be extracted.
Maximum number of words to extract: The maximum number of words to be extracted.

Phrase Extraction:

Minimum occurrence: Minimum frequency for a phrase to be extracted.
Maximum number of phrases to extract: Maximum number of phrases to be extracted.
Maximum phrase length: Maximum number of words for each phrase to be extracted.

Topic clustering uses a phrase stopword list and a blacklist to prevent words or phrases from becoming topic cluster result nodes.

The phrase stopword list is also used by the Document Summarizer document service. The stopword file is a language-specific file containing words that should not be considered during phrase extraction. The blacklist file is a language-specific file containing words and phrases that should not appear as cluster node names.

For example, if all indexed documents include the phrase "Oracle Corporation" and it does not make sense to have a cluster node for "oracle corporation", then this phrase could be added to the blacklist.

Note:

There is a separate stopword list for index stop words. This is an Oracle SES internal file for words that should not be indexed. That is not related with phrase extraction.

Both the stopword and blacklist files are in plain text format, with each line containing one word or phrase. The phrase stopwords file name should be 'phrasestopwords' followed a period and the two-letter language code (for example, phrasestopwords.en for English). Similarly, the blacklist file name should be 'blacklist' followed by a period and the two-letter language code.

By default, these files are located in the directory $ORACLE_HOME/search/lib/plugins/doc/extractor/phrasestopwords. There also are sample phrase stopword files for other languages in $ORACLE HOME/search/lib/plugins/doc/extractor/samples/phrasestopwords. If there are documents for these languages, these files should be copied to $ORACLE HOME/search/lib/plugins/doc/extractor/phrasestopwords.

The order of word or phrase in the file does not affect the phrase extraction. For example, phrasestopwords.en may look like the following:

a
an
me
:
z

The blacklist.en file may look like the following:

site maps
oracle corporation
:
term of use

Note:

The stopword and blacklist files are applicable to both the default query application and the Web services API. The other parameters are applicable to the default query application only.

Note:

During backup and recovery operations, if you recover an instance in a new location, the stopword directory must to be updated to reflect the new location, since it is an absolute path.

Topic clustering currently works best in English. Both the document summarizer in the crawler and the clustering module in the query application use a stemmer to stem the word and merge the words and phrases with the same stems. The open source stemmer library Snowball is used for this purpose. The version included with Oracle SES supports the following languages:

Dutch
English
Finnish
French
German
Norwegian
Portuguese
Russian
Spanish
Swedish

The Egothor stemmer is included for Polish language support. The stemmer configuration is shared between the default query application and the Web Services API.

Note:

Topic clustering is not supported for Chinese and Japanese.

See Also:

Appendix F, "Third Party Licenses"

Metadata Clustering

Metadata clustering is performed on a single attribute of String, Date, or Number type. If there is more than one attribute value for the same attribute in one document, then only the first attribute value is used for clustering. By default, the entire value is passed in as is for clustering.

However, for String attributes only, a delimiter can be specified for tokenizing the attribute value. If no tokenization delimiter is entered (or if only whitespace is entered), then the delimiter defaults to whitespace. When tokenized, the single attribute value is divided into multiple segments and each segment can correspond to a hierarchy based on another delimiter called the hierarchy delimiter. Whitespace is the default hierarchy delimiter; however, if both tokenization and hierarchy are selected, then the delimiters must be different. Parsing is done first by tokenization, and then by interpreting the hierarchy from the resulting tokens.

Create a metadata clustering tree on the Global Settings - Clustering Configuration - Create Metadata Clustering Tree page.

As an example where both tokenization and hierarchy are meaningful, a category attribute might consist of a comma-delimited list of fields, each representing a slash-separated hierarchical categorization (as in "java/j2ee/jdbc, oracle/search/connector").

The tokenization and hierarchy configuration is not applicable to Date or Number attributes. Metadata trees of Date type attributes use a fixed display format with year on the first level, month on the second, and day on the third. The year is sorted in descending order, and the month and day are sorted in ascending order. Metadata trees for Number type attributes are range-based with a fixed number of ranges (5) and a fixed tree depth (2). Empty ranges are not shown.

See Also:

"Configuring Clustering in the Web Services API"
"Document Service API"
"Customizing the Appearance of Search Results" for descriptions of Oracle SES internal attributes
"Oracle SES Stoplist"

Using Clustering

Cluster nodes filter the top results but do not change the order of the documents. When users select a cluster node, the result view is limited to the documents in that cluster node. All operations, such as sorting or paging through results, are limited to the cluster node.

The real-time clustering sidebar is hidden by default. Users can display the sidebar by clicking an arrow icon on the left-hand side of the search results page. Within the sidebar, result clusters are shown.

Users can expand or collapse the nodes within a cluster tree without affecting the rest of the interface. If users click a cluster node, then the search results are filtered. If a cluster tree contains no children nodes, it is disabled.

Configuring Clustering in the Web Services API

Methods in the Query Web Service API provide clustering for customized Oracle SES applications. The main interface is the method doOracleOrganizedSearch, which accepts query information, grouping and sorting options, and clustering requests. Based on the request variation, it returns the requested result. A second method doOracleFetchSearch is used when the set of documents is known.

The input for doOracleOrganizedSearch includes the following information

Query
TopN (the result set size used for grouping, sorting, and clustering)
Duplicate controls (removed, marked)
Data group list
Query and document language
Grouping and sorting options
Cluster tree configuration info (tree depth, children for each node, threshold, tree format type: JSON, XML; topic extraction configuration, metadata clustering configuration.)
Other query parameters (including Boolean returnCount, String filterConnect, Filter[] filters)

The output is an object that contains the search result, grouping information, and the cluster tree string list. The search result list is in the order specified by the grouping and sorting option. If this is not specified, then it is sorted by the relevance score. The returned cluster tree string represents the clustering tree information: tree structure, node names, and document IDs.

See Also:

"Search Operations"

Java Classes for Clustering

There are three classes to support the grouping and sorting options: GroupAttribute, SortAttribute, and GroupResult.

There are two classes to support the clustering request: ClusterConfig, which controls the clustering request, and ClusterTree, which contains the tree output.

The class OracleResultContainer is defined to wrap the search hit result, grouping result, and clustering result.

doOracleFetchSearch is used for fetching a selected list of documents identified by their document ID and/or federated source ID.

If GroupAttribute is specified, then it is automatically added to the top of the sorting attribute. For example, if the query is grouped by host name and sorted by title, then the search hit will be sorted by (hostname, title).

The sorting, grouping, or clustering option can be applied to this result. Sorting is based on the top N result, while grouping and clustering is based on the result window determined by (startIndex, docsRequested).

Cluster Result XML Schema

The main XML element, node, contains the following attributes:

id: ID for the node. The value represents the full path with the parent node paths.
name: The name of the node. This is actually the topic for the node.
level: The cluster node level started from 1 for the top node.
size: Number of documents under (directly and indirectly) this cluster node.
leaf: This is "1" if the cluster node only contains documents and no child cluster nodes. Otherwise, this is "0".
keywords: All keywords and phrases within the cluster node.

The node element contains the document IDs in the XML text element if the node is a simple node. The document ID in the XML file has the format docID.SES_InstanceID. If the document is from the local instance, then the SES_instance_ID is omitted.

<cluster>
<nodeset>
<node id="1" name="all" level="1" size="100" leaf="0" keywords="all"/>
<node id="1.4" name="java" level="2" size="99" leaf="0" keywords="java"/>
<node id="1.4.1" name="data warehousing" level="3" size="38" leaf="0" keywords="technologies bi,data warehousing,linux .net office php security service"/>
<node id="1.4.1.1" name="tutorials blogs" level="4" size="12" leaf="1" keywords="tutorials blogs">
2773.,8031.,109.,8033.,806.,26940.,817.,8024.,8030.,2862.,8032.,8028.
</node>
<node id="1.4.1.2" name="stored procedure" level="4" size="4" leaf="1" keywords="stored procedure">
4239.,4243.,2784.,4335.
</node>
<node id="1.4.1.3" name="miscellaneous" level="4"  size="22" leaf="1">
4017.,2836.,8029.,2767.,1502.,113814.,11731.,1138.,392.,2819.,2763.,1421.,221.,705.,7739.,2838.,2749.,2351.,2802.,1158.,15751.,15747.
</node>
:
 
</nodeset>
</cluster>

Cluster Result JSON Format

To integrate with AJAX applications, the cluster results can be returned in JSON format. The JSON format directly reflects the tree structure of the cluster results. Each node has a child array, which is a list of nodes representing the direct children of that node, or a docs array representing the document in that node if the node is a leaf node. Nodes in the child array may have children, and so on.

Here is sample JSON output.

{"nodeset":
 
  {"id":"1",
  "name":"all",
  "level":1,
  "size":100,
  "leaf":false,
  "keywords":"all",
  "children":
     [{"id":"1.4",
     "name":"java",
     "level":2,
     "size":99,
     "leaf":false,
     "keywords":"java",
     "children":
         [{"id":"1.4.1",
         "name":"data warehousing",
         "level":3,
         "size":38,
         "leaf":false,
         "keywords":"technologies bi,data warehousing,linux .net office php security service",
         "children":
            [{"id":"1.4.1.1",
            "name":"tutorials blogs",
            "level":4,
            "size":12,
            "leaf":true,
            "keywords":"tutorials blogs", "docs":["2773","8031","10","803","806","26940","817","8024","8030","2862","803","8028"] },
            {"id":"1.4.1.2",
            "name":"stored procedure",
            "level":4,
            "size":4,
            "leaf":true,
            "keywords":"stored procedure",
            "docs":["4239","4243","2784","4335"]}]
         }]
     },
     {"id":"1.5",
     "name":"miscellaneous",
     "level":2,
     "size":1,
     "leaf":true,
     "docs":["265915"]
     }]
   }
}

Configuring Top-N Documents and Group/Sort Attributes

Mo dify the search.properties file to configure the number of documents to retrieve for top-N processing and clustering and also to control the attributes available for grouping and sorting. These settings affect the default query application. The search.properties file is located in the $ORACLE_HOME/search/webapp/config directory.

The default top-N documents setting specifies the number of documents retrieved by default as part of the AJAX call for result clustering, grouping, and sorting:

ses.qapp.default_topn_docs=100

To page through a very large result set, say 500 documents, the user may view a page of results beyond the default top-N value. Suppose top-N is set to the default 100, and the user wants to view the results numbered 150-160. To provide result clustering and sorting/grouping, the browser must request 160 results. If the user views page 490-500, then the browser would be requesting 500 results through the AJAX call. This may result in reduced performance.

The maximum top-N documents setting represents a threshold above which the query application only displays a single page of results.

This mode does not provide any sorting, grouping, or result clustering. However, it lets a user to view the entire result set without the costly subsequent retrievals of top-N results.

Suppose max_topn_docs is to 200. If an end user is viewing results 30-40, then the browser would retrieve the default of 100 results. If the user views results 170-180, then the browser would request 180 documents. If the user views results above 200, then the query application would display only the current page of results. For example:

ses.qapp.max_topn_docs=300

The set of attributes available in the Group By and Sort By drop-down lists in the query page also can be configured in the search.properties file. The attributes available for grouping are configured by setting the ses.qapp.groupable_attrs property value, and the attributes available for sorting are configured by setting the ses.qapp.sortable_attrs property value.

The property value for either grouping or sorting is an ordered, alternating comma-delimited list of the search attribute name followed by the display name.

The following table lists the default grouping attributes:

Table 8-2 Grouping Attributes

Description	Attribute Name	Display
No grouping	`ses_none`	(none)
Source group	`ses_sourceGroup`	Source
Last modified date	`lastModified`Date	Date
Author	`author`	Author
File format	`mimetype`	File Format

The property value for this default set for grouping is the following:

ses.qapp.groupable_attrs=ses_none,-,ses_sourceGroup,-,lastModifiedDate,-,
        author,-,mimetype,-

The following table lists the default sorting attributes:

Table 8-3 Sorting Attributes

Description	Attribute Name	Display
Relevance	`ses_score`	Relevance
Last modified date	`lastModified`Date	Date
Author	`author`	Author
File format	`mimetype`	File Format
Document title	`title`	Title
URL	`infosource path`	Path
Language	`language`	Language

The property value for this default set for sorting is the following:

ses.qapp.sortable_attrs=ses_score,-,lastModifiedDate,-,author,-,
        mimetype,-,title,-,infosource path,-,language,-

To use the translated name of a search attribute for display instead of providing a fixed display name, insert a dash (-) in place of the display name. For example, if the search attribute "Test1" has translated names configured on the Global Settings page in the administration tool, then the following uses the translated names for display:

ses.qapp.sortable_attrs=ses_score,-,Test1,-,lastModifiedDate,-, ...

See Also:

"Searching on Date Attributes"

Customizing the Relevancy of Search Results

You can customize the default Oracle SES ranking to create a more relevant search result list for your enterprise. Ranking is determined by default and custom attributes. Default attributes include title, keywords, description, and others. Different weights indicate the importance of each attribute for document relevancy. For example, Oracle SES gives more weight to titles than to keywords.

To customize the relevancy of search results, you can use the Query Web Services API or ranking.xml to tune the weights of default attributes, or you can add custom attributes and set weights for those attributes.

See Also:

"Search Operations"

Customizing Relevancy in the Query Web Services API

The signature of the method for advanced search:

public OracleSearchResult doOracleAdvancedSearch (String query,
                                         Integer startIndex,
                                         Integer docsRequested,
                                         Boolean dupRemoved,
                                         Boolean dupMarked,
                                         DataGroup groups[],
                                         String queryLang,
                                         String docLang,
                                         Boolean returnCount,
                                         String filterConnector,
                                         Filter filters[],
                                         Integer[] fetchAttributes,
                                         String searchControls)  throws Exception

The parameter searchControls accepts a XML string, which include the filter and ranking elements.

<searchControls>
        <filter>
        </filter>
        <ranking>
        </ranking>
</searchControls>

This section contains the following topics:

Filter Element
Ranking Element

Filter Element

Filters for attribute search are passed in the filter element. All the various AND and OR conditions on the attributes are specified in the XML. For example:

<filter>
<operator type="and">
<operator type="or">
<attributefilter name="xxx" type="string" operation="equals" value="ttt"/>
<attributefilter name="yyy" type="number" operation="greaterthan" value="22"/>
….
</operator>
…
<attributefiler name="aaa" type="number" operation="equals" value="22"/>
….
</operator>
</filter

If the parameter searchControls is null, then filters and filterConnector are used to create advanced search; otherwise, they are ignored.

Ranking Element

The ranking XML string is expressed as ranking element in searchControls. The following is an example of ranking element:

<ranking>
               <global-settings>
                    <enable-all-default-factor>TRUE</enable-all-default-factor>
               </global-settings>
<default-factor>
        <!--default ranking factor -- >
         …
 </default-factor>
<default-factor>
        <!--default ranking factor -- >
         …
 </default-factor>
 <custom-factor>
        <!--default ranking factor -- >
         …
 </custom-factor>
<custom-factor>
        <!--default ranking factor -- >
         …
 </custom-factor>
</ranking>

The following rules apply to the construction of ranking XML string:

The whole ranking XML can be null, in which case default ranking is used.
The ranking XML contains the elements default-factor and custom-factor. Both can be null or absent at the same time.
When default-factor is null or absent and when custom-factor is not null, default ranking is used with the effect of custom-factor.
When custom-factor is null or absent, it does not have any impact on the ranking.
The ranking scheme applies only for the function doOracleAdvancedSearch call with none-empty query parameter passed.

This section contains the following topics:

Global-Settings
Default-Factor
Custom-Factor

Global-Settings

The global-settings element contains parameter settings across ranking factors. It has the following two sub-elements:

enable-all-default-factor

The ranking element has an attribute called enable-all-default-factor, which accepts two values: true or false. (When this attribute is absent, true is taken as the default value.)

When enable-all-default-factor is true, all default attributes are included in ranking queries, unless some default attributes are explicitly excluded in default-factor elements.

When enable-all-default-factor is false, all default attributes are excluded in ranking queries, unless some default attributes are explicitly included in default-factor elements.

Default-Factor

<default-factor>
<name>title</name>
<weight>VERY HIGH</name>
</default-factor>

Default factor (attribute) names are case-insensitive.

When a default-factor does not appear in the ranking XML string, Oracle SES takes the default weight for this ranking factor (unless default factors are disabled by enable-all-default-factor).

Oracle SES supports the following values for weight element: empty (Oracle SES uses the default weight), none (this attributes is not used in the ranking query), very high, high, medium, low, and very low.

The following table lists the default-factor names and weights:

Table 8-4 Oracle SES Default Attributes and Weights

Attribute	Weight
`Title`	High
`Description`	Medium
`Reftext`	High
`Keywords`	Medium
`Subject`	Low
`Author`	Medium
`H1headline`	Low
`H2headline`	Very low
`Url`	Low
`Urldepth`	High
`Language` `Match`	High
`Recency`	Very low
`Linkscore`	High

Custom-Factor

The custom-factor element lets you add more attributes for ranking. Any indexed search attribute can a custom ranking attribute.

Note:

Adding custom attributes for relevancy ranking can downgrade search performance.

The custom-factor element has four elements: attribute-name, attribute-type, factor-type, and weight (or match depending on the factor-type).

<custom-factor>
            <attribute-name>author manager</attribute-name>
            <attribute-type>STRING</attribute-type>
            <factor-type>QUERY_FACTOR</factor-type>
            <weight>LOW</weight> 
</custom-factor>

<custom-factor>
            <attribute-name>document quality</attribute-name>
            <attribute-type>STRING</attribute-type>
            <factor-type>STATIC_FACTOR<factor-type>
            <match>
            <value>good</value>
            <weight>HIGH</weight> 
            </match>
            <match>
            <value>fair</value>
            <weight>MEDIUM</weight> 
            </match>
            <match>
            <value>bad</value>
            <weight>VERY LOW</weight> 
            </match>
</custom-factor>

The attribute-name values are literally matched against attribute name in Oracle SES. Any indexed search attribute name can be attribute-name value. The value of the attribute-name element is case-insensitive.
The attribute-type element defines the type of the attribute. Only String attribute type is supported. Attribute-name and attribute-type in combination define a valid Oracle SES attribute.
For factor-type, Oracle SES supports two types of ranking for custom ranking attributes.
- QUERY_FACTOR: The attribute value is matched against query terms. A positive match will boost the document based on specified weight. QUERY_FACTOR is a query-based ranking factor; for example, title and reftext. The weight element should appear for this custom ranking factor. For example, with the query "Tiger Woods", if a document has a custom attribute publisher with the value "Tiger Woods", then it could be relevant.
- STATIC_FACTOR: Attribute value is matched against fixed values specified in the custom ranking factor. (The match element should appear for this custom ranking factor.) STATIC_FACTOR is not a query-based ranking factor. The fixed values specify qualities of the documents, such as linkscore and the sources of documents. For example, assume that documents have been classified based on quality. Well-written documents are classified as "good", and poorly-written documents are classified as "bad". A "good" document should be ranked higher than a "bad" document, even though they are both matched against a query. You can specify in the API that a document having "good" quality should be boosted in relevancy by a specified weight.
The match element specifies the match values and corresponding match weights when the factor-type is STATIC_FACTOR. The following XML string is a example of match element:
```
<match>
<value>bad</value>
<weight>VERY LOW</weight> 
</match>
```
The value element is used to match the corresponding attribute value of this ranking factor. Only alphanumeric letters are allowed in the attribute value. The match is case-insensitive.
The weight element has the identical syntax with weight element for default ranking element.

Apply Ranking Factors

The XML ranking text can be applied in two places:

As a part of the searchControls element, the ranking factors can be used as an advanced control for each query execution through the Web services method. This is called per-query ranking control.
As a separate configuration file ranking.xml in the directory $ORACLE_HOME/search/webapp/config, the configuration file is read and applied each time OC4J is started. The ranking factors specified in the configuration file are applied to all queries. This is called instance-wide ranking control.

In federated search, instance-wide ranking controls only applies to one instance. You must configure each instance for ranking customization separately.

If a conflict arises, the per-query ranking control specified in Web services method overrides the settings specified in instance-wide ranking control. That can include the following cases:

Per-query and instance-wide ranking specify the same factor, the factor set by per-query is taken by Oracle SES.
Instance-wide ranking control sets a ranking factor, but per-query ranking control does not mention. Oracle SES takes the factor set by instance-wide ranking control.
Per-query ranking control sets a ranking factor, which instance-wide ranking controls does not mention. Oracle SES takes the factor set by per-query ranking control.
If instance-wide ranking control sets enable-all-default-factor as false and per-query ranking control sets enable-all-default-factor as true, then Oracle SES takes the default attributes set explicitly by instance-wide ranking control plus the attributes set by per-query ranking controls, with the latter overriding the former.

Using Backup and Recovery

The Global Settings - Configuration Data Backup and Recovery page ba cks up metadata that can be used to recover your configuration settings after a hardware failure. You should run a backup after making configuration data changes, such as creating or editing sources.

Note:

The actual crawled data is not backed up. To back up the index, see "Performing a Cold Backup".

When a backup is performed, Oracle SES copies the data to the binary metaData.bkp file. The location of that file is provided on the Global Settings - Configuration Data Backup and Recovery page. When the backup successfully completes, you must copy this file to a different host.

When the installation completes, copy the metaData.bkp file to the location provided in the administration tool. Sources must be re-crawled to see search results.

Some notes about backup and recovery:

You must stop all running schedules before doing the backup.
Recovery must be performed on a fresh installation of the same version of Oracle SES that was backed up.
Secure search does not need to be re-enabled after recovery. If secure search is enabled in the backup instance, you do not need to re-register or re-activate the identity plug-in after recovery. Neither re-activation nor re-registration of the identity plug-in is required. If a plug-in was active when the instance was backed up, then the same plug-in will be activated in the recovered instance, using the same parameters.
If you have file or table sources residing on the same computer as the one running Oracle SES, and if you intend to use a different computer for recovery, then you must use the actual host name (not localhost) when creating the sources.
For database table sources, confirm that the remote tables exist.
For file sources, confirm that files and paths are valid after recovery.
During recovery, the mail archive directory settings for existing mailing list and e-mail sources is changed. After recovery, the location will be <cache-dir>/mail, which is the default for new e-mail and mailing list sources. Any customized directory locations prior to recovery will be lost.
If you recover an instance in a new location, the stopword directory must to be updated to reflect the new location, since it is an absolute path. See "Topic Clustering" for more about stopword directories.

Performing a Cold Backup

As an additional precaution to minimize downtime, you can perform a cold backup to backup the actual crawled data in the Oracle SES index.

Perform the following steps to do a cold backup:

Shut down the Oracle SES instance:

% $ORACLE_HOME/search/bin/searchctl stopall

Copy all data files under the Oracle data storage location.

This location was specified during the Oracle SES installation. If your data storage location is /mnt1/oracle/ses/oradata, then copy all files under that directory. There are several ways to make a copy. For example, using the zip command:
```
% cd /mnt1/oracle/ses/
% zip -r oradata.zip oradata
```
Optional. Backup cached files.

If you retain cache files, then users can click the "cached" link in the result list. (Cached files can occupy a lot of disk space.)

The cache directory location is listed on the Global Settings - Query Configuration page. For example, if the cache directory is /mnt1/oracle/ses/cache, then run the following commands.
```
% cd /mnt1/oracle/ses
% zip -r cache.zip cache
```
Put backup files in a safe location.
To recover files from a cold backup, do the following:
1. Shut down the Oracle SES instance:
```
% $ORACLE_HOME/search/bin/searchctl stopall
```
2. Restore all backed-up files. Put all backed up files in the exact same place.
3. Start the Oracle SES instance:
```
% $ORACLE_HOME/search/bin/searchctl startall
```

Understanding Attributes

Each source has its own set of document attributes. Docume nt attributes, like metadata, describe the properties of a document. The crawler retrieves values and maps them to one of the search attributes. This mapping lets users search documents based on their attributes. After you crawl a source, you can see the attributes for that source. Document attribute information is obtained differently depending on the source type. This section lists the attributes for each Oracle SES source type.

See Also:

"Overview of Attributes" for conceptual information about document and search attributes in Oracle SES
"Customizing the Appearance of Search Results" for a list of Oracle internal attributes
"Searching on Date Attributes"

For table and database source types, there are no predefined attributes. The crawler collects attributes from columns defined during source creation. The Oracle SES administrator must map the column to the search attributes.

For Oracle E-Business Suite and Siebel source types, attributes are specified by the user. Attributes for Oracle E-Business Suite 11i and Siebel 7.8 sources are specified in the query while creating the source. Attributes for Oracle E-Business Suite 12 and Siebel 8 sources are specified in the XML data feed. (That is, you can specify attributes in the XML data feed yourself).

For many source types (such as OracleAS Portal, e-mail, NTFS, and Microsoft Exchange sources), the crawler picks up key attributes offered by the target systems. These are listed in the following sections:

Web Source Attributes
File Source Attributes
E-mail and Mailing List Attributes
OracleAS Portal Source Attributes
Microsoft Exchange Source Attributes
NTFS Source Attributes
Oracle Calendar Attributes
Oracle Content Database Source Attributes

Note:

For all other sources, such as Documentum eRoom or Lotus Notes, there is an Attribute list parameter in the Home - Sources - Customize User-Defined Source page. Any attributes entered by users are collected by the crawler and available for search.

There are also system-defined search attributes. See "System-Defined Search Attributes".

Web Source Attributes

Title
Author
Description
Host
Keywords
Language
LastModifiedDate
Mimetype
Subject: This is mapped to "Description". If there is no description metatag in the HTML file, then it is ignored.
Headline1: The highest H tag text; for example, "Annual Report" from <H2>Annual Report</H2> when there is no H1 tag in the page.
Headline2: The second highest H tag text
Reference Text: The anchor text from another Web page that points to this page.

Additional HTML metatags can be defined to map to a String attribute on the Home - Sources - Metatag Mapping page.

File Source Attributes

Title
Author
Description
Host
Keywords
Language
LastModifiedDate
Mimetype
Subject

E-mail and Mailing List Attributes

author
title
subject
language
lastmodifieddate

OracleAS Portal Source Attributes

Table 8-5 OracleAS Portal Source Attributes

Attribute	Description
createdate	Date the document was created
creator	User name of the person who created the document
author	User-editable field so that they can specify a full name or whatever they want
page_path	Hierarchy path of the portal page/item in the portal tree (contains page titles)
portal_path	Hierarchy path of the portal page/item in the portal tree, used for browsing and boundary rules (contains page names) When searching OracleAS Portal 10.1.2, portal_path appears as upper case in the browse. When searching OracleASPortal 10.1.4, portal_path appears as lower case.
title	Title of the document
description	Brief description of the document
keywords	Keywords of the document
expiredate	Expiration date of the document
host	Portal host
infosource	Path of the Portal page in the browse hierarchy
language	Language of the portal page or item
lastmodifieddate	Last modified date of the document
mimetype	Usually 'text/html' for portal
perspectives	User-created markers that can be applied to pages or items, such as 'INTERNAL ONLY', 'REVIEWED', or 'DESIGN SPEC'. For example, a Portal containing recipes could have items representing recipes with perspectives such as 'Breakfast', 'Tea', 'Contains Nuts', 'Healthy' and one particular item could have several perspectives assigned to it.
wwsbr_name_	Internal name of the portal page or item
wwsbr_charset_	Character set of the portal page or item
wwsbr_category_	Category of the portal page or item
wwsbr_updatedate_	Date the last time the portal page or item was updated
wwsbr_updator_	Person who last updated the page or item
wwsbr_subtype_	Subtype of the portal page/item (for example, container)
wwsbr_itemtype_	Portal item type
wwsbr_mime_type_	Mimetype of the portal page or item
wwsbr_publishdate_	Date the portal page or item was published
wwsbr_version_number_	Version number of the portal item

Microsoft Exchange Source Attributes

ReceivedTime
From
To
CC
Subject
Lastmodifieddate

NTFS Source Attributes

ACLS_
FILEDATE
Host
Language
LastModifiedDate
Mimetype
Title

Oracle Calendar Attributes

Description
Priority
Status
start date
end date
event Type
Author
Created Date
Title
Location
Dial_info
ConferenceID
ConferenceKey
Duration

Oracle Content Database Source Attributes

AUTHOR
CREATE_DATE
DESCRIPTION
FILE_NAME
LASTMODIFIEDDATE
LAST_MODIFIED_BY
TITLE
ACL_CHECKSUM: The check sum calculated over the ACL submitted for the document.
DOCUMENT_LANGUAGE: Oracle SES language code taken from Oracle Content Database language string. For example, if Oracle Content Database uses "American", then Oracle SES submits is as it as "en-us".
DOCUMENT_CHARACTER_SET: The character set for the Oracle Content Database document.
MIMETYPE

Oracle SES also can search categories or cutomized attributes created by the user in Oracle Content Database.

You can apply categories to files and links. Categories can be divided into subcategories and can have one or more attributes. When a document in Oracle Content Database is attached to a category, you can search on the attribute of category. (The attributes appear in the list of search attributes.)

For example, suppose you create a category named testCategory with testAttr1 and testAttr2. Document X is created and assigned the testCategory. You must assign the value to the testCategory's attributes. After crawling, testAttr1 and testAttr2 will appear in the search attribute list.

Customized attribute values can be the following types: String, Integer, Long, Double, Boolean, Date, User, Enumerated String, Enumerated Integer, and Enumerated Long.

Index Long, Double, Integer, Enumerated Integer, and Enumerated Long type customized attributes are type Number attributes in Oracle SES (display name with "_N" suffix).

Index Date customized attribute is type Date attribute in Oracle SES (suffix "_D").

Index String, String Enumeration, and User customized attributes are type String attributes in Oracle SES.

Limitations:

The Oracle Content Database SDK has more features than the Oracle Content Database Web GUI. The Web GUI does not support the String Array, but the SDK does. If you use the SDK to build a customized admin and user GUI to support the String array type, then a customized attribute could have more than one attribute value.
If a document in Oracle Content Database is attached to a category and the attributes in that category are left blank, then when a user searches in Oracle SES (using Advanced Search), the attribute is not available in the list.

For example, create testCategory with three attributes. A document is created and assigned this test category. TestCategory's attribute are assigned values. For a test, assign one a value "test" leave the other attribute blank. After crawling, when searching you can see the attribute in the list that was assigned the value "test". However, the one that was left blank does not show in the list. If an attribute has null value, it will be skipped by the crawler. But if another document has the same attribute with some value, then it will be indexed.

System-Defined Search Attributes

There are two system-defined search attributes, Urldepth and Infosource Path.

Urldepth measures the number of levels down from the root directory. It is derived from the URL string. In general, the depth is the number of slashes, not counting the slash immediately following the host name or the trailing slash. An adjustment of minus 2 is made to home pages. An adjustment of plus 1 is made to dynamic pages (such as the example in the following table with the question mark in the URL).

The following table lists the Urldepth of some example URLs.

URL	Urldepth
http://my.company.com/portal/page/myo/Employee_Portal/MyCompany	4
http://my.company.com/portal/page/myo/Employee_Portal/MyCompany/	4
http://my.company.com/portal/page/myo/Employee_Portal/MyCompany.htm	4
http://us.rd.foo.com/finance/finhome/topstories/wall_street.html?.v=46	4
http://my.company.com/portal/page/myo/Employee_Portal/home.htm	2

Urldepth is used internally for relevance ranking calculation under the heuristic that a URL with a smaller URL depth is more important.

Infosource Path is a path representing the source of the document. It is an internal attribute. This attribute is used in situations where documents can be browsed by their source. The Infosource Path is generally derived from the URL string. For example, in the URL just given for Urldepth, the Infosource Path is:

portal/page/myo/Employee_Portal

If the document is submitted through a connector, this value can be set explicitly by using the DocumentMetadata.setSourceHierarchy() API.

Troubleshooting Sources

This section contains the following topics:

Tips for Using Table and Database Sources
Tips for Using File Sources
Tips for Using Mailing List Sources
Tips for Using OracleAS Portal Sources
Tips for Using User-Defined Sources
Tips for Using Federated Sources

Tips for Using Table and Database Sources

Table source types and database source types are similar, in that they both crawl database tables.

This section contains the following topics:

Understanding Table Sources Versus Database Sources
Crawling Tables with Quoted Identifiers

Understanding Table Sources Versus Database Sources

This section describes the benefits and limitations of both table source types and database source types.

Note:

For performance reasons, both source types require that the KEY column be backed by an index.

Table Source Benefits

A table source does not need to contain a specific set of columns.
A table source automatically creates a display URL target. You do not need to arrange for the content to be displayed by some other mechanism.
A table source does not require JDBC connection syntax.

Table Source Limitations

To crawl non-Oracle databases as a table source, you must create a view in an Oracle database on the non-Oracle table. Then create the table s ource on the Oracle view. Oracle SES accesses the database using database links.
Only one table or view can be specified for each table source. If data from more than one table or view is required, then first create a single view that encompasses all required data.
Oracle SES cannot crawl tables inside the Oracle SES database.
Table column mappings cannot be applied to LOB columns.
The following data types are supported for table sources: BLOB, CLOB, CHAR, VARCHAR, VARCHAR2.
If the content column has a data type of CLOB or BLOB, and selecting from a view raises an ORA-01445 error, then creating a table source based on that view will raise the same error.

Database Source Benefits

Database sources provide additional flexibility. A database source type is built on JDBC, so you can crawl any JDBC-enabled database.
- A database source supports any SQL query with join conditions without creating a view. In some databases, creating objects may not be feasible.
- A database source supports crawling content pointed to by a URL stored in the ATTACHMENT_LINK column.
- A database source supports Info source path hierarchy and MIMETYPEs.
Database sources provide additional security. A database source provides security on the row level. It provides a third security option ACLs Provided by Source that is not available for table sources.

Database Source Limitations

The base table or view cannot have text columns of type BFILE or RAW.
The value of the required URL column cannot be null.

Crawling Tables with Quoted Identifiers

Database object names may be represented with a quoted identifier. A quoted identifier is case-sensitive and begins and ends with double quotation marks ("). If the database object is represented with a quoted identifier, then you must use the double quotation marks and the same case whenever you refer to that object.

When creating a table source in Oracle SES, if the table name is a quoted identifier, such as "1 (Table)", then in the Table Name field enter "1 (Table)", with the same case and double quotation marks. Similarly, if a primary key column or content column is named using a quoted identifier, then enter that name exactly as it appears in the database with double quotation marks.

See Also:

Oracle Database SQL Reference (available on Oracle Technology Network) for more information about schema object names and qualifiers

Tips for Using File Sources

This section contains the following topics:

Crawling File Sources with Non-ASCII
Crawling File Sources with Symbolic Links
Crawling File URLs
Crawling File Sources from a Network Drive

Crawling File Sources with Non-ASCII

For file sources to successfully crawl and display multibyte environments, the locale of the computer that starts the Oracle SES server must be the same as the target file system. This way, the Oracle SES crawler can "see" the multibyte files and paths.

If the locale is different in the installation environment, then Oracle SES needs to be reinstalled from the environment with the correct locale. For example, for a Korean environment, either set LC_ALL to ko_KR or set both LC_LANG and LANG to ko_KR.KSC5601. Then restart Oracle SES with searchctl restartall from either a command prompt on Windows or an xterm on UNIX.

Crawling File Sources with Symbolic Links

When craw ling file sources on UNIX, the crawler will resolve any symbolic link to its true directory path and enforce the boun dary rule on it. For example, suppose directory /tmp/A has two children, B and C, where C is a link to /tmp2/beta. The crawl will have the following URLs:

/tmp/A
/tmp/A/B
/tmp2/beta
/tmp/A/C

If the inclusion rule is /tmp/A, then /tmp2/beta will be excluded. The seed URL is treated as is.

Crawling File URLs

If a file URL is to be used "as is", without going through Oracle SES to retrieve the file, then "file" in the URL should be upper case "FILE". For example, FILE://localhost/...

"As is" means that when a user clicks the search link of the document, the browser will try to use the specified file URL on the client computer to retrieve the file. Without that, Oracle SES uses this file URL on the server computer and sends the document through HTTP to the client computer.

Crawling File Sources from a Network Drive

If the files will be crawled from a network drive, then the Oracle process should be started as a user who has access to the drive.

See Also:

"Required Tasks" for instructions on how to change the user running the Oracle process.

Tips for Using Mailing List Sources

The Oracle SES crawler is IMAP4 compliant. To crawl mailing list sources, you need an IMAP e-mail account. It is recommended to create an e-mail account that is used solely for Oracle SES to crawl mailing list messages. The crawler is configured to crawl one IMAP account for all mailing list sources. Therefore, all mailing list messages to be crawled must be found in the Inbox of the e-mail account specified on this page. This e-mail account should be subscribed to all the mailing lists. New postings for all the mailing lists will be sent to this single account and subsequently crawled.
Messages deleted from the global mailing list e-mail account are not removed from the Oracle SES index. In fact, the mailing list crawler itself will delete messages from the IMAP e-mail account as it crawls. The next time the IMAP account for mailing lists is crawled, the previous messages will no longer be there. Any new messages in the account will be added to the index (and also consequently deleted from the account). This keeps the global mailing list IMAP account clean. The Oracle SES index serves as a complete archive of all the mailing list messages.

Tips for Using OracleAS Portal Sources

An Or acleAS Portal source name cannot exceed 35 characters.
URL bound ary rules are not enforced for URL items. A URL item is the metadata that resides on the OracleAS Portal server. Oracle SES does not touch the display URL or the boundary rules for URL items.
The portal_path attribute is used to compare boundary rules. Portal pages and items are organized in a tree structure. When a page is included or excluded, its entire subtree starting with that node is included or excluded.
If OracleAS Portal user privileges change, it is possible that content the crawler collects is not properly authorized. For example, in a Portal crawl, the user specified in the Home - Sources - Authentication page does not have privileges to see certain Portal pages. However, after privileges are granted to the user, on subsequent incremental crawls, the content still is not picked up by the crawler. Similarly, if privileges are revoked from the user, it is possible that content still is picked up by the crawler.

To be certain that Oracle SES has the correct set of documents, whenever a user's privileges change, update the crawler re-crawl policy to Process All Documents on the Home - Schedules - Edit Schedules page, and restart the crawl.
Oracle SES provides an option in the crawler.dat file to turn on smart incremental crawling for OracleAS Portal sources. This makes re-crawls more efficient by getting a list of changed pages and items directly from OracleAS Portal.

See Also:
"Smart Incremental Crawl for OracleAS Portal Sources"

Tips for Using User-Defined Sources

If a plug-in is to return file URLs to the crawler, then the file URLs must be fully qualified. For example, file://localhost/.
If a file URL is to be used "as is" without going through Oracle SES for retrieving the file, then "file" in the URL should be upper case "FILE". For example, FILE://localhost/...

See Also:

"Crawling File URLs"

Tips for Using Federated Sources

The Oracle SES federator caches the federator configuration (that is, all federation-related parameters including federated sources). As a result, any change in the configuration will take effect within five minutes.
If you entered proxy settings on the Global Settings - Proxy Settings page, then make sure to add the Web Services URL for the federated source as a proxy exception.
If the federation endpoint instance is set to secure mode 3 (require login to search secure and public content), then all documents (ACL stamped or not) are secure. For secure federated search, create a trusted entity in the federation endpoint instance, then edit the federated source with the trusted entity user name and password.
There can be consistency issues if you have configured a BIG-IP system as follows:
- You have two Oracle SES instances configured identically (same crawls, same sources, and so on) behind a BIG-IP load balancer to act as a single logical Oracle SES instance.
- You have two other Oracle SES instances configured identically along with Oracle HTTP Server and OracleAS Web Cache fronting each one and both servers behind BIG-IP. Each of these two instances federate to the logical Oracle SES instance. Web Cache is clustered between these two nodes to act as a single logical Oracle SES instance called broker instance.
When a user performs a search on the broker Oracle SES instance and tries to access the documents in the result, document access may not be consistent each time. As a workaround, make sure that the load balancer sends all the requests in one user session to the exact same node each time.

Federated Search Characteristics

Federated search can improve performance by distributing query processing on multiple computers. It can be an efficient way to scale up search service by adding a cluster of Oracle SES instances.
The federated search quality depends on the network topology and throughput of the entire federated Oracle SES environment.

Federated Search Limitations

There is a size limit of 200KB for the cached documents existing on the federation endpoint to be displayed on the Oracle SES federation broker instance.
For infosource browse, if the source hierarchies for both local and federated sources under one source group start with the same top level folder, then a sequence number is added to the folder name belonging to the federated source to distinguish the two hierarchies on the Browse page.
For federated infosource browse, a federated source should be put under an explicitly created source group.
On the Oracle SES federation broker, there is no direct access to documents on the federation endpoint through the display URL in the search result list for the following source types:
- File (local files, not UNC)
- Table
- E-mail
- Mailing list
For these source types, only the cached version of documents is accessible.

Tuning Crawl Performance

Your Web crawling strategy can be as simple as identifying a few well-known sites that are likely to contain links to most of the other intranet sites in your organization. You could test this by crawling these sites without indexing them. After the initial crawl, you have a good idea of the hosts that exist in your intranet. You could then define separate Web sources to facilitate crawling and indexing on individual sites.

However, the process of discovering and crawling your organization's intranet, or the Internet, is generally an interactive one characterized by periodic analysis of crawling results and modification to crawling parameters. For example, if you observe that the crawler is spending days crawling one Web host, then you might want to exclude crawling at that host or limit the crawling depth.

This section contains the most common things to consider to improve crawl performance:

Understanding the Crawler Schedule
Registering a Proxy
Checking Boundary Rules
Checking Dynamic Pages
Checking Crawler Depth
Checking Robots.txt Rule
Checking Duplicate Documents
Checking Redirected Pages
Checking URL Looping
Increasing the Oracle Redo Log File Size
Adding Datafiles
What to do Next

See Also:

"Monitoring the Crawling Process" for more information on crawling parameters

Understanding the Crawler Schedule

Schedules define the frequency at which the Oracle SES index is updated with information about each source. This section describes characteristics the Oracle SES crawler schedule.

The Failed Schedules section on the Home - General page lists all schedules that have failed. A failed schedule is one in which the crawler encountered fatal error, such as an indexing error or a source-specific login error, and cannot proceed. A failed schedule could be the result of a partial collection and indexing of documents.
The smallest granularity of the schedule interval is one hour. For example, you cannot have a schedule started at 1:30am.
If a crawl takes longer to finish then the scheduled interval, then it will be started as soon as the current crawl is done. Currently, there is no option to have the scheduled time automatically pushed back to the next scheduled time.
When multiple sources are assigned to one schedule, the sources are crawled one by one following the order of their assignment in the schedule.
The schedule starts crawling the assigned sources in the assigned order. Only one source is crawling under a schedule at any given time. If a source crawl fails, then the rest of the sources assigned after it are not crawled. The schedule does not restart. You must either resolve the cause of the failure and resume the schedule, or remove the failed source from the schedule.
There is no automatic e-mail notification of schedule success or failure.

Registering a Proxy

By default, Oracle SES is configured to crawl Web sites in the intranet. In other words, crawling internal Web sites requires no additional configuration. However, to crawl Web sites on the Internet (also referred to as external Web sites), Oracle SES needs the HTTP proxy server information. Set this on the Global Settings - Proxy Settings page. (If the proxy requires authentication, then enter the proxy authentication information on the Global Settings - Authentication page.)

Because internal Web sites should not go through the proxy server, specify proxy domain exceptions if the proxy server is set. Enter the host name suffix that should not go through the proxy in the exception field. To exclude the entire domain, use the suffix of the host name without http and begin with *.; for example, *.us.example.com or *.example.com. Entries without the *. prefix are treated as a single host. The IP address can only be used when the URL crawled is also specified in the IP for the host name. In other words, they must be consistent.

Checking Boundary Rules

The seed URL you enter when you create a source is turned into an inclusion rule. For example, if w ww.example.com is the seed URL, then Oracle SES creates an inclusion rule that only URLs containing the string www.example.com will be crawled.

However, suppose that the example Web site includes URLs starting with www.exa-mple.com or ones that start with example.com (without the www). Many pages have a prefix on the site name. For example, the investor section of the site has URLs that start with investor.example.com.

Always check the inclusion rules before crawling, then check the log after crawling to see what patterns have been excluded.

In this case, you might add www.example.com, www.exa-mple.com, and investor.example.com to the inclusion rules. Or you might just add example.

To crawl outside the seed site (for example, if you are crawling text.us.oracle.com, but you want to follow links outside of text.us.oracle.com to oracle.com), consider removing the inclusion rules altogether. Do so carefully. This could lead the crawler into many, many sites.

Notes for File Sources

For file sources, if no boundary rule is specified, then crawling is limited to the underlying file system access privileges. Files accessible from the specified seed file URL will be crawled, subject to the default crawling depth. The depth, which is 2 by default, is set on the Global Settings - Crawler Configuration page. For example, if the seed is file://localhost/home/user_a/, then the crawl will pick up all files and directories under user_a with access privileges. It will crawl any documents in the directory /home/user_a/level1 due to the depth limit. The documents in the /home/user_a/level1/level2 directory are at level 3.
The file URL can be of UNC (universal naming convention) format. The UNC file URL has the following format: file://localhost///<LocalComputerName>/<SharedFolderName>.

For example, \\stcisfcr\docs\spec.htm should be specified as file://localhost///stcisfcr/docs/spec.htm.
On some computers, the path or file name could contain non-ASCII and multibyte characters. URLs are always represented using the ASCII character set. Non-ASCII characters are represented using the hex representation of their UTF-8 encoding. For example, a space is encoded as %20, and a multibyte character can be encoded as %E3%81%82.

For file sources, spaces can be entered in simple (not regular expression) boundary rules. Oracle SES automatically encodes these URL boundary rules. If (Home Alone) is specified, then internally it is stored as (Home%20Alone). Oracle SES does this encoding for the following:
- File source simple boundary rules
- Test URL strings
- File source seed URLs

Note:

Oracle SES does not alter the rule if it is a regular expression rule. It is the administrator's responsibility to make sure that the regular expression rule specified is against the encoded file URL. Spaces are not allowed in regular expression rules.

Checking Dynamic Pages

Indexing dynamic pages can generate an excessive number of URLs. From the target Web site, manually navigate through a few pages to understand what bound ary rules should be set to avoid crawling duplicate pages.

Checking Crawler Depth

Setting the crawler depth very high (or unlimited) could lead the crawler into many sites. Without boundary rules, 20 will probably crawl the whole WWW from most locations.

Checking Robots.txt Rule

You can control which parts of your sites can be visited by robots. If robots exclusion is enabled (default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots.txt file.

The following sample /robots.txt file specifies that no robots should visit any URL starting with /cyberworld/map/ or /tmp/ or /foo.html:

# robots.txt for http://www.example.com/
 
User-agent: *
Disallow: /cyberworld/map/ 
Disallow: /tmp/ 
Disallow: /foo.html

If the Web site is under the user's control, then a specific robots rule can be tailored for the crawler by specifying the Oracle SES crawler plug-in name "User-agent: Oracle Secure Enterprise Search." For example:

User-agent: Oracle Secure Enterprise Search
 
Disallow: /tmp/

The robots meta tag can instruct the crawler to either index a Web page or follow the links within it. For example:

<meta name="robots" content="noindex,nofollow">

Checking Duplicate Documents

Oracle SES always removes duplicate (identical) documents. If Oracle SES thinks a page is a duplicate to one it has seen before, then it will not index it. If the page is reached through a URL that Oracle SES has already processed, then it will not index that either.

With the Web Services API, you can enable or disable near duplicate detection and removal from the result list. Near duplicate documents are similar to each other. They may or may not be identical to each other.

Checking Redirected Pages

The crawler crawls only redirected pages. For example, a Web site might have Javascript redirecting users to another site with the same title. Only the redirected site is indexed.

Check for inclusion rules from redirects. This is based on type of redirect. There are three kinds of redirects defined in EQ$URL:

Temporary Redirect: A redirected URL is always allowed if it is a temporary redirection (HTTP status code 302, 307). Temporary redirection is used for whatever reason that the original URL should still be used in the future. It's not possible to find out temporary redirect from EQ$URL table other than filtering out the rest from the log file.
Permanent Redirect: For permanent redirection (HTTP status 301), the redirected URL is subject to boundar y rules. Permanent redirection means the original URL is no longer valid and the user should start using the new (redirected) one. In EQ$URL, HTTP permanent redirect has the status code 954
Meta Redirect: Metatag redirection is treated as a permanent redirect. Meta redirect has status code 954. This is always checked against boundary rules.

Checking URL Looping

URL looping refers to the scenario where a large number of unique URLs all point to the same document. One particularly difficult situation is where a site contains a large number of pages, and each page contains links to every other page in the site. Ordinarily this would not be a problem, because the crawler eventually analyzes all documents in the site.

However, some Web servers attach parameters to generated URLs to track information across requests. Such Web servers might generate a large number of unique URLs that all point to the same document.

For example, http://example.com/somedocument.html?p_origin_page=10 might refer to the same document as http://example.com/somedocument.html?p_origin_page=13 but the p_origin_page parameter is different for each link, because the referring pages are different. If a large number of parameters are specified and if the number of referring links is large, then a single unique document could have thousands or tens of thousands of links referring to it. This is an example of how URL looping can occur.

Monitor the crawler statistics in the Oracle SES administration tool to determine which URLs and Web servers are being crawled the most. If you observe an inordinately large number of URL accesses to a particular site or URL, then you might want to do one of the following:

Exclude the Web Server: This prevents the crawler from crawling any URLs at that host. (You cannot limit the exclusion to a specific port on a host.)
Reduce the Crawling Depth: This limits the number of levels of referred links the crawler will follow. If you are observing URL looping effects on a particular host, then you should take a visual survey of the site to find out an estimate of the depth of the leaf pages at that site. Leaf pages are pages that do not have any links to other pages. As a general guideline, add three to the leaf page depth, and set the crawling depth to this value.

Be sure to restart the crawler after altering any parameters. Your changes take effect only after restarting the crawler.

Increasing the Oracle Redo Log File Size

Oracle SES allocates 200M for the redo log during installation. 200M is sufficient to crawl a relatively large corpus. However, if your disk has sufficient space to increase the redo log and if you are going to crawl a very large corpus (for example, more than 300G as pure text size), then increase the redo log file size for better crawl performance.

Note:

The biggest transaction during crawling is SYNC INDEX by Oracle Text. Check the AWR report or the v$sysstat view to see the actual redo size during crawling. Roughly, 200M is sufficient to crawl up to 300G.

Launch SQL*Plus and connect as the SYSTEM user. (The password is same as EQSYS).

Run the following SQL statement to see the current redo log status:

SQL> SELECT vl.group#, member, bytes,  vl.status 
  2  FROM v$log vl, v$logfile vlf 
  3  WHERE vl.group#=vlf.group#; 
 
GROUP# MEMBER                                                  BYTES STATUS 
------ -------------------------------------------------- ---------- ---------- 
     3 /scratch/ses10181/oradata/o10181/redo03.log         209715200 INACTIVE 
     2 /scratch/ses10181/oradata/o10181/redo02.log         209715200 CURRENT 
     1 /scratch/ses10181/oradata/o10181/redo01.log         209715200 INACTIVE

Drop the INACTIVE redo log file. For example, to drop group 3:

SQL> ALTER DATABASE DROP LOGFILE group 3; 
 
Database altered.

Create a larger redo log file. If you want to change the file location, specify the new location.

SQL> ALTER DATABASE ADD LOGFILE '/scratch/ses10181/oradata/o10181/redo03.log' 
  2  size 400M reuse; 
 
Database altered.

Check the status to make sure the file was created.

SQL> SELECT vl.group#, member, bytes, vl.status 
  2  FROM v$log vl, v$logfile vlf 
  3  WHERE vl.group#=vlf.group#; 
 
GROUP# MEMBER                                                  BYTES STATUS 
------ -------------------------------------------------- ---------- ---------- 
     3 /scratch/ses10181/oradata/o10181/redo03.log         419430400 UNUSED 
     2 /scratch/ses10181/oradata/o10181/redo02.log         209715200 CURRENT 
     1 /scratch/ses10181/oradata/o10181/redo01.log         209715200 INACTIVE

To drop a log file with CURRENT status, run the following SQL statement:

SQL> ALTER SYSTEM SWITCH LOGFILE; 
 
System altered. 
 
SQL> SELECT vl.group#, member, bytes,  vl.status 
  2  FROM v$log vl, v$logfile vlf 
  3  WHERE vl.group#=vlf.group#; 
 
GROUP# MEMBER                                                  BYTES STATUS 
------ -------------------------------------------------- ---------- ---------- 
     3 /scratch/ses10181/oradata/o10181/redo03.log         419430400 CURRENT 
     2 /scratch/ses10181/oradata/o10181/redo02.log         209715200 ACTIVE 
     1 /scratch/ses10181/oradata/o10181/redo01.log         209715200 INACTIVE

Group 2 status changed to ACTIVE. Run the following SQL statement to change the status to INACTIVE:

SQL> ALTER SYTEM CHECKPOINT; 
 
System altered. 
 
SQL>  SELECT vl.group#, member, bytes,  vl.status 
  2   FROM v$log vl, v$logfile vlf 
  3   WHERE vl.group#=vlf.group#; 
 
GROUP# MEMBER                                                  BYTES STATUS 
------ -------------------------------------------------- ---------- ---------- 
     3 /scratch/ses10181/oradata/o10181/redo03.log         419430400 CURRENT 
     2 /scratch/ses10181/oradata/o10181/redo02.log         209715200 INACTIVE 
     1 /scratch/ses10181/oradata/o10181/redo01.log         209715200 INACTIVE

Repeat steps 3, 4 and 5 for redo log groups 1 and 2.

Adding Datafiles

When crawling a large number of documents, the Oracle SES tablespace may not be big enough to complete the crawl. Add more datafiles to the Oracle SES tablespace to resume the crawl.

For example, the following steps add two datafiles to the OES tablespace:

Launch SQL*Plus and connect as the SYSTEM user. (The password is the same as EQSYS.)

Run the following SQL statement to see current datafile information:

SQL> SELECT FILE_NAME FROM dba_data_files WHERE tablespace_name ='OES';
 
FILE_NAME
------------------------------------------------------------------------------
/home/ses1018/oracle/product/oradata/ses1018/OES_01.dbf
/home/ses1018/oracle/product/oradata/ses1018/OES_02.dbf

Run the following SQL statement to add two datafiles:

SQL> ALTER TABLESPACE OES ADD DATAFILE
  2  '/home/ses1018/oracle/product/oradata/ses1018/OES_03.dbf' SIZE 10M
  3  AUTOEXTEND ON MAXSIZE UNLIMITED;
tablespace altered.
 
SQL> ALTER TABLESPACE OES ADD DATAFILE 
  2  '/home/ses1018/oracle/product/oradata/ses1018/OES_04.dbf' SIZE 10M
  3  AUTOEXTEND ON MAXSIZE UNLIMITED;
tablespace altered.

Run the following SQL statement to see current datafile information:

SQL> SELECT FILE_NAME FROM dba_data_files WHERE tablespace_name ='OES';
 
FILE_NAME
------------------------------------------------------------------------------
/home/ses1018/oracle/product/oradata/ses1018/OES_01.dbf
/home/ses1018/oracle/product/oradata/ses1018/OES_02.dbf
/home/ses1018/oracle/product/oradata/ses1018/OES_03.dbf
/home/ses1018/oracle/product/oradata/ses1018/OES_04.dbf

Note:

Oracle SES cannot add datafiles unless sufficient disk space is available.

What to do Next

If you are still not crawling all the pages you think you should, then check which pages were crawled by doing one of the following:

Check the crawler log file. (There's a link on the Home - Schedules page and the location of the full log on the Home - Schedules - Status page.)
Create a search source group. (Search - Source Groups - Create New Source Group) Put only one source in the group. From the Search page, search that group. (Click the group name on top of the search box.) Or, from the Search page, click Browse Search Groups. Click the group name for a hierarchy. You could also click the number next to the group name for a list of the pages crawled.

Tuning Search Performance

This section contains suggestions on how to improve the response time and throughput performance of Oracle SES.

This section contains the most common things to consider to improve search quality:

Adding Suggested Links
Optimizing the Index
Adjusting the Indexing Parameters
Checking the Search Statistics
Increasing the JVM Heap Size
Increasing the Oracle Undo Space

See Also:

"Searching on Date Attributes"

Adding Suggested Links

Suggested links let you direct users to a particular Web site for particular query keywords. For example, when users search for "Oracle Secure Enterprise Search documentation" or "Enterprise Search documentation" or "Search documentation", you could suggest http://www.oracle.com/technology.

Suggested link keywords are rules that determine which suggested links are returned (as suggestions) for a query. The rules consist of query terms and logical operators. For example, "secure AND search". With this rule, the corresponding suggested link is returned for the query "secure enterprise search", but it is not returned for the query "secure database".

The rule language used for the indexed queries supports the following operators:

Table 8-6 Suggested Link Keyword Operators

Operator	Example
AND	dog and cat
OR	dog or cat
PHRASE	dog sled
ABOUT	about(dogs)
NEAR	dog ; cat
STEM	$dog
WITHIN	dog within title
THESAURUS	SYN(dog)

Note:

Special characters (for example, '#', '$', '=', '&') should not be used in keywords.

Suggested links appear at the top of the search result list. Oracle SES can display up to two suggested links for each query.

This feature is especially useful to provide links to important Web pages that are not crawled by Oracle Secure Enterprise Search. Add or edit suggested links on the Search - Suggested Links page in the administration tool.

Optimizing the Index

Opti mizing the index reduces fragmentation, and it can significantly increase the speed of searches. Schedule index optimization on a regular basis. Also, optimize the index after the crawler has made substantial updates or if fragmentation is more than 50%. Make sure index optimization is scheduled during off-peak hours. Optimization of a very large index could take several hours.

See the fragmentation level and run index optimization on the Global Settings - Index Optimization page in the administration tool. You can specify a maximum number of hours for the optimization to run, but for best performance, select to run the optimization until it finishes. This creates a more compact copy of the index, and then it switches the original index and the copy (so it requires enough space to store both the copy and the original). When optimization is finished, the original index is dropped, and the space can be reused.

Adjusting the Indexing Parameters

To improve indexing performance, try adjusting the following parameters on the Global Settings - Set Indexing Parameters page in the administration tool:

Indexing Batch Size

When the crawled data in the cache directory reaches Indexing Batch Size, Oracle SES starts indexing. The bigger the batch size, the longer it takes to start indexing each batch. Only indexed data can be searched: data in the cache cannot be searched. The default size is 250M.

Document fetching and indexing run concurrently. While indexing is running, the Oracle SES crawler continues to fetch documents and store them in the cache directory.

Indexing Memory Size

This is the upper limit of memory used for indexing before flushing the index to disk.

A large amount of memory improves indexing performance (because there is less I/O) and improves query performance (because the created index is less fragmented from the beginning -- a fragmented index can be optimized later). Set this as high as possible without causing memory paging.

A smaller amount of memory might be useful when indexing progress should be tracked or when run-time memory is scarce. The default size is 275M. In general, increasing the Indexing Memory Size parameter can reduce fragmentation.

Checking the Search Statistics

See the Home - Statistics page in the administration tool for lists of the most popular queries, failed queries, and ineffective queries. This information can lead to the following actions:

Refer users to a particular Web site for failed queries on the Search - Suggested Links page.
Fix common errors that users make in searching on the Search - Alternate Words page.
Make important documents easier to find on the Search - Relevancy Boosting page.

Relevancy Boosting

Relevancy boosting lets administrators influence the order of documents in the result list for a particular search. You might want to override the default results for the following reasons:

For a highly popular search, direct users to the best results
For a search that returns no results, direct users to some results
For a search that has no click-throughs, direct users to better results

In a search, each result is assigned a score that indicates how relevant the result is to the search; that is, how good a result it is. Sometimes there are documents that you know are highly relevant to some search. For example, your company Web site could have a home page for XML (http://example.com/XML-is-great.htm), which you want to appear high in the results of any search for "XML". You would boost the score of that home page (http://example.com/XML-is-great.htm) to 100 for an "XML" search.

There are two methods for locating URLs for relevancy boosting: locate by search or manual URL entry.

Note:

The document still has a score computed if you enter a search that is not one of the boosted queries.

Relevancy boosting, like end user searching, is case-insensitve. For example, a document with a boosted score for "Oracle" is boosted when you enter "oracle".

Increasing the JVM Heap Size

If you expect heavy load on the Oracle SES server, then configure the J ava virtual machine (JVM) heap size for better performance.

The heap size is defined in the $ORACLE_HOME/search/config/searchctl.conf file. By default, the following values are given:

max_heap_size = 1024 megabytes

min_heap_size = 512 megabytes

Increase the value of these parameters appropriately. The maximum size should not exceed the physical memory size.

Then restart the middle tier with searchctl restart.

Increasing the Oracle Undo Space

Heavy query load should not coincide with heavy crawl activity, especially when there are large-scale changes on the target site. If it does, for example when the crawl needs be scheduled around-the-clock, then increase the size of the Oracle undo tablespace with the UNDO_RETENTION parameter.

See Also:

Oracle Database SQL Reference and Oracle Administrator's Guide (available on Oracle Technology Network) for more information about increasing the Oracle undo space

Oracle SES Command Line Tools

The command line tool for starting and stopping the search engine is searchctl.

Note:

Users are prompted for a password when running searchctl commands on UNIX platforms. No password is required on Windows platforms. This is because Oracle SES installation on Windows requires a user with administrator privileges. When running commands to start or stop the search engine, no password is required as long as the user is a member of the administrator group.

See Also:

Startup / Shutdown lesson in the Oracle SES administration tutorial: http://st-curriculum.oracle.com/tutorial/SESAdminTutorial/index.htm

Restarting Oracle Secure Enterprise Search After Rebooting

To restart Oracle SES (for example, after rebooting the host computer), navigate to the bin directory and run searchctl startall.

Turning On Debug Mode

Debug mode is useful for troubleshooting purposes. To turn on debug mode for Oracle SES administration tool, update the search.properties file located in the $ORACLE_HOME/search/webapp/config directory. Set debug=true and restart the Oracle SES middle tier with searchctl restart.

To turn off debug mode when you are finished troubleshooting, set debug=false and restart the middle tier with searchctl restart.

Note:

$ORACLE_HOME represents the directory where Oracle SES was installed.

Debug information can be found in the OC4J log file: $ORACLE_HOME/oc4j/j2ee/OC4J_SEARCH/log/oc4j.log.

Monitoring Oracle Secure Enterprise Search

In a production environment, where a load balancer or other monitoring tools are used to ensure system availability, Oracle Secure Enterprise Search (SES) can also be easily monitored through the following URL: http://<host>:<port>/monitor/check.jsp. The URL should return the following message: Oracle Secure Enterprise Search instance is up.

Note:

This message is not translated to other languages, because system monitoring tools might need to byte-compare this string.

If Oracle SES is not available, then the URL returns either a connection error or the HTTP status code 503.

Integrating with Google Desktop for Enterprise

Oracle Secure Enterprise Search provides a plug-in (or connector) to integrate with Google Desktop for Enterprise (GDfE). You can include Google Desktop results in your Oracle SES hitlist. You can also link to Oracle SES from the GDfE interface.

See Also:

Google Desktop for Enterprise Readme at http://<host>:<port>/search/query/gdfe/gdfe_readme.html for details about how to integrate with GDfE

Accessing Application Server Control Console on Oracle SES

The Oracle Enterprise Manager 10g Application Server Control Console is a Web-based user interface that displays the current status of the Oracle SES middle tier. For example, the Home page shows a graph of the Response and Load, and the Performance page shows a graph of the Heap Usage.

The Application Server Control Console is installed and configured automatically with OC4J. Because the Oracle SES middle tier runs in the embedded standalone OC4J, the Application Server Control Console is started by default when Oracle SES is started.

To access the console, type the following URL in a Web browser:

http://<host>:<port>/em

where host and port are the host name and port running Oracle SES.

See Also:

Oracle Containers for J2EE Configuration and Administration Guide 10g (10.1.3.1.0)
the online help provided with Application Server Control Console for detailed instructions on using this interface