Beta Draft

5 Oracle Secure Enterprise Search Advanced Information

This chapter contains the following topics:

Tips for Using Table Sources
Tips for Using File Sources
Tips for Using Mailing List Sources
Tips for Using OracleAS Portal Sources
Tips for Using User-Defined Sources
Tips for Using Federated Sources
Setting Up Secure Federated Search
Tips for Using Oracle Calendar Sources
Setting Up Secure Oracle Calendar Sources
Tips for Using Oracle Content Database Sources
Setting Up Secure Oracle Content Database Sources
Tuning Crawl Performance
Tuning Search Performance
Using Backup and Recovery
Integrating with Google Desktop for Enterprise
Monitoring Oracle Secure Enterprise Search
Turning On Debug Mode
Restarting Oracle Secure Enterprise Search After Rebooting

Tips for Using Table Sources

Oracle Secu re Enterprise Search can crawl table sources in an Oracle database. To crawl non-Oracle databases, you must create a view in an Oracle database on the non-Oracle table. Then create the table source on the Oracle view. Oracle SES accesses databases using database links.

Limitations with Table Sources

Oracle SES cannot crawl tables inside the Oracle SES database.
Only one table or view can be specified for each table source. If data from more than one table or view is required, then first create a single view that encompasses all required data.
Table column mappings cannot be applied to LOB columns.
The following data types are supported for table sources: BLOB, CLOB, CHAR, VARCHAR, VARCHAR2.

Limitations with Database Links

If the text column of the base table or view is of type BLOB or CLOB, then the table must have a ROWID column. A table or view might not have a ROWID column for various reasons, including the following:
- A view is comprised of a join of one or more tables.
- A view is based on a single table using a GROUP BY clause.
The best way to know if a table or view can be safely crawled by Oracle SES is to check for the existence of the ROWID column. To do so, run the following SQL statement against that table or view using SQL*Plus: SELECT MIN(ROWID) FROM <table or view name>;
The base table or view cannot have text columns of type BFILE or RAW.

Tips for Using File Sources

This section contains the following:

Crawling File Sources with Non-ASCII
Crawling File Sources with Symbolic Links
Crawling File URLs

Crawling File Sources with Non-ASCII

For file sources to successfully crawl and display multibyte environments, the locale of the machine that starts the Oracle SES server must be the same as the target file system. This way, the Oracle SES crawler can "see" the multibyte files and paths.

If the locale is different in the installation environment, then Oracle SES should be restarted from the environment with the correct locale. For example, for a Korean environment, either set LC_ALL to ko_KR or set both LC_LANG and LANG to ko_KR.KSC5601. Then run searchctl restartall from either a command prompt on Windows or an xterm on UNIX.

Crawling File Sources with Symbolic Links

When craw ling file sources on UNIX, the crawler will resolve any symbolic link to its true directory path and enforce the boun dary rule on it. For example, suppose directory /tmp/A has two children, B and C, where C is a link to /tmp2/beta. The crawl will have the following URLs:

/tmp/A
/tmp/A/B
/tmp2/beta
/tmp/A/C

If the boundary rule is /tmp/A, then /tmp2/beta will be excluded. The seed URL is treated as is.

Crawling File URLs

If a file URL is to be used "as is", without going through Oracle SES for retrieving the file, then "file" in the URL should be upper case "FILE". For example, FILE://localhost/... "As is" means that when a user clicks on the search link of the document, the browser will try to use the specified file URL on the client machine to retrieve the file. Without that, Oracle SES uses this file URL on the server machine and sends the document through HTTP to the client machine.

Tips for Using Mailing List Sources

The Oracle SES crawler is IMAP4 compliant. To crawl mailing list sources, you need an IMAP e-mail account. It is recommended to create an e-mail account that is used solely for Oracle SES to crawl mailing list messages. The crawler is configured to crawl one IMAP account for all mailing list sources. Therefore, all mailing list messages to be crawled must be found in the Inbox of the e-mail account specified on this page. This e-mail account should be subscribed to all the mailing lists. New postings for all the mailing lists will be sent to this single account and subsequently crawled.

Messages deleted from the global mailing list e-mail account are not removed from the Oracle SES index. In fact, the mailing list crawler itself will delete messages from the IMAP e-mail account as it crawls. The next time the IMAP account for mailing lists is crawled, the previous messages will no longer be there. Any new messages in the account will be added to the index (and also consequently deleted from the account). This keeps the global mailing list IMAP account clean. The Oracle SES index serves as a complete archive of all the mailing list messages.

Tips for Using OracleAS Portal Sources

An Or acleAS Portal source name cannot exceed 35 characters.
URL bound ary rules are not enforced for URL items. A URL item is the metadata that resides on the OracleAS Portal server. Oracle SES does not touch the display URL or the boundary rules for URL items.

Tips for Using User-Defined Sources

If a plug-in is to return file URLs to the crawler, then the file URLs must be fully qualified. For example, file://localhost/.

Also, if a file URL is to be used "as is" without going through Oracle SES for retrieving the file, then "file" in the URL should be upper case "FILE". For example, FILE://localhost/...

See Also:

"Crawling File URLs"

Tips for Using Federated Sources

Oracle SES provides the capability of searching multiple Oracle SES instances with their own document repositories and indexes. It provides a unified framework to search the different document repositories that are crawled, indexed, and maintained separately. Federated search allows a single query to be run across all Oracle SES instances. It aggregates the search results to show one unified result list to the user. User credentials are passed along with the query so that each remote (that is, slave) Oracle SES application can authenticate the user against its own document repository.

Create a federated source on the Home - Sources page of the Oracle SES administration tool.

Notes:

The Oracle SES federator caches the federator configuration (that is, all federation-related parameters including federated sources). As a result, any change in the configuration will take effect within 0 to 5 minutes.
Oracle SES supports 2-tier federated search. Federation of 3-tier or more is not currently supported.

See Also:

"Setting Up Secure Federated Search" if the federated source will be searching private content

Federated Search Characteristics

Federated search can improve performance by distributing query processing on multiple machines. It can be an efficient way to scale up search service by adding a cluster of Oracle SES instances.
The federated search performance depends on the network topology and throughput of the entire federated Oracle SES environment.

Federated Search Limitations

There is a size limit of 200KB for the cached documents existing on the remote Oracle SES instance to be displayed on the master node.
For infosource browse, if the source hierarchies for both local and federated sources under one source group start with the same top level folder, then only one of the hierarchies is available for browse.
On the master federated Oracle SES instance, there is no direct access to documents on the remote Oracle SES instance through the display URL in the search result list. Only the cached version of documents is accessible. Exception: There is direct access for Web source and OracleAS Portal source documents.

Setting Up Secure Federated Search

Secure federated search enables searching secure content across distributed Oracle SES instances. An end user is authenticated to the Oracle SES master instance. Along with querying the secure content in its own index, the master instance federates the query to each of the remote (that is, slave) Oracle SES instances on behalf of the authenticated end user. This mechanism necessitates propagation of user identity between the Oracle SES instances. In building a secure federated search environment, an important consideration is the secure propagation of user identities between the SES instances. This section explains how Oracle SES performs secure federation.

Federation Trusted Entities

When performing a secure search on a remote Oracle SES instance, the master instance must pass the identity of the logged in user to the remote instance. If the remote instance trusts the master instance, then the master instance can proxy as the end user. To esta blish this trust relationship, Oracle SES instances should exchange some secret. This secret is exchanged in the form of a trusted entity. A trusted entity consists of two values: entity name and entity password. Each Oracle SES instance can have one or more trusted entities that it can use to participate in secure federated search. (A trusted entity is also referred to as a proxy user.)

Create trusted entities on the Global Settings - Federation Trusted Entities page of Oracle SES administration tool.

An Oracle SES instance can connect to an identity management (IDM) system for managing users and groups. An IDM system can be an LDAP compliant directory, such as Oracle Internet Directory or Active Directory.

Each trusted entity can be authenticated by either an IDM system or by the Oracle SES instance directly, independent of an IDM system. For authentication by an IDM system, check the box Use Identity Plug-in for authentication when creating a trusted entity. In this case, the entity password is not required. This is useful when there is a user configured in the IDM system that can be used for proxy authentication. Make sure that the entity name is the name of the user that exists in the IDM system and is going to be used as the proxy user.

For authentication of the proxy user by Oracle SES, clear (uncheck) the box Use Identity Plug-in for authentication when creating a trusted entity. Then use any name and password pair to create a trusted entity.

To perform secure federated search, both Oracle SES instances involved in the federation must have identity plug-ins registered. The identity plug-ins may or may not talk to the same IDM system. Carefully specify the following parameters under the section Secure Federated Search when creating a federated source on the master Oracle SES instance:

Remote Entity Name: This is the name of the federation trusted entity on the remote Oracle SES instance provided by the administrator of the remote Oracle SES instance.
Remote Entity Password: This is the password of the federation trusted entity on the remote Oracle SES instance provided by the administrator of the remote Oracle SES instance.

Search User Attribute: This attribute identifies, and is used to authenticate, a user on the remote Oracle SES instance. This parameter is optional, except in the case where the master and remote SES instances use different authentication attributes to identify end users. The master and remote Oracle SES instances can use different authentication attributes to identify or authenticate end users. (For example, on the master instance, an end user can be identified by user name; on the remote instance, the end user can be identified by e-mail address.)

The identity plug-in registered on the master instance should be able to map the user identity to this attribute based on the authentication attribute used during the registration of the identity plug-in. If this attribute is not specified during creation of the federation source, then the user identity on the master instance is used to search on the remote Oracle SES instance.

Note:

If these parameters are not specified during the creation of the federated source, then the federated source is treated as a public source (that is, only public content is available to the search users).

Secure Oracle HTTP Server-Oracle SES channel: Because any Oracle HTTP Server can potentially connect to the AJP13 port on the Oracle SES instances and masquerade as a specific person, either the channel between the Oracle HTTP Server and the Oracle SES instance must be SSL-enabled or the entire Oracle HTTP Server and Oracle SES instance machines must be protected by a firewall.

Notes:

In a secure federated search environment, the master instance might or might not be using single sign-on (SSO). However, the Web service should not be behind SSO. The remote Oracle SES instance cannot use SSO to protect all content, but it can use SSO to protect private content.
Oracle strongly recommends that you SSL-protect the channel between Oracle HTTP Server and Oracle SES for secure content. The remote instance should be SSL-enabled, or you should be able to access the Web service using HTTPS.

Tips for Using Oracle Calendar Sources

Oracle Calendar sources are certified with Oracle Calendar release 10.1.2.

Oracle recommends creating one source group for archived calendar data and another source group for active calendar data. One Calendar connector instance for the archived source can run less frequently, such as every week or month. This source should cover all history. A separate connector instance for the active source can run daily for only the most recent period.

Setting Up Secure Oracle Calendar Sources

The Oracle SES instance and the Oracle Calendar instance must be connected to the same Oracle Internet Directory system. Follow these steps to set up a secure Oracle Calendar source:

Activate the Oracle Internet Directory identity plug-in for the Oracle Calendar instance. This is done on the Global Settings - Identity Management Setup page in the Oracle SES administration tool.
Use the following LDIF file to create an application entity for the plug-in. (An application entity is a data structure within LDAP used to represent and keep track of software applications accessing the directory via an LDAP client.)
```
$ORACLE_HOME/bin/ldapmodify -h oidHost -p OIDPortNumber -D OIDadmin -w password -f  calPlugin.ldif
```
Where $ORACLE_HOME is the Oracle Calendar infrastructure installation and calPlugin.ldif is the current directory.

This defines the entity that will be used for the plug-in: orclapplicationcommonname=ocscalplugin,cn=oses,cn=product, cn=oraclecontext. The entity will have the password welcome1.
1. See Also:
  Appendix D, "LDIF Files" to view the calPlugin.ldif file

Create a Calendar source on the Home - Sources page.

Table 5-1 Calendar Source Parameters

Parameter	Value
Calendar server	http://host name:port
Application entity name	name
Application entity password	welcome1
OID server hostname	host name
OID server port	389
OID server SSL port	636
OID server ldapbase	dc=us,dc=oracle,dc=com
OID login attribute	uid
User query	(objectclass=ctCalUser)
Past days	30
Future days	60
Rollover	true

Tips for Using Oracle Content Database Sources

Oracle Content Database and Oracle Content Services are the same product. This section uses the product name Oracle Content Database to mean Oracle Content Database and Oracle Content Services. Oracle Conte nt Database sources are certified with Oracle Content Database release 10.2 and Oracle Content Services release 10.1.2.3.

Limitations with Oracle Content Database Sources

Oracle SES currently does not index Oracle Content Database Categories; that is, custom metadata such as ProjectNumber, Client, or Project Manager.
The administrator account used by the Oracle Content Database source must have the ContentAdministrator role on the site that is being crawled and indexed. Also, end-users searching documents in Oracle Content Database must have the GetContent and GetMetadata permissions.
By default, Oracle Content Database has a limit of three concurrent requests (simultaneous operations) for each user. However, Oracle SES has a default of five concurrent crawler threads. When crawling Oracle Content Database, only three of the five threads can successfully crawl, which causes the crawl to fail.

Workaround: For an Oracle Content Database source, change the Number of Crawler Threads on the Home - Sources - Crawling Parameters page to a value less than or equal to three.

Or, modify the Oracle Collaboration Suite configuration in Oracle Enterprise Manager to allow more than three concurrent requests. For example:
1. Access the Enterprise Manager page for the Collaboration Suite Midtier. For example: http://machine.domain:1156/.
2. Click the Oracle Collaboration Suite midtier standalone instance name. For example: ocsapps.machine.domain.
3. In the System Components table, click Content.
4. From Administration, click Node Configurations.
5. In the Node Configurations table, click HTTP_Node. For example: ocsapps.machine.domain_HTTP_Node.
6. On Properties, change the value for Maximum Concurrent Requests Per User. Enter a value larger than or equal to the number of crawling threads used by Oracle SES. This value is listed on the Global Settings - Crawler Configuration page.

Setting Up Secure Oracle Content Database Sources

The Oracle SES instance and the Oracle Content Database instance must be connected to the same Oracle Internet Directory system. The groups in Oracle Content Database must also be synchronized with Oracle Internet Directory. Follow these steps to set up a secure Oracle Content Database source:

Read "Limitations with Oracle Content Database Sources" and confirm that the number of crawler threads does not exceed the available per-user concurrent connection settings in Oracle Content Database
Activate the Secure Enterprise Search Group Agent. Oracle Content Database uses this agent to synchronize groups into Oracle Internet Directory, so that they can be used by Oracle SES.

This agent is deactivated by default. Activate it by modifying the node configuration that corresponds to the node where you want to run the agent. To do this, follow these steps:
1. Connect to the Oracle Collaboration Suite Control and go to the Content Database Home page. From a Web browser, connect to the Enterprise Manager port, which is typically http://servername:1156. Log on as ias_admin with the password provided during the Oracle Collaboration Suite or Oracle Application Server installation. Choose the correct cluster name (which will be for APPS, not INFRASTRUCTURE). The names vary depending the installation. Under System Components, choose Content.
2. In the Administration section, click Node Configurations.
3. On the Node Configurations page, click the name of the node configuration you want to change.
4. In the Servers section, click Activate/Deactivate.
5. Move the Secure Enterprise Search Group Agent from the Inactive Servers list to the Active Servers list.
6. Click OK on the Activate/Deactivate Servers page.
7. Click OK on the Edit Node page.
8. Return to the Content Database Home page and restart the node.
  
  The crawler authenticates as an administrator user who has privilege to read all contents of all folders in the Oracle Content Database repository. This uses the service-to-service mechanism of passing in a trusted application entity name and password along with the admin user name.
Activate the Oracle Internet Directory identity plug-in for the Oracle Content Database instance. This is done on the Global Settings - Identity Management Setup page in the Oracle SES administration tool.
Use the following LDIF file to create an application entity for the plug-in. (An application entity is a data structure within LDAP used to represent and keep track of software applications accessing the directory via an LDAP client.)
```
$ORACLE_HOME/bin/ldapmodify -h oidHost -p OIDPortNumber -D OIDadmin -w password -f  csPlugin.ldif
```
Where $ORACLE_HOME is the Oracle Content Database infrastructure installation and csPlugin.ldif is the current directory.

This defines the entity that will be used for the plug-in: orclapplicationcommonname=ocscsplugin, cn=ifs,cn=products,cn=oraclecontext. The entity will have the password welcome1.

See Also:
Appendix D, "LDIF Files" to view the csPlugin.ldif file

Create an Oracle Content Database source on the Home - Sources page.

Table 5-2 Content Database Source Parameters

Parameter	Value
Oracle Content Database URL	http://host name:port
Starting paths	/
Depth	-1
Oracle Content Database admin user	orcladmin
Entity name	`orclapplicationcommonname=ocscsplugin, cn=ifs,cn=products,cn=oraclecontext`
Entity password	welcome1
LDAP group base	`cn=CSGroups,cn=groups,dc=us,dc=oracle,dc=com`
Crawl only	false
Use e-mail for authorization	false

Tuning Crawl Performance

Your Web crawling strategy can be as simple as identifying a few well-known sites that are likely to contain links to most of the other intranet sites in your organization. You could test this by crawling these sites without indexing them. After the initial crawl, you have a good idea of the hosts that exist in your intranet. You could then define separate Web sources to facilitate crawling and indexing on individual sites.

However, the process of discovering and crawling your organization's intranet, or the Internet, is generally an interactive one characterized by periodic analysis of crawling results and modification to crawling parameters. For example, if you observe that the crawler is spending days crawling one Web host, then you might want to exclude crawling at that host or limit the crawling depth.

This section contains the most common things to consider to improve crawl performance:

Register a Proxy
Check Boundary Rules
Check Dynamic Pages
Check Crawler Depth
Check Robots.txt Rule
Check Duplicate Pages
Check Redirected Pages
Check URL Looping
What to do Next

See Also:

"Monitoring the Crawling Process" for more information on crawling parameters

Register a Proxy

By default, Oracle SES is configured to crawl Web sites in the intranet. In other words, crawling internal Web sites requires no additional configuration. However, to crawl Web sites on the Internet (also referred to as external Web sites), Oracle SES needs the HTTP proxy server information. See the Global Settings - Proxy Settings page.If the proxy requires authentication, then enter the proxy authentication information on the Global Settings - Authentication page.

Check Boundary Rules

The seed URL you enter when you create a source is turned into an inclusion rule. For example, if w ww.example.com is the seed URL, then Oracle SES creates an inclusion rule that only URLs containing the string www.example.com will be crawled.

However, suppose that the example Web site includes URLs starting with www.exa-mple.com or ones that start with example.com (without the www). Many pages have a prefix on the site name. For example, the investor section of the site has URLs that start with investor.example.com.

Always check the inclusion rules before crawling, then check the log after crawling to see what patterns have been excluded.

In this case, you might add www.example.com, www.exa-mple.com, and investor.example.com to the inclusion rules. Or you might just add example.

To crawl outside the seed site (for example, if you are crawling text.us.oracle.com, but you want to follow links outside of text.us.oracle.com to oracle.com), consider removing the inclusion rules altogether. Do so carefully. This could lead the crawler into many, many sites.

Notes for File Sources

For file sources, if no boundary rule is specified, then crawling is limited to the underlying file system access privileges. Files accessible from the specified seed file URL will be crawled, subject to the default crawling depth. The depth, which is 2 by default, is set on the Global Settings - Crawler Configuration page. For example, if the seed is file://localhost/home/user_a/, then the crawl will pick up all files and directories under user_a with access privileges. It will crawl any documents in the directory /home/user_a/level1 due to the depth limit. The documents in the /home/user_a/level1/level2 directory are at level 3.
The file URL can be of UNC (universal naming convention) format. The UNC file URL has the following format: file://localhost///<LocalMachineName>/<SharedFolderName>.

For example, \\stcisfcr\docs\spec.htm should be specified as file://localhost///stcisfcr/docs/spec.htm.
On some machines, the path or file name could contain non-ASCII and multibyte characters. URLs are always represented using the ASCII character set. Non-ASCII characters are represented using the hex representation of their UTF-8 encoding. For example, a space is encoded as %20, and a multibyte character can be encoded as %E3%81%82.

For file sources, spaces can be entered in simple (not regular expression) boundary rules. Oracle SES automatically encodes these URL boundary rules. If (Home Alone) is specified, then internally it is stored as (Home%20Alone). Oracle SES does this encoding for the following:
- File source simple boundary rules
- Test URL strings
- File source seed URLs

Note:

Oracle SES does not alter the rule if it is a regular expression rule. It is the administrator's responsibility to make sure that the regular expression rule specified is against the encoded file URL. Spaces are not allowed in regular expression rules.

Check Dynamic Pages

Indexing dynamic pages can generate an excessive number of URLs. From the target Web site, manually navigate through a few pages to understand what bound ary rules should be set to avoid crawling identical pages.

Check Crawler Depth

Setting the crawler depth very high (or unlimited) could lead the crawler into many sites. Without boundary rules, 20 will probably crawl the whole WWW from most locations.

Check Robots.txt Rule

You can control which parts of your sites can be visited by robots. If robots exclusion is enabled (default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots.txt file.

The following sample /robots.txt file specifies that no robots should visit any URL starting with /cyberworld/map/ or /tmp/ or /foo.html:

# robots.txt for http://www.example.com/
 
User-agent: *
Disallow: /cyberworld/map/ 
Disallow: /tmp/ 
Disallow: /foo.html

If the Web site is under the user's control, then a specific robots rule can be tailored for the crawler by specifying the Oracle SES crawler plug-in name "User-agent: Oracle Secure Enterprise Search." For example:

User-agent: Oracle Secure Enterprise Search
 
Disallow: /tmp/

The robots meta tag can instruct the crawler to either index a Web page or follow the links within it. For example:

<meta name="robots" content="noindex,nofollow">

Check Duplicate Pages

If Oracle SES thinks a page is identical to one it has seen before, then it will not index it. If the page is reached through a URL that Oracle SES has already processed, then it will not index that either.

Check Redirected Pages

The crawler crawls only redirected pages. For example, a Web site might have Javascript redirecting users to another site with the same title. Only the redirected site is indexed.

Check for inclusion rules from redirects. This is based on type of redirect. There are three kinds of redirects defined in EQ$URL:

Temporary Redirect: A redirected URL is always allowed if it is a temporary redirection (HTTP status code 302, 307). Temporary redirection is used for whatever reason that the original URL should still be used in the future. It's not possible to find out temporary redirect from EQ$URL table other than filtering out the rest from the log file.
Permanent Redirect: For permanent redirection (HTTP status 301), the redirected URL is subject to boundar y rules. Permanent redirection means the original URL is no longer valid and the user should start using the new (redirected) one. In EQ$URL, HTTP permanent redirect has the status code 954
Meta Redirect: Metatag redirection is treated as a permanent redirect. Meta redirect has status code 954. This is always checked against boundary rules.

Check URL Looping

URL looping refers to the scenario where a large number of unique URLs all point to the same document. One particularly difficult situation is where a site contains a large number of pages, and each page contains links to every other page in the site. Ordinarily this would not be a problem, because the crawler eventually analyzes all documents in the site.

However, some Web servers attach parameters to generated URLs to track information across requests. Such Web servers might generate a large number of unique URLs that all point to the same document.

For example, http://example.com/somedocument.html?p_origin_page=10 might refer to the same document as http://example.com/somedocument.html?p_origin_page=13 but the p_origin_page parameter is different for each link, because the referring pages are different. If a large number of parameters are specified and if the number of referring links is large, then a single unique document could have thousands or tens of thousands of links referring to it. This is an example of how URL looping can occur.

Monitor the crawler statistics in the Oracle SES administration tool to determine which URLs and Web servers are being crawled the most. If you observe an inordinately large number of URL accesses to a particular site or URL, then you might want to do one of the following:

Exclude the Web Server: This prevents the crawler from crawling any URLs at that host. (You cannot limit the exclusion to a specific port on a host.)
Reduce the Crawling Depth: This limits the number of levels of referred links the crawler will follow. If you are observing URL looping effects on a particular host, then you should take a visual survey of the site to find out an estimate of the depth of the leaf pages at that site. Leaf pages are pages that do not have any links to other pages. As a general guideline, add three to the leaf page depth, and set the crawling depth to this value.

Be sure to restart the crawler after altering any parameters. Your changes take effect only after restarting the crawler.

What to do Next

If you are still not crawling all the pages you think you should, then check which pages were crawled by doing one of the following:

Check the crawler log file. (There's a link on the Home - Schedules page and the location of the full log on the Home - Schedules - Status page.)
Create a search source group. (Search - Source Groups - Create New Source Group) Put only one source in the group. From the Search page, search that group. (Click the group name on top of the search box.) Or, from the Search page, click Browse Search Groups. Click the group name for a hierarchy. You could also click the number next to the group name for a list of the pages crawled.

Tuning Search Performance

This section contains suggestions on how to improve the response time and throughput performance of Oracle SES.

This section contains the most common things to consider to improve search performance:

Optimize the Index
Increase the Size of the Indexing Batch Size
Check the Search Statistics
Increase the JVM Heap Size
Increase the Oracle Undo Space

Optimize the Index

Opti mizing the index reduces fragmentation, and it can significantly increase the speed of searches. Schedule index optimization on a regular basis. Also, optimize the index after the crawler has made substantial updates or if fragmentation is more than 50%. Make sure index optimization is scheduled during off-peak hours. Optimization of a very large index could take several hours.

See the fragmentation level and run index optimization on the Global Settings - Index Optimization page in the administration tool.

Increase the Size of the Indexing Batch Size

The data in the cache directory continues to accumulate until it reaches the indexing batch size. When the size is reached, the data is indexed. The bigger the batch size, the less fragmentation in the index. However, the bigger the batch size, the longer it will take to index each batch. Only indexed data can be searched: data in the cache cannot be searched.

Set the indexing batch size on the Global Settings - Crawler Configuration page in the administration tool.

Check the Search Statistics

See the Home - Statistics page in the administration tool for lists of the most popular queries, failed queries, and ineffective queries. This information can lead to the following actions:

Refer users to a particular Web site for failed queries on the Search - Suggested Links page.
Fix common errors that users make in searching on the Search - Alternate Words page.
Make important documents easier to find on the Search - Relevancy Boosting page.

Relevancy Boosting

Relevancy boosting lets administrators influence the order of documents in the result list for a particular search. You might want to override the default results for the following reasons:

For a highly popular search, direct users to the best results
For a search that returns no results, direct users to some results
For a search that has no click-throughs, direct users to better results

In a search, each result is assigned a score that indicates how relevant the result is to the search; that is, how good a result it is. Sometimes there are documents that you know are highly relevant to some search. For example, your company Web site could have a home page for XML (http://example.com/XML-is-great.htm), which you want to appear high in the results of any search for "XML". You would boost the score of that home page (http://example.com/XML-is-great.htm) to 100 for an "XML" search.

There are two methods for locating URLs for relevancy boosting: locate by search or manual URL entry.

Note:

The document still has a score computed if you enter a search that is not one of the boosted queries.

With relevancy boosting, comparison of the user's query against the boosted queries uses exact string matching. This means that the comparison is case-sensitive and space-aware. Therefore, a document with a boosted score for "Enterprise Search" is not boosted when you enter "search".

Increase the JVM Heap Size

If you expect heavy load on the Oracle SES server, then configure the J ava virtual machine (JVM) heap size for better performance.

The heap size is defined in the $ORACLE_HOME/search/config/searchctl.conf file. By default, the following values are given:

max_heap_size = 1024 megabytes

min_heap_size = 512 megabytes

Increase the value of these parameters appropriately. The max size should not exceed the physical memory size. Then restart the mid-tier with searchctl restart.

Increase the Oracle Undo Space

Heavy query load should not coincide with heavy crawl activity, especially when there are large-scale changes on the target site. If it does, for example when the crawl needs be scheduled around-the-clock, then increase the size of the Oracle undo tablespace with the UNDO_RETENTION parameter.

Using Backup and Recovery

A backup is a copy of configuration data that can be used to recover your configuration settings after a hardware failure. When a backup is performed on the Global Settings - Configuration Backup and Recovery page, Oracle SES copies the data to the binary metaData.bkp file. The location of that file is provided on the Global Settings - Configuration Data Backup and Recovery page. When the backup successfully completes, you must copy this file to a different host. You should backup after making configuration data changes, such as creating or editing sources.

Recovery can only be performed on a fresh installation. When the installation completes, copy the metaData.bkp file to the location provided in the administration tool. Sources need to be crawled again to see search results.

Some notes about backup and recovery:

You must stop all running schedules before doing the backup.
Secure search does not need to be re-enabled after recovery. If secure search is enabled in the backup instance, you do not need to re-register or re-activate the identity plug-in after recovery.

In 10.1.7, neither re-activation nor re-registration of the identity plug-in is required. If a plug-in was active when the instance was backed up, the same plug-in will be activated in the recovered instance, using the same parameters.
If you have file or table sources residing on the same machine as the one running Oracle SES, and if you intend to use a different machine for recovery, then you must use the actual host name (not localhost) when creating the sources.
For database table sources, confirm that the remote tables exist.
For file sources, confirm that files and paths are valid after recovery.
During recovery, the mail archive directory settings for existing mailing list and e-mail sources is changed. After recovery, the location will be <cache-dir>/mail, which is the default for new e-mail and mailing list sources. Any customized directory locations prior to recovery will be lost.

Integrating with Google Desktop for Enterprise

Oracle Secure Enterprise Search provides a plug-in (or connector) to integrate with Google Desktop for Enterprise (GDfE). You can include Google Desktop results in your Oracle SES hitlist. You can also link to Oracle SES from the GDfE interface.

See Also:

Google Desktop for Enterprise Readme at http://host:port/search/query/gdfe/gdfe_readme.html for details about how to integrate with GDfE

Monitoring Oracle Secure Enterprise Search

In a production environment, where a load balancer or other monitoring tools are used to ensure system availability, Oracle Secure Enterprise Search (SES) can also be easily monitored through the following URL: http://<host>:<port>/monitor/check.jsp. The URL should return the following message: Oracle Enterprise Search instance is up.

Note:

This message is not translated to other languages, because system monitoring tools might need to byte-compare this string.

If Oracle SES is not available, then the URL returns either a connection error or the HTTP status code 503.

Turning On Debug Mode

Debug mode is useful for troubleshooting purposes. To turn on debug mode for Oracle SES administration tool, update the search.properties file located in the $ORACLE_HOME/search/webapp/config directory. Set debug=true and restart the Oracle SES middle tier with searchctl restart.

To turn off debug mode when you are finished troubleshooting, set debug=false and restart the middle tier with searchctl restart.

Note:

$ORACLE_HOME represents the directory where Oracle SES was installed.

Debug information can be found in the OC4J log file: $ORACLE_HOME/oc4j/j2ee/OC4J_SEARCH/log/oc4j.log.

Restarting Oracle Secure Enterprise Search After Rebooting

The tool for starting and stopping the search engine is searchctl. To restart Oracle SES (for example, after rebooting the host machine), navigate to the bin directory and run searchctl startall.

Note:

Users are prompted for a password when running searchctl commands on UNIX platforms. No password is required on Windows platforms. This is because Oracle SES installation on Windows requires a user with administrator privileges. When running commands to start or stop the search engine, no password is required as long as the user is a member of the administrator group.

See Also:

Startup / Shutdown lesson in the Oracle SES admin tutorial: http://st-curriculum.oracle.com/tutorial/SESAdminTutorial/index.htm