12 Administering Oracle SES Instances

This chapter provides information about tuning and general management of Oracle SES instances. It contains the following topics:

Performing Backup and Recovery

Oracle SES provides multiple methods for backup and recovery of the Oracle SES data in the event of a hardware failure. The backup and recovery process depends upon the mode of the Oracle SES installation.

Backup and Recovery for Oracle SES Installed with Database and Middle Tier

For the Oracle SES installed along with the Database and the middle tier, the backup and recovery can be done using the standard Oracle Recovery Manager (RMAN) utility. Refer to the Oracle Database Backup and Recovery User's Guide for more information.

Backup and Recovery for Oracle SES Installed on Existing Database or Middle Tier

For the Oracle SES installed on an existing Database or middle tier, the backup and recovery must be done according to the standard backup and recovery procedure for the Oracle Fusion Middleware. Refer to the Oracle Fusion Middleware Administrator's Guide for more information.

Post-recovery Steps for Oracle SES (for All Installation Modes)

The following Oracle SES specific additional steps must be executed after performing the recovery process explained in the earlier sections.

Update the WebLogic Server Connection Pool used by Oracle SES

You must update the WebLogic Server connection pool configuration, only if the database connection string used in the new database (used after recovery) is different from that of the old database (used before recovery).

To update the database connection string in the WebLogic Server connection pool: 

  1. Log in to the WebLogic Server Administration Console.

  2. Under the Change Center in the left panel, click Lock & Edit.

  3. Change the URL for SearchAdminDS data source:

    1. Under the Domain Structure in the left panel, select search_domain > Services > Data Sources. The Configuration tab for the Settings for SearchAdminDS is displayed in the main panel.

    2. In the Name column of the Data Sources table, click SearchAdminDS. The General tab of the Settings for SearchAdminDS is displayed.

    3. Select the Connection Pool tab.

    4. Enter the new database connection string in the URL field.

    5. Save the changes.

  4. Repeat the above steps for the data sources - SearchQueryDS, EssInternalDS, EssDS, EssXADS, mds-ESS_MDS_DS, and mds-owsm.

  5. Click Activate Changes under the Change Center in the left panel.

Update the Database Connection String in the Credential Storage Framework (CSF) used by the Crawler

You must update the database connection string in the Credential Storage Framework (CSF) used by the crawler, only if the database connection string used in the new database (used after recovery) is different from that of the old database (used before recovery).

To update the database connection string in the CSF: 

  1. Connect to the system where WebLogic Server middle tier is installed.

  2. Go to ses_home/common/bin.

  3. Start the WebLogic Server Administration Scripting Tool by executing the command wlst.sh.

  4. Run the connect() command at the wls/offline> prompt. You will be prompted to enter the following values:

    • WebLogic Server user name

    • WebLogic Server password

    • WebLogic Server URL

  5. Run the following command at the wls:domain_name/serverConfig> prompt:

    updateCred(map="oracle.search",key="SEARCH_DATABASE",user="new_database_connection_string",password="search")
    

    where, new_database_connection_string is the connection string URL for the new database.

    Note:

    The password value is not used in the above command. Use search as the password value.

Update the SEARCHSYS Schema Password in the Credential Storage Framework (CSF) used by the Crawler

You must update the SEARCHSYS schema password in the Credential Storage Framework (CSF) used by the crawler, only if the database schema password used in the new database (used after recovery) is different from that of the old database (used before recovery).

To update the SEARCHSYS schema password in the CSF: 

  1. Connect to the system where WebLogic Server middle tier is installed.

  2. Go to ses_home/common/bin.

  3. Start the WebLogic Server Administration Scripting Tool by executing the command wlst.sh.

  4. Run the connect() command at the wls/offline> prompt. You will be prompted to enter the following values:

    • WebLogic Server user name

    • WebLogic Server password

    • WebLogic Server URL

  5. Run the following command at the wls:domain_name/serverConfig> prompt:

    updateCred(map="oracle.apps.security",key="FUSION_APPS_ECSF_SES_ADMIN-KEY", user="searchsys",password="password")
     
    

    where,

    password is the new password for SEARCHSYS schema, and the Oracle SES credentials for connecting to the Oracle Database are obtained from a CSF map named oracle.apps.security with a key named FUSION_APPS_ECSF_SES_ADMIN-KEY.

Update the SEARCH_TOP and the Perl Command Locations used by Enterprise Scheduler Service (ESS) to Launch the SES Crawler

You must update the location for the Perl command, only if it has been changed after the recovery process. The default location for the Perl command is ses_home/perl/bin/perl.

To update the Perl command location: 

  1. Connect to the system where WebLogic Server middle tier is installed.

  2. For Windows system, set the ESS application path by running the essSetEnv.cmd script present under the ses_home/bin directory.

  3. Navigate to the ses_home/common/bin directory.

  4. Start the WebLogic Server Administration Scripting Tool by executing the command wlst.sh.

  5. Run the connect() command at the wls/offline> prompt. You will be prompted to enter the following values:

    • WebLogic Server user name

    • WebLogic Server password

    • WebLogic Server URL

  6. Run the following command at the wls:domain_name/serverConfig> prompt to set the SEARCH_TOP location:

    wls:/>essManageRuntimeConfig("SearchEss","APP", operation="add",name="SEARCH_TOP",val="search_top_location")
    

    where, search_top_location is the absolute file path of the Oracle SES 11.2.2.2 home directory (that is, ses_home directory). For Windows system, use double backslashes (\\) as path delimiters.

  7. Run the following command at the wls:domain_name/serverConfig> prompt to set the Perl command location:

    wls:/>essManageRuntimeConfig("SearchEss","ESS", operation="add",name="PerlCommand",val="perl_location")
    

    where, perl_location is the location for the Perl command in the Oracle SES 11.2.2.2 instance. For Windows system, use double backslashes (\\) as path delimiters.

Cold Backups

As an additional precaution to minimize downtime, you can perform a cold backup to backup all the data of an Oracle SES Instance. To back up an Oracle SES instance data, you must save a copy of the directories oracle_base, oraInventory, and oradata.

To perform a cold backup: 

  1. Shut down the Oracle SES instance (middle tier as well as database server).

  2. Log in to the computer as the root user or the administrator.

  3. Copy all the files under the Oracle Database directory oracle_base, the Oracle Inventory directory oraInventory, and the Oracle data storage directory oradata.

    These locations are specified during the Oracle SES installation. There are several ways to make a copy. For example, using the tar command:

    tar cvf ses_orabase.tar  {full path to Oracle base} 
    tar cvf ses_orahome.tar  {full path to Oracle home} 
    tar cvf ses_orainv.tar   {full path to oraInventory} 
    tar cvf ses_oradat.tar   {full path to oradata}
    

    For example, if the oradata location is oracle_base/oradata, then save a copy using the command:

    tar cvf ses_oradat.tar oracle_base/oradata
    
  4. Copy the .tar files created in steps 2, 3, and 4 to a safe location.

    Note:

    You can use any compression method to perform file backup. For example, you can zip the files.

To recover files from a cold backup: 

  1. Shut down the Oracle SES instance (middle tier as well as Database server).

  2. Restore all backed-up files. First decompress (untar) the files, then move them back to their original locations.

  3. Start the Oracle SES instance.

Tuning Crawl Performance

Your Web crawling strategy can be as simple as identifying a few well-known sites that are likely to contain links to most of the other intranet sites in your organization. You could test this by crawling these sites without indexing them. After the initial crawl, you have a good idea of the hosts that exist in your intranet. You could then define separate Web sources to facilitate crawling and indexing on individual sites.

However, the process of discovering and crawling your organization's intranet, or the Internet, is generally an interactive one characterized by periodic analysis of crawling results and modification to crawling parameters. For example, if you observe that the crawler is spending days crawling one Web host, then you might want to exclude crawling at that host or limit the crawling depth.

This section contains the most common things to consider to improve crawl performance:

See Also:

"Monitoring the Crawling Process" for more information on crawling parameters

Understanding the Crawler Schedule

Schedules define the frequency at which the Oracle SES index is updated with information about each source. This section describes characteristics the Oracle SES crawler schedule.

  • The Failed Schedules section on the Home - General page lists all schedules that have failed. A failed schedule is one in which the crawler encountered an irrecoverable error, such as an indexing error or a source-specific login error, and cannot proceed. A failed schedule could be because of a partial collection and indexing of documents.

  • The smallest granularity of the schedule interval is one hour. For example, you cannot start a schedule at 1:30 am.

  • If a crawl takes longer to finish than the scheduled interval, then it starts again when the current crawl is done. Currently, there is no option to have the scheduled time automatically pushed back to the next scheduled time.

  • When multiple sources are assigned to one schedule, the sources are crawled one by one following the order of their assignment in the schedule.

  • The schedule starts crawling the assigned sources in the assigned order. Only one source is crawling under a schedule at any given time. If a source crawl fails, then the rest of the sources assigned after it are not crawled. The schedule does not restart. You must either resolve the cause of the failure and resume the schedule, or remove the failed source from the schedule.

  • There is no automatic e-mail notification of schedule success or failure.

For more information about documents that the crawler does not index: 

  • Browse the crawler log in the Oracle SES Administration GUI. Select the Schedules subtab from the Home page, then click the Log File icon for the schedule.

  • In the Oracle SES Administration GUI, select the Statistics subtab from the Home page. Under Crawler Statistics, choose Problematic URLs. This page lists errors encountered during the crawling process and the number of URLs that caused each error.

Registering a Proxy

By default, Oracle SES is configured to crawl Web sites in the intranet, so no additional configuration is required. However, to crawl Web sites on the Internet (also referred to as external Web sites), Oracle SES needs the HTTP proxy server information.

To register a proxy: 

  1. On the Global Settings page under Sources, select Proxy Settings.

  2. Enter the proxy server name and port. Click Set Proxy.

  3. Enter the internal host name suffix under Exceptions, so that internal Web sites do not go through the proxy server. Click Set Domain Exceptions.

    To exclude the entire domain, omit http, begin with *., and use the suffix of the host name. For example, *.us.example.com or *.example.com. Entries without the *. prefix are treated as a single host. Use the IP address only when the URL crawled is also specified using the IP for the host name. They must be consistent.

  4. If the proxy requires authentication, then enter the proxy authentication information on the Global Settings - Authentication page.

Checking Boundary Rules

The seed URL you enter when you create a source is turned into an inclusion rule. For example, if www.example.com is the seed URL, then Oracle SES creates an inclusion rule that only URLs containing the string www.example.com are crawled.

However, suppose that the example Web site includes URLs starting with www.exa-mple.com or example.com (without the www). Many pages have a prefix on the site name. For example, the investor section of the site has URLs that start with investor.example.com.

Always check the inclusion rules before crawling, then check the log after crawling to see what patterns have been excluded.

In this case, you might add www.example.com, www.exa-mple.com, and investor.example.com to the inclusion rules. Or you might just add example.

To crawl outside the seed site (for example, if you are crawling text.us.xyz.com, but you want to follow links outside of text.us.xyz.com to xyz.com), consider removing the inclusion rules completely. Do so carefully. This action could lead the crawler into many, many sites.

Notes for File Sources

  • If no boundary rule is specified, then crawling is limited to the underlying file system access privileges. Files accessible from the specified seed file URL are crawled, subject to the default crawling depth. The depth, which is 2 by default, is set on the Global Settings - Crawler Configuration page. For example, if the seed is file://localhost/home/user_a/, then the crawl picks up all files and directories under user_a with access privileges. It crawls any documents in the directory /home/user_a/level1 due to the depth limit. The documents in the /home/user_a/level1/level2 directory are at level 3.

  • The file URL can be in UNC (universal naming convention) format. The UNC file URL has the following format for files located within the host computer:

    file://localhost///LocalComputerName/SharedFolderName

    For example, specify \\stcisfcr\docs\spec.htm as file://localhost///stcisfcr/docs/spec.htm

    where stcisfcr is the name of the host computer.

    The string localhost is optional. You can specify the URL path without the string localhost in the URL, in which case the URL format is:

    file:///LocalComputerName/SharedFolderName

    For example,

    file:///stcisfcr/docs/spec.htm

    Note that you cannot use the UNC format to access files on other computers.

  • On some computers, the path or file name could contain non-ASCII and multibyte characters. URLs are always represented using the ASCII character set. Non-ASCII characters are represented using the hex representation of their UTF-8 encoding. For example, a space is encoded as %20, and a multibyte character can be encoded as %E3%81%82.

    You can enter spaces in simple (not regular expression) boundary rules. Oracle SES automatically encodes these URL boundary rules. For example, Home Alone is specified internally as Home%20Alone. Oracle SES does this encoding for the following:

    • File source simple boundary rules

    • URL string tests

    • File source seed URLs

    Oracle SES does not alter regular expression rules. You must ensure that the regular expression rule specified is against the encoded file URL. Spaces are not allowed in regular expression rules.

Checking Dynamic Pages

Indexing dynamic pages can generate too many URLs. From the target Web site, manually navigate through a few pages to understand what boundary rules should be set to avoid crawling of duplicate pages.

Checking Crawler Depth

Setting the crawler depth very high (or unlimited) could lead the crawler into many sites. Without boundary rules, a crawler depth of 20 probably crawls the entire World Wide Web from most locations.

Checking Robots Rule

You can control which parts of your sites can be visited by robots. If robots exclusion is enabled (the default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots.txt file.

The following sample robots.txt file specifies that no robots visit any URL starting with /cyberworld/map/ or /tmp/ or /foo.html:

# robots.txt for http://www.example.com/
 
User-agent: *
Disallow: /cyberworld/map/ 
Disallow: /tmp/ 
Disallow: /foo.html

If the Web site is under your control, then you can tailor a specific robots rule for the crawler by specifying Oracle Secure Enterprise Search as the user agent. For example:

User-agent: Oracle Secure Enterprise Search
 
Disallow: /tmp/

The robots meta tag can instruct the crawler either to index a Web page or to follow the links within it. For example:

<meta name="robots" content="noindex,nofollow">

Checking Duplicate Documents

Oracle SES always removes duplicate (identical) documents. Oracle SES does not index a page that is identical to one it has already indexed. Oracle SES also does not index a page that it reached through a URL that it has already processed.

With the Web Services API, you can enable or disable near duplicate detection and removal from the result list. Near duplicate documents are similar to each other. They may or may not be identical to each other.

Checking Redirected Pages

The crawler crawls only redirected pages. For example, a Web site might have Javascript that redirects users to another site with the same title. In such cases, only the redirected site is indexed.

Check for inclusion rules from redirects. The inclusion rules are based on the type of redirect. The EQ_TEST.EQ$URL table stores all of the URLs that have been crawled or are scheduled to be crawled. There are three kinds of redirects defined in it:

  • Temporary Redirect: A redirected URL is always allowed if it is a temporary redirection (HTTP status code 302, 307). Temporary redirection is used for whatever reason that the original URL should still be used in the future. It's not possible to find out temporary redirect from EQ$URL table other than filtering out the rest from the log file.

  • Permanent Redirect: For permanent redirection (HTTP status 301), the redirected URL is subject to boundary rules. Permanent redirection means the original URL is no longer valid and the user should start using the new (redirected) one. In EQ$URL, HTTP permanent redirect has the status code 954

  • Meta Redirect: Metatag redirection is treated as a permanent redirect. Meta redirect has status code 954. This is always checked against boundary rules.

The STATUS column of EQ_TEST.EQ$URL lists the status codes. For descriptions of the codes, refer to Appendix B, "URL Crawler Status Codes."

Note:

Some browsers, such as Mozilla and Firefox, do not allow redirecting a page to load a network file. Microsoft Internet Explorer does not have this limitation.

Checking URL Looping

URL looping refers to the scenario where a large number of unique URLs all point to the same document. Looping sometimes occurs where a site contains a large number of pages, and each page contains links to every other page in the site. Ordinarily this is not a problem, because the crawler eventually analyzes all documents in the site. However, some Web servers attach parameters to generated URLs to track information across requests. Such Web servers might generate a large number of unique URLs that all point to the same document.

For example,

http://example.com/somedocument.html?p_origin_page=10

might refer to the same document as

http://example.com/somedocument.html?p_origin_page=13

but the p_origin_page parameter is different for each link, because the referring pages are different. If a large number of parameters are specified and if the number of referring links is large, then a single unique document could have thousands or tens of thousands of links referring to it. This is an example of how URL looping can occur.

Monitor the crawler statistics in the Oracle SES Administration GUI to determine which URLs and Web servers are being crawled the most. If you observe an inordinately large number of URL accesses to a particular site or URL, then you might do one of the following:

  • Exclude the Web server: This prevents the crawler from crawling any URLs at that host. (You cannot limit the exclusion to a specific port on a host.)

  • Reduce the crawling depth: This limits the number of levels of referred links the crawler follows. If you are observing URL looping effects on a particular host, then you should take a visual survey of the site to find out an estimate of the depth of the leaf pages at that site. Leaf pages are pages that do not have any links to other pages. As a general guideline, add three to the leaf page depth, and set the crawling depth to this value.

Be sure to restart the crawler after altering any parameters. Your changes take effect only after restarting the crawler.

Increasing the Oracle Redo Log File Size

Oracle SES allocates 200MB of disk space for the redo log during installation, which is sufficient to crawl a relatively large number of documents, such as around 300GB of text. If you are going to install Oracle SES application on an existing Oracle Database, then make sure that at least 200MB of disk space is allocated for the redo log.

However, if your disk has sufficient space to increase the redo log, and if you are going to crawl a very large number of documents, for example, more than 300GB of text, then increase the redo log file size for better crawl performance.

Note:

The biggest transaction during crawling is SYNC INDEX by Oracle Text. Check the AWR report or the V$SYSSTAT view to see the actual redo size during crawling. Roughly, 200MB is sufficient to crawl up to 300GB.

To increase the size of the redo log files: 

  1. Open SQL*Plus and connect as the SYSTEM user. It has the same password as that of the SEARCHSYS user.

  2. Issue the following SQL statement to see the current redo log status:

    SELECT vl.group#, member, bytes, vl.status 
        FROM v$log vl, v$logfile vlf 
        WHERE vl.group#=vlf.group#; 
     
    GROUP# MEMBER                                                 BYTES STATUS 
    ------ ------------------------------------------------- ---------- ---------- 
         3 /scratch/ses111/oradata/o11101/redo03.log         209715200 INACTIVE 
         2 /scratch/ses111/oradata/o11101/redo02.log         209715200 CURRENT 
         1 /scratch/ses111/oradata/o11101/redo01.log         209715200 INACTIVE 
    
  3. Drop the INACTIVE redo log file. For example, to drop group 3:

    ALTER DATABASE DROP LOGFILE group 3; 
     
    Database altered. 
    
  4. Create a larger redo log file with a command like the following. If you want to change the file location, specify the new location.

    ALTER DATABASE ADD LOGFILE '/scratch/ses111/oradata/o11101/redo03.log' 2 
         size 400M reuse; 
    
  5. Check the status to ensure that the file was created.

    SELECT vl.group#, member, bytes, vl.status 
         FROM v$log vl, v$logfile vlf 
         WHERE vl.group#=vlf.group#; 
     
    GROUP# MEMBER                                                  BYTES STATUS 
    ------ -------------------------------------------------- ---------- ---------- 
         3 /scratch/ses111/oradata/o11101/redo03.log           419430400 UNUSED 
         2 /scratch/ses111/oradata/o11101/redo02.log           209715200 CURRENT 
         1 /scratch/ses111/oradata/o11101/redo01.log           209715200 INACTIVE 
    
  6. To drop a log file with a CURRENT status, issue the following ALTER statement, then check the results.

    ALTER SYSTEM SWITCH LOGFILE; 
     
    SELECT vl.group#, member, bytes, vl.status 
         FROM v$log vl, v$logfile vlf 
         WHERE vl.group#=vlf.group#; 
     
    GROUP# MEMBER                                                  BYTES STATUS 
    ------ -------------------------------------------------- ---------- ---------- 
         3 /scratch/ses111/oradata/o11101/redo03.log           419430400 CURRENT 
         2 /scratch/ses111/oradata/o11101/redo02.log           209715200 ACTIVE 
         1 /scratch/ses111/oradata/o11101/redo01.log           209715200 INACTIVE 
    
  7. Issue the following SQL statement to change the status of Group 2 from ACTIVE to INACTIVE:

    ALTER SYSTEM CHECKPOINT; 
     
    SELECT vl.group#, member, bytes,  vl.status 
         FROM v$log vl, v$logfile vlf 
         WHERE vl.group#=vlf.group#; 
     
    GROUP# MEMBER                                                  BYTES STATUS 
    ------ -------------------------------------------------- ---------- ---------- 
         3 /scratch/ses111/oradata/o11101/redo03.log           419430400 CURRENT 
         2 /scratch/ses111/oradata/o11101/redo02.log           209715200 INACTIVE 
         1 /scratch/ses111/oradata/o11101/redo01.log           209715200 INACTIVE 
    
  8. Repeat steps 3, 4 and 5 for redo log groups 1 and 2.

Increasing the Crawler Heap Size

You can specify a particular value for crawler heap size by updating the value of -mx parameter of the CRAWLER_EXEC_PARAMS configuration setting in the ses_home/bin/clexecutor.sh file on a Linux or UNIX system, and the ses_home\bin\clexecutor.cmd file on a Windows system. For example, to specify heap size of 1GB, set the -mx parameter value to -mx1024m.

Note:

There is no need to restart Oracle SES server after changing the heap size in the clexecutor script file and the subsequent crawl processes will start using the updated heap size.

Increasing the Crawler Cache Queue Size

If the system on which Oracle SES instance middle-tier is installed has sufficiently large memory, then you can increase the size of the crawler cache queue, which is an in-memory queue used by the crawler to process the documents. The following are the steps to increase the size of crawler cache queue:

  1. Export the current crawler configuration settings using the searchadmin command. For example:

    searchadmin -c http://ses_server_host:ses_server_port/search/api/admin/AdminService -p ses_admin_password export crawlerSettings -o crawlerSettings.xml
    
  2. Open the crawlerSettings.xml file and set the required values for minCacheQueue and maxCacheQueue elements. The minCacheQueue element represents the minimum size for the crawler cache queue and maxCacheQueue element represents the maximum size for the crawler cache queue. These element values are in megabytes. For example, the following configuration specifies the minimum crawler cache queue size of 2MB and maximum crawler cache queue size of 100MB.

    <search:minCacheQueue>2</search:minCacheQueue>
    <search:maxCacheQueue>100</search:maxCacheQueue>
    

    Note:

    The maximum crawler cache queue size limit is 100MB.
  3. Update the changes using the searchadmin command. For example:

    searchadmin -c http://ses_server_host:ses_server_port/search/api/admin/AdminService -p ses_admin_password update crawlerSettings --UPDATE_METHOD=overwrite -i crawlerSettings.xml
    

Note:

The minimum crawler cache queue size and maximum crawler cache queue size configuration settings are also logged in the crawler log file as shown in the following example:
2014-02-19 14:26:50.754 NOTIFICATION   Main Thread   EQG-30527 Caching queue high water mark = 100 MB
2014-02-19 14:26:50.754 NOTIFICATION   Main Thread   EQG-30528 Caching queue low water mark = 2 MB

Increasing the Number of Crawler Cache Threads

If the system on which Oracle SES instance middle-tier is installed has a large number of CPUs, then the number of caching threads that are used to save documents to the database can be increased by updating the value of cachingThreads element of the crawler configuration XML file that is exported using the searchadmin command.

For example, the following configuration setting specifies the number of caching threads to be 5.

<search:cachingThreads auto="false"> 
  <search:numThreads>5</search:numThreads>
</search:cachingThreads>

Note:

Setting the number of caching threads to a very high value may cause contention of Oracle SES database resources. Oracle recommends that you do not increase the number of caching threads beyond 8.

What to Do Next

If you are still not crawling all the pages you think you should, then check which pages were crawled by doing one of the following:

  • Check the crawler log file

  • Create a search source group

To check the crawler log file: 

  1. On the home page, click the Schedules secondary tab to display the Crawler Schedules page.

  2. Click the Log File icon to display the log file for the source.

  3. To obtain the location of the full log, click the Status link. The Crawler Progress Summary and Log Files by Source section displays the full path to the log file.

To create a search source group: 

  1. On the Search page, click the Source Groups subtab.

  2. Click New to display Create New Source Group Step 1.

  3. Enter a name, then click Proceed to Step 2.

  4. Select a source type, then shuttle only one source from Available Sources to Assigned Sources.

  5. Click Finish.

To search the source group: 

  1. On any page, click the Search link in the top right corner to open the query application.

  2. Select the group name, then issue a search term to list the matches within the source.

  3. Select the group name, then click Browse to see a list of search groups:

    • The number after the group name identifies the number of browsed documents. Click the number to browse the search results.

    • Click the arrow before the group name to display a hierarchy of search results. The number of matches appears after each item in the hierarchy.

Tuning Search Performance

Oracle SES contains a lot of features that optimize the search performance. This section contains suggestions on how to improve the response time and throughput performance of Oracle SES. It identifies the most common ways to improve search quality.

Adding Suggested Links

Suggested links enable you to direct users to a designated Web site for particular query keywords. For example, when users search for "Oracle Secure Enterprise Search documentation" or "Enterprise Search documentation" or "Search documentation", you could suggest http://www.oracle.com/technology.

Suggested link keywords are rules that determine which suggested links are returned (as suggestions) for a query. A rule can include query terms and logical operators. For example, "secure AND search". With this rule, the corresponding suggested link is returned for the query "secure enterprise search", but it is not returned for the query "secure database".

The rule language used for the indexed queries supports the following operators:

Table 12-1 Suggested Link Keyword Operators

Operator Example

ABOUT

about(dogs)

AND

dog and cat

NEAR

dog ; cat

OR

dog or cat

PHRASE

dog sled

STEM

$dog

THESAURUS

SYN(dog)


Note:

Do not use special characters, such as #, $, =, and &, in keywords.

Suggested links appear at the top of the search result list. Oracle SES can display up to two suggested links for each query.

This feature is especially useful for providing links to important Web pages that are not crawled by Oracle Secure Enterprise Search. Add or edit suggested links on the Search - Suggested Links page in the Oracle SES Administration GUI.

Authentication and Authorization

By tuning the security filter settings, you can prevent time outs.

To change the configuration of the security filter: 

  1. Log in to the Oracle SES Administration GUI.

  2. Click the Global Settings tab, then Query Configuration.

  3. Scroll down to Security Filter Configuration and change these settings using the guidelines provided in the Help.

    • Security Filter Lifespan

    • Authentication Timeout

    • Authorization Timeout

    • Minimum Number of Threads

    • Maximum Number of Threads

You can further tune the security filter by using the Administration API to set the <search:preserveStaleSecurityFilterOnError> and <search:securityFilterRefreshWaitTimeout> parameters in the queryConfig object. For example, these settings allow an expired security filter to be used immediately when a fresh security filter is unavailable:

<search:preserveStaleSecurityFilterOnError>true
   </search:preserveStaleSecurityFilterOnError>
<search:securityFilterRefreshWaitTimeout>0
   </search:securityFilterRefreshWaitTimeout>

The settings listed previously are also parameters of the queryConfig object and can be modified using the API. See the Oracle Secure Enterprise Search Administration API Guide.

Parallel Querying and Index Partitioning

Parallel querying significantly improves search performance and facilitates searches of very large data sources. You can optimize query performance of large document sources by storing the crawler index in partitions distributed across several independent disks. Oracle SES then executes parallel sub-queries automatically against the partitions. Both I/O and CPU resources are used in parallel. The parallel query architecture is based on Oracle Database partitioning and enhancements in Oracle Text.

The Parallel Query feature is automatically supported by Oracle SES when the partitioning option is enabled. You can specify partitioning only during the Oracle SES installation. Partitions can be configured in Oracle SES only using the Administration API.

The default tablespaces for Oracle SES are SEARCH_DATA, SEARCH_INDEX, and SEARCH_TEMP, which can be used for partitioning.

Note:

To make the best use of the Parallel Query feature, Oracle recommends that you run Oracle SES on a server with a 4-Core CPU, with at least 8 GB of RAM and multiple fast disk drives.

To enable partitioning: 

  1. Acquire a license for the Oracle Partitioning option for Oracle Database.

  2. During the Oracle SES software installation, select the Oracle Partitioning Option setting. Oracle Database is then installed with the partitioning feature.

  3. Activate and configure partitions using Administration API.

To use partitioned tablespaces in Oracle SES: 

  1. Create one or more ASSM (Automatic Segment Space Management) tablespaces using a tool such as Enterprise Manager.

  2. Open a searchadmin interactive session.

    See Also:

    "Opening an Interactive Session" in Oracle Secure Enterprise Search Administration API Guide
  3. Activate the partitionConfig object.

    See Also:

    Oracle Secure Enterprise Search Administration API Guide for more information about the partitionConfig object type.
  4. Register storage areas - Update the storageArea object to register the existing tablespaces as storage areas for use by Oracle SES. To make optimum use of the parallel query feature, you should distribute partitioned tablespaces across all the physical disks.

    See Also:

    Oracle Secure Enterprise Search Administration API Guide for more information about the storageArea object type.
  5. Configure partitions - Update the partitionConfig object with the above storageArea objects, partition attributes, and partition rules. A partition typically includes multiple storage areas.

  6. Create data sources and schedule them for crawling.

Example: Adding a Tablespace and Using it in a Partition Rule

This example registers a new tablespace for use by Oracle SES:

  1. Create a new ASSM tablespace. This example uses SQL to create a tablespace named NEW_ONE:

    CREATE TABLESPACE new_one DATAFILE '/ses_storage/new_one.dbf' 
       SIZE 8G REUSE AUTOEXTEND ON NEXT 2G MAXSIZE UNLIMITED
       EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO;
    
  2. Open a searchadmin interactive session:

    $ searchadmin --CONNECTION=http://ses_host:ses_port/search/api/admin/AdminService
    
  3. Activate the partitionConfig object:

    activate partitionConfig
    
  4. Export the XML description of the partition configuration to a file named part.xml:

    export partitionConfig --OUTPUT_FILE=part.xml
    
  5. Create an XML file named search_data.xml and describe the NEW_ONE tablespace as an Oracle SES storage area, as shown here:

    <?xml version="1.0" encoding="UTF-8"?>
    <search:config xmlns:search="http://xmlns.oracle.com/search" productVersion="11.2.2.2.0">
       <search:storageAreas>
          <search:storageArea>
             <search:name>NEW_ONE</search:name>
             <search:description>Additional storage area</search:description>
             <search:usage>PARTITION</search:usage>
          </search:storageArea>
       </search:storageAreas>
    </search:config>
    
  6. Open part.xml in a text editor and edit the <search:ruleType> and <search:storageArea> elements as shown here. This example hashes all documents into two partitions: one partition in the SEARCH_DATA tablespace, and the other partition in the NEW_ONE tablespace.

    <?xml version="1.0" encoding="UTF-8"?>
    <search:config xmlns:search="http://xmlns.oracle.com/search" productVersion="11.2.2.2.0">
       <search:partitionConfig>
          <search:partitionRules>
             <search:partitionRule>
                <search:partitionValue>EQ_DEFAULT</search:partitionValue>
                <search:valueType>META</search:valueType>
                <search:ruleType>HASH</search:ruleType>
                <search:storageArea>SEARCH_DATA,NEW_ONE</search:storageArea>
             </search:partitionRule>
          </search:partitionRules>
       </search:partitionConfig>
    </search:config>
    
  7. Register the new storage area:

    create storageArea --NAME=new_one --INPUT_FILE=search_data.xml
    
  8. Update the partition configuration:

    update partitionConfig --INPUT_FILE=part.xml --UPDATE_METHOD=overwrite
    

Limitations of Configuring Partitions

The following are some of the limitations while configuring a partition:

  • Once the partition option is enabled, then it cannot be disabled, and hence, parallel querying also cannot be disabled.

  • A partition cannot be dropped if it is in use.

  • When a data source is moved from one partition to another, the data source must be recrawled for reindexing purpose.

  • You can use one of the following methods for updating a partition rule:

    • Update the rule and then recrawl the data source.

    • Delete the data source, delete and recreate the rule, recreate the data source, and recrawl the data source.

Managing Index Fragmentation

Index fragmentation management allows the search engine index to be updated while Oracle SES is executing searches. This is achieved by temporarily saving index changes to an in-memory index and periodically merging them with the larger disk-based search engine index. This reduces fragmentation and leads to faster response times. Index fragmentation management is implemented automatically on Oracle SES, but it can be tuned by configuring Oracle Text, where you can turn index fragmentation management on and off, and specify the frequency of index merges.

Optimizing the index also reduces fragmentation, and it can significantly increase the speed of searches. Schedule index optimization on a regular basis. Also, optimize the index after the crawler has made substantial updates or if fragmentation is more than 50%. Verify that index optimization is scheduled during off-peak hours. Optimization of a very large index could take several hours.

You can see the fragmentation level and run index optimization on the Global Settings - Index Optimization page in the Oracle SES Administration GUI. Index optimization has these options:

Do Not Run Optimization Longer Than

Specify a maximum duration for the index optimization process. The actual time taken for optimization does not exceed this limit, but it can be shorter. A longer optimization time results in a more optimized index. In this mode, the optimization process does not require a large amount of free disk space.

Until the Optimization is Finished

Specifies that the optimization continues until it is finished. Allowing the optimization to complete creates a more compact index and supports better performance than a partial optimization.

In this mode, Oracle SES creates a temporary copy of the index. The required disk space almost equals the current index size. If sufficient free disk space is not available, then the optimization fails. Use the appropriate SQL query shown here to estimate the minimum disk requirement:

  • Oracle SES Without Partitioning

    SELECT SUM(bytes)/1048576 AS "MBytes" 
       FROM dba_segments 
       WHERE segment_name IN ('DR$EQ$DOC_PATH_IDX$I','DR$EQ$DOC_PATH_IDX$X'); 
    
  • Oracle SES With Partitioning

    SELECT SUM(sz) AS "MBytes" 
       FROM 
       ( 
          SELECT MAX(bytes)/1048576 sz FROM dba_segments 
             WHERE segment_name LIKE 'DR#EQ$DOC_PATH_IDX$%I' 
       UNION 
          SELECT MAX(bytes)/1048576 sz FROM dba_segments 
             WHERE segment_name LIKE 'DR#EQ$DOC_PATH_IDX$%X' 
       ) ; 
    

These queries return an estimate of the minimum disk space needed for optimization. Oracle SES may require more disk space than this estimate.

After the optimization is complete, Oracle SES releases the disk space consumed during the optimization. The space can be used by future crawls or any activity that consumes disk space.

Optimizing the Index

Optimizing the index reduces fragmentation, and it can significantly increase the speed of searches. Schedule index optimization on a regular basis. Also, optimize the index after the crawler has made substantial updates or if fragmentation is more than 50%. Verify that index optimization is scheduled during off-peak hours. Optimization of a very large index could take several hours.

You can see the fragmentation level and run index optimization on the Global Settings - Index Optimization page in the Oracle SES Administration GUI. You can specify a maximum number of hours for the optimization to run, but for best performance, run the optimization until completion. Oracle SES uses a faster optimization method and creates a more compact copy of the index when no time limit is set.

Adjusting the Indexing Parameters

To improve indexing performance, adjust the following parameters on the Global Settings - Set Indexing Parameters page of the Oracle SES Administration GUI:

Indexing Batch Size

When the crawled data in the cache directory reaches Indexing Batch Size, Oracle SES starts indexing. The bigger the batch size, the longer it takes to start indexing each batch. Only indexed data can be searched: Data in the cache cannot be searched. The default size is 250M.

Document fetching and indexing run concurrently. While indexing is running, the Oracle SES crawler continues to fetch documents and store them in the cache directory.

Indexing Memory Size

This is the upper limit of memory used for indexing before flushing the index to disk.

A large amount of memory improves indexing performance because it reduces I/O. It also improves query performance because the created index is less fragmented from the beginning, while a fragmented index can be optimized later. Set this parameter as high as possible without causing memory paging.

A smaller amount of memory might be useful when indexing progress should be tracked or when run-time memory is scarce. The default size is 275M. In general, increasing the Indexing Memory Size parameter can reduce fragmentation.

Parallel Indexing Degree

The number of concurrent threads used for indexing. This parameter is disabled in the current version of Oracle SES; it is always set to 1.

Checking the Search Statistics

See the Home - Statistics page in the Oracle SES Administration GUI for lists of the most popular queries, failed queries, and ineffective queries. This information can lead to the following actions:

  • Refer users to a particular Web site for failed queries on the Search - Suggested Links page.

  • Fix common errors that users make in searching on the Search - Alternate Words page.

  • Make important documents easier to find on the Search - Relevancy Boosting page.

Every hour, Oracle SES automatically summarizes logged queries. The summarizing task might utilize the server resource if there are a large number of logged queries, and this might impact the query performance. This issue is visible for stress tests where several queries are executed every second. The ideal solution in such instances is to disable the query statistics option.

To disable the query statistics option: 

  1. On the home page, click Global Settings, then Query Configuration.

  2. In the Query Statistics section, select No for the Enable Query Statistics option.

Cleaning up Search statistics data from the Database

Oracle SES stores the query log for seven days in the Database. If you want to clean up the query log from the Database immediately, then run the following SQL commands:

SQL> truncate table searchsys.eq$statistic;
SQL> truncate table searchsys.eq$user_clicks;
SQL> truncate table searchsys.EQ$SUM_CLICK_THROUGH;
SQL> truncate table searchsys.EQ$SUM_QUERY_STAT;
SQL> truncate table searchsys.EQ$SUM_STAT_DAILY;
SQL> truncate table searchsys.EQ$SUM_STAT_FAILED;
SQL> truncate table searchsys.EQ$SUM_STAT_INEFFECT;
SQL> truncate table searchsys.EQ$SUM_STAT_POPULAR; 

Load Balancing on Oracle RAC

When Oracle SES is deployed in an Oracle Real Applications Cluster (Oracle RAC) environment, the usage profile is typically one of the following:

  • Small index with a large query load

  • Large index with a small-to-large query load

A third option, a small index and a small query load, typically operates on a single computer.

Configuring a Small Index

The load balancing solutions provided by Oracle RAC and the WebLogic Server are sufficient for this type of Oracle SES deployment. Most or all of the index can reside in memory or the buffer cache. You only need to set up the listeners appropriately for Oracle SES.

To set up the listeners: 

  • Provide a local listener on each Oracle RAC instance.

  • Do not configure remote listeners.

  • Oracle recommends dedicated processes over shared processes.

Oracle RAC Deployment Best Practice

When Oracle SES is deployed in an Oracle RAC environment, then do the configuration described in this section for the best query performance.

Note:

It is recommended to do this configuration if the AWR report shows any of the following statistics:
  • The Oracle RAC statistic for the estimated interconnect traffic shows a very high value.

  • The Oracle RAC cluster wait events are in the top 5 timed foreground events.

  • The Oracle RAC cluster wait class consumes a very large amount of database time.

Three Node Oracle RAC Environment

In a three node Oracle RAC environment, use one Oracle RAC instance for the crawler and use the rest of the Oracle RAC instances for the query application. This deployment configuration will increase the query performance due to separate Oracle RAC nodes for query application and crawler. It will also reduce the Oracle RAC cache fusion overhead for the query application without compromising the high availability.

The above described configuration can be done by creating two database services - one for the crawler and one for the query application. After creating these database services, configure the database connection details for both these services.

See Also:

WebLogic Search Server Configuration

Oracle SES is installed in a WebLogic domain. The default settings for stuck threads can result in slow query performance even under a moderate load.

To change the Search server configuration 

  1. Log in to the WebLogic Administration Console.

  2. In the left panel under Change Center, click Lock & Edit.

  3. In the left panel under Domain Structure, expand Environment and click Servers. The Summary of Services page is displayed in the main panel.

  4. In the Name column, click search_server1. The Settings for search_server1 page is displayed.

  5. Select the Configuration tab.

  6. Configure these settings:

    • Stuck Thread Max Time: 3600

    • Stuck Thread Timer Interval: 1800

  7. Click Save.

  8. Repeat these steps for any other search server instances, such as search_server2.

  9. In the left panel under Change Center, click Activate Changes.

Database Initialization Parameters

To support a large number of simultaneous users, you may need to increase the values of these database initialization parameters:

  • PROCESSES

  • SESSIONS

  • OPEN_CURSORS

The crawler also uses several threads, and each thread uses several database connections. You can alter the number of crawler threads on the Home - Sources - Crawling Parameters page of the Oracle SES Administration GUI.

Use the combined estimate of concurrent user processes and crawler threads for the value of PROCESSES. Then modify SESSIONS to a compatible value, typically calculated as 1.1 * PROCESSES.

You can monitor the number of open cursors using the statistics stored in the V$SESSTAT dynamic performance view. If the number of open cursors for user sessions frequently approaches the maximum, then you can increase that number.

See Also:

To change the database initialization parameters: 

  1. Open SQL*Plus and log in to Oracle Database as a privileged user, such as SYSTEM.

  2. For a list of all initialization parameters and their current settings, issue this SQL*Plus command:

    show parameters
    
  3. Issue ALTER SYSTEM commands, using values appropriate for your system, to change the value of the parameters. For example, this command sets PROCESSES to 800:

    ALTER SYSTEM SET processes=800 SCOPE=spfile;
    
  4. Restart Oracle Database for the new settings to take effect.

Relevancy Boosting

Relevancy boosting lets administrators influence the order of documents in the result list for a particular search. You might want to override the default results for the following reasons:

  • For a highly popular search, direct users to the best results

  • For a search that returns no results, direct users to some results

  • For a search that has no click-throughs, direct users to better results

In a search, each result is assigned a score that indicates how relevant the result is to the search; that is, how good a result it is. Sometimes you know the documents that are highly relevant to some search. For example, your company Web site could have a home page for XML (http://example.com/XML-is-great.htm), which you want to appear high in the results of any search for XML. You would boost the score of the XML home page to 100 for an XML search.

The document also has a score computed for searches that are not among the boosted queries.

Two methods can help you locate URLs for relevancy boosting: locate by search and manual URL entry.

Relevancy boosting, like end user searching, is case-insensitve. For example, a document with a boosted score for Oracle is boosted for oracle.

Increasing the JVM Heap Size

If you expect heavy loads on the Oracle SES server, then configure the Java Virtual Machine (JVM) heap size for better performance.

You can configure the JVM heap size using the WebLogic Server JVM configuration parameters. Refer to Oracle Fusion Middleware Performance and Tuning for Oracle WebLogic Server for more information.

Note:

The following are some of the important points to consider while configuring the JVM heap size:
  • The heap size should be set to about 50% of the generally consumed heap size under normal application load.

  • The heap size should not be too large, for example, its value should not be more than 2 GB. Setting the heap size to a very large value may severely impact the application performance due to its usage by the Java Garbage Collection (GC) process.

Increasing the Oracle Undo Space

Heavy query load should not coincide with heavy crawl activity, especially when there are large-scale changes on the target site. If it does, such as when a crawl is scheduled around the clock, then increase the size of the Oracle undo tablespace with the UNDO_RETENTION parameter.

See Also:

Oracle Database SQL Language Reference and Oracle Database Administrator's Guide on Oracle Technology Network for more information about increasing the Oracle undo space

Optimizing Query Application Performance

If you plan to use the Oracle SES default query user interface and have an Oracle Application Server Web Cache installation, then you can use its compression utility to compress the content Oracle SES sends over the network. For example, the utility can compress results.jsp from 980 to 72K. Compression provides the greatest benefit to users connecting over the Internet.

Use these Web cache compression rules:

/search/search?(.*)
/search/results.jsp?(.*)

OracleAS Web Cache does not benefit custom querying applications.

Using Command Line Tools

You can use the searchctl command line tool for starting and stopping the Oracle SES instance, only when both the Database and the WebLogic Server middle tier are installed as part of the Oracle SES software installation.

For more information about various searchctl command parameters, refer to the section "Using searchctl Command".

Configuring Oracle Diagnostic Logging (ODL) for Oracle SES Server

You can configure Oracle Diagnostic Logging (ODL) for Oracle SES instance using the Oracle Enterprise Manager (EM) as well as using the WebLogic Server Administration Scripting Tool wlst.sh.

Note:

The ODL logging configuration described here is for Oracle SES managed server and is not for Oracle SES crawler. For information about Oracle SES crawler log, refer to the section "Viewing Crawler Logs".

To configure ODL using Oracle Enterprise Manager: 

  1. Log on to the Oracle Enterprise Manager UI using the following URL:

    http://host:port/em
    

    where, host and port are the host name and the port for the WebLogic Server middle tier.

  2. On the Login page, provide user name and password as weblogic and the Oracle SES administrator password respectively.

  3. In the left panel, from the Farm_base_domain tree view, navigate to WebLogic Domain > domain_name > ses_managed_server. The value of domain_name is base_domain by default, and the value of ses_managed_server is search_server1 by default.

  4. Click the WebLogic Server list box, and navigate to Logs > Log Configuration.

  5. Under Logger Name, navigate to the tree node Root Logger > oracle > oracle.search.

  6. Select the required logging level from the Oracle Diagnostic Logging Level list.

  7. Click Apply to save the changes.

To configure ODL using the WebLogic Server Administration Scripting Tool wlst.sh: 

  1. Connect to the Linux system where WebLogic Server middle tier is installed.

  2. Go to MW_HOME/oracle_common/common/bin.

  3. Run the wlst.sh command.

  4. At the wls/offline> prompt, run the following command:

    connect('weblogic_username','weblogic_password','t3://localhost:7001')
    
  5. Run the following command to set the required log level:

    setLogLevel(target="ses_managed_server",logger="oracle.search", level="ODL_log_level")
    

    The ODL_log_level can have one of the following values:

    • SEVERE

    • WARNING

    • INFO

    • CONFIG

    • FINE

    • FINER

    • FINEST

Viewing Oracle SES Server Log Files

You can troubleshoot the run-time issues related to Oracle SES by viewing the server logs associated with the Oracle SES instance.

Oracle SES server log files

Refer to the Oracle SES server log files in case you encounter problems such as:

  • Oracle SES application hangs, that is, it does not respond to user actions

  • Oracle SES Administration GUI pages do not load

  • Oracle SES application throws database exception

The Oracle SES server log files are stored in the directory: wls_domain_home/ses_domain_name/servers/search_server1/logs.

Oracle Enterprise Scheduler server log files

Refer to the Oracle Enterprise Scheduler server log files in case you encounter problems such as:

  • Crawler fails to launch

  • Oracle SES Administration GUI pages related to schedules do not load

The Oracle Enterprise Scheduler server log files are stored in the directory: wls_domain_home/ses_domain_name/servers/ess_server1/logs.

Note:

You can view all the stuck threads on the WebLogic Console at Servers > server name > Monitoring > Threads, where server name is search_server1 for Oracle SES server and ess_server1 for Oracle Enterprise Scheduler server.

Monitoring Oracle SES Instance

In a production environment, where a load balancer or other monitoring tools are used to ensure system availability, Oracle SES instance can be monitored easily using the following URL:

http://host:port/monitor/check.jsp

The page should display the following message: Oracle Secure Enterprise Search instance is up.

This message is not translated to other languages because system monitoring tools might need to byte-compare this string.

If Oracle SES is not available, then the page displays either a connection error or the HTTP status code 503.

Integrating Oracle SES with Google Desktop

Oracle Secure Enterprise Search provides a GDfE plug-in to integrate with Google Desktop Enterprise Edition. You can include Google Desktop results in your Oracle SES hit list. You can also link to Oracle SES from the GDfE interface.

See Also:

Google Desktop for Enterprise Plug-in Readme at http://host:port/search/query/gdfe/gdfe_readme.html

Accessing Oracle WebLogic Server Administration Console

The Oracle WebLogic Server Administration Console is a Web browser-based user interface that displays the current status of the Oracle SES middle tier.

The Application Server Control Console is installed and configured automatically with WebLogic Server and can be accessed when the WebLogic Server is running.

To access the Oracle WebLogic Server Administration Console:  

  1. Enter the following URL in a Web browser, replacing host:port with the host name and the port for the WebLogic Administration Console:

    http://host:port/console

  2. Log in as the weblogic user with your Oracle SES administrator password.

See Also:

http://docs.oracle.com/cd/E23943_01/apirefs.1111/e13952/core/index.html

for detailed documentation related to the Oracle WebLogic Server Administration Console.