Skip Headers
Oracle® Secure Enterprise Search Administrator's Guide
11g Release 1 (11.1.2.0.0)

Part Number E14130-04
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
View PDF

12 Administering Oracle SES Instances

This chapter provides information about tuning and general management of Oracle SES instances. It contains the following topics:

Managing Disk Space Usage

When you install Oracle SES, various components of Oracle SES consume server space. The different components that consume space include:

You can set space usage quota for these components using the Admin API. Among the different components, data storage location consumes the most disk space over time. To address this, Oracle SES sets a default space quota for this component during installation. The default space quota for the data storage location is the initial space allocated by Oracle SES during installation, plus half of the available free disk space at the time of installation. For example, if at the time of installation, the available free disk space is 350 GB, then Oracle SES allocates 175 GB of disk space to data storage location.

Note that you can use the storageArea API to modify the default space quota for data storage location. Using this API, you can remove the quota, in which case the data storage location can utilize the entire free disk space. However, if Oracle SES uses up the entire disk space, then the crawler fails and the Oracle SES instance crashes. It is not possible to recover from this state even if you clear up space. Hence, Oracle recommends that you do not remove the preset quota.

Oracle SES calculates the space usage of all the storage components on a periodic basis. You can define the frequency of the periodic checks. For example you can set it to be an hourly, daily, or a weekly task depending on your usage. Additionally, you can calculate the space usage at any time by calling the spaceCalculator admin API.

A variety of tasks such as crawling, index optimization, metadata backup, and auto merge among others consume space. When the space usage reaches 80% of the defined space usage quota, Oracle SES raises a warning. When the usage exceeds the quota, Oracle SES raises an alert and performs the following operations:

After you clear up the space or increase the space usage quota, you can resume all the disrupted activities including the stopped On-Demand crawler schedules and other disabled tasks. To do this, you must call the task admin API resumeAllSpaceConsumingTasks. If you do not call this API, then Oracle SES cannot restart any of the stopped activities, even if you create free space.

To perform space management tasks, you must use the Admin API. See Oracle Secure Enterprise Search Administration API Guide for more information.

Using Backup and Recovery

The Global Settings - Configuration Data Backup and Recovery page backs up metadata that can be used to recover your configuration settings after a hardware failure. The actual crawled data is not backed up. You should run a backup after making configuration data changes, such as creating or editing sources.

When you perform a backup, Oracle SES copies the data to the binary metaData.bkp file. The location of this file is provided on the Global Settings - Configuration Data Backup and Recovery page. When the backup successfully completes, you must copy this file to a different host.

When the installation completes, copy the metaData.bkp file to the location provided in the Oracle SES Administration GUI. Sources must be re-crawled to see search results.

Notes about backup and recovery 

Cold Backups

As an additional precaution to minimize downtime, you can perform a cold backup to backup all the data of an Oracle SES Instance. To back up an instance, you must save a copy of the directories ORACLE_BASE, oraInventory, and oradata.

To perform a cold backup: 

  1. Shut down the Oracle SES instance:

    ORACLE_HOME/bin/searchctl stopall
    
  2. Log in to the machine as the root user or the administrator.

  3. Copy all the files under the Oracle SES base directory (ORACLE_BASE), the Ora Inventory oraInventory, and the Oracle data storage oradata.

    These locations are specified during the Oracle SES installation. There are several ways to make a copy. For example, using the tar command:

    cd / 
    tar cvf ses_orahome.tar <complete path to ORACLE_BASE>
    tar cvf ses_orainv.tar <complete path to Ora Inventory>
    tar cvf ses_oradat.tar <complete path to oradata>
    

    For example, if the oradata location is /mnt1/oracle/ses/oradata, then save a copy using the command:

    cd / 
    tar cvf ses_oradat.tar /mnt1/oracle/ses/oradata
    
  4. Backup the cached files of sources created before Oracle SES 11g. (Optional)

    If you retain cache files, then users can click the "cached" link in the result list.

    The cache directory location is listed on the Global Settings - Crawler Configuration page. For example, if the cache directory is /mnt1/oracle/ses/cache, then run the following commands.

    cd /
    tar cvf cache.tar /mnt1/oracle/ses/cache
    
  5. Save the .tar files in a safe location.

    Note:

    You can use any compression method to perform file backup. For example, you can zip the files.

To recover files from a cold backup: 

  1. Shut down the Oracle SES instance:

    ORACLE_HOME/bin/searchctl stopall
    
  2. Restore all backed-up files. To do this, untar the files, and move them back to their original locations.

  3. Start the Oracle SES instance:

    ORACLE_HOME/bin/searchctl startall
    

Tuning the Crawl Performance

Your Web crawling strategy can be as simple as identifying a few well-known sites that are likely to contain links to most of the other intranet sites in your organization. You could test this by crawling these sites without indexing them. After the initial crawl, you have a good idea of the hosts that exist in your intranet. You could then define separate Web sources to facilitate crawling and indexing on individual sites.

However, the process of discovering and crawling your organization's intranet, or the Internet, is generally an interactive one characterized by periodic analysis of crawling results and modification to crawling parameters. For example, if you observe that the crawler is spending days crawling one Web host, then you might want to exclude crawling at that host or limit the crawling depth.

This section contains the most common things to consider to improve crawl performance:

See Also:

"Monitoring the Crawling Process" for more information on crawling parameters

Understanding the Crawler Schedule

Schedules define the frequency at which the Oracle SES index is updated with information about each source. This section describes characteristics the Oracle SES crawler schedule.

  • The Failed Schedules section on the Home - General page lists all schedules that have failed. A failed schedule is one in which the crawler encountered an irrecoverable error, such as an indexing error or a source-specific login error, and cannot proceed. A failed schedule could be because of a partial collection and indexing of documents.

  • The smallest granularity of the schedule interval is one hour. For example, you cannot start a schedule at 1:30 am.

  • If a crawl takes longer to finish than the scheduled interval, then it starts again when the current crawl is done. Currently, there is no option to have the scheduled time automatically pushed back to the next scheduled time.

  • When multiple sources are assigned to one schedule, the sources are crawled one by one following the order of their assignment in the schedule.

  • The schedule starts crawling the assigned sources in the assigned order. Only one source is crawling under a schedule at any given time. If a source crawl fails, then the rest of the sources assigned after it are not crawled. The schedule does not restart. You must either resolve the cause of the failure and resume the schedule, or remove the failed source from the schedule.

  • There is no automatic e-mail notification of schedule success or failure.

Registering a Proxy

By default, Oracle SES is configured to crawl Web sites in the intranet, so no additional configuration is required. However, to crawl Web sites on the Internet (also referred to as external Web sites), Oracle SES needs the HTTP proxy server information.

To register a proxy: 

  1. On the Global Settings page under Sources, select Proxy Settings.

  2. Enter the proxy server name and port. Click Set Proxy.

  3. Enter the internal host name suffix under Exceptions, so that internal Web sites do not go through the proxy server. Click Set Domain Exceptions.

    To exclude the entire domain, omit http, begin with *., and use the suffix of the host name. For example, *.us.example.com or *.example.com. Entries without the *. prefix are treated as a single host. Use the IP address only when the URL crawled is also specified using the IP for the host name. They must be consistent.

  4. If the proxy requires authentication, then enter the proxy authentication information on the Global Settings - Authentication page.

Checking Boundary Rules

The seed URL you enter when you create a source is turned into an inclusion rule. For example, if www.example.com is the seed URL, then Oracle SES creates an inclusion rule that only URLs containing the string www.example.com are crawled.

However, suppose that the example Web site includes URLs starting with www.exa-mple.com or example.com (without the www). Many pages have a prefix on the site name. For example, the investor section of the site has URLs that start with investor.example.com.

Always check the inclusion rules before crawling, then check the log after crawling to see what patterns have been excluded.

In this case, you might add www.example.com, www.exa-mple.com, and investor.example.com to the inclusion rules. Or you might just add example.

To crawl outside the seed site (for example, if you are crawling text.us.oracle.com, but you want to follow links outside of text.us.oracle.com to oracle.com), consider removing the inclusion rules completely. Do so carefully. This action could lead the crawler into many, many sites.

Notes for File Sources

  • If no boundary rule is specified, then crawling is limited to the underlying file system access privileges. Files accessible from the specified seed file URL are crawled, subject to the default crawling depth. The depth, which is 2 by default, is set on the Global Settings - Crawler Configuration page. For example, if the seed is file://localhost/home/user_a/, then the crawl picks up all files and directories under user_a with access privileges. It crawls any documents in the directory /home/user_a/level1 due to the depth limit. The documents in the /home/user_a/level1/level2 directory are at level 3.

  • The file URL can be in UNC (universal naming convention) format. The UNC file URL has the following format for files located within the host machine:

    file://localhost///LocalComputerName/SharedFolderName

    For example, specify \\stcisfcr\docs\spec.htm as file://localhost///stcisfcr/docs/spec.htm

    where stcisfcr is the name of the host machine.

    The string localhost is optional. You can specify the URL path without the string localhost in the URL, in which case the URL format is:

    file:///LocalComputerName/SharedFolderName

    For example,

    file:///stcisfcr/docs/spec.htm

    Note that you cannot use the UNC format to access files on other machines.

  • On some computers, the path or file name could contain non-ASCII and multibyte characters. URLs are always represented using the ASCII character set. Non-ASCII characters are represented using the hex representation of their UTF-8 encoding. For example, a space is encoded as %20, and a multibyte character can be encoded as %E3%81%82.

    You can enter spaces in simple (not regular expression) boundary rules. Oracle SES automatically encodes these URL boundary rules. For example, Home Alone is specified internally as Home%20Alone. Oracle SES does this encoding for the following:

    • File source simple boundary rules

    • URL string tests

    • File source seed URLs

    Oracle SES does not alter regular expression rules. You must ensure that the regular expression rule specified is against the encoded file URL. Spaces are not allowed in regular expression rules.

Checking Dynamic Pages

Indexing dynamic pages can generate too many URLs. From the target Web site, manually navigate through a few pages to understand what boundary rules should be set to avoid crawling of duplicate pages.

Checking Crawler Depth

Setting the crawler depth very high (or unlimited) could lead the crawler into many sites. Without boundary rules, a crawler depth of 20 probably crawls the entire World Wide Web from most locations.

Checking Robots Rule

You can control which parts of your sites can be visited by robots. If robots exclusion is enabled (the default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots.txt file.

The following sample robots.txt file specifies that no robots visit any URL starting with /cyberworld/map/ or /tmp/ or /foo.html:

# robots.txt for http://www.example.com/
 
User-agent: *
Disallow: /cyberworld/map/ 
Disallow: /tmp/ 
Disallow: /foo.html

If the Web site is under your control, then you can tailor a specific robots rule for the crawler by specifying Oracle Secure Enterprise Search as the user agent. For example:

User-agent: Oracle Secure Enterprise Search
 
Disallow: /tmp/

The robots meta tag can instruct the crawler either to index a Web page or to follow the links within it. For example:

<meta name="robots" content="noindex,nofollow">

Checking Duplicate Documents

Oracle SES always removes duplicate (identical) documents. Oracle SES does not index a page that is identical to one it has already indexed. Oracle SES also does not index a page that it reached through a URL that it has already processed.

With the Web Services API, you can enable or disable near duplicate detection and removal from the result list. Near duplicate documents are similar to each other. They may or may not be identical to each other.

Checking Redirected Pages

The crawler crawls only redirected pages. For example, a Web site might have Javascript that redirects users to another site with the same title. In such cases, only the redirected site is indexed.

Check for inclusion rules from redirects. The inclusion rules are based on the type of redirect. The EQ_TEST.EQ$URL table stores all of the URLs that have been crawled or are scheduled to be crawled. There are three kinds of redirects defined in it:

  • Temporary Redirect: A redirected URL is always allowed if it is a temporary redirection (HTTP status code 302, 307). Temporary redirection is used for whatever reason that the original URL should still be used in the future. It's not possible to find out temporary redirect from EQ$URL table other than filtering out the rest from the log file.

  • Permanent Redirect: For permanent redirection (HTTP status 301), the redirected URL is subject to boundary rules. Permanent redirection means the original URL is no longer valid and the user should start using the new (redirected) one. In EQ$URL, HTTP permanent redirect has the status code 954

  • Meta Redirect: Metatag redirection is treated as a permanent redirect. Meta redirect has status code 954. This is always checked against boundary rules.

The STATUS column of EQ_TEST.EQ$URL lists the status codes. For descriptions of the codes, refer to Appendix B, "URL Crawler Status Codes."

Note:

Some browsers, such as Mozilla and Firefox, do not allow redirecting a page to load a network file. Microsoft Internet Explorer does not have this limitation.

Checking URL Looping

URL looping refers to the scenario where a large number of unique URLs all point to the same document. Looping sometimes occurs where a site contains a large number of pages, and each page contains links to every other page in the site. Ordinarily this is not a problem, because the crawler eventually analyzes all documents in the site. However, some Web servers attach parameters to generated URLs to track information across requests. Such Web servers might generate a large number of unique URLs that all point to the same document.

For example,

http://example.com/somedocument.html?p_origin_page=10

might refer to the same document as

http://example.com/somedocument.html?p_origin_page=13

but the p_origin_page parameter is different for each link, because the referring pages are different. If a large number of parameters are specified and if the number of referring links is large, then a single unique document could have thousands or tens of thousands of links referring to it. This is an example of how URL looping can occur.

Monitor the crawler statistics in the Oracle SES Administration GUI to determine which URLs and Web servers are being crawled the most. If you observe an inordinately large number of URL accesses to a particular site or URL, then you might want to do one of the following:

  • Exclude the Web server: This prevents the crawler from crawling any URLs at that host. (You cannot limit the exclusion to a specific port on a host.)

  • Reduce the crawling depth: This limits the number of levels of referred links the crawler follows. If you are observing URL looping effects on a particular host, then you should take a visual survey of the site to find out an estimate of the depth of the leaf pages at that site. Leaf pages are pages that do not have any links to other pages. As a general guideline, add three to the leaf page depth, and set the crawling depth to this value.

Be sure to restart the crawler after altering any parameters. Your changes take effect only after restarting the crawler.

Increasing the Oracle Redo Log File Size

Oracle SES allocates 200M for the redo log during installation. 200M is sufficient to crawl a relatively large number of documents. However, if your disk has sufficient space to increase the redo log and if you are going to crawl a very large number of documents (for example, more than 300G of text), then increase the redo log file size for better crawl performance.

Note:

The biggest transaction during crawling is SYNC INDEX by Oracle Text. Check the AWR report or the V$SYSSTAT view to see the actual redo size during crawling. Roughly, 200M is sufficient to crawl up to 300G.

To increase the size of the redo log files: 

  1. Open SQL*Plus and connect as the SYSTEM user. It has the same password as EQSYS.

  2. Issue the following SQL statement to see the current redo log status:

    SELECT vl.group#, member, bytes, vl.status 
        FROM v$log vl, v$logfile vlf 
        WHERE vl.group#=vlf.group#; 
     
    GROUP# MEMBER                                                 BYTES STATUS 
    ------ ------------------------------------------------- ---------- ---------- 
         3 /scratch/ses111/oradata/o11101/redo03.log         209715200 INACTIVE 
         2 /scratch/ses111/oradata/o11101/redo02.log         209715200 CURRENT 
         1 /scratch/ses111/oradata/o11101/redo01.log         209715200 INACTIVE 
    
  3. Drop the INACTIVE redo log file. For example, to drop group 3:

    ALTER DATABASE DROP LOGFILE group 3; 
     
    Database altered. 
    
  4. Create a larger redo log file with a command like the following. If you want to change the file location, specify the new location.

    ALTER DATABASE ADD LOGFILE '/scratch/ses111/oradata/o11101/redo03.log' 2 
         size 400M reuse; 
    
  5. Check the status to ensure that the file was created.

    SELECT vl.group#, member, bytes, vl.status 
         FROM v$log vl, v$logfile vlf 
         WHERE vl.group#=vlf.group#; 
     
    GROUP# MEMBER                                                  BYTES STATUS 
    ------ -------------------------------------------------- ---------- ---------- 
         3 /scratch/ses111/oradata/o11101/redo03.log           419430400 UNUSED 
         2 /scratch/ses111/oradata/o11101/redo02.log           209715200 CURRENT 
         1 /scratch/ses111/oradata/o11101/redo01.log           209715200 INACTIVE 
    
  6. To drop a log file with a CURRENT status, issue the following ALTER statement, then check the results.

    ALTER SYSTEM SWITCH LOGFILE; 
     
    SELECT vl.group#, member, bytes, vl.status 
         FROM v$log vl, v$logfile vlf 
         WHERE vl.group#=vlf.group#; 
     
    GROUP# MEMBER                                                  BYTES STATUS 
    ------ -------------------------------------------------- ---------- ---------- 
         3 /scratch/ses111/oradata/o11101/redo03.log           419430400 CURRENT 
         2 /scratch/ses111/oradata/o11101/redo02.log           209715200 ACTIVE 
         1 /scratch/ses111/oradata/o11101/redo01.log           209715200 INACTIVE 
    
  7. Issue the following SQL statement to change the status of Group 2 from ACTIVE to INACTIVE:

    ALTER SYSTEM CHECKPOINT; 
     
    SELECT vl.group#, member, bytes,  vl.status 
         FROM v$log vl, v$logfile vlf 
         WHERE vl.group#=vlf.group#; 
     
    GROUP# MEMBER                                                  BYTES STATUS 
    ------ -------------------------------------------------- ---------- ---------- 
         3 /scratch/ses111/oradata/o11101/redo03.log           419430400 CURRENT 
         2 /scratch/ses111/oradata/o11101/redo02.log           209715200 INACTIVE 
         1 /scratch/ses111/oradata/o11101/redo01.log           209715200 INACTIVE 
    
  8. Repeat steps 3, 4 and 5 for redo log groups 1 and 2.

What to Do Next

If you are still not crawling all the pages you think you should, then check which pages were crawled by doing one of the following:

  • Check the crawler log file

  • Create a search source group

To check the crawler log file: 

  1. On the Home page, click the Schedules secondary tab to display the Crawler Schedules page.

  2. Click the Log File icon to display the log file for the source.

  3. To obtain the location of the full log, click the Status link. The Crawler Progress Summary and Log Files by Source section displays the full path to the log file.

To create a search source group: 

  1. On the Search page, click the Source Groups subtab.

  2. Click New to display Create New Source Group Step 1.

  3. Enter a name, then click Proceed to Step 2.

  4. Select a source type, then shuttle only one source from Available Sources to Assigned Sources.

  5. Click Finish.

To search the source group: 

  1. On any page, click the Search link in the top right corner to open the Search application.

  2. Select the group name, then issue a search term to list the matches within the source.

  3. Select the group name, then click Browse to see a list of search groups:

    • The number after the group name identifies the number of browsed documents. Click the number to browse the search results.

    • Click the arrow before the group name to display a hierarchy of search results. The number of matches appears after each item in the hierarchy.

Automatically Adding Datafiles

While there are no space restrictions for user defined datafiles, Oracle Database sets a size limit of 32 GB for any datafile. As a result, Oracle SES datafiles cannot grow beyond this limit. However, Oracle SES automatically creates a new datafile at the same location when the existing datafile becomes full.

Tuning Search Performance

Oracle SES contains a lot of features that optimize the search performance. This section contains suggestions on how to improve the response time and throughput performance of Oracle SES. It identifies the most common ways to improve search quality.

Adding Suggested Links

Suggested links enable you to direct users to a designated Web site for particular query keywords. For example, when users search for "Oracle Secure Enterprise Search documentation" or "Enterprise Search documentation" or "Search documentation", you could suggest http://www.oracle.com/technology.

Suggested link keywords are rules that determine which suggested links are returned (as suggestions) for a query. A rule can include query terms and logical operators. For example, "secure AND search". With this rule, the corresponding suggested link is returned for the query "secure enterprise search", but it is not returned for the query "secure database".

The rule language used for the indexed queries supports the following operators:

Table 12-1 Suggested Link Keyword Operators

Operator Example

ABOUT

about(dogs)

AND

dog and cat

NEAR

dog ; cat

OR

dog or cat

PHRASE

dog sled

STEM

$dog

THESAURUS

SYN(dog)


Note:

Do not use special characters, such as #, $, =, and &, in keywords.

Suggested links appear at the top of the search result list. Oracle SES can display up to two suggested links for each query.

This feature is especially useful for providing links to important Web pages that are not crawled by Oracle Secure Enterprise Search. Add or edit suggested links on the Search - Suggested Links page in the Oracle SES Administration GUI.

Parallel Querying and Index Partitioning

Parallel querying significantly improves search performance and facilitates searches of very large data sources. The query architecture is based on Oracle Database partitioning and enhancements in Oracle Text.

To make the best use of this feature, Oracle recommends that you run Oracle SES on a server with a 4-Core CPU, with at least 8GB of RAM and multiple fast disk drives.

Parallel querying is automatically implemented on Oracle SES when the partitioning option is enabled. Partitioning can only be enabled on a newly installed Oracle SES instance.

To enable partitioning: 

  1. Log in as eqsys and execute the following SQL commands:

    exec eq_adm.use_instance(1)
    exec eq_par.enable_partition
    
  2. Next, configure the partition by setting up the storage areas and partition rules. You can do this using the admin API. See Oracle Secure Enterprise Search Administration API Guide for more information.

  3. Define the data sources and start the crawl process.

Note:

Once enabled, the partitioning option cannot be disabled. Therefore, by default, parallel querying cannot be disabled either.

Storage Areas

A storage area in Oracle SES corresponds to a physical disk. To make optimum use of the parallel querying feature, you must create as many storage areas as there are physical disks.

A storage area is a user defined object with the following attributes:

  • name

  • description (can be updated)

  • locations (Oracle SES 11g supports only a single location)

  • usage (can be SYSTEM, CRAWLER, CACHE FILE, or PARTITION)

For each location, you can provide the following details:

  • path

  • preAllocatedSpace (in MB, can be updated)

  • device (can be updated)

  • quota (in MB, can be updated)

  • currentSize (in MB). It also contains the lastRefreshDate parameter which indicates the time when currentSize was calculated.

You can create, export, update, and delete storage areas. Use the admin API to perform these operations and manage storage areas. See Oracle Secure Enterprise Search Administration API Guide for more information.

Note the following about the various operations:

  • Allow users to create and delete only those storage areas that have the usage type set to PARTITION.

  • Only the following fields can be updated: description, preAllocatedSpace, device, and quota.

Storage Area Schema

The storage area schema is as defined:

<xsd:element name = "storageAreas" minOccurs = "0" maxOccurs = "1">
    <xsd:complexType>
     <xsd:sequence>
      <xsd:element name = "storageArea" minOccurs = "0" maxOccurs = "unbounded">
    <xsd:complexType>
     <xsd:all>
      <xsd:element name = "name" type = "xsd:string" minOccurs = "1" maxOccurs = "1" />
      <xsd:element name = "description" type = "xsd:string" minOccurs = "1" maxOccurs = "1" />
      <xsd:element name = "usage" type = "xsd:string" minOccurs = "1" maxOccurs = "1" />
      <xsd:element name = "locations" minOccurs = "1" maxOccurs = "1">
    <xsd:complexType>
     <xsd:sequence>
      <xsd:element name = "location" minOccurs = "1" maxOccurs = "1">
       <xsd:complexType>
        <xsd:all>
         <xsd:element name = "path" type = "xsd:string" minOccurs = "1" maxOccurs = "1"/>
         <xsd:element name = "device" type = "xsd:string" minOccurs = "0" maxOccurs = "1"/>
         <xsd:element name = "preAllocatedSpace" type = "xsd:int" minOccurs = "0" maxOccurs = "1"/>
         <xsd:element name = "quota" type = "xsd:int" minOccurs = "0" maxOccurs = "1"/>
         <xsd:element name = "currentSize" minOccurs = "0" maxOccurs = "1">
       <xsd:complexType>
        <xsd:simpleContent>
         <xsd:extension base = "xsd:string">
          <xsd:attribute name = "lastRefreshDate" type = "xsd:string" />
         </xsd:extension>
        </xsd:simpleContent>
       </xsd:complexType>
      </xsd:element>
     </xsd:all>
    </xsd:complexType>
   </xsd:element>
  </xsd:sequence>
 </xsd:complexType>
</xsd:element>
</xsd:all>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>

For example,

<search:storageArea>
 <search:name>Cache directory</search:name>
 <search:description>The path where SES store cache files</search:description>
 <search:usage>SYSTEM</search:usage>
 <search:locations>
  <search:location>
   <search:path>/oracle/work/regress/</search:path>
  <search:device>default</search:device>
  </search:location>
 </search:locations>
</search:storageArea>

Configuring a Partition

Configuring a partition includes updating partition attributes, updating partition rules, as well as exporting configurations. A partition can typically include multiple storage areas. For example, the following command configures a hash partition over six storage areas.

$ORACLE_HOME/bin/searchadmin -u eqsys -p eqsys_password update partitionConfig -i /scratch/configHashPartition.xml

where configHashPartition.xml is:

<search:config productVersion="11.1.1.0.0">
<search:partitionConfig>
<search:partitionRules>
<search:partitionRule>
<search:partitionValue>EQ_DEFAULT</search:partitionValue>
<search:valueType>META</search:valueType>
<search:ruleType>HASH</search:ruleType>
<search:ruleSetting/>
<search:storageArea>SA1, SA2, SA3, SA4, SA5, SA6</search:storageArea>
</search:partitionRule>
</search:partitionRules>
</search:partitionConfig>
</search:config>

With this partition configuration, all documents are hash partitioned and evenly distributed across storage areas SA1 though SA6.

partitionConfig Schema

The partition configuration schema is as defined:

<!-- Partition Configuration -->
<xsd:element name = "partitionConfig" minOccurs = "0" maxOccurs = "1">
<xsd:complexType>
<xsd:sequence>
<xsd:element name = "partitionRules" minOccurs = "0" maxOccurs = "1">
<xsd:complexType>
<xsd:sequence>
<xsd:element name = "partitionRule" minOccurs = "0" maxOccurs = "unbounded">
<xsd:complexType>
<xsd:all>
<xsd:element name = "partitionValue" type = "xsd:string" minOccurs = "1" maxOccurs = "1"/>
<xsd:element name = "valueType" type = "xsd:string" minOccurs = "1" maxOccurs = "1"/>
<xsd:element name = "ruleType" type = "xsd:string" minOccurs = "1" maxOccurs = "1"/>
<xsd:element name = "ruleSetting" type = "xsd:string" minOccurs = "0" maxOccurs = "1"/>
<xsd:element name = "storageArea" type = "xsd:string" minOccurs = "0" maxOccurs = "1"/>
</xsd:all>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>

The different elements are:

  • search:partitionConfig: Contains partition configuration rules.

  • search:partitionRules: Contains one or more partition rules.

  • search:partitionRule: Describes a partition rule. It consists of the following elements:

    • search:partitionValue: Specify the system defined special value EQ_DEFAULT.

    • search:valueType: Type of partition value. Enter META in this field.

    • search:ruleType: Type of partition rule. Enter HASH in this field.

    • search:ruleSetting: Do not specify any value.

    • search:storageArea: A comma-separated list of storage areas included in the partition.

Managing Index Fragmentation

The ideal goal for any search engine is to auto-manage index fragmentation. With semi-automatic index fragmentation management, Oracle SES comes close to achieving this goal. Some garbage collection is still required on an infrequent basis, maybe once a month. For this reason, the Oracle SES administrator still has the ability to schedule index optimizations to run during non-peak hours.

The new index fragmentation management feature is implemented on top of an enhancement in Oracle Text, which allows the search engine index to be updated while Oracle SES is executing searches. This is achieved by temporarily saving index changes to an in-memory index and periodically merging them with the larger disk-based search engine index. This reduces fragmentation, and leads to faster response times.

The new index fragmentation management is implemented automatically on Oracle SES, but it can be tuned by configuring Oracle Text, where you can turn index fragmentation management on and off, and specify the frequency of index merges.

Modifying the KEEP Pool Size

This involves tuning the Oracle Database to obtain optimum benefits of the indexing option in Oracle Text.

By default, when you install Oracle SES, the indexing option, Staging Text Index, is enabled. This automatically sets up the KEEP pool of the database because the DR$EQ$DOC_PATH_IDX$G table that temporarily stages the index is stored in the KEEP pool.

By default, Oracle Database allocates 10% of the default buffer pool size to the KEEP pool. The DR$EQ$DOC_PATH_IDX$G table expands and shrinks on a real time basis depending on the volume of the indexing activity. Thus, if there is a high volume of indexing activity, then it is likely that the average size of the DR$EQ$DOC_PATH_IDX$G table is greater than the size of the KEEP pool. This can result in slower query response time. To prevent this, you can allocate more space to the KEEP pool.

Note:

Do not attempt to modify the KEEP pool size if you are not familiar with database tuning operations. Ideally, only the database administrator must be allowed to modify the KEEP pool size.

Determining if the KEEP Pool Size is Sufficient

If the KEEP pool size is not sufficient, then you are likely to see high physical read from DR$EQ$DOC_PATH_IDX$G table and/or DR$EQ$DOC_PATH_IDX$H segments in AWR (automatic workload repository) report or V$SEGSTAT view. If you observe high physical read from DR$EQ$DOC_PATH_IDX$G and DR$EQ$DOC_PATH_IDX$H tables, then consider increasing the KEEP pool size.

Increasing the KEEP Pool Buffer Size

Use SQL*Plus to modify the size of the KEEP pool. For example, to allocate 400 MB to the pool, execute the following:

SQL> alter system set DB_KEEP_CACHE_SIZE=400M scope=both;

To know the current KEEP pool size, you must access the view V$SGA_DYNAMIC_COMPONENTS. Use the following command:

SQL> select current_size  from v$sga_dynamic_components where component = 'KEEP buffer cache';

The output is similar to the following:

CURRENT_SIZE
------------
419430400

See Also:

Oracle Database Performance Tuning Guide for more information about the KEEP pool buffer.

Optimizing the Index

Optimizing the index reduces fragmentation, and it can significantly increase the speed of searches. Schedule index optimization on a regular basis. Also, optimize the index after the crawler has made substantial updates or if fragmentation is more than 50%. Verify that index optimization is scheduled during off-peak hours. Optimization of a very large index could take several hours.

You can see the fragmentation level and run index optimization on the Global Settings - Index Optimization page in the Oracle SES Administration GUI. You can specify a maximum number of hours for the optimization to run, but for best performance, run the optimization until completion. Oracle SES uses a faster optimization method and creates a more compact copy of the index when no time limit is set.

Adjusting the Indexing Parameters

To improve indexing performance, adjust the following parameters on the Global Settings - Set Indexing Parameters page of the Oracle SES Administration GUI:

Indexing Batch Size

When the crawled data in the cache directory reaches Indexing Batch Size, Oracle SES starts indexing. The bigger the batch size, the longer it takes to start indexing each batch. Only indexed data can be searched: Data in the cache cannot be searched. The default size is 250M.

Document fetching and indexing run concurrently. While indexing is running, the Oracle SES crawler continues to fetch documents and store them in the cache directory.

Indexing Memory Size

This is the upper limit of memory used for indexing before flushing the index to disk.

A large amount of memory improves indexing performance because it reduces I/O. It also improves query performance because the created index is less fragmented from the beginning, while a fragmented index can be optimized later. Set this parameter as high as possible without causing memory paging.

A smaller amount of memory might be useful when indexing progress should be tracked or when run-time memory is scarce. The default size is 275M. In general, increasing the Indexing Memory Size parameter can reduce fragmentation.

Parallel Indexing Degree

The number of concurrent threads used for indexing. This parameter is disabled in the current version of Oracle SES; it is always set to 1.

Checking the Search Statistics

See the Home - Statistics page in the Oracle SES Administration GUI for lists of the most popular queries, failed queries, and ineffective queries. This information can lead to the following actions:

  • Refer users to a particular Web site for failed queries on the Search - Suggested Links page.

  • Fix common errors that users make in searching on the Search - Alternate Words page.

  • Make important documents easier to find on the Search - Relevancy Boosting page.

Note that every hour, SES automatically summarizes logged queries. The summarizing task might utilize the server resource if there are a large number of logged queries, and this might impact the query performance. This issue is visible for stress tests where several queries are executed every second. The ideal solution in such instances is to disable the query statistics option.

To do this, from the Home page, click Global Settings, Query Configuration. Under Query Statistics, select No for the Enable Query Statistics option.

Relevancy Boosting

Relevancy boosting lets administrators influence the order of documents in the result list for a particular search. You might want to override the default results for the following reasons:

  • For a highly popular search, direct users to the best results

  • For a search that returns no results, direct users to some results

  • For a search that has no click-throughs, direct users to better results

In a search, each result is assigned a score that indicates how relevant the result is to the search; that is, how good a result it is. Sometimes you know the documents that are highly relevant to some search. For example, your company Web site could have a home page for XML (http://example.com/XML-is-great.htm), which you want to appear high in the results of any search for XML. You would boost the score of the XML home page to 100 for an XML search.

The document also has a score computed for searches that are not among the boosted queries.

Two methods can help you locate URLs for relevancy boosting: locate by search and manual URL entry.

Relevancy boosting, like end user searching, is case-insensitve. For example, a document with a boosted score for Oracle is boosted for oracle.

Increasing the JVM Heap Size

If you expect heavy loads on the Oracle SES server, then configure the Java Virtual Machine (JVM) heap size for better performance.

The heap size is defined in the ORACLE_HOME/search/config/searchctl.conf file. By default, the following values are given:

COMMON_MEM_ARGS = -Xmx2048m -Xms512m

Increase the value of these parameters appropriately for your system configuration. The -Xmx value should not exceed the physical memory size.

Then restart the middle tier:

searchctl restart

Increasing the Oracle Undo Space

Heavy query load should not coincide with heavy crawl activity, especially when there are large-scale changes on the target site. If it does, such as when a crawl is scheduled around the clock, then increase the size of the Oracle undo tablespace with the UNDO_RETENTION parameter.

See Also:

Oracle Database SQL Language Reference and Oracle Database Administrator's Guide on Oracle Technology Network for more information about increasing the Oracle undo space

Optimizing Query Application Performance

If you plan to use the Oracle SES default query user interface and have an Oracle Application Server Web Cache installation, then you can use its compression utility to compress the content Oracle SES sends over the network. For example, the utility can compress results.jsp from 980 to 72K. Compression provides the greatest benefit to users connecting over the Internet.

Use these Web cache compression rules:

/search/search?(.*)
/search/results.jsp?(.*)

OracleAS Web Cache does not benefit custom querying applications.

Oracle SES Command Line Tools

The command line utility for starting and stopping the search engine is searchctl. You can use it on the database, the middle tier, or both.

To list the searchctl command options: 

You are prompted for a password when running searchctl commands on UNIX platforms. No password is required on Windows platforms. This is because Oracle SES installations on Windows require users to have Administrator privileges. When running commands to start or stop the search engine, no password is required when the user is a member of the administrator group.

See Also:

Startup/Shutdown lesson in the Oracle SES administration tutorial: http://st-curriculum.oracle.com/tutorial/SESAdminTutorial/index.htm

To restart Oracle SES: 

  1. Navigate to ORACLE_HOME/bin directory.

  2. Issue the command

    searchctl startall
    

Turning On Debug Mode

Debug mode for the Oracle SES Administration GUI is useful for troubleshooting purposes.

To turn on debug mode: 

  1. Navigate to the ORACLE_HOME/search/webapp/config directory.

  2. Edit the search.properties file and set debug=true.

  3. Restart the Oracle SES middle tier:

    searchctl restart
    

To turn off debug mode when you are finished troubleshooting, set debug=false and restart the middle tier.

Note:

Debug information can be found in the log file available at: ORACLE_HOME/search/base_domain/servers/AdminServer/logs.

Monitoring Oracle Secure Enterprise Search

In a production environment, where a load balancer or other monitoring tools are used to ensure system availability, Oracle Secure Enterprise Search (Oracle SES) can be monitored easily at the following URL:

http://host:port/monitor/check.jsp.

The page should display the following message: Oracle Secure Enterprise Search instance is up.

This message is not translated to other languages because system monitoring tools might need to byte-compare this string.

If Oracle SES is not available, then the page displays either a connection error or the HTTP status code 503.

Integrating with Google Desktop

Oracle Secure Enterprise Search provides a GDfE plug-in to integrate with Google Desktop Enterprise Edition. You can include Google Desktop results in your Oracle SES hit list. You can also link to Oracle SES from the GDfE interface.

See Also:

Google Desktop for Enterprise Plug-in Readme at http://host:port/search/query/gdfe/gdfe_readme.html

Accessing Oracle WebLogic Server Administration Console on Oracle SES

The Oracle WebLogic Server Administration Console is a Web browser-based user interface that displays the current status of the Oracle SES middle tier. For example, the Home page shows a graph of the Response and Load, and the Performance page shows a graph of the Heap Usage.

The Application Server Control Console is installed and configured automatically with WebLogic. Because the Oracle SES middle tier runs in the embedded standalone Oracle WebLogic Server, the Administration Console is started by default when Oracle SES is started.

To access the Oracle WebLogic Server Administration Console:  

  1. Enter the following URL in a Web browser, replacing host:port with the host name and port for Oracle SES:

    http://host:port/console

  2. Log in as the weblogic user with your Oracle SES administrator password.

See Also:

http://download.oracle.com/docs/cd/E15523_01/wls.htm

for detailed documentation related to Oracle WebLogic Server Administration Console

Note:

In previous releases, the base path of Oracle SES was referred to as ORACLE_HOME. In Oracle SES release 11g, the base path is referred to as ORACLE_BASE. This represents the Software Location that you specify at the time of installing Oracle SES.

ORACLE_HOME now refers to the path ORACLE_BASE/seshome.

For more information about ORACLE_BASE, see "Conventions".