Oracle® Secure Enterprise Search Administrator's Guide 11g Release 1 (11.1.2.0.0) Part Number E14130-04 |
|
|
View PDF |
This chapter provides information about tuning and general management of Oracle SES instances. It contains the following topics:
When you install Oracle SES, various components of Oracle SES consume server space. The different components that consume space include:
Destination Path (ORACLE_BASE
): The root directory or location where Oracle SES is installed.
Data storage location: The directory or location where Oracle SES stores its data. The data includes internal database files and crawler log files among others. This location should not be within the ORACLE_HOME
directory.
Cache directory location: Location or directory to store cached data. By default the cached data is stored within the database. You can however specify a different location to store this data using the Crawler Configuration page.
Storage areas: Users can create storage areas on a need basis.
Crawler log location: Location or directory to store crawler logs. By default, the crawler logs are stored within the data storage location. You can however specify a different location to store the logs using the Crawler Configuration page.
You can set space usage quota for these components using the Admin API. Among the different components, data storage location consumes the most disk space over time. To address this, Oracle SES sets a default space quota for this component during installation. The default space quota for the data storage location is the initial space allocated by Oracle SES during installation, plus half of the available free disk space at the time of installation. For example, if at the time of installation, the available free disk space is 350 GB, then Oracle SES allocates 175 GB of disk space to data storage location.
Note that you can use the storageArea
API to modify the default space quota for data storage location. Using this API, you can remove the quota, in which case the data storage location can utilize the entire free disk space. However, if Oracle SES uses up the entire disk space, then the crawler fails and the Oracle SES instance crashes. It is not possible to recover from this state even if you clear up space. Hence, Oracle recommends that you do not remove the preset quota.
Oracle SES calculates the space usage of all the storage components on a periodic basis. You can define the frequency of the periodic checks. For example you can set it to be an hourly, daily, or a weekly task depending on your usage. Additionally, you can calculate the space usage at any time by calling the spaceCalculator
admin API.
A variety of tasks such as crawling, index optimization, metadata backup, and auto merge among others consume space. When the space usage reaches 80% of the defined space usage quota, Oracle SES raises a warning. When the usage exceeds the quota, Oracle SES raises an alert and performs the following operations:
It immediately stops all crawler activities. Note that some crawler activities cannot be stopped immediately. Therefore, there is likely to be a slight delay before all crawler activities are stopped.
It allows other tasks to run till completion, and then disables them.
It disables all the pre scheduled crawler activities.
After you clear up the space or increase the space usage quota, you can resume all the disrupted activities including the stopped On-Demand crawler schedules and other disabled tasks. To do this, you must call the task admin API resumeAllSpaceConsumingTasks
. If you do not call this API, then Oracle SES cannot restart any of the stopped activities, even if you create free space.
To perform space management tasks, you must use the Admin API. See Oracle Secure Enterprise Search Administration API Guide for more information.
The Global Settings - Configuration Data Backup and Recovery page backs up metadata that can be used to recover your configuration settings after a hardware failure. The actual crawled data is not backed up. You should run a backup after making configuration data changes, such as creating or editing sources.
When you perform a backup, Oracle SES copies the data to the binary metaData.bkp
file. The location of this file is provided on the Global Settings - Configuration Data Backup and Recovery page. When the backup successfully completes, you must copy this file to a different host.
When the installation completes, copy the metaData.bkp
file to the location provided in the Oracle SES Administration GUI. Sources must be re-crawled to see search results.
Notes about backup and recovery
The following configuration files are not backed up:
ORACLE_HOME
/search/webapp/config/search.properties
ORACLE_HOME
/search/webapp/config/search.conf
ORACLE_HOME
/search/webapp/config/ranking.xml
ORACLE_HOME
/search/data/config/crawler.dat
If these files are modified, then ensure that you make a backup for them. When you restore the metaData.bkp
file to a new Oracle SES instance, you must restore these files as well. Otherwise, you may lose relevant configuration information and will need to change the configuration settings manually.
Skin bundles are not backed up. To back up skin bundles, you must use the exportAll
operation of the Admin API. See Oracle Secure Enterprise Search Administration API Guide for more information.
If you enabled Portlet or Single Sign-On, you need to configure them again on the new instance.
If you deployed any crawler plugins or modified the topic clustering dictionary files, then ensure that you make a backup of the ORACLE_HOME
/search/lib/plugin
directory. In the new instance, you must deploy the files within the plugin
directory.
You must stop all running schedules before doing the backup.
Recovery must be performed on a fresh installation of the same version of Oracle SES that was backed up.
Secure search does not need to be re-enabled after recovery. If secure search is enabled in the backup instance, you do not need to re-register or re-activate the identity plug-in after recovery. Neither re-activation nor re-registration of the identity plug-in is required. If a plug-in was active when the instance was backed up, then the same plug-in is activated in the recovered instance, using the same parameters.
If you have file or table sources residing on the same computer as the one running Oracle SES, and if you intend to use a different computer for recovery, then you must use the actual host name (not localhost) when creating the sources.
For database table sources, confirm that the remote tables exist.
For file sources, confirm that files and paths are valid after recovery.
During recovery, the mail archive directory settings for existing mailing list and e-mail sources is changed. After recovery, the location is cache-dir
/mail
, which is the default for new e-mail and mailing list sources. Any customized directory locations prior to recovery is lost.
If you recover an instance in a new location, the stopword directory must to be updated to reflect the new location, since it is an absolute path. See "Topic Clustering" for more about stopword directories.
Note:
The backup files may contain sensitive information and must be stored in a secure location.Cold Backups
As an additional precaution to minimize downtime, you can perform a cold backup to backup all the data of an Oracle SES Instance. To back up an instance, you must save a copy of the directories ORACLE_BASE
, oraInventory
, and oradata
.
To perform a cold backup:
Shut down the Oracle SES instance:
ORACLE_HOME/bin/searchctl stopall
Log in to the machine as the root user or the administrator.
Copy all the files under the Oracle SES base directory (ORACLE_BASE
), the Ora Inventory oraInventory
, and the Oracle data storage oradata
.
These locations are specified during the Oracle SES installation. There are several ways to make a copy. For example, using the tar
command:
cd / tar cvf ses_orahome.tar <complete path to ORACLE_BASE> tar cvf ses_orainv.tar <complete path to Ora Inventory> tar cvf ses_oradat.tar <complete path to oradata>
For example, if the oradata
location is /mnt1/oracle/ses/oradata
, then save a copy using the command:
cd / tar cvf ses_oradat.tar /mnt1/oracle/ses/oradata
Backup the cached files of sources created before Oracle SES 11g. (Optional)
If you retain cache files, then users can click the "cached" link in the result list.
The cache directory location is listed on the Global Settings - Crawler Configuration page. For example, if the cache directory is /mnt1/oracle/ses/cache
, then run the following commands.
cd / tar cvf cache.tar /mnt1/oracle/ses/cache
Save the .tar
files in a safe location.
Note:
You can use any compression method to perform file backup. For example, you can zip the files.To recover files from a cold backup:
Shut down the Oracle SES instance:
ORACLE_HOME/bin/searchctl stopall
Restore all backed-up files. To do this, untar the files, and move them back to their original locations.
Start the Oracle SES instance:
ORACLE_HOME/bin/searchctl startall
Your Web crawling strategy can be as simple as identifying a few well-known sites that are likely to contain links to most of the other intranet sites in your organization. You could test this by crawling these sites without indexing them. After the initial crawl, you have a good idea of the hosts that exist in your intranet. You could then define separate Web sources to facilitate crawling and indexing on individual sites.
However, the process of discovering and crawling your organization's intranet, or the Internet, is generally an interactive one characterized by periodic analysis of crawling results and modification to crawling parameters. For example, if you observe that the crawler is spending days crawling one Web host, then you might want to exclude crawling at that host or limit the crawling depth.
This section contains the most common things to consider to improve crawl performance:
See Also:
"Monitoring the Crawling Process" for more information on crawling parametersSchedules define the frequency at which the Oracle SES index is updated with information about each source. This section describes characteristics the Oracle SES crawler schedule.
The Failed Schedules section on the Home - General page lists all schedules that have failed. A failed schedule is one in which the crawler encountered an irrecoverable error, such as an indexing error or a source-specific login error, and cannot proceed. A failed schedule could be because of a partial collection and indexing of documents.
The smallest granularity of the schedule interval is one hour. For example, you cannot start a schedule at 1:30 am.
If a crawl takes longer to finish than the scheduled interval, then it starts again when the current crawl is done. Currently, there is no option to have the scheduled time automatically pushed back to the next scheduled time.
When multiple sources are assigned to one schedule, the sources are crawled one by one following the order of their assignment in the schedule.
The schedule starts crawling the assigned sources in the assigned order. Only one source is crawling under a schedule at any given time. If a source crawl fails, then the rest of the sources assigned after it are not crawled. The schedule does not restart. You must either resolve the cause of the failure and resume the schedule, or remove the failed source from the schedule.
There is no automatic e-mail notification of schedule success or failure.
By default, Oracle SES is configured to crawl Web sites in the intranet, so no additional configuration is required. However, to crawl Web sites on the Internet (also referred to as external Web sites), Oracle SES needs the HTTP proxy server information.
To register a proxy:
On the Global Settings page under Sources, select Proxy Settings.
Enter the proxy server name and port. Click Set Proxy.
Enter the internal host name suffix under Exceptions, so that internal Web sites do not go through the proxy server. Click Set Domain Exceptions.
To exclude the entire domain, omit http
, begin with *.
, and use the suffix of the host name. For example, *.us.example.com
or *.example.com
. Entries without the *.
prefix are treated as a single host. Use the IP address only when the URL crawled is also specified using the IP for the host name. They must be consistent.
If the proxy requires authentication, then enter the proxy authentication information on the Global Settings - Authentication page.
The seed URL you enter when you create a source is turned into an inclusion rule. For example, if www.example.com
is the seed URL, then Oracle SES creates an inclusion rule that only URLs containing the string www.example.com
are crawled.
However, suppose that the example Web site includes URLs starting with www.exa-mple.com
or example.com
(without the www
). Many pages have a prefix on the site name. For example, the investor section of the site has URLs that start with investor.example.com
.
Always check the inclusion rules before crawling, then check the log after crawling to see what patterns have been excluded.
In this case, you might add www.example.com
, www.exa-mple.com
, and investor.example.com
to the inclusion rules. Or you might just add example
.
To crawl outside the seed site (for example, if you are crawling text.us.oracle.com
, but you want to follow links outside of text.us.oracle.com
to oracle.com), consider removing the inclusion rules completely. Do so carefully. This action could lead the crawler into many, many sites.
If no boundary rule is specified, then crawling is limited to the underlying file system access privileges. Files accessible from the specified seed file URL are crawled, subject to the default crawling depth. The depth, which is 2 by default, is set on the Global Settings - Crawler Configuration page. For example, if the seed is file://localhost/home/user_a/
, then the crawl picks up all files and directories under user_a
with access privileges. It crawls any documents in the directory /home/user_a/level1
due to the depth limit. The documents in the /home/user_a/level1/level2
directory are at level 3
.
The file URL can be in UNC (universal naming convention) format. The UNC file URL has the following format for files located within the host machine:
file://localhost///
LocalComputerName/SharedFolderName
For example, specify \\stcisfcr\docs\spec.htm
as file://localhost///stcisfcr/docs/spec.htm
where stcisfcr
is the name of the host machine.
The string localhost
is optional. You can specify the URL path without the string localhost
in the URL, in which case the URL format is:
file:///
LocalComputerName/SharedFolderName
For example,
file:///stcisfcr/docs/spec.htm
Note that you cannot use the UNC format to access files on other machines.
On some computers, the path or file name could contain non-ASCII and multibyte characters. URLs are always represented using the ASCII character set. Non-ASCII characters are represented using the hex representation of their UTF-8 encoding. For example, a space is encoded as %20
, and a multibyte character can be encoded as %E3%81%82
.
You can enter spaces in simple (not regular expression) boundary rules. Oracle SES automatically encodes these URL boundary rules. For example, Home Alone
is specified internally as Home%20Alone
. Oracle SES does this encoding for the following:
File source simple boundary rules
URL string tests
File source seed URLs
Oracle SES does not alter regular expression rules. You must ensure that the regular expression rule specified is against the encoded file URL. Spaces are not allowed in regular expression rules.
Indexing dynamic pages can generate too many URLs. From the target Web site, manually navigate through a few pages to understand what boundary rules should be set to avoid crawling of duplicate pages.
Setting the crawler depth very high (or unlimited) could lead the crawler into many sites. Without boundary rules, a crawler depth of 20 probably crawls the entire World Wide Web from most locations.
You can control which parts of your sites can be visited by robots. If robots exclusion is enabled (the default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots.txt
file.
The following sample robots.txt
file specifies that no robots visit any URL starting with /cyberworld/map/
or /tmp/
or /foo.html
:
# robots.txt for http://www.example.com/ User-agent: * Disallow: /cyberworld/map/ Disallow: /tmp/ Disallow: /foo.html
If the Web site is under your control, then you can tailor a specific robots rule for the crawler by specifying Oracle Secure Enterprise Search as the user agent. For example:
User-agent: Oracle Secure Enterprise Search Disallow: /tmp/
The robots meta
tag can instruct the crawler either to index a Web page or to follow the links within it. For example:
<meta name="robots" content="noindex,nofollow">
Oracle SES always removes duplicate (identical) documents. Oracle SES does not index a page that is identical to one it has already indexed. Oracle SES also does not index a page that it reached through a URL that it has already processed.
With the Web Services API, you can enable or disable near duplicate detection and removal from the result list. Near duplicate documents are similar to each other. They may or may not be identical to each other.
The crawler crawls only redirected pages. For example, a Web site might have Javascript that redirects users to another site with the same title. In such cases, only the redirected site is indexed.
Check for inclusion rules from redirects. The inclusion rules are based on the type of redirect. The EQ_TEST.EQ$URL
table stores all of the URLs that have been crawled or are scheduled to be crawled. There are three kinds of redirects defined in it:
Temporary Redirect: A redirected URL is always allowed if it is a temporary redirection (HTTP status code 302, 307). Temporary redirection is used for whatever reason that the original URL should still be used in the future. It's not possible to find out temporary redirect from EQ$URL table other than filtering out the rest from the log file.
Permanent Redirect: For permanent redirection (HTTP status 301), the redirected URL is subject to boundary rules. Permanent redirection means the original URL is no longer valid and the user should start using the new (redirected) one. In EQ$URL, HTTP permanent redirect has the status code 954
Meta Redirect: Metatag redirection is treated as a permanent redirect. Meta redirect has status code 954. This is always checked against boundary rules.
The STATUS
column of EQ_TEST.EQ$URL
lists the status codes. For descriptions of the codes, refer to Appendix B, "URL Crawler Status Codes."
Note:
Some browsers, such as Mozilla and Firefox, do not allow redirecting a page to load a network file. Microsoft Internet Explorer does not have this limitation.URL looping refers to the scenario where a large number of unique URLs all point to the same document. Looping sometimes occurs where a site contains a large number of pages, and each page contains links to every other page in the site. Ordinarily this is not a problem, because the crawler eventually analyzes all documents in the site. However, some Web servers attach parameters to generated URLs to track information across requests. Such Web servers might generate a large number of unique URLs that all point to the same document.
For example,
http://example.com/somedocument.html?p_origin_page=10
might refer to the same document as
http://example.com/somedocument.html?p_origin_page=13
but the p_origin_page
parameter is different for each link, because the referring pages are different. If a large number of parameters are specified and if the number of referring links is large, then a single unique document could have thousands or tens of thousands of links referring to it. This is an example of how URL looping can occur.
Monitor the crawler statistics in the Oracle SES Administration GUI to determine which URLs and Web servers are being crawled the most. If you observe an inordinately large number of URL accesses to a particular site or URL, then you might want to do one of the following:
Exclude the Web server: This prevents the crawler from crawling any URLs at that host. (You cannot limit the exclusion to a specific port on a host.)
Reduce the crawling depth: This limits the number of levels of referred links the crawler follows. If you are observing URL looping effects on a particular host, then you should take a visual survey of the site to find out an estimate of the depth of the leaf pages at that site. Leaf pages are pages that do not have any links to other pages. As a general guideline, add three to the leaf page depth, and set the crawling depth to this value.
Be sure to restart the crawler after altering any parameters. Your changes take effect only after restarting the crawler.
Oracle SES allocates 200M for the redo log during installation. 200M is sufficient to crawl a relatively large number of documents. However, if your disk has sufficient space to increase the redo log and if you are going to crawl a very large number of documents (for example, more than 300G of text), then increase the redo log file size for better crawl performance.
Note:
The biggest transaction during crawling isSYNC
INDEX
by Oracle Text. Check the AWR report or the V$SYSSTAT view to see the actual redo size during crawling. Roughly, 200M is sufficient to crawl up to 300G.To increase the size of the redo log files:
Open SQL*Plus and connect as the SYSTEM user. It has the same password as EQSYS.
Issue the following SQL statement to see the current redo log status:
SELECT vl.group#, member, bytes, vl.status FROM v$log vl, v$logfile vlf WHERE vl.group#=vlf.group#; GROUP# MEMBER BYTES STATUS ------ ------------------------------------------------- ---------- ---------- 3 /scratch/ses111/oradata/o11101/redo03.log 209715200 INACTIVE 2 /scratch/ses111/oradata/o11101/redo02.log 209715200 CURRENT 1 /scratch/ses111/oradata/o11101/redo01.log 209715200 INACTIVE
Drop the INACTIVE redo log file. For example, to drop group 3:
ALTER DATABASE DROP LOGFILE group 3; Database altered.
Create a larger redo log file with a command like the following. If you want to change the file location, specify the new location.
ALTER DATABASE ADD LOGFILE '/scratch/ses111/oradata/o11101/redo03.log' 2 size 400M reuse;
Check the status to ensure that the file was created.
SELECT vl.group#, member, bytes, vl.status FROM v$log vl, v$logfile vlf WHERE vl.group#=vlf.group#; GROUP# MEMBER BYTES STATUS ------ -------------------------------------------------- ---------- ---------- 3 /scratch/ses111/oradata/o11101/redo03.log 419430400 UNUSED 2 /scratch/ses111/oradata/o11101/redo02.log 209715200 CURRENT 1 /scratch/ses111/oradata/o11101/redo01.log 209715200 INACTIVE
To drop a log file with a CURRENT status, issue the following ALTER statement, then check the results.
ALTER SYSTEM SWITCH LOGFILE; SELECT vl.group#, member, bytes, vl.status FROM v$log vl, v$logfile vlf WHERE vl.group#=vlf.group#; GROUP# MEMBER BYTES STATUS ------ -------------------------------------------------- ---------- ---------- 3 /scratch/ses111/oradata/o11101/redo03.log 419430400 CURRENT 2 /scratch/ses111/oradata/o11101/redo02.log 209715200 ACTIVE 1 /scratch/ses111/oradata/o11101/redo01.log 209715200 INACTIVE
Issue the following SQL statement to change the status of Group 2 from ACTIVE to INACTIVE:
ALTER SYSTEM CHECKPOINT; SELECT vl.group#, member, bytes, vl.status FROM v$log vl, v$logfile vlf WHERE vl.group#=vlf.group#; GROUP# MEMBER BYTES STATUS ------ -------------------------------------------------- ---------- ---------- 3 /scratch/ses111/oradata/o11101/redo03.log 419430400 CURRENT 2 /scratch/ses111/oradata/o11101/redo02.log 209715200 INACTIVE 1 /scratch/ses111/oradata/o11101/redo01.log 209715200 INACTIVE
Repeat steps 3, 4 and 5 for redo log groups 1 and 2.
If you are still not crawling all the pages you think you should, then check which pages were crawled by doing one of the following:
To check the crawler log file:
On the Home page, click the Schedules secondary tab to display the Crawler Schedules page.
Click the Log File icon to display the log file for the source.
To obtain the location of the full log, click the Status link. The Crawler Progress Summary and Log Files by Source section displays the full path to the log file.
To create a search source group:
On the Search page, click the Source Groups subtab.
Click New to display Create New Source Group Step 1.
Enter a name, then click Proceed to Step 2.
Select a source type, then shuttle only one source from Available Sources to Assigned Sources.
Click Finish.
To search the source group:
On any page, click the Search link in the top right corner to open the Search application.
Select the group name, then issue a search term to list the matches within the source.
Select the group name, then click Browse to see a list of search groups:
The number after the group name identifies the number of browsed documents. Click the number to browse the search results.
Click the arrow before the group name to display a hierarchy of search results. The number of matches appears after each item in the hierarchy.
While there are no space restrictions for user defined datafiles, Oracle Database sets a size limit of 32 GB for any datafile. As a result, Oracle SES datafiles cannot grow beyond this limit. However, Oracle SES automatically creates a new datafile at the same location when the existing datafile becomes full.
Oracle SES contains a lot of features that optimize the search performance. This section contains suggestions on how to improve the response time and throughput performance of Oracle SES. It identifies the most common ways to improve search quality.
See Also:
"Searching on Date Attributes"Suggested links enable you to direct users to a designated Web site for particular query keywords. For example, when users search for "Oracle Secure Enterprise Search documentation" or "Enterprise Search documentation" or "Search documentation", you could suggest http://www.oracle.com/technology
.
Suggested link keywords are rules that determine which suggested links are returned (as suggestions) for a query. A rule can include query terms and logical operators. For example, "secure AND search". With this rule, the corresponding suggested link is returned for the query "secure enterprise search", but it is not returned for the query "secure database".
The rule language used for the indexed queries supports the following operators:
Table 12-1 Suggested Link Keyword Operators
Operator | Example |
---|---|
ABOUT |
about(dogs) |
AND |
dog and cat |
NEAR |
dog ; cat |
OR |
dog or cat |
PHRASE |
dog sled |
STEM |
$dog |
THESAURUS |
SYN(dog) |
Note:
Do not use special characters, such as #, $, =, and &, in keywords.Suggested links appear at the top of the search result list. Oracle SES can display up to two suggested links for each query.
This feature is especially useful for providing links to important Web pages that are not crawled by Oracle Secure Enterprise Search. Add or edit suggested links on the Search - Suggested Links page in the Oracle SES Administration GUI.
Parallel querying significantly improves search performance and facilitates searches of very large data sources. The query architecture is based on Oracle Database partitioning and enhancements in Oracle Text.
To make the best use of this feature, Oracle recommends that you run Oracle SES on a server with a 4-Core CPU, with at least 8GB of RAM and multiple fast disk drives.
Parallel querying is automatically implemented on Oracle SES when the partitioning option is enabled. Partitioning can only be enabled on a newly installed Oracle SES instance.
To enable partitioning:
Log in as eqsys
and execute the following SQL commands:
exec eq_adm.use_instance(1) exec eq_par.enable_partition
Next, configure the partition by setting up the storage areas and partition rules. You can do this using the admin API. See Oracle Secure Enterprise Search Administration API Guide for more information.
Define the data sources and start the crawl process.
Note:
Once enabled, the partitioning option cannot be disabled. Therefore, by default, parallel querying cannot be disabled either.A storage area in Oracle SES corresponds to a physical disk. To make optimum use of the parallel querying feature, you must create as many storage areas as there are physical disks.
A storage area is a user defined object with the following attributes:
name
description (can be updated)
locations (Oracle SES 11g supports only a single location)
usage (can be SYSTEM, CRAWLER, CACHE FILE, or PARTITION)
For each location, you can provide the following details:
path
preAllocatedSpace (in MB, can be updated)
device (can be updated)
quota (in MB, can be updated)
currentSize (in MB). It also contains the lastRefreshDate
parameter which indicates the time when currentSize
was calculated.
You can create, export, update, and delete storage areas. Use the admin API to perform these operations and manage storage areas. See Oracle Secure Enterprise Search Administration API Guide for more information.
Note the following about the various operations:
Allow users to create and delete only those storage areas that have the usage
type set to PARTITION
.
Only the following fields can be updated: description
, preAllocatedSpace
, device
, and quota
.
The storage area schema is as defined:
<xsd:element name = "storageAreas" minOccurs = "0" maxOccurs = "1"> <xsd:complexType> <xsd:sequence> <xsd:element name = "storageArea" minOccurs = "0" maxOccurs = "unbounded"> <xsd:complexType> <xsd:all> <xsd:element name = "name" type = "xsd:string" minOccurs = "1" maxOccurs = "1" /> <xsd:element name = "description" type = "xsd:string" minOccurs = "1" maxOccurs = "1" /> <xsd:element name = "usage" type = "xsd:string" minOccurs = "1" maxOccurs = "1" /> <xsd:element name = "locations" minOccurs = "1" maxOccurs = "1"> <xsd:complexType> <xsd:sequence> <xsd:element name = "location" minOccurs = "1" maxOccurs = "1"> <xsd:complexType> <xsd:all> <xsd:element name = "path" type = "xsd:string" minOccurs = "1" maxOccurs = "1"/> <xsd:element name = "device" type = "xsd:string" minOccurs = "0" maxOccurs = "1"/> <xsd:element name = "preAllocatedSpace" type = "xsd:int" minOccurs = "0" maxOccurs = "1"/> <xsd:element name = "quota" type = "xsd:int" minOccurs = "0" maxOccurs = "1"/> <xsd:element name = "currentSize" minOccurs = "0" maxOccurs = "1"> <xsd:complexType> <xsd:simpleContent> <xsd:extension base = "xsd:string"> <xsd:attribute name = "lastRefreshDate" type = "xsd:string" /> </xsd:extension> </xsd:simpleContent> </xsd:complexType> </xsd:element> </xsd:all> </xsd:complexType> </xsd:element> </xsd:sequence> </xsd:complexType> </xsd:element> </xsd:all> </xsd:complexType> </xsd:element> </xsd:sequence> </xsd:complexType> </xsd:element>
For example,
<search:storageArea> <search:name>Cache directory</search:name> <search:description>The path where SES store cache files</search:description> <search:usage>SYSTEM</search:usage> <search:locations> <search:location> <search:path>/oracle/work/regress/</search:path> <search:device>default</search:device> </search:location> </search:locations> </search:storageArea>
Configuring a partition includes updating partition attributes, updating partition rules, as well as exporting configurations. A partition can typically include multiple storage areas. For example, the following command configures a hash partition over six storage areas.
$ORACLE_HOME/bin/searchadmin -u eqsys -p eqsys_password update partitionConfig -i /scratch/configHashPartition.xml
where configHashPartition.xml
is:
<search:config productVersion="11.1.1.0.0"> <search:partitionConfig> <search:partitionRules> <search:partitionRule> <search:partitionValue>EQ_DEFAULT</search:partitionValue> <search:valueType>META</search:valueType> <search:ruleType>HASH</search:ruleType> <search:ruleSetting/> <search:storageArea>SA1, SA2, SA3, SA4, SA5, SA6</search:storageArea> </search:partitionRule> </search:partitionRules> </search:partitionConfig> </search:config>
With this partition configuration, all documents are hash partitioned and evenly distributed across storage areas SA1 though SA6.
The partition configuration schema is as defined:
<!-- Partition Configuration --> <xsd:element name = "partitionConfig" minOccurs = "0" maxOccurs = "1"> <xsd:complexType> <xsd:sequence> <xsd:element name = "partitionRules" minOccurs = "0" maxOccurs = "1"> <xsd:complexType> <xsd:sequence> <xsd:element name = "partitionRule" minOccurs = "0" maxOccurs = "unbounded"> <xsd:complexType> <xsd:all> <xsd:element name = "partitionValue" type = "xsd:string" minOccurs = "1" maxOccurs = "1"/> <xsd:element name = "valueType" type = "xsd:string" minOccurs = "1" maxOccurs = "1"/> <xsd:element name = "ruleType" type = "xsd:string" minOccurs = "1" maxOccurs = "1"/> <xsd:element name = "ruleSetting" type = "xsd:string" minOccurs = "0" maxOccurs = "1"/> <xsd:element name = "storageArea" type = "xsd:string" minOccurs = "0" maxOccurs = "1"/> </xsd:all> </xsd:complexType> </xsd:element> </xsd:sequence> </xsd:complexType> </xsd:element> </xsd:sequence> </xsd:complexType> </xsd:element>
The different elements are:
search:partitionConfig: Contains partition configuration rules.
search:partitionRules: Contains one or more partition rules.
search:partitionRule: Describes a partition rule. It consists of the following elements:
search:partitionValue: Specify the system defined special value EQ_DEFAULT
.
search:valueType: Type of partition value. Enter META
in this field.
search:ruleType: Type of partition rule. Enter HASH
in this field.
search:ruleSetting: Do not specify any value.
search:storageArea: A comma-separated list of storage areas included in the partition.
The ideal goal for any search engine is to auto-manage index fragmentation. With semi-automatic index fragmentation management, Oracle SES comes close to achieving this goal. Some garbage collection is still required on an infrequent basis, maybe once a month. For this reason, the Oracle SES administrator still has the ability to schedule index optimizations to run during non-peak hours.
The new index fragmentation management feature is implemented on top of an enhancement in Oracle Text, which allows the search engine index to be updated while Oracle SES is executing searches. This is achieved by temporarily saving index changes to an in-memory index and periodically merging them with the larger disk-based search engine index. This reduces fragmentation, and leads to faster response times.
The new index fragmentation management is implemented automatically on Oracle SES, but it can be tuned by configuring Oracle Text, where you can turn index fragmentation management on and off, and specify the frequency of index merges.
This involves tuning the Oracle Database to obtain optimum benefits of the indexing option in Oracle Text.
By default, when you install Oracle SES, the indexing option, Staging Text Index
, is enabled. This automatically sets up the KEEP pool of the database because the DR$EQ$DOC_PATH_IDX$G
table that temporarily stages the index is stored in the KEEP pool.
By default, Oracle Database allocates 10% of the default buffer pool size to the KEEP pool. The DR$EQ$DOC_PATH_IDX$G
table expands and shrinks on a real time basis depending on the volume of the indexing activity. Thus, if there is a high volume of indexing activity, then it is likely that the average size of the DR$EQ$DOC_PATH_IDX$G
table is greater than the size of the KEEP pool. This can result in slower query response time. To prevent this, you can allocate more space to the KEEP pool.
Note:
Do not attempt to modify the KEEP pool size if you are not familiar with database tuning operations. Ideally, only the database administrator must be allowed to modify the KEEP pool size.If the KEEP pool size is not sufficient, then you are likely to see high physical read from DR$EQ$DOC_PATH_IDX$G
table and/or DR$EQ$DOC_PATH_IDX$H
segments in AWR (automatic workload repository) report or V$SEGSTAT
view. If you observe high physical read from DR$EQ$DOC_PATH_IDX$G and DR$EQ$DOC_PATH_IDX$H tables, then consider increasing the KEEP pool size.
Use SQL*Plus to modify the size of the KEEP pool. For example, to allocate 400 MB to the pool, execute the following:
SQL> alter system set DB_KEEP_CACHE_SIZE=400M scope=both;
To know the current KEEP pool size, you must access the view V$SGA_DYNAMIC_COMPONENTS
. Use the following command:
SQL> select current_size from v$sga_dynamic_components where component = 'KEEP buffer cache';
The output is similar to the following:
CURRENT_SIZE ------------ 419430400
See Also:
Oracle Database Performance Tuning Guide for more information about the KEEP pool buffer.Optimizing the index reduces fragmentation, and it can significantly increase the speed of searches. Schedule index optimization on a regular basis. Also, optimize the index after the crawler has made substantial updates or if fragmentation is more than 50%. Verify that index optimization is scheduled during off-peak hours. Optimization of a very large index could take several hours.
You can see the fragmentation level and run index optimization on the Global Settings - Index Optimization page in the Oracle SES Administration GUI. You can specify a maximum number of hours for the optimization to run, but for best performance, run the optimization until completion. Oracle SES uses a faster optimization method and creates a more compact copy of the index when no time limit is set.
To improve indexing performance, adjust the following parameters on the Global Settings - Set Indexing Parameters page of the Oracle SES Administration GUI:
When the crawled data in the cache directory reaches Indexing Batch Size, Oracle SES starts indexing. The bigger the batch size, the longer it takes to start indexing each batch. Only indexed data can be searched: Data in the cache cannot be searched. The default size is 250M.
Document fetching and indexing run concurrently. While indexing is running, the Oracle SES crawler continues to fetch documents and store them in the cache directory.
This is the upper limit of memory used for indexing before flushing the index to disk.
A large amount of memory improves indexing performance because it reduces I/O. It also improves query performance because the created index is less fragmented from the beginning, while a fragmented index can be optimized later. Set this parameter as high as possible without causing memory paging.
A smaller amount of memory might be useful when indexing progress should be tracked or when run-time memory is scarce. The default size is 275M. In general, increasing the Indexing Memory Size parameter can reduce fragmentation.
See the Home - Statistics page in the Oracle SES Administration GUI for lists of the most popular queries, failed queries, and ineffective queries. This information can lead to the following actions:
Refer users to a particular Web site for failed queries on the Search - Suggested Links page.
Fix common errors that users make in searching on the Search - Alternate Words page.
Make important documents easier to find on the Search - Relevancy Boosting page.
Note that every hour, SES automatically summarizes logged queries. The summarizing task might utilize the server resource if there are a large number of logged queries, and this might impact the query performance. This issue is visible for stress tests where several queries are executed every second. The ideal solution in such instances is to disable the query statistics option.
To do this, from the Home page, click Global Settings, Query Configuration. Under Query Statistics, select No for the Enable Query Statistics option.
Relevancy boosting lets administrators influence the order of documents in the result list for a particular search. You might want to override the default results for the following reasons:
For a highly popular search, direct users to the best results
For a search that returns no results, direct users to some results
For a search that has no click-throughs, direct users to better results
In a search, each result is assigned a score that indicates how relevant the result is to the search; that is, how good a result it is. Sometimes you know the documents that are highly relevant to some search. For example, your company Web site could have a home page for XML (http://example.com/XML-is-great.htm
), which you want to appear high in the results of any search for XML
. You would boost the score of the XML home page to 100 for an XML
search.
The document also has a score computed for searches that are not among the boosted queries.
Two methods can help you locate URLs for relevancy boosting: locate by search and manual URL entry.
Relevancy boosting, like end user searching, is case-insensitve. For example, a document with a boosted score for Oracle
is boosted for oracle
.
If you expect heavy loads on the Oracle SES server, then configure the Java Virtual Machine (JVM) heap size for better performance.
The heap size is defined in the ORACLE_HOME
/search/config/searchctl.conf
file. By default, the following values are given:
COMMON_MEM_ARGS = -Xmx2048m -Xms512m
Increase the value of these parameters appropriately for your system configuration. The -Xmx
value should not exceed the physical memory size.
Then restart the middle tier:
searchctl restart
Heavy query load should not coincide with heavy crawl activity, especially when there are large-scale changes on the target site. If it does, such as when a crawl is scheduled around the clock, then increase the size of the Oracle undo tablespace with the UNDO_RETENTION
parameter.
See Also:
Oracle Database SQL Language Reference and Oracle Database Administrator's Guide on Oracle Technology Network for more information about increasing the Oracle undo spaceIf you plan to use the Oracle SES default query user interface and have an Oracle Application Server Web Cache installation, then you can use its compression utility to compress the content Oracle SES sends over the network. For example, the utility can compress results.jsp
from 980 to 72K. Compression provides the greatest benefit to users connecting over the Internet.
Use these Web cache compression rules:
/search/search?(.*) /search/results.jsp?(.*)
OracleAS Web Cache does not benefit custom querying applications.
The command line utility for starting and stopping the search engine is searchctl
. You can use it on the database, the middle tier, or both.
To list the searchctl command options:
Issue the command searchctl
.
You are prompted for a password when running searchctl
commands on UNIX platforms. No password is required on Windows platforms. This is because Oracle SES installations on Windows require users to have Administrator privileges. When running commands to start or stop the search engine, no password is required when the user is a member of the administrator group.
See Also:
Startup/Shutdown lesson in the Oracle SES administration tutorial:http://st-curriculum.oracle.com/tutorial/SESAdminTutorial/index.htm
To restart Oracle SES:
Navigate to ORACLE_HOME
/bin
directory.
Issue the command
searchctl startall
Debug mode for the Oracle SES Administration GUI is useful for troubleshooting purposes.
To turn on debug mode:
Navigate to the ORACLE_HOME
/search/webapp/config
directory.
Edit the search.properties
file and set debug=true
.
Restart the Oracle SES middle tier:
searchctl restart
To turn off debug mode when you are finished troubleshooting, set debug=false
and restart the middle tier.
In a production environment, where a load balancer or other monitoring tools are used to ensure system availability, Oracle Secure Enterprise Search (Oracle SES) can be monitored easily at the following URL:
http://
host:port
/monitor/check.jsp
.
The page should display the following message: Oracle Secure Enterprise Search instance is up.
This message is not translated to other languages because system monitoring tools might need to byte-compare this string.
If Oracle SES is not available, then the page displays either a connection error or the HTTP status code 503.
Oracle Secure Enterprise Search provides a GDfE plug-in to integrate with Google Desktop Enterprise Edition. You can include Google Desktop results in your Oracle SES hit list. You can also link to Oracle SES from the GDfE interface.
See Also:
Google Desktop for Enterprise Plug-in Readme athttp://
host:port
/search/query/gdfe/gdfe_readme.html
The Oracle WebLogic Server Administration Console is a Web browser-based user interface that displays the current status of the Oracle SES middle tier. For example, the Home page shows a graph of the Response and Load, and the Performance page shows a graph of the Heap Usage.
The Application Server Control Console is installed and configured automatically with WebLogic. Because the Oracle SES middle tier runs in the embedded standalone Oracle WebLogic Server, the Administration Console is started by default when Oracle SES is started.
To access the Oracle WebLogic Server Administration Console:
Enter the following URL in a Web browser, replacing host:port
with the host name and port for Oracle SES:
http://
host:port
/console
Log in as the weblogic
user with your Oracle SES administrator password.
See Also:
http://download.oracle.com/docs/cd/E15523_01/wls.htm
for detailed documentation related to Oracle WebLogic Server Administration Console
Note:
In previous releases, the base path of Oracle SES was referred to asORACLE_HOME
. In Oracle SES release 11g, the base path is referred to as ORACLE_BASE
. This represents the Software Location that you specify at the time of installing Oracle SES.
ORACLE_HOME
now refers to the path ORACLE_BASE
/seshome
.
For more information about ORACLE_BASE
, see "Conventions".