Oracle® Secure Enterprise Search Administrator's Guide 10g Release 1 (10.1.8.1) Part Number B32514-01 |
|
|
View PDF |
This chapter contains the following topics:
Suggested content lets you display real-time data content in the result list of the default query application. Oracle SES retrieves data from content providers and applies a style sheet to the data to generate an HTML fragment. The HTML fragment is displayed in the result list and is available through the Web Services API. For example, when an end user searches for contact information on a coworker, Oracle SES can fetch the content from the suggested content provider and return the contact information (e-mail address, phone number, and so on) for that person in the result list. Suggested content results appear under any suggested links and above the query results.
Configure suggested content on the Search - Suggested Content page in the administration tool. Enter the maximum number of suggested content results (up to 20) to be included in the Oracle SES result list. The results are rendered on a first-come, first-served basis.
Regular expressions (as supported in the Java regular expression API java.util.regex
) are used to define query patterns for suggested content providers. The regular expression-based pattern matching is case-sensitive. For example, a provider with the pattern dir\s(\S+)
is triggered on the query dir james
but not on the query Dir James
. To trigger on the query Dir James
, the pattern could be defined either as [Dd][Ii][Rr]\s+(\S+)
or as (?i)dir\s+(\S+)
. A provider with a blank query pattern is triggered on all queries.
The URL you enter for the suggested content provider can contain the following variables: $ora:q, $ora:lang, $ora:q1, ... $ora:qn and $ora:username.
$ora:q is the end user full query.
$ora:lang is the two-letter code for the browser language
$ora:qn is the nth regular expression match group from the end user query. n starts from 1. If no nth group is matched, then the empty string replaces the variable.
$ora:username is the end user name.
Enter an XSLT style sheet to defines rules (for example, the size and style) for transforming XML content from a provider into an HTML fragment. This HTML fragment is displayed in the result list or returned over the Web Services API. If you do not enter an XSLT style sheet, then Oracle SES assumes that the suggested content provider returns HTML. If you do not enter an XSLT style sheet and the provider returns XML, then the result list displays the plain XML.
Note:
It is the administrator's responsibility to ensure that suggested content providers return valid and safe content. Corrupted or incomplete content returned by an suggested content provider can affect the formatting of the default query application results page.There are three security options for how Oracle SES passes the end user's authentication information to the suggested content provider:
None: With this method (the default), no security policy is used.
Cookie: With this method, the end user first must be authenticated by the suggested content provider. A cookie is set for the user to maintain a session. Oracle SES must know the cookie used by the provider for authentication, and it is made available during registration of the suggested content provider. When the user enters a query, Oracle SES grabs the cookies from the user's request header and passes them to the provider. The cookie scope must be set to the common domain of the provider site and the Oracle SES site by the provider.
For example, suppose the provider site is http://provider.company.com and the Oracle SES site is http://ses.company.com. After the end user logs in to the provider site, the site could set the value of the security cookie loginCookie with domain scope .company.com. When the end user searches in Oracle SES, Oracle SES gets the loginCookie value from the end user browser and forwards it to the provider site to get the suggested content (without login to the provider site again). However, if the provider site is accessed as http://provider or if the Oracle SES site is accessed as http://SES, then no domain cookie is available for sharing between the two sites and this security mechanism does not work.
You can decide what happens when suggested content is available but the user is not logged in to the suggested content provider or the cookie for the provider is not available. For Unauthenticated User Action, if you select Ignore content, then content from that provider will not be displayed in the result list. If you select Display login message, then Oracle SES returns a message that there is content available from this provider but the user is not logged in. The message also provides a link to log in to that provider. Enter the link for the suggested content provider login in the Login URL field.
Service-to-Service: With this method, a one-way trusted relationship is established between Oracle SES and the suggested content provider. Any user already logged in to Oracle SES does not need to be authenticated by the provider again. The provider only authenticates the Oracle SES application and trusts the Oracle SES application to act as the end user.
The end user identity is sent from Oracle SES to the provider site in the HTTP header ORA_S2S_PROXY_USER
. The trusted entity could be a proxy user configured in the identity management system used by the provider, or it could be a name-value pair.
Note:
If the secured content provider needs to authenticate the end user and it sets the domain level security cookie to maintain login information after the end user login, then use the cookie method for form authentication. The Oracle SES end user must login manually to the provider site, and the security cookie is stored in the browser. Oracle SES searches on the provider for the end user without additional login.However, if the domain security cookie is not allowed for the provider, then the provider must support service-to-service security. The provider must allow an Oracle SES application account to search after passing HTTP basic or digest authentication. Also, if the provider has different secured content for different Oracle SES end users, then it must respect the end user security (in the HTTP header ORA_S2S_PROXY_USER
) for the Oracle SES search request.
To register a provider that requires either HTTP basic or HTTP digest authentication, specify the authentication user name in the Entity Name field and specify the authentication password in the Entity Password field.
Existing OneBox providers can be configured for use as Oracle SES Suggested Content providers. For example, for a Google OneBox provider, the provider URL might be http://host.company.com/apps/directory.jsp
and the trigger might be dir\s(\S+)
. When the user query is dir james, the provider receives the request with a query string similar to the following: apiMaj=10&apiMin=1&oneboxName=app&query=james
.
With a Suggested Content provider, set the URL template as http://host.company.com/apps/directory.jsp?apiMaj=10&apiMin=1&oneboxName=app&query=$ora:q1
. The provider pattern is the same: dir\s(\S+)
. The XSLT used for Google OneBox can be re-used with a minor change. Look for the line:
<xsl:template name="apps">
and change that line in your template to
<xsl:template match="/OneBoxResults">
A backup is a copy of configuration data that can be used to recover your configuration settings after a hardware failure. When a backup is performed, Oracle SES copies the data to the binary metaData.bkp
file. The location of that file is provided on the Global Settings - Configuration Data Backup and Recovery page. When the backup successfully completes, you must copy this file to a different host. You should backup after making configuration data changes, such as creating or editing sources.
When the installation completes, copy the metaData.bkp
file to the location provided in the administration tool. Sources need to be crawled again to see search results.
Some notes about backup and recovery:
You must stop all running schedules before doing the backup.
Recovery must be performed on a fresh installation of the same version of Oracle SES that was backed up.
Secure search does not need to be re-enabled after recovery. If secure search is enabled in the backup instance, you do not need to re-register or re-activate the identity plug-in after recovery. Neither re-activation nor re-registration of the identity plug-in is required. If a plug-in was active when the instance was backed up, the same plug-in will be activated in the recovered instance, using the same parameters.
If you have file or table sources residing on the same computer as the one running Oracle SES, and if you intend to use a different computer for recovery, then you must use the actual host name (not localhost) when creating the sources.
For database table sources, confirm that the remote tables exist.
For file sources, confirm that files and paths are valid after recovery.
During recovery, the mail archive directory settings for existing mailing list and e-mail sources is changed. After recovery, the location will be <cache-dir>/mail
, which is the default for new e-mail and mailing list sources. Any customized directory locations prior to recovery will be lost.
Each source has its own set of document attributes. Document attributes, like metadata, describe the properties of a document. The crawler retrieves values and maps them to one of the search attributes. This mapping lets users search documents based on their attributes. After you crawl a source, you can see the attributes for that source. Document attribute information is obtained differently depending on the source type. This section lists the attributes for each Oracle SES source type.
See Also:
"Overview of Attributes" for conceptual information about document and search attributes in Oracle SESFor table and database source types, there are no predefined attributes. The crawler collects attributes from columns defined during source creation. The Oracle SES administrator must map the column to the search attributes.
For Oracle E-Business Suite and Siebel source types, attributes are specified by the user. Attributes for Oracle E-Business Suite 11i and Siebel 7.8 sources are specified in the query while creating the source. Attributes for Oracle E-Business Suite 12 and Siebel 8 sources are specified in the RSS data feed. (That is, you can specify attributes in the RSS data feed yourself).
For many source types (such as OracleAS Portal, e-mail, NTFS, and Microsoft Exchange sources), the crawler picks up key attributes offered by the target systems. These are listed in the following sections.
Note:
For all other sources, such as Documentum eRoom or Lotus Notes, there is an Attribute list parameter in the Home - Sources - Customize User-Defined Source page. Any attributes entered by users are collected by the crawler and available for search.Title
Author
Description
Host
Keywords
Language
LastModifiedDate
Mimetype
Subject: This is mapped to "Description". If there is no description metatag in the HTML file, then it is ignored.
Headline1: The highest H tag text; for example, "Annual Report" from <H2>Annual Report</H2> when there is no H1 tag in the page.
Headline2: The second highest H tag text
Reference Text: The anchor text from another Web page that points to this page.
Additional HTML metatags can be defined to map to a String attribute on the Home - Sources - Metatag Mapping page.
Title
Author
Description
Host
Keywords
Language
LastModifiedDate
Mimetype
Subject
Table 6-1
Attribute | Description |
---|---|
createdate |
Date the document was created |
creator |
User name of the person who created the document |
author |
User-editable field so that they can specify a full name or whatever they want |
page_path |
Hierarchy path of the item/page in the portal tree |
title |
Title of the document |
description |
Brief description of the document |
keywords |
Keywords of the document |
expiredate |
Expiration date of the document |
host |
Portal host |
infosource |
Path of the Portal page in the browse hierarchy |
language |
Language of the portal page or item |
lastmodifieddate |
Last modified date of the document |
mimetype |
Usually 'text/html' for portal |
perspectives |
User-created markers that can be applied to pages or items, such as 'INTERNAL ONLY', 'REVIEWED', or 'DESIGN SPEC'. For example, a Portal containing recipes could have items representing recipes with perspectives such as 'Breakfast', 'Tea', 'Contains Nuts', 'Healthy' and one particular item could have several perspectives assigned to it. |
wwsbr_name_ |
Internal name of the portal page or item |
wwsbr_charset_ |
Character set of the portal page or item |
wwsbr_category_ |
Category of the portal page or item |
wwsbr_updatedate_ |
Date the last time the portal page or item was updated |
wwsbr_updator_ |
Person who last updated the page or item |
wwsbr_subtype_ |
Subtype of the portal page/item (for example, container) |
wwsbr_itemtype_ |
Portal item type |
wwsbr_mime_type_ |
Mimetype of the portal page or item |
wwsbr_publishdate_ |
Date the portal page or item was published |
wwsbr_version_number_ |
Version number of the portal item |
Title
Subject
Author
Category
Comments
Description
FileDate : LastModified Date
Description
Priority
Status
start date
end date
event Type
Author
Created Date
Title
Location
Dial_info
ConferenceID
ConferenceKey
Duration
AUTHOR
CREATE_DATE
DESCRIPTION
FILE_NAME
LASTMODIFIEDDATE
LAST_MODIFIED_BY
TITLE
ACL_CHECKSUM: The check sum calculated over the ACL submitted for the document.
DOCUMENT_LANGUAGE: Oracle SES language code taken from Oracle Content Database language string. For example, if Oracle Content Database uses "American", then Oracle SES submits is as it as "en-us".
DOCUMENT_CHARACTER_SET: The character set for the Oracle Content Database document.
MIMETYPE
Oracle SES also can search categories or cutomized attributes created by the user in Oracle Content Database.
You can apply categories to files and links. Categories can be divided into subcategories and can have one or more attributes. When a document in Oracle Content Database is attached to a category, you can search on the attribute of category. (The attributes appear in the list of search attributes.)
For example, suppose you create a category named testCategory with testAttr1 and testAttr2. Document X is created and assigned the testCategory. You must assign the value to the testCategory's attributes. After crawling, testAttr1 and testAttr2 will appear in the search attribute list.
Customized attribute values can be the following types: String, Integer, Long, Double, Boolean, Date, User, Enumerated String, Enumerated Integer, and Enumerated Long.
Index Long, Double, Integer, Enumerated Integer, and Enumerated Long type customized attributes are type Number attributes in Oracle SES (display name with "_N" suffix).
Index Date customized attribute is type Date attribute in Oracle SES (suffix "_D").
Index String, String Enumeration, and User customized attributes are type String attributes in Oracle SES.
Limitations:
The Oracle Content Database SDK has more features than the Oracle Content Database Web GUI. The Web GUI does not support the String Array, but the SDK does. If you use the SDK to build a customized admin and user GUI to support the String array type, then a customized attribute could have more than one attribute value.
If a document in Oracle Content Database is attached to a category and the attributes in that category are left blank, then when a user searches in Oracle SES (using Advanced Search), the attribute is not available in the dropdown list.
For example, create testCategory with three attributes. A document is created and assigned this test category. TestCategory's attribute are assigned values. For a test, assign one a value "test" leave the other attribute blank. After crawling, when searching you can see the attribute in the list that was assigned the value "test". However, the one that was left blank does not show in the dropdown list. If an attribute has null value, it will be skipped by the crawler. But if another document has the same attribute with some value, then it will be indexed.
This section contains the following topics:
Table source types and database source types are similar, in that they both crawl database tables.
This section contains the following topics:
This section describes the benefits and limitations of both table source types and database source types.
Note:
For performance reasons, both source types require that theKEY
column be backed by an index.To crawl non-Oracle databases as a table source, you must create a view in an Oracle database on the non-Oracle table. Then create the table source on the Oracle view. Oracle SES accesses the database using database links.
Only one table or view can be specified for each table source. If data from more than one table or view is required, then first create a single view that encompasses all required data.
Oracle SES cannot crawl tables inside the Oracle SES database.
Table column mappings cannot be applied to LOB columns.
The following data types are supported for table sources: BLOB
, BFILE
, CLOB
, CHAR
, VARCHAR
, VARCHAR2
.
Database sources provide additional flexibility. A database source type is built on JDBC, so you can crawl any JDBC-enabled database.
A database source supports any SQL query with join conditions without creating a view. In some databases, creating objects may not be feasible.
A database source supports crawling content pointed to by a URL stored in the ATTACHMENT_LINK
column.
A database source supports Info source path hierarchy and MIMETYPEs.
Database sources provide additional security. A database source provides security on the row level. It provides a third security option ACLs Provided by Source that is not available for table sources.
Database object names may be represented with a quoted identifier. A quoted identifier is case-sensitive and begins and ends with double quotation marks ("). If the database object is represented with a quoted identifier, then you must use the double quotation marks and the same case whenever you refer to that object.
When creating a table source in Oracle SES, if the table name is a quoted identifier, such as "1 (Table)", then in the Table Name field enter "1 (Table)", with the same case and double quotation marks. Similarly, if a primary key column or content column is named using a quoted identifier, then enter that name exactly as it appears in the database with double quotation marks.
See Also:
Oracle Database SQL Reference (available on Oracle Technology Network) for more information about schema object names and qualifiersThis section contains the following topics:
For file sources to successfully crawl and display multibyte environments, the locale of the computer that starts the Oracle SES server must be the same as the target file system. This way, the Oracle SES crawler can "see" the multibyte files and paths.
If the locale is different in the installation environment, then Oracle SES should be restarted from the environment with the correct locale. For example, for a Korean environment, either set LC_ALL
to ko_KR
or set both LC_LANG
and LANG
to ko_KR.KSC5601
. Then run searchctl restartall
from either a command prompt on Windows or an xterm on UNIX.
When crawling file sources on UNIX, the crawler will resolve any symbolic link to its true directory path and enforce the boundary rule on it. For example, suppose directory /tmp/A
has two children, B
and C
, where C
is a link to /tmp2/beta
. The crawl will have the following URLs:
/tmp/A
/tmp/A/B
/tmp2/beta
/tmp/A/C
If the inclusion rule is /tmp/A
, then /tmp2/beta
will be excluded. The seed URL is treated as is.
If a file URL is to be used "as is", without going through Oracle SES for retrieving the file, then "file" in the URL should be upper case "FILE". For example, FILE://localhost/...
"As is" means that when a user clicks on the search link of the document, the browser will try to use the specified file URL on the client computer to retrieve the file. Without that, Oracle SES uses this file URL on the server computer and sends the document through HTTP to the client computer.
The Oracle SES crawler is IMAP4 compliant. To crawl mailing list sources, you need an IMAP e-mail account. It is recommended to create an e-mail account that is used solely for Oracle SES to crawl mailing list messages. The crawler is configured to crawl one IMAP account for all mailing list sources. Therefore, all mailing list messages to be crawled must be found in the Inbox of the e-mail account specified on this page. This e-mail account should be subscribed to all the mailing lists. New postings for all the mailing lists will be sent to this single account and subsequently crawled.
Messages deleted from the global mailing list e-mail account are not removed from the Oracle SES index. In fact, the mailing list crawler itself will delete messages from the IMAP e-mail account as it crawls. The next time the IMAP account for mailing lists is crawled, the previous messages will no longer be there. Any new messages in the account will be added to the index (and also consequently deleted from the account). This keeps the global mailing list IMAP account clean. The Oracle SES index serves as a complete archive of all the mailing list messages.
URL boundary rules are not enforced for URL items. A URL item is the metadata that resides on the OracleAS Portal server. Oracle SES does not touch the display URL or the boundary rules for URL items.
If OracleAS Portal user privileges change, it is possible that content the crawler collects is not properly authorized. For example, in a Portal crawl, the user specified in the Home - Sources - Authentication page does not have privileges to see certain Portal pages. However, after privileges are granted to the user, on subsequent incremental crawls, the content still is not picked up by the crawler. Similarly, if privileges are revoked from the user, it is possible that content still is picked up by the crawler.
To be certain that Oracle SES has the correct set of documents, whenever a user's privileges change, update the crawler re-crawl policy to Process All Documents on the Home - Schedules - Edit Schedules page, and restart the crawl.
If a plug-in is to return file URLs to the crawler, then the file URLs must be fully qualified. For example, file://localhost/
.
If a file URL is to be used "as is" without going through Oracle SES for retrieving the file, then "file" in the URL should be upper case "FILE". For example, FILE://localhost/...
See Also:
"Crawling File URLs"The Oracle SES federator caches the federator configuration (that is, all federation-related parameters including federated sources). As a result, any change in the configuration will take effect within five minutes.
If you entered proxy settings on the Global Settings - Proxy Settings page, then make sure to add the Web Services URL for the federated source as a proxy exception.
If the federation endpoint instance is set to secure mode 3 (require login to search secure and public content), then all documents (ACL stamped or not) are secure. For secure federated search, create a trusted entity in the federation endpoint instance, then edit the federated source with the trusted entity user name and password.
There can be consistency issues if you have configured a BIG-IP system as follows:
You have two Oracle SES instances configured identically (same crawls, same sources, and so on) behind a BIG-IP load balancer to act as a single logical Oracle SES instance.
You have two other Oracle SES instances configured identically along with Oracle HTTP Server and OracleAS Web Cache fronting each one and both servers behind BIG-IP. Each of these two instances federate to the logical Oracle SES instance. Web Cache is clustered between these two nodes to act as a single logical Oracle SES instance called broker instance.
When a user performs a search on the broker Oracle SES instance and tries to access the documents in the result, document access may not be consistent each time. As a workaround, make sure that the load balancer sends all the requests in one user session to the exact same node each time.
Federated search can improve performance by distributing query processing on multiple computers. It can be an efficient way to scale up search service by adding a cluster of Oracle SES instances.
The federated search performance depends on the network topology and throughput of the entire federated Oracle SES environment.
There is a size limit of 200KB for the cached documents existing on the federation endpoint to be displayed on the Oracle SES federation broker instance.
For infosource browse, if the source hierarchies for both local and federated sources under one source group start with the same top level folder, then a sequence number is added to the folder name belonging to the federated source to distinguish the two hierarchies on the Browse page.
For federated infosource browse, a federated source should be put under an explicitly created source group.
On the Oracle SES federation broker, there is no direct access to documents on the federation endpoint through the display URL in the search result list. Only the cached version of documents is accessible. Exception: There is direct access for Web source and OracleAS Portal source documents.
See Also:
"Setting Up Federated Sources" if the federated source will be searching private content
Your Web crawling strategy can be as simple as identifying a few well-known sites that are likely to contain links to most of the other intranet sites in your organization. You could test this by crawling these sites without indexing them. After the initial crawl, you have a good idea of the hosts that exist in your intranet. You could then define separate Web sources to facilitate crawling and indexing on individual sites.
However, the process of discovering and crawling your organization's intranet, or the Internet, is generally an interactive one characterized by periodic analysis of crawling results and modification to crawling parameters. For example, if you observe that the crawler is spending days crawling one Web host, then you might want to exclude crawling at that host or limit the crawling depth.
This section contains the most common things to consider to improve crawl performance:
See Also:
"Monitoring the Crawling Process" for more information on crawling parametersSchedules define the frequency at which the Oracle SES index is updated with information about each source. This section describes characteristics the Oracle SES crawler schedule.
The Failed Schedules section on the Home - General page lists all schedules that have failed. Generally, a failed schedule is one in which the crawler did not collect any documents. A failed schedule also could be the result of a partial collection and indexing of documents.
The smallest granularity of the schedule interval is one hour. For example, you cannot have a schedule started at 1:30am.
If a crawl takes longer to finish then the scheduled interval, then it will be started as soon as the current crawl is done. Currently, there is no option to have the scheduled time automatically pushed back to the next scheduled time.
When multiple sources are assigned to one schedule, the sources are crawled one by one following the order of their assignment in the schedule.
If a crawl fails, the schedule does not restart. You must resolve the cause of the crawl failure and resume the schedule. The rest of the pending sources are not crawled. Currently, there is no distinction between a failure that can be automatically retried versus a failure that must be fixed by the administrator.
There is no automatic e-mail notification of schedule success or failure.
By default, Oracle SES is configured to crawl Web sites in the intranet. In other words, crawling internal Web sites requires no additional configuration. However, to crawl Web sites on the Internet (also referred to as external Web sites), Oracle SES needs the HTTP proxy server information. See the Global Settings - Proxy Settings page.If the proxy requires authentication, then enter the proxy authentication information on the Global Settings - Authentication page.
The seed URL you enter when you create a source is turned into an inclusion rule. For example, if www.example.com is the seed URL, then Oracle SES creates an inclusion rule that only URLs containing the string www.example.com will be crawled.
However, suppose that the example Web site includes URLs starting with www.exa-mple.com or ones that start with example.com (without the www). Many pages have a prefix on the site name. For example, the investor section of the site has URLs that start with investor.example.com.
Always check the inclusion rules before crawling, then check the log after crawling to see what patterns have been excluded.
In this case, you might add www.example.com, www.exa-mple.com, and investor.example.com to the inclusion rules. Or you might just add example.
To crawl outside the seed site (for example, if you are crawling text.us.oracle.com, but you want to follow links outside of text.us.oracle.com to oracle.com), consider removing the inclusion rules altogether. Do so carefully. This could lead the crawler into many, many sites.
For file sources, if no boundary rule is specified, then crawling is limited to the underlying file system access privileges. Files accessible from the specified seed file URL will be crawled, subject to the default crawling depth. The depth, which is 2 by default, is set on the Global Settings - Crawler Configuration page. For example, if the seed is file://localhost/home/user_a/
, then the crawl will pick up all files and directories under user_a
with access privileges. It will crawl any documents in the directory /home/user_a/level1
due to the depth limit. The documents in the /home/user_a/level1/level2
directory are at level 3.
The file URL can be of UNC (universal naming convention) format. The UNC file URL has the following format: file://localhost///<LocalMachineName>/<SharedFolderName>
.
For example, \\stcisfcr\docs\spec.htm
should be specified as file://localhost///stcisfcr/docs/spec.htm
.
On some computers, the path or file name could contain non-ASCII and multibyte characters. URLs are always represented using the ASCII character set. Non-ASCII characters are represented using the hex representation of their UTF-8 encoding. For example, a space is encoded as %20, and a multibyte character can be encoded as %E3%81%82.
For file sources, spaces can be entered in simple (not regular expression) boundary rules. Oracle SES automatically encodes these URL boundary rules. If (Home Alone) is specified, then internally it is stored as (Home%20Alone). Oracle SES does this encoding for the following:
File source simple boundary rules
Test URL strings
File source seed URLs
Note:
Oracle SES does not alter the rule if it is a regular expression rule. It is the administrator's responsibility to make sure that the regular expression rule specified is against the encoded file URL. Spaces are not allowed in regular expression rules.Indexing dynamic pages can generate an excessive number of URLs. From the target Web site, manually navigate through a few pages to understand what boundary rules should be set to avoid crawling duplicate pages.
Setting the crawler depth very high (or unlimited) could lead the crawler into many sites. Without boundary rules, 20 will probably crawl the whole WWW from most locations.
You can control which parts of your sites can be visited by robots. If robots exclusion is enabled (default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots
.txt
file.
The following sample /robots.txt
file specifies that no robots should visit any URL starting with /cyberworld/map/
or /tmp/
or /foo.html
:
# robots.txt for http://www.example.com/ User-agent: * Disallow: /cyberworld/map/ Disallow: /tmp/ Disallow: /foo.html
If the Web site is under the user's control, then a specific robots rule can be tailored for the crawler by specifying the Oracle SES crawler plug-in name "User-agent: Oracle Secure Enterprise Search." For example:
User-agent: Oracle Secure Enterprise Search Disallow: /tmp/
The robots meta
tag can instruct the crawler to either index a Web page or follow the links within it. For example:
<meta name="robots" content="noindex,nofollow">
Oracle SES always removes duplicate (identical) documents. If Oracle SES thinks a page is a duplicate to one it has seen before, then it will not index it. If the page is reached through a URL that Oracle SES has already processed, then it will not index that either.
With the Web Services API, you can enable or disable near duplicate detection and removal from the result list. Near duplicate documents are similar to each other. They may or may not be identical to each other.
The crawler crawls only redirected pages. For example, a Web site might have Javascript redirecting users to another site with the same title. Only the redirected site is indexed.
Check for inclusion rules from redirects. This is based on type of redirect. There are three kinds of redirects defined in EQ$URL
:
Temporary Redirect: A redirected URL is always allowed if it is a temporary redirection (HTTP status code 302, 307). Temporary redirection is used for whatever reason that the original URL should still be used in the future. It's not possible to find out temporary redirect from EQ$URL
table other than filtering out the rest from the log file.
Permanent Redirect: For permanent redirection (HTTP status 301), the redirected URL is subject to boundary rules. Permanent redirection means the original URL is no longer valid and the user should start using the new (redirected) one. In EQ$URL
, HTTP permanent redirect has the status code 954
Meta Redirect: Metatag redirection is treated as a permanent redirect. Meta redirect has status code 954. This is always checked against boundary rules.
URL looping refers to the scenario where a large number of unique URLs all point to the same document. One particularly difficult situation is where a site contains a large number of pages, and each page contains links to every other page in the site. Ordinarily this would not be a problem, because the crawler eventually analyzes all documents in the site.
However, some Web servers attach parameters to generated URLs to track information across requests. Such Web servers might generate a large number of unique URLs that all point to the same document.
For example, http://example.com/somedocument.html?p_origin_page=10
might refer to the same document as http://example.com/somedocument.html?p_origin_page=13
but the p_origin_page
parameter is different for each link, because the referring pages are different. If a large number of parameters are specified and if the number of referring links is large, then a single unique document could have thousands or tens of thousands of links referring to it. This is an example of how URL looping can occur.
Monitor the crawler statistics in the Oracle SES administration tool to determine which URLs and Web servers are being crawled the most. If you observe an inordinately large number of URL accesses to a particular site or URL, then you might want to do one of the following:
Exclude the Web Server: This prevents the crawler from crawling any URLs at that host. (You cannot limit the exclusion to a specific port on a host.)
Reduce the Crawling Depth: This limits the number of levels of referred links the crawler will follow. If you are observing URL looping effects on a particular host, then you should take a visual survey of the site to find out an estimate of the depth of the leaf pages at that site. Leaf pages are pages that do not have any links to other pages. As a general guideline, add three to the leaf page depth, and set the crawling depth to this value.
Be sure to restart the crawler after altering any parameters. Your changes take effect only after restarting the crawler.
Oracle SES allocates 10M for the redo log during installation. If your disk has sufficient space to increase the redo log and if you are going to crawl a very large corpus (for example, more than 30G), then increase the redo log file size for better crawl performance.
Note:
The biggest transaction during crawling isSYNC
INDEX
by Oracle Text. Check the AWR report or the v$sysstat
view to see the actual redo size during crawling. Roughly, 200M is sufficient to crawl up to 50G.Launch SQL*Plus and connect as the SYSTEM
user. (The password is same as EQSYS
).
Run the following SQL statement to see the current redo log status:
SQL> SELECT vl.group#, member, bytes, vl.status 2 FROM v$log vl, v$logfile vlf 3 WHERE vl.group#=vlf.group#; GROUP# MEMBER BYTES STATUS ------ -------------------------------------------------- ---------- ---------- 3 /scratch/ses10181/oradata/o10181/redo03.log 10485760 INACTIVE 2 /scratch/ses10181/oradata/o10181/redo02.log 10485760 CURRENT 1 /scratch/ses10181/oradata/o10181/redo01.log 10485760 INACTIVE
Drop the INACTIVE
redo log file. For example, to drop group 3:
SQL> ALTER DATABASE DROP LOGFILE group 3; Database altered.
The redo log file is dropped from the database, but the file itself still exists on the file. Manually remove it with the file deletion command:
% rm /scratch/ses10181/oradata/o10181/redo03.log
Create a larger redo log file. If you want to change the file location, specify the new location.
SQL> alter database add logfile '/scratch/ses10181/oradata/o10181/redo03.log' 2 size 200M; Database altered.
Check the status to make sure the file was created.
SQL> SELECT vl.group#, member, bytes, vl.status 2 FROM v$log vl, v$logfile vlf 3 WHERE vl.group#=vlf.group#; GROUP# MEMBER BYTES STATUS ------ -------------------------------------------------- ---------- ---------- 3 /scratch/ses10181/oradata/o10181/redo03.log 209715200 UNUSED 2 /scratch/ses10181/oradata/o10181/redo02.log 10485760 CURRENT 1 /scratch/ses10181/oradata/o10181/redo01.log 10485760 INACTIVE
To drop a log file with CURRENT
status, run the following SQL statement:
SQL> ALTER SYSTEM SWITCH LOGFILE; System altered. SQL> SELECT vl.group#, member, bytes, vl.status 2 FROM v$log vl, v$logfile vlf 3 WHERE vl.group#=vlf.group#; GROUP# MEMBER BYTES STATUS ------ -------------------------------------------------- ---------- ---------- 3 /scratch/ses10181/oradata/o10181/redo03.log 209715200 CURRENT 2 /scratch/ses10181/oradata/o10181/redo02.log 10485760 ACTIVE 1 /scratch/ses10181/oradata/o10181/redo01.log 10485760 INACTIVE
Group 2 status changed to ACTIVE
. Run the following SQL statement to change the status to INACTIVE
:
SQL> ALTER SYTEM CHECKPOINT; System altered. SQL> SELECT vl.group#, member, bytes, vl.status 2 FROM v$log vl, v$logfile vlf 3 WHERE vl.group#=vlf.group#; GROUP# MEMBER BYTES STATUS ------ -------------------------------------------------- ---------- ---------- 3 /scratch/ses10181/oradata/o10181/redo03.log 209715200 CURRENT 2 /scratch/ses10181/oradata/o10181/redo02.log 10485760 INACTIVE 1 /scratch/ses10181/oradata/o10181/redo01.log 10485760 INACTIVE
Repeat steps 3, 4 and 5 for redo log groups 1 and 2.
If you are still not crawling all the pages you think you should, then check which pages were crawled by doing one of the following:
Check the crawler log file. (There's a link on the Home - Schedules page and the location of the full log on the Home - Schedules - Status page.)
Create a search source group. (Search - Source Groups - Create New Source Group) Put only one source in the group. From the Search page, search that group. (Click the group name on top of the search box.) Or, from the Search page, click Browse Search Groups. Click the group name for a hierarchy. You could also click the number next to the group name for a list of the pages crawled.
This section contains suggestions on how to improve the response time and throughput performance of Oracle SES.
This section contains the most common things to consider to improve search performance:
Suggested links let you direct users to a particular Web site for particular query keywords. For example, when users search for "Oracle Secure Enterprise Search documentation" or "Enterprise Search documentation" or "Search documentation", you could suggest http://www.oracle.com/technology
.
Suggested link keywords are rules that determine which suggested links are returned (as suggestions) for a query. The rules consist of query terms and logical operators. For example, "secure AND search". With this rule, the corresponding suggested link is returned for the query "secure enterprise search", but it is not returned for the query "secure database".
The rule language used for the indexed queries supports the following operators:
Table 6-2 Suggested Link Keyword Operators
Operator | Example |
---|---|
AND |
dog and cat |
OR |
dog or cat |
PHRASE |
dog sled |
ABOUT |
about(dogs) |
NEAR |
dog ; cat |
STEM |
$dog |
WITHIN |
dog within title |
THESAURUS |
SYN(dog) |
Note:
Special characters (for example, '#', '$', '=', '&') should not be used in keywords.Suggested links appear at the top of the search result list. This feature is especially useful to provide links to important Web pages that are not crawled by Oracle Secure Enterprise Search. Add or edit suggested links on the Search - Suggested Links page in the administration tool.
Optimizing the index reduces fragmentation, and it can significantly increase the speed of searches. Schedule index optimization on a regular basis. Also, optimize the index after the crawler has made substantial updates or if fragmentation is more than 50%. Make sure index optimization is scheduled during off-peak hours. Optimization of a very large index could take several hours.
See the fragmentation level and run index optimization on the Global Settings - Index Optimization page in the administration tool. You can specify a maximum number of hours for the optimization to run, but for best performance, select to run the optimization until it finishes. This creates a more compact copy of the index, and then it switches the original index and the copy (so it requires enough space to store both the copy and the original). When optimization is finished, the original index is dropped, and the space can be reused.
The data in the cache directory continues to accumulate until it reaches this limit. When the limit is reached, the data is indexed. The bigger the batch size, the longer it will take to index each batch. Only indexed data can be searched: data in the cache cannot be searched.
The default indexing batch size is 250M. Increasing the size up to the index memory size (275M by default) can reduce index fragmentation. However, increasing the size more than the index memory size will not reduce fragmentation. You can change the index memory size manually.
Set the indexing batch size on the Global Settings - Crawler Configuration page in the administration tool.
A large index memory setting (even hundreds of megabytes) improves the speed of indexing and reduces the fragmentation of the final indexes. However, there will be a point where it is set so high that memory paging occurs and impacts indexing speed.
Follow these steps to increase the index memory size:
Launch SQL*Plus and connect as the eqsys
user.
Run the following SQL statement to see the current indexing memory size:
SQL> SELECT par_value FROM ctx_parameters 2 WHERE par_name = 'DEFAULT_INDEX_MEMORY'; PAR_VALUE ----------- 288358400
This is the default value for indexing memory size. The unit is bytes. (288358400 bytes = 275M bytes)
To change the default indexing memory size to 500M (524288000bytes), run the following procedure:
SQL> begin 2 ctxsys.ctx_adm.set_parameter('DEFAULT_INDEX_MEMORY','524288000'); 3 end; 4 / PL/SQL procedure successfully completed. SQL> SELECT par_value FROM ctx_parameters 2 WHERE par_name = 'DEFAULT_INDEX_MEMORY'; PAR_VALUE ----------- 524288000
You can specify up to 2G for DEFAULT_INDEX_MEMORY
. To allocate more than 1G, you also must change MAX_INDEX_MEMORY
. DEFAULT_INDEX_MEMORY
cannot exceed MAX_INDEX_MEMORY
, and the default value for MAX_INDEX_MEMROY
is 1G. The maximum size for MAX_INDEX_MEMORY
is 2,147,483,647 bytes.
SQL> begin 2 ctxsys.ctx_adm.set_parameter('MAX_INDEX_MEMORY','2147483647'); 3 end; 4 / PL/SQL procedure successfully completed. SQL> begin 2 ctxsys.ctx_adm.set_parameter('DEFAULT_INDEX_MEMORY','2147483647'); 3 end; 4 / PL/SQL procedure successfully completed.
You can change the memory size any time. The next synchronized index uses this specified memory size.
Note:
The indexing batch size determines when the synchronized index is called. Even ifDEFAULT_INDEX_MEMORY
is large enough, Oracle SES does not use it if the indexing batch size is small. For example, if the indexing batch size is 10M, then the synchronized index uses memory up to 10M, even if you specify 1G for it.See Also:
"Increasing the Indexing Batch Size"See the Home - Statistics page in the administration tool for lists of the most popular queries, failed queries, and ineffective queries. This information can lead to the following actions:
Refer users to a particular Web site for failed queries on the Search - Suggested Links page.
Fix common errors that users make in searching on the Search - Alternate Words page.
Make important documents easier to find on the Search - Relevancy Boosting page.
Relevancy boosting lets administrators influence the order of documents in the result list for a particular search. You might want to override the default results for the following reasons:
For a highly popular search, direct users to the best results
For a search that returns no results, direct users to some results
For a search that has no click-throughs, direct users to better results
In a search, each result is assigned a score that indicates how relevant the result is to the search; that is, how good a result it is. Sometimes there are documents that you know are highly relevant to some search. For example, your company Web site could have a home page for XML (http://example.com/XML-is-great.htm), which you want to appear high in the results of any search for "XML". You would boost the score of that home page (http://example.com/XML-is-great.htm) to 100 for an "XML" search.
There are two methods for locating URLs for relevancy boosting: locate by search or manual URL entry.
Note:
The document still has a score computed if you enter a search that is not one of the boosted queries.Relevancy boosting, like end user searching, is case-insensitve. For example, a document with a boosted score for "Oracle" is boosted when you enter "oracle".
If you expect heavy load on the Oracle SES server, then configure the Java virtual machine (JVM) heap size for better performance.
The heap size is defined in the $ORACLE_HOME/search/config/searchctl.conf
file. By default, the following values are given:
max_heap_size
= 1024 megabytes
min_heap_size
= 512 megabytes
Increase the value of these parameters appropriately. The max size should not exceed the physical memory size. Then restart the mid-tier with searchctl restart
.
Heavy query load should not coincide with heavy crawl activity, especially when there are large-scale changes on the target site. If it does, for example when the crawl needs be scheduled around-the-clock, then increase the size of the Oracle undo tablespace with the UNDO_RETENTION
parameter.
See Also:
Oracle Database SQL Reference and Oracle Administrator's Guide (available on Oracle Technology Network) for more information about increasing the Oracle undo spaceOracle Secure Enterprise Search provides a plug-in (or connector) to integrate with Google Desktop for Enterprise (GDfE). You can include Google Desktop results in your Oracle SES hitlist. You can also link to Oracle SES from the GDfE interface.
See Also:
Google Desktop for Enterprise Readme athttp://<host>:<port>/search/query/gdfe/gdfe_readme.html
for details about how to integrate with GDfEIn a production environment, where a load balancer or other monitoring tools are used to ensure system availability, Oracle Secure Enterprise Search (SES) can also be easily monitored through the following URL: http://<host>:<port>/monitor/check.jsp
. The URL should return the following message: Oracle Secure Enterprise Search instance is up.
Note:
This message is not translated to other languages, because system monitoring tools might need to byte-compare this string.If Oracle SES is not available, then the URL returns either a connection error or the HTTP status code 503.
Debug mode is useful for troubleshooting purposes. To turn on debug mode for Oracle SES administration tool, update the search.properties
file located in the $ORACLE_HOME/search/webapp/config
directory. Set debug=true
and restart the Oracle SES middle tier with searchctl
restart
.
To turn off debug mode when you are finished troubleshooting, set debug=false
and restart the middle tier with searchctl
restart
.
Note:
$ORACLE_HOME
represents the directory where Oracle SES was installed.
Debug information can be found in the OC4J log file: $ORACLE_HOME/oc4j/j2ee/OC4J_SEARCH/log/oc4j.log
.
The Oracle Enterprise Manager 10g Application Server Control Console is a Web-based user interface that displays the current status of the Oracle SES middle tier. For example, the Home page shows a graph of the Response and Load, and the Performance page shows a graph of the Heap Usage.
The Application Server Control Console is installed and configured automatically with OC4J. Because the Oracle SES middle tier runs in the embedded standalone OC4J, the Application Server Control Console is started by default when Oracle SES is started.
To access the console, type the following URL in a Web browser:
http://<host>:<port>/em
where host
and port
are the host name and port running Oracle SES.
Log in as the oc4jadmin
user with your Oracle SES administrator password.
See Also:
Oracle Containers for J2EE Configuration and Administration Guide 10g (10.1.3.1.0)
the online help provided with Application Server Control Console for detailed instructions on using this interface
The tool for starting and stopping the search engine is searchctl
. To restart Oracle SES (for example, after rebooting the host computer), navigate to the bin
directory and run searchctl startall
.
Note:
Users are prompted for a password when runningsearchctl
commands on UNIX platforms. No password is required on Windows platforms. This is because Oracle SES installation on Windows requires a user with administrator privileges. When running commands to start or stop the search engine, no password is required as long as the user is a member of the administrator group.See Also:
Startup / Shutdown lesson in the Oracle SES administration tutorial:http://st-curriculum.oracle.com/tutorial/SESAdminTutorial/index.htm