Skip Headers
Oracle® Secure Enterprise Search Administrator's Guide
10g Release 1 (10.1.8.1)

Part Number B32514-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
View PDF

6 Oracle Secure Enterprise Search Advanced Information

This chapter contains the following topics:

Adding Suggested Content

Suggested content lets you display real-time data content in the result list of the default query application. Oracle SES retrieves data from content providers and applies a style sheet to the data to generate an HTML fragment. The HTML fragment is displayed in the result list and is available through the Web Services API. For example, when an end user searches for contact information on a coworker, Oracle SES can fetch the content from the suggested content provider and return the contact information (e-mail address, phone number, and so on) for that person in the result list. Suggested content results appear under any suggested links and above the query results.

Configure suggested content on the Search - Suggested Content page in the administration tool. Enter the maximum number of suggested content results (up to 20) to be included in the Oracle SES result list. The results are rendered on a first-come, first-served basis.

Regular expressions (as supported in the Java regular expression API java.util.regex) are used to define query patterns for suggested content providers. The regular expression-based pattern matching is case-sensitive. For example, a provider with the pattern dir\s(\S+) is triggered on the query dir james but not on the query Dir James. To trigger on the query Dir James, the pattern could be defined either as [Dd][Ii][Rr]\s+(\S+) or as (?i)dir\s+(\S+). A provider with a blank query pattern is triggered on all queries.

The URL you enter for the suggested content provider can contain the following variables: $ora:q, $ora:lang, $ora:q1, ... $ora:qn and $ora:username.

Enter an XSLT style sheet to defines rules (for example, the size and style) for transforming XML content from a provider into an HTML fragment. This HTML fragment is displayed in the result list or returned over the Web Services API. If you do not enter an XSLT style sheet, then Oracle SES assumes that the suggested content provider returns HTML. If you do not enter an XSLT style sheet and the provider returns XML, then the result list displays the plain XML.

Note:

It is the administrator's responsibility to ensure that suggested content providers return valid and safe content. Corrupted or incomplete content returned by an suggested content provider can affect the formatting of the default query application results page.

There are three security options for how Oracle SES passes the end user's authentication information to the suggested content provider:

Example Configuring Google OneBox for Suggested Content

Existing OneBox providers can be configured for use as Oracle SES Suggested Content providers. For example, for a Google OneBox provider, the provider URL might be http://host.company.com/apps/directory.jsp and the trigger might be dir\s(\S+). When the user query is dir james, the provider receives the request with a query string similar to the following: apiMaj=10&apiMin=1&oneboxName=app&query=james.

With a Suggested Content provider, set the URL template as http://host.company.com/apps/directory.jsp?apiMaj=10&apiMin=1&oneboxName=app&query=$ora:q1. The provider pattern is the same: dir\s(\S+). The XSLT used for Google OneBox can be re-used with a minor change. Look for the line:

<xsl:template name="apps">

and change that line in your template to

<xsl:template match="/OneBoxResults">

Using Backup and Recovery

A backup is a copy of configuration data that can be used to recover your configuration settings after a hardware failure. When a backup is performed, Oracle SES copies the data to the binary metaData.bkp file. The location of that file is provided on the Global Settings - Configuration Data Backup and Recovery page. When the backup successfully completes, you must copy this file to a different host. You should backup after making configuration data changes, such as creating or editing sources.

When the installation completes, copy the metaData.bkp file to the location provided in the administration tool. Sources need to be crawled again to see search results.

Some notes about backup and recovery:

Understanding Attributes

Each source has its own set of document attributes. Document attributes, like metadata, describe the properties of a document. The crawler retrieves values and maps them to one of the search attributes. This mapping lets users search documents based on their attributes. After you crawl a source, you can see the attributes for that source. Document attribute information is obtained differently depending on the source type. This section lists the attributes for each Oracle SES source type.

See Also:

"Overview of Attributes" for conceptual information about document and search attributes in Oracle SES

For table and database source types, there are no predefined attributes. The crawler collects attributes from columns defined during source creation. The Oracle SES administrator must map the column to the search attributes.

For Oracle E-Business Suite and Siebel source types, attributes are specified by the user. Attributes for Oracle E-Business Suite 11i and Siebel 7.8 sources are specified in the query while creating the source. Attributes for Oracle E-Business Suite 12 and Siebel 8 sources are specified in the RSS data feed. (That is, you can specify attributes in the RSS data feed yourself).

For many source types (such as OracleAS Portal, e-mail, NTFS, and Microsoft Exchange sources), the crawler picks up key attributes offered by the target systems. These are listed in the following sections.

Note:

For all other sources, such as Documentum eRoom or Lotus Notes, there is an Attribute list parameter in the Home - Sources - Customize User-Defined Source page. Any attributes entered by users are collected by the crawler and available for search.

Web Source Attributes

  • Title

  • Author

  • Description

  • Host

  • Keywords

  • Language

  • LastModifiedDate

  • Mimetype

  • Subject: This is mapped to "Description". If there is no description metatag in the HTML file, then it is ignored.

  • Headline1: The highest H tag text; for example, "Annual Report" from <H2>Annual Report</H2> when there is no H1 tag in the page.

  • Headline2: The second highest H tag text

  • Reference Text: The anchor text from another Web page that points to this page.

Additional HTML metatags can be defined to map to a String attribute on the Home - Sources - Metatag Mapping page.

File Source Attributes

  • Title

  • Author

  • Description

  • Host

  • Keywords

  • Language

  • LastModifiedDate

  • Mimetype

  • Subject

E-mail Attributes

  • author

  • title

  • subject

  • language

  • lastmodifieddate

OracleAS Portal Source Attributes

Table 6-1

Attribute Description

createdate

Date the document was created

creator

User name of the person who created the document

author

User-editable field so that they can specify a full name or whatever they want

page_path

Hierarchy path of the item/page in the portal tree

title

Title of the document

description

Brief description of the document

keywords

Keywords of the document

expiredate

Expiration date of the document

host

Portal host

infosource

Path of the Portal page in the browse hierarchy

language

Language of the portal page or item

lastmodifieddate

Last modified date of the document

mimetype

Usually 'text/html' for portal

perspectives

User-created markers that can be applied to pages or items, such as 'INTERNAL ONLY', 'REVIEWED', or 'DESIGN SPEC'. For example, a Portal containing recipes could have items representing recipes with perspectives such as 'Breakfast', 'Tea', 'Contains Nuts', 'Healthy' and one particular item could have several perspectives assigned to it.

wwsbr_name_

Internal name of the portal page or item

wwsbr_charset_

Character set of the portal page or item

wwsbr_category_

Category of the portal page or item

wwsbr_updatedate_

Date the last time the portal page or item was updated

wwsbr_updator_

Person who last updated the page or item

wwsbr_subtype_

Subtype of the portal page/item (for example, container)

wwsbr_itemtype_

Portal item type

wwsbr_mime_type_

Mimetype of the portal page or item

wwsbr_publishdate_

Date the portal page or item was published

wwsbr_version_number_

Version number of the portal item


Microsoft Exchange Source Attributes

  • ReceivedTime

  • From

  • To

  • CC

  • BCC

  • Subject:

NTFS Source Attributes

  • Title

  • Subject

  • Author

  • Category

  • Comments

  • Description

  • FileDate : LastModified Date

Oracle Calendar Attributes

  • Description

  • Priority

  • Status

  • start date

  • end date

  • event Type

  • Author

  • Created Date

  • Title

  • Location

  • Dial_info

  • ConferenceID

  • ConferenceKey

  • Duration

Oracle Content Database Source Attributes

  • AUTHOR

  • CREATE_DATE

  • DESCRIPTION

  • FILE_NAME

  • LASTMODIFIEDDATE

  • LAST_MODIFIED_BY

  • TITLE

  • ACL_CHECKSUM: The check sum calculated over the ACL submitted for the document.

  • DOCUMENT_LANGUAGE: Oracle SES language code taken from Oracle Content Database language string. For example, if Oracle Content Database uses "American", then Oracle SES submits is as it as "en-us".

  • DOCUMENT_CHARACTER_SET: The character set for the Oracle Content Database document.

  • MIMETYPE

Oracle SES also can search categories or cutomized attributes created by the user in Oracle Content Database.

You can apply categories to files and links. Categories can be divided into subcategories and can have one or more attributes. When a document in Oracle Content Database is attached to a category, you can search on the attribute of category. (The attributes appear in the list of search attributes.)

For example, suppose you create a category named testCategory with testAttr1 and testAttr2. Document X is created and assigned the testCategory. You must assign the value to the testCategory's attributes. After crawling, testAttr1 and testAttr2 will appear in the search attribute list.

Customized attribute values can be the following types: String, Integer, Long, Double, Boolean, Date, User, Enumerated String, Enumerated Integer, and Enumerated Long.

Index Long, Double, Integer, Enumerated Integer, and Enumerated Long type customized attributes are type Number attributes in Oracle SES (display name with "_N" suffix).

Index Date customized attribute is type Date attribute in Oracle SES (suffix "_D").

Index String, String Enumeration, and User customized attributes are type String attributes in Oracle SES.

Limitations:

  • The Oracle Content Database SDK has more features than the Oracle Content Database Web GUI. The Web GUI does not support the String Array, but the SDK does. If you use the SDK to build a customized admin and user GUI to support the String array type, then a customized attribute could have more than one attribute value.

  • If a document in Oracle Content Database is attached to a category and the attributes in that category are left blank, then when a user searches in Oracle SES (using Advanced Search), the attribute is not available in the dropdown list.

    For example, create testCategory with three attributes. A document is created and assigned this test category. TestCategory's attribute are assigned values. For a test, assign one a value "test" leave the other attribute blank. After crawling, when searching you can see the attribute in the list that was assigned the value "test". However, the one that was left blank does not show in the dropdown list. If an attribute has null value, it will be skipped by the crawler. But if another document has the same attribute with some value, then it will be indexed.

Troubleshooting Sources

This section contains the following topics:

Tips for Using Table and Database Sources

Table source types and database source types are similar, in that they both crawl database tables.

This section contains the following topics:

Understanding Table Sources Versus Database Sources

This section describes the benefits and limitations of both table source types and database source types.

Note:

For performance reasons, both source types require that the KEY column be backed by an index.
Table Source Benefits
  • A table source does not need to contain a specific set of columns.

  • A table source automatically creates a display URL target. You do not need to arrange for the content to be displayed by some other mechanism.

  • A table source does not require JDBC connection syntax.

Table Source Limitations
  • To crawl non-Oracle databases as a table source, you must create a view in an Oracle database on the non-Oracle table. Then create the table source on the Oracle view. Oracle SES accesses the database using database links.

  • Only one table or view can be specified for each table source. If data from more than one table or view is required, then first create a single view that encompasses all required data.

  • Oracle SES cannot crawl tables inside the Oracle SES database.

  • Table column mappings cannot be applied to LOB columns.

  • The following data types are supported for table sources: BLOB, BFILE, CLOB, CHAR, VARCHAR, VARCHAR2.

Database Source Benefits
  • Database sources provide additional flexibility. A database source type is built on JDBC, so you can crawl any JDBC-enabled database.

    • A database source supports any SQL query with join conditions without creating a view. In some databases, creating objects may not be feasible.

    • A database source supports crawling content pointed to by a URL stored in the ATTACHMENT_LINK column.

    • A database source supports Info source path hierarchy and MIMETYPEs.

  • Database sources provide additional security. A database source provides security on the row level. It provides a third security option ACLs Provided by Source that is not available for table sources.

Database Source Limitations
  • The base table or view cannot have text columns of type BFILE or RAW.

  • The value of the required URL column cannot be null.

Crawling Tables with Quoted Identifiers

Database object names may be represented with a quoted identifier. A quoted identifier is case-sensitive and begins and ends with double quotation marks ("). If the database object is represented with a quoted identifier, then you must use the double quotation marks and the same case whenever you refer to that object.

When creating a table source in Oracle SES, if the table name is a quoted identifier, such as "1 (Table)", then in the Table Name field enter "1 (Table)", with the same case and double quotation marks. Similarly, if a primary key column or content column is named using a quoted identifier, then enter that name exactly as it appears in the database with double quotation marks.

See Also:

Oracle Database SQL Reference (available on Oracle Technology Network) for more information about schema object names and qualifiers

Tips for Using File Sources

This section contains the following topics:

Crawling File Sources with Non-ASCII

For file sources to successfully crawl and display multibyte environments, the locale of the computer that starts the Oracle SES server must be the same as the target file system. This way, the Oracle SES crawler can "see" the multibyte files and paths.

If the locale is different in the installation environment, then Oracle SES should be restarted from the environment with the correct locale. For example, for a Korean environment, either set LC_ALL to ko_KR or set both LC_LANG and LANG to ko_KR.KSC5601. Then run searchctl restartall from either a command prompt on Windows or an xterm on UNIX.

Crawling File Sources with Symbolic Links

When crawling file sources on UNIX, the crawler will resolve any symbolic link to its true directory path and enforce the boundary rule on it. For example, suppose directory /tmp/A has two children, B and C, where C is a link to /tmp2/beta. The crawl will have the following URLs:

  • /tmp/A

  • /tmp/A/B

  • /tmp2/beta

  • /tmp/A/C

If the inclusion rule is /tmp/A, then /tmp2/beta will be excluded. The seed URL is treated as is.

Crawling File URLs

If a file URL is to be used "as is", without going through Oracle SES for retrieving the file, then "file" in the URL should be upper case "FILE". For example, FILE://localhost/... "As is" means that when a user clicks on the search link of the document, the browser will try to use the specified file URL on the client computer to retrieve the file. Without that, Oracle SES uses this file URL on the server computer and sends the document through HTTP to the client computer.

Tips for Using Mailing List Sources

  • The Oracle SES crawler is IMAP4 compliant. To crawl mailing list sources, you need an IMAP e-mail account. It is recommended to create an e-mail account that is used solely for Oracle SES to crawl mailing list messages. The crawler is configured to crawl one IMAP account for all mailing list sources. Therefore, all mailing list messages to be crawled must be found in the Inbox of the e-mail account specified on this page. This e-mail account should be subscribed to all the mailing lists. New postings for all the mailing lists will be sent to this single account and subsequently crawled.

  • Messages deleted from the global mailing list e-mail account are not removed from the Oracle SES index. In fact, the mailing list crawler itself will delete messages from the IMAP e-mail account as it crawls. The next time the IMAP account for mailing lists is crawled, the previous messages will no longer be there. Any new messages in the account will be added to the index (and also consequently deleted from the account). This keeps the global mailing list IMAP account clean. The Oracle SES index serves as a complete archive of all the mailing list messages.

Tips for Using OracleAS Portal Sources

  • An OracleAS Portal source name cannot exceed 35 characters.

  • URL boundary rules are not enforced for URL items. A URL item is the metadata that resides on the OracleAS Portal server. Oracle SES does not touch the display URL or the boundary rules for URL items.

  • If OracleAS Portal user privileges change, it is possible that content the crawler collects is not properly authorized. For example, in a Portal crawl, the user specified in the Home - Sources - Authentication page does not have privileges to see certain Portal pages. However, after privileges are granted to the user, on subsequent incremental crawls, the content still is not picked up by the crawler. Similarly, if privileges are revoked from the user, it is possible that content still is picked up by the crawler.

    To be certain that Oracle SES has the correct set of documents, whenever a user's privileges change, update the crawler re-crawl policy to Process All Documents on the Home - Schedules - Edit Schedules page, and restart the crawl.

Tips for Using User-Defined Sources

  • If a plug-in is to return file URLs to the crawler, then the file URLs must be fully qualified. For example, file://localhost/.

  • If a file URL is to be used "as is" without going through Oracle SES for retrieving the file, then "file" in the URL should be upper case "FILE". For example, FILE://localhost/...

Tips for Using Federated Sources

  • The Oracle SES federator caches the federator configuration (that is, all federation-related parameters including federated sources). As a result, any change in the configuration will take effect within five minutes.

  • If you entered proxy settings on the Global Settings - Proxy Settings page, then make sure to add the Web Services URL for the federated source as a proxy exception.

  • If the federation endpoint instance is set to secure mode 3 (require login to search secure and public content), then all documents (ACL stamped or not) are secure. For secure federated search, create a trusted entity in the federation endpoint instance, then edit the federated source with the trusted entity user name and password.

  • There can be consistency issues if you have configured a BIG-IP system as follows:

    • You have two Oracle SES instances configured identically (same crawls, same sources, and so on) behind a BIG-IP load balancer to act as a single logical Oracle SES instance.

    • You have two other Oracle SES instances configured identically along with Oracle HTTP Server and OracleAS Web Cache fronting each one and both servers behind BIG-IP. Each of these two instances federate to the logical Oracle SES instance. Web Cache is clustered between these two nodes to act as a single logical Oracle SES instance called broker instance.

    When a user performs a search on the broker Oracle SES instance and tries to access the documents in the result, document access may not be consistent each time. As a workaround, make sure that the load balancer sends all the requests in one user session to the exact same node each time.

Federated Search Characteristics

  • Federated search can improve performance by distributing query processing on multiple computers. It can be an efficient way to scale up search service by adding a cluster of Oracle SES instances.

  • The federated search performance depends on the network topology and throughput of the entire federated Oracle SES environment.

Federated Search Limitations

  • There is a size limit of 200KB for the cached documents existing on the federation endpoint to be displayed on the Oracle SES federation broker instance.

  • For infosource browse, if the source hierarchies for both local and federated sources under one source group start with the same top level folder, then a sequence number is added to the folder name belonging to the federated source to distinguish the two hierarchies on the Browse page.

  • For federated infosource browse, a federated source should be put under an explicitly created source group.

  • On the Oracle SES federation broker, there is no direct access to documents on the federation endpoint through the display URL in the search result list. Only the cached version of documents is accessible. Exception: There is direct access for Web source and OracleAS Portal source documents.

See Also:

Tuning Crawl Performance

Your Web crawling strategy can be as simple as identifying a few well-known sites that are likely to contain links to most of the other intranet sites in your organization. You could test this by crawling these sites without indexing them. After the initial crawl, you have a good idea of the hosts that exist in your intranet. You could then define separate Web sources to facilitate crawling and indexing on individual sites.

However, the process of discovering and crawling your organization's intranet, or the Internet, is generally an interactive one characterized by periodic analysis of crawling results and modification to crawling parameters. For example, if you observe that the crawler is spending days crawling one Web host, then you might want to exclude crawling at that host or limit the crawling depth.

This section contains the most common things to consider to improve crawl performance:

See Also:

"Monitoring the Crawling Process" for more information on crawling parameters

Understanding the Crawler Schedule

Schedules define the frequency at which the Oracle SES index is updated with information about each source. This section describes characteristics the Oracle SES crawler schedule.

  • The Failed Schedules section on the Home - General page lists all schedules that have failed. Generally, a failed schedule is one in which the crawler did not collect any documents. A failed schedule also could be the result of a partial collection and indexing of documents.

  • The smallest granularity of the schedule interval is one hour. For example, you cannot have a schedule started at 1:30am.

  • If a crawl takes longer to finish then the scheduled interval, then it will be started as soon as the current crawl is done. Currently, there is no option to have the scheduled time automatically pushed back to the next scheduled time.

  • When multiple sources are assigned to one schedule, the sources are crawled one by one following the order of their assignment in the schedule.

  • If a crawl fails, the schedule does not restart. You must resolve the cause of the crawl failure and resume the schedule. The rest of the pending sources are not crawled. Currently, there is no distinction between a failure that can be automatically retried versus a failure that must be fixed by the administrator.

  • There is no automatic e-mail notification of schedule success or failure.

Registering a Proxy

By default, Oracle SES is configured to crawl Web sites in the intranet. In other words, crawling internal Web sites requires no additional configuration. However, to crawl Web sites on the Internet (also referred to as external Web sites), Oracle SES needs the HTTP proxy server information. See the Global Settings - Proxy Settings page.If the proxy requires authentication, then enter the proxy authentication information on the Global Settings - Authentication page.

Checking Boundary Rules

The seed URL you enter when you create a source is turned into an inclusion rule. For example, if www.example.com is the seed URL, then Oracle SES creates an inclusion rule that only URLs containing the string www.example.com will be crawled.

However, suppose that the example Web site includes URLs starting with www.exa-mple.com or ones that start with example.com (without the www). Many pages have a prefix on the site name. For example, the investor section of the site has URLs that start with investor.example.com.

Always check the inclusion rules before crawling, then check the log after crawling to see what patterns have been excluded.

In this case, you might add www.example.com, www.exa-mple.com, and investor.example.com to the inclusion rules. Or you might just add example.

To crawl outside the seed site (for example, if you are crawling text.us.oracle.com, but you want to follow links outside of text.us.oracle.com to oracle.com), consider removing the inclusion rules altogether. Do so carefully. This could lead the crawler into many, many sites.

Notes for File Sources

  1. For file sources, if no boundary rule is specified, then crawling is limited to the underlying file system access privileges. Files accessible from the specified seed file URL will be crawled, subject to the default crawling depth. The depth, which is 2 by default, is set on the Global Settings - Crawler Configuration page. For example, if the seed is file://localhost/home/user_a/, then the crawl will pick up all files and directories under user_a with access privileges. It will crawl any documents in the directory /home/user_a/level1 due to the depth limit. The documents in the /home/user_a/level1/level2 directory are at level 3.

  2. The file URL can be of UNC (universal naming convention) format. The UNC file URL has the following format: file://localhost///<LocalMachineName>/<SharedFolderName>.

    For example, \\stcisfcr\docs\spec.htm should be specified as file://localhost///stcisfcr/docs/spec.htm.

  3. On some computers, the path or file name could contain non-ASCII and multibyte characters. URLs are always represented using the ASCII character set. Non-ASCII characters are represented using the hex representation of their UTF-8 encoding. For example, a space is encoded as %20, and a multibyte character can be encoded as %E3%81%82.

    For file sources, spaces can be entered in simple (not regular expression) boundary rules. Oracle SES automatically encodes these URL boundary rules. If (Home Alone) is specified, then internally it is stored as (Home%20Alone). Oracle SES does this encoding for the following:

    • File source simple boundary rules

    • Test URL strings

    • File source seed URLs

Note:

Oracle SES does not alter the rule if it is a regular expression rule. It is the administrator's responsibility to make sure that the regular expression rule specified is against the encoded file URL. Spaces are not allowed in regular expression rules.

Checking Dynamic Pages

Indexing dynamic pages can generate an excessive number of URLs. From the target Web site, manually navigate through a few pages to understand what boundary rules should be set to avoid crawling duplicate pages.

Checking Crawler Depth

Setting the crawler depth very high (or unlimited) could lead the crawler into many sites. Without boundary rules, 20 will probably crawl the whole WWW from most locations.

Checking Robots.txt Rule

You can control which parts of your sites can be visited by robots. If robots exclusion is enabled (default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots.txt file.

The following sample /robots.txt file specifies that no robots should visit any URL starting with /cyberworld/map/ or /tmp/ or /foo.html:

# robots.txt for http://www.example.com/
 
User-agent: *
Disallow: /cyberworld/map/ 
Disallow: /tmp/ 
Disallow: /foo.html

If the Web site is under the user's control, then a specific robots rule can be tailored for the crawler by specifying the Oracle SES crawler plug-in name "User-agent: Oracle Secure Enterprise Search." For example:

User-agent: Oracle Secure Enterprise Search
 
Disallow: /tmp/

The robots meta tag can instruct the crawler to either index a Web page or follow the links within it. For example:

<meta name="robots" content="noindex,nofollow">

Checking Duplicate Documents

Oracle SES always removes duplicate (identical) documents. If Oracle SES thinks a page is a duplicate to one it has seen before, then it will not index it. If the page is reached through a URL that Oracle SES has already processed, then it will not index that either.

With the Web Services API, you can enable or disable near duplicate detection and removal from the result list. Near duplicate documents are similar to each other. They may or may not be identical to each other.

Checking Redirected Pages

The crawler crawls only redirected pages. For example, a Web site might have Javascript redirecting users to another site with the same title. Only the redirected site is indexed.

Check for inclusion rules from redirects. This is based on type of redirect. There are three kinds of redirects defined in EQ$URL:

  • Temporary Redirect: A redirected URL is always allowed if it is a temporary redirection (HTTP status code 302, 307). Temporary redirection is used for whatever reason that the original URL should still be used in the future. It's not possible to find out temporary redirect from EQ$URL table other than filtering out the rest from the log file.

  • Permanent Redirect: For permanent redirection (HTTP status 301), the redirected URL is subject to boundary rules. Permanent redirection means the original URL is no longer valid and the user should start using the new (redirected) one. In EQ$URL, HTTP permanent redirect has the status code 954

  • Meta Redirect: Metatag redirection is treated as a permanent redirect. Meta redirect has status code 954. This is always checked against boundary rules.

Checking URL Looping

URL looping refers to the scenario where a large number of unique URLs all point to the same document. One particularly difficult situation is where a site contains a large number of pages, and each page contains links to every other page in the site. Ordinarily this would not be a problem, because the crawler eventually analyzes all documents in the site.

However, some Web servers attach parameters to generated URLs to track information across requests. Such Web servers might generate a large number of unique URLs that all point to the same document.

For example, http://example.com/somedocument.html?p_origin_page=10 might refer to the same document as http://example.com/somedocument.html?p_origin_page=13 but the p_origin_page parameter is different for each link, because the referring pages are different. If a large number of parameters are specified and if the number of referring links is large, then a single unique document could have thousands or tens of thousands of links referring to it. This is an example of how URL looping can occur.

Monitor the crawler statistics in the Oracle SES administration tool to determine which URLs and Web servers are being crawled the most. If you observe an inordinately large number of URL accesses to a particular site or URL, then you might want to do one of the following:

  • Exclude the Web Server: This prevents the crawler from crawling any URLs at that host. (You cannot limit the exclusion to a specific port on a host.)

  • Reduce the Crawling Depth: This limits the number of levels of referred links the crawler will follow. If you are observing URL looping effects on a particular host, then you should take a visual survey of the site to find out an estimate of the depth of the leaf pages at that site. Leaf pages are pages that do not have any links to other pages. As a general guideline, add three to the leaf page depth, and set the crawling depth to this value.

Be sure to restart the crawler after altering any parameters. Your changes take effect only after restarting the crawler.

Increasing the Oracle Redo Log File Size

Oracle SES allocates 10M for the redo log during installation. If your disk has sufficient space to increase the redo log and if you are going to crawl a very large corpus (for example, more than 30G), then increase the redo log file size for better crawl performance.

Note:

The biggest transaction during crawling is SYNC INDEX by Oracle Text. Check the AWR report or the v$sysstat view to see the actual redo size during crawling. Roughly, 200M is sufficient to crawl up to 50G.
  1. Launch SQL*Plus and connect as the SYSTEM user. (The password is same as EQSYS).

  2. Run the following SQL statement to see the current redo log status:

    SQL> SELECT vl.group#, member, bytes,  vl.status 
      2  FROM v$log vl, v$logfile vlf 
      3  WHERE vl.group#=vlf.group#; 
     
    GROUP# MEMBER                                                  BYTES STATUS 
    ------ -------------------------------------------------- ---------- ---------- 
         3 /scratch/ses10181/oradata/o10181/redo03.log          10485760 INACTIVE 
         2 /scratch/ses10181/oradata/o10181/redo02.log          10485760 CURRENT 
         1 /scratch/ses10181/oradata/o10181/redo01.log          10485760 INACTIVE 
    
    
  3. Drop the INACTIVE redo log file. For example, to drop group 3:

    SQL> ALTER DATABASE DROP LOGFILE group 3; 
     
    Database altered. 
    
    
  4. The redo log file is dropped from the database, but the file itself still exists on the file. Manually remove it with the file deletion command:

    % rm  /scratch/ses10181/oradata/o10181/redo03.log 
    
    
  5. Create a larger redo log file. If you want to change the file location, specify the new location.

    SQL> alter database add logfile '/scratch/ses10181/oradata/o10181/redo03.log' 
      2  size 200M; 
     
    Database altered. 
    
    
  6. Check the status to make sure the file was created.

    SQL> SELECT vl.group#, member, bytes, vl.status 
      2  FROM v$log vl, v$logfile vlf 
      3  WHERE vl.group#=vlf.group#; 
     
    GROUP# MEMBER                                                  BYTES STATUS 
    ------ -------------------------------------------------- ---------- ---------- 
         3 /scratch/ses10181/oradata/o10181/redo03.log         209715200 UNUSED 
         2 /scratch/ses10181/oradata/o10181/redo02.log          10485760 CURRENT 
         1 /scratch/ses10181/oradata/o10181/redo01.log          10485760 INACTIVE 
    
    
  7. To drop a log file with CURRENT status, run the following SQL statement:

    SQL> ALTER SYSTEM SWITCH LOGFILE; 
     
    System altered. 
     
    SQL> SELECT vl.group#, member, bytes,  vl.status 
      2  FROM v$log vl, v$logfile vlf 
      3  WHERE vl.group#=vlf.group#; 
     
    GROUP# MEMBER                                                  BYTES STATUS 
    ------ -------------------------------------------------- ---------- ---------- 
         3 /scratch/ses10181/oradata/o10181/redo03.log         209715200 CURRENT 
         2 /scratch/ses10181/oradata/o10181/redo02.log          10485760 ACTIVE 
         1 /scratch/ses10181/oradata/o10181/redo01.log          10485760 INACTIVE 
    
    
  8. Group 2 status changed to ACTIVE. Run the following SQL statement to change the status to INACTIVE:

    SQL> ALTER SYTEM CHECKPOINT; 
     
    System altered. 
     
    SQL>  SELECT vl.group#, member, bytes,  vl.status 
      2   FROM v$log vl, v$logfile vlf 
      3   WHERE vl.group#=vlf.group#; 
     
    GROUP# MEMBER                                                  BYTES STATUS 
    ------ -------------------------------------------------- ---------- ---------- 
         3 /scratch/ses10181/oradata/o10181/redo03.log         209715200 CURRENT 
         2 /scratch/ses10181/oradata/o10181/redo02.log          10485760 INACTIVE 
         1 /scratch/ses10181/oradata/o10181/redo01.log          10485760 INACTIVE 
    
    
  9. Repeat steps 3, 4 and 5 for redo log groups 1 and 2.

What to do Next

If you are still not crawling all the pages you think you should, then check which pages were crawled by doing one of the following:

  • Check the crawler log file. (There's a link on the Home - Schedules page and the location of the full log on the Home - Schedules - Status page.)

  • Create a search source group. (Search - Source Groups - Create New Source Group) Put only one source in the group. From the Search page, search that group. (Click the group name on top of the search box.) Or, from the Search page, click Browse Search Groups. Click the group name for a hierarchy. You could also click the number next to the group name for a list of the pages crawled.

Tuning Search Performance

This section contains suggestions on how to improve the response time and throughput performance of Oracle SES.

This section contains the most common things to consider to improve search performance:

Adding Suggested Links

Suggested links let you direct users to a particular Web site for particular query keywords. For example, when users search for "Oracle Secure Enterprise Search documentation" or "Enterprise Search documentation" or "Search documentation", you could suggest http://www.oracle.com/technology.

Suggested link keywords are rules that determine which suggested links are returned (as suggestions) for a query. The rules consist of query terms and logical operators. For example, "secure AND search". With this rule, the corresponding suggested link is returned for the query "secure enterprise search", but it is not returned for the query "secure database".

The rule language used for the indexed queries supports the following operators:

Table 6-2 Suggested Link Keyword Operators

Operator Example

AND

dog and cat

OR

dog or cat

PHRASE

dog sled

ABOUT

about(dogs)

NEAR

dog ; cat

STEM

$dog

WITHIN

dog within title

THESAURUS

SYN(dog)


Note:

Special characters (for example, '#', '$', '=', '&') should not be used in keywords.

Suggested links appear at the top of the search result list. This feature is especially useful to provide links to important Web pages that are not crawled by Oracle Secure Enterprise Search. Add or edit suggested links on the Search - Suggested Links page in the administration tool.

Optimizing the Index

Optimizing the index reduces fragmentation, and it can significantly increase the speed of searches. Schedule index optimization on a regular basis. Also, optimize the index after the crawler has made substantial updates or if fragmentation is more than 50%. Make sure index optimization is scheduled during off-peak hours. Optimization of a very large index could take several hours.

See the fragmentation level and run index optimization on the Global Settings - Index Optimization page in the administration tool. You can specify a maximum number of hours for the optimization to run, but for best performance, select to run the optimization until it finishes. This creates a more compact copy of the index, and then it switches the original index and the copy (so it requires enough space to store both the copy and the original). When optimization is finished, the original index is dropped, and the space can be reused.

Increasing the Indexing Batch Size

The data in the cache directory continues to accumulate until it reaches this limit. When the limit is reached, the data is indexed. The bigger the batch size, the longer it will take to index each batch. Only indexed data can be searched: data in the cache cannot be searched.

The default indexing batch size is 250M. Increasing the size up to the index memory size (275M by default) can reduce index fragmentation. However, increasing the size more than the index memory size will not reduce fragmentation. You can change the index memory size manually.

Set the indexing batch size on the Global Settings - Crawler Configuration page in the administration tool.

Increasing the Index Memory Size

A large index memory setting (even hundreds of megabytes) improves the speed of indexing and reduces the fragmentation of the final indexes. However, there will be a point where it is set so high that memory paging occurs and impacts indexing speed.

Follow these steps to increase the index memory size:

  1. Launch SQL*Plus and connect as the eqsys user.

  2. Run the following SQL statement to see the current indexing memory size:

    SQL> SELECT par_value FROM ctx_parameters
    2  WHERE par_name = 'DEFAULT_INDEX_MEMORY';
     
    PAR_VALUE
    -----------
    288358400
    
    

    This is the default value for indexing memory size. The unit is bytes. (288358400 bytes = 275M bytes)

  3. To change the default indexing memory size to 500M (524288000bytes), run the following procedure:

    SQL> begin
    2  ctxsys.ctx_adm.set_parameter('DEFAULT_INDEX_MEMORY','524288000');
    3  end;
    4  /
     
    PL/SQL procedure successfully completed.
     
    SQL> SELECT par_value FROM ctx_parameters
    2  WHERE par_name = 'DEFAULT_INDEX_MEMORY';
     
    PAR_VALUE
    -----------
    524288000
    
    
  4. You can specify up to 2G for DEFAULT_INDEX_MEMORY. To allocate more than 1G, you also must change MAX_INDEX_MEMORY. DEFAULT_INDEX_MEMORY cannot exceed MAX_INDEX_MEMORY, and the default value for MAX_INDEX_MEMROY is 1G. The maximum size for MAX_INDEX_MEMORY is 2,147,483,647 bytes.

    SQL> begin
    2  ctxsys.ctx_adm.set_parameter('MAX_INDEX_MEMORY','2147483647');
    3  end;
    4  /
     
    PL/SQL procedure successfully completed.
     
    SQL> begin
    2  ctxsys.ctx_adm.set_parameter('DEFAULT_INDEX_MEMORY','2147483647');
    3  end;
    4  /
     
    PL/SQL procedure successfully completed.
    
    

    You can change the memory size any time. The next synchronized index uses this specified memory size.

Note:

The indexing batch size determines when the synchronized index is called. Even if DEFAULT_INDEX_MEMORY is large enough, Oracle SES does not use it if the indexing batch size is small. For example, if the indexing batch size is 10M, then the synchronized index uses memory up to 10M, even if you specify 1G for it.

Checking the Search Statistics

See the Home - Statistics page in the administration tool for lists of the most popular queries, failed queries, and ineffective queries. This information can lead to the following actions:

  • Refer users to a particular Web site for failed queries on the Search - Suggested Links page.

  • Fix common errors that users make in searching on the Search - Alternate Words page.

  • Make important documents easier to find on the Search - Relevancy Boosting page.

Relevancy Boosting

Relevancy boosting lets administrators influence the order of documents in the result list for a particular search. You might want to override the default results for the following reasons:

  • For a highly popular search, direct users to the best results

  • For a search that returns no results, direct users to some results

  • For a search that has no click-throughs, direct users to better results

In a search, each result is assigned a score that indicates how relevant the result is to the search; that is, how good a result it is. Sometimes there are documents that you know are highly relevant to some search. For example, your company Web site could have a home page for XML (http://example.com/XML-is-great.htm), which you want to appear high in the results of any search for "XML". You would boost the score of that home page (http://example.com/XML-is-great.htm) to 100 for an "XML" search.

There are two methods for locating URLs for relevancy boosting: locate by search or manual URL entry.

Note:

The document still has a score computed if you enter a search that is not one of the boosted queries.

Relevancy boosting, like end user searching, is case-insensitve. For example, a document with a boosted score for "Oracle" is boosted when you enter "oracle".

Increasing the JVM Heap Size

If you expect heavy load on the Oracle SES server, then configure the Java virtual machine (JVM) heap size for better performance.

The heap size is defined in the $ORACLE_HOME/search/config/searchctl.conf file. By default, the following values are given:

max_heap_size = 1024 megabytes

min_heap_size = 512 megabytes

Increase the value of these parameters appropriately. The max size should not exceed the physical memory size. Then restart the mid-tier with searchctl restart.

Increasing the Oracle Undo Space

Heavy query load should not coincide with heavy crawl activity, especially when there are large-scale changes on the target site. If it does, for example when the crawl needs be scheduled around-the-clock, then increase the size of the Oracle undo tablespace with the UNDO_RETENTION parameter.

See Also:

Oracle Database SQL Reference and Oracle Administrator's Guide (available on Oracle Technology Network) for more information about increasing the Oracle undo space

Integrating with Google Desktop for Enterprise

Oracle Secure Enterprise Search provides a plug-in (or connector) to integrate with Google Desktop for Enterprise (GDfE). You can include Google Desktop results in your Oracle SES hitlist. You can also link to Oracle SES from the GDfE interface.

See Also:

Google Desktop for Enterprise Readme at http://<host>:<port>/search/query/gdfe/gdfe_readme.html for details about how to integrate with GDfE

Monitoring Oracle Secure Enterprise Search

In a production environment, where a load balancer or other monitoring tools are used to ensure system availability, Oracle Secure Enterprise Search (SES) can also be easily monitored through the following URL: http://<host>:<port>/monitor/check.jsp. The URL should return the following message: Oracle Secure Enterprise Search instance is up.

Note:

This message is not translated to other languages, because system monitoring tools might need to byte-compare this string.

If Oracle SES is not available, then the URL returns either a connection error or the HTTP status code 503.

Turning On Debug Mode

Debug mode is useful for troubleshooting purposes. To turn on debug mode for Oracle SES administration tool, update the search.properties file located in the $ORACLE_HOME/search/webapp/config directory. Set debug=true and restart the Oracle SES middle tier with searchctl restart.

To turn off debug mode when you are finished troubleshooting, set debug=false and restart the middle tier with searchctl restart.

Note:

$ORACLE_HOME represents the directory where Oracle SES was installed.

Debug information can be found in the OC4J log file: $ORACLE_HOME/oc4j/j2ee/OC4J_SEARCH/log/oc4j.log.

Accessing Application Server Control Console on Oracle SES

The Oracle Enterprise Manager 10g Application Server Control Console is a Web-based user interface that displays the current status of the Oracle SES middle tier. For example, the Home page shows a graph of the Response and Load, and the Performance page shows a graph of the Heap Usage.

The Application Server Control Console is installed and configured automatically with OC4J. Because the Oracle SES middle tier runs in the embedded standalone OC4J, the Application Server Control Console is started by default when Oracle SES is started.

To access the console, type the following URL in a Web browser:

http://<host>:<port>/em

where host and port are the host name and port running Oracle SES.

Log in as the oc4jadmin user with your Oracle SES administrator password.

See Also:

  • Oracle Containers for J2EE Configuration and Administration Guide 10g (10.1.3.1.0)

  • the online help provided with Application Server Control Console for detailed instructions on using this interface

Restarting Oracle Secure Enterprise Search After Rebooting

The tool for starting and stopping the search engine is searchctl. To restart Oracle SES (for example, after rebooting the host computer), navigate to the bin directory and run searchctl startall.

Note:

Users are prompted for a password when running searchctl commands on UNIX platforms. No password is required on Windows platforms. This is because Oracle SES installation on Windows requires a user with administrator privileges. When running commands to start or stop the search engine, no password is required as long as the user is a member of the administrator group.

See Also:

Startup / Shutdown lesson in the Oracle SES administration tutorial: http://st-curriculum.oracle.com/tutorial/SESAdminTutorial/index.htm