Oracle Ultra Search Release Notes Release 2 (9.0.2) Part Number A97355-01 |
|
April 2002
Part No. A97355-01
This document summarizes the differences between Oracle Ultra Search in Oracle9i Application Server Release 2 (9.0.2) and its documented functionality.
To view the Ultra Search documentation:
$ORACLE_HOME/ultrasearch/doc/help/toc
.htm
.
For installation documents within an Ultra Search installation, see $ORACLE_HOME/ultrasearch/doc/help/install
.htm
.
The Ultra Search middle tier is compliant with Oracle J2EE container (OC4J). Follow the instructions at $ORACLE_HOME/ultrasearch/doc/help/install_midtier
.htm
to configure the Ultra Search middle tier with OC4J.
Ultra Search requires you to have either JRE or JDK on the database host where you install the Ultra Search server component. By default, JDK 1.3.1 is installed by Oracle Universal Installer (OUI) during database installation under directory $ORACLE_HOME/jdk
. If you use a different JDK, either create a soft link, or copy the files from the location where you install JDK to $ORACLE_HOME/jdk
. Ultra Search is certified with JDK 1.3.1 in this release.
This section describes Ultra Search configuration issues and their workarounds for Ultra Search.
In addition to the configuration steps described in $ORACLE_HOME/ultrasearch/doc/help/install_midtier
.htm
, you must also follow these steps in order for Ultra Search sample query application to function correctly.
To configure Ultra Search sample query applications and sample search portlet, edit the OC4J data-sources
.xml
file. For editing data-sources
.xml
, see $ORACLE_HOME/ultrasearch/doc/help/install_midtier
.htm
.
To configure gsearch
.jsp
, you must manually edit the file and change the variable setting for 'username' and 'password', and then edit ultrasearch
.properties
to change the connection string. The file gsearch
.jsp
is under the $ORACLE_HOME/ultrasearch/sample/query/9i
directory.
For editing the ultrasearch
.properties
, see $ORACLE_HOME/ultrasearch/doc/help/install_midtier
.htm
.
For detailed information, see: $ORACLE_HOME/ultrasearch/sample/query/9i/README
.html
The Ultra Search Crawler is a Java process that runs on the server tier when launched. Therefore, Ultra Search requires you to have either JRE or JDK installed on the database host where you install the Ultra Search server component.
Stopping a schedule while it is running shuts down the crawler. Any documents that have not been crawled or indexed are processed when the crawler is restarted.
Oracle Ultra Search schedule launching uses the DBMS_JOB
package. Therefore, the Oracle Ultra Search DBA must make sure that there is least one SNP process running. In other words, the initialization parameter file for the Oracle Ultra Search database instance should contain a line that specifies the JOB_QUEUE_PROCESSES
parameter to be at least 1.
After you have configured the Oracle Ultra Search system, you can launch a crawler immediately from the Schedule screen.
View the Oracle Ultra Search crawler status by checking its state in the Oracle Ultra Search Administration Tool page.
Click the "Schedules" tab, and you see a table that lists all schedules and their state.
When a schedule is launching, it is in the "LAUNCHING" state. During the launching state, URLs to be crawled are copied into a queue. The amount of time this process takes depends on how many URLs there are to copy. In cases such as maintenance crawling of the Primary Schedule, there can be be millions of URLs to copy, with the schedule staying in the "LAUNCHING" state for a very long time.
When a schedule has completed launching and the crawler has begun fetching pages, the state changes to "EXECUTING".
Oracle Ultra Search, like any other application that uses the database to store content, requires that the character set of the database be able to support the character set used at the application level. For example, if the application language is in Japanese, and if the database character set is SF7ASCII, then any data that the application attempts to store is corrupt, because the SF7ASCII character set does not support Asian languages.
The Oracle9i Globalization Support Guide provides detailed information on database character sets.
For Crawler performance tuning, see Step 5 in Configuring the Oracle Server for Ultra Search at $ORACLE_HOME/ultrasearch/doc/help/configure_server
.htm
.
This section contains suggestions on how to improve the response time of the Ultra Search query.
The database buffer cache keeps frequently accessed data read from datafiles, and efficient usage of the buffer cache can improve Ultra Search query performance. The cache size is controlled by the DB_CACHE_SIZE
initialization parameter.
For more information on how to tune this parameter, see Oracle9i Database Tuning Performance Guide and Reference.
Optimize the Ultra Search index after the crawler has made substantial updates. This can be done by scheduling index optimization on a regular basis. Make sure index optimization is scheduled during off-peak hours, because query performance is slow during an index optimization schedule.
For information on index optimization schedules, see the Ultra Search online documentation about the Schedules Page ($ORACLE_HOME/ultrasearch/doc/help/a_schedules
.htm
).
Optimize the Ultra Search index by basing it on frequently searched tokens. Queries can be logged by turning on query statistics collection in the Administration tool. The frequently searched tokens then can be passed to CTX_DDL
.OPTIMIZE_INDEX
in token mode. The Ultra Search index name is WK$DOC_PATH_IDX
.
For more information on OPTIMIZE_INDEX
, see Oracle Text Reference.
The search response time is directly influenced by the Text query string used. Although Ultra Search provides a default mechanism to expand user input into a Text query, simpler expansions can greatly reduce search time.
For information on customizing query expansion, see the Ultra Search online documentation about Customizing the Query Syntax Expansion ($ORACLE_HOME/qsyntax
.htm
) and the Javadoc for the oracle
.ultrasearch
.query
.Query
interface.
This section explains in detail how Web data sources work. You should understand this section well before proceeding with crawling.
Web sources are different from other data sources in the following:
Take the following example:
When the default schedule is launched, all of the URLs belonging to hosts www.foo1.com, www.foo2.com, and www.foo3 are collected and crawled.
Subsequently, if a user-defined Web source "foo2" is defined for "www.foo2.com", then URLs under www.foo2.com no longer belong to the default Web source. They now belong to the foo2 user-defined Web source. This means that when the default schedule is launched, foo2 URLs will not be crawled. Instead, a separate schedule needs to be created to crawl the foo2 data source.
When a user-defined Web source is dropped, the URLs of the dropped source are reassigned to the default Web source. They are not deleted from the system. So, if the "foo2" data source in the previous example is dropped, then all of the foo2 URLs are re-assigned to the default Web source. They are then crawled whenever the default schedule is launched.
If you need to entirely eliminate all URLs that belong to a specific host, the only way to remove them from the system is to directly issue SQL statements in a SQL*Plus session. For example:
EXEC WK_ADM.USE_INSTANCE('<instance_name>'); DELETE FROM WK$URL WHERE URL LIKE 'http://www.foo2.com%'; COMMIT;
The default Web source is a collection of URLs that are discovered from a set of seed URLs. The set is bounded by one or more inclusion patterns and zero or more exclusion patterns. Specifically, the default Web source is defined by the following:
Maintenance crawling means that a page is not processed if it has not been changed since it was last crawled. To determine whether a page has changed, the crawler checks the Last Modified time stamp and the page checksum value. The checksum is based on the contents of the page. If the page has changed, then it is parsed again for new links and indexed.
For example, when launching a schedule that has the user-defined foo2 Web source associated with it, the crawler runs through all of foo2's URLs and possibly finds only a few URLs with changed pages. From those changed pages, the crawler might discover new URLs. A significant amount of time is saved because the crawler needed to process only URLs that have changed.
A set of seed URLs can be defined as the starting points for the crawler to discover URLs. As URLs are discovered, they are assigned to a user-defined Web source, if the host name matches. Otherwise, they are assigned to the default Web source. A seed URL is useful only when it matches one of the inclusion patterns defined and is not excluded. There is no limit on the number of seeds that can be defined.
Here are two examples:
Careful planning is needed when defining the default Web source. Specifically, the user must determine whether to build it using a top-down or bottom-up approach. It is strongly recommended that the bottom-up approach be used for ease of management.
Web sites are incrementally added to the default Web source. The idea is to crawl all known important Web sites before allowing the crawler to discover unknown Web sites. The following sequence describes the process of incrementally adding Web sites:
The inclusion patterns are defined to contain an unknown number of Web sites; for example, the "oracle.com" inclusion pattern allows crawling of URLs on any Web site within Oracle Corporation. The problems with this approach are as follows:
Whenever a schedule is launched, the list of URLs to be crawled are enqueued into a queue before the crawler is run. The enqueuing of URLs differs depending on whether it is the default schedule or a schedule containing user-defined Web sources:
The default schedule implicitly crawls the default Web source. Therefore, any new seed URLs (that were added after the last default schedule ran) are enqueued first. Then, all other URLs that are assigned to the default Web source are enqueued for maintenance crawling. URLs that belong to hosts for which user-defined Web sources exist are not enqueued. For example, if foo2 was defined as a Web source for host www.foo2.com, then URLs that begin with www.foo2.com are not enqueued.
Schedules that have user-defined Web sources associated URLs belonging to all associated user-defined Web sources are enqueued. For example, if you associate the foo2 Web source, then all URLs that belong to host www.foo2.com are enqueued.
The path of the JDBC OCI driver is not passed to the Ultra Search crawler JVM. As a result, the crawler cannot communicate with the database, and none of the crawled documents are indexed. By default, Ultra Search uses JDBC thin driver; therefore, no additional steps are needed for Ultra Search to work.
If you choose to use JDBC OCI driver for crawling, then Ultra Search requires %ORACLE_HOME%/bin
to be part of NT system environment variable %PATH%
. During Oracle installation, OUI makes the correct configuration to %PATH%
. You must reboot the computer right after installation for this configuration to take affect.
The crawler is unable to handle log directory path with multibyte characters.
Avoid specifying log directories that have Chinese, Japanese, or Korean characters.
File data source crawling cannot crawl directories or files with multibyte characters; for example, Chinese or Japanese.
Avoid naming the file in Chinese, Japanese, or Korean or putting them under such directory.
The crawler is unable to pick up files or directories whose name contains HTML reserved symbols, like '<' or '>', when doing file data source crawling.
Rename the file or directory that is using such symbol.
Stop crawling does not stop the crawler, even though the schedule status shows that the crawler has stopped. This happens when the crawler agent is used and the crawler is in the process of enqueuing URLs fetched from the agent. The crawler stops only after enqueuing is finished.
Currently, there is no way to stop the enqueuing other than manually killing the crawler process.
When query statistics collection is enabled, the query statistics pages (Daily summary of query statistics, Top 50 queries, Top 50 ineffective queries, Top 50 failed queries) may show text query strings like '(((WKA2X &({abc}))WITHIN S2))*2,({abc}))'. This behavior is due to 9.0.2 Java API expanding the user's query ('abc' in this case) before it is sent to the database.
In most cases, the user's query can be deciphered from the text query.
When a user tries to create a new data source type, in Step 1, with regard to "Agent java class name" entering information about crawler agent, the user should enter just the class name without .class
extension.
If a Portal item or page is of type page link, then its display URL will be a duplicate of the actual item/page being linked to. It is possible that multiple items link to one item (likewise for pages). Given a set of items all of which have the same display URL, Ultra Search is only able to index exactly one of them.
Which Ultra Search indexes depend on which item is reached first by the crawler. Ensure that no page links or item links are created (pages or items are created in Portal that point to other existing pages or items in that same Portal).
In Portal, the URLs for translations of an item have the same display URL as that of the base language item. Portal users can view different translations because when users log in to Portal, the language is established as part of the browser session. However, this language negotiation process only works with browsers operated by human users. Therefore, the Ultra Search crawler receives the same display URL for the translated items. This violates the stated requirement that all display URLs presented to Ultra Search be unique. The implication is that Ultra Search cannot crawl translations of an item.
As in bug 2218987, with multiple translations, only one of the translation items or the base item itself is indexed by the crawler. The rest are rejected by the crawler because of the duplicate display URLs.
If there are translations for an item or page, then some attributes of that item/page cannot be correctly transmitted to the Ultra Search crawler. As a result, attribute queries may not work correctly for translated items.
Ultra Search crawler can process specific Portal item types. However, Portal item type of "none" does not have display URLs. As a result, they are not revealed to Ultra Search. Because anything that does not have a display URL cannot be represented in the search application search results list in such a way that the user can click on it to view the item.
Portal users should embed Ultra Search portlets that are hosted on the same host as the Oracle9iAS Portal server. If Oracle Portal is installed on host A, then Ultra Search is also installed on host A. Hence, the Ultra Search provider is also hosted as a Web application on host A.
It is possible that the Ultra Search provider running on host A could be registered with a second Oracle Portal instance running on host B. However, if the Ultra Search portlet hosted on A is embedded within pages created in Portal B, then the pop-up list-of-values will not work correctly. This is because of an security bug inherent in Javascript.
Oracle is a registered trademark, and Oracle9i is a trademark or registered trademark of Oracle Corporation. Other names may be trademarks of their respective owners.
Copyright © 2002 Oracle Corporation.
All Rights Reserved.
|
Copyright © 2002 Oracle Corporation. All Rights Reserved. |
|