|Oracle® Ultra Search User's Guide
10g Release 1 (10.1)
Part Number B10731-01
This chapter contains the following sections:
The Oracle Ultra Search crawler is a powerful tool for discovering information on Web sites in an organization's intranet. This feature is especially relevant to Web crawling. The other data sources (for example, table or email data sources) are defined such that the crawler does not follow any links to other documents that you might not be aware of.
Your Web crawling strategy can be as simple as identifying a few well-known sites that are likely to contain links to most of the other intranet sites in your organization. You could test this by crawling these sites without indexing them. After the initial crawl, you have a good idea of the hosts that exist in your intranet. You could then define separate Web sources to facilitate crawling and indexing on individual sites.
However, in reality, the process of discovering and crawling your organization's intranet is an interactive one characterized by periodic analysis of crawling results and modification to crawling parameters to direct the crawling process somewhat. For example, if you observe that the crawler is spending days crawling one Web host, then you might want to exclude crawling at that host or limit the crawling depth.
Monitor the crawling process by using a combination of the following methods:
Monitoring the schedule status with the administration tool
Monitoring the real time schedule progress with the administration tool
Monitoring the crawler statistics with the administration tool
Monitoring the log file for the current schedule
URL looping refers to the scenario where, for some reason, a large number of unique URLs all point to the same document. One particularly difficult situation is where a site contains a large number of pages, and each page contains links to every other page in the site. Ordinarily, this would not be a problem, because the crawler eventually analyzes all documents in the site.
However, some Web servers attach parameters to generated URLs to track information across requests. Such Web servers might generate a large number of unique URLs that all point to the same document.
http://mycompany.com/somedocument.html?p_origin_page=10 might refer to the same document as
http://mycompany.com/somedocument.html?p_origin_page=13 but the
p_origin_page parameter is different for each link, because the referring pages are different. If a large number of parameters are specified and if the number of referring links is large, then a single unique document could have thousands or tens of thousands of links referring to it. This is an example of how URL looping can occur.
Monitor the crawler statistics in the Oracle Ultra Search administration tool to determine which URLs and Web servers are being crawled the most. If you observe an inordinately large number of URL accesses to a particular site or URL, then you might want to do one of the following:
Exclude the Web Server: This prevents the crawler from crawling any URLs at that host. (You cannot limit the exclusion to a specific port on a host.)
Reduce the Crawling Depth: This limits the number of levels of referred links the crawler will follow. If you are observing URL looping effects on a particular host, then you should take a visual survey of the site to find out an estimate of the depth of the leaf pages at that site. Leaf pages are pages that do not have any links to other pages. As a general guideline, add three to the leaf page depth, and set the crawling depth to this value.
Be sure to restart the crawler after altering any parameters in the Crawler Page. Your changes take effect only after restarting the crawler.
This section contains suggestions on how to improve the performance of the Oracle Ultra Search query. Query performance is generally affected by response time and throughput.
DB_CACHE_SIZE initialization parameter.
The database buffer cache keeps frequently accessed data read from datafiles. Efficient usage of the buffer cache can improve Oracle Ultra Search query performance. The cache size is controlled by the
DB_CACHE_SIZE initialization parameter.
See Also:Oracle Database Performance Tuning Guide for information on how to tune this parameter
Optimize the index.
Optimize the Oracle Ultra Search index after the crawler has made substantial updates. To do so, schedule index optimization on a regular basis. Make sure index optimization is scheduled during off-peak hours, because query performance is significantly degraded during index optimization.
See Also:"Index Optimization"
Optimize the index based on tokens.
Optimize the Oracle Ultra Search index by basing it on frequently searched tokens. To log queries, use the administration tool to turn on query statistics collection. The frequently searched tokens then can be passed to
OPTIMIZE_INDEX in token mode. The Oracle Ultra Search index name is
See Also:Oracle Text Reference for more information on
Simplify query expansion.
The search response time is directly influenced by the Oracle Text query string used. Although Oracle Ultra Search provides a default mechanism to expand user input into a Text query, simpler expansions can greatly reduce search time.
Size the shared pool.
The shared pool stores the library cache and the dictionary cache. The library cache stores recently run SQL and PL/SQL code. A cache miss on the data dictionary cache or library cache is more expensive than a miss on the buffer cache. For this reason, the shared pool should be sized to ensure that frequently used data is cached. The shared pool size is controlled by the
SHARED_POOL_SIZE initialization parameter.
See Also:Oracle Database Performance Tuning Guide for information on tuning this parameter
The Oracle Ultra Search middle tier connects to the database through JDBC. Because creation of a connection is an expensive operation in JDBC, a pool of open connections is used to improve the response time of queries. With Oracle Application Server, OC4J can manage the connection pool for the applications.
The minimum size, maximum size, and allocation algorithm of the pool can be specified in the
xml configuration file of OC4J.
The following is an example of a data source definition, with minimum 2 and maximum 30 open-connections. Each connection closes after 30 seconds of inactivity, and new connections are created dynamically according to load. The other caching schemes are
<data-source class="oracle.jdbc.pool.OracleConnectionCacheImpl" name="UltraSearchDS" location="jdbc/UltraSearchPooledDS" username="user" password="pass" url="jdbc:oracle:thin:@hostname:1521:oracle_sid" min-connections="2" max-connections="30" inactivity-timeout="30" > <property name="cacheScheme" value="1" /> </data-source>
Pin the query package in memory.
Pin frequently used packages in the shared memory pool. When a package is pinned, it remains in memory, no matter how full the pool gets or how frequently you access the package. You can pin packages using the supplied package
See Also:PL/SQL Packages and Types Reference
Without the Oracle Ultra Search remote crawler, you must run the Oracle Ultra Search crawler on the same host as the Oracle Database. For large data sets, you can improve performance by running the Oracle Ultra Search crawler on one or more separate hosts from the Oracle Database. The Oracle Ultra Search crawler is a pure Java application, and it communicates with the Oracle Database through JDBC. Because the Oracle Ultra Search scheduling mechanism runs within the Oracle Database, it automatically uses the database's high availability features. The Oracle Database uses one of two mechanisms to send launch requests to the remote crawler hosts. The first is Java remote method invocation (RMI). The second is Java database connectivity (JDBC). Both mechanisms establish a launching sub-system on the remote host. You can conceptualize the launching sub-system as a process that uses either RMI or JDBC to listen for launch requests. (This chapter refers to the launching sub-system as the "launcher").
Upon receipt of a launch request, the launcher spawns a new Java process. It is this process that is the actual remote crawler.
You should use JDBC-based remote crawling if you do not want the dependency on RMI (for example, because of network restrictions).
The launcher is the sub-system that listens for launch requests and launches remote crawler processes. When you register a remote crawler (either RMI-based or JDBC-based), you are actually registering the launcher with the Oracle Ultra Search backend. By registering a launcher, you make it available to be used by all Oracle Ultra Search instances within an Oracle Ultra Search installation. Thus, after registration, an administrator or an Oracle Ultra Search instance may subsequently choose to associate the launcher and assign schedules to be launched with that launcher. There is no way to restrict a launcher to specific Oracle Ultra Search instances. Once registered, all Oracle Ultra Search instances may use it. However, the launcher is only a sub-system for launching remote crawler processes. Having multiple instances use the same launcher for launching purposes poses no security problems for most customers. Both RMI and JDBC launchers are simply Java processes themselves. They are started from the command line. Oracle provides scripts for starting these launchers, described in the following section. Also, the JDBC launcher must establish JDBC connections to the Oracle Ultra Search backend database to listen for launch events. You must specify the launch user (or role) at registration time. Oracle strongly recommends that you create a new database user (or role) specifically for the purposes of launching remote crawlers. You should not use this user (or role) for any other purposes.
RMI-based remote crawling depends on the standard RMI infrastructure. Therefore, each remote crawler host must have an RMI registry and an RMI daemon running. These are started when you run the scripts to start the RMI-based launcher.
When a crawling schedule is activated, the Oracle Ultra Search scheduler launches a Java program as a separate process on the database host. This Java program is known as the ActivationClient.
This program attempts to connect to the remote crawler host through the RMI registry port specified at installation time. If successful, then the ActivationClient receives a remote reference to a Java object running on the remote host. This remote Java object is known as the ActivatableCrawlerLauncher.
The ActivationClient then instructs the ActivatableCrawlerLauncher to launch the Oracle Ultra Search crawler on the remote host. The ActivatableCrawlerLauncher launches the Oracle Ultra Search crawler as a separate Java process on the remote host.
The RMI registry and daemon ports are inflexible. Therefore, if you have other RMI services running on the same host, you will not be able to use RMI-based remote crawling. Also, you cannot run two RMI-based launchers, because they will both conflict on the RMI ports.
When a crawling schedule is activated, the Oracle Ultra Search scheduler sends a message to the launcher.
If the launcher is running and properly connected to the database as the appropriate launch user (or role), then it can receive the launch messages. Otherwise, the message times out after 30 seconds and launch failure is reported.
The launcher then deciphers the launch message and spawns an Oracle Ultra Search crawler as a separate Java process on the remote host.
The launcher maintains two permanent JDBC connections to the backend database. If either connection goes down at any time, then the JDBC launcher attempts to re-establish it. The number of attempts to re-establish connections is configurable as a command line parameter. The wait time between attempts is also configurable.
Note:The JDBC launcher can be configured to periodically trigger a "keep-alive" signal. This is useful to prevent inadvertent closing of the JDBC connections by firewalls. The time between signals is configurable with a command line parameter.
Oracle Ultra Search encrypts all RMI communication. However, the JDBC launcher uses the Oracle Thin JDBC driver. If security is a concern, then encrypt all JDBC traffic by securing the Oracle Thin JDBC driver.
See Also:Oracle Advanced Security Administrator's Guide for more information on Thin JDBC support
Each Oracle Ultra Search schedule can be associated with exactly one crawler. The crawler can run locally on the Oracle database host or on a remote host. There is no limit to the number of schedules that can be run. Similarly, there is no limit to the number of remote crawler hosts that can be run. However, each remote crawler host requires that the Oracle Ultra Search middle tier be installed on its host.
By using several remote crawler hosts and carefully allocating schedules to specific hosts, you can achieve scalability and load balancing of the entire crawling process.
Understand the cache and mail archive directories.
All remote crawlers must cache crawled data into a common file system location, that is, accessible by the backend Oracle Ultra Search database. Likewise, when crawling Email sources, all emails must be saved in a common, central location. The simplest way to achieve this is by ensuring that the cache and mail archive directories seen by the remote crawler uses are mounted through NFS to point to the cache and mail directories used by the Oracle Ultra Search backend database.
For example, your Oracle Ultra Search installation might consist of four hosts: one database server (host X) running Solaris on which the Oracle Ultra Search backend is installed; one remote crawler host (host Y1) running on Windows; one remote crawler host (host Y2) running on Solaris; and one remote crawler host (host Y3) running on Linux.
In this scenario, export the shared directories on host X using the UNIX
export command. Then use the UNIX
mount command on hosts Y2 and Y3 to mount the exported directories. For host Y1, you must purchase a third-party NFS client for Windows and use that to mount the shared directories. If host X is a Linux server, you can create Samba shares and thereby mount those shares on Windows without needing any third party software.
If for some reason there is no shared file system between the database and remote crawler hosts, you can instruct the remote crawler to transfer all cache and mail archive data across JDBC to the database host. The files are then saved locally on the database host. You can choose this option by selecting through JDBC connection for the Cache file access mode setting in the next step.
Configure the remote crawler with the administration tool.
To edit the remote crawler profile, navigate to the Crawler: Remote Crawler Profiles page and click Edit for the remote crawler profile you want to edit. Edit that profile by manually entering all mount points for the shared crawler resources that you defined.
Cache and mail archive directories. If the backend database host and remote crawler host are on a shared file system (such as NFS), select "through mounted file system" for the Cache file access mode setting. Then specify values for the following parameters:
Mount point for cache directory path as seen by the remote crawler
Mount point for mail archive path as seen by the remote crawler (if you are using the Oracle Ultra Search mailing list feature)
Otherwise, if there is no shared file system between the remote crawler host and the backend database host, you must select through JDBC connection for the Cache file access mode setting. Then, specify values for the following parameters:
Local cache directory as seen by local crawlers on the backend database host
Local mail archive directory as seen by local crawlers on the backend database host
Crawler log directory. It is not necessary that the remote crawler log directory be an NFS mount a central location accessible by the backend Oracle Ultra Search database. However, it is beneficial to do so if you want to be able to monitor all crawler logs (local as well as all remote crawlers) in one central location.
Additionally, you must specify the following crawler parameters before you can begin crawling:
number of crawler threads that the remote crawler uses for gathering documents
number of processors on the remote crawler host
initial Java heap size
maximum Java heap size
Java classpath. It is not usually necessary to specify this classpath. The classpath that remote crawler processes use is inherited from the RMI subsystem. The RMI subsystem classpath is configured by the scripts used to launch it. You will need to modify the classpath only in special circumstances where you require the classpath to be different from the RMI subsystem classpath.
Complete the crawler configuration with the administration tool.
Create schedules and data sources. Assign one or more data sources to each schedule.
Each schedule must then be assigned to a remote crawler or the local crawler. (The local crawler is the crawler that runs on the local Oracle database host itself). To assign the a schedule to a remote crawler host or the local database host, click the host name of a schedule in the Schedules page.
You can also turn off the remote crawler feature for each schedule, thereby forcing the schedule to launch a crawler on the local database host, instead of the specified remote crawler host. To turn off the remote crawler feature, click the host name of a schedule in the Synchronization Schedules page. If a remote crawler host is selected, the RMI-based remote crawler hostname (or JDBC-Based launcher name) will be displayed. Change this to the local database host in order to disable remote crawling.
Start the remote crawler launching sub-system on each remote crawler host.
Use the helper scripts in
$ORACLE_HOME/tools/remotecrawler/scripts/operating_system to do this.
If the remote crawler is running on a UNIX platform, then source the
$ORACLE_HOME/tools/remotecrawler/scripts/unix/runall.sh Bourne shell script for RMI-based remote crawling. Source
runall_jdbc.sh for JDBC-based remote crawling.
If the remote crawler is running on a Windows host, then run the
%ORACLE_HOME%\tools\remotecrawler\scripts\winnt\runall.bat file for RMI-based remote crawling. Run
runall_jdbc.bat for JDBC-based remote crawling.
For RMI-based remote crawling, the
runall scripts perform the following tasks in sequence:
define_env is invoked to define necessary environment variables.
runregistry is invoked to start up the RMI registry.
runrmid is invoked to start up the RMI daemon.
register_stub is invoked to register the necessary Java classes with the RMI subsystem.
Note:You can invoke
For JDBC-based remote crawling, the
runall_jdbc scripts perform the following tasks in sequence:
define_env is invoked to define necessary environment variables
The JDBC launcher is started with a command line Java process invocation. There are the following command line arguments for the JDBC launcher:
-name: name of launcher (that you used to register this launcher)
-url: JDBC connection URL to the backend Oracle Ultra Search database
-user: database user to connect
-password: database user password
-rw: wait time (in seconds) between attempts to re-establish JDBC connections
-ra: maximum number of attempts to re-establish JDBC connections
-kw: wait time (in milliseconds) between keep-alive signals
You must edit the contents of the
runall_jdbc script and specify the values for each parameter before running it.
Launch the remote crawler from the administration tool, and verify that it is running.
The state of the schedule is listed in the Schedules page. The remote crawler launching process takes up to 90 seconds to change state from LAUNCHING to FAILED if failure occurs.
To view the schedule status, click the crawler status in the schedules list. To view more details, especially in the event of failure, click the schedule status itself. This brings up a detailed schedule status.
The RMI-based remote crawler fails to launch if any one of the following requirements are not met:
The RMI registry is not running and listening on the port specified at installation.
The RMI daemon is not running and listening on the port specified at installation.
The necessary Java objects have not been successfully registered with each RMI registry.
The JDBC-based remote crawler fails to launch if any one of the following requirements are not met:
The JDBC launcher is not running.
The JDBC launcher is running, but the connect user (or role) specified is incorrect.
After a remote crawler is launched, verify that it is running with one or more of the following methods:
For RMI-based crawling, check for active Java processes on the remote crawler host. A simple way to confirm that remote crawler is running on the remote crawler host is to use an operating system command, such as
ps on UNIX systems. Look for active Java processes.
For JDBC-based crawling, check that the launcher is up and running and that there are no errors. When you start the JDBC-based launcher, it will output text to standard output. You may optionally redirect output to a file. Monitor this output for any errors.
Monitor the contents of the schedule log file. If the remote crawler is running successfully, then you should see the contents of the schedule log file changing periodically. The schedule log file is located in the shared log directory.
Oracle Ultra Search can crawl on one fixed node or on any node, depending on the storage access configuration of the Real Application Clusters system. PL/SQL APIs are provided to specify which node should run the crawler, if needed. For Oracle Ultra Search administration and the Oracle Ultra Search query application, you can configure the connection string to connect to any node of Real Application Clusters.
See Also:The documentation for Oracle Database Real Application Clusters
The disk of any node in a Real Application Clusters system can be shared (cluster file system) or not shared (raw disk). For Real Application Clusters on a cluster file system (CFS), the cache files generated by the crawler on any node are visible to any Oracle instance and can be indexed by any Oracle instance that performs index synchronization. If the disk is not shared, then the crawler must run on one particular Oracle instance to ensure that all cache files can be indexed.
This is due to the nature of Oracle Text indexing, where rows inserted into one table by different sessions go to the same pending queue, and whoever initiates index synchronization attempts to index all of the inserted rows. Because of this limitation, on a CFS, Oracle Ultra Search is configured to launch the crawler on any database instance. If it is not on a CFS, then Oracle Ultra Search launches the crawler on the database instance where
INSTANCE_NUMBER = 1.
The Oracle Ultra Search administrator can configure which instance runs the crawler with the following PL/SQL API:
instance_name is the name of the launching instance (or the database name if it is to be launched on any node) and
connect_url is the connect descriptor.
For connection to a single database instance, the descriptor can be in the short form "host:port:SID" or the connect descriptor (Oracle Net keyword-value pair). For example:
To connect to any database instance, the full database connect descriptor must be used. For example:
See Also:Oracle Database JDBC Developer's Guide and Reference for configuration details.
You cannot configure Oracle Ultra Search to launch the crawler on any node on a non-cluster file system.
To query on the existing launching instance configuration, use the following PL/SQL API:
WK_ADM.GET_LAUNCH_INSTANCE RETURN VARCHAR2;
This returns the name of the launching instance or the database name if any node can launch the crawler.
The Oracle Ultra Search remote crawler requires that the remote file system be mounted on the Oracle instance for indexing.
For cluster file system Real Application Clusters, the file system of the remote computer should be NFS mounted to all nodes of the system.
For non-cluster file system Real Application Clusters, the NFS mount can be limited to the specific node where the Oracle instance is serving the remote crawler. There is no advantage to mounting the remote file system to all nodes--it could lead to stale NFS handles when nodes go down. When there is a configuration change to move to a different Oracle instance, the remote file system should be NFS mounted to the new node accordingly.
The administration middle tier connects to the Oracle database with a JDBC connection specified in the ultrasearch.properties file. If the client serving node is down, then you must manually edit the ultrasearch.properties file to connect to a different Oracle instance.
Query components should fully utilize Real Application Clusters. You can specify the JDBC connection string as a database connect descriptor so that it can connect to any Oracle instance in Real Application Clusters. For example:
The connect string used by Oracle Ultra Search crawler is initialized during installation and can be changed with the
SET_LAUNCH_INSTANCE API. When there is a system configuration change, such as adding or dropping a node, the connect string is changed automatically.
Thin driver (default)
driver_type = 0
driver_type = 1
This API requires super-user privileges. The change affects all Oracle Ultra Search instances.
Note:The OCI driver requires that environment variables, such as
See Also:Oracle Database JDBC Developer's Guide and Reference for configuration details on using the OCI driver.
WK_ADM.GET_JDBC_DRIVER RETURN NUMBER;
Oracle Ultra Search crawls database tables in the local Oracle Database instance where Oracle Ultra Search is installed. Additionally, it can crawl remote databases if they have been linked to the main Oracle Database. Remote databases are linked to the main Oracle instance with database links.
See Also:Oracle Database Administrator's Guide for instructions on how to create database links
Oracle Ultra Search provides a logging mechanism to optimize crawling of table sources. Using this logging mechanism, only newly updated documents are revisited during the crawling process. If the source database is not an Oracle database, then you must perform a sequence of steps to use this feature.
Before creating log tables and log triggers, make sure that the Oracle Ultra Search instance schema has the CREATE ANY TABLE and CREATE ANY TRIGGER system privileges. For tables in Oracle databases, data definition language (DDL) statements are provided to create the following:
The log table stores changes that have occurred in the base table. The Oracle Ultra Search crawler uses the change information to figure out which rows need to be recrawled. For example, a log table generated by Oracle Ultra Search could be named
The structure of the log table conforms to the following rules:
For every primary key column of the base table, a column must be created in the log table.
There can be up to only eight primary key columns in the base table.
Each column in the log table that corresponds to a primary key column must be named Kx, where x is a number from one to eight.
Each column in the log table that corresponds to a primary key column must be of type
There must be exactly one column named mark that has type
The column named mark must have a default value F.
For example, the base table employees has the following structure:
|Column Name||Column Type|
If the primary key of the employees table comprises of the
NAME columns, then a log table
WK$LOG (whose name is generated automatically) is created with the following structure:
|Column Name||Column Type|
The SQL statement for creating the log table is as follows:
CREATE TABLE WK$LOG( K1 VARCHAR2(1000), K2 VARCHAR2(1000), MARK CHAR(1) default 'F');
Every time a row is inserted into the employees base table, the
INSERT trigger inserts a row into the log table. The row in the log table records the new values of the id and the name into the k1 and k2 columns. An F is inserted into the mark column to signal the crawler that work needs to be done for this row.
CREATE TABLE employees (id NUMBER, name VARCHAR2(10)); CREATE OR REPLACE TRIGGER wk$ins AFTER INSERT ON employees FOR EACH ROW BEGIN INSERT INTO WK$LOG(k1,k2,mark) VALUES(:new.id,:new.name,'F'); END; /
Every time a row is updated in the employees base table, the
UPDATE trigger inserts two rows into the log table. The first row in the log table records the old values of the id and the name into the k1 and k2 columns. An F is inserted into the mark column to signal the crawler that work needs to be done for this row. The second row in the log table records the new values of the id and the name into the k1 and k2 columns.
CREATE OR REPLACE TRIGGER wk$upd AFTER UPDATE ON employees FOR EACH ROW BEGIN INSERT INTO WK$LOG(k1,k2,mark) VALUES(:old.id,:old.name,'F'); INSERT INTO WK$LOG(k1,k2,mark) VALUES(:new.id,:new.name,'F'); END;/
Every time a row is deleted from the employees base table, the
DELETE trigger inserts a row into the log table. The row in the log table records the old values of the id and the name into the k1 and k2 columns. An F is inserted into the mark column to signal the crawler that work needs to be done for this row.
CREATE OR REPLACE TRIGGER wk$del AFTER DELETE ON employees FOR EACH ROW BEGIN INSERT INTO WK$LOG(k1,k2,mark) VALUES(:old.id,:old.name,'F'); END;/
For tables in non-Oracle remote databases, you must perform the following steps:
Manually create the log table. The log table must conform to the rules for log tables described earlier. Also, it must reside in the same schema and database instance as the base table.
Create three triggers that record inserts, updates, and deletes on the base table. These triggers must exhibit the same behavior as the triggers described earlier for Oracle tables.
Associate the log table. When you have completed these tasks, choose the "Enable logging mechanism (non-Oracle tables)" option during the creation of an Oracle Ultra Search table data source. By choosing that option, the Oracle Ultra Search administration tool prompts you for the name of the log table in the remote database. Oracle Ultra Search associates this log table with the base table. Oracle Ultra Search assumes that you have correctly performed steps 1 and 2.