4 Tuning and Performance

This chapter contains the following sections:

Tuning the Web Crawling Process
Tuning Query Performance
Using the Remote Crawler
Ultra Search on Real Application Clusters
Table Data Source Synchronization

4.1 Tuning the Web Crawling Process

The Ultra Search crawler is a powerful tool for discovering information on Web sites in an organization's intranet. This feature is especially relevant to Web crawling. The other data sources (for example, table or email data sources) are defined such that the crawler does not follow any links to other documents that you might not be aware of.

4.1.1 Web Crawling Strategy

Your Web crawling strategy can be as simple as identifying a few well-known sites that are likely to contain links to most of the other intranet sites in your organization. You could test this by crawling these sites without indexing them. After the initial crawl, you have a good idea of the hosts that exist in your intranet. You could then define separate Web sources to facilitate crawling and indexing on individual sites.

However, in reality, the process of discovering and crawling your organization's intranet is an interactive one characterized by periodic analysis of crawling results and modification to crawling parameters to direct the crawling process somewhat. For example, if you observe that the crawler is spending days crawling one Web host, then you might want to exclude crawling at that host or limit the crawling depth.

4.1.2 Monitoring the Crawling Process

Monitor the crawling process by using a combination of the following methods:

Monitoring the schedule status with the administration tool
Monitoring the real time schedule progress with the administration tool
Monitoring the crawler statistics with the administration tool
Monitoring the log file for the current schedule

4.1.3 URL Looping

URL looping refers to the scenario where, for some reason, a large number of unique URLs all point to the same document. One particularly difficult situation is where a site contains a large number of pages, and each page contains links to every other page in the site. Ordinarily, this would not be a problem, because the crawler eventually analyzes all documents in the site.

However, some Web servers attach parameters to generated URLs to track information across requests. Such Web servers might generate a large number of unique URLs that all point to the same document.

For example, http://mycompany.com/somedocument.html?p_origin_page=10 might refer to the same document as http://mycompany.com/somedocument.html?p_origin_page=13 but the p_origin_page parameter is different for each link, because the referring pages are different. If a large number of parameters are specified and if the number of referring links is large, then a single unique document could have thousands or tens of thousands of links referring to it. This is an example of how URL looping can occur.

Monitor the crawler statistics in the Ultra Search administration tool to determine which URLs and Web servers are being crawled the most. If you observe an inordinately large number of URL accesses to a particular site or URL, then you might want to do one of the following:

Exclude the Web Server: This prevents the crawler from crawling any URLs at that host. (You cannot limit the exclusion to a specific port on a host.)
Reduce the Crawling Depth: This limits the number of levels of referred links the crawler will follow. If you are observing URL looping effects on a particular host, then you should take a visual survey of the site to find out an estimate of the depth of the leaf pages at that site. Leaf pages are pages that do not have any links to other pages. As a general guideline, add three to the leaf page depth, and set the crawling depth to this value.

Be sure to restart the crawler after altering any parameters in the Crawler Page. Your changes take effect only after restarting the crawler.

4.2 Tuning Query Performance

This section contains suggestions on how to improve the performance of the Ultra Search query. Query performance is generally affected by response time and throughput.

Tune the DB_CACHE_SIZE initialization parameter.

The database buffer cache keeps frequently accessed data read from datafiles. Efficient usage of the buffer cache can improve Ultra Search query performance. The cache size is controlled by the DB_CACHE_SIZE initialization parameter.

See Also:
Oracle9i Database Performance Tuning Guide and Reference for information on how to tune this parameter
Optimize the index.

Optimize the Ultra Search index after the crawler has made substantial updates. To do so, schedule index optimization on a regular basis. Make sure index optimization is scheduled during off-peak hours, because query performance is significantly degraded during index optimization.

See Also:
"Index Optimization"
Optimize the index based on tokens.

Optimize the Ultra Search index by basing it on frequently searched tokens. To log queries, use the administration tool to turn on query statistics collection. The frequently searched tokens then can be passed to CTX_DDL.OPTIMIZE_INDEX in token mode. The Ultra Search index name is WK$DOC_PATH_IDX.

See Also:
Oracle Text Reference for more information on OPTIMIZE_INDEX
Simplify query expansion.

The search response time is directly influenced by the Oracle Text query string used. Although Ultra Search provides a default mechanism to expand user input into a Text query, simpler expansions can greatly reduce search time.
See Also:

"Customizing the Query Syntax Expansion"

Oracle Ultra Search API Reference for the oracle.ultrasearch.query.Query interface
Size the shared pool.

The shared pool stores the library cache and the dictionary cache. The library cache stores recently executed SQL and PL/SQL code. A cache miss on the data dictionary cache or library cache is more expensive than a miss on the buffer cache. For this reason, the shared pool should be sized to ensure that frequently used data is cached. The shared pool size is controlled by the SHARED_POOL_SIZE initialization parameter.

See Also:
Oracle9i Database Performance Tuning Guide and Reference for information on tuning this parameter
Define JDBC connection pooling.

The Ultra Search middle tier connects to the database through JDBC. Because creation of a connection is an expensive operation in JDBC, a pool of open connections is used to improve the response time of queries. With Oracle Application Server, OC4J can manage the connection pool for the applications.

The minimum size, maximum size, and allocation algorithm of the pool can be specified in the data-sources.xml configuration file of OC4J.

The following is an example of a data source definition, with minimum 2 and maximum 30 open-connections. Each connection closes after 30 seconds of inactivity, and new connections are created dynamically according to load. The other caching schemes are FIXED_WAIT_SCHEME and FIXED_RETURN_NULL_SCHEME.

Note:
DYNAMIC_SCHEME = 1, FIXED_WAIT_SCHEME = 2, and FIXED_RETURN_NULL_SCHEME = 3

       <data-source 
               class="oracle.jdbc.pool.OracleConnectionCacheImpl" 
               name="UltraSearchDS" 
               location="jdbc/UltraSearchPooledDS" 
               username="user" 
               password="pass"
               url="jdbc:oracle:thin:@hostname:1521:oracle_sid" 
               min-connections="2"
               max-connections="30"
               inactivity-timeout="30" >
              <property name="cacheScheme" value="1" />
        </data-source>

Pin the query package in memory.

Pin frequently used packages in the shared memory pool. When a package is pinned, it remains in memory, no matter how full the pool gets or how frequently you access the package. You can pin packages using the supplied package DBMS_SHARED_POOL.

The PL/SQL package used for Ultra Search query is WKSYS.WK_QRY.

See Also:
Oracle9i Supplied PL/SQL Packages and Types Reference

4.3 Using the Remote Crawler

Without the Ultra Search remote crawler, you must run the Ultra Search crawler on the same host as the Oracle Database. For large data sets, you can improve performance by running the Ultra Search crawler on one or more separate hosts from the Oracle Database. Because the Ultra Search crawler is a pure Java application, it communicates with the Oracle Database through JDBC.

Ultra Search remote crawler instances are always launched by the Oracle Database. By launching remote crawlers from the Oracle Database, you can leverage the high-availability of the Oracle Database. The Ultra Search scheduling mechanism runs within the Oracle Database and therefore automatically uses the database's high availability features.

The Oracle Database uses Java remote method invocation (RMI) to communicate with the remote crawler hosts. Therefore, each remote host must have an RMI registry and an RMI daemon running.

When a crawling schedule is activated, the Ultra Search scheduler launches a Java program as a separate process on the database host. This Java program is known as the ActivationClient.
This program attempts to connect to the remote crawler host through the standard RMI registry and RMI daemon on ports 1098 and 1099. If successful, then the ActivationClient receives a remote reference to a Java object running on the remote host. This remote Java object is known as the ActivatableCrawlerLauncher.
The ActivationClient then instructs the ActivatableCrawlerLauncher to launch the Ultra Search crawler on the remote host. The ActivatableCrawlerLauncher launches the Ultra Search crawler as a separate Java process on the remote host.

Note:

By default, RMI sends data over the network unencrypted. Using the remote crawler to perform crawling introduces a potential security risk. A malicious entity within the enterprise could steal the Ultra Search instance schema and password by listening to packets going across the network. Refrain from using the remote crawler feature if this security risk is unacceptable.

4.3.1 Scalability and Load Balancing

Each Ultra Search schedule can be associated with exactly one crawler. The crawler can run locally on the Oracle database host or on a remote host. There is no limit to the number of schedules that can be run. Similarly, there is no limit to the number of remote crawler hosts that can be run. However, each remote crawler host requires that the Ultra Search middle tier be installed on its host.

By using several remote crawler hosts and carefully allocating schedules to specific hosts, you can achieve scalability and load balancing of the entire crawling process.

4.3.2 Installation and Configuration Sequence

Make sure that you have installed the Ultra Search backend (server component) on the Oracle Database, the Ultra Search middle tier on one or more Web server hosts, and the Ultra Search middle tier on all remote crawler hosts.

See Also:
Chapter 2, " Installing and Configuring Ultra Search"
Export the following common resources on the database host:
- The temporary directory
- The log directory
- The mail archive directory (if you are using the Ultra Search mailing list feature)
These resources are merely directories that must be accessible by all remote crawler instances over the network. Use whatever mechanism you want to share these resources with a remote crawler host.

The remote crawler code is pure Java. Therefore, it is platform-independent. For example, your Ultra Search installation might consist of four hosts: one database server (host X) running Solaris on which the Ultra Search backend (server component) is installed; one remote crawler host (host Y1) running on Windows; one remote crawler host (host Y2) running on Solaris; and one remote crawler host (host Y3) running on Linux.

In this scenario, export the shared directories on host X using the UNIX "export" command. Then use the UNIX "mount" command on hosts Y2 and Y3 to mount the exported directories. For host Y1, you must purchase a third-party NFS client for Windows and use that to mount the shared directories. If host X is a Linux server, then you can create Samba shares and thereby mount those shares on Windows without needing any third party software.
Configure the remote crawler with the administration tool.

Edit that profile by manually entering all mount points for the shared crawler resources that you defined. To edit the remote crawler profile, navigate to the Crawler: Remote Crawler Profiles page and click Edit for the remote crawler profile you want to edit. Specify values for the following parameters:
- Mount point for temporary directory path as seen by the remote crawler
- Mount point for log directory path as seen by the remote crawler
- Mount point for mail archive path as seen by the remote crawler (if you are using the Ultra Search mailing list feature)
Additionally, you must specify the following crawler parameters before you can begin crawling:
- Number of crawler threads that the remote crawler uses for gathering documents
- Number of processors on the remote crawler host
Complete the crawler configuration with the administration tool.

The minimum set of parameters that likely need to be configured are the following:
- Seed URLs
- Web proxy
- A schedule
Each schedule must be assigned to a remote crawler or the local crawler. (The local crawler is the crawler that runs on the local Oracle database host itself). To assign the a schedule to a remote crawler host or the local database host, click the host name of a schedule in the Schedules page.

You can also turn off the remote crawler feature for each schedule, thereby forcing the schedule to launch a crawler on the local database host, instead of the specified remote crawler host. To turn off the remote crawler feature, click the host name of a schedule in the Synchronization Schedules page. If a remote crawler host is selected, then you can enable or disable the remote crawler.

See Also:
Chapter 7, " Understanding the Ultra Search Administration Tool"
Start up the RMI registry and RMI daemon on each remote crawler host.

Use the helper scripts in $ORACLE_HOME/tools/remotecrawler/scripts/operating_system to do this.
- If the remote crawler is running on a UNIX platform, then source the $ORACLE_HOME/tools/remotecrawler/scripts/unix/runall.sh Bourne shell script.
- If the remote crawler is running on a Windows host, then run the %ORACLE_HOME%\tools\remotecrawler\scripts\winnt\runall.bat file.
The runall.sh and runall.bat scripts perform the following tasks in sequence:
- define_env is invoked to define necessary environment variables
- runregistry is invoked to start up the RMI registry
- runrmid is invoked to start up the RMI daemon
- register_stub is invoked to register the necessary Java classes with the RMI subsystem
You can invoke runregistry, runrmid, and register_stub individually. However, you must first invoke define_env to define the necessary environment variables.
Launch the remote crawler from the administration tool, and verify that it is running.

The state of the schedule is listed in the Schedules page. The remote crawler launching process takes up to 90 seconds to change state from LAUNCHING to FAILED, if failure occurs.

To view the schedule status, click the crawler status in the schedules list. To view more details, especially in the event of failure, click the schedule status itself. This brings up a detailed schedule status.

The remote crawler fails to launch if any one of the following requirements are not met:
- The RMI registry is not running and listening on port 1099 of each remote host.
- The RMI daemon is not running and listening on port 1098 of each remote host.
- The necessary Java objects have not been successfully registered with each RMI registry.

After a remote crawler is launched, verify that it is running by one or more of the following methods:

Check for active Java processes on the remote crawler host.

A simple way to confirm that remote crawler is running on the remote crawler host is to use an operating system command, such as ps on UNIX systems. You should look for active Java processes.
Monitor the contents of the schedule log file.

If the remote crawler is running successfully, you should see the contents of the schedule log file changing periodically. The schedule log file is located in the shared log directory.

4.4 Ultra Search on Real Application Clusters

Ultra Search can crawl on one fixed node or on any node, depending on the storage access configuration of the Real Application Clusters system. PL/SQL APIs are provided to specify which node should run the crawler, if needed. For Ultra Search administration and the Ultra Search query application, you can configure the connection string to connect to any node of Real Application Clusters.

See Also:

The documentation for Oracle Database Real Application Clusters

4.4.1 Configuring Storage Access

The disk of any node in a Real Application Clusters system can be shared (cluster file system) or not shared (raw disk). For Real Application Clusters on a cluster file system (CFS), the cache files generated by the crawler on any node are visible to any Oracle instance and can be indexed by any Oracle instance that performs index synchronization. If the disk is not shared, then the crawler must run on one particular Oracle instance to ensure that all cache files can be indexed.

This is due to the nature of Oracle Text indexing, where rows inserted into one table by different sessions go to the same pending queue, and whoever initiates index synchronization attempts to index all of the inserted rows. Because of this limitation, on a CFS, Ultra Search is configured to launch the crawler on any database instance. If it is not on a CFS, then Ultra Search launches the crawler on the database instance where INSTANCE_NUMBER = 1.

The Ultra Search administrator can configure which instance runs the crawler with the following PL/SQL API:

WK_ADM.SET_LAUNCH_INSTANCE(instance_name, connect_url);

where instance_name is the name of the launching instance (or the database name if it is to be launched on any node) and connect_url is the connect descriptor.

For connection to a single database instance, the descriptor can be in the short form "host:port:SID" or the connect descriptor (Oracle Net keyword-value pair). For example:

(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=cls02a)(PORT=3999)))(CONNECT_DATA=(   
  SERVICE_NAME=acme.us.com)))

To connect to any database instance, the full database connect descriptor must be used. For example:

(DESCRIPTION=(LOAD_BALANCE=yes)(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=cls02a)(PORT=3999 
  ))(ADDRESS=(PROTOCOL=TCP)(HOST=cls02b)(PORT=3999)))(CONNECT_DATA=(SERVICE_NAME=acme.us.com)))

See Also:

Oracle9i JDBC Developer's Guide and Reference for configuration details.

You cannot configure Ultra Search to launch the crawler on any node on a non-cluster file system.

To query on the existing launching instance configuration, use the following PL/SQL API:

WK_ADM.GET_LAUNCH_INSTANCE RETURN VARCHAR2;

This returns the name of the launching instance or the database name if any node can launch the crawler.

4.4.1.1 Remote Crawler File Cache

The Ultra Search remote crawler requires that the remote file system be mounted on the Oracle instance for indexing.

For cluster file system Real Application Clusters, the file system of the remote computer should be NFS mounted to all nodes of the system.

For non-cluster file system Real Application Clusters, the NFS mount can be limited to the specific node where the Oracle instance is serving the remote crawler. There is no advantage to mounting the remote file system to all nodes--it could lead to stale NFS handles when nodes go down. When there is a configuration change to move to a different Oracle instance, the remote file system should be NFS mounted to the new node accordingly.

4.4.2 Logging on to the Oracle Instance

All components of Ultra Search use the JDBC Thin Driver with the connect string consisting of "hostname:port:SID" or the full connect descriptor as seen in tnsnames.ora.

The administration middle tier connects to the Oracle database with a JDBC connection specified in the ultrasearch.properties file. If the client serving node is down, then you must manually edit the ultrasearch.properties file to connect to a different Oracle instance.

4.4.2.1 Query Search Application for Read Application Clusters

Query components should fully utilize Real Application Clusters. You can specify the JDBC connection string as a database connect descriptor so that it can connect to any Oracle instance in Real Application Clusters. For example:

"jdbc:oracle:thin:@(DESCRIPTION=(LOAD_BALANCE=yes)(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=cls02a)(PORT=3999
  ))(ADDRESS=(PROTOCOL=TCP)(HOST=cls02b)(PORT=3999)))(CONNECT_DATA=(SERVICE_NAME=acme.us.com)))"

See Also:

Oracle9i JDBC Developer's Guide and Reference

4.4.2.2 Java Crawler

The connect string used by Ultra Search crawler is initialized during installation and can be changed with the WK_ADM.SET_LAUNCH_INSTANCE API. When there is a system configuration change, such as adding or dropping a node, the connect string is changed automatically.

4.4.2.3 Choosing a JDBC Driver

The Ultra Search administrator optionally can configure the local crawler to use the JDBC OCI driver to log on to the database. This is done with the following PL/SQL API:

WK_ADM.SET_JDBC_DRIVER(driver_type)

Where

Thin driver (default) driver_type = 0
OCI driver driver_type = 1

This API requires super-user privileges. The change affects all Ultra Search instances.

Note:

The OCI driver requires that environment variables, such as LD_LIBRARY_PATH and NLS_LANG, be set properly on the launching database instance. The crawler inherits the environment setting from the Oracle process. Therefore, you must configure them appropriately before starting Oracle.

See Also:

Oracle9i JDBC Developer's Guide and Reference for configuration details on using the OCI driver.

The following PL/SQL API determines which kind of JDBC drivers are used currently:

WK_ADM.GET_JDBC_DRIVER RETURN NUMBER;

4.5 Table Data Source Synchronization

Ultra Search crawls database tables in the local Oracle Database instance where Ultra Search is installed. Additionally, it can crawl remote databases if they have been linked to the main Oracle Database. Remote databases are linked to the main Oracle instance with database links.

See Also:

Oracle9i Database Administrator's Guide for instructions on how to create database links

Ultra Search provides a logging mechanism to optimize crawling of table sources. Using this logging mechanism, only newly updated documents are revisited during the crawling process. If the source database is not an Oracle database, then you must perform a sequence of steps to use this feature.

4.5.1 Synchronizing Crawling of Oracle Databases

Before creating log tables and log triggers, make sure that the Ultra Search instance schema has the CREATE ANY TABLE and CREATE ANY TRIGGER system privileges. For tables in Oracle databases, data definition language (DDL) statements are provided to create the following:

4.5.1.1 Create Log Table

The log table stores changes that have occurred in the base table. The Ultra Search crawler uses the change information to figure out which rows need to be recrawled. For example, a log table generated by Ultra Search could be named WK$LOG.

The structure of the log table conforms to the following rules:

For every primary key column of the base table, a column must be created in the log table.
There can be up to only eight primary key columns in the base table.
Each column in the log table that corresponds to a primary key column must be named Kx, where x is a number from one to eight.
Each column in the log table that corresponds to a primary key column must be of type VARCHAR2(1000).
There must be exactly one column named mark that has type CHAR(1).
The column named mark must have a default value F.

For example, the base table employees has the following structure:

Column Name	Column Type
ID	NUMBER
NAME	VARCHAR2(200)
ADDRESS	VARCHAR2(400)
TELEPHONE	VARCHAR2(10)
USERNAME	VARCHAR2(24)

If the primary key of the employees table comprises of the ID and NAME columns, then a log table WK$LOG (whose name is generated automatically) is created with the following structure:

Column Name	Column Type
`K1`	`NUMBER`
`K2`	`VARCHAR2(200)`

The SQL statement for creating the log table is as follows:

CREATE TABLE WK$LOG(
K1 VARCHAR2(1000),
K2 VARCHAR2(1000),
MARK CHAR(1) default 'F')

4.5.1.2 Create Log Triggers

An INSERT trigger, UPDATE trigger, and DELETE trigger are created. The Oracle trigger definitions are as follows:

4.5.1.2.1 INSERT Trigger Statement

Every time a row is inserted into the employees base table, the INSERT trigger inserts a row into the log table. The row in the log table records the new values of the id and the name into the k1 and k2 columns. An F is inserted into the mark column to signal the crawler that work needs to be done for this row.

For example:

CREATE OR REPLACE TRIGGER wk$ins
AFTER INSERT ON employees
FOR EACH ROW;

BEGIN
  INSERT INTO WK$LOG(k1,k2,mark)
    VALUES(:new.id,:new.name,'F');
END;

4.5.1.2.2 UPDATE Trigger Statement

Every time a row is updated in the employees base table, the UPDATE trigger inserts two rows into the log table. The first row in the log table records the old values of the id and the name into the k1 and k2 columns. An F is inserted into the mark column to signal the crawler that work needs to be done for this row. The second row in the log table records the new values of the id and the name into the k1 and k2 columns.

For example:

CREATE OR REPLACE TRIGGER wk$upd
AFTER UPDATE ON employees
FOR EACH ROW;

BEGIN
  INSERT INTO WK$LOG(k1,k2,mark)
    VALUES(:old.id,:old.name,'F'); 
  INSERT INTO WK$LOG(k1,mark)
    VALUES(:new.id,:new.name,'F');
END;

4.5.1.2.3 DELETE Trigger

Every time a row is deleted from the employees base table, the DELETE trigger inserts a row into the log table. The row in the log table records the old values of the id and the name into the k1 and k2 columns. An F is inserted into the mark column to signal the crawler that work needs to be done for this row.

For example:

CREATE OR REPLACE TRIGGER wk$del
AFTER DELETE ON employees
FOR EACH ROW;

BEGIN
  INSERT INTO WK$LOG(k1,k2,mark)
    VALUES(:old.id,:old.name,'F'); 
END;

4.5.2 Synchronizing Crawling of Non-Oracle Databases

For tables in non-Oracle remote databases, you must perform the following steps:

Manually create the log table. The log table must conform to the rules for log tables described earlier. Also, it must reside in the same schema and database instance as the base table.
Create three triggers that record inserts, updates, and deletes on the base table. These triggers must exhibit the same behavior as the triggers described earlier for Oracle tables.
Associate the log table. When you have completed these tasks, choose the "Enable logging mechanism (non-Oracle tables)" option during the creation of an Ultra Search table data source. By choosing that option, the Ultra Search administration tool prompts you for the name of the log table in the remote database. Ultra Search associates this log table with the base table. Ultra Search assumes that you have correctly performed steps 1 and 2.