C
Using the Remote Crawler

Without the Ultra Search remote crawler, you must run the Ultra Search crawler on the same host as the Oracle9i database. For large data sets, you can improve performance by running the Ultra Search crawler on one or more separate hosts from the Oracle9i database. Because the Ultra Search crawler is a pure Java application, it communicates with the Oracle9i database through JDBC.

Ultra Search remote crawler instances are always launched by the Oracle9i database. The Oracle9i database uses Java remote method invocation (RMI) to communicate with the remote crawler hosts. Therefore, each remote host must have an RMI registry and an RMI daemon running.

By launching remote crawlers from the Oracle9i database, you can leverage the high-availability of the Oracle9i database. The Ultra Search scheduling mechanism runs within the Oracle9i database and therefore automatically uses the database's high availability features.

Caution:

By default, RMI sends data over the network unencrypted. Using the remote crawler to perform crawling introduces a potential security risk. A malicious entity within the enterprise could steal the Ultra Search instance schema and password by listening to packets going across the network. Refrain from using the remote crawler feature if this security risk is unacceptable.

Scalability and Load Balancing

Each Ultra Search schedule can be associated with exactly one crawler. The crawler can run locally on the Oracle9i database host or on a remote host. There is no limit to the number of schedules that can be run. Similarly, there is no limit to the number of remote crawler hosts that can be run. However, each remote crawler host requires that the Ultra Search middle tier component be installed on its host.

By using several remote crawler hosts and carefully allocating schedules to specific hosts, you can achieve scalability and load balancing of the entire crawling process.

How is "Launching" Achieved?

Launching is done using Java RMI technology.

When a crawling schedule is activated, the Ultra Search scheduler launches a Java program as a separate process on the database host. This Java program is known as the ActivationClient.
This program attempts to connect to the remote crawler host through the standard RMI registry and RMI daemon on ports 1098 and 1099. If successful, the ActivationClient receives a remote reference to a Java object running on the remote host. This remote Java object is known as the ActivatableCrawlerLauncher.
The ActivationClient then instructs the ActivatableCrawlerLauncher to launch the Ultra Search crawler on the remote host. The ActivatableCrawlerLauncher launches the Ultra Search crawler as a separate Java process on the remote host.

Installation and Configuration Sequence

Make sure that you have installed the Ultra Search server components on the Oracle9i database, the Ultra Search middle tier component on one or more Web server hosts, and the Ultra Search middle tier component on all remote crawler hosts.

See Also:
Chapter 2, "Installing and Configuring Ultra Search"
Export the following common resources on the database host:
- The temporary directory
- The log directory
- The mail archive directory (if you are using the Ultra Search mailing list feature)
These resources are merely directories that must be accessible by all remote crawler instances over the network. Use whatever mechanism you want to share these resources with a remote crawler host.

The remote crawler code is pure Java. Therefore, it is platform independent. For example, your Ultra Search installation might consist of four hosts: one database server (host X) running Solaris, on which the Ultra Search server component is installed; one remote crawler host (host Y1) running on Windows NT; one remote crawler host (host Y2) running on Solaris; one remote crawler host (host Y3) running on Linux.

In this scenario, you export the shared directories on host X using the UNIX "export" command. You then use the UNIX "mount" command on hosts Y2 and Y3 to mount the exported directories. For host Y1, you must purchase a third-party NFS client for Windows NT and use that to mount the shared directories. (Note: if host X is a Linux server, then you can create Samba shares and thereby mount those shares on Windows NT without needing any third party software).
Configure the remote crawler with the administration tool.

Edit that profile by manually entering all mount points for the shared crawler resources that you defined. To edit the remote crawler profile, navigate to the "Crawler: Remote Crawler Profiles" page and click Edit for the remote crawler profile you want to edit. Specify values for the following parameters:
- Mount point for temporary directory path as seen by the remote crawler
- Mount point for log directory path as seen by the remote crawler
- Mount point for mail archive path as seen by the remote crawler (if you are using the Ultra Search mailing list feature)
Additionally, you must specify the following crawler parameters before you can begin crawling:
- Number of crawler threads that the remote crawler uses for gathering documents
- Number of processors on the remote crawler host
Complete the crawler configuration with the administration tool.

The minimum set of parameters that will likely need to be configured are the following:
- Seed URLs
- Web proxy
- A schedule
Each schedule must be assigned to a remote crawler or the local crawler (the local crawler is the crawler that runs on the local Oracle9i database host itself). To assign the a schedule to a remote crawler host or the local database host, click the host name of a schedule in the Schedules Page

You can also turn off the remote crawler feature for each schedule, thereby forcing the schedule to launch a crawler on the local database host, instead of the specified remote crawler host. To turn off the remote crawler feature, click the host name of a schedule in the "Synchronization Schedules" page. If a remote crawler host is selected, then you can enable or disable the remote crawler.

See Also:
Chapter 5, "Understanding the Ultra Search Administration Tool"
Start up the RMI registry and RMI daemon on each remote crawler host.

Use the helper scripts in $ORACLE_HOME/tools/remotecrawler/scripts/<operating system> to do this.
- If the remote crawler is running on a UNIX platform, then source the $ORACLE_HOME/tools/remotecrawler/scripts/unix/runall.sh Bourne shell script.
- If the remote crawler is running on a Windows NT host, then run the %ORACLE_HOME%\tools\remotecrawler\scripts\winnt\runall.bat file.
The runall.sh and runall.bat scripts perform the following tasks in sequence:
- define_env is invoked to define necessary environment variables
- runregistry is invoked to start up the RMI registry
- runrmid is invoked to start up the RMI daemon
- register_stub is invoked to register the necessary Java classes with the RMI subsystem
You can invoke runregistry, runrmid, and register_stub individually. However, you must first invoke define_env to define the necessary environment variables.
Launch the remote crawler from the administration tool, and verify that it is running.

The state of the schedule is listed in the Schedules Page. The remote crawler launching process takes up to 90 seconds to change state from LAUNCHING to FAILED, if failure occurs.

To view the schedule status, click the crawler status in the schedules list. To view more details, especially in the event of failure, click the schedule status itself. This brings up a detailed schedule status.

The remote crawler fails to launch if any one of the following requirements are not met:
- The RMI registry is not running and listening on port 1099 of each remote host.
- The RMI daemon is not running and listening on port 1098 of each remote host.
- The necessary Java objects have not been successfully registered with each RMI registry.

After a remote crawler is launched, verify that it is running by one or more of the following methods:

Check for active Java processes on the remote crawler host.

A simple way to confirm that remote crawler is running on the remote crawler host is to use an operating system command, such as "ps" on UNIX systems. You should look for active Java processes.
Monitor the contents of the schedule log file.

If the remote crawler is running successfully, you should see the contents of the schedule log file changing periodically. The schedule log file is located in the shared log directory.

C Using the Remote Crawler

Scalability and Load Balancing

How is "Launching" Achieved?

Installation and Configuration Sequence

C
Using the Remote Crawler