Using the Remote Crawler
Without the Ultra Search remote crawler, you must run
the Ultra Search Crawler on the same host as the Oracle9i Server.
For large data sets, you can improve performance by running the
Ultra Search Crawler on one or more separate hosts from the Oracle
9i Server. Since the Ultra Search Crawler is a pure Java application,
it communicates with the Oracle9i Server via JDBC.
Ultra Search remote crawlers instances are always launched by the
Oracle9i Server. The Oracle9i Server uses Java Remote Method
Invocation (RMI) to communicate with the remote crawler hosts.
Therefore, each remote host must have an RMI Registry and an RMI
Daemon up and running.
The reason for launching remote crawlers from the Oracle9i Server
is to leverage the high-availability of the Oracle9i Server.
The Ultra Search scheduling mechanism runs within the Oracle9i Server
and therefore automatically uses the database's high availability
Scalability and Load Balancing
Each Ultra Search schedule can be associated with exactly one crawler.
The crawler can run locally on the Oracle9i Server host or on
a remote host. There is no limit to the number of schedules that
can be run. Similarly, there is no limit to the number of Remote
Crawler hosts that can be run.
However, each remote crawler host requires that the Ultra Search Middle-Tier
Components be installed on its host.
By using several remote crawler hosts and carefully allocating
schedules to specific hosts, you can achieve scalability and
load balancing of the entire crawling process.
How is this "launching" achieved?
It is done using Java Remote Method Invocation (RMI) technology.
Step 1: When a crawling schedule is activated, the Ultra Search scheduler
launches a Java program as a separate process on the database
host. We shall refer to this Java program as the ActivationClient.
Step 2: This program attempts to connect to the remote crawler
host via the standard RMI Registry and RMI Daemon on ports 1098
and 1099. If successful, the ActivationClient receives a remote
reference to a Java object running on the remote host. This remote
Java object is known as the ActivatableCrawlerLauncher.
Step 3: The ActivationClient then instructs the ActivatableCrawlerLauncher
to launch the Ultra Search Crawler on the remote host.
launches the Ultra Search Crawler as a separate Java process on the
Installation & Configuration sequence
- Make sure that you have installed the Ultra Search Server
Component on the Oracle9i Server.
See Installing Oracle Ultra Search
- Make sure that you have installed the Ultra Search Middle-Tier
Components on one or more web server hosts.
See Installing Oracle Ultra Search
- Install the Ultra Search Middle-Tier Components on all
Remote Crawler hosts.
See Installing Oracle Ultra Search
- Export the following common resources on the database host:
- The temporary directory
- The log directory
- mail archive directory (only if you are using the Ultra
Search mailing list feature)
These resources are merely directories that must be accessible
by all remote crawler instances over the network. You can
use whatever mechanism you wish to share these resources with
a remote crawler host.
The remote crawler code is pure Java. Therefore it is platform
independent. For example, your Ultra Search installation may
consist of 4 hosts. One database server (host X) on which
the Ultra Search server component is installed. This host
is running Solaris. One remote crawler host (host Y1) running
on Windows NT. One remote crawler host (host Y2) running on
Solaris. One Remote Crawler host (host Y3) running on Linux.
In this scenario, you export the shared directories on host
X using the unix "export" command. You then use the unix "mount"
command on hosts Y2 and Y3 to mount the exported directories.
For host Y1, you have to purchase a third-party NFS client
for Windows NT and use that to mount the shared directories.
(Note that if host X was a linux server, you can create Samba
shares and thereby mount those shares on Windows NT without
needing to purchase any third party software).
- Configure the remote crawler via the Administration Tool.
You will now have to edit that profile by manually entering
all mount points for the shared crawler resources that you defined
in step 4. To edit the remote crawler profile, navigate to the
"Crawler : remote crawler Profiles" page and click on the edit
icon of the remote crawler profile you wish to edit:
Specify values for the following parameters:
- Mount point for temporary directory path as seen by the
- Mount point for log directory path as seen by the Remote
- Mount point for mail archive path as seen by the Remote
Crawler (only if you are using the Ultra Search mailing
Additionally, you must specify the following crawler parameters
before you can begin crawling:
- Number of crawler threads that the remote crawler uses
for gathering documents.
- Number of processors on the remote crawler host.
- Complete the crawler configuration via the Administration
Consult the Ultra Search Administration Tool documentation
for details. The minimum set of parameters that will likely
need to be configured are:
- Seed urls
- Web Proxy
- A schedule
Note that each schedule must be assigned to a remote crawler
or the local crawler (the local crawler is the crawler that
runs on the local Oracle9i database host itself). To assign
the a schedule to a remote crawler host or the local database
host, click on the hostname of a schedule in the "Synchronization
Note that you can also "turn off" the remote crawler feature
for each schedule, thereby forcing the the schedule to launch
a crawler on the local database host instead of the specified
remote crawler host. To "turn off" the remote crawler
feature, click on the hostname of a schedule in the "Synchronization
Schedules" page. If a remote crawler host is selected,
you will be able to enable or disable the remote crawler.
- Start up the RMI registry and RMI daemon on each Remote
You can use the helper scripts in $WEB_ORACLE_HOME/tools/remotecrawler/scripts/
to do so.
If the remote crawler is running on a unix platform, you
can source the $WEB_ORACLE_HOME/tools/remotecrawler/scripts/unix/runall.sh
Bourne shell script.
If the remote crawler is running on a windows NT host, you
can run the
The runall.sh and runall.bat scripts perform the following
tasks in sequence:
- define_env is invoked to define necessary environment
- runregistry is invoked to start up the RMI Registry
- runrmid is invoked to start up the RMI Daemon
- register_stub is invoked to register the necessary Java
classes with the RMI subsystem
You can invoke runregistry, runrmid and register_stub individually.
However, you must first invoke define_env to define the necessary
- Launch the remote crawler from the Administration Tool
and verify that it is running.
The state of the schedule is listed in the Schedules page.
Please be patient as the remote crawler launching process
will take up to 90 seconds to change state from LAUNCHING
to FAILED if failure occurs.
To view the schedule status, click on the crawler status
in the schedules list. To view more details especially in
the event of failure, click on the schedule status itself.
This will bring up a detailed schedule status page.
The remote crawler will fail to launch if any one of the
following requirements are not met:
- The RMI registry is not running and listening on port
1099 of each remote host.
- The RMI daemon is not running and listening on port 1098
of each remote host.
- The necessary java objects have not been successfully
registered with each RMI registry.
Once a remote crawler is launched, you can verify that it
is running by one or more of the following methods:
- Check for active java processes on the remote crawler host.
A simple way to confirm that remote crawler is running on
the remote crawler host is to use an operating system command
such as "ps" on Unix systems. You should look for
active java processes.
- Monitor the contents of the schedule log file.
Finally, if the remote crawler is running successfully,
you should see the contents of the schedule log file changing
periodically. The schedule log file will be located in the
shared log directory.