The Web Crawler can be configured to write its output directly to
a Record Store instance, instead of to an output file on disk (the default).
This procedure assumes you are modifying a single crawl configuration in the
site.xml
file and not the global Web crawler
configuration in
default.xml
.
There are two main tasks in the configuration process. You create and configure a Record Store instance to receive the Web Crawler output. Then you configure the Web Crawler to override its default output settings and instead write to the Record Store instance.
The Record Store instance configuration requires a configuration file
with two properties for Web Crawler output. The Web Crawler configuration
requires the following two changes to the
site.xml
file:
Add three output properties to specify the host and port of the machine running the Record Store, and instance name of the Record Store that you want to write to.
Add a
plugin.includes
property for the recordstore-outputter plugin. This plugin instructs the Web Crawler to write to a Record Store instance and over rides the output-endeca-record which would have instructed the Web Crawler to write to an output file.
Each of these steps is fully described below.
To configure a Web Crawler to write output to a Record Store instance:
Start the CAS Service if it is not running already
On Windows, the CAS Service is started by default.
Using the Component Instance Manager Command-line Utility, create a new Record Store instance for the Web Crawler output.
Start a command prompt and navigate to
<install path>\CAS\
.version
\binRun the
create-component
task ofcomponent-manager-cmd
. Specify the-t
option with an argument ofRecordStore
. Specify the-n
option with a Record Store instance name of your choice. If necessary, specify host and port information or accept the defaults.For example, this Windows command creates a Record Store instance named
WebCrawlOutput
:C:\Endeca\CAS\3.1.1\bin>component-manager-cmd.bat create-component -h localhost -n WebCrawlerOutput -p 8500 -t RecordStore
The command prompt displays:
Successfully created component: WebCrawlerOutput
Create a Record Store configuration file that has an
idPropertyName
property ofEndeca.Id
andchangePropertyNames
ofEndeca.Document.Text, Endeca.Web.Last-Modified
.For example, here are the contents of a configuration file named
recordstore-configuration.xml
:<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <recordStoreConfiguration xmlns="http://recordstore.itl.endeca.com/"> <changePropertyNames> <changePropertyName>Endeca.Document.Text</changePropertyName> <changePropertyName>Endeca.Web.Last-Modified</changePropertyName> </changePropertyNames> <idPropertyName>Endeca.Id</idPropertyName> </recordStoreConfiguration>
Save the Record Store configuration file. You may find it convenient to save it with the other Web Crawler configuration files.
Using the Record Store Command-line Utility, set the configuration file for the Record Store instance.
Start a command prompt and navigate to
<install path>\CAS\
.version
\binRun the
set-configuration
task ofrecordstore-cmd
. Specify the-a
option with an argument of the Record Store instance name. Specify the-f
option with the path to the configuration file for the Record Store instance.For example, this Windows command sets the configuration file named
recordstore-configuration.xml
for the Record Store instance namedWebCrawlerOutput
:C:\Endeca\CAS\3.1.1\bin>recordstore-cmd.bat set-configuration -a WebCrawlerOutput -f C:\sample\webcrawler\recordstore-configuration.xml
The command prompt displays:
Successfully set recordstore configuration.
Modify the
site.xml
file to include the three output properties that specify the fully qualified name of the host and the port on which the Record Store is running and the instance name of the Record Store.For example, this snippet specifies an instance name of
WebCrawlerOutput
with defaults for a Record Store running locally:<property> <name>output.recordStore.host</name> <value>hostname.endeca.com</value> </property> <property> <name>output.recordStore.port</name> <value>8500</value> </property> <property> <name>output.recordStore.instanceName</name> <value>WebCrawlerOutput</value> </property>
In the
site.xml
file, add aplugin.includes
property for the recordstore-outputter plugin. This plugin instructs the Web Crawler to write to a Record Store instance.For example:
<property> <name>plugin.includes</name> <value>lib-auth-http|auth-http-form-basic|protocol-httpclient|protocol-file|urlfilter-regex|parse-(text|html|js)|endeca-searchexport-converter-parser|urlnormalizer-(pass|regex|basic)|endeca-generator-html-basic|recordstore-outputter</value> </property>
In the
site.xml
file, delete theplugin.includes
property for the output-endeca-record plugin, if it exists in the file.Optionally, you can remove properties in
site.xml
file that configure output file settings. These properties include:output.file.is-compressed
,output.file.is-xml
,output.file.name
, andoutput.file.directory
.Removing them is useful if you want a clean configuration file, but removing them is not required because the addition of the recordstore-outputter plugin over rides the file output properties.
To confirm the Web crawl wrote output to a Record Store instance, run
the
list-generations
task of the Record Store Command-line
Utility. For the example above, this command confirms the crawl output for the
WebCrawlerOutput
instance:
C:\Endeca\CAS\3.1.1\bin>recordstore-cmd list-generations -a WebCrawlerOutput ID STATUS CREATION TIME 1 COMPLETED Tue Mar 03 17:40:22 EST 2009
Note
The Web Crawler does not automatically manage Record Store instances for Web crawls. For details about managing Record Store instances, see the CAS Developer's Guide.