The Web Crawler can be configured to write its output directly to a Record Store instance, instead of to an output file on disk (the default). This procedure assumes you are modifying a single crawl configuration in the site.xml file and not the global Web crawler configuration in default.xml.

There are two main tasks in the configuration process. You create and configure a Record Store instance to receive the Web Crawler output. Then you configure the Web Crawler to override its default output settings and instead write to the Record Store instance.

The Record Store instance configuration requires a configuration file with two properties for Web Crawler output. The Web Crawler configuration requires the following two changes to the site.xml file:

Each of these steps is fully described below.

To configure a Web Crawler to write output to a Record Store instance:

  1. Start the CAS Service if it is not running already

    On Windows, the CAS Service is started by default.

  2. Using the Component Instance Manager Command-line Utility, create a new Record Store instance for the Web Crawler output.

  3. Create a Record Store configuration file that has an idPropertyName property of Endeca.Id and changePropertyNames of Endeca.Document.Text, Endeca.Web.Last-Modified.

    For example, here are the contents of a configuration file named recordstore-configuration.xml:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <recordStoreConfiguration xmlns="http://recordstore.itl.endeca.com/">
        <changePropertyNames>
        	<changePropertyName>Endeca.Document.Text</changePropertyName>
        	<changePropertyName>Endeca.Web.Last-Modified</changePropertyName>
        </changePropertyNames>
        <idPropertyName>Endeca.Id</idPropertyName>
    </recordStoreConfiguration>
  4. Save the Record Store configuration file. You may find it convenient to save it with the other Web Crawler configuration files.

  5. Using the Record Store Command-line Utility, set the configuration file for the Record Store instance.

  6. Modify the site.xml file to include the three output properties that specify the fully qualified name of the host and the port on which the Record Store is running and the instance name of the Record Store.

    For example, this snippet specifies an instance name of WebCrawlerOutput with defaults for a Record Store running locally:

    <property>
       <name>output.recordStore.host</name>
       <value>hostname.endeca.com</value>
    </property>
    <property>
       <name>output.recordStore.port</name>
       <value>8500</value>
    </property>
    <property>
       <name>output.recordStore.instanceName</name>
       <value>WebCrawlerOutput</value>
    </property>
  7. In the site.xml file, add a plugin.includes property for the recordstore-outputter plugin. This plugin instructs the Web Crawler to write to a Record Store instance.

    For example:

    <property>
       <name>plugin.includes</name>
         <value>lib-auth-http|auth-http-form-basic|protocol-httpclient|protocol-file|urlfilter-regex|parse-(text|html|js)|endeca-searchexport-converter-parser|urlnormalizer-(pass|regex|basic)|endeca-generator-html-basic|recordstore-outputter</value>
    </property>
  8. In the site.xml file, delete the plugin.includes property for the output-endeca-record plugin, if it exists in the file.

  9. Optionally, you can remove properties in site.xml file that configure output file settings. These properties include: output.file.is-compressed, output.file.is-xml, output.file.name, and output.file.directory.

    Removing them is useful if you want a clean configuration file, but removing them is not required because the addition of the recordstore-outputter plugin over rides the file output properties.

  10. Run the Web crawl as you normally would.

To confirm the Web crawl wrote output to a Record Store instance, run the list-generations task of the Record Store Command-line Utility. For the example above, this command confirms the crawl output for the WebCrawlerOutput instance:

C:\Endeca\CAS\3.1.1\bin>recordstore-cmd list-generations -a WebCrawlerOutput
ID      STATUS          CREATION TIME
1       COMPLETED       Tue Mar 03 17:40:22 EST 2009


Copyright © Legal Notices