Activating the plugin for the Web Crawler

Oracle recommends that you modify the crawl-specific site.xml file, rather than the global default.xml file (this is because the site.xml settings override the default.xml global settings).

Use the following steps to activate the plugin.

To activate the plugin for the Web Crawler:

  1. Open default.xml (located in IAS\workspace\conf\web-crawler\default) and copy the plugin.includes and plugin.excludes properties into site.xml (located in IAS\workspace\conf\web-crawler\polite-crawl or IAS\workspace\conf\web-crawler\non-polite-crawl).
  2. Add the plugin ID to the plugin.includes property in the site.xml file, as shown in this abbreviated example:
    ...
    <property>
      <name>plugin.includes</name>
      <value>filter-htmlmetatags|... | output-endeca-record</value>
      <description>
        Regular expression naming plugin directory names to include.
      </description>
    </property>
    ...
    
    Note: The value name (filter-htmlmetatags in this example) must refer to the plugin ID as set in the plugin's plugin.xml definition file.
  3. Check both configuration files (default.xml and site.xml) for the plugin.excludes property and make certain that the plugin ID is not excluded, as in the following example:
    ...
    <property>
      <name>plugin.excludes</name>
      <value></value>
      <description>
        Regular expression naming plugin directory names to exclude.
      </description>
    </property>
    
  4. Check the parse filtering order. If you are using the parser.filters.order configuration property to specify the order by which parse filters are applied, make sure that you include the filter-htmlmetatags in the property value. If you are not using this property (i.e., it has an empty value), you can leave the property as-is.

You can now run the Web Crawler with the new plugin.