Oracle Commerce Guided Search - Activating the plug-in for the Web Crawler

Activating the plug-in for the Web Crawler

Oracle recommends that you modify the crawl-specific site.xml file, rather than the global default.xml file (this is because the site.xml settings override the default.xml global settings).

Use the following steps to activate the plug-in.

To activate the plug-in for the Web Crawler:

From default.xml (located in CAS\workspace\conf\web-crawler\default), copy the plugin.includes and plugin.excludes properties to site.xml (located in CAS\workspace\conf\web-crawler\polite-crawl or CAS\workspace\conf\web-crawler\non-polite-crawl).
Add the plug-in ID to the plugin.includes property in the site.xml file, as shown in this abbreviated example:
```
...
<property>
  <name>plugin.includes</name>
  <value>filter-htmlmetatags|... | output-endeca-record</value>
  <description>
    Regular expression naming plugin directory names to include.
  </description>
</property>
...
```
Note
The value name (filter-htmlmetatags in this example) must refer to the plug-in ID as set in the plug-in’s plugin.xml definition file.

Check both configuration files (default.xml and site.xml) for the plugin.excludes property and make certain that the plug-in ID is not excluded, as in the following example:

...
<property>
  <name>plugin.excludes</name>
  <value></value>
  <description>
    Regular expression naming plugin directory names to exclude.
  </description>
</property>

Check the parse filtering order. If you are using the parser.filters.order configuration property to specify the order by which parse filters are applied, make sure that you include the filter-htmlmetatags in the property value. If you are not using this property (i.e., it has an empty value), you can leave the property as-is.

You can now run the Web Crawler with the new plug-in.

Activating the plug-in for the Web Crawler

Note

Content Acquisition System Web Crawler Guide