Oracle recommends that you modify the crawl-specific
site.xml file, rather than the global
default.xml file (this is because the
site.xml settings override the
default.xml global settings).
Use the following steps to activate the plug-in.
To activate the plug-in for the Web Crawler:
From
default.xml(located inCAS\workspace\conf\web-crawler\default), copy theplugin.includesandplugin.excludesproperties tosite.xml(located inCAS\workspace\conf\web-crawler\polite-crawlorCAS\workspace\conf\web-crawler\non-polite-crawl).Add the plug-in ID to the
plugin.includesproperty in thesite.xmlfile, as shown in this abbreviated example:... <property> <name>plugin.includes</name> <value>filter-htmlmetatags|... | output-endeca-record</value> <description> Regular expression naming plugin directory names to include. </description> </property> ...Note
The value name (
filter-htmlmetatagsin this example) must refer to the plug-in ID as set in the plug-in’splugin.xmldefinition file.Check both configuration files (
default.xmlandsite.xml) for theplugin.excludesproperty and make certain that the plug-in ID is not excluded, as in the following example:... <property> <name>plugin.excludes</name> <value></value> <description> Regular expression naming plugin directory names to exclude. </description> </property>Check the parse filtering order. If you are using the
parser.filters.orderconfiguration property to specify the order by which parse filters are applied, make sure that you include thefilter-htmlmetatagsin the property value. If you are not using this property (i.e., it has an empty value), you can leave the property as-is.
You can now run the Web Crawler with the new plug-in.

