Oracle recommends that you modify the crawl-specific
site.xml
file, rather than the global
default.xml
file (this is because the
site.xml
settings override the
default.xml
global settings).
Use the following steps to activate the plug-in.
To activate the plug-in for the Web Crawler:
From
default.xml
(located inCAS\workspace\conf\web-crawler\default
), copy theplugin.includes
andplugin.excludes
properties tosite.xml
(located inCAS\workspace\conf\web-crawler\polite-crawl
orCAS\workspace\conf\web-crawler\non-polite-crawl
).Add the plug-in ID to the
plugin.includes
property in thesite.xml
file, as shown in this abbreviated example:... <property> <name>plugin.includes</name> <value>filter-htmlmetatags|... | output-endeca-record</value> <description> Regular expression naming plugin directory names to include. </description> </property> ...
Note
The value name (
filter-htmlmetatags
in this example) must refer to the plug-in ID as set in the plug-in’splugin.xml
definition file.Check both configuration files (
default.xml
andsite.xml
) for theplugin.excludes
property and make certain that the plug-in ID is not excluded, as in the following example:... <property> <name>plugin.excludes</name> <value></value> <description> Regular expression naming plugin directory names to exclude. </description> </property>
Check the parse filtering order. If you are using the
parser.filters.order
configuration property to specify the order by which parse filters are applied, make sure that you include thefilter-htmlmetatags
in the property value. If you are not using this property (i.e., it has an empty value), you can leave the property as-is.
You can now run the Web Crawler with the new plug-in.