URL Filter dialog box

You use this dialog box to specify the filters by which the spider includes or excludes URLs during a crawl.

Filters are expressed as wildcards or Perl regular expressions. URL filters are mutually exclusive; that is, URL filter A does not influence URL filter B and vice versa. At least one URL filter is required to allow the spider make additional processing loops over the root URL.

Option	Description
URL filter	Enter either a wildcard filter or regular expression filter. Filters can be specified either by using wildcard filters for example, .endeca.com or Perl regular expressions, for example /.\.html/i. Generally, you should use "Wildcard" patterns for "Host" filters and use "Regular expression" patterns for "URL" filters. Note: There are additional samples in the Crawler Implementations section in the Data Foundry Guide.
Type	Select either Host or URL: Host filters apply only to the host name portion of a URL. URL filters are more flexible and can filter URLs based on whether the entire URL matches the specified pattern. For example, the spider may crawl a file system in which a directory named "presentations" contains PowerPoint documents that, for some reason, should not be crawled. They can be excluded using a URL exclusion filter with the pattern /.\/presentations\/.\.ppt/.
Action	Select either Include or Exclude: Include indicates that the spider crawls documents that match the URL filter. Exclude indicates that the spider excludes documents that match the URL filter. Note: A URL must pass both inclusion and exclusion filters for the spider to queue it. In other words, a URL must match at least one inclusion filter and a URL also must not match any exclusion filter.
Pattern	Select either Wildcard or Regular expression depending on the syntax of the filter you specified in the "URL filter" field.