URL Filter dialog box

You use this dialog box to specify the filters by which the spider includes or excludes URLs during a crawl.

Filters are expressed as wildcards or Perl regular expressions. URL filters are mutually exclusive; that is, URL filter A does not influence URL filter B and vice versa. At least one URL filter is required to allow the spider make additional processing loops over the root URL.

Option Description
URL filter Enter either a wildcard filter or regular expression filter. Filters can be specified either by using wildcard filters for example, *.endeca.com or Perl regular expressions, for example /.*\.html/i. Generally, you should use "Wildcard" patterns for "Host" filters and use "Regular expression" patterns for "URL" filters.
Note: There are additional samples in the Crawler Implementations section in the Data Foundry Guide.
Type Select either Host or URL:
  • Host filters apply only to the host name portion of a URL.
  • URL filters are more flexible and can filter URLs based on whether the entire URL matches the specified pattern. For example, the spider may crawl a file system in which a directory named "presentations" contains PowerPoint documents that, for some reason, should not be crawled. They can be excluded using a URL exclusion filter with the pattern /.*\/presentations\/.*\.ppt/.
Action Select either Include or Exclude:
  • Include indicates that the spider crawls documents that match the URL filter.
  • Exclude indicates that the spider excludes documents that match the URL filter.
Note: A URL must pass both inclusion and exclusion filters for the spider to queue it. In other words, a URL must match at least one inclusion filter and a URL also must not match any exclusion filter.
Pattern Select either Wildcard or Regular expression depending on the syntax of the filter you specified in the "URL filter" field.