Specifying Document Types

Access the Document Types page by selecting PeopleTools, Search Framework, Search Designer Activity Guide, Search Definition and selecting a Source Type of File Source. Then click the Document Types tab.

Use the Document Types page to specify the document types, expressed as MIME types, of the documents to be crawled.

Image: File Source: Document Types page

This example illustrates the fields and controls on the File Source: Document Types page. You can find definitions for the fields and controls later on this page.

File Source: Document Types page

Document Types to Crawl

Field or Control

Definition

Include default types

Crawl only on the default document types, as defined by the search engine.

The search engine default document types for crawling are:

  • PDF

  • HTML

  • TXT (plain text)

  • Microsoft Word

  • Microsoft Excel

  • Microsoft PowerPoint

If no other document types are added to the Select grid, then the search engine considers only the default document types.

Include all types

Crawl all document types supported by the search engine.

To see this list of supported document types, expressed as MIME types, select either the Add these types or Exclude these types radio button and click the lookup button for the Document Type column.

Add these types

In addition to the default document types, the system also crawls any document type added to the Document Type grid, with Add these types selected.

Exclude these types

Excludes specific document types added to the Document Type grid.

Assume the majority of the document types supported by Elasticsearch crawler apply to your configuration, except for a small number of document types. In this case you can specifically include those document types in the Document Type grid. When the search engine crawls the file location, the search engine crawls all document types on the supported MIME list, except for those document types included in the Document Type grid, with Exclude these types selected.

URL Boundary Rules

Field or Control

Definition

Inclusion Rules

Specify an inclusion rule that a URL must contain. For example:

*es_xml*

In this case, search engine crawls all documents with es_xml in the name.

Specify an inclusion rule that URL must start with. For example:

file://localhost/ds1/product/ES_ADD/es_doc*

In this case, the search engine crawls all files starting with file://localhost/ds1/product/ES_ADD/es_doc.

Exclusion Rules

Specify an exclusion rule that a URL can't contain. For example:

*.xml

In this case, the search engine does not crawl anything with a .xml extension.

URL boundary rules limit the crawling scope. When you add boundary rules, the crawler is restricted to URLs that match only the rules you specify. Inclusion and Exclusion rules can be formed to filter documents with patterns of begins with, ends with, contains or regular expressions. Rules with regular expression should start with the character R.

Rule

Description

Begins With: file://localhost/example*

In this case, the search engine considers URLs starting with file://localhost/example.

Ends With: *.doc

In this case, the search engine considers URLs ending with .doc.

Contains: *contacts*

In this case, the search engine considers URLs containing string contacts.

Regular Expression: R.*es_html_lvl[1-9].html

In this case, the search engine considers URLs ending with numbers varying from 1 to 9 with file names ending with es_html_lvl.

When working with these rules, keep in mind:

  • Exclusion rules always override inclusion rules.

  • Multiple inclusion and exclusion rules can be separated by a space or in a new line.

  • Use an asterisk to represent a wildcard.

  • Inclusion and exclusion rules are case-insensitive.