Specifying Document Types

Access the Document Types page by selecting PeopleTools, Search Framework, Search Designer Activity Guide, Search Definition and selecting a Source Type of File Source. Then click the Document Types tab.

Use the Document Types page to specify the document types, expressed as MIME types, of the documents to be crawled.

Document Types to Crawl

Field or Control	Definition
Include default types	Crawl only on the default document types, as defined by the search engine. The search engine default document types for crawling are: PDF HTML TXT (plain text) Microsoft Word Microsoft Excel Microsoft PowerPoint If no other document types are added to the Select grid, then the search engine considers only the default document types.
Include all types	Crawl all document types supported by the search engine. To see this list of supported document types, expressed as MIME types, select either the Add these types or Exclude these types radio button and click the lookup button for the Document Type column.
Add these types	In addition to the default document types, the system also crawls any document type added to the Document Type grid, with Add these types selected.
Exclude these types	Excludes specific document types added to the Document Type grid. Assume the majority of the document types supported by Elasticsearch crawler apply to your configuration, except for a small number of document types. In this case you can specifically include those document types in the Document Type grid. When the search engine crawls the file location, the search engine crawls all document types on the supported MIME list, except for those document types included in the Document Type grid, with Exclude these types selected.

URL Boundary Rules

Field or Control	Definition
Inclusion Rules	Specify an inclusion rule that a URL must contain. For example: `es_xml` In this case, search engine crawls all documents with es_xml in the name. Specify an inclusion rule that URL must start with. For example: `file://localhost/ds1/product/ES_ADD/es_doc*` In this case, the search engine crawls all files starting with `file://localhost/ds1/product/ES_ADD/es_doc.`
Exclusion Rules	Specify an exclusion rule that a URL can't contain. For example: `*.xml` In this case, the search engine does not crawl anything with a `.xml` extension.

Field or Control

Definition

Inclusion Rules

Specify an inclusion rule that a URL must contain. For example:

*es_xml*

In this case, search engine crawls all documents with es_xml in the name.

Specify an inclusion rule that URL must start with. For example:

file://localhost/ds1/product/ES_ADD/es_doc*

In this case, the search engine crawls all files starting with file://localhost/ds1/product/ES_ADD/es_doc.

Exclusion Rules

Specify an exclusion rule that a URL can't contain. For example:

*.xml

In this case, the search engine does not crawl anything with a .xml extension.

URL boundary rules limit the crawling scope. When you add boundary rules, the crawler is restricted to URLs that match only the rules you specify. Inclusion and Exclusion rules can be formed to filter documents with patterns of begins with, ends with, contains or regular expressions. Rules with regular expression should start with the character R.

Rule	Description
Begins With: `file://localhost/example*`	In this case, the search engine considers URLs starting with `file://localhost/example.`
Ends With: `*.doc`	In this case, the search engine considers URLs ending with `.doc.`
Contains: `contacts`	In this case, the search engine considers URLs containing string `contacts.`
Regular Expression: `R.*es_html_lvl[1-9].html`	In this case, the search engine considers URLs ending with numbers varying from 1 to 9 with file names ending with `es_html_lvl.`

When working with these rules, keep in mind:

Exclusion rules always override inclusion rules.
Multiple inclusion and exclusion rules can be separated by a space or in a new line.
Use an asterisk to represent a wildcard.
Inclusion and exclusion rules are case-insensitive.