[/map {"- map/map "}) [/map/topicref {"- map/topicref "}) [/map/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicmeta/navtitle {"- topic/navtitle "}) Configuration (navtitle][/map/topicref/topicmeta/linktext {"- map/linktext "}) Configuration (linktext][/map/topicref/topicmeta/shortdesc {"- map/shortdesc "}) This section provides reference information to configure the Endeca Web Crawler. (shortdesc] (topicmeta][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Configuration files (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Configuration files (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The Endeca Web Crawler uses the following set of configuration files: (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) The default.xml file (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) The default.xml file (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The default.xml file is the main configuration file for the Endeca Web Crawler. (shortdesc] (topicmeta][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) HTTP Properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) HTTP Properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You set the HTTP transport properties in the default.xml file. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About setting the HTTPClient cookies (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About setting the HTTPClient cookies (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The http.cookies property sets the cookies used by the HTTPClient. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About obeying the robots.txt file (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About obeying the robots.txt file (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You can set the Web Crawler to either ignore or obey the robots.txt exclusion standard, as well as any META ROBOTS tags in HTML pages. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Setting the download content limit (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Setting the download content limit (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) If a crawl downloads files with a lot of content (for example, large PDF or SWF files), you may see WARN messages about pages being skipped because the content limit was exceeded.To solve this problem, increase the download content limit to a setting that allows all content to be downloaded. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Authentication properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Authentication properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You set the authentication properties in the default.xml file. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About configuring Basic authentication (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About configuring Basic authentication (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) If a Web server uses HTTP Basic authentication to restrict access to Web sites, you can specify authentication credentials that enable the Web Crawler to access password-protected pages.The http.auth.basic property sets the credentials to be used by the HTTPClient for Basic authentication. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About configuring Digest authentication (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About configuring Digest authentication (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) If a Web server uses HTTP Digest authentication to restrict access to Web sites, you can use the http.auth.digest property to set the credentials used by the HTTPClient for Digest authentication. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About configuring NTLM authentication (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About configuring NTLM authentication (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) If a Web server uses HTTP NTLM authentication to restrict access to Web sites, you can specify authentication credentials that enable the Web Crawler to access password-protected pages.The http.auth.ntlm property sets the credentials to be used by the HTTPClient for NTLM authentication. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Configuring Form-based authentication (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Configuring Form-based authentication (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) If you are crawling sites that implement form-based authentication, you supply the credentials in a form-credentials.xml file. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Properties for authenticated proxy support (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Properties for authenticated proxy support (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You configure authenticated proxy support in the default.xml file. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Fetcher properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Fetcher properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The fetcher is the Web Crawler component that actually fetches pages from Web sites.You set the fetcher properties in the default.xml file. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Use of the max delay and crawl-delay values (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Use of the max delay and crawl-delay values (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The fetcher compares the value of the fetcher.delay.max property to the value of the Crawl-Delay parameter in the robots.txt file. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Fetcher overrides in the site.xml files (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Fetcher overrides in the site.xml files (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) This topic describes overrides for the fetcher property values in the default.xml file. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) URL normalization properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) URL normalization properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You can set the URL normalization properties in the default.xml file. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Types of URL normalizers (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Types of URL normalizers (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The Endeca Web Crawler has three URL normalizers: (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Normalizing the seed list (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Normalizing the seed list (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You apply normalization to the seed list with the urlnormalizer.normalize-seeds property. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) MIME type properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) MIME type properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You set the MIME type mapping properties in the default.xml file. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Overriding the server text/html MIME type (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Overriding the server text/html MIME type (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) If there is confusion as to the MIME type of a given URL, the Web Crawler by default trusts the URL extension over the server MIME type.The mime.types.trust-server.text-html property is intended for crawls that may experience "text/html" MIME type resolution problems. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Plugin properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Plugin properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You set the plugin properties in the default.xml file. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Default activated plugins (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Default activated plugins (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The default regular expression value for the plugin.includes property activates these plugins: (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Specifying the plugins directory (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Specifying the plugins directory (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The plugin.folders property specifies the location of the plugins directory. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Parser properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Parser properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You set the parser properties in the default.xml file. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Parser filter properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Parser filter properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You can set the parser filter properties in the default.xml file. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Setting the order of parser filters (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Setting the order of parser filters (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The parser.filters.order property specifies the order in which the parser filters are applied. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About defining the XPath filter expressions (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About defining the XPath filter expressions (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The document.prune.xpath property defines the XPath expressions that will be used by the Endeca Document Prune XPath Filter (i.e., the endeca-xpath-filter plugin). (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) URL filter properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) URL filter properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You configure how the URL filter plugins are handled in the default.xml file. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Setting the order of URL filters (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Setting the order of URL filters (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The urlfilter.order property allows you to specify the order in which URL filters are applied. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Filtering the seed list (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Filtering the seed list (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You apply URL filters to the seeds with the urlfilter.filter-seeds property. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Crawl scoping properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Crawl scoping properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You implement crawl scoping to control which URLs are crawled in the default.xml file.. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About configuring crawl scoping (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About configuring crawl scoping (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The Web Crawler implements a basic crawl scoping scheme to accommodate crawls of multiple seeds.The crawler can scope a crawl to only visit URLs from the same host or from the same domain as a seed. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) How domain names are retrieved from URLs (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) How domain names are retrieved from URLs (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) Every domain name ends in a top-level domain (TLD) name.The TLDs are either generic names (such as com) or country codes (such as jp for Japan). (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Default top-level domain names (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Default top-level domain names (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The crawlscope.top-level-domains.generic property contains the following TLD names in the default.xml configuration file: (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Document conversion properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Document conversion properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You set the document conversion properties in the default.xml file. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Output properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Output properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You set output properties in the default.xml file.You can configure output to either an output file (the default) or to a Record Store instance. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Gathering XHTML information (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Gathering XHTML information (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) If the output.dom.include property is set to true, the Web Crawler normalizes the content of HTML documents into XHTML and stores it in the Endeca.Document.XHTML property in the record. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Excluding record properties (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Excluding record properties (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The output.records.properties.excludes property specifies a list of record properties that you want to exclude from the records. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Extensions for additional binary output files (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Extensions for additional binary output files (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) For the output.file.binary.file-size-max property, if output has to be written to more than one output, the name pattern of the new files is similar to this example: (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Output file overrides in site.xml files (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Output file overrides in site.xml files (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The site.xml files in the workspace/conf/web-crawler/polite-crawl and workspace/conf/web-crawler/non-polite-crawl directories contain these output file overrides. (shortdesc] (topicmeta] (topicref] (topicref] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) The site.xml file (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) The site.xml file (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The site.xml file provides override property values for the global configuration file default.xml. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) The crawl-urlfilter.txt file (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) The crawl-urlfilter.txt file (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The crawl-urlfilter.txt file provides include and exclude regular expressions for URLs. (shortdesc] (topicmeta][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Regular expression format (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Regular expression format (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The Web Crawler implements Sun’s java.util.regex package to parse and match the pattern of the regular expression. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Specifying the hosts to accept (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Specifying the hosts to accept (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You set the crawl-urlfilter.txt files to limit a crawl to a specific domain. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Order of the regular expressions (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Order of the regular expressions (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) When specifying regular expressions, make sure that you list the exclude expressions before the include expressions.The reason is that the RegexURLFilter plugin does the regex-pattern matching from top to bottom. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Excluding file formats (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Excluding file formats (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You globally exclude file formats by adding their file extensions to an exclusion line in the crawl-urlfilter.txt file. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) The regex-normalize.xml file (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) The regex-normalize.xml file (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The regex-normalize.xml file provides substitutions for normalizing URLs. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) The mime-types.xml file (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) The mime-types.xml file (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The mime-types.xml file provides mappings of file extensions to MIME types. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) The parse-plugins.xml file (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) The parse-plugins.xml file (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The parse-plugins.xml file provides mappings of MIME types to parsers. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) The form-credentials.xml file (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) The form-credentials.xml file (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The form-credentials.xml file provides the credentials for sites that use form-based authentication. (shortdesc] (topicmeta][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About form-based authentication (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About form-based authentication (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The Web Crawler supports form-based authentication for both GET and POST requests.The http.auth.form.credentials.file property sets the name of the file that contains the form credentials to be used by the Web client. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Format of the credentials file (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Format of the credentials file (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The format of the form-based authentication credentials file is as follows. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Setting the timeout property (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Setting the timeout property (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You set the authentication timeout with the BasicFormAuthenticator. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Using special characters in the credentials file (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Using special characters in the credentials file (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) XML has a special set of characters that cannot be used in normal XML strings.If you need to enter any of the following special characters, you must enter them in their encoded format: (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Authentication Exceptions (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Authentication Exceptions (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The authentication framework has two Exception classes: (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) The log4j.properties file (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) The log4j.properties file (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You modify the log4j.properties file to change the properties for the log4j loggers. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Enabling the IAS Document Conversion Module (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Enabling the IAS Document Conversion Module (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) By default, the Web Crawler is enabled to call the IAS Document Conversion Module to convert any documents that are not text, HTML, XML, SGML, or JavaScript. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Disabling the IAS Document Conversion Module (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Disabling the IAS Document Conversion Module (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) If desired, you can disable the IAS Document Conversion Module to prevent document conversion or license warnings.You can either disable the module globally for all crawls, or you can disable the module on a per crawl basis. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About document conversion options (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About document conversion options (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You can change the default behavior of the IAS Document Conversion Module by modifying JVM property names and values. (shortdesc] (topicmeta][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Setting document conversion options (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Setting document conversion options (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) Set the document conversion options as parameters to the JVM's -D option. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Configuring Web crawls to write output to a Record Store instance (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Configuring Web crawls to write output to a Record Store instance (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The Web Crawler can be configured to write its output directly to a Record Store instance, instead of to an output file on disk (the default).This procedure describes how to modify a single crawl configuration in the site.xml file and not the global Web Crawler configuration in default.xml. (shortdesc] (topicmeta] (topicref] (topicref] (map]