[/map {"- map/map "}) [/map/topicmeta {"- map/topicmeta "}) [/map/topicmeta/author {"- topic/author "}) Oracle (author][/map/topicmeta/prodinfo {"- topic/prodinfo "}) [/map/topicmeta/prodinfo/prodname {"- topic/prodname "}) Integrator Acquisition System (prodname] [/map/topicmeta/prodinfo/vrmlist {"- topic/vrmlist "}) [/map/topicmeta/prodinfo/vrmlist/vrm {"- topic/vrm "}) (vrm] (vrmlist] (prodinfo] (topicmeta][/map/topicref {"- map/topicref "}) [/map/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicmeta/navtitle {"- topic/navtitle "}) Introduction (navtitle][/map/topicref/topicmeta/linktext {"- map/linktext "}) Introduction (linktext][/map/topicref/topicmeta/shortdesc {"- map/shortdesc "}) This section provides introductory information about the Endeca Web Crawler. (shortdesc] (topicmeta][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Web Crawler overview (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Web Crawler overview (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The Endeca Web Crawler is installed by default as part of the IAS installation.The Web Crawler gathers source data by crawling HTTP and HTTPS Web sites and writes the data in a format that is accessible to Endeca Information Discovery Integrator (either XML or a Record Store instance). (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Running a sample Web crawl of oracle.com (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Running a sample Web crawl of oracle.com (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You can examine the configuration and operation of the Web Crawler by running a sample Web crawl.The sample is located in the <install path>\IAS\workspace\conf\web-crawler\polite-crawl directory. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Running a sample Web crawl that writes to a Record Store (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Running a sample Web crawl that writes to a Record Store (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) In this topic, you run a sample Web crawl that writes output to a Record Store instance instead of to a file on disk.This sample is stored in <install path>\IAS\<version>\sample\webcrawler-to-recordstore.The run-sample script runs the sample Web Crawler. (shortdesc] (topicmeta] (topicref] (topicref] [/map/topicref {"- map/topicref "}) [/map/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicmeta/navtitle {"- topic/navtitle "}) Configuration (navtitle][/map/topicref/topicmeta/linktext {"- map/linktext "}) Configuration (linktext][/map/topicref/topicmeta/shortdesc {"- map/shortdesc "}) This section provides reference information to configure the Endeca Web Crawler. (shortdesc] (topicmeta][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Configuration files (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Configuration files (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The Endeca Web Crawler uses the following set of configuration files: (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) The default.xml file (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) The default.xml file (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The default.xml file is the main configuration file for the Endeca Web Crawler. (shortdesc] (topicmeta][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) HTTP Properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) HTTP Properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You set the HTTP transport properties in the default.xml file. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About setting the HTTPClient cookies (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About setting the HTTPClient cookies (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The http.cookies property sets the cookies used by the HTTPClient. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About obeying the robots.txt file (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About obeying the robots.txt file (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You can set the Web Crawler to either ignore or obey the robots.txt exclusion standard, as well as any META ROBOTS tags in HTML pages. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Setting the download content limit (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Setting the download content limit (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) If a crawl downloads files with a lot of content (for example, large PDF or SWF files), you may see WARN messages about pages being skipped because the content limit was exceeded.To solve this problem, increase the download content limit to a setting that allows all content to be downloaded. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Authentication properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Authentication properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You set the authentication properties in the default.xml file. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About configuring Basic authentication (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About configuring Basic authentication (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) If a Web server uses HTTP Basic authentication to restrict access to Web sites, you can specify authentication credentials that enable the Web Crawler to access password-protected pages.The http.auth.basic property sets the credentials to be used by the HTTPClient for Basic authentication. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About configuring Digest authentication (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About configuring Digest authentication (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) If a Web server uses HTTP Digest authentication to restrict access to Web sites, you can use the http.auth.digest property to set the credentials used by the HTTPClient for Digest authentication. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About configuring NTLM authentication (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About configuring NTLM authentication (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) If a Web server uses HTTP NTLM authentication to restrict access to Web sites, you can specify authentication credentials that enable the Web Crawler to access password-protected pages.The http.auth.ntlm property sets the credentials to be used by the HTTPClient for NTLM authentication. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Configuring Form-based authentication (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Configuring Form-based authentication (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) If you are crawling sites that implement form-based authentication, you supply the credentials in a form-credentials.xml file. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Properties for authenticated proxy support (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Properties for authenticated proxy support (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You configure authenticated proxy support in the default.xml file. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Fetcher properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Fetcher properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The fetcher is the Web Crawler component that actually fetches pages from Web sites.You set the fetcher properties in the default.xml file. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Use of the max delay and crawl-delay values (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Use of the max delay and crawl-delay values (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The fetcher compares the value of the fetcher.delay.max property to the value of the Crawl-Delay parameter in the robots.txt file. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Fetcher overrides in the site.xml files (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Fetcher overrides in the site.xml files (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) This topic describes overrides for the fetcher property values in the default.xml file. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) URL normalization properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) URL normalization properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You can set the URL normalization properties in the default.xml file. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Types of URL normalizers (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Types of URL normalizers (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The Endeca Web Crawler has three URL normalizers: (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Normalizing the seed list (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Normalizing the seed list (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You apply normalization to the seed list with the urlnormalizer.normalize-seeds property. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) MIME type properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) MIME type properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You set the MIME type mapping properties in the default.xml file. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Overriding the server text/html MIME type (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Overriding the server text/html MIME type (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) If there is confusion as to the MIME type of a given URL, the Web Crawler by default trusts the URL extension over the server MIME type.The mime.types.trust-server.text-html property is intended for crawls that may experience "text/html" MIME type resolution problems. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Plugin properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Plugin properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You set the plugin properties in the default.xml file. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Default activated plugins (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Default activated plugins (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The default regular expression value for the plugin.includes property activates these plugins: (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Specifying the plugins directory (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Specifying the plugins directory (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The plugin.folders property specifies the location of the plugins directory. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Parser properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Parser properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You set the parser properties in the default.xml file. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Parser filter properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Parser filter properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You can set the parser filter properties in the default.xml file. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Setting the order of parser filters (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Setting the order of parser filters (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The parser.filters.order property specifies the order in which the parser filters are applied. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About defining the XPath filter expressions (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About defining the XPath filter expressions (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The document.prune.xpath property defines the XPath expressions that will be used by the Endeca Document Prune XPath Filter (i.e., the endeca-xpath-filter plugin). (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) URL filter properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) URL filter properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You configure how the URL filter plugins are handled in the default.xml file. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Setting the order of URL filters (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Setting the order of URL filters (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The urlfilter.order property allows you to specify the order in which URL filters are applied. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Filtering the seed list (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Filtering the seed list (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You apply URL filters to the seeds with the urlfilter.filter-seeds property. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Crawl scoping properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Crawl scoping properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You implement crawl scoping to control which URLs are crawled in the default.xml file.. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About configuring crawl scoping (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About configuring crawl scoping (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The Web Crawler implements a basic crawl scoping scheme to accommodate crawls of multiple seeds.The crawler can scope a crawl to only visit URLs from the same host or from the same domain as a seed. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) How domain names are retrieved from URLs (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) How domain names are retrieved from URLs (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) Every domain name ends in a top-level domain (TLD) name.The TLDs are either generic names (such as com) or country codes (such as jp for Japan). (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Default top-level domain names (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Default top-level domain names (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The crawlscope.top-level-domains.generic property contains the following TLD names in the default.xml configuration file: (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Document conversion properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Document conversion properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You set the document conversion properties in the default.xml file. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Output properties (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Output properties (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You set output properties in the default.xml file.You can configure output to either an output file (the default) or to a Record Store instance. (shortdesc] (topicmeta][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Gathering XHTML information (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Gathering XHTML information (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) If the output.dom.include property is set to true, the Web Crawler normalizes the content of HTML documents into XHTML and stores it in the Endeca.Document.XHTML property in the record. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Excluding record properties (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Excluding record properties (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The output.records.properties.excludes property specifies a list of record properties that you want to exclude from the records. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Extensions for additional binary output files (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Extensions for additional binary output files (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) For the output.file.binary.file-size-max property, if output has to be written to more than one output, the name pattern of the new files is similar to this example: (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Output file overrides in site.xml files (navtitle][/map/topicref/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Output file overrides in site.xml files (linktext][/map/topicref/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The site.xml files in the workspace/conf/web-crawler/polite-crawl and workspace/conf/web-crawler/non-polite-crawl directories contain these output file overrides. (shortdesc] (topicmeta] (topicref] (topicref] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) The site.xml file (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) The site.xml file (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The site.xml file provides override property values for the global configuration file default.xml. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) The crawl-urlfilter.txt file (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) The crawl-urlfilter.txt file (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The crawl-urlfilter.txt file provides include and exclude regular expressions for URLs. (shortdesc] (topicmeta][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Regular expression format (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Regular expression format (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The Web Crawler implements Sun’s java.util.regex package to parse and match the pattern of the regular expression. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Specifying the hosts to accept (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Specifying the hosts to accept (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You set the crawl-urlfilter.txt files to limit a crawl to a specific domain. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Order of the regular expressions (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Order of the regular expressions (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) When specifying regular expressions, make sure that you list the exclude expressions before the include expressions.The reason is that the RegexURLFilter plugin does the regex-pattern matching from top to bottom. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Excluding file formats (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Excluding file formats (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You globally exclude file formats by adding their file extensions to an exclusion line in the crawl-urlfilter.txt file. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) The regex-normalize.xml file (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) The regex-normalize.xml file (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The regex-normalize.xml file provides substitutions for normalizing URLs. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) The mime-types.xml file (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) The mime-types.xml file (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The mime-types.xml file provides mappings of file extensions to MIME types. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) The parse-plugins.xml file (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) The parse-plugins.xml file (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The parse-plugins.xml file provides mappings of MIME types to parsers. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) The form-credentials.xml file (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) The form-credentials.xml file (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The form-credentials.xml file provides the credentials for sites that use form-based authentication. (shortdesc] (topicmeta][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About form-based authentication (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About form-based authentication (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The Web Crawler supports form-based authentication for both GET and POST requests.The http.auth.form.credentials.file property sets the name of the file that contains the form credentials to be used by the Web client. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Format of the credentials file (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Format of the credentials file (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The format of the form-based authentication credentials file is as follows. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Setting the timeout property (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Setting the timeout property (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You set the authentication timeout with the BasicFormAuthenticator. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Using special characters in the credentials file (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Using special characters in the credentials file (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) XML has a special set of characters that cannot be used in normal XML strings.If you need to enter any of the following special characters, you must enter them in their encoded format: (shortdesc] (topicmeta] (topicref][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Authentication Exceptions (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Authentication Exceptions (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The authentication framework has two Exception classes: (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) The log4j.properties file (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) The log4j.properties file (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You modify the log4j.properties file to change the properties for the log4j loggers. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Enabling the IAS Document Conversion Module (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Enabling the IAS Document Conversion Module (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) By default, the Web Crawler is enabled to call the IAS Document Conversion Module to convert any documents that are not text, HTML, XML, SGML, or JavaScript. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Disabling the IAS Document Conversion Module (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Disabling the IAS Document Conversion Module (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) If desired, you can disable the IAS Document Conversion Module to prevent document conversion or license warnings.You can either disable the module globally for all crawls, or you can disable the module on a per crawl basis. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About document conversion options (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About document conversion options (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You can change the default behavior of the IAS Document Conversion Module by modifying JVM property names and values. (shortdesc] (topicmeta][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Setting document conversion options (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Setting document conversion options (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) Set the document conversion options as parameters to the JVM's -D option. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Configuring Web crawls to write output to a Record Store instance (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Configuring Web crawls to write output to a Record Store instance (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The Web Crawler can be configured to write its output directly to a Record Store instance, instead of to an output file on disk (the default).This procedure describes how to modify a single crawl configuration in the site.xml file and not the global Web Crawler configuration in default.xml. (shortdesc] (topicmeta] (topicref] (topicref] [/map/topicref {"- map/topicref "}) [/map/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicmeta/navtitle {"- topic/navtitle "}) Supported Crawl Types (navtitle][/map/topicref/topicmeta/linktext {"- map/linktext "}) Supported Crawl Types (linktext][/map/topicref/topicmeta/shortdesc {"- map/shortdesc "}) This section provides an overview of the full and resumable crawl types that are supported by the Endeca Web Crawler. (shortdesc] (topicmeta][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About full crawls (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About full crawls (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) This topic provides an overview of full crawls. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About resumable crawls (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About resumable crawls (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) This topic provides an overview of resumable crawls. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About workspace directories and output files (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About workspace directories and output files (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) This topic describes file output settings.デフォルトでは、Webクロールの出力ファイルを格納するには、ワークスペース・ディレクトリを使用します。For details about Record Store settings, see the Integrator Acquisition System Developer's Guide. (shortdesc] (topicmeta] (topicref] (topicref] [/map/topicref {"- map/topicref "}) [/map/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicmeta/navtitle {"- topic/navtitle "}) Running the Endeca Web Crawler (navtitle][/map/topicref/topicmeta/linktext {"- map/linktext "}) Running the Endeca Web Crawler (linktext][/map/topicref/topicmeta/shortdesc {"- map/shortdesc "}) This section provides information on how to run the Endeca Web Crawler, including the startup scripts and the record properties that are returned by the crawls. (shortdesc] (topicmeta][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Command-line flags for crawls (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Command-line flags for crawls (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The Endeca Web Crawler startup script has several flags to control the behavior of the crawl. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Running full crawls (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Running full crawls (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You run full crawls from the command line. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Running resumable crawls (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Running resumable crawls (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) You run a resumable crawl from the command line. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Record properties generated by a crawl (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Record properties generated by a crawl (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) During a crawl, the Endeca Web Crawler produces record properties according to a standardized naming scheme. (shortdesc] (topicmeta] (topicref] (topicref] [/map/topicref {"- map/topicref "}) [/map/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicmeta/navtitle {"- topic/navtitle "}) Running the Sample Web Crawler Plugin (navtitle][/map/topicref/topicmeta/linktext {"- map/linktext "}) Running the Sample Web Crawler Plugin (linktext][/map/topicref/topicmeta/shortdesc {"- map/shortdesc "}) This section provides instructions for running the sample Web Crawler plugin, a custom parse filter plugin that adds HTML meta tags as additional properties to the output records. (shortdesc] (topicmeta][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About the Web Crawler plugin framework (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About the Web Crawler plugin framework (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The Endeca Web Crawler is based on the Apache Nutch open-source project.そのため、主要な機能はプラグインとして実装されます。Its framework allows you to write your own plugins, such as plugins that extract additional content from Web pages. (shortdesc] (topicmeta][/map/topicref/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) How the Web Crawler processes URLs (navtitle][/map/topicref/topicref/topicref/topicmeta/linktext {"- map/linktext "}) How the Web Crawler processes URLs (linktext][/map/topicref/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) Knowing how the Web Crawler processes URLs helps you understand where a new plugin fits in, because the URL processing is accomplished by a series of plugins. (shortdesc] (topicmeta] (topicref] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) About the sample custom filter plugin (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) About the sample custom filter plugin (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) Custom filters (ParseFilter) implement content extensions.These filters can examine the contents of a page (either the raw page contents or the parsed DOM) and add additional properties to records that are produced. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Adding a custom plugin to the Endeca Web Crawler (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Adding a custom plugin to the Endeca Web Crawler (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) This topic provides an overview of how to add a custom plugin to the Endeca Web Crawler. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Opening the sample plugin project (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Opening the sample plugin project (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) For the purpose of this sample, you load the sample parse filter plugin project.If you were creating your own plugin, you would create your own Eclipse project. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Overview of the sample HTMLMetatagFilter plugin (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Overview of the sample HTMLMetatagFilter plugin (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) For the purpose of this sample, we use the source for the HTMLMetatagFilter class that is in the HTMLMetatagFilter.java source file (in the IAS\<version>\sample\custom-web-crawler-plugin\src directory).If you were writing your own plugin, you would write the code for your custom plugin. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Overview of the plugin.xml file (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Overview of the plugin.xml file (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) The plugin.xml file describes the plugin to the Web Crawler.The file resides in the plugin directory along with the JAR file. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Building the sample plugin (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Building the sample plugin (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) For the purpose of this sample, use Eclipse to build a JAR of the sample Web Crawler parse plugin. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Adding the plugin to the IAS lib directory (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Adding the plugin to the IAS lib directory (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) After you build the Jar for your custom plugin, create a directory for the plugin and copy this to the Web Crawler's plugin directory. (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Activating the plugin for the Web Crawler (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Activating the plugin for the Web Crawler (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) Oracle recommends that you modify the crawl-specific site.xml file, rather than the global default.xml file (this is because the site.xml settings override the default.xml global settings). (shortdesc] (topicmeta] (topicref][/map/topicref/topicref {"- map/topicref "}) [/map/topicref/topicref/topicmeta {"- map/topicmeta "}) [/map/topicref/topicref/topicmeta/navtitle {"- topic/navtitle "}) Running the Web Crawler with a new plugin (navtitle][/map/topicref/topicref/topicmeta/linktext {"- map/linktext "}) Running the Web Crawler with a new plugin (linktext][/map/topicref/topicref/topicmeta/shortdesc {"- map/shortdesc "}) After you activate the new plugin, you can run new crawls exactly as before. (shortdesc] (topicmeta] (topicref] (topicref] (map]