The Web Crawler supports Form-based authentication for both GET and POST
requests.
The http.auth.form.credentials.file
property sets the name of the file that contains the form credentials to be used by the Web client.
If a Web server uses HTML forms to restrict access to Web sites, you can specify authentication credentials that enable the Web Crawler to access password-protected pages.
The fields that you specify in the credentials file correspond to the fields that an interactive user fills in when prompted by the Web browser, and any hidden or static fields that are required for a successful login. This means that you must coordinate with the server administrators, who must provide you with the security requirements for the Web sites, including all information that is used to authenticate the Web Crawler's identity and determine that the crawler has permission to crawl the restricted pages.
In the Web Crawler, the authentication plugin provides a way to execute form-based login for Web crawls. The plugin implements two main authentication modes:
Pre-crawl authentication mode performs the authentication before the crawl begins. Note that if pre-crawl authentication is specified and the request times out, the Authenticator will attempt an in-crawl authentication for the retry.
In-crawl authentication mode performs the authentication as the crawl is progressing. After every page is fetched and processed, a site-specific authenticator checks the page contents and determines whether or not the page needs to be refetched (say, if the crawler has been logged out), and it may log into the site if necessary.
The preCrawlAuth
setting in the credentials file determines whether pre-crawl or in-crawl authentication is performed. If you are uncertain as which mode to use, we recommend that you start by using the pre-crawl mode, as long as you think that the authentication process will not time out. If, however, you believe that timeouts will occur, then the in-crawl mode would be more advantageous.