How the Web Crawler processes URLs

Knowing how the Web Crawler processes URLs helps you understand where a new plugin fits in, because the URL processing is accomplished by a series of plugins.

Each URL is processed in the following way:

Flowchart of URL processing by the Web Crawler

The processing flow is as follows:
  1. The scheduler determines which URL should be fetched (this step is not shown in the diagram).
  2. Fetch: A Protocol plugin (such as the protocol-httpclient plugin) fetches the bytes for a URL and places them in a Content object.
  3. In-crawl Auth: An Authenticator plugin can determine whether form-based authentication is required. If so, a specific login URL can be fetched, as shown in the diagram.
  4. Parse: A Parse plugin parses the content (the Content object ) and generates a Parse object. It also extracts outlinks. For example, the parse-html plugin uses the Neko library to extract the DOM representation of a HTML page.
  5. Filter: ParseFilter plugins do additional processing on raw and parsed content, because these plugins have access to both the Content and Parse objects from a particular page. For example, the endeca-xpath-filter plugin (if activated) uses XPath expressions to prune documents.
  6. Generate: A record is generated and written to the record output file. In addition, any outlinks are queued by the scheduler to be fetched.
In the processing flow, the sample htmlmetatags plugin would be part of step 5, because it does additional processing of the parsed content.