How the Web Crawler processes URLs

Knowing how the Web Crawler processes URLs helps you understand where a new plug-in fits in, because the URL processing is accomplished by a series of plug-ins.

Each URL is processed by a thread in the following manner:

The processing flow is as follows:
  1. The scheduler determines which URL should be fetched (this step is not shown in the diagram).
  2. Fetch: A Protocol plug-in (such as the protocol-httpclient plug-in) fetches the bytes for a URL and places them in a Content object.
  3. In-crawl Auth: An Authenticator plug-in can determine whether form-based authentication is required. If so, a specific login URL can be fetched, as shown in the diagram.
  4. Parse: A Parse plug-in parses the content (the Content object ) and generates a Parse object. It also extracts outlinks. For example, the parse-html plug-in uses the Neko library to extract the DOM representation of a HTML page.
  5. Filter: ParseFilter plug-ins do additional processing on raw and parsed content, because these plug-ins have access to both the Content and Parse objects from a particular page. For example, the endeca-xpath-filter plug-in (if activated) uses XPath expressions to prune documents.
  6. Generate: A record is generated and written to the record output file. In addition, any outlinks are queued by the scheduler to be fetched.
In the processing flow, the sample htmlmetatags plug-in would be part of step 5, because it does additional processing of the parsed content.