Knowing how the Web Crawler processes URLs helps you understand
where a new plug-in fits in, because the URL processing is accomplished by a
series of plug-ins.
Each URL is processed by a thread in the following manner:
The processing flow is as follows:
- The scheduler determines
which URL should be fetched (this step is not shown in the diagram).
- Fetch: A
Protocol plug-in (such as the
protocol-httpclient plug-in) fetches the bytes for a
URL and places them in a Content object.
- In-crawl Auth: An
Authenticator plug-in can determine whether form-based
authentication is required. If so, a specific login URL can be fetched, as
shown in the diagram.
- Parse: A
Parse plug-in parses the content (the Content object )
and generates a
Parse object. It also extracts outlinks. For example,
the
parse-html plug-in uses the Neko library to extract
the DOM representation of a HTML page.
- Filter:
ParseFilter plug-ins do additional processing on raw
and parsed content, because these plug-ins have access to both the Content and
Parse objects from a particular page. For example, the
endeca-xpath-filter plug-in (if activated) uses XPath
expressions to prune documents.
- Generate: A record is
generated and written to the record output file. In addition, any outlinks are
queued by the scheduler to be fetched.
In the processing flow, the sample
htmlmetatags plug-in would be part of step 5, because it
does additional processing of the parsed content.