Knowing how the Web Crawler processes URLs helps you understand where a new plug-in fits in, because the URL processing is accomplished by a series of plug-ins.
Each URL is processed by a thread in the following manner:
The processing flow is as follows:
The scheduler determines which URL should be fetched (this step is not shown in the diagram).
Fetch: A
Protocol
plug-in (such as theprotocol-httpclient
plug-in) fetches the bytes for a URL and places them in a Content object.In-crawl Auth: An
Authenticator
plug-in can determine whether form-based authentication is required. If so, a specific login URL can be fetched, as shown in the diagram.Parse: A
Parse
plug-in parses the content (the Content object ) and generates aParse
object. It also extracts outlinks. For example, theparse-html
plug-in uses the Neko library to extract the DOM representation of a HTML page.Filter:
ParseFilter
plug-ins do additional processing on raw and parsed content, because these plug-ins have access to both the Content and Parse objects from a particular page. For example, theendeca-xpath-filter
plug-in (if activated) uses XPath expressions to prune documents.Generate: A record is generated and written to the record output file. In addition, any outlinks are queued by the scheduler to be fetched.
In the processing flow, the sample htmlmetatags
plug-in would be part of step 5, because it does additional processing of the parsed content.