This section provides an overview of differential crawling.
Overview of a differential crawling pipeline
Conceptually, a differential crawl is similar to a full crawl, with the exception that a differential crawl will only download those documents that have been modified since the previous crawl.
About enabling differential crawling for the spider
Both full crawl and differential pipelines must have a spider component. The main configuration difference is that the differential spider has a URL specified in the “Differential crawl URL” field of the Spider editor.
About joining previously-crawled data
To include the bodies of unchanged documents, a previous full crawl's output must be joined back into the pipeline (Forge's output can be fed directly into a record adapter using the "binary" format).
About removing invalid content
The differential crawl will contain a record for documents that previously existed, but have now disappeared (or are no longer valid if the parameters of the spider have changed). These records will have an Endeca.Document.Status property equal to “Fetch Failed” or “Fetch Aborted” and must be removed from the output.
Caveats for differential crawling
Because differential crawls depend heavily on page HTTP headers (such as content size and date), it is critical that the server being crawled produce accurate, differential crawl-friendly metadata.
Record adapters
The sample pipeline has two input record adapters and one output record adapter.
Differential spider
A differential crawl spider is configured in the same way as a full crawl spider, with the exception of the Differential Crawl URL field.
Record assembler
The JoinDifferentialAndFull component is a record assembler that performs a First Record join between the current and previous crawls.
Record manipulators
The pipeline has two record manipulators, named RemoveUnchanged and RemoveFailed. The FetchandParse record manipulator is not described in this appendix because it is identical to the record manipulator created for a full crawl.