Caveats for differential crawling

Because differential crawls depend heavily on page HTTP headers (such as content size and date), it is critical that the server being crawled produce accurate, differential crawl-friendly metadata.

There are server configurations that are not differential crawl-friendly. One example is a content management system that republishes its pages on a nightly basis. Although the relevant content within the majority of pages does not change, the metadata does change. Also, non-critical text within the page's content (such as the current date) may be updated. This kind of nightly publishing changes enough information in each document that the differential crawler believes it is a new document and downloads it every night. Thus, the benefits of the differential crawl are lost.

Another example is a dynamic site containing changing data, but not changing metadata. The dynamic site may be pulling constantly updated information from a database, but the server issues unchanged metadata. While the differential crawl should recognize content changes alone, it is possible that some changed documents will not be downloaded.

The best way to determine if you have one of these server configurations is by diligent testing of the crawl results. Such testing can include temporarily adding a record manipulator in each of the full and differential source streams. The manipulator would add a property (named DifferentialStatus) on each record with a value of “Cached” (if the record is from a previous crawl) or “Fresh” (if the record is from the current crawl). The property could then be mapped to a dimension, with refinement statistics, that details the number of documents that have been downloaded fresh versus cached from the previous crawl.