About enabling differential crawling for the spider

Both full crawl and differential pipelines must have a spider component. The main configuration difference is that the differential spider has a URL specified in the “Differential crawl URL” field of the Spider editor.

When a URL is specified for this field, instead of performing a full crawl every time the pipeline is run, the spider will only download those documents that have been modified since the last run.

The spider determines which documents have changed by maintaining a state file (at the “Differential crawl URL” location). This state file contains the results of the previous pipeline run, and is compared to the results of the current run. The spider will fetch the document's headers, such as size and date modified; if these are different than what is in the state file, the entire document is downloaded. If these are not different, the document is not downloaded.

Note: The crawler supports the HTTP/1.1 specification, such as the Cache-Control and Pragma directives and the If-Modified-Since header field.

This means that the output of the spider contains a record for every document, but only the bodies of those documents that have been modified since the last spider run. If this is used as the only record source for an MDEX Engine, data will be missing, since unchanged documents will have no bodies. The solution is to join previously-crawled data with the new data.