Differential crawls generate the Endeca.Document.IsUnchanged property.
The setting of the Endeca.Document.IsUnchanged property indicates whether a document is considered to be changed or not.
At the highest level, the crawler tries to conform to the HTTP/1.1 specification when determining whether it considers a document to be changed. In general, the rules are:
- Whether a document is re-fetched is determined by a combination of header fields, including such directives as Expires, Last-Modified, Max-Age, and Cache-Control.
- Metadata for a document is fetched with the document, stored by the spider, and used by the RETRIEVE_URL expression.
- If a document sets the Cache-Control metatag to “must-revalidate”, this metadata will be re-fetched during each crawl. Otherwise, the stored version is used.
- If a document sets the Cache-Control metatag to “no-cache” or “no-store”, or if it sets the Pragma field to “no-cache”, the IsUnchanged property is set to false.
- If the date the document was fetched, plus the Max-Age, is less than the current date, the IsUnchanged property is set to true.
- The Endeca software locally computes the approximate current time on the server from which the document was fetched. If the computed time is less than the Expires date, the IsUnchanged property is set to true. Otherwise, the spider will consider the IsUnchanged property to false.
- The RETRIEVE_URL expression, then checks the Last-Modified date on the document with the existing revision; if the remote modification date is not after the Last-Modified date on the local document, the IsUnchanged property is set to true.
An important caveat to keep in mind is that because of the way that the above rules are implemented, it is possible that a re-fetch of a document can be skipped without ever checking with the server to see whether the document has changed. For example, you can manually edit a document, but it is possible that the Endeca.Document.IsUnchanged property may remain set to true.