Sample differential crawl pipeline

This section includes a diagram and overview of a sample differential crawl pipeline.

The following Pipeline Diagram shows the contents of the pipeline:

The table below provides brief descriptions of the components:

Component Name	Description
PreviousCrawl	Input record adapter for records in the previous crawl.
DifferentialCrawl	Input record adapter for records in the current crawl.
FetchAndParse	Record manipulator that downloads and parses URLs discovered during crawling.
CrawlRefs	Spider component that enqueues and follows URLs discovered during crawling.
RemoveUnchanged	Record manipulator that removes any unchanged records from the differential crawl.
PreviousRecCache	Record cache for records in the previous crawl that will feed the join.
NewRecCache	Record cache for records in the current crawl that will feed the join.
JoinDifferentialAndFull	Record assembler that performs a First Record join between the differential crawl and the previous full crawl.
RemoveFailed	Record manipulator that removes any invalid records.
WriteRawRecords	Output record adapter that saves the raw records, before property mapping, of the join between the differential crawl and the previous crawl.
MapProps	Property mapper that maps source properties into Endeca properties and dimensions.
WriteOutput	Indexer adapter that prepares output for Dgidx.
Dimensions	Dimension adapter providing the dimension source.
DimensionServer	Dimension server.

Note: Although you can have two pipelines in your project (one that performs only full crawls and the other dedicated to differential crawls), it is simpler to have one pipeline that can perform both types of crawls. This is the type of pipeline used in this sample implementation.