Sample differential crawl pipeline

This section includes a diagram and overview of a sample differential crawl pipeline.

The following Pipeline Diagram shows the contents of the pipeline:

The table below provides brief descriptions of the components:

Component Name

Description

PreviousCrawl

Input record adapter for records in the previous crawl.

DifferentialCrawl

Input record adapter for records in the current crawl.

FetchAndParse

Record manipulator that downloads and parses URLs discovered during crawling.

CrawlRefs

Spider component that enqueues and follows URLs discovered during crawling.

RemoveUnchanged

Record manipulator that removes any unchanged records from the differential crawl.

PreviousRecCache

Record cache for records in the previous crawl that will feed the join.

NewRecCache

Record cache for records in the current crawl that will feed the join.

JoinDifferentialAndFull

Record assembler that performs a First Record join between the differential crawl and the previous full crawl.

RemoveFailed

Record manipulator that removes any invalid records.

WriteRawRecords

Output record adapter that saves the raw records, before property mapping, of the join between the differential crawl and the previous crawl.

MapProps

Property mapper that maps source properties into Endeca properties and dimensions.

WriteOutput

Indexer adapter that prepares output for Dgidx.

Dimensions

Dimension adapter providing the dimension source.

DimensionServer

Dimension server.

Note: Although you can have two pipelines in your project (one that performs only full crawls and the other dedicated to differential crawls), it is simpler to have one pipeline that can perform both types of crawls. This is the type of pipeline used in this sample implementation.