This section includes a diagram and overview of a sample differential crawl pipeline.
The following Pipeline Diagram shows the contents of the pipeline:
Component Name |
Description |
---|---|
PreviousCrawl |
Input record adapter for records in the previous crawl. |
DifferentialCrawl |
Input record adapter for records in the current crawl. |
FetchAndParse |
Record manipulator that downloads and parses URLs discovered during crawling. |
CrawlRefs |
Spider component that enqueues and follows URLs discovered during crawling. |
RemoveUnchanged |
Record manipulator that removes any unchanged records from the differential crawl. |
PreviousRecCache |
Record cache for records in the previous crawl that will feed the join. |
NewRecCache |
Record cache for records in the current crawl that will feed the join. |
JoinDifferentialAndFull |
Record assembler that performs a First Record join between the differential crawl and the previous full crawl. |
RemoveFailed |
Record manipulator that removes any invalid records. |
WriteRawRecords |
Output record adapter that saves the raw records, before property mapping, of the join between the differential crawl and the previous crawl. |
MapProps |
Property mapper that maps source properties into Endeca properties and dimensions. |
WriteOutput |
Indexer adapter that prepares output for Dgidx. |
Dimensions |
Dimension adapter providing the dimension source. |
DimensionServer |
Dimension server. |