About full crawls

This topic provides an overview of full crawls.

A full crawl means that the crawler processes all the pages in the seeds (except for pages that are excluded by filters). As part of the full crawl, a crawl history database is created with metadata information about the URLs. The database is created in the workspace directory of the crawl.

The crawl database provides persistence, so that its history can later be used for resumable crawls. For example, if the user stops a full crawl via a Control-C in the command window, the crawler closes the database files before exiting. If the crawl is later resumed (via the -r flag), the resumed crawl begins with the first URL that has a status of pending.

Workflow of a crawl

The Web Crawler handles full crawls as follows:

The crawler creates the crawl history database. If a previous database exists, it is overwritten.
The depth of the crawl is entered in the database.
From the seed, the crawler generates a list of URLs to be visited and queues them in the database. Each URL is given a status of pending because it has not yet been visited.
The crawler gets a URL from the queue, visits (and processes) the page, and changes the URL's status in the database to complete.
The crawler repeats step 4 until all the queued URLs are processed.