Specify Content Acquisition for an External Collection

You describe the collection’s contents by specifying how the content acquisition process behaves when it collects content from the host web sites. You specify the web sites by entering one or more starting point URLs. The content acquisition process begins at these specified points, and collects documents according to the content acquisition parameters that you specify, including the following data.

  • The number of successive links from the start point that the crawler content acquisition processes.

  • The types of documents to include and exclude.

  • The URLs of any sitemaps and individual documents that you want to explicitly include in the collection.

  • The URLs that the application can use to display collection documents to end users, if they are different than those that the content acquisition process uses to access the documents.

Web Site Specifications for Content Acquisition Process Fields

Field Description

How many URL levels do you want to include?

Specify the number of URL levels (crawl depth) to include in the collection. The default is 10 levels.

For example, a value of 4 specifies that the content acquisition process only includes four links (->) from the starting point page:

Starting point page -> level 2 page (linked from starting point) -> level 3 page (linked from level 2) -> level 4 page (linked from level 3)

Do you have a sitemap.xml or web document url?

If you have sitemaps or web document URLs, select Yes. Add them to the Sitemap URL or Web Document URL field.

If you do not have sitemaps or Web Document URLs, select No. Specify where to begin acquiring data and how deep to crawl in the Starting point URLs fields.

Starting point URLs

Specify one or more top-level URLs for the collection. The content acquisition process starts at each URL, and collects documents and follows links as specified in its configuration.

File name patterns to include or exclude

Specify one or more optional accept or reject document patterns. Document patterns are regular expressions that logically define desired document characteristics. Enter each pattern in a separate field. Content acquisition accepts all documents by default; in most cases you do not need to specify explicit document acceptance patterns.

Web Document URL

Specify any specific documents that otherwise are not accessed by the collection configuration.

Custom Display URL

Specify whether the user interface displays documents using a different URL than the collection configuration.