Describe a Collection

You describe a collection by providing a site map to direct the content processor to the documents you want to include in the collection. You also specify whether the content processor will follow or ignore any robots.txt or robots meta tags that it finds.

Provide a Site Map

You must use a site map file and a site map URL to direct the content processor to the documents that you want to include in the collection. Site maps are structured indexes created specifically for use by search engines. They list the URLs of the documents in your site and include important metadata about each document, such as when it was last updated, and how frequently it changes.

You can specify only one site map URL for a collection. However, you can use child site map .xml files within the primary site map file. You can use an existing site map or create new site maps using many available tools. You can also manually create site map files.

Some important points to consider in creating the site map file.

  • Here is a list of extensions (in regex format) that are NOT supported by KM crawler:\.jpg$$|%\.gif$$|%\.jpeg$$|%\.js$$|%\.png$$|%\.zip$$|%\.exe$$|%\.[tjr]ar$$|%\. tgz$$|%\.css$$|%\.tar\.gz$$|%\.mp[g3e4a]$$|%\.avi$$|%\.rm$$|%\.ram$$|%\.as[fx] $$|%\.wm[vazsf]$$|%\.au$$|%\.msi$$|%\.sit$$|%\.m4a$$|%\.mov$$|%\.cab$ .
  • If the sitemap URL is defined as: "https://<hostname.domain>/xx/xx/xxxx/sitemap.xml", and the documents within the sitemap have a different domain they will be rejected. To override this so that URLs with the a different domain are allowed you need to define this in the 'Include Pattern' .

Use or Ignore Robot Files and Tags

Robot text files and robots tags specify which documents and links on a site are available to the content processor. Click Yes to use them, or No to ignore them.

Exclude or Include Specific Documents

You can exclude or include specific documents by entering one or more regular expression patterns. Enter each pattern in a separate field. The content processor accepts all documents by default, so in most cases you don't need to specify explicit document acceptance patterns.