Importing Web Content with Web Content Crawlers

You can create a web content crawler to import content from web sites and RSS feeds.

Before you create a web content crawler, you must:

Create a content source, if necessary, to access secured content.
Create the folders in which you want to store the imported content.
Create and apply any filters to the folders to control the sorting of imported content.
Create any users and groups to which you want to grant access to the imported content.

To create a web content crawler you must have the following rights and privileges:

Access Administration activity right
Create Content Crawlers activity right
At least Edit access to the parent folder (the folder that will store the content crawler)
At least Select access to the content web service on which this content crawler will be based
At least Select access to the folders in which you want to store the imported content
At least Select access to the users and groups to which you want to grant access to the imported content

Click Administration.
Open the folder in which you want to store the content crawler.
In the Create Object drop-down list, click Content Crawler — WWW. The Choose Content Source dialog box opens.
Select the content source that provides access to the content you want to crawl and click OK. The Web Content Crawler Editor opens.
Complete the following tasks on the Main Settings page:
- Defining Where and How Far to Crawl
- Setting a Target Folder for Imported Content
- Automatically Approving Imported Content
- Granting Access to Imported Content
Click the Web Page Exclusions page and complete the following task:
- Avoiding Importing Unwanted Web Content
Click the Target Settings page and complete the following task:
- Specifying a Time-Out Period for a Web Content Crawler
Click the Document Settings page and complete the following tasks:
- Specifying When Imported Documents Expire
- Specifying Refresh Settings for Imported Links and Property Values
Click the Content Type page and complete the following task:
- Assigning Content Types to Imported Content
Click the Advanced Settings page and complete the following tasks:
- Selecting a Language for Imported Content
- Specifying What to Do with Rejected Documents
- Specifying What to Do On Subsequent Crawls
- Marking Imported Content with a Content Crawler Tag
- Specifying Maximum Threads Settings
Click the Set Job page and complete the following task:
- Associating an Object with a Job
Click the Properties and Names page and complete the following tasks:
- Naming and Describing an Object
  
  You can instead enter a name and description when you save this content crawler.
- Localizing the Name and Description for an Object (optional)
- Managing Object Properties(optional)

The default security for this content crawler is based on the security of the parent folder. You can change the security when you save this content crawler (on the Security tab page in the Save As dialog box), or by editing this content crawler (on the Security page of the Content Crawler Editor).

To import content, run the job you associated with this content crawler.

Parent topic: About Importing Content with Content Crawlers

AquaLogic Interaction Administrator Guide

Importing Web Content with Web Content Crawlers