Importing Web Content with Web Content Crawlers
You can create a web content crawler to import content
from web sites and RSS feeds.
Before you create a web content crawler, you must:
- Create a content source, if necessary, to access secured content.
- Create the folders in which you want to store the imported content.
- Create and apply any filters to the folders to control the sorting
of imported content.
- Create any users and groups to which you want to grant access
to the imported content.
To create a web content crawler you must have the following rights
and privileges:
- Access Administration activity right
- Create Content Crawlers activity right
- At least Edit access to the parent folder (the folder that will
store the content crawler)
- At least Select access to the content web service on which this
content crawler will be based
- At least Select access to the folders in which you want to store
the imported content
- At least Select access to the users and groups to which you want
to grant access to the imported content
- Click Administration.
- Open the folder in which you want to store the content
crawler.
- In the Create Object drop-down list,
click Content Crawler — WWW.
The Choose Content Source dialog box opens.
- Select the content source that provides access to the content
you want to crawl and click OK.
The Web Content Crawler Editor opens.
- Complete the following tasks on the Main Settings page:
- Defining Where and How Far to Crawl
- Setting a Target Folder for Imported Content
- Automatically Approving Imported Content
- Granting Access to Imported Content
- Click the Web Page Exclusions page
and complete the following task:
- Avoiding Importing Unwanted Web Content
- Click the Target Settings page and
complete the following task:
- Specifying a Time-Out Period for a Web Content Crawler
- Click the Document Settings page
and complete the following tasks:
- Specifying When Imported Documents Expire
- Specifying Refresh Settings for Imported Links and Property
Values
- Click the Content Type page and
complete the following task:
- Assigning Content Types to Imported Content
- Click the Advanced Settings page
and complete the following tasks:
- Selecting a Language for Imported Content
- Specifying What to Do with Rejected Documents
- Specifying What to Do On Subsequent Crawls
- Marking Imported Content with a Content Crawler Tag
- Specifying Maximum Threads Settings
- Click the Set Job page and complete the following task:
- Click the Properties and Names page and complete the following
tasks:
The default security for this content crawler is based on the security of the parent folder. You can change the security when you save this content crawler (on the Security
tab page in the Save As dialog box), or by editing this content crawler (on
the Security page of the Content Crawler Editor).
To import content, run the job you associated with this content crawler.