About Importing Content with Content Crawlers

Content crawlers enable you to import content into the portal. Web content crawlers enable you to import content from web sites. Remote content crawlers enable you to import content from external content repositories such as a Windows NT file system, Documentum, Microsoft Exchange, or Lotus Notes.

Crawl Providers

A crawl provider is a piece of software that tells the portal how to use the information in the external content repository. Oracle provides several crawl providers: World Wide Web (WWW) (installed with the portal software), Windows NT File (included with the portal software), Documentum, Microsoft Exchange, and Lotus Notes. If your content resides in a custom system, such as a custom database, you can import it by writing your own crawl provider using the IDK.

Note:

Your portal administrator must install the crawl provider before you can create the associated content web service. For information on obtaining crawl providers, refer to the Oracle Technology Network at http://www.oracle.com/technology/index.html. For information on installing crawl providers, refer to the Installation Guide for Oracle WebCenter Interaction (available on the Oracle Technology Network at http://www.oracle.com/technology/documentation/bea.html) or the documentation that comes with your crawl provider, or contact your portal administrator.
To learn about developing your own crawl provider, refer to the Oracle WebCenter Interaction Web Service Development Guide, which is located on the Oracle Technology Network at http://www.oracle.com/technology/documentation/bea.html.
For a summary of Oracle WebCenter Interaction crawl providers, as well as guidelines on best practices for deploying content crawlers, see the Deployment Guide for Oracle WebCenter Interaction.

Content Web Services

Content web services enable you to specify general settings for your external user repository, leaving the target and security settings to be set in the associated remote content source and remote content crawler. This allows you to crawl multiple locations of the same content repository without having to repeatedly specify all the settings.

Content Sources

Content sources provide access to external content repositories, enabling users to submit documents and content managers to create content crawlers to import documents into the Knowledge Directory. Each content source is configured to access a particular document repository with specific authentication. For example, a content source for a secured web site can be configured to fill out the web form necessary to gain access to that site. Register a content source for each secured web site or back-end repository from which content can be imported into your portal.

Best Practices

To facilitate maintenance, we recommend you implement several instances of each content crawler type, configured for limited, specific purposes.
For file system content crawlers, you might want to implement a content crawler that mirrors an entire file system folder hierarchy by specifying a top-level starting point and its subfolders. Although the content in your folder structure is available on your network, replicating this structure in the portal offers several advantages:
- Users are able to search and access the content over the web.
- Interested users can receive regular updates on new content with snapshot queries.
- You can use default profiles to direct new users to important folders.
However, you might find it easier to maintain controlled access, document updates, or document expiration by creating several content crawlers that target specific folders.
If you plan to crawl web locations, familiarize yourself with the pages you want to import. Often, you can find one or two pages that contain links to everything of interest. For example, most companies offer a list of links to their latest press releases, and most web magazines offer a list of links to their latest articles. When you configure your content crawler for this source, you can target these pages and exclude others to improve the efficiency of your crawl jobs.
If you know that certain content will no longer be relevant after a date—for example, if the content is related to a fiscal year, a project complete date, or the like—you might want to create a content crawler specifically for the date-dependent content. When the content is no longer relevant, you can run a job that removes all content created by the specific content crawler.
For remote content crawlers, you might want to limit the target for mail content crawlers to specific user names; you might want to limit the target for document content crawlers to specific content types.

For additional considerations and best practices, see the Deployment Guide for Oracle WebCenter Interaction.

Importing Content from External Document Repositories with Remote Content CrawlersYou can create a remote content crawler to import content (and security) from external document repositories.
Importing Web Content with Web Content CrawlersYou can create a web content crawler to import content from web sites and RSS feeds.
Refreshing Content from Content CrawlersYou can refresh metadata and import new content from content crawlers that have previously imported content.
Testing a Content CrawlerBefore you have a content crawler import content into the public folders of your portal, test it by running a job that crawls document records into a temporary folder.
Troubleshooting the Results of a CrawlThere are several things you can troubleshoot if your content crawler does not import the expected content.
Example of Importing Security
Destination Folder Flow Chart
Metadata Imported by Content CrawlersContent crawlers index the full document text, but some content crawlers can import additional metadata.