About Importing Content with Content Crawlers
Content crawlers enable you to import content into the
portal. Web content
crawlers enable you to import content from web sites. Remote content crawlers
enable you to import content from external content repositories such
as a Windows NT file system, Documentum, Microsoft Exchange, or Lotus
Notes.
Crawl Providers
A crawl provider is a piece
of software that tells the portal how to use the information in the
external content repository. Oracle provides several crawl providers:
World Wide Web (WWW) (installed with the portal software), Windows
NT File (included with the portal software), Documentum, Microsoft
Exchange, and Lotus Notes. If your content resides in a custom system,
such as a custom database, you can import it by writing your own crawl
provider using the IDK.
Note:
- Your portal administrator must install the crawl provider before
you can create the associated content web service. For information
on obtaining crawl providers, refer to the Oracle Technology Network
at http://www.oracle.com/technology/index.html. For information on installing crawl providers, refer to the Installation Guide for Oracle WebCenter Interaction (available
on the Oracle Technology Network at http://www.oracle.com/technology/documentation/bea.html) or the documentation that comes with your crawl provider, or contact
your portal administrator.
- To learn about developing your own crawl provider, refer to the Oracle WebCenter Interaction Web Service Development Guide, which is located on the Oracle Technology Network at http://www.oracle.com/technology/documentation/bea.html.
- For a summary of Oracle WebCenter Interaction crawl providers,
as well as guidelines on best practices for deploying content crawlers,
see the Deployment Guide for Oracle WebCenter Interaction.
Content Web Services
Content web services
enable you to specify general settings for your external user repository,
leaving the target and security settings to be set in the associated
remote content source and remote content crawler. This allows you
to crawl multiple locations of the same content repository without
having to repeatedly specify all the settings.
Content Sources
Content
sources provide access to external content repositories, enabling
users to submit documents and content managers to create content crawlers
to import documents into the Knowledge Directory. Each content source
is configured to access a particular document repository with specific
authentication. For example, a content source for a secured web site
can be configured to fill out the web form necessary to gain access
to that site. Register a content source for each secured web site
or back-end repository from which content can be imported into your
portal.
Best Practices
- To facilitate maintenance, we recommend you implement several
instances of each content crawler type, configured for limited, specific
purposes.
- For file system content crawlers, you might want to implement
a content crawler that mirrors an entire file system folder hierarchy
by specifying a top-level starting point and its subfolders. Although
the content in your folder structure is available on your network,
replicating this structure in the portal offers several advantages:
- Users are able to search and access the content over the web.
- Interested users can receive regular updates on new content with
snapshot queries.
- You can use default profiles to direct new users to important
folders.
However, you might find it easier to maintain controlled access,
document updates, or document expiration by creating several content
crawlers that target specific folders.
- If you plan to crawl web locations, familiarize yourself with
the pages you want to import. Often, you can find one or two pages
that contain links to everything of interest. For example, most companies
offer a list of links to their latest press releases, and most web
magazines offer a list of links to their latest articles. When you
configure your content crawler for this source, you can target these
pages and exclude others to improve the efficiency of your crawl jobs.
- If you know that certain content will no longer be relevant after
a date—for example, if the content is related to a fiscal year, a
project complete date, or the like—you might want to create a content
crawler specifically for the date-dependent content. When the content
is no longer relevant, you can run a job that removes all content
created by the specific content crawler.
- For remote content crawlers, you might want to limit the target
for mail content crawlers to specific user names; you might want to
limit the target for document content crawlers to specific content
types.
For additional considerations and best practices, see the
Deployment Guide for Oracle WebCenter Interaction.