Oracle WebCenter Interaction Web Service Development Guide

     Previous Next  Open TOC in new window   View as PDF - New Window  Get Adobe Reader - New Window
Content starts here

About Content Crawlers

Content crawlers are extensible components used to import documents into the portal Directory from a back-end document repository, including Lotus Notes, Microsoft Exchange, Documentum and Novell. Portal users can search for and open crawled files on protected back-end systems through the portal without violating access restrictions.

The Oracle WebCenter Interaction Development Kit (IDK) allows you to create remote content crawlers and related configuration pages without parsing SOAP or accessing the portal API; you simply implement four object interfaces to access the back-end repository and retrieve files. UDDI servers are not required.

The purposes of a Content Crawler are two-fold:


  1. Iterate over and catalog a hierarchical data repository. Retrieve metadata and index documents in the data repository and include them in the portal Directory and search index. Files are indexed based on metadata and full-text content.
  2. Retrieve individual documents on demand through the portal Directory, enforcing any user-level access restrictions.

Content Crawlers are run asynchronously by the portal Automation Service. The associated content crawler completes step 1. The Content Crawler Job can be run on a regular schedule to refresh any updated or added files. The portal creates a Document object for each crawled file and indexes it in the Directory. Each object includes basic file information, security information, and a URL that opens the file from the back-end content repository. (No crawled files are stored on the portal server.) If the content is not contained within a file or cannot be indexed for another reason, you must implement a servlet/aspx page to return files that can be indexed to the portal.

Step 2 occurs when a user browses the Directory and opens to a previously crawled document. After a file is crawled into the portal, users must be able to access the file from within the portal by clicking a link. This step is called click-through. If files are publicly accessible, click-through is simple. In many cases, you must provide access to documents that are behind a firewall or are otherwise inaccessible from the portal interface.

For details, see the following topics:

  Back to Top      Previous Next