Developing Portlets and Integration Web Services: Crawlers and Search Services  

Introduction: Crawlers and Search Services

Crawlers and search services are integral parts of the Plumtree Corporate Portal, providing portal users with access to content from a range of sources, inside and outside the corporate network.

Crawlers

Crawlers are extensible components used to import documents into the portal Knowledge Directory from a specific type of document repository, including Lotus Notes, Microsoft Exchange, Documentum and Novell. Portal users can then search for and open crawled files through the portal Knowledge Directory.

The purposes of a crawler are two-fold:

  1. Iterate over and catalog a hierarchical data repository. Retrieve metadata and index documents in the data repository and include them in the portal Knowledge Directory and search index. (Files are indexed based on metadata and full-text content.)

  2. Retrieve individual documents on demand through the portal Knowledge Directory, enforcing any user-level access restrictions.

Crawlers are run asynchronously by the Automation Server. The associated crawler service completes step 1. The Crawler Job can be run on a regular schedule to refresh any updated or added files. The portal creates a Document object for each crawled file and indexes it in the Knowledge Directory. Each object includes basic file information, security information, and a URL that opens the file from the back-end content repository. (No crawled files are stored on the Portal Server.) If the content is not contained within a file or cannot be indexed for another reason, you must implement a servlet/aspx page to return files that can be indexed to the portal. For details on handling secured files, see Accessing Secured Content.  

Step 2 occurs when a user browses the Knowledge Directory and opens to a previously crawled document. After a file is crawled into the portal, users must be able to access the file from within the portal by clicking a link. This step is called click-through. If files are publicly accessible, click-through is simple. In many cases, you must provide access to documents that are behind a firewall or are otherwise inaccessible from the portal interface. For details on handling secured files, see Accessing Secured Content.

The following pages provide detailed instructions on creating a custom crawler:

Search Services

You can extend Plumtree search functionality in a number of ways, including adding to the portal search index, implementing Web services to access content in other repositories, customizing the search UI, and adding portal search to remote services.

Next: Developing Custom Crawlers