Developing Portlets and Integration Web Services: Crawlers and Search Services

Introduction: Crawlers and Search Services

Crawlers and search services are integral parts of the Plumtree Corporate Portal, providing portal users with access to content from a range of sources, inside and outside the corporate network.

Crawlers access content from an external repository and index it in the portal. Portal users can search for and open crawled files through the portal Knowledge Directory. Crawlers can be used to provide access to files on protected back-end systems without violating access restrictions. In version 5.0 and above, crawlers are implemented as remote Web services.
Search services provide access to external repositories without adding documents to the portal Knowledge Directory. Search services are especially useful for content that is updated frequently or is only accessed by a small number of portal users. In the portal, search services are called Federated Search.

Crawlers

Crawlers are extensible components used to import documents into the portal Knowledge Directory from a specific type of document repository, including Lotus Notes, Microsoft Exchange, Documentum and Novell. Portal users can then search for and open crawled files through the portal Knowledge Directory.

The purposes of a crawler are two-fold:

Iterate over and catalog a hierarchical data repository. Retrieve metadata and index documents in the data repository and include them in the portal Knowledge Directory and search index. (Files are indexed based on metadata and full-text content.)
Retrieve individual documents on demand through the portal Knowledge Directory, enforcing any user-level access restrictions.

Crawlers are run asynchronously by the Automation Server. The associated crawler service completes step 1. The Crawler Job can be run on a regular schedule to refresh any updated or added files. The portal creates a Document object for each crawled file and indexes it in the Knowledge Directory. Each object includes basic file information, security information, and a URL that opens the file from the back-end content repository. (No crawled files are stored on the Portal Server.) If the content is not contained within a file or cannot be indexed for another reason, you must implement a servlet/aspx page to return files that can be indexed to the portal. For details on handling secured files, see Accessing Secured Content.

Step 2 occurs when a user browses the Knowledge Directory and opens to a previously crawled document. After a file is crawled into the portal, users must be able to access the file from within the portal by clicking a link. This step is called click-through. If files are publicly accessible, click-through is simple. In many cases, you must provide access to documents that are behind a firewall or are otherwise inaccessible from the portal interface. For details on handling secured files, see Accessing Secured Content.

The following pages provide detailed instructions on creating a custom crawler:

Developing Custom Crawlers: The EDK provides object interfaces to implement custom crawlers. This page introduces the EDK's crawler interfaces and lists useful warnings and best practices.
Accessing Secured Content: The crawl is just the first step. This page explains how crawlers can provide access to secured files that have been indexed in the portal.
Deploying Custom Crawlers: After coding your crawler, you must deploy your code. This page provides instructions for Java and .NET.
Configuring Custom Crawlers: Implementing a successful crawler in the portal requires specific configuration. This page explains the necessary portal objects and provides information on key settings.
Logging and Troubleshooting: Logging is a key component of any successful crawl. This page lists logging options and includes a basic FAQ on crawler development.
Crawler Testing Checklist: This checklist summarizes key tests that should be performed on every crawler.

Search Services

You can extend Plumtree search functionality in a number of ways, including adding to the portal search index, implementing Web services to access content in other repositories, customizing the search UI, and adding portal search to remote services.

Search Services provide access to external repositories without adding documents to the portal Knowledge Directory. Search services are especially useful for content that is updated frequently or is only accessed by a small number of portal users. In the portal, search services are called Federated Search. When the portal accesses a search service, the remote service accesses the content repository and sends information about each file to the portal. The returned information is displayed to users in search results. This information includes a URL that opens the file from the back-end content repository. The following pages provide detailed instructions on creating a custom search Web service:

Developing Custom Search Services: The EDK provides object interfaces to implement custom search services. This page introduces the EDK's search interfaces and lists useful warnings and best practices.
Deploying Custom Search Services: After coding your search service, you must deploy your code. This page provides instructions for Java and .NET.

The Plumtree Search Framework allows you to customize Plumtree Search in a number of ways. This section summarizes recommended approaches to common customizations, describes their capabilities and limitations, and points to more complete documentation and sample code.
The Remote Search API (com.plumtree.remote.prc.search) provides a generic interface to search operations in the Plumtree portal. Using the PRC search API, you can query document (card), folder, user and Community objects using a standard request-response model. The API allows you to add multiple constraints and filter searches by location or object type. The portal Knowledge Directory displays links to documents in a hierarchical structure of folders and subfolders. These documents can be external or internal Web pages, Office documents, or essentially any file of interest. PRC Knowledge Directory Operations allow you to query for documents and document properties, create new documents, and edit the properties for existing documents.

Next: Developing Custom Crawlers