About Content Crawlers

Content Crawlers are extensible components used to import documents into the portal Knowledge Directory from a back-end document repository, including Lotus Notes, Microsoft Exchange, Documentum and Novell. Portal users can search for and open crawled files on protected back-end systems through the portal without violating access restrictions.

The IDK allows you to create remote content crawlers and related configuration pages without parsing SOAP or accessing the portal API; you simply implement four object interfaces to access the back-end repository and retrieve files. UDDI servers are not required.

The purposes of a Content Crawler are two-fold:

Iterate over and catalog a hierarchical data repository. Retrieve metadata and index documents in the data repository and include them in the portal Knowledge Directory and search index. Files are indexed based on metadata and full-text content.
Retrieve individual documents on demand through the portal Knowledge Directory, enforcing any user-level access restrictions.

Content Crawlers are run asynchronously by the ALI Automation Service. The associated Content Crawler completes step 1. The Content Crawler Job can be run on a regular schedule to refresh any updated or added files. The portal creates a Document object for each crawled file and indexes it in the Knowledge Directory. Each object includes basic file information, security information, and a URL that opens the file from the back-end content repository. (No crawled files are stored on the portal server.) If the content is not contained within a file or cannot be indexed for another reason, you must implement a servlet/aspx page to return files that can be indexed to the portal.

Step 2 occurs when a user browses the Knowledge Directory and opens to a previously crawled document. After a file is crawled into the portal, users must be able to access the file from within the portal by clicking a link. This step is called click-through. If files are publicly accessible, click-through is simple. In many cases, you must provide access to documents that are behind a firewall or are otherwise inaccessible from the portal interface.

For details, see the following topics:

IDK Interfaces for Content Crawler Development: The IDK provides object interfaces to implement custom content crawlers. This page introduces the IDK's crawler interfaces and lists useful warnings and best practices.
Content Crawler Development Tips: These best practices and development tips apply to all content crawler development.
About Content Crawler Indexing: Content crawlers must return an indexable version of each crawled file to be included in the portal Knowledge Directory. This page provides an introduction to indexing.
About Content Crawler Click-Through: The crawl is just the first step. This page explains how content crawlers can provide access to secured files that have been indexed in the portal. For instructions, see Implementing Content Crawler Click-Through.
Deploying a Custom Content Crawler (Java) and Deploying a Custom Content Crawler (.NET): After coding your Content Crawler, you must deploy your code. These pages provide detailed instructions.
Configuring Content Crawlers: Implementing a successful Content Crawler in the portal requires specific configuration.
Debugging Custom Content Crawlers: Logging is a key component of any successful crawl. This page introduces logging options.
Testing Custom Content Crawlers: This checklist summarizes key tests that should be performed on every content crawler.

IDK Interfaces for Content Crawler DevelopmentThe IDK plumtree.remote.crawler package/namespace includes four interfaces to support content crawler development: IContainerProvider, IContainer, IDocumentProvider and IDocument.
Content Crawler Development TipsThese best practices and development tips apply to all content crawler development.
About Content Crawler Security OptionsA crawler can use a range of credential types to access a secure file.
About Content Crawler IndexingA content crawler must return an indexable version of each crawled file to be included in the portal Knowledge Directory.
About Content Crawler DocFetchThe IDK's DocFetch mechanism is one way for a content crawler to retrieve files that are not accessible via a public URL.
About Content Crawler Click-ThroughAfter a repository is crawled and files are indexed in the portal, users must be able to access the file from within the portal by clicking a link; this is the 'click-through' step.
Handling Exceptions in Custom Content CrawlersContent crawler code should handle exceptions.
Deploying a Custom Content Crawler (Java)After implementing a custom content crawler, you must deploy your code.
Deploying a Custom Content Crawler (.NET)After implementing a custom content crawler, you must deploy your code.
Testing Custom Content CrawlersThese key tests should be performed on every content crawler.
Debugging Custom Content CrawlersTo debug custom content crawlers, use logging.
Configuring Content CrawlersImplementing a successful Content Crawler in the portal requires specific configuration.

Parent topic: About Content Service Development

AquaLogic User Interaction Development Guide

About Content Crawlers