Content Crawlers are extensible components used to import documents into the portal Knowledge Directory from a back-end document repository, including Lotus Notes, Microsoft Exchange, Documentum and Novell. Portal users can search for and open crawled files on protected back-end systems through the portal without violating access restrictions.
The IDK allows you to create remote content crawlers and related configuration pages without parsing SOAP or accessing the portal API; you simply implement four object interfaces to access the back-end repository and retrieve files. UDDI servers are not required.
The purposes of a Content Crawler are two-fold:
Content Crawlers are run asynchronously by the ALI Automation Service. The associated Content Crawler completes step 1. The Content Crawler Job can be run on a regular schedule to refresh any updated or added files. The portal creates a Document object for each crawled file and indexes it in the Knowledge Directory. Each object includes basic file information, security information, and a URL that opens the file from the back-end content repository. (No crawled files are stored on the portal server.) If the content is not contained within a file or cannot be indexed for another reason, you must implement a servlet/aspx page to return files that can be indexed to the portal.
Step 2 occurs when a user browses the Knowledge Directory and opens to a previously crawled document. After a file is crawled into the portal, users must be able to access the file from within the portal by clicking a link. This step is called click-through. If files are publicly accessible, click-through is simple. In many cases, you must provide access to documents that are behind a firewall or are otherwise inaccessible from the portal interface.