IDK Interfaces for Content Crawler Development

The IDK plumtree.remote.crawler package/namespace includes four interfaces to support content crawler development: IContainerProvider, IContainer, IDocumentProvider and IDocument.

When the ALI Automation Service initiates a crawl, it issues a SOAP request to return a list of folders. It iterates over the list of folders and retrieves lists of documents with metadata. In general, the portal calls IDK interfaces in the following order. See the definitions that follow for more information.

IContainerProvider.Initialize once per thread. Use DataSourceInfo and CrawlerInfo to initialize the Container Provider (make a connection to the back-end system and create a new session). Note: This is not a true HTTP session, and sessions can get dropped. Keep a variable that can be used to ensure the session is still initialized; if it is not, throw NotInitializedException. Store the Content Source in a member variable in Initialize. Do not use direct access to the member variable; instead use a method that checks if it is null and throws a NotInitializedException.
IContainerProvider.AttachToContainer, using the starting location in the key CrawlerConstants.TAG_PATH. The key should be populated using a Service Configuration page in the Content Crawler editor. The string in TAG_PATH is service-specific; a file Content Crawler could use the UNC path to a folder, while a database Content Crawler could use the full name of a table. The following methods are not called in any specific order.
- IContainer.GetUsers and IContainer.GetGroups on that container as required. (IContainer.GetMetaData is deprecated.)
- IContainer.GetChildContainers up to the number specified in CrawlerConstants.TAG_DEPTH. (This key must be set via a Service Configuration page.)
- IContainerProvider.AttachToContainer for each ChildContainer returned.
- IContainer.GetChildDocuments, then IDocumentProvider.AttachToDocument for each ChildDocument returned.
IContainerProvider.Shutdown (this call is optional and could be blocked by exceptions or network failure).
IDocumentProvider.Initialize once per thread. Note: Sessions can get dropped. Keep a variable that can be used to ensure the session is still initialized; if it is not, throw NotInitializedException.
IDocumentProvider.AttachToDocument for each ChildDocument, then IDocument.GetDocumentSignature to see if the document has changed. If the document is new or has been modified, the following methods are called (not in any specific order).
- IDocument.GetUsers and IDocument.GetGroups on that document as required.
- IDocument.GetMetaData to get the file name, description, content type, URL, etc.
- IDocument.GetDocument to index the document (only if DocFetch is used).
IDocumentProvider.Shutdown (this call is optional and could be blocked by exceptions or network failure).

The sections below provide helpful information on the interfaces used to implement a Content Crawler. For a complete listing of interfaces, classes, and methods, see the IDK API documentation.

IContainerProvider

The IContainerProvider interface allows the portal to iterate over a back-end directory structure. The portal calls IContainerProvider first in most cases. This interface provides the following methods:

Initialize allows the remote server to initialize a session and create a connection to the back-end document repository. The IDK passes in a DataSourceInfo object that contains the necessary settings associated with a Content Source object (the name of a directory in the repository and the credentials of a system user). The CrawlInfo object contains the settings for the associated Content Crawler object in the portal. The start location of the crawl is the value stored in the key CrawlerConstants.TAG_PATH , set using a Service Configuration page.
AttachToContainer is always the next call after Initialize; the order of the remaining calls is not defined. It associates the session with the container specified in the sContainerLocation parameter; subsequent calls refer to this container until the next AttachToContainer call. The value in the sContainerLocation parameter will be the CrawlerConstants.TAG_PATH key for the initial attach, and the value specified in ChildContainer.GetLocation for subsequent attaches. Each time AttachToContainer is called, discard any state created during the previous AttachToContainer call. If multiple translations of the container are available, select the most appropriate using the Locale parameter, which can be sent as a full locale (e.g., "en-us") or in the abbreviated language-only format (e.g., "en"). Note: If the container specified does not exist, you must throw a new NoLongerExistsException to avoid an infinite loop. If the Content Crawler is configured to delete missing files, all files in the container will be removed from the portal index.
Shutdown allows the portal to clean up any unused sessions that have not yet expired. Content Crawlers are implemented on top of standard cookie-based session mechanisms, so sessions expire and resources and connections are released after an inactivity period, typically around 20 minutes. As a performance optimization, the portal might send a Shutdown message notifying the remote server to end the session immediately. No parameters are received and none are returned. Do not assume that Shutdown will be called; the call could be blocked by an exception or network failure. Remote servers must terminate sessions after an inactivity timeout but can choose to ignore the Shutdown message and keep the session alive until it times out.

IContainer

The portal uses the IContainer interface to query information about back-end resource directories. This interface provides the following methods:

GetGroups and GetUsers return a list of the portal groups or users that have read access to the container. These calls are made only if the Web Service and Content Crawler objects are configured to import security. The portal batches these calls; the Content Crawler code should return all groups or users at once.
GetChildContainers returns the containers inside the current container (i.e., subfolders of a folder). The value stored in the key CrawlerContants.TAG_DEPTH is used to determine how many times GetChildContainers is called (crawl depth). This value must be set via a Service Configuration page. If no value is stored with this key, GetChildContainers is never called; only the documents in the folder specified for the start location are crawled into the portal. Note: Setting CrawlerConstants.TAG_DEPTH to -1 could result in an infinite loop.

GetChildDocuments returns the documents inside the current container (folder). The portal batches this call; the Content Crawler code should return all documents at once. The TypeNamespace and TypeID parameters define the Content Type for the document. TypeNamespace associates the document with a row in the Global Content Type Map, and the TypeID associates it with a particular Content Type. The value in ChildDocument.GetLocation is used in IDocumentProvider.AttachToDocument, so any information required by AttachToDocument must be included in the location string. You can describe the document using file or MIME, as shown in the example below.

ChildDocument doc=new ChildDocument();
String filename = WordDoc.doc;

//Location is a crawler-specific string to retrieve doc, e.g., file name 
doc.setLocation(filename);

//TypeNameSpace is either FILE or MIME unless using a custom namespace (Notes, Exchange)
//NOTE: example uses getCode because setTypeNameSpace expects a String
doc.setTypeNameSpace(TypeNamespace.MIME.getCode()):

//For file descriptions, TypeID is simply the document name with extension (i.e., filename)
//For MIME descriptions, set the document type or map multiple file extensions to MIME types 
doc.setTypeID("application/msword");

//DisplayName is the name to display in the KD, usually overridden in IDocument.getMetaData();
doc.setDisplayName(filename);

GetMetaData (DEPRECATED) returns all metadata available in the repository about the container. The name and location are used in mirrored crawls to mirror the structure of the source repository. In most cases, the container metadata is only the name and description.

IDocumentProvider

The IDocumentProvider interface allows the portal to specify back-end documents for retrieval. In most cases, the portal calls IContainerProvider first. However, in some cases, the service is used to refresh existing documents and IDocumentProvider might be called first.

Initialize allows the remote server to initialize a session and create a connection to the back-end document repository. (For details on parameters and session state, see IContainerProvider.Initialize above.) IDocumentProvider.Initialize will be called once per thread as long as the session does not time out or get interrupted for other reasons, and AttachToDocument will be called next.

AttachToDocument is always the next call made after Initialize; the order of the remaining calls is not defined. This method 'attaches' a session to the document specified in the sDocumentLocation parameter; subsequent calls refer to this document until the next AttachToDocument call. The sDocumentLocation string is the value specified in ChildDocument.GetLocation (ChildDocument is returned by IContainer.GetChildDocuments). If multiple translations of the document are available, select the most appropriate by using the Locale parameter, which can be sent as a full locale (e.g., 'en-us') or in the abbreviated language only format (e.g., 'en'). When implementing this method, you can throw the following exceptions:

Exception	Description
NoLongerExistsException	The document has been moved or deleted. (The refresh agent will delete documents from the portal index only if this exception has been thrown.)
NotAvailableException	The document is temporarily unavailable.
NotInitializedException	The IDocumentProvider is in an uninitialized state.
AccessDeniedException	Access to this document is denied.
ServiceException	Propagates the exception to the portal and adds an entry to ALI Logging Spy.

Shutdown allows the portal to clean up any unused sessions that have not yet expired. (For details, see IContainerProvider.Shutdown above.)

IDocument

The IDocument interface allows the portal to query information about and retrieve documents. This interface provides the following methods:

GetDocumentSignature allows the portal to determine if the document has changed and should be re-indexed and flagged as updated. It can be a version number, a last-modified date, or the CRC of the document. The IDK does not enforce any restrictions on what to use for the document signature, or provide any utilities to get the CRC of the document. This is always the first call made to IDocument; on re-crawls, if the documentSignature has not changed, no additional calls will be made.

GetMetadata returns all metadata available in the repository about the document. The portal maps this data to properties based on the mappings defined for the appropriate Content Type, along with metadata returned by the associated accessor. The following field names are reserved. Additional properties can be added using the portal's Global Document Property Map; for details, see Configuring Custom Content Crawlers: Properties and Metadata. (Any properties that are not in the Global Document Property Map will be discarded.)

Field Name	Description
Name	REQUIRED. The name of the link to be displayed in the portal Knowledge Directory. Note: By default, the portal uses the name from the crawled file properties as the name of the card. To set the portal to use the Name property returned by GetMetadata, you must set the CrawlerConstants.TAG_PROPERTIES to REMOTE using the Service Configuration Interface.
Description	The description of the link to be displayed in the portal Knowledge Directory.
UseDocFetch	Whether or not to use DocFetch to retrieve the file. The default is False. If you use DocFetch, the value in the File Name field is used to retrieve the file during both indexing and click-through. If you do not use DocFetch, you must provide values for Indexing URL and Click-Through URL.
File Name (required for DocFetch)	The name of the click-through file, used for DocFetch.
Content Type (required for DocFetch)	The content type of the click-through file, used to associated the crawled document with the Global Content Type Map.
Indexing URL (public URL)	(Required if not using DocFetch.) The URL to the file that can be indexed in the portal. URLs can be relative to the Remote Server. If a file is publicly accessible via a URL, that URL can be used to access the document for both indexing and click-through. Documents that cannot be indexed must provide an additional URL at crawl-time for indexing purposes. For details on crawling secured content, see Accessing Secured Content .
Click-Through URL (public URL)	(Required if not using DocFetch.) The URL to the click-through file. URLs can be relative to the Remote Server. For details on crawling secured content, see Accessing Secured Content.
Image UUID (optional)	This parameter is only required for custom Content Types. For standard Content Types, the accessor will assign the correct image UUID.

GetDocument returns the path to the file if it was not provided by GetMetaData. (For public URLs, you do not need to implement GetDocument, but you must provide values for IndexingURL and ClickThroughURL in GetMetaData.) During crawl-time indexing, this file is copied to the web-accessible IndexFilePath location specified in your deployment descriptor and returned to the portal via a URL to that location. If the file is not supported for indexing by the portal, implement GetDocument to convert the document into a supported file format for indexing (e.g., text-only) and return that file during indexing. Note: To create a custom implementation of GetDocument, you must set UseDocFetch to True. When a user clicks through to the document, the display file is streamed back via the DocFetch servlet to the browser. Any necessary cleanup due to temporary file usage should be done on subsequent calls to IDocumentProvider.AttachToDocument or IDocumentProvider.Shutdown. For details on accessing secured content and files that are not accessible via a public URL, see About Content Crawler Click-Through.
GetGroups and GetUsers return a list of the groups or users with read access to the document. Each entry is an ACLEntry with a domain and group name. The portal batches these calls; the Content Crawler code should return all groups or users at once. This call is made only if the Supports importing security with each document option is checked on the Advanced Settings page of the Web Service editor.

SCI Variables for Content Crawler PropertiesContent crawler properties are configured using a defined set of variables.

Parent topic: About Content Crawlers

AquaLogic User Interaction Development Guide

IDK Interfaces for Content Crawler Development

IContainerProvider

IContainer

IDocumentProvider

IDocument