Developing Portlets and Integration Web Services: Crawlers and Search Services

Developing Custom Crawlers

The EDK allows you to create remote crawlers and related configuration pages without parsing SOAP or accessing the portal API; you simply implement four object interfaces to access the back-end repository and retrieve files. UDDI servers are not required. You can also import access restrictions during a crawl; for details, see Configuring Custom Crawlers: Importing File Security.

Note: Before you start coding, make sure to review the Best Practices at the bottom of this page.

The EDK's Plumtree.Remote.Crawler package/namespace includes the following interfaces:

IContainerProvider
IContainer
IDocumentProvider
IDocument

When the Automation Server initiates a crawl, it issues a SOAP request to return a list of folders. It iterates over the list of folders and retrieves lists of documents with metadata. In general, the portal calls EDK interfaces in the following order. See the definitions that follow for more information. (For details on configuration, see Deploying Custom Crawlers.)

IContainerProvider.Initialize once per thread. Use DataSourceInfo and CrawlerInfo to initialize the Container Provider (i.e., make a connection to the back-end system and create a new session). Note: This is not a true HTTP session, and sessions can get dropped. Keep a variable that can be used to ensure the session is still initialized; if it is not, throw NotInitializedException. Create a member variable for the data source. Store the data source in a member variable in initialize. Do not use direct access to the member variable; instead use a method that checks if it is null, and throws a NotInitializedException accordlingly as shown in the sample code below.

Protected DataSourceInfo m_dbMap; m_dbMap = dsInfo;

Protected DataSourceInfo getDataInfo throws NotInitializedException { //if the map is null, throw NotInitializedException to re-init if (null == m_dbMap) { throw new NotInitializedException(); } return m_dbMap; }

IContainerProvider.AttachToContainer, using the starting location in the key CrawlerConstants.TAG_PATH. The key should be populated using a Service Configuration page in the Crawler editor. The string in TAG_PATH is crawler-specific; a file crawler could use the UNC path to a folder, while a database crawler could use the full name of a table. (For details on configuration, see Deploying Custom Crawlers.) The following methods are not called in any specific order.

IContainer.GetUsers and IContainer.GetGroups on that container as required. (IContainer.GetMetaData is deprecated.)
IContainer.GetChildContainers up to the number specified in CrawlerConstants.TAG_DEPTH. (This key must be set via a Service Configuration page; for details, see Configuring Custom Crawlers: Service Configuration Page.)
IContainerProvider.AttachToContainer for each ChildContainer returned.
IContainer.GetChildDocuments, then IDocumentProvider.AttachToDocument for each ChildDocument returned.

IContainerProvider.Shutdown (this call is optional and could be blocked by exceptions or network failure).
IDocumentProvider.Initialize once per thread. Note: Sessions can get dropped. Keep a variable that can be used to ensure the session is still initialized; if it is not, throw NotInitializedException.
IDocumentProvider.AttachToDocument for each ChildDocument, then IDocument.GetDocumentSignature to see if the document has changed. If the document is new or has been modified, the following methods are called (these methods are not called in any specific order).

IDocument.GetUsers, IDocument.GetGroups, and IDocument.GetMetaData on that document as required.
IDocument.GetMetaData to get the file name, description, content type, URL, etc.
IDocument.GetDocument to index the document (only if DocFetch is used).

IDocumentProvider.Shutdown (this call is optional and could be blocked by exceptions or network failure).

The sections below provide helpful information on the interfaces used to implement a crawler. For a complete listing of interfaces, classes, and methods, see the EDK API documentation. For sample code that illustrates how to implement the EDK Crawler and DocFetch APIs, see the Database Viewer sample application provided with the EDK.

IContainerProvider

The IContainerProvider interface allows the portal to iterate over a back-end directory structure. As noted above, the portal calls IContainerProvider first in most cases. This interface provides the following methods:

Initialize allows the remote server to initialize a session and create a connection to the back-end document repository. The EDK passes in a DataSourceInfo object that contains the necessary settings associated with a Plumtree Data Source object (e.g., the name of a directory in a repository and the credentials of a system user). The CrawlInfo object contains the settings associated with a Crawler object in the portal. The start location of the crawl is the value stored with the key CrawlerConstants.TAG_PATH. This key is set using a Service Configuration page.
AttachToContainer is always the next call after Initialize; the order of the remaining calls is not defined. It associates the session with the container specified in the sContainerLocation parameter; subsequent calls refer to this container until the next AttachToContainer call. The value in the sContainerLocation parameter will be the CrawlerConstants.TAG_PATH key for the initial attach, and the value specified in ChildContainer.GetLocation for subsequent attaches.
Each time AttachToContainer is called, discard any state created during the previous AttachToContainer call. If multiple translations of the container are available, select the most appropriate using the Locale parameter, which can be sent as a full locale (e.g., "en-us") or in the abbreviated language-only format (e.g., "en").
Note: If the container specified does not exist, you must throw a new NoLongerExistsException to avoid an infinite loop. If the crawler is configured to delete missing files, all files in the container will be removed from the portal index.
Shutdown allows the portal to clean up any unused sessions that have not yet expired. Crawlers are implemented on top of standard cookie-based session mechanisms, so sessions expire and resources and connections are released after an inactivity period, typically around 20 minutes. As a performance optimization, the portal might send a Shutdown message notifying the remote server to end the session immediately. No parameters are received and none are returned. Do not assume that Shutdown will be called; the call could be blocked by an exception or network failure, and remote servers must terminate sessions after an inactivity timeout but can choose to ignore the Shutdown message and keep the session alive until it times out.

IContainer

The portal uses the IContainer interface to query information about back-end resource directories. This interface provides the following methods:

GetGroups and GetUsers return a list of the portal groups or users that have read access to the container. These calls are made only if the Crawler Web Service and Crawler are configured to import security (for details, see Configuring Custom Crawlers: Importing File Security). The portal batches these calls; the crawler code should return all groups or users at once.
GetChildContainers returns the containers inside the current container (i.e., subfolders of a folder). The value stored in the key CrawlerContants.TAG_DEPTH is used to determine how many times GetChildContainers is called (i.e., crawl depth). This value must be set via a Service Configuration page. If no value is stored with this key, GetChildContainers is never called; only the documents in the folder specified for the start location are crawled into the portal. (Setting CrawlerConstants.TAG_DEPTH to -1 could result in an infinite loop.)
GetChildDocuments returns the documents inside the current container (i.e., files in a folder). The portal batches this call; the crawler code should return all documents at once. The TypeNamespace and TypeID parameters define the Document Type for the document. TypeNamespace associates the document with a row in the Global Document Type Map, and the TypeID associates it with a particular Document Type. The value in ChildDocument.GetLocation is used in IDocumentProvider.AttachToDocument, so any information required by AttachToDocument must be included in the location string. There are two way to describe the document: file or MIME, as shown in the examples below.

File:

ChildDocument doc = new ChildDocument(); String filename = WordDoc.doc;

//location param is a crawler-specific string on how to retrieve the doc //here we will just use file name (see the API docs for more detail)

doc.setLocation(filename);

//TypeNameSpace is usually either FILE or MIME //unless this is a custom namespace like Notes, Exchange, or Documentum //note that we use getCode, as setTypeNameSpace expects a String.

doc.setTypeNameSpace(TypeNamespace.FILE.getCode()):

//type id for the FILE TypeNamespace is the document name with extension

doc.setTypeID(filename);

//display name is the name that should appear in the knowledge directory //this name is usually overridden in IDocument.getMetaData(); //here we will just set to the file name

doc.setDisplayName(filename);

MIME:

ChildDocument doc = new ChildDocument(); String filename = WordDoc.doc;

//location param is a crawler-specific string on how to retrieve the doc //here we will just use file name (see the API docs for more detail)

doc.setLocation(filename);

//TypeNameSpace is usually either FILE or MIME //unless this is a custom namespace like Notes, Exchange, or Documentum //note that we use getCode, as setTypeNameSpace expects a String.

doc.setTypeNameSpace(TypeNamespace.MIME.getCode()):

//if you will be crawling multiple file types, this generally means //creating a map between file extensions and MIME types //here we just set the MIME type for Word

doc.setTypeID("application/msword");

//display name is the name that should appear in the knowledge directory //this name is usually overridden in IDocument.getMetaData(); //here we will just set to the file name

doc.setDisplayName(filename);

GetMetaData (DEPRECATED) returns all metadata available in the repository about the container. The name and location are used in mirrored crawls to mirror the structure of the source repository. In most cases, the container metadata is only the name and description.

IDocumentProvider

The IDocumentProvider interface allows the portal to specify back-end documents for retrieval. In most cases, the portal calls IContainerProvider first. However, in some cases, the service is used to refresh existing documents and IDocumentProvider might be called first.

Initialize allows the remote server to initialize a session and create a connection to the back-end document repository. (For details on parameters and session state, see IContainerProvider.Initialize above.) IDocumentProvider.Initialize will be called once per thread as long as the session does not time out or get interrupted for other reasons, and AttachToDocument will be called next.
AttachToDocument is always the next call made after Initialize; the order of the remaining calls is not defined. This method "attaches" a session to the document specified in the sDocumentLocation parameter; subsequent calls refer to this document until the next AttachToDocument call. The sDocumentLocation string is the value specified in ChildDocument.GetLocation (ChildDocument is returned by IContainer.GetChildDocuments). If multiple translations of the document are available, select the most appropriate by using the Locale parameter, which can be sent as a full locale (e.g., "en-us") or in the abbreviated language only format (e.g., "en").
When implementing this method, you can throw the following exceptions:

Exception	Description
NoLongerExistsException	The document has been moved or deleted. (The refresh agent will delete documents from the portal index only if this exception has been thrown.)
NotAvailableException	The document is temporarily unavailable.
NotInitializedException	The IDocumentProvider is in an uninitialized state.
AccessDeniedException	Access to this document is denied.
ServiceException	Propagates the exception to the portal and adds an entry to PTSpy.

Shutdown allows the portal to clean up any unused sessions that have not yet expired. (For details, see IContainerProvider.Shutdown above.)

IDocument

The IDocument interface allows the portal to query information about and retrieve documents. This interface provides the following methods:

GetDocumentSignature allows the portal to determine if the document has changed and should be re-indexed and flagged as updated. It can be a version number, a last-modified date, or a CRC of the document. The EDK does not enforce any restrictions on what to use for the document signature, or provide any utilities to get the CRC of the document. This is always the first call made to IDocument; on re-crawls, if the documentSignature has not changed, no additional calls will be made.
GetMetadata returns all metadata available in the repository about the document. The portal maps this data to properties based on the mappings defined for the appropriate Document Type, along with metadata returned by the associated accessor. The following field names are reserved. Additional properties can be added using the portal's Global Document Property Map; for details, see Configuring Custom Crawlers: Properties and Metadata. (Any properties that are not in the Global Document Property Map will be discarded.) For details on accessing secured content and files that are not accessible via a public URL, see the next page, Accessing Secured Content.

Field Name	Description
Name	REQUIRED. The name of the link to be displayed in the portal Knowledge Directory. (Note: By default, the portal uses the name from the crawled file properties as the name of the card. To set the portal to use the Name property returned by GetMetadata, you must set the CrawlerConstants.TAG_PROPERTIES to REMOTE using the Service Configuration Interface.)
Description	The description of the link to be displayed in the portal Knowledge Directory.
UseDocFetch	Whether or not to use DocFetch to retrieve the file. The default is False. If you use DocFetch, the value in the File Name field is used to retrieve the file during both indexing and click-through. If you do not use DocFetch, you must provide values for Indexing URL and Click-Through URL.
File Name (DocFetch)	The name of the click-through file, used for DocFetch.
Content Type (DocFetch)	The content type of the click-through file, used to associated the crawled document with the Global Document Type Map.
Indexing URL (public URL)	(Required if not using DocFetch.) The URL to the file that can be indexed in the portal. URLs can be relative to the Remote Server. If a file is publicly accessible via a URL, that URL can be used to access the document for both indexing and click-through. Documents that cannot be indexed must provide an additional URL at crawl-time for indexing purposes. For details on crawling secured content, see the next page, Accessing Secured Content.
Click-Through URL (public URL)	(Required if not using DocFetch.) The URL to the click-through file. URLs can be relative to the Remote Server. For details on crawling secured content, see the next page, Accessing Secured Content.
Image UUID (optional)	This parameter is only required for custom document types. For standard document types, the accessor will assign the correct image UUID.

GetDocument returns the path to the file if it was not provided by GetMetaData. (For public URLs, you do not need to implement GetDocument, but you must provide values for IndexingURL and ClickThroughURL in GetMetaData.) During crawl-time indexing, this file is copied to the Web-accessible IndexFilePath location specified in your deployment descriptor and returned to the portal via a URL to that location. If the file is not supported for indexing by the portal, implement GetDocument to convert the document into a supported file format for indexing (e.g., text-only) and return that file during indexing. (To create a custom implementation of GetDocument, you must set UseDocFetch to True.) When a user clicks through to the document, the display file is streamed back via the DocFetch servlet to the browser. Any necessary cleanup due to temporary file usage should be done on subsequent calls to IDocumentProvider.AttachToDocument or IDocumentProvider.Shutdown. For details on accessing secured content and files that are not accessible via a public URL, see the next page, Accessing Secured Content.
GetGroups and GetUsers return a list of the groups or users with read access to the document. Each entry is an ACLEntry with a domain and group name. The portal batches these calls; the crawler code should return all groups or users at once. This call is made only if Supports importing security with each document is checked on the Advanced Settings page of the Crawler Web Service editor. (For details on security, see Configuring Custom Crawlers: Importing File Security.)

Best Practices

Consider the following best practices for every crawler:

Use logging extensively to provide feedback during a crawl. In some cases, the portal reports a successful crawl when there were minor errors. Use Log4J or Log4Net to track progress. For more information, see Logging and Troubleshooting.

Use relative URLs in your code to allow migration to another remote server. These URLs might be relative to different base URL endpoints.

The key difference is the click-through URL is relative to the remote server base URL and the indexing URL is relative to the SOAP URL. Depending on whether you have implemented your crawler using Java or .NET, the base URL endpoint for the remote server might differ from the base URL endpoint for SOAP.

For example, the Java EDK uses Axis, which implements programs as services. In Axis, the SOAP URL is the remote server base URL with "/services" attached to the end. Given the remote server base URL http://server:port/sitename, the SOAP URL would be http://server:port/sitename/services. If both click-through and indexing URLs point to the same servlet (http://server:port/sitename/customdocfetch?docId=12345), the relative URLs would be different:

	Relative URL	Resulting URL
Indexing	../customdocfetch?docId=12345	http://server:port/sitename/customdocfetch?docId=12345
Click-Through	customdocfetch?docId=12345	http://server:port/sitename/customdocfetch?docId=12345

As noted above, the indexing URL is relative to the SOAP URL, so the "../" in the relative URL reorients the path from http://server:port/sitename/services to http://server:port/sitename, yielding the correct URL to http://server:port/sitename/customdocfetch?docId=12345.

Note: In Plumtree Corporate Portal v.4.5 the click-through URL is not relative to the remote server base URL and must be absolute.

Do your initial implementation of IDocumentProvider and IDocFetchProvider in separate classes, but factor out some code to allow reuse of the GetDocument and GetMetaData methods. See the Viewer sample application included with the EDK for sample code.

Do not make your calls order-dependent. The portal can make the above calls in any order, so your code cannot be dependent on order.

If a document or container does not exist, always throw a new NoLongerExistsException. This is the only way the portal can determine if the file or folder has been deleted. Not throwing the exception could result in an infinite loop.

If there are no results, return a zero-length array. If the intention is to return no results, use a zero-length array, not an array with empty strings. (For example, return new ChildContainer[0];)

Check the SOAP timeout for the back-end server and calibrate your response accordingly. In version 5.0 and above, the SOAP timeout is set in the Crawler Web Service editor. In version 4.5, the SOAP timeout must be set via a Service Configuration page.

Pages that are not publicly accessible must be gatewayed. Gateway settings are configured in the Crawler Web Service editor on the HTTP Configuration page, and in the Data Source editor. You can gateway all URLs relative to the remote server or enter individual URLs and add paths to other servers to gateway additional pages. (For details, see Deploying Custom Crawlers.)

You must define mappings for any associated Document Types before a crawler is run. The portal uses the mappings in the Document Type definition to map the data returned by the crawler to portal properties. Properties are only stored if you configure the Document Type mapping before running the crawler. (Properties that apply to all documents are configured in the Global Document Property Map.)

To import security settings, the backend repository must have an associated Authentication Source. Crawlers that import security need the user and category (domain) defined by an Authentication Source. You must configure the Authentication Source before the crawler is run. Many repositories use the networks NT or LDAP security store; if an associated Authentication Source already exists, there is no need to create one. For details on Authentication Sources, see the portal online help. (For details on security, see Configuring Custom Crawlers: Importing File Security.)

If you use a mirrored crawl, only run it when you first import documents. Always check every directory after a mirrored crawl. After you have imported documents into the portal, it is safer to refresh your portal directory using a regular crawl with filters.

For mirrored crawls, make crawl depth as shallow as possible. Portal users want to access documents quickly, so folder structure is important. Also, the deeper the crawl, the more extensive your QA process will be.

Use filters to sort crawled documents into portal folders. Mirrored crawls can return inappropriate content and create directory structures. Filters are a more efficient way to sort crawled documents. To use filters, choose Apply Filter of Destination Folder in the Crawler editor. For details on filters, see the portal online help.

Do not use automatic approval unless you have tested a crawler. It is dangerous to use automatic approval without first testing the structure, metadata and logs for a crawler.

To clear the deletion history, you must re-open the Crawler Editor. To re-crawl documents that have been deleted from the portal, you must re-open the Crawler Editor and configure the Importing Documents settings on the Advanced Settings page as explained in Deploying Custom Crawlers.

Next: Accessing Secured Content