Developing Portlets and Integration Web Services: Crawlers and Search Services  

Developing Custom Crawlers

The EDK allows you to create remote crawlers and related configuration pages without parsing SOAP or accessing the portal API; you simply implement four object interfaces to access the back-end repository and retrieve files. UDDI servers are not required. You can also import access restrictions during a crawl; for details, see Configuring Custom Crawlers: Importing File Security.

Note: Before you start coding, make sure to review the Best Practices at the bottom of this page.

The EDK's Plumtree.Remote.Crawler package/namespace includes the following interfaces:

When the Automation Server initiates a crawl, it issues a SOAP request to return a list of folders. It iterates over the list of folders and retrieves lists of documents with metadata. In general, the portal calls EDK interfaces in the following order. See the definitions that follow for more information. (For details on configuration, see Deploying Custom Crawlers.)

  1. IContainerProvider.Initialize once per thread. Use DataSourceInfo and CrawlerInfo to initialize the Container Provider (i.e., make a connection to the back-end system and create a new session). Note: This is not a true HTTP session, and sessions can get dropped. Keep a variable that can be used to ensure the session is still initialized; if it is not, throw NotInitializedException. Create a member variable for the data source. Store the data source in a member variable in initialize. Do not use direct access to the member variable; instead use a method that checks if it is null, and throws a NotInitializedException accordlingly as shown in the sample code below.  

  2. Protected DataSourceInfo m_dbMap;
    m_dbMap = dsInfo;

    Protected DataSourceInfo getDataInfo throws NotInitializedException
    {
    //if the map is null, throw NotInitializedException to re-init
    if (null == m_dbMap)
    {   
      throw new NotInitializedException();
    }
    return m_dbMap;
    }

  3. IContainerProvider.AttachToContainer, using the starting location in the key CrawlerConstants.TAG_PATH. The key should be populated using a Service Configuration page in the Crawler editor. The string in TAG_PATH is crawler-specific; a file crawler could use the UNC path to a folder, while a database crawler could use the full name of a table. (For details on configuration, see Deploying Custom Crawlers.) The following methods are not called in any specific order.

  4. IContainerProvider.Shutdown (this call is optional and could be blocked by exceptions or network failure).

  5. IDocumentProvider.Initialize once per thread. Note: Sessions can get dropped. Keep a variable that can be used to ensure the session is still initialized; if it is not, throw NotInitializedException.

  6. IDocumentProvider.AttachToDocument for each ChildDocument, then IDocument.GetDocumentSignature to see if the document has changed. If the document is new or has been modified, the following methods are called (these methods are not called in any specific order).

  7. IDocumentProvider.Shutdown (this call is optional and could be blocked by exceptions or network failure).

The sections below provide helpful information on the interfaces used to implement a crawler. For a complete listing of interfaces, classes, and methods, see the EDK API documentation. For sample code that illustrates how to implement the EDK Crawler and DocFetch APIs, see the Database Viewer sample application provided with the EDK.

IContainerProvider

The IContainerProvider interface allows the portal to iterate over a back-end directory structure. As noted above, the portal calls IContainerProvider first in most cases. This interface provides the following methods:

IContainer

The portal uses the IContainer interface to query information about back-end resource directories. This interface provides the following methods:

File:

ChildDocument doc = new ChildDocument();
String filename = WordDoc.doc;

//location param is a crawler-specific string on how to retrieve the doc
//here we will just use file name (see the API docs for more detail)

doc.setLocation(filename);

//TypeNameSpace is usually either FILE or MIME
//unless this is a custom namespace like Notes, Exchange, or Documentum
//note that we use getCode, as setTypeNameSpace expects a String.

doc.setTypeNameSpace(TypeNamespace.FILE.getCode()):

//type id for the FILE TypeNamespace is the document name with extension

doc.setTypeID(filename);

//display name is the name that should appear in the knowledge directory
//this name is usually overridden in IDocument.getMetaData();
//here we will just set to the file name

doc.setDisplayName(filename);

MIME:

ChildDocument doc = new ChildDocument();
String filename = WordDoc.doc;

//location param is a crawler-specific string on how to retrieve the doc
//here we will just use file name (see the API docs for more detail)

doc.setLocation(filename);

//TypeNameSpace is usually either FILE or MIME
//unless this is a custom namespace like Notes, Exchange, or Documentum
//note that we use getCode, as setTypeNameSpace expects a String.

doc.setTypeNameSpace(TypeNamespace.MIME.getCode()):

//if you will be crawling multiple file types, this generally means
//creating a map between file extensions and MIME types
//here we just set the MIME type for Word

doc.setTypeID("application/msword");

//display name is the name that should appear in the knowledge directory
//this name is usually overridden in IDocument.getMetaData();
//here we will just set to the file name

doc.setDisplayName(filename);

IDocumentProvider

The IDocumentProvider interface allows the portal to specify back-end documents for retrieval. In most cases, the portal calls IContainerProvider first. However, in some cases, the service is used to refresh existing documents and IDocumentProvider might be called first.

Exception

Description

NoLongerExistsException

The document has been moved or deleted. (The refresh agent will delete documents from the portal index only if this exception has been thrown.)

NotAvailableException

The document is temporarily unavailable.

NotInitializedException

The IDocumentProvider is in an uninitialized state.

AccessDeniedException

Access to this document is denied.

ServiceException

Propagates the exception to the portal and adds an entry to PTSpy.

IDocument

The IDocument interface allows the portal to query information about and retrieve documents. This interface provides the following methods:

Field Name

Description

Name

REQUIRED. The name of the link to be displayed in the portal Knowledge Directory. (Note: By default, the portal uses the name from the crawled file properties as the name of the card. To set the portal to use the Name property returned by GetMetadata, you must set the CrawlerConstants.TAG_PROPERTIES to REMOTE using the Service Configuration Interface.)

Description

The description of the link to be displayed in the portal Knowledge Directory.

UseDocFetch

Whether or not to use DocFetch to retrieve the file. The default is False. If you use DocFetch, the value in the File Name field is used to retrieve the file during both indexing and click-through. If you do not use DocFetch, you must provide values for Indexing URL and Click-Through URL.

File Name (DocFetch)

The name of the click-through file, used for DocFetch.

Content Type (DocFetch)

The content type of the click-through file, used to associated the crawled document with the Global Document Type Map.

Indexing URL (public URL)

(Required if not using DocFetch.) The URL to the file that can be indexed in the portal. URLs can be relative to the Remote Server. If a file is publicly accessible via a URL, that URL can be used to access the document for both indexing and click-through. Documents that cannot be indexed must provide an additional URL at crawl-time for indexing purposes. For details on crawling secured content, see the next page, Accessing Secured Content.

Click-Through URL (public URL)

(Required if not using DocFetch.) The URL to the click-through file. URLs can be relative to the Remote Server.  For details on crawling secured content, see the next page, Accessing Secured Content.

Image UUID (optional)

This parameter is only required for custom document types. For standard document types, the accessor will assign the correct image UUID.

Best Practices

Consider the following best practices for every crawler:

Use logging extensively to provide feedback during a crawl. In some cases, the portal reports a successful crawl when there were minor errors. Use Log4J or Log4Net to track progress. For more information, see Logging and Troubleshooting.

Use relative URLs in your code to allow migration to another remote server. These URLs might be relative to different base URL endpoints.

The key difference is the click-through URL is relative to the remote server base URL and the indexing URL is relative to the SOAP URL. Depending on whether you have implemented your crawler using Java or .NET, the base URL endpoint for the remote server might differ from the base URL endpoint for SOAP.

For example, the Java EDK uses Axis, which implements programs as services. In Axis, the SOAP URL is the remote server base URL with "/services" attached to the end. Given the remote server base URL http://server:port/sitename, the SOAP URL would be http://server:port/sitename/services. If both click-through and indexing URLs point to the same servlet (http://server:port/sitename/customdocfetch?docId=12345), the relative URLs would be different:

 

Relative URL

Resulting URL

Indexing

../customdocfetch?docId=12345

http://server:port/sitename/customdocfetch?docId=12345

Click-Through

customdocfetch?docId=12345

http://server:port/sitename/customdocfetch?docId=12345

As noted above, the indexing URL is relative to the SOAP URL, so the "../" in the relative URL reorients the path from http://server:port/sitename/services to http://server:port/sitename, yielding the correct URL to http://server:port/sitename/customdocfetch?docId=12345.

Note: In Plumtree Corporate Portal v.4.5 the click-through URL is not relative to the remote server base URL and must be absolute.

 

Do your initial implementation of IDocumentProvider and IDocFetchProvider in separate classes, but factor out some code to allow reuse of the GetDocument and GetMetaData methods. See the Viewer sample application included with the EDK for sample code.

Do not make your calls order-dependent. The portal can make the above calls in any order, so your code cannot be dependent on order.

 

If a document or container does not exist, always throw a new NoLongerExistsException. This is the only way the portal can determine if the file or folder has been deleted. Not throwing the exception could result in an infinite loop.

If there are no results, return a zero-length array. If the intention is to return no results, use a zero-length array, not an array with empty strings. (For example, return new ChildContainer[0];)

Check the SOAP timeout for the back-end server and calibrate your response accordingly. In version 5.0 and above, the SOAP timeout is set in the Crawler Web Service editor. In version 4.5, the SOAP timeout must be set via a Service Configuration page.

Pages that are not publicly accessible must be gatewayed. Gateway settings are configured in the Crawler Web Service editor on the HTTP Configuration page, and in the Data Source editor. You can gateway all URLs relative to the remote server or enter individual URLs and add paths to other servers to gateway additional pages. (For details, see Deploying Custom Crawlers.)

 

You must define mappings for any associated Document Types before a crawler is run. The portal uses the mappings in the Document Type definition to map the data returned by the crawler to portal properties. Properties are only stored if you configure the Document Type mapping before running the crawler. (Properties that apply to all documents are configured in the Global Document Property Map.)

 

To import security settings, the backend repository must have an associated Authentication Source. Crawlers that import security need the user and category (domain) defined by an Authentication Source. You must configure the Authentication Source before the crawler is run. Many repositories use the networks NT or LDAP security store; if an associated Authentication Source already exists, there is no need to create one. For details on Authentication Sources, see the portal online help. (For details on security, see Configuring Custom Crawlers: Importing File Security.)

If you use a mirrored crawl, only run it when you first import documents. Always check every directory after a mirrored crawl. After you have imported documents into the portal, it is safer to refresh your portal directory using a regular crawl with filters.

For mirrored crawls, make crawl depth as shallow as possible. Portal users want to access documents quickly, so folder structure is important. Also, the deeper the crawl, the more extensive your QA process will be.

 

Use filters to sort crawled documents into portal folders. Mirrored crawls can return inappropriate content and create directory structures. Filters are a more efficient way to sort crawled documents. To use filters, choose Apply Filter of Destination Folder in the Crawler editor. For details on filters, see the portal online help.

Do not use automatic approval unless you have tested a crawler. It is dangerous to use automatic approval without first testing the structure, metadata and logs for a crawler.

 

To clear the deletion history, you must re-open the Crawler Editor. To re-crawl documents that have been deleted from the portal, you must re-open the Crawler Editor and configure the Importing Documents settings on the Advanced Settings page as explained in Deploying Custom Crawlers.

Next: Accessing Secured Content