IDK Interfaces for Content Crawler Development
The IDK plumtree.remote.crawler package/namespace includes
four interfaces to support content crawler development: IContainerProvider, IContainer, IDocumentProvider and IDocument.
When the
ALI Automation Service initiates a crawl, it issues a SOAP request
to return a list of folders. It iterates over the list of folders
and retrieves lists of documents with metadata. In general, the portal
calls IDK interfaces in the following order. See the definitions that
follow for more information.
- IContainerProvider.Initialize once per thread. Use DataSourceInfo and CrawlerInfo to initialize the Container Provider (make
a connection to the back-end system and create a new session). Note:
This is not a true HTTP session, and sessions can get dropped. Keep
a variable that can be used to ensure the session is still initialized;
if it is not, throw NotInitializedException. Store
the Content Source in a member variable in Initialize. Do not use
direct access to the member variable; instead use a method that checks
if it is null and throws a NotInitializedException.
- IContainerProvider.AttachToContainer, using the starting location in the key CrawlerConstants.TAG_PATH.
The key should be populated using a Service Configuration page in
the Content Crawler editor. The string in TAG_PATH is service-specific;
a file Content Crawler could use the UNC path to a folder, while a
database Content Crawler could use the full name of a table. The following
methods are not called in any specific order.
- IContainer.GetUsers and IContainer.GetGroups on that container as
required. (IContainer.GetMetaData is deprecated.)
- IContainer.GetChildContainers up to the number specified in CrawlerConstants.TAG_DEPTH. (This
key must be set via a Service Configuration page.)
- IContainerProvider.AttachToContainer for each ChildContainer returned.
- IContainer.GetChildDocuments, then IDocumentProvider.AttachToDocument for
each ChildDocument returned.
- IContainerProvider.Shutdown (this call is optional and could be blocked by exceptions or network
failure).
- IDocumentProvider.Initialize once per thread. Note: Sessions can get dropped. Keep a variable
that can be used to ensure the session is still initialized; if it
is not, throw NotInitializedException.
- IDocumentProvider.AttachToDocument for each ChildDocument, then IDocument.GetDocumentSignature to see if the document has changed. If the document is new or has
been modified, the following methods are called (not in any specific
order).
- IDocument.GetUsers and IDocument.GetGroups on that document as
required.
- IDocument.GetMetaData to get the file name, description, content type, URL, etc.
- IDocument.GetDocument to index the document (only if DocFetch is used).
- IDocumentProvider.Shutdown (this call is optional and could be blocked by exceptions or network
failure).
The sections below provide helpful information on the
interfaces used to implement a Content Crawler. For a complete listing
of interfaces, classes, and methods, see the IDK API documentation.
IContainerProvider
The
IContainerProvider interface allows the
portal to iterate over a back-end directory structure. The portal
calls
IContainerProvider first in most cases. This
interface provides the following methods:
- Initialize allows the remote server to initialize a session and create a connection
to the back-end document repository. The IDK passes in a DataSourceInfo object that contains the necessary settings
associated with a Content Source object (the name of a directory in
the repository and the credentials of a system user). The CrawlInfo object contains the settings for the associated
Content Crawler object in the portal. The start location of the crawl
is the value stored in the key CrawlerConstants.TAG_PATH ,
set using a Service Configuration page.
- AttachToContainer is always the next call after Initialize; the order of the remaining
calls is not defined. It associates the session with the container
specified in the sContainerLocation parameter;
subsequent calls refer to this container until the next AttachToContainer call. The value in the sContainerLocation parameter
will be the CrawlerConstants.TAG_PATH key for the initial attach,
and the value specified in ChildContainer.GetLocation for subsequent
attaches. Each time AttachToContainer is called,
discard any state created during the previous AttachToContainer call. If multiple translations of the container are available, select
the most appropriate using the Locale parameter, which can be sent
as a full locale (e.g., "en-us") or in the abbreviated language-only
format (e.g., "en"). Note: If the container specified does not exist,
you must throw a new NoLongerExistsException to
avoid an infinite loop. If the Content Crawler is configured to delete
missing files, all files in the container will be removed from the
portal index.
- Shutdown allows the portal to clean up any unused sessions that have not
yet expired. Content Crawlers are implemented on top of standard cookie-based
session mechanisms, so sessions expire and resources and connections
are released after an inactivity period, typically around 20 minutes.
As a performance optimization, the portal might send a Shutdown message notifying the remote server to end the session immediately.
No parameters are received and none are returned. Do not assume that Shutdown will be called; the call could be blocked by an
exception or network failure. Remote servers must terminate sessions
after an inactivity timeout but can choose to ignore the Shutdown message and keep the session alive until it times
out.
IContainer
The portal uses the IContainer interface to
query information about back-end resource directories. This interface
provides the following methods:
- GetGroups and GetUsers return a list of the portal groups
or users that have read access to the container. These calls are made
only if the Web Service and Content Crawler objects are configured
to import security. The portal batches these calls; the Content Crawler
code should return all groups or users at once.
- GetChildContainers returns the containers inside the current container (i.e., subfolders
of a folder). The value stored in the key CrawlerContants.TAG_DEPTH is used to determine how many times GetChildContainers is called (crawl depth). This value must be set via a Service Configuration
page. If no value is stored with this key, GetChildContainers is never called; only the documents in the folder specified for
the start location are crawled into the portal. Note: Setting CrawlerConstants.TAG_DEPTH
to -1 could result in an infinite loop.
- GetChildDocuments returns the documents inside the current container (folder). The
portal batches this call; the Content Crawler code should return all
documents at once. The TypeNamespace and TypeID parameters
define the Content Type for the document. TypeNamespace associates the document with a row in the Global Content Type Map,
and the TypeID associates it with a particular Content Type. The value
in ChildDocument.GetLocation is used in IDocumentProvider.AttachToDocument, so any information
required by AttachToDocument must be included in
the location string. You can describe the document using file or MIME,
as shown in the example below.
ChildDocument doc=new ChildDocument();
String filename = WordDoc.doc;
//Location is a crawler-specific string to retrieve doc, e.g., file name
doc.setLocation(filename);
//TypeNameSpace is either FILE or MIME unless using a custom namespace (Notes, Exchange)
//NOTE: example uses getCode because setTypeNameSpace expects a String
doc.setTypeNameSpace(TypeNamespace.MIME.getCode()):
//For file descriptions, TypeID is simply the document name with extension (i.e., filename)
//For MIME descriptions, set the document type or map multiple file extensions to MIME types
doc.setTypeID("application/msword");
//DisplayName is the name to display in the KD, usually overridden in IDocument.getMetaData();
doc.setDisplayName(filename);
IDocumentProvider
The
IDocumentProvider interface allows the
portal to specify back-end documents for retrieval. In most cases,
the portal calls
IContainerProvider first. However,
in some cases, the service is used to refresh existing documents and
IDocumentProvider might be called first.
- Initialize allows the remote server to initialize a session and create a connection
to the back-end document repository. (For details on parameters and
session state, see IContainerProvider.Initialize above.) IDocumentProvider.Initialize will be
called once per thread as long as the session does not time out or
get interrupted for other reasons, and AttachToDocument will be called next.
- AttachToDocument is always the next call made after Initialize; the order of the
remaining calls is not defined. This method 'attaches' a session to
the document specified in the sDocumentLocation parameter;
subsequent calls refer to this document until the next AttachToDocument
call. The sDocumentLocation string is the value specified in ChildDocument.GetLocation
(ChildDocument is returned by IContainer.GetChildDocuments). If multiple
translations of the document are available, select the most appropriate
by using the Locale parameter, which can be sent as a full locale
(e.g., 'en-us') or in the abbreviated language only format (e.g.,
'en'). When implementing this method, you can throw the following
exceptions:
Exception
|
Description
|
NoLongerExistsException
|
The document has been moved or deleted. (The refresh agent
will delete documents from the portal index only if this exception
has been thrown.)
|
NotAvailableException
|
The document is temporarily unavailable.
|
NotInitializedException
|
The IDocumentProvider is in an uninitialized state. |
AccessDeniedException
|
Access to this document is denied. |
ServiceException
|
Propagates the exception to the portal and adds an entry
to ALI Logging Spy. |
- Shutdown allows the portal to clean up any unused sessions that have not
yet expired. (For details, see IContainerProvider.Shutdown above.)
IDocument
The IDocument interface allows the portal to
query information about and retrieve documents. This interface provides
the following methods:
- GetDocumentSignature allows the portal to determine if the document has changed and should
be re-indexed and flagged as updated. It can be a version number,
a last-modified date, or the CRC of the document. The IDK does not
enforce any restrictions on what to use for the document signature,
or provide any utilities to get the CRC of the document. This is always
the first call made to IDocument; on re-crawls, if the documentSignature
has not changed, no additional calls will be made.
- GetMetadata returns all metadata available in the repository about the document.
The portal maps this data to properties based on the mappings defined
for the appropriate Content Type, along with metadata returned by
the associated accessor. The following field names are reserved. Additional
properties can be added using the portal's Global Document Property
Map; for details, see Configuring Custom Content Crawlers: Properties
and Metadata. (Any properties that are not in the Global Document
Property Map will be discarded.)
Field Name |
Description |
Name |
REQUIRED. The name of the link to be displayed
in the portal Knowledge Directory. Note: By default, the portal uses
the name from the crawled file properties as the name of the card.
To set the portal to use the Name property returned by GetMetadata,
you must set the CrawlerConstants.TAG_PROPERTIES to REMOTE using the
Service Configuration Interface. |
Description |
The description of the link to be displayed
in the portal Knowledge Directory. |
UseDocFetch |
Whether or not to use DocFetch to retrieve
the file. The default is False. If you use DocFetch, the value
in the File Name field is used to retrieve the file during both indexing
and click-through. If you do not use DocFetch, you must provide values
for Indexing URL and Click-Through URL. |
File Name (required for DocFetch) |
The name of the click-through file, used for
DocFetch. |
Content Type (required for DocFetch) |
The content type of the click-through file,
used to associated the crawled document with the Global Content Type
Map. |
Indexing URL (public URL) |
(Required if not using DocFetch.) The URL to
the file that can be indexed in the portal. URLs can be relative to
the Remote Server. If a file is publicly accessible via a URL, that
URL can be used to access the document for both indexing and click-through.
Documents that cannot be indexed must provide an additional URL at
crawl-time for indexing purposes. For details on crawling secured
content, see Accessing Secured Content . |
Click-Through URL (public URL) |
(Required if not using DocFetch.) The URL to
the click-through file. URLs can be relative to the Remote Server.
For details on crawling secured content, see Accessing Secured Content. |
Image UUID (optional) |
This parameter is only required for custom
Content Types. For standard Content Types, the accessor will assign
the correct image UUID. |
- GetDocument returns the path to the file if it was not provided by GetMetaData. (For public URLs, you do not need to implement GetDocument, but you must provide values for IndexingURL and ClickThroughURL in GetMetaData.) During crawl-time indexing, this file is
copied to the web-accessible IndexFilePath location
specified in your deployment descriptor and returned to the portal
via a URL to that location. If the file is not supported for indexing
by the portal, implement GetDocument to convert
the document into a supported file format for indexing (e.g., text-only)
and return that file during indexing. Note: To create a custom implementation
of GetDocument, you must set UseDocFetch to True. When a user clicks through to the document, the display
file is streamed back via the DocFetch servlet to the browser. Any
necessary cleanup due to temporary file usage should be done on subsequent
calls to IDocumentProvider.AttachToDocument or IDocumentProvider.Shutdown. For details on accessing secured
content and files that are not accessible via a public URL, see About Content Crawler Click-Through.
- GetGroups and GetUsers return a list of the groups or
users with read access to the document. Each entry is an ACLEntry
with a domain and group name. The portal batches these calls; the
Content Crawler code should return all groups or users at once. This
call is made only if the Supports importing security with each
document option is checked on the Advanced Settings page of the
Web Service editor.