Implementing Content Crawler DocFetch

Content crawler code can use DocFetch to access files that are not available via a public URL.

To use DocFetch, there are three relevant fields in the DocumentMetaData object returned in the portal's call to IDocument.getMetaData:

UseDocFetch: Set UseDocFetch to True.
File Name: Set the File Name to the name of the file in the repository (must be unique).
Content Type: Set the Content Type to the content type for the file. The content type must be mapped to a supported Content Type in the portal.

When UseDocFetch is set to True, the Oracle WebCenter Interaction Development Kit (IDK) sets the ClickThroughURL stored in the Directory to the URL of the DocFetch servlet, and calls IDocument.getDocument to retrieve the file path to the indexable version of the document. When a user subsequently clicks on a link to the crawled document in the Directory, the request to the DocFetch servlet makes several calls to the already-implemented content crawler code. getDocument is called again, but this time as part of the IDocFetch interface. The file path returned is opened by the servlet and streamed back in the response.

As explained above, the content crawler must implement the getDocument method in both the Crawler.IDocument and DocFetch.IDocFetch interfaces to return the appropriate file path(s). If the repository cannot access files directly, you must serialize the binary representation to a temporary disk file and return that path. The IDocument and IDocFetch interfaces can use the same process. The Oracle WebCenter Interaction Development Kit (IDK) provides a cleanup call to delete any temporary files later.

Note: If getDocument returns a path to a file (not a URL to a publicly accessible file), the file name must be unique. Otherwise, all copies of the file are removed during cleanup, including copies that are currently in use by other users.

To use user preferences or User Information, you must configure the settings to be used in the Content Crawler editor.

DocFetch interfaces are called in the following order. For a complete listing of interfaces, classes, and methods, see the Oracle WebCenter Interaction Development Kit (IDK) API documentation.

IDocFetchProvider.initialize using the DataSourceInfo, UserPrefs and UserInfo returned from the portal to make a connection to the backend system and create a new session. The implementation should initialize in a similar manner to IDocumentProvider.initialize. IDocFetchProvider can use UserInfo and UserPrefs to perform additional authentication. The ICrawlerLog object is not available. Note: Sessions can get dropped. Keep a variable that can be used to ensure the session is still initialized; if it is not, throw NotInitializedException.
IDocFetchProvider.attachToDocument using the authentication information provided (including UserPrefs and UserInfo).
1. IDocFetch.getMetaData: The only DocumentMetadata required for click-through is the file name and content type.
2. IDocFetch.getDocument: As noted above, IDocFetch.GetDocument method should reuse as much code as possible from the IDocument.getDocument method. The Oracle WebCenter Interaction Development Kit (IDK) looks in web.config/*.wsdd to get the file path and URL to the directory for creating temporary files.
IDocFetchProvider.Shutdown (optional).

Parent topic: About Content Crawler DocFetch

Oracle WebCenter Interaction Web Service Development Guide

Implementing Content Crawler DocFetch