Developing Portlets and Integration Web Services: Crawlers and Search Services

Accessing Secured Content

If it is not trivial to return a public URL to a file in the back-end repository, you must implement code to retrieve the file. As noted in the introduction, crawlers must access a file twice.

First, the crawler must retrieve metadata and index documents in the data repository. Files are indexed based on metadata and full-text content. If the content is not accessible via a URL or cannot be indexed for another reason, you must implement a custom document fetching mechanism (i.e., servlet or aspx page) that returns an index-able version of the file.
Second, the crawler must retrieve individual documents on demand through the portal Knowledge Directory, enforcing any user-level access restrictions. When a user retrieves a file, it must be displayed in a readable format.

One option is to use the DocFetch mechanism in the EDK. Note: DocFetch does not allow you to use multiple methods of authentication or implement custom error handling. If you cannot use public URLs and are not using DocFetch, you must implement a custom document fetching mechanism (i.e., servlet or aspx page). If necessary, you can implement separate servlets for indexing and click-through.

Indexing
Click-Through
Using DocFetch
Security Options

Indexing

As noted above, a crawler must return an index-able version of the file to be included in the portal Knowledge Directory. For files, you can stream content directly from the source directory. If the content is not in a file, create a representation with as little extraneous information as possible in a temporary file. The servlet/aspx page must return the content in a index-able format and set the content type and file name using the appropriate headers.

Any information required to retrieve the document needs to go on the query string of the index url, including credentials (if needed).

Note: The request to the indexing servlet is a simple HTTP GET. This call is not gatewayed, so the crawler does not have access to the data source, user credentials and preferences, or anything else through the EDK.

Streaming Content: If the content is index-able, you can stream content directly from the source directory. In this approach, no temporary files are necessary.
Using Temporary Files: If the content cannot be indexed as-is, you must create a temporary file. In most cases, the resource has generally already been accessed in AttachToDocument, so there is little need to call the back-end system again. In this approach, no credentials need to be passed.

Streaming Content

If the content being crawled is in a file, you can stream the content directly from the source directory. (For information on using temporary files, see the next section.) The servlet/aspx page must return the content in a index-able format and set the content type and file name using the appropriate headers. A typical custom mechanism involves the following:

In IDocument, get all the variables needed to access the document, and add them to the query string of the indexing servlet. This could be as simple as a UNC path for a file crawler, or as complicated as server name, database name, schema, table, primary key(s) and primary key value(s) for a database record. It depends entirely on the crawler and the document being crawled. Make sure all values are URLEncoded.
Add the content type to the query string.
In IDocument, add URLEncoded credentials to the query string. Keep in mind that URLEncoding the credentials will turn a "+" to a space, which must be turned back into a space in the indexing servlet.
Pass back URLs via the EDK's DocumentMetadata class that point to the servlet(s).

UseDocFetch: Set UseDocFetch to False.

IndexingURL: Set the IndexingURL to the endpoint/servlet that provides the indexable version of the file (including the query string arguments defined in steps 1-3 above).

ClickThroughURL: Set the ClickThroughURL to the endpoint/servlet that provides the path to be used when a user clicks through to view the file. During the crawl, the ClickThroughURL value is stored in the associated Knowledge Directory document. (For details, see Click-Through below.)

In the indexing servlet, get the location string and content type from the query string and parse the location string to get the path to the resource.
Obtain the resource.
Set the content type header from the supplied header and set the Content-Disposition header.
Stream the file (binary or text) or write out the file (text) in a try-catch block.

Temporary Files

As noted above, if the content is not in a file, create a representation with as little extraneous information as possible in a temporary file. The servlet/aspx page must return the content in a index-able format and set the content type and file name using the appropriate headers. A typical custom mechanism involves the following:

In IDocument, write a temporary file to a publicly accessible location (i.e., the root directory of the Web application as shown in the code snippet below).

MessageContext context = MessageContext.getCurrentContext();

HttpServletRequest req = (HttpServletRequest)context.getProperty(HTTPConstants.MC_HTTP_SERVLETREQUEST);

StringBuffer buff = new StringBuffer();

buff.append(req.getScheme()).append("://").append(req.getServerName())

.append(":").append(req.getServerPort()).append(req.getContextPath());

String indexRoot = buff.toString();

Pass back URLs via the EDK's DocumentMetadata class that point to the servlet(s).

UseDocFetch: Set UseDocFetch to False.

IndexingURL: Set the IndexingURL to the endpoint/servlet that provides the indexable version of the file.

ClickThroughURL: Set the ClickThroughURL to the endpoint/servlet that provides the path to be used when a user clicks through to view the file. During the crawl, the ClickThroughURL value is stored in the associated Knowledge Directory document. (For details, see Click-Through below.)

Add the temporary file path to the query string, along with the content type. Make sure to URLEncode both.
In the indexing servlet, get the file path and content type from the query string. Get the file name from the file path.
Set the content type header from the supplied header, and set the Content-Disposition header.
Stream the file (binary or text) or write out the file (text) in a try-catch block.
In the finally block, delete the file.

The simple example below indexes a text file.

logger.Debug("Entering Index.Page_Load()");

// try to get the .tmp filename from the CWS string indexFileName = Request[Constants.INDEX_FILE];

if (indexFileName != null) { StreamReader sr = null; string filePath = ""; try { filePath = HttpUtility.UrlDecode(indexFileName); string shortFileName = filePath.Substring(filePath.LastIndexOf("\\") + 1);

// set the proper response headers Response.ContentType = "text/plain"; Response.AddHeader("Content-Disposition", "inline; filename=" + shortFileName);

// open the file. sr = new StreamReader(filePath);

// stream out the information into the response string line = sr.ReadLine();

while (line != null) {

Response.Output.WriteLine(line); line = sr.ReadLine(); } } catch (Exception ex) { logger.Error("Exception while trying to write index file: " + ex.Message, ex); }

finally { // close and delete the temporary index file even if there is an error if(sr != null){sr.Close();} if(!filePath.Equals("")){File.Delete(filePath);}

}

// done return; } ...

Click-Through

After a repository is crawled and files are indexed in the portal, users must be able to access the file from within the portal by clicking a link; this is the "click-through" step. Click-through retrieves the crawled file over HTTP to be displayed to the user. To retrieve documents that are not available via a public URL, you can write your own code or use the DocFetch mechanism in the EDK. If you handle document retrieval, you can also implement custom caching or error handling. Click-through links are gatewayed, so the crawler can leverage user credentials and other preferences. (For details, see Security Options below.)

Note: The example below uses a file, but the resource could be any type of content. The click-through servlet must return the content in a readable format and set the content type and file name using the appropriate headers. If the content is not in a file, create a representation with as little extraneous information as possible in a temporary file (for example, for a database, you would retrieve the record and transform it to HTML).

Create the clickThroughServlet, and add a mapping in web.xml.
Complete the implementation of IDocument.getMetaData. Set the ClickThoughURL value to an URL constructed using the following steps:

Construct the base URL of the application using the same approach as in the index servlet (as shown in the Temporary Files sample code).
Add the servlet mapping to the clickThroughServlet.
Add any query string parameters required to access the document from the clickThroughServlet (aspx page). Remember: The click-through page will have access to data source parameters (as administrative preferences), but no access to crawler settings.

To authenticate to the back-end resource, you can use Basic Auth, User Preferences, User Info, or credentials from the data source. Below are suggestions for each; security will need to be tailored to your crawler.

Use Basic Auth to use the same credentials used to log in to the portal. For example, if the portal uses AD credentials, Basic Auth could be used to access NT files.
Use (encrypted) User Preferences if the authentication source is different from the one used to log in to the portal. For example, if the portal log in uses IPlanet, but you need to access an NT or Documentum file.
Use (encrypted) User Info if the encrypted credentials are stored in another profile source and imported using a profile job.
Use data source credentials when there a limited connections, e.g., with a database.

Extract the parameters from the query string as required.
Display the page.

If there is already an HTML representation of the page (e.g., for Outlook or Notes), authenticate to the page. If the site is using Basic Auth, and you are using Basic Auth headers, simply redirect to that page. If the site is using Basic Auth, and you are not using Basic Auth, users must log in, unless that site and the portal are using the same SSO solution. If the site is using form-based authentication, post to the site and follow the redirect.
If there is not an HTML representation of the page, retrieve the resource and stream it out to the client as shown in the sample code below (Java). If you use a temporary file, put the code in a try-catch-finally block, and delete the file in the finally block.

//get the content type, passed as a query string parameterString contentType = request.getParameter(“contentType”);

//if this is a file, get the file nameString filename = request.getParameter(“filename”);

//set the content type on the responseresponse.setContentType(contentType);

//set the content disposition header to tell the browser the file nameresponse.setHeader(“Content-Disposition”, “inline; filename=” + filename);

//set the header that tells the gateway to stream this through the gatewayresponse.setHeader(“PTGW-Streaming”, “Yes”);

//get the content – for a file, get a file input stream based on the path (shown below) //other repositories may simply provide an input stream//NOTE: this code contains no error checkingString filePath = request.getParameter(“filePath”);File file = new File(filePath);FileInputStream fileStream = new FileInputStream(file);

//create a byte buffer for reading the file in 40k chunksint BUFFER_SIZE = 40 * 1024;byte[] buf = new byte[BUFFER_SIZE];

//start reading the fileint bytesRead = fileStream.read(buf);ServletOutputStream out = response.getOutputStream();

//start writing out the bodyout.write(buf, 0, bytesRead);

//continue writing until the input stream returns -1while ((bytesREad = fileStream.read(buf)) != -1){

out.write(buf, 0, bytesRead);

}

Using DocFetch

DocFetch is one way to retrieve files that are not accessible via a public URL. If the crawler’s DocumentMetaData object sets UseDocFetch to True, the EDK sets ClickThroughURL to the URL of the DocFetch servlet, and calls IDocument.GetDocument to get the file path to the indexable version of the document (this can be the same version used for click-through). The EDK copies the returned filepath to a temporary file in a writable, Web-accessible location and returns the URL to that temporary file. When the crawler is cleaning up, this temporary file is removed by the EDK, and you receive a cleanup call.

For DocFetch, there are three relevant fields in the DocumentMetaData object returned in the portal's call to IDocument.GetMetaData:

UseDocFetch: Set UseDocFetch to True. You must implement GetDocument to retrieve the file.
File Name: Set the File Name to the name of the file in the repository (must be unique).
Content Type: Set the Content Type to the content type for the file. The content type must be mapped to a supported Document Type in the portal. For details, see the introduction.

To use DocFetch, you must set UseDocFetch to True and implement the GetDocument method in the Crawler.IDocument and Docfetch.IDocFetch interfaces to return the appropriate file path. In most cases, a repository cannot access files directly, so you must serialize the binary representation to a temporary disk file and return that path. The EDK provides a cleanup call to delete these temporary files later. The IDocument and IDocFetch interfaces should use the same process.

DocFetch also allows you to implement user-level access control. You can choose to pass user preferences or User Information to the crawler; this information can be used by DocFetch to authenticate with the back-end system or limit access to specific users. To use user preferences or User Information, you must configure the specific settings you need in the Crawler Web Service editor.

The Plumtree.Remote.Docfetch package/namespace includes the following interfaces:

IDocFetchProvider
IDocFetch

DocFetch interfaces are called in the following order. For a complete listing of interfaces, classes, and methods, see the EDK API documentation.

IDocFetchProvider.Initialize. Uses the DataSourceInfo, UserPrefs and UserInfo returned from the portal to initialize the DocFetch Provider (i.e., make a connection to the backend system and create a new session). The implementation should initialize in a similar manner to IDocumentProvider.Initialize. The advantage in IDocFetchProvider is that you can use UserInfo and UserPrefs to perform additional authentication. The ICrawlerLog object is not available. Note: Sessions can get dropped. Keep a variable that can be used to ensure the session is still initialized; if it is not, throw NotInitializedException.
IDocFetchProvider.AttachToDocument using the authentication information provided (including UserPrefs and UserInfo).

IDocFetch.GetMetaData. The only DocumentMetadata required for click-through is the file name and content type.

IDocFetch.GetDocument. The EDK looks in web.config/*.wsdd to get the file path and URL to the directory for creating temporary files.

IDocFetchProvider.Shutdown (optional).

Click-through should reuse code from the already-written crawler service code. The URL to the DocFetch servlet is stored in the Knowledge Directory document during crawling, as explained earlier. When a user subsequently clicks on a crawled document, the request to the servlet makes several calls to the already-implemented crawler code. This ultimately results in GetDocument being called again, but this time as part of the docfetch.IDocFetch interface. The file path returned by GetDocument is opened by the servlet and streamed back in the response. After the streaming is done, a cleanup call is issued so you can delete any temporary resources if necessary.

IMPORTANT: If GetDocument returns a path to a file (not a URL to a publicly accessible file), the file name must be unique. Otherwise, all copies of the file are removed during cleanup, including copies that are currently in use by other users.

Security Options

If you need to apply credentials to access a file, you can use any of the following options:

SSO: This approach requires that you have set up SSO on the portal and on the remote server, using the instructions of your SSO vendor. For details on importing file security in a crawl, see Configuring Custom Crawlers: Importing File Security.
Basic authentication: Set the remote server to pass the user’s basic authentication headers to the remote resource. This approach works only if both sources are using the same directory. For example, if a user logs in using an IPlanet directory, it is unlikely they will be able to access an Exchange resource. For details on importing file security in a crawl, see Configuring Custom Crawlers: Importing File Security.
Data Source credentials: This approach is generally valid only for crawling a database. Most other use cases require user-specific credentials.
User preferences via form-based authentication: You can use preferences stored in the Plumtree database to create a cookie if the resource accepts session-based authentication. User preferences generally cannot be used if the resource expects basic authentication. For example, the Notes CWS uses this approach when Notes is using session-based (i.e., cookie) authentication. You must enter all User settings and User Information required by a crawler on the Preferences page of the Crawler Web Service Editor.
Force users to log in: If the credentials you need are not available, you must redirect the user to the appropriate page and/or provide an intelligible error message. For example, the Notes CWS uses this approach when Notes is using basic authentication.

Next: Logging and Troubleshooting