Developing Portlets and Integration Web Services: Crawlers and Search Services

Accessing Secured Content

If it is not trivial to return a public URL to a file in the back-end repository, you must implement code to retrieve the file. As noted in the introduction, crawlers must access a file twice.

  1. First, the crawler must retrieve metadata and index documents in the data repository. Files are indexed based on metadata and full-text content. If the content is not accessible via a URL or cannot be indexed for another reason, you must implement a custom document fetching mechanism (i.e., servlet or aspx page) that returns an index-able version of the file.

  2. Second, the crawler must retrieve individual documents on demand through the portal Knowledge Directory, enforcing any user-level access restrictions. When a user retrieves a file, it must be displayed in a readable format.

One option is to use the DocFetch mechanism in the EDK. Note: DocFetch does not allow you to use multiple methods of authentication or implement custom error handling. If you cannot use public URLs and are not using DocFetch, you must implement a custom document fetching mechanism (i.e., servlet or aspx page). If necessary, you can implement separate servlets for indexing and click-through.

Indexing

As noted above, a crawler must return an index-able version of the file to be included in the portal Knowledge Directory. For files, you can stream content directly from the source directory. If the content is not in a file, create a representation with as little extraneous information as possible in a temporary file. The servlet/aspx page must return the content in a index-able format and set the content type and file name using the appropriate headers.

Any information required to retrieve the document needs to go on the query string of the index url, including credentials (if needed).

Note: The request to the indexing servlet is a simple HTTP GET. This call is not gatewayed, so the crawler does not have access to the data source, user credentials and preferences, or anything else through the EDK.

Streaming Content

If the content being crawled is in a file, you can stream the content directly from the source directory. (For information on using temporary files, see the next section.) The servlet/aspx page must return the content in a index-able format and set the content type and file name using the appropriate headers. A typical custom mechanism involves the following:

  1. In IDocument, get all the variables needed to access the document, and add them to the query string of the indexing servlet. This could be as simple as a UNC path for a file crawler, or as complicated as server name, database name, schema, table, primary key(s) and primary key value(s) for a database record. It depends entirely on the crawler and the document being crawled. Make sure all values are URLEncoded.

  2. Add the content type to the query string.

  3. In IDocument, add URLEncoded credentials to the query string. Keep in mind that URLEncoding the credentials will turn a "+" to a space, which must be turned back into a space in the indexing servlet.

  4. Pass back URLs via the EDK's DocumentMetadata class that point to the servlet(s).  

  5. In the indexing servlet, get the location string and content type from the query string and parse the location string to get the path to the resource.

  6. Obtain the resource.

  7. Set the content type header from the supplied header and set the Content-Disposition header.

  8. Stream the file (binary or text) or write out the file (text) in a try-catch block.

Temporary Files

As noted above, if the content is not in a file, create a representation with as little extraneous information as possible in a temporary file. The servlet/aspx page must return the content in a index-able format and set the content type and file name using the appropriate headers. A typical custom mechanism involves the following:

  1. In IDocument, write a temporary file to a publicly accessible location (i.e., the root directory of the Web application as shown in the code snippet below).

  2. MessageContext context = MessageContext.getCurrentContext();

            HttpServletRequest req = (HttpServletRequest)context.getProperty(HTTPConstants.MC_HTTP_SERVLETREQUEST);

            StringBuffer buff = new StringBuffer();

            buff.append(req.getScheme()).append("://").append(req.getServerName())

                    .append(":").append(req.getServerPort()).append(req.getContextPath());

            String indexRoot = buff.toString();

  3. Pass back URLs via the EDK's DocumentMetadata class that point to the servlet(s).  

  4. Add the temporary file path to the query string, along with the content type. Make sure to URLEncode both.

  5. In the indexing servlet, get the file path and content type from the query string. Get the file name from the file path.

  6. Set the content type header from the supplied header, and set the Content-Disposition header.

  7. Stream the file (binary or text) or write out the file (text) in a try-catch block.

  8. In the finally block, delete the file.

The simple example below indexes a text file.

logger.Debug("Entering Index.Page_Load()");

// try to get the .tmp filename from the CWS
string indexFileName = Request[Constants.INDEX_FILE];

if (indexFileName != null)
{
StreamReader sr = null;
string filePath = "";
try
{
filePath = HttpUtility.UrlDecode(indexFileName);
string shortFileName = filePath.Substring(filePath.LastIndexOf("\\") + 1);

        // set the proper response headers
Response.ContentType = "text/plain";
Response.AddHeader("Content-Disposition", "inline; filename=" + shortFileName);                                               

        // open the file.
sr = new StreamReader(filePath);                                                   

        // stream out the information into the response
string line = sr.ReadLine();                                                   

        while (line != null)
{

           Response.Output.WriteLine(line);
line = sr.ReadLine();
}
}
catch (Exception ex)
{
logger.Error("Exception while trying to write index file: " + ex.Message, ex);
}

    finally
{
// close and delete the temporary index file even if there is an error
if(sr != null){sr.Close();}
if(!filePath.Equals("")){File.Delete(filePath);}

  }

// done
return;
}
...

 

Click-Through

After a repository is crawled and files are indexed in the portal, users must be able to access the file from within the portal by clicking a link; this is the "click-through" step. Click-through retrieves the crawled file over HTTP to be displayed to the user. To retrieve documents that are not available via a public URL, you can write your own code or use the DocFetch mechanism in the EDK. If you handle document retrieval, you can also implement custom caching or error handling. Click-through links are gatewayed, so the crawler can leverage user credentials and other preferences. (For details, see Security Options below.)

Note: The example below uses a file, but the resource could be any type of content. The click-through servlet must return the content in a readable format and set the content type and file name using the appropriate headers. If the content is not in a file, create a representation with as little extraneous information as possible in a temporary file (for example, for a database, you would retrieve the record and transform it to HTML).

  1. Create the clickThroughServlet, and add a mapping in web.xml.

  2. Complete the implementation of IDocument.getMetaData. Set the ClickThoughURL value to an URL constructed using the following steps:

    1. Construct the base URL of the application using the same approach as in the index servlet (as shown in the Temporary Files sample code).

    2. Add the servlet mapping to the clickThroughServlet.

    3. Add any query string parameters required to access the document from the clickThroughServlet (aspx page). Remember: The click-through page will have access to data source parameters (as administrative preferences), but no access to crawler settings.

  3. To authenticate to the back-end resource, you can use Basic Auth, User Preferences, User Info, or credentials from the data source. Below are suggestions for each; security will need to be tailored to your crawler.

  4. Extract the parameters from the query string as required.

  5. Display the page.

//get the content type, passed as a query string parameter
String contentType = request.getParameter(“contentType”);

//if this is a file, get the file name
String filename = request.getParameter(“filename”);

//set the content type on the response
response.setContentType(contentType);

//set the content disposition header to tell the browser the file name
response.setHeader(“Content-Disposition”, “inline; filename=” + filename);

//set the header that tells the gateway to stream this through the gateway
response.setHeader(“PTGW-Streaming”, “Yes”);

 

//get the content – for a file, get a file input stream based on the path (shown below)
//other repositories may simply provide an input stream

//NOTE: this code contains no error checking

String filePath = request.getParameter(“filePath”);

File file = new File(filePath);

FileInputStream fileStream = new FileInputStream(file);

 

//create a byte buffer for reading the file in 40k chunks
int BUFFER_SIZE = 40 * 1024;

byte[] buf = new byte[BUFFER_SIZE];

//start reading the file
int bytesRead = fileStream.read(buf);

ServletOutputStream out = response.getOutputStream();

//start writing out the body
out.write(buf, 0, bytesRead);

//continue writing until the input stream returns -1
while ((bytesREad = fileStream.read(buf)) != -1)

{

            out.write(buf, 0, bytesRead);

}

Using DocFetch

DocFetch is one way to retrieve files that are not accessible via a public URL. If the crawler’s DocumentMetaData object sets UseDocFetch to True, the EDK sets ClickThroughURL to the URL of the DocFetch servlet, and calls IDocument.GetDocument to get the file path to the indexable version of the document (this can be the same version used for click-through). The EDK copies the returned filepath to a temporary file in a writable, Web-accessible location and returns the URL to that temporary file. When the crawler is cleaning up, this temporary file is removed by the EDK, and you receive a cleanup call.

For DocFetch, there are three relevant fields in the DocumentMetaData object returned in the portal's call to IDocument.GetMetaData:

To use DocFetch, you must set UseDocFetch to True and implement the GetDocument method in the Crawler.IDocument and Docfetch.IDocFetch interfaces to return the appropriate file path. In most cases, a repository cannot access files directly, so you must serialize the binary representation to a temporary disk file and return that path. The EDK provides a cleanup call to delete these temporary files later. The IDocument and IDocFetch interfaces should use the same process.

DocFetch also allows you to implement user-level access control. You can choose to pass user preferences or User Information to the crawler; this information can be used by DocFetch to authenticate with the back-end system or limit access to specific users. To use user preferences or User Information, you must configure the specific settings you need in the Crawler Web Service editor.

The Plumtree.Remote.Docfetch package/namespace includes the following interfaces:

DocFetch interfaces are called in the following order. For a complete listing of interfaces, classes, and methods, see the EDK API documentation.

  1. IDocFetchProvider.Initialize. Uses the DataSourceInfo, UserPrefs and UserInfo returned from the portal to initialize the DocFetch Provider (i.e., make a connection to the backend system and create a new session). The implementation should initialize in a similar manner to IDocumentProvider.Initialize. The advantage in IDocFetchProvider is that you can use UserInfo and UserPrefs to perform additional authentication. The ICrawlerLog object is not available. Note: Sessions can get dropped. Keep a variable that can be used to ensure the session is still initialized; if it is not, throw NotInitializedException.

  2. IDocFetchProvider.AttachToDocument using the authentication information provided (including UserPrefs and UserInfo).

  3. IDocFetchProvider.Shutdown (optional).

Click-through should reuse code from the already-written crawler service code. The URL to the DocFetch servlet is stored in the Knowledge Directory document during crawling, as explained earlier. When a user subsequently clicks on a crawled document, the request to the servlet makes several calls to the already-implemented crawler code. This ultimately results in GetDocument being called again, but this time as part of the docfetch.IDocFetch interface. The file path returned by GetDocument is opened by the servlet and streamed back in the response. After the streaming is done, a cleanup call is issued so you can delete any temporary resources if necessary.

IMPORTANT: If GetDocument returns a path to a file (not a URL to a publicly accessible file), the file name must be unique. Otherwise, all copies of the file are removed during cleanup, including copies that are currently in use by other users.

Security Options

If you need to apply credentials to access a file, you can use any of the following options:

Next: Logging and Troubleshooting