Creating Temporary Files for Indexing

If crawled content cannot be indexed as-is, the crawler code must create a temporary file for indexing.

The following steps describe a typical custom mechanism to create a temporary indexable file with as little extraneous information as possible and set the content type and file name using the appropriate headers. In most cases, the resource has already been accessed in AttachToDocument, so there is no need to call the back-end system again. This example does not use credentials. If you do not want to create temporary files, you can implement an indexing servlet that returns indexable content.

In IDocument, write a temporary file to a publicly accessible location (usually the root directory of the Web application as shown in the code snippet below).

MessageContext context = MessageContext.getCurrentContext();
   HttpServletRequest req = (HttpServletRequest)context.getProperty(HTTPConstants.MC_HTTP_SERVLETREQUEST)
   StringBuffer buff = new StringBuffer();
      buff.append(req.getScheme()).append('://').append(req.getServerName())
         .append(':').append(req.getServerPort()).append(req.getContextPath());
      String indexRoot = buff.toString();

Pass back URLs via the IDK's DocumentMetadata class that point to the servlet(s).
- UseDocFetch: Set UseDocFetch to False.
- IndexingURL: Set the IndexingURL to the endpoint/servlet that provides the indexable version of the file, including the query string arguments defined in steps 1-3 above.
- ClickThroughURL: Set the ClickThroughURL to the endpoint/servlet that provides the path to be used when a user clicks through to view the file. During the crawl, the ClickThroughURL value is stored in the associated Knowledge Directory document.
Add the temporary file path to the query string, along with the content type. Make sure to URLEncode both.
In the indexing servlet, get the file path and content type from the query string. Get the file name from the file path.
Set the ContentType header and the Content-Disposition header.
Stream the file (binary or text) or write out the file (text) in a try-catch block.
In the finally block, delete the file.

The following sample code indexes a text file.

logger.Debug('Entering Index.Page_Load()');

// try to get the .tmp filename from the Content Crawler 
string indexFileName = Request[Constants.INDEX_FILE];
if (indexFileName != null) 
{
		  StreamReader sr = null; 
			 string filePath = ''; try 
				{ 
								filePath = HttpUtility.UrlDecode(indexFileName); 
								string shortFileName = filePath.Substring(filePath.LastIndexOf('\\') + 1);

        // set the proper response headers
        Response.ContentType = 'text/plain';
        Response.AddHeader('Content-Disposition', 'inline; filename=' + shortFileName); 

        // open the file
        sr = new StreamReader(filePath); 

        // stream out the information into the response
        string line = sr.ReadLine(); 

         while (line != null)
         {
               Response.Output.WriteLine(line);
               line = sr.ReadLine(); 
         }
    }
    catch (Exception ex)
				{ 
    logger.Error('Exception while trying to write index file: ' + ex.Message, ex);
    }
    finally
    {
    // close and delete the temporary index file even if there is an error
    if(sr != null){sr.Close();}
    if(!filePath.Equals('')){File.Delete(filePath);}
    }
//done
return;
}
...

Parent topic: About Content Crawler Indexing

Parent topic: About Content Crawler DocFetch

AquaLogic User Interaction Development Guide

Creating Temporary Files for Indexing