Developing Portlets and Integration Web Services: Crawlers and Search Services

Logging and Troubleshooting Custom Crawlers

As noted on the previous pages, logging is an important component of any successful crawler. Logging allows you to track progress and find problems. This page provides basic information on logging options and an FAQ on crawler development.

Logging

In most implementations, using Log4J or Log4Net for logging is the best approach. The EDK ICrawlerLog object is more efficient and useful than PTSpy or a SOAP trace, but it only includes messages from ContainerProvider.AttachToContainer and standard exceptions.

Handling Exceptions

Crawler code should handle exceptions. With other exceptions, most calls should be put into a try-catch block. The scope of the try-catch block should be small enough to diagnose errors easily. In the catch block, log the error in both Log4j/Log4net as well as the ICrawlerLog and then re-throw the exception as a ServiceException. This will result in the error displaying in the job log. However, only the error message shows up in the log; look at the log from Log4j/Log4net to get the full stack trace. The following exceptions have special meaning:

NotInitializedException means to re-initialize.
NoLongerExistsException means that the folder or document no longer exists, and for the portal to delete that resource.

To use ICrawlerLog, store the member variable in your implementation of IContainerProvider in initialize. To send a log message, simply add the following line: m_logger.Log("enter logging message here")

Note: The container provider log reads the logs only after attachToContainer and after exceptions. The document provider log reads only after exceptions. For more information and the best visibility, use Log4j/Log4net.

Viewing Logs

If you are viewing the ICrawlerLog, do not assume that the every card was imported if the job is successful. Generally successful means no catastrophic failures, such as search server not started, or unable to attach to the start node. Individual document failures will not fail a job.

If you are viewing logs created by Log4net or Log4j, see the associated documentation for logging configuration options. Both products allow you to specify a file location and a rollover log with a specified file size.

If you know the location of the file, it is not difficult to create a servlet/aspx page that streams the file from the log to the browser.

FAQ

This section addresses common questions regarding Crawler development.

Q. What variables do I need to set in a Service Configuration (SCI) page for the Crawler and Data Source?

A. The Crawler object should include the following properties:

TAG_PATH: is the path to the container you want to crawl. Depending on the type of container, this could be a UNC path, information for a table in a database, information for a view in notes, etc.
CRAWL_DEPTH: If the variable TAG_DEPTH has not been included, the crawler only crawls documents in the first directory. For resources with no sub-directories, such as a database, this is fine. For a file system, it is usually best to use a SCISelectElement to let users select the crawl depth (where -1 means until subcontainers return no child containers). If you do not want users to set this option, use a SCIHiddenElement for the same field.
Note: The SCISelectElement must call SetStorageType(TypeStorage.STORAGE_INTEGER) to be stored correctly; otherwise the portal will return the message "wrong property type."
TAG_PROPERTIES (optional): The TAG_PROPERTIES tag represents whether properties from GetMetaData or the local accessor should be used. Setting this variable to TAG_PROPERTIES_LOCAL causes the local accessor properties used to retrieve a file to override the properties returned by the crawler. Setting the variable to TAG_PROPERTIES_REMOTE causes the properties from GetMetaData to override properties from local accessors.

In 5.0, there are no required portal variables for the Data Source object. In general, the data source + document location = complete path. For example, a Data Source for a database crawler could include the server and credentials, and the location would be a string that could be parsed to construct a SQL statement that returns a record.

Q. What should I return in GetMetaData?

A. Always return the Name and Description. If you are not using DocFetch, return the indexing URL and click-through URL. If you are using DocFetch, implement GetDocument as explained on the previous page.

Q. If I am not using DocFetch, how do I do something similar?

A. To access secured or otherwise inaccessible files, you must first stream in an indexable version with encrypted credentials. For click-through, you must retrieve the backend file and transform it if necessary.

Q. I don’t want to create temporary files for indexing. What are my choices?

A. If you don't want to create temp files, you can implement an indexing servlet that returns indexable content.

Q. What happens when an exception is thrown?

A. If any exception is thrown during the initial attachToContainer, the crawl aborts. If NotInitializedException is thrown, then the crawler- re-initializes. If NoLongerExistsException is thrown, the resource is removed from the knowledge directory, and the crawler continues to the next resource. If other exceptions are thrown, the error is logged, and the crawler continues to the next resource. As the error information returned to the job log is very limited, use Log4j/Log4net to get the full exception information.

Next: Deploying Custom Crawlers