AquaLogic User Interaction Development Guide

Previous

Next

Open TOC in new window

Content Crawler Development Tips

These best practices and development tips apply to all content crawler development.

	Use logging extensively to provide feedback during a crawl. In some cases, the portal reports a successful crawl when there were minor errors. Use Log4J or Log4Net to track progress. For more information, see Logging and Troubleshooting.
	Use relative URLs in your code to allow migration to another remote server. Note: These URLs might be relative to different base URL endpoints. The click-through URL is relative to the remote server base URL, and the indexing URL is relative to the SOAP URL. Depending on whether you have implemented your Content Crawler using Java or .NET, the base URL endpoint for the remote server might differ from the base URL endpoint for SOAP. For example, the Java IDK uses Axis, which implements programs as services. In Axis, the SOAP URL is the remote server base URL with '/services' attached to the end. Given the remote server base URL http://server:port/sitename, the SOAP URL would be http://server:port/sitename/services. If both click-through and indexing URLs point to the same servlet (http://server:port/sitename/customdocfetch?docId=12345), the relative URLs would be different. The relative URL for indexing would be "../customdocfetch?docId=12345" and the relative URL for click-through would be "customdocfetch?docId=12345". (Since the indexing URL is relative to the SOAP URL, the '../' reorients the path from http://server:port/sitename/services to http://server:port/sitename, yielding the correct URL to http://server:port/sitename/customdocfetch?docId=12345.)
	Do your initial implementation of IDocumentProvider and IDocFetchProvider in separate classes, but factor out some code to allow reuse of the GetDocument and GetMetaData methods. See the Viewer sample application included with the IDK for sample code.
	Do not make your calls order-dependent. The portal can make the above calls in any order, so your code cannot be dependent on order.
	If a document or container does not exist, always throw a new NoLongerExistsException. This is the only way the portal can determine if the file or folder has been deleted. Not throwing the exception could result in an infinite loop.
	If there are no results, return a zero-length array. If your intention is to return no results, use a zero-length array, not an array with empty strings. (For example, `return new ChildContainer[0];`)
	Check the SOAP timeout for the back-end server and calibrate your response accordingly. In version 5.0 and above, the SOAP timeout is set in the Web Service editor. In version 4.5, the SOAP timeout must be set via a Service Configuration page.
	Pages that are not publicly accessible must be gatewayed. Gateway settings are configured in the Web Service editor on the HTTP Configuration page, and in the Content Source editor. You can gateway all URLs relative to the remote server or enter individual URLs and add paths to other servers to gateway additional pages. For details, see Deploying Custom Content Crawlers.
	You must define mappings for any associated Content Types before a Content Crawler is run. The portal uses the mappings in the Content Type definition to map the data returned by the Content Crawler to portal properties. Properties are only stored if you configure the Content Type mapping before running the Content Crawler. (Properties that apply to all documents are configured in the Global Document Property Map.)
	To import security settings, the backend repository must have an associated Authentication Source. Content Crawlers that import security need the user and category (domain) defined by an Authentication Source. You must configure the Authentication Source before the Content Crawler is run. Many repositories use the networks NT or LDAP security store; if an associated Authentication Source already exists, there is no need to create one. For details on Authentication Sources, see the portal online help. For details on security, see Configuring Custom Content Crawlers : Importing File Security.
	If you use a mirrored crawl, only run it when you first import documents. Always check every directory after a mirrored crawl. After you have imported documents into the portal, it is safer to refresh your portal directory using a regular crawl with filters.
	For mirrored crawls, make crawl depth as shallow as possible. Portal users want to access documents quickly, so folder structure is important. Also, the deeper the crawl, the more extensive your QA process will be.
	Use filters to sort crawled documents into portal folders. Mirrored crawls can return inappropriate content and create unnecessary directory structures. Filters are a more efficient way to sort crawled documents. To use filters, choose Apply Filter of Destination Folder in the Content Crawler editor. For details on filters, see the portal online help.
	Do not use automatic approval unless you have tested a Content Crawler. It is dangerous to use automatic approval without first testing the structure, metadata and logs for a Content Crawler.
	To clear the deletion history, you must re-open the Content Crawler editor. To re-crawl documents that have been deleted from the portal, you must re-open the Content Crawler editor and configure the Importing Documents settings on the Advanced Settings page as explained in Deploying Custom Content Crawlers.

You can also import access restrictions during a crawl; for details, see Configuring Content Crawlers.

Parent topic: About Content Crawlers

Previous

Next