Configuring Content Crawlers

Implementing a successful Content Crawler in the portal requires specific configuration.

To register a Content Crawler in the portal, you must create the following administrative objects and portal components:

Remote Server: The Remote Server defines the base URL for the Content Crawler. Content Crawlers can use a Remote Server object or hard-coded URLs. Multiple services can share a single Remote Server object. If you will be using a Remote Server object, you must register it before registering any related Web Service objects.
Web Service - Content: The Web Service object includes basic configuration settings, including the SOAP endpoints for the ContainerProvider and DocumentProvider, and Preference page URLs. Multiple Content Source or Content Crawler objects can use the same Web Service object. All remote Content Crawlers require an associated Web Service object. For information on specific settings, see the portal online help.
Content Source - Remote: The Content Source defines the location and access restrictions for the back-end repository. Each Web Service - Content object has one or more associated Content Source objects. The Content Source editor can include Service Configuration pages created for the Content Crawler. Multiple Content Crawler objects can use the same Remote Content Source, allowing you to crawl multiple locations of the same content repository without having to repeatedly specify all the settings. For details on specific settings, see the portal online help. For details on Service Configuration pages, see Creating Service Configuration Pages for Content Crawlers.
Content Crawler - Remote: Each Content Crawler has an associated Content Crawler object that defines basic settings, including destination folder and Content Type. The Content Crawler editor can include Service Configuration pages created for the Content Crawler. Refresh settings are also entered in the Content Crawler editor. For details on specific settings, see the portal online help. For details on Service Configuration pages, see Creating Service Configuration Pages for Content Crawlers.
Job: To run the Content Crawler, you must schedule a Job or add the Content Crawler object to an existing Job. The Content Crawler editor allows you to set a Job. For details on configuring Jobs, see the portal online help.
Global Content Type Map: If you are importing a proprietary file format, you might need to create a new Content Type. Content Types are used to determine the type of accessor used to index a file. You can create new Content Types, or map additional file extensions to an existing Content Type using the Global Content Type Map. Most standard file formats are supported for indexing by the portal. In most cases, the same document is returned during a crawl (for indexing) as for click-through (for display). You can also map additional file extensions to Content Types through the Global Content Type Map. For detailed instructions, see the portal online help or the AquaLogic Interaction Administrator Guide.
Global Document Property Map: To map document attributes to portal Properties, you must update the Global Document Property Map before running a Content Crawler. During a crawl, file attributes are imported into the portal and stored as Properties. The relationship between file attributes and portal Properties can be defined in two places: the Content Type editor or the Global Document Property Map. Two types of metadata are returned during a crawl.
- The crawler (aka provider) iterates over documents in a repository and retrieves the file name, path, size, and usually nothing else.
- During the indexing step, the file is copied to ALI Search, where the appropriate accessor executes full-text extraction and metadata extraction. For example, a for a Microsoft Office document, the portal uses the MS Office accessor to obtain additional properties, such as author, title, manager, category, etc.
If there are conflicts between the two sets of metadata, the setting in CrawlerConstants.TAG_PROPERTIES determines which is stored in the database (for details, see Service Configuration Pages above).
Note: If any properties returned by the crawler or accessor are not included in the Global Document Property map, they are discarded. Mappings for the specific Content Type have precedence over mappings in the Global Document Property Map. The Object Created property is set by the portal and cannot be modified by code inside a Content Crawler.
Global ACL Sync Map: Content Crawlers can import security settings based the Global ACL Sync Map, which defines how the Access Control List (ACL) of the source document corresponds with ALI’s authentication groups. (An ACL consists of a list of names or groups. For each name or group, there is a corresponding list of possible permissions. The ACL returned to the portal is for read rights only.) For detailed instructions, see the portal online help or the AquaLogic Interaction Administrator Guide. In most cases, the Global ACL Sync Map is automatically maintained by Authentication Sources. The Authentication Source is the first step in ALI security. To import security settings in a crawl, the back-end repository must have an associated Authentication Source. Content Crawlers that import security need the user and category (domain) defined by an Authentication Source. You must configure the Authentication Source before the Content Crawler is run. Many repositories use the network’s NT or LDAP security store; if an associated Authentication Source already exists, there is no need to create one.
Note: Two settings are required to import security settings:
- In the Web Service - Content editor on the Advanced Settings page, check Supports importing security with each document.
- In the Content Crawler editor on the Main Settings page, check Import security with each document.

Creating Service Configuration Pages for Content CrawlersService Configuration (SCI) pages are integrated with portal editors and used to define settings used by a Content Crawler.

Parent topic: About Content Crawlers

AquaLogic User Interaction Development Guide

Configuring Content Crawlers