Configuring Content Crawlers
Implementing a successful content crawler in the portal
requires specific configuration.
To register a content crawler in the portal, you must create
the following administrative objects and portal components:
- Remote Server: The Remote Server defines the base
URL for the content crawler. Content crawlers can use a Remote Server
object or hard-coded URLs. Multiple services can share a single Remote
Server object. If you will be using a Remote Server object, you must
register it before registering any related Web Service objects.
- Web Service - Content: The Web Service object includes
basic configuration settings, including the SOAP endpoints for the
ContainerProvider and DocumentProvider, and Preference page URLs.
Multiple Content Source or Content Crawler objects can use the same
Web Service object. All remote content crawlers require an associated
Web Service object. For information on specific settings, see the
portal online help.
- Content Source - Remote: The Content Source defines
the location and access restrictions for the back-end repository.
Each Web Service - Content object has one or more associated Content
Source objects. The Content Source editor can include Service Configuration
pages created for the content crawler. Multiple Content Crawler objects
can use the same Remote Content Source, allowing you to crawl multiple
locations of the same content repository without having to repeatedly
specify all the settings. For details on specific settings, see the
portal online help. For details on Service Configuration pages, see Creating Service Configuration Pages for Content Crawlers.
- Content Crawler - Remote: Each content crawler has
an associated Content Crawler object that defines basic settings,
including destination folder and Content Type. The Content Crawler
editor can include Service Configuration pages created for the Content
Crawler. Refresh settings are also entered in the Content Crawler
editor. For details on specific settings, see the portal online help.
For details on Service Configuration pages, see Creating Service Configuration Pages for Content Crawlers.
- Job: To run the content crawler, you must schedule
a Job or add the Content Crawler object to an existing Job. The Content
Crawler editor allows you to set a Job. For details on configuring
Jobs, see the portal online help.
- Global Content Type Map: If you are importing a
proprietary file format, you might need to create a new Content Type.
Content Types are used to determine the type of accessor used to index
a file. You can create new Content Types, or map additional file extensions
to an existing Content Type using the Global Content Type Map. Most
standard file formats are supported for indexing by the portal. In
most cases, the same document is returned during a crawl (for indexing)
as for click-through (for display). You can also map additional file
extensions to Content Types through the Global Content Type Map. For
detailed instructions, see the portal online help or the Administrator
Guide for Oracle WebCenter Interaction.
- Global Document Property Map: To map document attributes
to portal Properties, you must update the Global Document Property
Map before running a content crawler. During a crawl, file attributes
are imported into the portal and stored as Properties. The relationship
between file attributes and portal Properties can be defined in two
places: the Content Type editor or the Global Document Property Map.
Two types of metadata are returned during a crawl.
- The crawler (aka provider) iterates over documents in a
repository and retrieves the file name, path, size, and usually nothing
else.
- During the indexing step, the file is copied to portal Search,
where the appropriate accessor executes full-text extraction
and metadata extraction. For example, a for a Microsoft Office document,
the portal uses the MS Office accessor to obtain additional properties,
such as author, title, manager, category, etc.
If there are conflicts between the two sets of metadata, the
setting in CrawlerConstants.TAG_PROPERTIES determines which is stored
in the database (for details, see Service Configuration Pages above).
Note: If any properties returned by the crawler or accessor are
not included in the Global Document Property map, they are discarded.
Mappings for the specific Content Type have precedence over mappings
in the Global Document Property Map. The Object Created property is
set by the portal and cannot be modified by code inside a Content
Crawler.
- Global ACL Sync Map: Content crawlers can import
security settings based the Global ACL Sync Map, which defines how
the Access Control List (ACL) of the source document corresponds with
Oracle WebCenter Interaction’s authentication groups. (An ACL consists
of a list of names or groups. For each name or group, there is a corresponding
list of possible permissions. The ACL returned to the portal is for
read rights only.) For detailed instructions, see the portal online
help or the Administrator Guide for Oracle WebCenter Interaction.
In most cases, the Global ACL Sync Map is automatically maintained
by Authentication Sources. The Authentication Source is the first
step in Oracle WebCenter Interaction security. To import security
settings in a crawl, the back-end repository must have
an associated Authentication Source. Content crawlers that import
security need the user and category (domain) defined by an Authentication
Source. You must configure the Authentication Source before the content
crawler is run. Many repositories use the network’s NT or LDAP security
store; if an associated Authentication Source already exists, there
is no need to create one.
Note: Two settings are required to import
security settings:
- In the Web Service - Content editor on the Advanced Settings
page, check Supports importing security with each document.
- In the Content Crawler editor on the Main Settings page,
check Import security with each document.