Developing Portlets and Integration Web Services: Crawlers and Search Services

Configuring Custom Crawlers in the Portal

Implementing a successful crawler in the portal requires specific configuration. To register a crawler in the portal, you must create the following administrative objects:

Remote Server (optional): The Remote Server defines the base URL for the crawler. Crawlers can use a Remote Server object or hard-coded URLs. Multiple services can share a single Remote Server object. If you will be using a Remote Server object, you must register it before registering any related Web Service objects.
Web Service - Crawler: The Crawler Web Service object includes basic configuration settings, including the SOAP endpoints for the ContainerProvider and DocumentProvider, and Preference page URLs. Multiple Data Source or Crawler objects can use the same Web Service object. All remote crawlers require an associated Crawler Web Service. For information on key settings, see Crawler Web Service Editor below. Web crawlers (Crawler - WWW) are not covered in this guide; for details, see the portal online help.
Data Source - Remote: The Data Source defines the location and access restrictions for the back-end repository. Each Crawler Web Service object has one or more associated Remote Data Source objects. The Data Source editor can include Service Configuration pages created for the crawler. (For details, see Service Configuration Pages.) Multiple Crawler objects can use the same Remote Data Source, allowing you to crawl multiple locations of the same content repository without having to repeatedly specify all the settings. Web data sources (Data Source - WWW) are not covered in this guide; for details, see the portal online help.
Crawler - Remote: Each crawler has an associated Remote Crawler object that defines basic settings, including destination folder and Document Type. The Remote Crawler editor can include Configuration pages created for the crawler. (For details, see Service Configuration Pages.) Refresh settings are also entered in the Crawler editor. For details, see Refreshing Crawled Documents (Crawler Editor).
Job: To run the crawler, you must schedule a Job or add the Crawler object to an existing Job. The Remote Crawler editor allows you to set a Job.

In addition, crawlers use settings configured in the following portal components:

Global Document Type Map: If you are importing a proprietary file format, you might need to create a new Document Type. For details, see Global Document Type Map: Document Types and Accessors.
Global Document Property Map: To map document attributes to portal Properties, you must update the Global Document Property Map before running a crawler. For details, see Global Document Property Map: Properties and Metadata.
Global ACL Sync Map: To import security during a crawl, the back-end repository must have an associated Authentication Source. For details, see Importing File Security

For detailed information on portal objects and editor settings, see the Portal Administrator's Guide and the portal online help. For details on deploying your code, see Deploying Custom Crawlers.

Crawler Web Service Editor

The following sections detail key settings required when you create a new Crawler Web Service.

HTTP Configuration
Advanced Settings
Advanced URLs
Preferences

For additional information on editor settings, see the portal online help.

HTTP Configuration

By default, crawlers use the gateway to open documents. When users click a link to an associated document, they are redirected to a URL (generated from the settings in your portal configuration file) that displays the document. This allows users to view documents they might not otherwise be able to view, either due to security on the source repository or firewall restrictions.

Any pages that are not publicly accessible must be gatewayed. Gateway settings are configured on the HTTP Configuration page of the Crawler Web Service editor. To gateway all URLs relative to the remote server, enter “.” in the Gateway URL Prefixes list. You can also enter individual URLs and add paths to other servers to gateway additional pages that can be accessed from a document.

To use the network path of the document, choose Does not use the gateway to open documents in the Data Source editor. Note: If you use this setting, even users with access privileges will not be able to access documents if they are not connected to your network.

Advanced Settings

The following settings are configured on the Advanced Settings page of the Crawler Web Service editor.

Support Document Submission using file paths: Crawlers import documents using a standard file path.
Support Document Submission using Remote UI: Crawlers import documents using a remote SCI interface. This is one way to crawl in individual files or records. For an example, see the DB Record Viewer sample application.
Support mirroring the source folder structure: Crawlers import the directory structure and duplicate it in the portal Knowledge Directory.
Support importing security with each document: Crawlers import document-specific security. If this option is selected, user and group access control information is imported with the crawled files.
Send Timezone: The user’s timezone is used to localize the crawler interface.
Send Login Token: Select this option to send a login token to the remote server. A login token is required to access to portal functionality via the Plumtree Remote Client (PRC). Configure the Login Token duration or leave the default of 5 minutes.
SOAP Encoding Style: Specify how information from the service is encoded. Choose RPC/Encoded if the service is written in Java. Choose Document/Literal if the service is written in .NET.

Advanced URLs

If there are administrative settings to be configured that apply to all instances of the crawler, add a Service Configuration page. The Service Configuration page URL entered on the Advanced URLs page of the Crawler Web Service editor will be included as a page in the associated Data Source editor and/or Crawler editor. (For details, see Service Configuration Pages below.)

Preferences

As noted earlier, User Preferences and User Information can be used to authenticate with the back-end system or limit access to specific users. The User Preferences that should be sent to the crawler or DocFetch servlet must be configured on the Preferences page in the Crawler Web Service editor. List the unique names of all User Preferences (User settings) that should be sent.

The User Information settings that should be sent to DocFetch must be configured on the User Information page in the Crawler Web Service editor. Choose all User Information settings that should be sent to the Crawler/DocFetch. Standard settings are displayed in the top pane; add any additional User Information settings by typing them in the textbox or clicking Add Existing User Info.

Crawlers require administrative settings that should be set via a Service Configuration page, covered next.

Service Configuration Pages (Data Source and Crawler Editors)

Service Configuration (SCI) pages are integrated with portal editors and used to define settings used by a crawler. Crawlers must provide SCI pages for the Data Source and/or Crawler editors to build the preferences used by the crawler. The URL to any associated SCI page(s) must be entered on the Advanced URLs page of the Crawler Web Service editor.

For a crawler, all optional settings are in the class CrawlerConstants. Most of these values are legacy settings from 4.5. The only settings recommended for use in 5.x are the following:

TAG_PATH: is the path to the container you want to crawl. Depending on the type of container, this could be a UNC path, information for a table in a database, information for a view in notes, etc.
CRAWL_DEPTH: If the variable TAG_DEPTH has not been included, the crawler only crawls documents in the first directory. For resources with no sub-directories, such as a database, this is fine. For a file system, it is usually best to use a SCISelectElement to let users select the crawl depth (where -1 means until subcontainers return no child containers). If you do not want users to set this option, use a SCIHiddenElement for the same field.
Note: The SCISelectElement must call SetStorageType(TypeStorage.STORAGE_INTEGER) to be stored correctly; otherwise the portal will return the message "wrong property type."
TAG_PROPERTIES (TAG_PROPERTIES_LOCAL/TAG_PROPERTIES_REMOTE): The TAG_PROPERTIES tag represents whether properties from GetMetaData or the local accessor should be used. Setting this variable to LOCAL causes the local accessor properties used to retrieve a file to override the properties returned by the crawler. Setting the variable to REMOTE causes the properties from GetMetaData to override properties from local accessors.

SCI provides an easy way to write configuration pages that are integrated with portal editors. SCI wraps the portal’s XUI XML and allows you to create controls without XUI. For a complete listing of classes and methods in the Plumtree.Remote.Sci namespace, see the EDK API documentation.

The following methods must be implemented:

Initialize: Passes the namespace, whether data source or crawler, settings (NamedValueMap). Dependent objects supply data.
GetPages: Returns fixed-length array of the number of custom pages.
GetContent: Returns the XML content for a page. The API provides a collection of helper classes to build the page (textbox, select box, tree element, etc.)

The example below is a SCI page for a Data Source Editor that gets credentials for a database crawler.

Imports SystemImports Plumtree.Remote.SciImports Plumtree.Remote.UtilImports System.Security.Cryptography

Namespace Plumtree.Remote.Crawler.DRV'Page to enter name and password- first page for DataSourceEditorPublic Class AuthPageInherits AbstractPage#Region "Constructors"Public Sub New(ByVal editor As AbstractEditor) MyBase.New(editor)End Sub#End Region

#Region "Functions"'Gets the content for the page in string form.'One textElement for name, one PasswordElement for password'Note the way that the password is stored & the encryption usedPublic Overrides Function GetContent(ByVal errorCode As Integer, ByVal pageInfo As NamedValueMap) As String Dim page As New SciPage Dim userElement As New SciTextElement(DRVConstants.USER_NAME, "Enter the user name to authenticate to SQL Server") Dim userName As String = pageInfo.Get(DRVConstants.USER_NAME) If Not userName Is Nothing Then userElement.SetValue(userName) End If userElement.SetMandatoryValidation("User name is mandatory")

Dim passElement As New SciPasswordElement(DRVConstants.PASSWORD, "Enter the password to authenticate to SQL Server", "Confirm", "Passwords do not match") 'deal with asterisks and the like- for now, just show password Dim password As String = pageInfo.Get(DRVConstants.ENC_PASSWORD) 'save the initial password? Dim settings As NamedValueMap = Me.Editor.Settings settings.Put(DRVConstants.ENC_PASSWORD, password) Editor.Settings = settings 'set asterisks for the value passElement.SetValue(DRVConstants.ASTERISKS)

page.Add(userElement) page.Add(passElement)

Return page.ToStringEnd Function

'Gets the help page URI for the page.Public Overrides Function GetHelpURI() As String Return ""End Function

'Gets the image (icon) URI for the page. (This setting is for backward compatibility; no icon is displayed in version 5.0.)Public Overrides Function GetImageURI() As String Return ""End Function

'Gets the instructions for the page, displayed below the title in the editor.Public Overrides Function GetInstructions() As String Return "Enter SQL Server authentication information"End Function

'Gets the title for the page.Public Overrides Function GetTitle() As String Return "SQL Server Authentication"End Function

'Validates the current page and throws a ValidationException to report an error. Returns a NamedValueMap array of the settings entered on the editor page.Public Overrides Sub ValidatePage(ByVal pageInfo As NamedValueMap) 'if the password is not asterisks, then put it into settings Dim password As String = pageInfo.Get(DRVConstants.PASSWORD) If Not password.Equals(DRVConstants.ASTERISKS) Then Dim settings As NamedValueMap = Me.Editor.Settings 'encrypt this Dim encPassword As String = Utilities.EncryptPassword(password, Me.Editor.Locale) settings.Put(DRVConstants.ENC_PASSWORD, encPassword) Editor.Settings = settings End If

End Sub#End Region

End ClassEnd Namespace

Global Document Type Map: Document Types and Accessors

Document Types are used to determine the type of accessor used to index a file. You can create new Document Types, or map additional file extensions to an existing Document Type using the Global Document Type Map.

Most standard file formats are supported for indexing by the portal. In most cases, the same document is returned during a crawl (for indexing) as for click-through (for display). As noted above, Document Types are used to determine the type of accessor used to index a file. The following standard file formats are supported:

File Format	Document Type
.txt, .xml, .rtf	Text Files
.html, .htm, .asp	Web Pages
.pdf	PDF Documents
.doc, .dot	MS Word Documents
.xls, .xlt	MS Excel Documents
.ppt	MS PowerPoint Documents
all other formats (.exe, .zip, etc.)	Non Indexed Files

To add a new Document Type, open the appropriate administrative folder and select Create Object... | Document Type. You can also map additional file extensions to Document Types through the Global Document Type Map. For detailed instructions, see the portal online help or the Portal Administrator’s Guide.

If the file being crawled is unusual or proprietary and there is no associated accessor, it can still be indexed in the portal. For details, see the next page, Creating Custom Crawlers.

Global Document Property Map: Properties and Metadata

During a crawl, file attributes are imported into the portal and stored as Properties. The relationship between file attributes and portal Properties can be defined in two places: the Document Type editor or the Global Document Property Map.

Two types of metadata are returned during a crawl.

The crawler (aka provider) iterates over documents in a repository and retrieves the file name, path, size, and usually nothing else.
During the indexing step, the file is copied to the search server, where the appropriate accessor executes full-text extraction and metadata extraction. For example, a for a Microsoft Office document, the portal uses the MS Office accessor to obtain additional properties, such as author, title, manager, category, etc.

If there are conflicts between the two sets of metadata, the setting in CrawlerConstants.TAG_PROPERTIES determines which is stored in the database (for details, see Service Configuration Pages above).

Note: If any properties returned by the crawler or accessor are not included in the Global Document Property map, they are discarded. Mappings for the specific Document Type have precedence over mappings in the Global Document Property Map. The Object Created property is set by the portal and cannot be modified by code inside a CWS.

To make custom attributes searchable in the portal, you must edit the associated Property object and select the Searchable option.

Importing File Security

By default, the Administrators group has Admin access to all documents. You can add groups and users as necessary and configure the access privileges for each. Security for crawled documents is configured in the Crawler editor on the Main Settings page. For details on portal editors, see Deploying Custom Crawlers.

Plumtree crawlers can import security settings based the Global ACL Sync Map, which defines how the Access Control List (ACL) of the source document corresponds with Plumtree’s authentication groups. (An ACL consists of a list of names or groups. For each name or group, there is a corresponding list of possible permissions. The ACL returned to the portal is for read rights only.)

Two settings are required to import security settings:

In the Crawler Web Service editor on the Advanced Settings page, check Supports importing security with each document.
In the Crawler editor on the Main Settings page, check Import security with each document.

In most cases, the Global ACL Sync Map is automatically maintained by Authentication Sources. The Authentication Source is the first step in Plumtree security.

To import security settings in a crawl, the back-end repository must have an associated Authentication Source. Crawlers that import security need the user and category (domain) defined by an Authentication Source. You must configure the Authentication Source before the crawler is run. Many repositories use the network’s NT or LDAP security store; if an associated Authentication Source already exists, there is no need to create one. For details on Authentication Sources, see the portal online help.

After running the crawler, check the Main Settings page of the Crawler editor. Any errors related to importing security will be displayed. If the security settings imported by the crawler are incorrect, you might need to edit the Global ACL Sync Map. To edit the Global ACL Sync Map, follow the steps below. For details, see the Portal Administrator’s Guide or the portal online help.

Log in to the portal as an administrator and navigate to Administration.
From the Select Utility drop-down list, choose Global ACL Sync Map.
Click Add Authentication Source (Windows) or Add Mapping (Solaris) and choose the authentication source you use for the resource.
Select Edit on the Domain Name field and enter the domain name of the resource.
Click Finish.

You can also use User preferences and User Information stored in the Plumtree database to authenticate with the back-end system or limit access to specific users. You must enter all User settings and User Information required by a crawler on the Preferences page of the Crawler Web Service editor. For more information on security options, see Accessing Secured Content: Security Options.

Refreshing Crawled Documents (Crawler Editor)

You can configure a crawler to automatically expire and/or refresh crawled documents. If you run a crawl multiple times, you can configure additional settings that control which documents are imported and how they are structured.

The following settings are on the Document Settings page of the Crawler editor:

To set documents to expire after a set interval, use the Document Expiration settings.
To set documents to be refreshed on a regular basis, use the use the Link and Property Refresh settings.
You can choose to check for missing documents only. Use the Broken Links settings to control how the portal handles missing documents.

If you run a crawl multiple times, you can change the crawler settings to define how files are handled by the portal. The settings below are in the Importing Documents section of the Advanced Settings page in the Crawler editor. This section only appears if you are editing an existing crawler.

To re-crawl deleted documents, check Regenerate deleted links. (By default, crawled documents deleted from the portal are not re-crawled when the crawler is run again.)
To update property mappings or document settings, check Refresh them. Refreshing documents slows down a crawler, but you must refresh documents to update property mappings or document settings. Note: Clearing the Import only new links box will result in multiple copies of each crawled document.

For more details on these options, see the portal online help.

Next: Crawler Testing Checklist