Developing Portlets and Integration Web Services: Crawlers and Search Services  

Configuring Custom Crawlers in the Portal

Implementing a successful crawler in the portal requires specific configuration. To register a crawler in the portal, you must create the following administrative objects:

In addition, crawlers use settings configured in the following portal components:

For detailed information on portal objects and editor settings, see the Portal Administrator's Guide and the portal online help. For details on deploying your code, see Deploying Custom Crawlers.

Crawler Web Service Editor

The following sections detail key settings required when you create a new Crawler Web Service.

For additional information on editor settings, see the portal online help.

HTTP Configuration

By default, crawlers use the gateway to open documents. When users click a link to an associated document, they are redirected to a URL (generated from the settings in your portal configuration file) that displays the document. This allows users to view documents they might not otherwise be able to view, either due to security on the source repository or firewall restrictions.

Any pages that are not publicly accessible must be gatewayed. Gateway settings are configured on the HTTP Configuration page of the Crawler Web Service editor. To gateway all URLs relative to the remote server, enter “.” in the Gateway URL Prefixes list. You can also enter individual URLs and add paths to other servers to gateway additional pages that can be accessed from a document.

To use the network path of the document, choose Does not use the gateway to open documents in the Data Source editor. Note: If you use this setting, even users with access privileges will not be able to access documents if they are not connected to your network.

Advanced Settings

The following settings are configured on the Advanced Settings page of the Crawler Web Service editor.

Advanced URLs

If there are administrative settings to be configured that apply to all instances of the crawler, add a Service Configuration page. The Service Configuration page URL entered on the Advanced URLs page of the Crawler Web Service editor will be included as a page in the associated Data Source editor and/or Crawler editor. (For details, see Service Configuration Pages below.)

Preferences

As noted earlier, User Preferences and User Information can be used to authenticate with the back-end system or limit access to specific users. The User Preferences that should be sent to the crawler or DocFetch servlet must be configured on the Preferences page in the Crawler Web Service editor. List the unique names of all User Preferences (User settings) that should be sent.

The User Information settings that should be sent to DocFetch must be configured on the User Information page in the Crawler Web Service editor. Choose all User Information settings that should be sent to the Crawler/DocFetch. Standard settings are displayed in the top pane; add any additional User Information settings by typing them in the textbox or clicking Add Existing User Info.

Crawlers require administrative settings that should be set via a Service Configuration page, covered next.

Service Configuration Pages (Data Source and Crawler Editors)

Service Configuration (SCI) pages are integrated with portal editors and used to define settings used by a crawler. Crawlers must provide SCI pages for the Data Source and/or Crawler editors to build the preferences used by the crawler. The URL to any associated SCI page(s) must be entered on the Advanced URLs page of the Crawler Web Service editor.

For a crawler, all optional settings are in the class CrawlerConstants. Most of these values are legacy settings from 4.5. The only settings recommended for use in 5.x are the following:

SCI provides an easy way to write configuration pages that are integrated with portal editors. SCI wraps the portal’s XUI XML and allows you to create controls without XUI.  For a complete listing of classes and methods in the Plumtree.Remote.Sci namespace, see the EDK API documentation.

The following methods must be implemented:

The example below is a SCI page for a Data Source Editor that gets credentials for a database crawler.

Imports System
Imports Plumtree.Remote.Sci

Imports Plumtree.Remote.Util

Imports System.Security.Cryptography

Namespace Plumtree.Remote.Crawler.DRV
'Page to enter name and password- first page for DataSourceEditor

Public Class AuthPage

Inherits AbstractPage

#Region "Constructors"

Public Sub New(ByVal editor As AbstractEditor)

  MyBase.New(editor)

End Sub

#End Region

#Region "Functions"
'Gets the content for the page in string form.

'One textElement for name, one PasswordElement for password

'Note the way that the password is stored & the encryption used

Public Overrides Function GetContent(ByVal errorCode As Integer, ByVal pageInfo As NamedValueMap) As String

  Dim page As New SciPage

  Dim userElement As New SciTextElement(DRVConstants.USER_NAME, "Enter the user name to authenticate to SQL Server")

  Dim userName As String = pageInfo.Get(DRVConstants.USER_NAME)

  If Not userName Is Nothing Then

      userElement.SetValue(userName)

  End If

  userElement.SetMandatoryValidation("User name is mandatory")

            Dim passElement As New SciPasswordElement(DRVConstants.PASSWORD, "Enter the password to authenticate to SQL Server", "Confirm", "Passwords do not match")
  'deal with asterisks and the like- for now, just show password

  Dim password As String = pageInfo.Get(DRVConstants.ENC_PASSWORD)

  'save the initial password?

  Dim settings As NamedValueMap = Me.Editor.Settings

  settings.Put(DRVConstants.ENC_PASSWORD, password)

  Editor.Settings = settings

  'set asterisks for the value

  passElement.SetValue(DRVConstants.ASTERISKS)

            page.Add(userElement)
  page.Add(passElement)

            Return page.ToString
End Function

        'Gets the help page URI for the page.
Public Overrides Function GetHelpURI() As String

  Return ""

End Function

        'Gets the image (icon) URI for the page. (This setting is for backward compatibility; no icon is displayed in version 5.0.)
Public Overrides Function GetImageURI() As String

  Return ""

End Function

        'Gets the instructions for the page, displayed below the title in the editor.
Public Overrides Function GetInstructions() As String

  Return "Enter SQL Server authentication information"

End Function

        'Gets the title for the page.
Public Overrides Function GetTitle() As String

  Return "SQL Server Authentication"

End Function

        'Validates the current page and throws a ValidationException to report an error. Returns a NamedValueMap array of the settings entered on the editor page.
Public Overrides Sub ValidatePage(ByVal pageInfo As NamedValueMap)

  'if the password is not asterisks, then put it into settings

  Dim password As String = pageInfo.Get(DRVConstants.PASSWORD)

  If Not password.Equals(DRVConstants.ASTERISKS) Then

      Dim settings As NamedValueMap = Me.Editor.Settings

      'encrypt this

      Dim encPassword As String = Utilities.EncryptPassword(password, Me.Editor.Locale)

      settings.Put(DRVConstants.ENC_PASSWORD, encPassword)

      Editor.Settings = settings

  End If

        End Sub
#End Region

    End Class
End Namespace

Global Document Type Map: Document Types and Accessors

Document Types are used to determine the type of accessor used to index a file. You can create new Document Types, or map additional file extensions to an existing Document Type using the Global Document Type Map.

Most standard file formats are supported for indexing by the portal. In most cases, the same document is returned during a crawl (for indexing) as for click-through (for display). As noted above, Document Types are used to determine the type of accessor used to index a file. The following standard file formats are supported:

File Format

Document Type

.txt, .xml, .rtf

Text Files

.html, .htm, .asp

Web Pages

.pdf

PDF Documents

.doc, .dot

MS Word Documents

.xls, .xlt

MS Excel Documents

.ppt

MS PowerPoint Documents

all other formats (.exe, .zip, etc.)

Non Indexed Files

To add a new Document Type, open the appropriate administrative folder and select Create Object... | Document Type. You can also map additional file extensions to Document Types through the Global Document Type Map. For detailed instructions, see the portal online help or the Portal Administrator’s Guide.

If the file being crawled is unusual or proprietary and there is no associated accessor, it can still be indexed in the portal. For details, see the next page, Creating Custom Crawlers.

Global Document Property Map: Properties and Metadata

During a crawl, file attributes are imported into the portal and stored as Properties. The relationship between file attributes and portal Properties can be defined in two places: the Document Type editor or the Global Document Property Map.

Two types of metadata are returned during a crawl.

  1. The crawler (aka provider) iterates over documents in a repository and retrieves the file name, path, size, and usually nothing else.

  2. During the indexing step, the file is copied to the search server, where the appropriate accessor executes full-text extraction and metadata extraction. For example, a for a Microsoft Office document, the portal uses the MS Office accessor to obtain additional properties, such as author, title, manager, category, etc.

If there are conflicts between the two sets of metadata, the setting in CrawlerConstants.TAG_PROPERTIES determines which is stored in the database (for details, see Service Configuration Pages above).

Note: If any properties returned by the crawler or accessor are not included in the Global Document Property map, they are discarded. Mappings for the specific Document Type have precedence over mappings in the Global Document Property Map. The Object Created property is set by the portal and cannot be modified by code inside a CWS.

To make custom attributes searchable in the portal, you must edit the associated Property object and select the Searchable option.

Importing File Security

By default, the Administrators group has Admin access to all documents. You can add groups and users as necessary and configure the access privileges for each. Security for crawled documents is configured in the Crawler editor on the Main Settings page. For details on portal editors, see Deploying Custom Crawlers.

Plumtree crawlers can import security settings based the Global ACL Sync Map, which defines how the Access Control List (ACL) of the source document corresponds with Plumtree’s authentication groups. (An ACL consists of a list of names or groups. For each name or group, there is a corresponding list of possible permissions. The ACL returned to the portal is for read rights only.)

Two settings are required to import security settings:

In most cases, the Global ACL Sync Map is automatically maintained by Authentication Sources. The Authentication Source is the first step in Plumtree security.

To import security settings in a crawl, the back-end repository must have an associated Authentication Source. Crawlers that import security need the user and category (domain) defined by an Authentication Source. You must configure the Authentication Source before the crawler is run. Many repositories use the network’s NT or LDAP security store; if an associated Authentication Source already exists, there is no need to create one. For details on Authentication Sources, see the portal online help.

After running the crawler, check the Main Settings page of the Crawler editor. Any errors related to importing security will be displayed. If the security settings imported by the crawler are incorrect, you might need to edit the Global ACL Sync Map. To edit the Global ACL Sync Map, follow the steps below. For details, see the Portal Administrator’s Guide or the portal online help.

  1. Log in to the portal as an administrator and navigate to Administration.

  2. From the Select Utility drop-down list, choose Global ACL Sync Map.

  3. Click Add Authentication Source (Windows) or Add Mapping (Solaris) and choose the authentication source you use for the resource.

  4. Select Edit on the Domain Name field and enter the domain name of the resource.

  5. Click Finish.

You can also use User preferences and User Information stored in the Plumtree database to authenticate with the back-end system or limit access to specific users. You must enter all User settings and User Information required by a crawler on the Preferences page of the Crawler Web Service editor. For more information on security options, see Accessing Secured Content: Security Options.

Refreshing Crawled Documents (Crawler Editor)

You can configure a crawler to automatically expire and/or refresh crawled documents. If you run a crawl multiple times, you can configure additional settings that control which documents are imported and how they are structured.

The following settings are on the Document Settings page of the Crawler editor:

If you run a crawl multiple times, you can change the crawler settings to define how files are handled by the portal. The settings below are in the Importing Documents section of the Advanced Settings page in the Crawler editor. This section only appears if you are editing an existing crawler.

For more details on these options, see the portal online help.

Next: Crawler Testing Checklist