Crawling and Indexing SharePoint Items

This chapter discusses how to use SharePoint Console to crawl and index SharePoint items within your portal. This is accomplished by creating the following objects within the portal:

One or more content sources using the SharePoint CWS Web service.
One or more crawlers using your created content source.
One or more jobs to run the crawlers.

Introduction

SharePoint Console allows you to crawl and index items from a Microsoft Office SharePoint Server (MOSS) or Windows SharePoint Services (WSS) site or site collection. It also allows you to crawl a list of MOSS or WSS sites specified by an RSS feed. Crawling SharePoint items into your portal requires the configuration of a content source, a crawler, and a job. Depending on your needs, more than one content source and/or crawlers may need to be created.

A SharePoint content source is configured with authentication information and default clickthrough behavior. The authentication information is the Windows credentials necessary to access the desired SharePoint site or site collection. If multiple sites are accessible with the same authentication credentials, only one SharePoint content source is required. If sites require different authenticating credentials, create a SharePoint content source for each set of credentials.

Each SharePoint content source can have one or more SharePoint crawlers associated with it. A SharePoint crawler describes which SharePoint site or site collection is to be crawled, what to crawl on that SharePoint site, and where the crawled items should be put. Note that the crawler does not import the SharePoint items themselves, but rather indexes them within the portal.

Creating a SharePoint Content Source

SharePoint Console comes with a default content source, SharePoint Content Source, that can be updated with your MOSS or WSS information. To create a SharePoint content source:

In portal Administration, go to the folder where you want to create the SharePoint content source.
From the Create Object... drop-down, select Content Source - Remote.
In the Choose Web Service dialog box, select the SharePoint CWS Web Service and click OK.
In the URL Type section, select Does not use the Gateway to open documents.

Caution:

The portal cannot gateway SharePoint documents. The gateway interferes with authentication and WebDAV features of SharePoint.

Under SharePoint User Settings, enter the login information for a user that has access to all of the SharePoint content that you wish to crawl into the portal. If the crawler will use a site feed, the user must have access to each of the sites specified by the site feed.

When the crawler accesses the desired SharePoint site, it will run as this user. The user must have the following rights on the site:

Microsoft Office SharePoint Server 2007: Browse Directories, Manage Lists, Use Remote Interfaces
Windows SharePoint Services 2.0: Browse Directories, Manage List Permissions
Windows SharePoint Services 3.0: Browse Directories, Manage Lists, Use Remote Interfaces

To crawl Form Template folders in MOSS 2007 and WSS 3.0, the View Application Pages and View Versions rights are also required.

Note:

The authentication information configured in the content source is used only by the crawler. These credentials will not be passed to the SharePoint site when a user clicks on an item in the portal. Authentication of the user to access that item is handled by MOSS or WSS.

Choose Document Clickthrough Settings from the menu on the left.

Select the radio button next to the default clickthrough behavior you would like for SharePoint documents. Clicking on a SharePoint document can either open the document directly, or take the user to the SharePoint properties page for that document. This setting affects clickthrough behavior in the Knowledge Directory, general search results, the Most Recently Used SharePoint Items portlet, and the default mode of the SharePoint Search portlet. When the Most Recently Used SharePoint Items portlet is used in a community, community preferences can be used to change the clickthrough behavior.

Note:

This setting only affects documents. All other SharePoint items open their default display page which is the properties page.

Click Finish to name and save the content source.

Creating SharePoint Crawlers

Once you have created one or more SharePoint content sources, you can create SharePoint crawlers:

In portal Administration, go to a folder where you want to create the SharePoint crawler.
From the Create Object... drop-down, select Content Crawler - Remote.
In the Choose Content Source dialog box, select the content source you created.

On the Main Settings page, configure the SharePoint Console crawler settings:

Table 2-1 SharePoint Console Crawler Configuration - Main Settings
Setting	Description
Crawler Configuration Settings
SharePoint Site	Select Site URL to crawl a SharePoint site or site collection. Type the URL of the site that you wish to crawl and click Validate. This URL should end in "/". The crawler will validate that the URL entered is a valid SharePoint site or site collection and display the name of the starting subsite. If the URL entered does not end in "/" but a valid site collection can be extracted from the URL, the crawler will discard any extraneous trailing information from the URL and update the URL shown in the UI accordingly. The subsite name may be listed as "/" if the starting site is the root site of the site collection itself. If the starting SharePoint URL points to a subsite of the site collection, the subsite will be listed. Select Feed to crawl a list of WSS sites specified by an RSS feed. Type the path to the site feed. The path must return a valid RSS 2.0 document. For more information about crawling RSS feeds, see Crawling RSS Site Feeds.
Setting	Description
Crawl Depth	Select the depth of the crawl. It is recommended that you choose Selected Site and All Subsites to simplify indexing of SharePoint sites. If you do not choose this setting, you will have to create a separate crawler for each subsite. You should choose Selected Site Only if you want to tightly control what SharePoint sites are indexed.
Destination Folders
Folder Path	Select the folder or folders where you would like to import SharePoint content.
Settings	Select how you would like to import the SharePoint items. It is recommended that you choose to mirror the folder structure, approve imported documents, and import security. If you choose to mirror the folder structure, the crawler will mirror the SharePoint site structure. Importing security means that only those users who can view the source item in MOSS or WSS can see the corresponding document in the portal and search results. For more information on importing security, see the Administrator Guide for AquaLogic Interaction.
Document Access Privileges	Specify the access rights for the imported content.

On the List Settings page, check the box next to each type of SharePoint item you want to crawl. To crawl documents attached to attached to the list items, select Crawl documents attached to list items.
On the Document Settings page, specify how documents should be expired and refreshed when you run the crawler.

You can use the Document Expiration settings, along with the select Apply these settings to existing documents created by this crawler option, to delete all documents previously crawled by this crawler.

For more information about the Document Settings page, see the AquaLogic Interaction online help.

On the Content Type page, specify how content types should be assigned to the documents imported into the Knowledge Directory.

For more information about content types, see the Administrator Guide for AquaLogic Interaction or the AquaLogic Interaction online help.

On the Advanced Settings page, select settings for content language, handling of documents rejected by the crawler, handling of documents previously crawled into the Knowledge Directory, and runtime configuration. The settings available here will vary depending on whether or not you chose to mirror the source folder structure.

For more information about advanced crawler settings, see the AquaLogic Interaction online help.

On the Set Job page, assign or create a job to run this crawler.

You must also register the administrative folder that contains the job with the Automation Service.

For more information about setting up crawler jobs, see the Administrator Guide for the AquaLogic Interaction or the AquaLogic Interaction online help.

Crawling RSS Site Feeds

In addition to crawling SharePoint sites and site collections, SharePoint content crawlers can also be used to import content from a list of sites provided by an RSS feed.

The path to a site feed can be a URL (http://, file:///), a local file path on the SharePoint Console machine (c:\feed.xml), or a UNC file path on the SharePoint Console network (\\server\feeds\feed.xml). Secured sites (https://) and FTP sites (ftp://) cannot be crawled.

The SharePoint Console installer creates a virtual directory named SiteFeed that can be used to deploy a simple RSS feed. For example, you can put a well-formed, valid XML feed document or an .aspx file that generates a site feed into the sitefeed folder on the file system and then access it via HTTP.

The site feed document should conform to the RSS 2.0 specification. The link element of each item must contain the URL to a valid SharePoint site. The title element of an item is optional. If provided, it should be a valid Knowledge Directory folder name, as it will be used as the Knowledge Directory folder name for the site for a mirroring crawl. If the title element is omitted, the folder name will be retrieve from the SharePoint site for a mirroring crawl.

<description>List of SharePoint sites to be crawled by BEA ALI SharePoint Console</description>