3 Content Service Development

Content services (content crawlers and federated search services) allow you to search external repositories through the portal and index external content in the portal Directory. These services allow users to access documents and other resources from multiple repositories without leaving the portal workspace.

Content crawlers access content from an external repository and index it in the portal. Portal users can search for and open crawled files through the portal Directory. Content Crawlers can be used to provide access to files on protected back-end systems without violating access restrictions. Content Crawlers are implemented as remote web services. For details, see Content Crawlers.
Federated search services are remote web services that search external repositories, including the web, internal company databases and document repositories. For details, see Oracle WebCenter Interaction Federated Search Services. For additional search customization options, see the Oracle WebCenter Interaction UI Customization Guide.

Content Crawlers

Content crawlers are extensible components used to import documents into the portal Directory from a back-end document repository, including Lotus Notes, Microsoft Exchange, Documentum and Novell. Portal users can search for and open crawled files on protected back-end systems through the portal without violating access restrictions.

The Oracle WebCenter Interaction Development Kit (IDK) allows you to create remote content crawlers and related configuration pages without parsing SOAP or accessing the portal API; you simply implement four object interfaces to access the back-end repository and retrieve files. UDDI servers are not required.

The purposes of a Content Crawler are two-fold:

Iterate over and catalog a hierarchical data repository. Retrieve metadata and index documents in the data repository and include them in the portal Directory and search index. Files are indexed based on metadata and full-text content.
Retrieve individual documents on demand through the portal Directory, enforcing any user-level access restrictions.

Content Crawlers are run asynchronously by the portal Automation Service. The associated content crawler completes step 1. The Content Crawler Job can be run on a regular schedule to refresh any updated or added files. The portal creates a Document object for each crawled file and indexes it in the Directory. Each object includes basic file information, security information, and a URL that opens the file from the back-end content repository. (No crawled files are stored on the portal server.) If the content is not contained within a file or cannot be indexed for another reason, you must implement a servlet/aspx page to return files that can be indexed to the portal.

Step 2 occurs when a user browses the Directory and opens to a previously crawled document. After a file is crawled into the portal, users must be able to access the file from within the portal by clicking a link. This step is called click-through. If files are publicly accessible, click-through is simple. In many cases, you must provide access to documents that are behind a firewall or are otherwise inaccessible from the portal interface.

For details, see the following sections:

Oracle WebCenter Interaction Development Kit (IDK) Interfaces for Content Crawler Development: The Oracle WebCenter Interaction Development Kit (IDK) provides object interfaces to implement custom content crawlers. This section introduces the IDK's crawler interfaces and lists useful warnings and best practices.
Content Crawler Development Tips: These best practices and development tips apply to all content crawler development.
Content Crawler Indexing: Content crawlers must return an indexable version of each crawled file to be included in the portal Directory. This section provides an introduction to indexing.
Content Crawler Click-Through: The crawl is just the first step. This section explains how content crawlers can provide access to secured files that have been indexed in the portal. For instructions, see Implementing Content Crawler Click-Through.
Deploying a Custom Content Crawler: After coding your Content Crawler, you must deploy your code. These sections provide detailed instructions.
Configuring Content Crawlers: Implementing a successful Content Crawler in the portal requires specific configuration.
Debugging Custom Content Crawlers: Logging is a key component of any successful crawl. This page introduces logging options.
Testing Custom Content Crawlers: This checklist summarizes key tests that should be performed on every content crawler.

Oracle WebCenter Interaction Development Kit (IDK) Interfaces for Content Crawler Development

The Oracle WebCenter Interaction Development Kit (IDK) plumtree.remote.crawler package/namespace includes four interfaces to support content crawler development: IContainerProvider, IContainer, IDocumentProvider and IDocument.

When the portal Automation Service initiates a crawl, it issues a SOAP request to return a list of folders. It iterates over the list of folders and retrieves lists of documents with metadata. In general, the portal calls Oracle WebCenter Interaction Development Kit (IDK) interfaces in the following order. See the definitions that follow for more information.

IContainerProvider.initialize once per thread. Use DataSourceInfo and CrawlerInfo to initialize the Container Provider (make a connection to the back-end system and create a new session). Note: This is not a true HTTP session, and sessions can get dropped. Keep a variable that can be used to ensure the session is still initialized; if it is not, throw NotInitializedException. Store the Content Source in a member variable in Initialize. Do not use direct access to the member variable; instead use a method that checks if it is null and throws a NotInitializedException.
IContainerProvider.attachToContainer

, using the starting location in the key CrawlerConstants.TAG_PATH. The key should be populated using a Service Configuration page in the Content Crawler editor. The string in TAG_PATH is service-specific; a file content crawler could use the UNC path to a folder, while a database content crawler could use the full name of a table. The following methods are not called in any specific order.
- IContainer.getUsers and IContainer.getGroups on that container as required. (IContainer.GetMetaData is deprecated.)
- IContainer.getChildContainers up to the number specified in CrawlerConstants.TAG_DEPTH. (This key must be set via a Service Configuration page.)
- IContainerProvider.attachToContainer for each ChildContainer returned.
- IContainer.getChildDocuments, then IDocumentProvider.attachToDocument for each ChildDocument returned.
IContainerProvider.shutdown (this call is optional and could be blocked by exceptions or network failure).
IDocumentProvider.initialize once per thread. Note: Sessions can get dropped. Keep a variable that can be used to ensure the session is still initialized; if it is not, throw NotInitializedException.
IDocumentProvider.attachToDocument

for each ChildDocument, then

IDocument.getDocumentSignature

to see if the document has changed. If the document is new or has been modified, the following methods are called (not in any specific order).
- IDocument.getUsers and IDocument.getGroups on that document as required.
- IDocument.getMetaData to get the file name, description, content type, URL, etc.
- IDocument.getDocument to index the document (only if DocFetch is used).
IDocumentProvider.shutdown (this call is optional and could be blocked by exceptions or network failure).

The sections below provide helpful information on the interfaces used to implement a content crawler. For a complete listing of interfaces, classes, and methods, see the Oracle WebCenter Interaction Development Kit (IDK) API documentation.

IContainerProvider

The IContainerProvider interface allows the portal to iterate over a back-end directory structure. The portal calls IContainerProvider first in most cases. This interface provides the following methods:

initialize allows the remote server to initialize a session and create a connection to the back-end document repository. The Oracle WebCenter Interaction Development Kit (IDK) passes in a DataSourceInfo object that contains the necessary settings associated with a Content Source object (the name of a directory in the repository and the credentials of a system user). The CrawlInfo object contains the settings for the associated Content Crawler object in the portal. The start location of the crawl is the value stored in the key CrawlerConstants.TAG_PATH , set using a Service Configuration page.
attachToContainer is always the next call after Initialize; the order of the remaining calls is not defined. It associates the session with the container specified in the sContainerLocation parameter; subsequent calls refer to this container until the next attachToContainer call. The value in the sContainerLocation parameter will be the CrawlerConstants.TAG_PATH key for the initial attach, and the value specified in ChildContainer.GetLocation for subsequent attaches. Each time attachToContainer is called, discard any state created during the previous attachToContainer call. If multiple translations of the container are available, select the most appropriate using the Locale parameter, which can be sent as a full locale (e.g., "en-us") or in the abbreviated language-only format (e.g., "en"). Note: If the container specified does not exist, you must throw a new NoLongerExistsException to avoid an infinite loop. If the Content Crawler is configured to delete missing files, all files in the container will be removed from the portal index.
ahutdown allows the portal to clean up any unused sessions that have not yet expired. Content Crawlers are implemented on top of standard cookie-based session mechanisms, so sessions expire and resources and connections are released after an inactivity period, typically around 20 minutes. As a performance optimization, the portal might send a Shutdown message notifying the remote server to end the session immediately. No parameters are received and none are returned. Do not assume that Shutdown will be called; the call could be blocked by an exception or network failure. Remote servers must terminate sessions after an inactivity timeout but can choose to ignore the Shutdown message and keep the session alive until it times out.

IContainer

The portal uses the IContainer interface to query information about back-end resource directories. This interface provides the following methods:

getGroups and getUsers return a list of the portal groups or users that have read access to the container. These calls are made only if the Web Service and Content Crawler objects are configured to import security. The portal batches these calls; the content crawler code should return all groups or users at once.
getChildContainers returns the containers inside the current container (i.e., subfolders of a folder). The value stored in the key CrawlerContants.TAG_DEPTH is used to determine how many times getChildContainers is called (crawl depth). This value must be set via a Service Configuration page. If no value is stored with this key, getChildContainers is never called; only the documents in the folder specified for the start location are crawled into the portal. Note: Setting CrawlerConstants.TAG_DEPTH to -1 could result in an infinite loop.

getChildDocuments returns the documents inside the current container (folder). The portal batches this call; the Content Crawler code should return all documents at once. The TypeNamespace and TypeID parameters define the Content Type for the document. TypeNamespace associates the document with a row in the Global Content Type Map, and the TypeID associates it with a particular Content Type. The value in ChildDocument.getLocation is used in IDocumentProvider.attachToDocument, so any information required by attachToDocument must be included in the location string. You can describe the document using file or MIME, as shown in the example below.

ChildDocument doc=new ChildDocument();
String filename = WordDoc.doc;

//Location is a crawler-specific string to retrieve doc, e.g., file name 
doc.setLocation(filename);

//TypeNameSpace is either FILE or MIME unless using a custom namespace (Notes, Exchange)
//NOTE: example uses getCode because setTypeNameSpace expects a String
doc.setTypeNameSpace(TypeNamespace.MIME.getCode()):

//For file descriptions, TypeID is simply the document name with extension (i.e., filename)
//For MIME descriptions, set the document type or map multiple file extensions to MIME types 
doc.setTypeID("application/msword");

//DisplayName is the name to display in the KD, usually overridden in IDocument.getMetaData();
doc.setDisplayName(filename);

getMetaData (DEPRECATED) returns all metadata available in the repository about the container. The name and location are used in mirrored crawls to mirror the structure of the source repository. In most cases, the container metadata is only the name and description.

IDocumentProvider

The IDocumentProvider interface allows the portal to specify back-end documents for retrieval. In most cases, the portal calls IContainerProvider first. However, in some cases, the service is used to refresh existing documents and IDocumentProvider might be called first.

initialize allows the remote server to initialize a session and create a connection to the back-end document repository. (For details on parameters and session state, see IContainerProvider.initialize above.) IDocumentProvider.initialize will be called once per thread as long as the session does not time out or get interrupted for other reasons, and attachToDocument will be called next.

attachToDocument

is always the next call made after Initialize; the order of the remaining calls is not defined. This method 'attaches' a session to the document specified in the

sDocumentLocation

parameter; subsequent calls refer to this document until the next attachToDocument call. The sDocumentLocation string is the value specified in ChildDocument.getLocation (ChildDocument is returned by IContainer.getChildDocuments). If multiple translations of the document are available, select the most appropriate by using the Locale parameter, which can be sent as a full locale (e.g., 'en-us') or in the abbreviated language only format (e.g., 'en'). When implementing this method, you can throw the following exceptions:

Exception	Description
NoLongerExistsException	The document has been moved or deleted. (The refresh agent will delete documents from the portal index only if this exception has been thrown.)
NotAvailableException	The document is temporarily unavailable.
NotInitializedException	The IDocumentProvider is in an uninitialized state.
AccessDeniedException	Access to this document is denied.
ServiceException	Propagates the exception to the portal and adds an entry to Logging Spy.

shutdown allows the portal to clean up any unused sessions that have not yet expired. (For details, see IContainerProvider.shutdown above.)

IDocument

The IDocument interface allows the portal to query information about and retrieve documents. This interface provides the following methods:

getDocumentSignature allows the portal to determine if the document has changed and should be re-indexed and flagged as updated. It can be a version number, a last-modified date, or the CRC of the document. The Oracle WebCenter Interaction Development Kit (IDK) does not enforce any restrictions on what to use for the document signature, or provide any utilities to get the CRC of the document. This is always the first call made to IDocument; on re-crawls, if the documentSignature has not changed, no additional calls will be made.

getMetadata

returns all metadata available in the repository about the document. The portal maps this data to properties based on the mappings defined for the appropriate Content Type, along with metadata returned by the associated accessor. The following field names are reserved. Additional properties can be added using the portal's Global Document Property Map; for details, see Configuring Custom Content Crawlers: Properties and Metadata. (Any properties that are not in the Global Document Property Map will be discarded.)

Field Name	Description
Name	REQUIRED. The name of the link to be displayed in the portal Knowledge Directory. Note: By default, the portal uses the name from the crawled file properties as the name of the card. To set the portal to use the Name property returned by getMetadata, you must set the CrawlerConstants.TAG_PROPERTIES to REMOTE using the Service Configuration Interface.
Description	The description of the link to be displayed in the portal Directory.
UseDocFetch	Whether or not to use DocFetch to retrieve the file. The default is False. If you use DocFetch, the value in the File Name field is used to retrieve the file during both indexing and click-through. If you do not use DocFetch, you must provide values for Indexing URL and Click-Through URL.
File Name (required for DocFetch)	The name of the click-through file, used for DocFetch.
Content Type (required for DocFetch)	The content type of the click-through file, used to associated the crawled document with the Global Content Type Map.
Indexing URL (public URL)	(Required if not using DocFetch.) The URL to the file that can be indexed in the portal. URLs can be relative to the Remote Server. If a file is publicly accessible via a URL, that URL can be used to access the document for both indexing and click-through. Documents that cannot be indexed must provide an additional URL at crawl-time for indexing purposes. For details on crawling secured content, see Accessing Secured Content .
Click-Through URL (public URL)	(Required if not using DocFetch.) The URL to the click-through file. URLs can be relative to the Remote Server. For details on crawling secured content, see Accessing Secured Content.
Image UUID (optional)	This parameter is only required for custom Content Types. For standard Content Types, the accessor will assign the correct image UUID.

getDocument returns the path to the file if it was not provided by getMetaData. (For public URLs, you do not need to implement getDocument, but you must provide values for IndexingURL and ClickThroughURL in getMetaData.) During crawl-time indexing, this file is copied to the web-accessible IndexFilePath location specified in your deployment descriptor and returned to the portal via a URL to that location. If the file is not supported for indexing by the portal, implement getDocument to convert the document into a supported file format for indexing (e.g., text-only) and return that file during indexing. Note: To create a custom implementation of getDocument, you must set useDocFetch to True. When a user clicks through to the document, the display file is streamed back via the DocFetch servlet to the browser. Any necessary cleanup due to temporary file usage should be done on subsequent calls to IDocumentProvider.attachToDocument or IDocumentProvider.shutdown. For details on accessing secured content and files that are not accessible via a public URL, see Content Crawler Click-Through.
getGroups and GetUsers return a list of the groups or users with read access to the document. Each entry is an ACLEntry with a domain and group name. The portal batches these calls; the content crawler code should return all groups or users at once. This call is made only if the Supports importing security with each document option is checked on the Advanced Settings page of the Web Service editor.

SCI Variables for Content Crawler Properties

Content crawler properties are configured using a defined set of variables.

The Content Crawler object should include the following properties. These properties canbe hard-coded or configured using a Service Configuration (SCI) page.For details on SCI pages, see

Creating Service Configuration Pages for Content Crawlers.

Variable	Property Value
TAG_PATH	The path to the container to crawl. Depending on the type of container, this could be a URL, a UNC path, information for a table in a database, information for a view in Notes, etc.
CRAWL_DEPTH	If the variable TAG_DEPTH has not been included, the content crawler only crawls documents in the first directory. This works for resources with no subdirectories, such as a database. For a file system, it is usually best to use a SCISelectElement to let users select the crawl depth (where -1 means until subcontainers return no child containers). If you do not want users to set this option, use a SCIHiddenElement for the same field. Note: The SCISelectElement must call `SetStorageType(TypeStorage.STORAGE_INTEGER)` to be stored correctly; otherwise the portal will return the message "wrong property type."
TAG_PROPERTIES	(optional) Represents whether properties from GetMetaData or the local accessor should be used. Setting this variable to TAG_PROPERTIES_LOCAL causes the local accessor properties used to retrieve a file to override the properties returned by the content crawler. Setting the variable to TAG_PROPERTIES_REMOTE causes the properties from GetMetaData to override properties from local accessors.

Content Crawler Development Tips

These best practices and development tips apply to all content crawler development.

Use logging extensively to provide feedback during a crawl. In some cases, the portal reports a successful crawl when there were minor errors. Use Log4J or Log4Net to track progress.
Use relative URLs in your code to allow migration to another remote server. Note: These URLs might be relative to different base URL endpoints. The click-through URL is relative to the remote server base URL, and the indexing URL is relative to the SOAP URL. Depending on whether you have implemented your content crawler using Java or .NET, the base URL endpoint for the remote server might differ from the base URL endpoint for SOAP. For example, the Java IDK uses Axis, which implements programs as services. In Axis, the SOAP URL is the remote server base URL with '/services' attached to the end. Given the remote server base URL http://server:port/sitename, the SOAP URL would be http://server:port/sitename/services. If both click-through and indexing URLs point to the same servlet (http://server:port/sitename/customdocfetch?docId=12345), the relative URLs would be different. The relative URL for indexing would be "../customdocfetch?docId=12345" and the relative URL for click-through would be "customdocfetch?docId=12345". (Since the indexing URL is relative to the SOAP URL, the '../' reorients the path from http://server:port/sitename/services to http://server:port/sitename, yielding the correct URL to http://server:port/sitename/customdocfetch?docId=12345.)
Do your initial implementation of IDocumentProvider and IDocFetchProvider in separate classes, but factor out some code to allow reuse of the GetDocument and GetMetaData methods. See the Viewer sample application included with the Oracle WebCenter Interaction Development Kit (IDK) for sample code.
Do not make your calls order-dependent. The portal can make the above calls in any order, so your code cannot be dependent on order.
If a document or container does not exist, always throw a new NoLongerExistsException. This is the only way the portal can determine if the file or folder has been deleted. Not throwing the exception could result in an infinite loop.
If there are no results, return a zero-length array. If your intention is to return no results, use a zero-length array, not an array with empty strings. (For example, return new ChildContainer[0]; )
Check the SOAP timeout for the back-end server and calibrate your response accordingly. The SOAP timeout is set in the Web Service editor.
Pages that are not publicly accessible must be gatewayed. Gateway settings are configured in the Web Service editor on the HTTP Configuration page, and in the Content Source editor. You can gateway all URLs relative to the remote server or enter individual URLs and add paths to other servers to gateway additional pages.
You must define mappings for any associated Content Types before a content crawler is run. The portal uses the mappings in the Content Type definition to map the data returned by the content crawler to portal properties. Properties are only stored if you configure the Content Type mapping before running the content crawler. (Properties that apply to all documents are configured in the Global Document Property Map.)
To import security settings, the backend repository must have an associated Authentication Source. Content crawlers that import security need the user and category (domain) defined by an Authentication Source. You must configure the Authentication Source before the content crawler is run. Many repositories use the networks NT or LDAP security store; if an associated Authentication Source already exists, there is no need to create one. For details on Authentication Sources, see the portal online help.
If you use a mirrored crawl, only run it when you first import documents. Always check every directory after a mirrored crawl. After you have imported documents into the portal, it is safer to refresh your portal directory using a regular crawl with filters.
For mirrored crawls, make crawl depth as shallow as possible. Portal users want to access documents quickly, so folder structure is important. Also, the deeper the crawl, the more extensive your QA process will be.
Use filters to sort crawled documents into portal folders. Mirrored crawls can return inappropriate content and create unnecessary directory structures. Filters are a more efficient way to sort crawled documents. To use filters, choose Apply Filter of Destination Folder in the Content Crawler editor. For details on filters, see the portal online help.
Do not use automatic approval unless you have tested a content crawler. It is dangerous to use automatic approval without first testing the structure, metadata and logs for a content crawler.
To clear the deletion history, you must re-open the Content Crawler editor. To re-crawl documents that have been deleted from the portal, you must re-open the Content Crawler editor and configure the Importing Documents settings on the Advanced Settings page.

You can also import access restrictions during a crawl; for details, see Configuring Content Crawlers. For more information on the configuration settings above, see the following sections:

Configuring Content Crawlers
Deploying a Custom Content Crawler

Content Crawler Security Options

A crawler can use a range of credential types to access a secure file.

If you need to apply credentials to access a file, you can use any of the following options:

Credential Type	Description
SSO	SSO must be configured in the portal and on the remote server, using the instructions of your SSO vendor.
Basic Authentication	Set the remote server to pass the user's basic authentication headers to the remote resource. Both sources must be using the same directory. For example, if a user logs in using an IPlanet directory, it is unlikely they will be able to access an Exchange resource.
Content Source credentials	Content Source credentials are generally valid only for crawling a database. Most other use cases require user-specific credentials.
User preferences via form-based authentication	Preferences stored in the portal database can be used to create a cookie if the resource accepts session-based authentication. User preferences generally cannot be used if the resource expects basic authentication. For example, the Content Service for Notes uses this approach when Notes is using session-based (cookie) authentication. You must enter all User settings and User Information required by a content crawler on the Preferences page of the Content Crawler editor.
Force users to log in	If the required credentials are not available, redirect the user to the appropriate page and/or provide an intelligible error message. For example, the Content Service for Notes uses this approach when Notes is using basic authentication.

Content Crawler Indexing

A content crawler must return an indexable version of each crawled file to be included in the portal Directory.

The crawler's servlet/aspx page must return content in a indexable format and set the content type and file name using the appropriate headers. Any information required to retrieve the document must be included in the query string of the index URL, including credentials (if necessary).

Note:

The request from the portal to the indexing servlet is a simple HTTP GET. This call is not gatewayed, so the content crawler code does not have access to the Content Source settings, user credentials and preferences, or anything other information through the Oracle WebCenter Interaction Development Kit (IDK).

For files, content can be streamed directly from the source directory. If the content is not in a file, the crawler code should create a temporary file that includes the content with as little extraneous information as possible.

For details, see the following sections:

Indexing Streaming Content
Creating Temporary Files for Indexing

Indexing Streaming Content

If the content being crawled is in a file, the file can be streamed directly from the source directory.

The following steps describe a typical custom mechanism to return files in a indexable format and set the content type and file name using the appropriate headers.

In IDocument, get all the variables needed to access the document and add them to the query string of the indexing servlet. This could be as simple as a UNC path for a file crawler or as complicated as server name, database name, schema, table, primary key(s) and primary key value(s) for a database record. It depends entirely on the content crawler and the document being crawled. Make sure all values are URLEncoded.
Add the content type to the query string.
In IDocument, add URLEncoded credentials to the query string. Keep in mind that URLEncoding the credentials will turn a '+' to a space, which must be turned back into a space in the indexing servlet.
Pass back URLs via theDocumentMetadata class that point to the servlet(s).
- UseDocFetch: Set to False.
- IndexingURL: Set to the endpoint/servlet that provides the indexable version of the file, including the query string arguments defined in steps 1-3 above.
- ClickThroughURL: Set to the endpoint/servlet that provides the path to be used when a user clicks through to view the file. During the crawl, the ClickThroughURL value is stored in the associated Directory document.
In the indexing servlet, get the location string and content type from the query string and parse the location string to get the path to the resource.
Obtain the resource.
Set the ContentType header and the Content-Disposition header.
Stream the file (binary or text) or write out the file (text) in a try-catch block.

Creating Temporary Files for Indexing

If crawled content cannot be indexed as-is, the crawler code must create a temporary file for indexing.

The following steps describe a typical custom mechanism to create a temporary indexable file with as little extraneous information as possible and set the content type and file name using the appropriate headers. In most cases, the resource has already been accessed in sttachToDocument, so there is no need to call the back-end system again. This example does not use credentials. If you do not want to create temporary files, you can implement an indexing servlet that returns indexable content.

In IDocument, write a temporary file to a publicly accessible location (usually the root directory of the web application as shown in the code snippet below).

MessageContext context = MessageContext.getCurrentContext();
   HttpServletRequest req = (HttpServletRequest)context.getProperty(HTTPConstants.MC_HTTP_SERVLETREQUEST)
   StringBuffer buff = new StringBuffer();
      buff.append(req.getScheme()).append('://').append(req.getServerName())
         .append(':').append(req.getServerPort()).append(req.getContextPath());
      String indexRoot = buff.toString();

Pass back URLs via the IDK's DocumentMetadata class that point to the servlet(s).
- UseDocFetch: Set to False.
- IndexingURL: Set to the endpoint/servlet that provides the indexable version of the file, including the query string arguments defined in steps 1-3 above.
- ClickThroughURL: Set to the endpoint/servlet that provides the path to be used when a user clicks through to view the file. During the crawl, the ClickThroughURL value is stored in the associated Directory document.
Add the temporary file path to the query string, along with the content type. Make sure to URLEncode both.
In the indexing servlet, get the file path and content type from the query string. Get the file name from the file path.
Set the ContentType header and the Content-Disposition header.
Stream the file (binary or text) or write out the file (text) in a try-catch block.
In the finally block, delete the file.

The following sample code indexes a text file.

logger.Debug('Entering Index.Page_Load()');

// try to get the .tmp filename from the Content Crawler 
string indexFileName = Request[Constants.INDEX_FILE];
if (indexFileName != null) 
{
                  StreamReader sr = null; 
                         string filePath = ''; try 
                                { 
                                                                filePath = HttpUtility.UrlDecode(indexFileName); 
                                                                string shortFileName = filePath.Substring(filePath.LastIndexOf('\\') + 1);

        // set the proper response headers
        Response.ContentType = 'text/plain';
        Response.AddHeader('Content-Disposition', 'inline; filename=' + shortFileName); 

        // open the file
        sr = new StreamReader(filePath); 

        // stream out the information into the response
        string line = sr.ReadLine(); 

         while (line != null)
         {
               Response.Output.WriteLine(line);
               line = sr.ReadLine(); 
         }
    }
    catch (Exception ex)
                                { 
    logger.Error('Exception while trying to write index file: ' + ex.Message, ex);
    }
    finally
    {
    // close and delete the temporary index file even if there is an error
    if(sr != null){sr.Close();}
    if(!filePath.Equals('')){File.Delete(filePath);}
    }
//done
return;
}
...

Content Crawler Click-Through

After a repository is crawled and files are indexed in the portal, users must be able to access the file from within the portal by clicking a link; this is the 'click-through' step.

Click-through retrieves a crawled file over HTTP to be displayed to the user. To retrieve documents that are not available via a public URL, you can write your own code or use the DocFetch mechanism in the Oracle WebCenter Interaction Development Kit (IDK). If you handle document retrieval, you can also implement custom caching or error handling. Click-through links are gatewayed, so the content crawler can leverage user credentials and other preferences.

For details, see the following sections:

Implementing Content Crawler Click-Through
Content Crawler DocFetch
Content Crawler Security Options

Implementing Content Crawler Click-Through

The content crawler's click-through implementation must return content in a readable format and set the content type and file name using the appropriate headers.

The following example uses a file, but the crawled resource could be any type of content. If the content is not in a file, the click-through servlet should create a representation with as little extraneous information as possible in a temporary file (for example, for a database, you would retrieve the record and transform it to HTML). See Creating Temporary Files for Indexing. You can also use the Oracle WebCenter Interaction Development Kit (IDK) DocFetch mechanism to handle indexing and click-through; see Content Crawler DocFetch.

Create the clickThroughServlet, and add a mapping in web.xml.
Complete the implementation of IDocument.getMetaData. Set the ClickThoughURL value to an URL constructed using the following steps:
1. Construct the base URL of the application using the same approach as in the index servlet.
2. Add the servlet mapping to the clickThroughServlet.
3. Add any query string parameters required to access the document from the clickThroughServlet (or aspx page). Remember: The click-through page will have access to Content Source parameters (as administrative preferences), but no access to content crawler settings.
To authenticate to the back-end resource, you can use basic authentication, User Preferences, User Info, or credentials from the Content Source. Below are suggestions for each; security will need to be tailored to your content crawler
- Use Basic Authentication to use the same credentials used to log in to the portal. For example, if the portal uses AD credentials, Basic Auth could be used to access NT files.
- Use (encrypted) User Preferences if the authentication source is different from the one used to log in to the portal. For example, if the portal log in uses IPlanet, but you need to access an NT or Documentum file.
- Use (encrypted) User Info if the encrypted credentials are stored in another profile source and imported using a profile job.
- Use Content Source credentials when there a limited connections, for example with a database.
Extract the parameters from the query string as required.

Display the page.

If there is already an HTML representation of the page, authenticate to the page. If the site is using basic authentication and you are using basic authentication headers, simply redirect to that page. If the site is using basic authentication and you are not using basic authentication, users must log in unless the site and the portal are using the same SSO solution. If the site is using form-based authentication, post to the site and follow the redirect.

If there is not an HTML representation of the page, retrieve the resource and stream it out to the client as shown in the sample code below (Java). If you use a temporary file, put the code in a try-catch-finally block, and delete the file in the finally block.

//get the content type, passed as a query string parameter
String contentType = request.getParameter('contentType')

//if this is a file, get the file name 
String filename = request.getParameter('filename');

//set the content type on the response 
response.setContentType(contentType);

//set the content disposition header to tell the browser the file name
response.setHeader('Content-Disposition', 'inline; filename=' + filename);

//set the header that tells the gateway to stream this through the gateway
response.setHeader('PTGW-Streaming', 'Yes');

//get the content - for a file, get a file input stream based on the path (shown below)
//other repositories may simply provide an input stream
//NOTE: this code contains no error checking
String filePath = request.getParameter('filePath');
File file = new File(filePath);
FileInputStream fileStream = new FileInputStream(file);

//create a byte buffer for reading the file in 40k chunks
int BUFFER_SIZE = 40 * 1024;
byte[] buf = new byte[BUFFER_SIZE];

//start reading the file 
int bytesRead = fileStream.read(buf);
ServletOutputStream out = response.getOutputStream();

//start writing out the body 
out.write(buf, 0, bytesRead);

//continue writing until the input stream returns -1
while ((bytesRead = fileStream.read(buf)) != -1
{
    out.write(buf, 0, bytesRead);
}

Content Crawler DocFetch

The Oracle WebCenter Interaction Development Kit (IDK) DocFetch mechanism is one way for a content crawler to retrieve files that are not accessible via a public URL.

If a content crawler implements DocFetch, the Oracle WebCenter Interaction Development Kit (IDK) manages the process of creating temporary files for indexing and click-through. DocFetch also allows you to implement user-level access control. You can pass user preferences or User Information to the content crawler, and this information can be used by DocFetch to authenticate with the back-end system or limit access to specific users.

Note:

DocFetch does not allow you to use multiple methods of authentication or implement custom error handling. If you cannot use public URLs and are not using DocFetch, you must implement a custom document fetching mechanism (i.e., servlet or aspx page). If necessary, you can implement separate servlets for indexing and click-through.

Implementing Content Crawler DocFetch

Content crawler code can use DocFetch to access files that are not available via a public URL.

To use DocFetch, there are three relevant fields in the DocumentMetaData object returned in the portal's call to IDocument.getMetaData:

UseDocFetch: Set UseDocFetch to True.
File Name: Set the File Name to the name of the file in the repository (must be unique).
Content Type: Set the Content Type to the content type for the file. The content type must be mapped to a supported Content Type in the portal.

When UseDocFetch is set to True, the Oracle WebCenter Interaction Development Kit (IDK) sets the ClickThroughURL stored in the Directory to the URL of the DocFetch servlet, and calls IDocument.getDocument to retrieve the file path to the indexable version of the document. When a user subsequently clicks on a link to the crawled document in the Directory, the request to the DocFetch servlet makes several calls to the already-implemented content crawler code. getDocument is called again, but this time as part of the IDocFetch interface. The file path returned is opened by the servlet and streamed back in the response. As explained above, the content crawler must implement the getDocument method in both the Crawler.IDocument and DocFetch.IDocFetch interfaces to return the appropriate file path(s). If the repository cannot access files directly, you must serialize the binary representation to a temporary disk file and return that path. The IDocument and IDocFetch interfaces can use the same process. The Oracle WebCenter Interaction Development Kit (IDK) provides a cleanup call to delete any temporary files later.

Note:

If getDocument returns a path to a file (not a URL to a publicly accessible file), the file name must be unique. Otherwise, all copies of the file are removed during cleanup, including copies that are currently in use by other users.

To use user preferences or User Information, you must configure the settings to be used in the Content Crawler editor. DocFetch interfaces are called in the following order. For a complete listing of interfaces, classes, and methods, see the Oracle WebCenter Interaction Development Kit (IDK) API documentation.

IDocFetchProvider.initialize using the DataSourceInfo, UserPrefs and UserInfo returned from the portal to make a connection to the backend system and create a new session. The implementation should initialize in a similar manner to IDocumentProvider.initialize. IDocFetchProvider can use UserInfo and UserPrefs to perform additional authentication. The ICrawlerLog object is not available. Note: Sessions can get dropped. Keep a variable that can be used to ensure the session is still initialized; if it is not, throw NotInitializedException.
IDocFetchProvider.attachToDocument using the authentication information provided (including UserPrefs and UserInfo).
1. IDocFetch.getMetaData: The only DocumentMetadata required for click-through is the file name and content type.
2. IDocFetch.getDocument: As noted above, IDocFetch.GetDocument method should reuse as much code as possible from the IDocument.getDocument method. The Oracle WebCenter Interaction Development Kit (IDK) looks in web.config/*.wsdd to get the file path and URL to the directory for creating temporary files.
IDocFetchProvider.Shutdown (optional).

Handling Exceptions in Custom Content Crawlers

Content crawler code should handle exceptions.

Most calls should be put into a try-catch block. The scope of the try-catch block should be small enough to diagnose errors easily. In the catch block, log the error in both Log4j/Log4net as well as ICrawlerLog and then re-throw the exception as a ServiceException. This will result in the error displaying in the job log. However, only the error message shows up in the log; look at the log from Log4j/Log4net to get the full stack trace. The following exceptions have special meaning:

NotInitializedException means to re-initialize.
NoLongerExistsException means that the folder or document no longer exists, and tells the portal to delete that resource.

If any exception is thrown during the initial attachToContainer, the crawl aborts. If NotInitializedException is thrown, the content crawler re-initializes. If NoLongerExistsException is thrown, the resource is removed from the Directory, and the content crawler continues to the next resource. If other exceptions are thrown, the error is logged, and the content crawler continues to the next resource. To use ICrawlerLog, store the member variable in your implementation of IContainerProvider.initialize. To send a log message, simply add the following line: m_logger.Log('enter logging message here') Note: The container provider log reads the logs only after AttachToContainer and after exceptions. The document provider log reads only after exceptions. For more information and the best visibility, use Log4j/Log4net.

For details on logging, see Oracle WebCenter Interaction Logging Utilities.

Deploying a Custom Content Crawler

After implementing a custom content crawler, you must deploy your code.

Java

Follow the instructions below to deploy a Java content crawler.

Compile the class that implements the IDK interface and copy the entire package structure to the appropriate location in your web application (usually the \WEB-INF\classes directory).
Update the web.xml file in the WEB-INF directory by adding the class to the appropriate *Impl keys. For a content crawler, add your class to ContainerProviderImpl and DocumentProviderImpl as shown below. Note:The *Impl key in web.xml must reference the fully-qualified name of both provider classes required by the service. If the service uses SCI, you must also enter the fully-qualified name of the appropriate implementation of the IAdminEditor interface in the SciImpl parameter.
```
...
<env-entry>
<env-entry-name>ContainerProviderImpl</env-entry-name>
<env-entry-value>com.plumtree.remote.crawler.helloworld.CrawlContainer</env-entry-value>
<env-entry-type>java.lang.String</env-entry-type>
</env-entry>

<env-entry>
<env-entry-name>DocumentProviderImpl</env-entry-name>
<env-entry-value>com.plumtree.remote.crawler.helloworld.CrawlDocument</env-entry-value>
<env-entry-type>java.lang.String</env-entry-type>
</env-entry>
...
 
```
Start your application server. (In most cases, you must restart your application server after copying a file.)
Test the directory by opening the following page in a Web browser: http://<hostname:port>/edk/services/<servicetype>ProviderSoapBinding (for example, http://localhost:8080/edk/ContainerProviderSoapBinding and http://localhost:8080/edk/DocumentProviderSoapBinding). The browser should display the following message: "Hi there, this is an AXIS service! Perhaps there will be a form for invoking the service here..." When you configure the Web Service object for the content crawler in the portal, enter this path as the Service Provider URL.

If the content crawler uses DocFetch, you must also deploy your DocFetch code. Open the WEB-INF\web.xml file and add the fully-qualified name of your class in the DocFetchProvider initialization parameter, as shown in the code that follows.

...
<servlet>
<servlet-name>DocFetch</servlet-name>
<servlet-class>com.plumtree.remote.docfetch.DocFetch</servlet-class>

<!-- Modify the param-value below to reference your class --> 
<init-param> 
<param-name>DocFetchProvider</param-name> 
<param-value>com.mycompany.MyDocFetchProvider</param-value> 
</init-param>

</servlet>
...

.NET

To deploy a .NET content crawler, add a line to the deployment file (web.config) that specifies the fully qualified name of the class. For a content crawler, enter values for the following parameters, as shown in the code that follows.

ContainerProviderImpl
DocumentProviderImpl
ContainerProviderAssembly
DocumentProviderAssembly

... 
<appSettings> 
<add key='ContainerProviderAssembly' value='CompanyStoreCWS'/> 
<add key='ContainerProviderImpl' value='Plumtree.CompanyStore.CWS.CompanyStoreContainer'/> 
<add key='DocumentProviderAssembly' value='CompanyStoreCWS'/> 
<add key='DocumentProviderImpl' value='Plumtree.CompanyStore.CWS.CompanyStoreDocument'/> 
...

If the service uses SCI, you must also enter the fully-qualified name of the appropriate implementation of the IAdminEditor interface using the SciImpl and AdminEditorAssembly parameters.

If the content crawler uses DocFetch, you must also deploy your DocFetch code. Add a line to the deployment file (web.config) that specifies the fully qualified name of your class and the associated assembly (DocFetchImpl and DocFetchAssembly). You must also add three additional parameters to the web.config deployment descriptor:

DocFetchURL: The URL to the DocFetch servlet or server page. This URL should be relative to the Remote Server object URL configured for the Content Crawler object in the portal to facilitate migration to another portal.
IndexFilePath: A writable, web-accessible directory to which the IDK can write temporary files. During crawl-time, the Oracle WebCenter Interaction Development Kit (IDK) calls IDocument.GetDocument and copies the file path returned to this temporary file location, which is returned to the portal. These temporary files should be deleted upon completion of the crawl. (The DocFetch mechanism will clean up its own resources, but you must delete the temporary file you return to GetDocument.)
IndexURLPrefix: The public Web address of the IndexFilePath directory. IndexURLPrefix must be an URL accessible from the portal server.

The code below is an example of deploying DocFetch in web.config.

... 
<appSettings> 
<add key='DocFetchAssembly' value='MyDocFetch' /> 
<add key='DocFetchImpl' value='com.mycompany.MyDocFetchProvider' /> 
<add key='DocFetchURL' value='iis/docfetch.aspx'/> 
<add key='IndexFilePath' value='D:\\root\\config\\mydomain'/> 
<add key='IndexURLPrefix' value='http://yourhost/IISVirtualDirectory'/> 
...

Testing Custom Content Crawlers

These key tests should be performed on every content crawler.

All the following tests should be performed in multiple implementations of the portal.

Test the entire crawl depth. Confirm that documents are structured correctly in every level. Crawl depth should be as shallow as possible. If there are problems, check the filters on the target folders. If nothing is returned, check the authentication settings in the associated Content Source and Web Service - Content objects.
Check the document metadata. Is it stored in the appropriate properties? Does it match the metadata in the source repository? If there are problems, check the Content Type settings in the Content Crawler editor, and check the mappings for each associated Content Type.
Click through to crawled documents from each crawled directory. If there are problems, check the gateway settings in the Web Service - Content editor.
Test refreshing documents to confirm that they reflect modifications. If there are problems, make sure you are providing the correct document signature.
Check logs after every crawl. The log can reveal problems even if the portal reports a successful crawl.

Debugging Custom Content Crawlers

To debug custom content crawlers, use logging.

Logging is an important component of any successful content crawler. Logging allows you to track progress and find problems.In most implementations, using Log4J or Log4Net for logging is the best approach. The IDK ICrawlerLog object is more efficient and useful than Logging Spy or a SOAP trace, but it only includes standard exceptions and messages from ContainerProvider.AttachToContainer.If you are viewing the ICrawlerLog, do not assume that the every card was imported if the job is successful. Successful means no catastrophic failures, such as portal Search not started, or unable to attach to the start node. Individual document failures will not fail a job.If you are viewing logs created by Log4net or Log4j, see the associated documentation for logging configuration options. Both products allow you to specify a file location and a rollover log with a specified file size. If you know the location of the file, it is not difficult to create a servlet/aspx page that streams the file from the log to the browser.

For more information, see the following sections:

Configuring Content Crawlers

Implementing a successful content crawler in the portal requires specific configuration.

To register a content crawler in the portal, you must create the following administrative objects and portal components:

Remote Server: The Remote Server defines the base URL for the content crawler. Content crawlers can use a Remote Server object or hard-coded URLs. Multiple services can share a single Remote Server object. If you will be using a Remote Server object, you must register it before registering any related Web Service objects.
Web Service - Content: The Web Service object includes basic configuration settings, including the SOAP endpoints for the ContainerProvider and DocumentProvider, and Preference page URLs. Multiple Content Source or Content Crawler objects can use the same Web Service object. All remote content crawlers require an associated Web Service object. For information on specific settings, see the portal online help.
Content Source - Remote: The Content Source defines the location and access restrictions for the back-end repository. Each Web Service - Content object has one or more associated Content Source objects. The Content Source editor can include Service Configuration pages created for the content crawler. Multiple Content Crawler objects can use the same Remote Content Source, allowing you to crawl multiple locations of the same content repository without having to repeatedly specify all the settings. For details on specific settings, see the portal online help. For details on Service Configuration pages, see Creating Service Configuration Pages for Content Crawlers.
Content Crawler - Remote: Each content crawler has an associated Content Crawler object that defines basic settings, including destination folder and Content Type. The Content Crawler editor can include Service Configuration pages created for the Content Crawler. Refresh settings are also entered in the Content Crawler editor. For details on specific settings, see the portal online help. For details on Service Configuration pages, see Creating Service Configuration Pages for Content Crawlers.
Job: To run the content crawler, you must schedule a Job or add the Content Crawler object to an existing Job. The Content Crawler editor allows you to set a Job. For details on configuring Jobs, see the portal online help.
Global Content Type Map: If you are importing a proprietary file format, you might need to create a new Content Type. Content Types are used to determine the type of accessor used to index a file. You can create new Content Types, or map additional file extensions to an existing Content Type using the Global Content Type Map. Most standard file formats are supported for indexing by the portal. In most cases, the same document is returned during a crawl (for indexing) as for click-through (for display). You can also map additional file extensions to Content Types through the Global Content Type Map. For detailed instructions, see the portal online help or the Administrator Guide for Oracle WebCenter Interaction.
Global Document Property Map: To map document attributes to portal Properties, you must update the Global Document Property Map before running a content crawler. During a crawl, file attributes are imported into the portal and stored as Properties. The relationship between file attributes and portal Properties can be defined in two places: the Content Type editor or the Global Document Property Map.

Two types of metadata are returned during a crawl.
- The crawler (aka provider) iterates over documents in a repository and retrieves the file name, path, size, and usually nothing else.
- During the indexing step, the file is copied to portal Search, where the appropriate accessor executes full-text extraction and metadata extraction. For example, a for a Microsoft Office document, the portal uses the MS Office accessor to obtain additional properties, such as author, title, manager, category, etc.
If there are conflicts between the two sets of metadata, the setting in CrawlerConstants.TAG_PROPERTIES determines which is stored in the database (for details, see Service Configuration Pages above).

Note:

If any properties returned by the crawler or accessor are not included in the Global Document Property map, they are discarded. Mappings for the specific Content Type have precedence over mappings in the Global Document Property Map. The Object Created property is set by the portal and cannot be modified by code inside a Content Crawler.
Global ACL Sync Map: Content crawlers can import security settings based the Global ACL Sync Map, which defines how the Access Control List (ACL) of the source document corresponds with Oracle WebCenter Interaction's authentication groups. (An ACL consists of a list of names or groups. For each name or group, there is a corresponding list of possible permissions. The ACL returned to the portal is for read rights only.) For detailed instructions, see the portal online help or the Administrator Guide for Oracle WebCenter Interaction.

In most cases, the Global ACL Sync Map is automatically maintained by Authentication Sources. The Authentication Source is the first step in Oracle WebCenter Interaction security. To import security settings in a crawl, the back-end repository must have an associated Authentication Source. Content crawlers that import security need the user and category (domain) defined by an Authentication Source. You must configure the Authentication Source before the content crawler is run. Many repositories use the network's NT or LDAP security store; if an associated Authentication Source already exists, there is no need to create one.
Note:

Two settings are required to import security settings:
- In the Web Service - Content editor on the Advanced Settings page, check Supports importing security with each document.
- In the Content Crawler editor on the Main Settings page, check Import security with each document.

Creating Service Configuration Pages for Content Crawlers

Service Configuration (SCI) pages are integrated with portal editors and used to define settings used by a content crawler.

Content crawlers must provide SCI pages for the Content Source and/or Content Crawler editors to build the preferences used by the content crawler. The URL to any associated SCI page(s) must be entered on the Advanced URLs page of the Web Service - Content editor. All optional settings are in the class CrawlerConstants. For a list, see SCI Variables for Content Crawler Properties. SCI provides an easy way to write configuration pages that are integrated with portal editors. SCI wraps the portal's XUI XML and allows you to create controls without XUI. For a complete listing of classes and methods in the plumtree.remote.sci namespace, see the IDK API documentation. The following methods must be implemented: .

initialize passes the namespace, whether Content Source or Content Crawler, settings (NamedValueMap). Dependent objects supply data.
getPages returns fixed-length array of the number of custom pages.
getContent returns the XML content for a page. The API provides a collection of helper classes to build the page (textbox, select box, tree element, etc.)

The example below is a SCI page for a Content Source editor that gets credentials for a database content crawler.

Imports System
Imports Plumtree.Remote.Sci
Imports Plumtree.Remote.Util
Imports System.Security.Cryptography 

Namespace Plumtree.Remote.Crawler.DRV
'Page to enter name and password- first page for DataSourceEditor
Public Class AuthPage
Inherits AbstractPage
#Region "Constructors"
Public Sub New(ByVal editor As AbstractEditor)
MyBase.New(editor)
End Sub
#End Region 

#Region "Functions"
'Gets the content for the page in string form.
'One textElement for name, one PasswordElement for password
'Note the way that the password is stored & the encryption used
Public Overrides Function GetContent(ByVal errorCode As Integer, ByVal pageInfo As NamedValueMap) As String
Dim page As New SciPage
Dim userElement As New SciTextElement(DRVConstants.USER_NAME, "Enter the user name to authenticate to SQL Server")
Dim userName As String = pageInfo.Get(DRVConstants.USER_NAME)
If Not userName Is Nothing Then
  userElement.SetValue(userName)
End If
userElement.SetMandatoryValidation("User name is mandatory") 

            Dim passElement As New SciPasswordElement(DRVConstants.PASSWORD, "Enter the password to authenticate to SQL Server", "Confirm", "Passwords do not match")
'deal with asterisks and the like- for now, just show password
Dim password As String = pageInfo.Get(DRVConstants.ENC_PASSWORD)
'save the initial password?
Dim settings As NamedValueMap = Me.Editor.Settings
settings.Put(DRVConstants.ENC_PASSWORD, password)
Editor.Settings = settings
'set asterisks for the value
passElement.SetValue(DRVConstants.ASTERISKS) 

            page.Add(userElement)
page.Add(passElement) 

            Return page.ToString
End Function 

        'Gets the help page URI for the page.
Public Overrides Function GetHelpURI() As String
Return ""
End Function 

        'Gets the image (icon) URI for the page. (This setting is for backward compatibility; no icon is displayed in version 5.0.)
Public Overrides Function GetImageURI() As String
Return ""
End Function 

        'Gets the instructions for the page, displayed below the title in the editor.
Public Overrides Function GetInstructions() As String
Return "Enter SQL Server authentication information"
End Function 

        'Gets the title for the page.
Public Overrides Function GetTitle() As String
Return "SQL Server Authentication"
End Function 

        'Validates the current page and throws a ValidationException to report an error. Returns a NamedValueMap array of the settings entered on the editor page. 
Public Overrides Sub ValidatePage(ByVal pageInfo As NamedValueMap)
'if the password is not asterisks, then put it into settings
Dim password As String = pageInfo.Get(DRVConstants.PASSWORD)
If Not password.Equals(DRVConstants.ASTERISKS) Then
  Dim settings As NamedValueMap = Me.Editor.Settings
  'encrypt this
  Dim encPassword As String = Utilities.EncryptPassword(password, Me.Editor.Locale)
  settings.Put(DRVConstants.ENC_PASSWORD, encPassword)
  Editor.Settings = settings
End If 

        End Sub
#End Region

    End Class
End Namespace

Oracle WebCenter Interaction Federated Search Services

Federated Search provides access to external repositories without adding documents to the portal Directory. Federated Search is especially useful for content that is updated frequently or is only accessed by a small number of portal users

When the portal requests a federated search service, the remote service accesses the content repository and sends information about each file to the portal. The returned information is displayed to users in search results. The results include a URL that opens the file from the back-end content repository.

For details on implementing federated search services, see the following sections:

Creating a Federated Search Service
Oracle WebCenter Development Kit (IDK) Interfaces for Federated Search Service Development
Deploying a Federated Search Service

Creating a Federated Search Service

The Oracle WebCenter Interaction Development Kit (IDK) allows you to create remote Federated Search services and related configuration pages without parsing SOAP or accessing the portal API. The Oracle WebCenter Interaction Development Kit (IDK) Search API provides an abstraction from the necessary SOAP calls; you simply implement an object interface.

The following best practices apply to every federated search service:

Know what to expect in response to a query. You must be ready to handle pagination and authentication if necessary.
Check the SOAP timeout for the back-end server and calibrate your response accordingly.
Use relative URLs in your code to allow migration to another remote server.

For details on implementing Federated Search Services using the Oracle WebCenter Interaction Development Kit (IDK) Search API, see Oracle WebCenter Development Kit (IDK) Interfaces for Federated Search Service Development.

Oracle WebCenter Development Kit (IDK) Interfaces for Federated Search Service Development

The Oracle WebCenter Interaction Development Kit (IDK) plumtree.remote.search package/namespace includes a set of interfaces to support federated search service development.

The Oracle WebCenter Interaction Development Kit (IDK) plumtree.remote.search package/namespace includes the following interfaces:

IRemoteSearch
ISearchQuery
ISearchUser
ISearchContext
ISearchRecord
ISearchResult

In general, the portal calls these interfaces in the following order. See the definitions that follow for more information.

IRemoteSearch.BasicSearch, using ISearchQuery, ISearchUser and ISearchContext as parameters.
The ISearchResult object returned allows the federated search service to iterate through the search results and return them to the user. The service calls ISearchResult.GetSearchResultList to retrieve an ISearchRecord for each record returned. ISearchRecord allows you to retrieve the title, description, file URL and image URL and set the title, description, file URL and image URL to be returned to the portal.

The sections below provide helpful information on the interfaces used to implement a federated search service For a complete listing of interfaces, classes, and methods, see the IDK API documentation.

IRemoteSearch

The IRemoteSearch interface allows the portal to initiate a query over a back-end directory structure. BasicSearch allows you to pass in an ISearchQuery that defines the query to be performed. You can also pass in a ISearchUser and ISearchContext for access to the PRC.

ISearchQuery

The ISearchQuery interface defines the search query to be performed by the portal. Using ISearchQuery, you can define the scope of the query and provide user preferences and user information to be used for authentication or user-level access control. SearchException allows you to provide useful error messages (for example, the specific preference type that was not found). For details, see the IDK API documentation. This interface provides the following methods:

GetMaxReturn determines the maximum number of records to return per page.
GetNumberToSkip returns the number of records that will be skipped: where the search will start. For example, the search could start at record 30.
GetSearchInfo returns any related administrative preferences set for the associated Federated Search object in the portal.
GetSearchResult returns an ISearchResult object that allows the federated search service to access the results returned by IRemoteSearch.
GetSearchString returns the query string passed to the portal.
GetUserInfo returns any User Information settings sent to the federated search service. To access User Information, you must configure the specific settings you need in the Web Service editor on the User Information page.
GetUserPrefs returns any user settings sent to the federated search service. To access user settings, you must configure the specific settings you need in the Web Service editor on the Preferences page.

ISearchUser

The ISearchUser interface can be used to access the current user's portal object ID and locale, and to obtain the login token for the current session with the portal to access the PRC.

ISearchContext

The ISearchContext interface can be used to access the portal UUID and SOAP service endpoint URI to implement the PRC.

ISearchResult

The ISearchResult interface allows you to retrieve the results returned from a search query and return the results to the portal. The federated search service code must handle pagination; the methods in the ISearchResult facilitate iteration over large numbers of search records.

Get/SetNumberSkipped returns the number of records that were skipped: where the search started. For example, the search could start at record 30.
Get/SetSearchResultList returns a SearchRecord array of search results.
Get/SetTotalNumberofHits returns the total number of search records.
Is/SetDescriptionEncoded determines whether or not the description for the search results is HTMLencoded.

ISearchRecord

The ISearchRecord interface allows you to manipulate the metadata for each search record. Only the title is required.

Get/SetTitle returns the title for the search record (required).
Get/SetDescription returns the description for the search record. If the description should be HTMLencoded, use ISearchResult.SetDescriptionEncoded.
Get/SetOpenDocumentURL returns the URL that will retrieve the document. This URL must be accessible over the web or through the gateway. If the document is gatewayed, make sure to configure the Web Service object with the appropriate gateway URLs.
Get/SetImageURL returns the URL to the image that will be displayed with the search record.

Deploying a Federated Search Service

After implementing a federated search service, you must deploy your code.

Java

Follow the instructions below to deploy a Java federated search service:

Compile the class that implements the Oracle WebCenter Interaction Development Kit (IDK) interface and copy the entire package structure to the appropriate location in your web application (usually the \WEB-INF\classes directory).
Update the web.xml file in the WEB-INF directory by adding the class to the appropriate *Impl keys. For example, add your class to SearchImpl as shown below. Note: The *Impl key in the web.xml file must reference the fully-qualified name of the class. If the service uses SCI, you must also enter the fully-qualified name of the appropriate implementation of the IAdminEditor interface.
```
...
<env-entry>
<env-entry-name>SearchImpl</env-entry-name>
<env-entry-value>com.plumtree.remote.search.helloworld.Search</env-entry-value>
<env-entry-type>java.lang.String</env-entry-type>
</env-entry>
...
 
```
Start your application server. (In most cases, you must restart your application server after copying a file.)
Test the directory by opening the following page in a web browser: http://<hostname:port>/idk/services/<servicetype>ProviderSoapBinding (for example, http://localhost:8080/idk/SearchSoapBinding). The browser should display the following message: "Hi there, this is an AXIS service! Perhaps there will be a form for invoking the service here..." When you configure the Web Service for the federated search service in the portal, enter this path as the Service Provider URL.
If the federated search service uses a SCI page to define settings, you must also deploy the SCI code. For details on using SCI pages, see Creating Service Configuration Pages for Content Crawlers.

NET

To deploy a .NET federated search service, add a line to the deployment file (web.config) that specifies the fully qualified name of the class used to implement federated search. For a federated search service, you must enter values for the following parameters, as shown in the code that follows.

SearchImpl
SearchAssembly

...
<appSettings>
<add key='SearchAssembly' value='CompanyStoreSWS'/>
<add key='SearchImpl' value='Plumtree.CompanyStore.SWS.CompanyStoreSWS'/>
...

If the federated search service uses a SCI page to define settings, you must also deploy the SCI code. For details on using SCI pages, see Creating Service Configuration Pages for Content Crawlers.