Oracle® Fusion Middleware Web Service Developer's Guide for Oracle WebCenter Interaction 10g Release 4 (10.3.3.0.0) Part Number E14109-02 |
|
|
View PDF |
Content services (content crawlers and federated search services) allow you to search external repositories through the portal and index external content in the portal Directory. These services allow users to access documents and other resources from multiple repositories without leaving the portal workspace.
Content crawlers access content from an external repository and index it in the portal. Portal users can search for and open crawled files through the portal Directory. Content Crawlers can be used to provide access to files on protected back-end systems without violating access restrictions. Content Crawlers are implemented as remote web services. For details, see Content Crawlers.
Federated search services are remote web services that search external repositories, including the web, internal company databases and document repositories. For details, see Oracle WebCenter Interaction Federated Search Services. For additional search customization options, see the Oracle WebCenter Interaction UI Customization Guide.
Content crawlers are extensible components used to import documents into the portal Directory from a back-end document repository, including Lotus Notes, Microsoft Exchange, Documentum and Novell. Portal users can search for and open crawled files on protected back-end systems through the portal without violating access restrictions.
The Oracle WebCenter Interaction Development Kit (IDK) allows you to create remote content crawlers and related configuration pages without parsing SOAP or accessing the portal API; you simply implement four object interfaces to access the back-end repository and retrieve files. UDDI servers are not required.
The purposes of a Content Crawler are two-fold:
Iterate over and catalog a hierarchical data repository. Retrieve metadata and index documents in the data repository and include them in the portal Directory and search index. Files are indexed based on metadata and full-text content.
Retrieve individual documents on demand through the portal Directory, enforcing any user-level access restrictions.
Content Crawlers are run asynchronously by the portal Automation Service. The associated content crawler completes step 1. The Content Crawler Job can be run on a regular schedule to refresh any updated or added files. The portal creates a Document object for each crawled file and indexes it in the Directory. Each object includes basic file information, security information, and a URL that opens the file from the back-end content repository. (No crawled files are stored on the portal server.) If the content is not contained within a file or cannot be indexed for another reason, you must implement a servlet/aspx page to return files that can be indexed to the portal.
Step 2 occurs when a user browses the Directory and opens to a previously crawled document. After a file is crawled into the portal, users must be able to access the file from within the portal by clicking a link. This step is called click-through. If files are publicly accessible, click-through is simple. In many cases, you must provide access to documents that are behind a firewall or are otherwise inaccessible from the portal interface.
For details, see the following sections:
Oracle WebCenter Interaction Development Kit (IDK) Interfaces for Content Crawler Development: The Oracle WebCenter Interaction Development Kit (IDK) provides object interfaces to implement custom content crawlers. This section introduces the IDK's crawler interfaces and lists useful warnings and best practices.
Content Crawler Development Tips: These best practices and development tips apply to all content crawler development.
Content Crawler Indexing: Content crawlers must return an indexable version of each crawled file to be included in the portal Directory. This section provides an introduction to indexing.
Content Crawler Click-Through: The crawl is just the first step. This section explains how content crawlers can provide access to secured files that have been indexed in the portal. For instructions, see Implementing Content Crawler Click-Through.
Deploying a Custom Content Crawler: After coding your Content Crawler, you must deploy your code. These sections provide detailed instructions.
Configuring Content Crawlers: Implementing a successful Content Crawler in the portal requires specific configuration.
Debugging Custom Content Crawlers: Logging is a key component of any successful crawl. This page introduces logging options.
Testing Custom Content Crawlers: This checklist summarizes key tests that should be performed on every content crawler.
The Oracle WebCenter Interaction Development Kit (IDK) plumtree.remote.crawler package/namespace includes four interfaces to support content crawler development: IContainerProvider
, IContainer
, IDocumentProvider
and IDocument
.
When the portal Automation Service initiates a crawl, it issues a SOAP request to return a list of folders. It iterates over the list of folders and retrieves lists of documents with metadata. In general, the portal calls Oracle WebCenter Interaction Development Kit (IDK) interfaces in the following order. See the definitions that follow for more information.
IContainerProvider.initialize
once per thread. Use DataSourceInfo
and CrawlerInfo
to initialize the Container Provider (make a connection to the back-end system and create a new session). Note: This is not a true HTTP session, and sessions can get dropped. Keep a variable that can be used to ensure the session is still initialized; if it is not, throw NotInitializedException
. Store the Content Source in a member variable in Initialize. Do not use direct access to the member variable; instead use a method that checks if it is null and throws a NotInitializedException
.
IContainerProvider.attachToContainer
, using the starting location in the key CrawlerConstants.TAG_PATH. The key should be populated using a Service Configuration page in the Content Crawler editor. The string in TAG_PATH is service-specific; a file content crawler could use the UNC path to a folder, while a database content crawler could use the full name of a table. The following methods are not called in any specific order.
IContainer.getUsers
and IContainer.getGroups
on that container as required. (IContainer.GetMetaData is deprecated.)
IContainer.getChildContainers
up to the number specified in CrawlerConstants.TAG_DEPTH. (This key must be set via a Service Configuration page.)
IContainerProvider.attachToContainer
for each ChildContainer returned.
IContainer.getChildDocuments
, then IDocumentProvider.attachToDocument
for each ChildDocument returned.
IContainerProvider.shutdown
(this call is optional and could be blocked by exceptions or network failure).
IDocumentProvider.initialize
once per thread. Note: Sessions can get dropped. Keep a variable that can be used to ensure the session is still initialized; if it is not, throw NotInitializedException.
IDocumentProvider.attachToDocument
for each ChildDocument, then
IDocument.getDocumentSignature
to see if the document has changed. If the document is new or has been modified, the following methods are called (not in any specific order).
IDocument.getUsers
and IDocument.getGroups
on that document as required.
IDocument.getMetaData
to get the file name, description, content type, URL, etc.
IDocument.getDocument
to index the document (only if DocFetch is used).
IDocumentProvider.shutdown
(this call is optional and could be blocked by exceptions or network failure).
The sections below provide helpful information on the interfaces used to implement a content crawler. For a complete listing of interfaces, classes, and methods, see the Oracle WebCenter Interaction Development Kit (IDK) API documentation.
The IContainerProvider
interface allows the portal to iterate over a back-end directory structure. The portal calls IContainerProvider
first in most cases. This interface provides the following methods:
initialize
allows the remote server to initialize a session and create a connection to the back-end document repository. The Oracle WebCenter Interaction Development Kit (IDK) passes in a DataSourceInfo
object that contains the necessary settings associated with a Content Source object (the name of a directory in the repository and the credentials of a system user). The CrawlInfo
object contains the settings for the associated Content Crawler object in the portal. The start location of the crawl is the value stored in the key CrawlerConstants.TAG_PATH , set using a Service Configuration page.
attachToContainer
is always the next call after Initialize; the order of the remaining calls is not defined. It associates the session with the container specified in the sContainerLocation
parameter; subsequent calls refer to this container until the next attachToContainer
call. The value in the sContainerLocation
parameter will be the CrawlerConstants.TAG_PATH key for the initial attach, and the value specified in ChildContainer.GetLocation for subsequent attaches. Each time attachToContainer
is called, discard any state created during the previous attachToContainer
call. If multiple translations of the container are available, select the most appropriate using the Locale parameter, which can be sent as a full locale (e.g., "en-us") or in the abbreviated language-only format (e.g., "en"). Note: If the container specified does not exist, you must throw a new NoLongerExistsException
to avoid an infinite loop. If the Content Crawler is configured to delete missing files, all files in the container will be removed from the portal index.
ahutdown
allows the portal to clean up any unused sessions that have not yet expired. Content Crawlers are implemented on top of standard cookie-based session mechanisms, so sessions expire and resources and connections are released after an inactivity period, typically around 20 minutes. As a performance optimization, the portal might send a Shutdown
message notifying the remote server to end the session immediately. No parameters are received and none are returned. Do not assume that Shutdown
will be called; the call could be blocked by an exception or network failure. Remote servers must terminate sessions after an inactivity timeout but can choose to ignore the Shutdown
message and keep the session alive until it times out.
The portal uses the IContainer
interface to query information about back-end resource directories. This interface provides the following methods:
getGroups
and getUsers
return a list of the portal groups or users that have read access to the container. These calls are made only if the Web Service and Content Crawler objects are configured to import security. The portal batches these calls; the content crawler code should return all groups or users at once.
getChildContainers
returns the containers inside the current container (i.e., subfolders of a folder). The value stored in the key CrawlerContants.TAG_DEPTH is used to determine how many times getChildContainers
is called (crawl depth). This value must be set via a Service Configuration page. If no value is stored with this key, getChildContainers
is never called; only the documents in the folder specified for the start location are crawled into the portal. Note: Setting CrawlerConstants.TAG_DEPTH to -1 could result in an infinite loop.
getChildDocuments
returns the documents inside the current container (folder). The portal batches this call; the Content Crawler code should return all documents at once. The TypeNamespace and TypeID parameters define the Content Type for the document. TypeNamespace
associates the document with a row in the Global Content Type Map, and the TypeID associates it with a particular Content Type. The value in ChildDocument.getLocation
is used in IDocumentProvider.attachToDocument
, so any information required by attachToDocument
must be included in the location string. You can describe the document using file or MIME, as shown in the example below.
ChildDocument doc=new ChildDocument(); String filename = WordDoc.doc; //Location is a crawler-specific string to retrieve doc, e.g., file name doc.setLocation(filename); //TypeNameSpace is either FILE or MIME unless using a custom namespace (Notes, Exchange) //NOTE: example uses getCode because setTypeNameSpace expects a String doc.setTypeNameSpace(TypeNamespace.MIME.getCode()): //For file descriptions, TypeID is simply the document name with extension (i.e., filename) //For MIME descriptions, set the document type or map multiple file extensions to MIME types doc.setTypeID("application/msword"); //DisplayName is the name to display in the KD, usually overridden in IDocument.getMetaData(); doc.setDisplayName(filename);
getMetaData
(DEPRECATED) returns all metadata available in the repository about the container. The name and location are used in mirrored crawls to mirror the structure of the source repository. In most cases, the container metadata is only the name and description.
The IDocumentProvider
interface allows the portal to specify back-end documents for retrieval. In most cases, the portal calls IContainerProvider
first. However, in some cases, the service is used to refresh existing documents and IDocumentProvider
might be called first.
initialize
allows the remote server to initialize a session and create a connection to the back-end document repository. (For details on parameters and session state, see IContainerProvider.initialize
above.) IDocumentProvider.initialize
will be called once per thread as long as the session does not time out or get interrupted for other reasons, and attachToDocument
will be called next.
attachToDocument
is always the next call made after Initialize; the order of the remaining calls is not defined. This method 'attaches' a session to the document specified in the
sDocumentLocation
parameter; subsequent calls refer to this document until the next attachToDocument call. The sDocumentLocation string is the value specified in ChildDocument.getLocation (ChildDocument is returned by IContainer.getChildDocuments). If multiple translations of the document are available, select the most appropriate by using the Locale parameter, which can be sent as a full locale (e.g., 'en-us') or in the abbreviated language only format (e.g., 'en'). When implementing this method, you can throw the following exceptions:
Exception | Description |
---|---|
NoLongerExistsException |
The document has been moved or deleted. (The refresh agent will delete documents from the portal index only if this exception has been thrown.) |
NotAvailableException |
The document is temporarily unavailable. |
NotInitializedException |
The IDocumentProvider is in an uninitialized state. |
AccessDeniedException |
Access to this document is denied. |
ServiceException |
Propagates the exception to the portal and adds an entry to Logging Spy. |
shutdown
allows the portal to clean up any unused sessions that have not yet expired. (For details, see IContainerProvider.shutdown above.)
The IDocument
interface allows the portal to query information about and retrieve documents. This interface provides the following methods:
getDocumentSignature
allows the portal to determine if the document has changed and should be re-indexed and flagged as updated. It can be a version number, a last-modified date, or the CRC of the document. The Oracle WebCenter Interaction Development Kit (IDK) does not enforce any restrictions on what to use for the document signature, or provide any utilities to get the CRC of the document. This is always the first call made to IDocument; on re-crawls, if the documentSignature has not changed, no additional calls will be made.
getMetadata
returns all metadata available in the repository about the document. The portal maps this data to properties based on the mappings defined for the appropriate Content Type, along with metadata returned by the associated accessor. The following field names are reserved. Additional properties can be added using the portal's Global Document Property Map; for details, see Configuring Custom Content Crawlers: Properties and Metadata. (Any properties that are not in the Global Document Property Map will be discarded.)
Field Name | Description |
---|---|
Name |
REQUIRED. The name of the link to be displayed in the portal Knowledge Directory. Note: By default, the portal uses the name from the crawled file properties as the name of the card. To set the portal to use the Name property returned by getMetadata, you must set the CrawlerConstants.TAG_PROPERTIES to REMOTE using the Service Configuration Interface. |
Description |
The description of the link to be displayed in the portal Directory. |
UseDocFetch |
Whether or not to use DocFetch to retrieve the file. The default is False. If you use DocFetch, the value in the File Name field is used to retrieve the file during both indexing and click-through. If you do not use DocFetch, you must provide values for Indexing URL and Click-Through URL. |
File Name (required for DocFetch) |
The name of the click-through file, used for DocFetch. |
Content Type (required for DocFetch) |
The content type of the click-through file, used to associated the crawled document with the Global Content Type Map. |
Indexing URL (public URL) |
(Required if not using DocFetch.) The URL to the file that can be indexed in the portal. URLs can be relative to the Remote Server. If a file is publicly accessible via a URL, that URL can be used to access the document for both indexing and click-through. Documents that cannot be indexed must provide an additional URL at crawl-time for indexing purposes. For details on crawling secured content, see Accessing Secured Content . |
Click-Through URL (public URL) |
(Required if not using DocFetch.) The URL to the click-through file. URLs can be relative to the Remote Server. For details on crawling secured content, see Accessing Secured Content. |
Image UUID (optional) |
This parameter is only required for custom Content Types. For standard Content Types, the accessor will assign the correct image UUID. |
getDocument
returns the path to the file if it was not provided by getMetaData
. (For public URLs, you do not need to implement getDocument
, but you must provide values for IndexingURL
and ClickThroughURL
in getMetaData
.) During crawl-time indexing, this file is copied to the web-accessible IndexFilePath
location specified in your deployment descriptor and returned to the portal via a URL to that location. If the file is not supported for indexing by the portal, implement getDocument
to convert the document into a supported file format for indexing (e.g., text-only) and return that file during indexing. Note: To create a custom implementation of getDocument
, you must set useDocFetch
to True. When a user clicks through to the document, the display file is streamed back via the DocFetch servlet to the browser. Any necessary cleanup due to temporary file usage should be done on subsequent calls to IDocumentProvider.attachToDocument
or IDocumentProvider.shutdown
. For details on accessing secured content and files that are not accessible via a public URL, see Content Crawler Click-Through.
getGroups
and GetUsers
return a list of the groups or users with read access to the document. Each entry is an ACLEntry with a domain and group name. The portal batches these calls; the content crawler code should return all groups or users at once. This call is made only if the Supports importing security with each document option is checked on the Advanced Settings page of the Web Service editor.
Content crawler properties are configured using a defined set of variables.
The Content Crawler object should include the following properties. These properties canbe hard-coded or configured using a Service Configuration (SCI) page.For details on SCI pages, see
Creating Service Configuration Pages for Content Crawlers.
Variable | Property Value |
---|---|
TAG_PATH |
The path to the container to crawl. Depending on the type of container, this could be a URL, a UNC path, information for a table in a database, information for a view in Notes, etc. |
CRAWL_DEPTH |
If the variable TAG_DEPTH has not been included, the content crawler only crawls documents in the first directory. This works for resources with no subdirectories, such as a database. For a file system, it is usually best to use a SCISelectElement to let users select the crawl depth (where -1 means until subcontainers return no child containers). If you do not want users to set this option, use a SCIHiddenElement for the same field. Note: The SCISelectElement must call |
TAG_PROPERTIES |
(optional) Represents whether properties from GetMetaData or the local accessor should be used. Setting this variable to TAG_PROPERTIES_LOCAL causes the local accessor properties used to retrieve a file to override the properties returned by the content crawler. Setting the variable to TAG_PROPERTIES_REMOTE causes the properties from GetMetaData to override properties from local accessors. |
These best practices and development tips apply to all content crawler development.
Use logging extensively to provide feedback during a crawl. In some cases, the portal reports a successful crawl when there were minor errors. Use Log4J or Log4Net to track progress.
Use relative URLs in your code to allow migration to another remote server. Note: These URLs might be relative to different base URL endpoints. The click-through URL is relative to the remote server base URL, and the indexing URL is relative to the SOAP URL. Depending on whether you have implemented your content crawler using Java or .NET, the base URL endpoint for the remote server might differ from the base URL endpoint for SOAP. For example, the Java IDK uses Axis, which implements programs as services. In Axis, the SOAP URL is the remote server base URL with '/services' attached to the end. Given the remote server base URL http://server:port/sitename, the SOAP URL would be http://server:port/sitename/services. If both click-through and indexing URLs point to the same servlet (http://server:port/sitename/customdocfetch?docId=12345), the relative URLs would be different. The relative URL for indexing would be "../customdocfetch?docId=12345" and the relative URL for click-through would be "customdocfetch?docId=12345". (Since the indexing URL is relative to the SOAP URL, the '../' reorients the path from http://server:port/sitename/services to http://server:port/sitename, yielding the correct URL to http://server:port/sitename/customdocfetch?docId=12345.)
Do your initial implementation of IDocumentProvider and IDocFetchProvider in separate classes, but factor out some code to allow reuse of the GetDocument and GetMetaData methods. See the Viewer sample application included with the Oracle WebCenter Interaction Development Kit (IDK) for sample code.
Do not make your calls order-dependent. The portal can make the above calls in any order, so your code cannot be dependent on order.
If a document or container does not exist, always throw a new NoLongerExistsException. This is the only way the portal can determine if the file or folder has been deleted. Not throwing the exception could result in an infinite loop.
If there are no results, return a zero-length array. If your intention is to return no results, use a zero-length array, not an array with empty strings. (For example, return new ChildContainer[0];
)
Check the SOAP timeout for the back-end server and calibrate your response accordingly. The SOAP timeout is set in the Web Service editor.
Pages that are not publicly accessible must be gatewayed. Gateway settings are configured in the Web Service editor on the HTTP Configuration page, and in the Content Source editor. You can gateway all URLs relative to the remote server or enter individual URLs and add paths to other servers to gateway additional pages.
You must define mappings for any associated Content Types before a content crawler is run. The portal uses the mappings in the Content Type definition to map the data returned by the content crawler to portal properties. Properties are only stored if you configure the Content Type mapping before running the content crawler. (Properties that apply to all documents are configured in the Global Document Property Map.)
To import security settings, the backend repository must have an associated Authentication Source. Content crawlers that import security need the user and category (domain) defined by an Authentication Source. You must configure the Authentication Source before the content crawler is run. Many repositories use the networks NT or LDAP security store; if an associated Authentication Source already exists, there is no need to create one. For details on Authentication Sources, see the portal online help.
If you use a mirrored crawl, only run it when you first import documents. Always check every directory after a mirrored crawl. After you have imported documents into the portal, it is safer to refresh your portal directory using a regular crawl with filters.
For mirrored crawls, make crawl depth as shallow as possible. Portal users want to access documents quickly, so folder structure is important. Also, the deeper the crawl, the more extensive your QA process will be.
Use filters to sort crawled documents into portal folders. Mirrored crawls can return inappropriate content and create unnecessary directory structures. Filters are a more efficient way to sort crawled documents. To use filters, choose Apply Filter of Destination Folder in the Content Crawler editor. For details on filters, see the portal online help.
Do not use automatic approval unless you have tested a content crawler. It is dangerous to use automatic approval without first testing the structure, metadata and logs for a content crawler.
To clear the deletion history, you must re-open the Content Crawler editor. To re-crawl documents that have been deleted from the portal, you must re-open the Content Crawler editor and configure the Importing Documents settings on the Advanced Settings page.
You can also import access restrictions during a crawl; for details, see Configuring Content Crawlers. For more information on the configuration settings above, see the following sections:
A crawler can use a range of credential types to access a secure file.
If you need to apply credentials to access a file, you can use any of the following options:
Credential Type | Description |
---|---|
SSO |
SSO must be configured in the portal and on the remote server, using the instructions of your SSO vendor. |
Basic Authentication |
Set the remote server to pass the user's basic authentication headers to the remote resource. Both sources must be using the same directory. For example, if a user logs in using an IPlanet directory, it is unlikely they will be able to access an Exchange resource. |
Content Source credentials |
Content Source credentials are generally valid only for crawling a database. Most other use cases require user-specific credentials. |
User preferences via form-based authentication |
Preferences stored in the portal database can be used to create a cookie if the resource accepts session-based authentication. User preferences generally cannot be used if the resource expects basic authentication. For example, the Content Service for Notes uses this approach when Notes is using session-based (cookie) authentication. You must enter all User settings and User Information required by a content crawler on the Preferences page of the Content Crawler editor. |
Force users to log in |
If the required credentials are not available, redirect the user to the appropriate page and/or provide an intelligible error message. For example, the Content Service for Notes uses this approach when Notes is using basic authentication. |
A content crawler must return an indexable version of each crawled file to be included in the portal Directory.
The crawler's servlet/aspx page must return content in a indexable format and set the content type and file name using the appropriate headers. Any information required to retrieve the document must be included in the query string of the index URL, including credentials (if necessary).
Note:
The request from the portal to the indexing servlet is a simple HTTP GET. This call is not gatewayed, so the content crawler code does not have access to the Content Source settings, user credentials and preferences, or anything other information through the Oracle WebCenter Interaction Development Kit (IDK).
For files, content can be streamed directly from the source directory. If the content is not in a file, the crawler code should create a temporary file that includes the content with as little extraneous information as possible.
For details, see the following sections:
If the content being crawled is in a file, the file can be streamed directly from the source directory.
The following steps describe a typical custom mechanism to return files in a indexable format and set the content type and file name using the appropriate headers.
In IDocument
, get all the variables needed to access the document and add them to the query string of the indexing servlet. This could be as simple as a UNC path for a file crawler or as complicated as server name, database name, schema, table, primary key(s) and primary key value(s) for a database record. It depends entirely on the content crawler and the document being crawled. Make sure all values are URLEncoded.
Add the content type to the query string.
In IDocument
, add URLEncoded credentials to the query string. Keep in mind that URLEncoding the credentials will turn a '+' to a space, which must be turned back into a space in the indexing servlet.
Pass back URLs via theDocumentMetadata
class that point to the servlet(s).
UseDocFetch
: Set to False.
IndexingURL
: Set to the endpoint/servlet that provides the indexable version of the file, including the query string arguments defined in steps 1-3 above.
ClickThroughURL
: Set to the endpoint/servlet that provides the path to be used when a user clicks through to view the file. During the crawl, the ClickThroughURL value is stored in the associated Directory document.
In the indexing servlet, get the location string and content type from the query string and parse the location string to get the path to the resource.
Obtain the resource.
Set the ContentType header and the Content-Disposition header.
Stream the file (binary or text) or write out the file (text) in a try-catch block.
If crawled content cannot be indexed as-is, the crawler code must create a temporary file for indexing.
The following steps describe a typical custom mechanism to create a temporary indexable file with as little extraneous information as possible and set the content type and file name using the appropriate headers. In most cases, the resource has already been accessed in sttachToDocument
, so there is no need to call the back-end system again. This example does not use credentials. If you do not want to create temporary files, you can implement an indexing servlet that returns indexable content.
In IDocument
, write a temporary file to a publicly accessible location (usually the root directory of the web application as shown in the code snippet below).
MessageContext context = MessageContext.getCurrentContext(); HttpServletRequest req = (HttpServletRequest)context.getProperty(HTTPConstants.MC_HTTP_SERVLETREQUEST) StringBuffer buff = new StringBuffer(); buff.append(req.getScheme()).append('://').append(req.getServerName()) .append(':').append(req.getServerPort()).append(req.getContextPath()); String indexRoot = buff.toString();
Pass back URLs via the IDK's DocumentMetadata
class that point to the servlet(s).
UseDocFetch
: Set to False.
IndexingURL
: Set to the endpoint/servlet that provides the indexable version of the file, including the query string arguments defined in steps 1-3 above.
ClickThroughURL
: Set to the endpoint/servlet that provides the path to be used when a user clicks through to view the file. During the crawl, the ClickThroughURL value is stored in the associated Directory document.
Add the temporary file path to the query string, along with the content type. Make sure to URLEncode both.
In the indexing servlet, get the file path and content type from the query string. Get the file name from the file path.
Set the ContentType header and the Content-Disposition header.
Stream the file (binary or text) or write out the file (text) in a try-catch block.
In the finally block, delete the file.
The following sample code indexes a text file.
logger.Debug('Entering Index.Page_Load()'); // try to get the .tmp filename from the Content Crawler string indexFileName = Request[Constants.INDEX_FILE]; if (indexFileName != null) { StreamReader sr = null; string filePath = ''; try { filePath = HttpUtility.UrlDecode(indexFileName); string shortFileName = filePath.Substring(filePath.LastIndexOf('\\') + 1); // set the proper response headers Response.ContentType = 'text/plain'; Response.AddHeader('Content-Disposition', 'inline; filename=' + shortFileName); // open the file sr = new StreamReader(filePath); // stream out the information into the response string line = sr.ReadLine(); while (line != null) { Response.Output.WriteLine(line); line = sr.ReadLine(); } } catch (Exception ex) { logger.Error('Exception while trying to write index file: ' + ex.Message, ex); } finally { // close and delete the temporary index file even if there is an error if(sr != null){sr.Close();} if(!filePath.Equals('')){File.Delete(filePath);} } //done return; } ...
After a repository is crawled and files are indexed in the portal, users must be able to access the file from within the portal by clicking a link; this is the 'click-through' step.
Click-through retrieves a crawled file over HTTP to be displayed to the user. To retrieve documents that are not available via a public URL, you can write your own code or use the DocFetch mechanism in the Oracle WebCenter Interaction Development Kit (IDK). If you handle document retrieval, you can also implement custom caching or error handling. Click-through links are gatewayed, so the content crawler can leverage user credentials and other preferences.
For details, see the following sections:
The content crawler's click-through implementation must return content in a readable format and set the content type and file name using the appropriate headers.
The following example uses a file, but the crawled resource could be any type of content. If the content is not in a file, the click-through servlet should create a representation with as little extraneous information as possible in a temporary file (for example, for a database, you would retrieve the record and transform it to HTML). See Creating Temporary Files for Indexing. You can also use the Oracle WebCenter Interaction Development Kit (IDK) DocFetch mechanism to handle indexing and click-through; see Content Crawler DocFetch.
Create the clickThroughServlet, and add a mapping in web.xml.
Complete the implementation of IDocument.getMetaData
. Set the ClickThoughURL
value to an URL constructed using the following steps:
Construct the base URL of the application using the same approach as in the index servlet.
Add the servlet mapping to the clickThroughServlet.
Add any query string parameters required to access the document from the clickThroughServlet (or aspx page). Remember: The click-through page will have access to Content Source parameters (as administrative preferences), but no access to content crawler settings.
To authenticate to the back-end resource, you can use basic authentication, User Preferences, User Info, or credentials from the Content Source. Below are suggestions for each; security will need to be tailored to your content crawler
Use Basic Authentication to use the same credentials used to log in to the portal. For example, if the portal uses AD credentials, Basic Auth could be used to access NT files.
Use (encrypted) User Preferences if the authentication source is different from the one used to log in to the portal. For example, if the portal log in uses IPlanet, but you need to access an NT or Documentum file.
Use (encrypted) User Info if the encrypted credentials are stored in another profile source and imported using a profile job.
Use Content Source credentials when there a limited connections, for example with a database.
Extract the parameters from the query string as required.
Display the page.
If there is already an HTML representation of the page, authenticate to the page. If the site is using basic authentication and you are using basic authentication headers, simply redirect to that page. If the site is using basic authentication and you are not using basic authentication, users must log in unless the site and the portal are using the same SSO solution. If the site is using form-based authentication, post to the site and follow the redirect.
If there is not an HTML representation of the page, retrieve the resource and stream it out to the client as shown in the sample code below (Java). If you use a temporary file, put the code in a try-catch-finally block, and delete the file in the finally block.
//get the content type, passed as a query string parameter String contentType = request.getParameter('contentType') //if this is a file, get the file name String filename = request.getParameter('filename'); //set the content type on the response response.setContentType(contentType); //set the content disposition header to tell the browser the file name response.setHeader('Content-Disposition', 'inline; filename=' + filename); //set the header that tells the gateway to stream this through the gateway response.setHeader('PTGW-Streaming', 'Yes'); //get the content - for a file, get a file input stream based on the path (shown below) //other repositories may simply provide an input stream //NOTE: this code contains no error checking String filePath = request.getParameter('filePath'); File file = new File(filePath); FileInputStream fileStream = new FileInputStream(file); //create a byte buffer for reading the file in 40k chunks int BUFFER_SIZE = 40 * 1024; byte[] buf = new byte[BUFFER_SIZE]; //start reading the file int bytesRead = fileStream.read(buf); ServletOutputStream out = response.getOutputStream(); //start writing out the body out.write(buf, 0, bytesRead); //continue writing until the input stream returns -1 while ((bytesRead = fileStream.read(buf)) != -1 { out.write(buf, 0, bytesRead); }
The Oracle WebCenter Interaction Development Kit (IDK) DocFetch mechanism is one way for a content crawler to retrieve files that are not accessible via a public URL.
If a content crawler implements DocFetch, the Oracle WebCenter Interaction Development Kit (IDK) manages the process of creating temporary files for indexing and click-through. DocFetch also allows you to implement user-level access control. You can pass user preferences or User Information to the content crawler, and this information can be used by DocFetch to authenticate with the back-end system or limit access to specific users.
Note:
DocFetch does not allow you to use multiple methods of authentication or implement custom error handling. If you cannot use public URLs and are not using DocFetch, you must implement a custom document fetching mechanism (i.e., servlet or aspx page). If necessary, you can implement separate servlets for indexing and click-through.
Content crawler code can use DocFetch to access files that are not available via a public URL.
To use DocFetch, there are three relevant fields in the DocumentMetaData object returned in the portal's call to IDocument.getMetaData
:
UseDocFetch: Set UseDocFetch to True.
File Name: Set the File Name to the name of the file in the repository (must be unique).
Content Type: Set the Content Type to the content type for the file. The content type must be mapped to a supported Content Type in the portal.
When UseDocFetch is set to True, the Oracle WebCenter Interaction Development Kit (IDK) sets the ClickThroughURL stored in the Directory to the URL of the DocFetch servlet, and calls IDocument.getDocument
to retrieve the file path to the indexable version of the document. When a user subsequently clicks on a link to the crawled document in the Directory, the request to the DocFetch servlet makes several calls to the already-implemented content crawler code. getDocument
is called again, but this time as part of the IDocFetch interface. The file path returned is opened by the servlet and streamed back in the response. As explained above, the content crawler must implement the getDocument
method in both the Crawler.IDocument
and DocFetch.IDocFetch
interfaces to return the appropriate file path(s). If the repository cannot access files directly, you must serialize the binary representation to a temporary disk file and return that path. The IDocument and IDocFetch interfaces can use the same process. The Oracle WebCenter Interaction Development Kit (IDK) provides a cleanup call to delete any temporary files later.
Note:
If getDocument
returns a path to a file (not a URL to a publicly accessible file), the file name must be unique. Otherwise, all copies of the file are removed during cleanup, including copies that are currently in use by other users.
To use user preferences or User Information, you must configure the settings to be used in the Content Crawler editor. DocFetch interfaces are called in the following order. For a complete listing of interfaces, classes, and methods, see the Oracle WebCenter Interaction Development Kit (IDK) API documentation.
IDocFetchProvider.initialize
using the DataSourceInfo, UserPrefs and UserInfo returned from the portal to make a connection to the backend system and create a new session. The implementation should initialize in a similar manner to IDocumentProvider.initialize
. IDocFetchProvider can use UserInfo and UserPrefs to perform additional authentication. The ICrawlerLog object is not available. Note: Sessions can get dropped. Keep a variable that can be used to ensure the session is still initialized; if it is not, throw NotInitializedException
.
IDocFetchProvider.attachToDocument
using the authentication information provided (including UserPrefs and UserInfo).
IDocFetch.getMetaData
: The only DocumentMetadata required for click-through is the file name and content type.
IDocFetch.getDocument:
As noted above, IDocFetch.GetDocument method should reuse as much code as possible from the IDocument.getDocument
method. The Oracle WebCenter Interaction Development Kit (IDK) looks in web.config/*.wsdd to get the file path and URL to the directory for creating temporary files.
IDocFetchProvider.Shutdown
(optional).
Content crawler code should handle exceptions.
Most calls should be put into a try-catch block. The scope of the try-catch block should be small enough to diagnose errors easily. In the catch block, log the error in both Log4j/Log4net as well as ICrawlerLog
and then re-throw the exception as a ServiceException. This will result in the error displaying in the job log. However, only the error message shows up in the log; look at the log from Log4j/Log4net to get the full stack trace. The following exceptions have special meaning:
NotInitializedException means to re-initialize.
NoLongerExistsException means that the folder or document no longer exists, and tells the portal to delete that resource.
If any exception is thrown during the initial attachToContainer
, the crawl aborts. If NotInitializedException
is thrown, the content crawler re-initializes. If NoLongerExistsException
is thrown, the resource is removed from the Directory, and the content crawler continues to the next resource. If other exceptions are thrown, the error is logged, and the content crawler continues to the next resource. To use ICrawlerLog
, store the member variable in your implementation of IContainerProvider.initialize
. To send a log message, simply add the following line: m_logger.Log('enter logging message here')
Note: The container provider log reads the logs only after AttachToContainer and after exceptions. The document provider log reads only after exceptions. For more information and the best visibility, use Log4j/Log4net.
For details on logging, see Oracle WebCenter Interaction Logging Utilities.
After implementing a custom content crawler, you must deploy your code.
Follow the instructions below to deploy a Java content crawler.
Compile the class that implements the IDK interface and copy the entire package structure to the appropriate location in your web application (usually the \WEB-INF\classes directory).
Update the web.xml file in the WEB-INF directory by adding the class to the appropriate *Impl keys. For a content crawler, add your class to ContainerProviderImpl and DocumentProviderImpl as shown below. Note:The *Impl key in web.xml must reference the fully-qualified name of both provider classes required by the service. If the service uses SCI, you must also enter the fully-qualified name of the appropriate implementation of the IAdminEditor interface in the SciImpl parameter.
... <env-entry> <env-entry-name>ContainerProviderImpl</env-entry-name> <env-entry-value>com.plumtree.remote.crawler.helloworld.CrawlContainer</env-entry-value> <env-entry-type>java.lang.String</env-entry-type> </env-entry> <env-entry> <env-entry-name>DocumentProviderImpl</env-entry-name> <env-entry-value>com.plumtree.remote.crawler.helloworld.CrawlDocument</env-entry-value> <env-entry-type>java.lang.String</env-entry-type> </env-entry> ...
Start your application server. (In most cases, you must restart your application server after copying a file.)
Test the directory by opening the following page in a Web browser: http://<hostname:port>/edk/services/<servicetype>ProviderSoapBinding (for example, http://localhost:8080/edk/ContainerProviderSoapBinding and http://localhost:8080/edk/DocumentProviderSoapBinding). The browser should display the following message: "Hi there, this is an AXIS service! Perhaps there will be a form for invoking the service here..." When you configure the Web Service object for the content crawler in the portal, enter this path as the Service Provider URL.
If the content crawler uses DocFetch, you must also deploy your DocFetch code. Open the WEB-INF\web.xml file and add the fully-qualified name of your class in the DocFetchProvider initialization parameter, as shown in the code that follows.
... <servlet> <servlet-name>DocFetch</servlet-name> <servlet-class>com.plumtree.remote.docfetch.DocFetch</servlet-class> <!-- Modify the param-value below to reference your class --> <init-param> <param-name>DocFetchProvider</param-name> <param-value>com.mycompany.MyDocFetchProvider</param-value> </init-param> </servlet> ...
To deploy a .NET content crawler, add a line to the deployment file (web.config) that specifies the fully qualified name of the class. For a content crawler, enter values for the following parameters, as shown in the code that follows.
ContainerProviderImpl
DocumentProviderImpl
ContainerProviderAssembly
DocumentProviderAssembly
... <appSettings> <add key='ContainerProviderAssembly' value='CompanyStoreCWS'/> <add key='ContainerProviderImpl' value='Plumtree.CompanyStore.CWS.CompanyStoreContainer'/> <add key='DocumentProviderAssembly' value='CompanyStoreCWS'/> <add key='DocumentProviderImpl' value='Plumtree.CompanyStore.CWS.CompanyStoreDocument'/> ...
If the service uses SCI, you must also enter the fully-qualified name of the appropriate implementation of the IAdminEditor interface using the SciImpl and AdminEditorAssembly parameters.
If the content crawler uses DocFetch, you must also deploy your DocFetch code. Add a line to the deployment file (web.config) that specifies the fully qualified name of your class and the associated assembly (DocFetchImpl and DocFetchAssembly). You must also add three additional parameters to the web.config deployment descriptor:
DocFetchURL: The URL to the DocFetch servlet or server page. This URL should be relative to the Remote Server object URL configured for the Content Crawler object in the portal to facilitate migration to another portal.
IndexFilePath: A writable, web-accessible directory to which the IDK can write temporary files. During crawl-time, the Oracle WebCenter Interaction Development Kit (IDK) calls IDocument.GetDocument and copies the file path returned to this temporary file location, which is returned to the portal. These temporary files should be deleted upon completion of the crawl. (The DocFetch mechanism will clean up its own resources, but you must delete the temporary file you return to GetDocument.)
IndexURLPrefix: The public Web address of the IndexFilePath directory. IndexURLPrefix must be an URL accessible from the portal server.
The code below is an example of deploying DocFetch in web.config.
... <appSettings> <add key='DocFetchAssembly' value='MyDocFetch' /> <add key='DocFetchImpl' value='com.mycompany.MyDocFetchProvider' /> <add key='DocFetchURL' value='iis/docfetch.aspx'/> <add key='IndexFilePath' value='D:\\root\\config\\mydomain'/> <add key='IndexURLPrefix' value='http://yourhost/IISVirtualDirectory'/> ...
These key tests should be performed on every content crawler.
All the following tests should be performed in multiple implementations of the portal.
Test the entire crawl depth. Confirm that documents are structured correctly in every level. Crawl depth should be as shallow as possible. If there are problems, check the filters on the target folders. If nothing is returned, check the authentication settings in the associated Content Source and Web Service - Content objects.
Check the document metadata. Is it stored in the appropriate properties? Does it match the metadata in the source repository? If there are problems, check the Content Type settings in the Content Crawler editor, and check the mappings for each associated Content Type.
Click through to crawled documents from each crawled directory. If there are problems, check the gateway settings in the Web Service - Content editor.
Test refreshing documents to confirm that they reflect modifications. If there are problems, make sure you are providing the correct document signature.
Check logs after every crawl. The log can reveal problems even if the portal reports a successful crawl.
To debug custom content crawlers, use logging.
Logging is an important component of any successful content crawler. Logging allows you to track progress and find problems.In most implementations, using Log4J or Log4Net for logging is the best approach. The IDK ICrawlerLog
object is more efficient and useful than Logging Spy or a SOAP trace, but it only includes standard exceptions and messages from ContainerProvider.AttachToContainer
.If you are viewing the ICrawlerLog
, do not assume that the every card was imported if the job is successful. Successful means no catastrophic failures, such as portal Search not started, or unable to attach to the start node. Individual document failures will not fail a job.If you are viewing logs created by Log4net or Log4j, see the associated documentation for logging configuration options. Both products allow you to specify a file location and a rollover log with a specified file size. If you know the location of the file, it is not difficult to create a servlet/aspx page that streams the file from the log to the browser.
For more information, see the following sections:
Implementing a successful content crawler in the portal requires specific configuration.
To register a content crawler in the portal, you must create the following administrative objects and portal components:
Remote Server: The Remote Server defines the base URL for the content crawler. Content crawlers can use a Remote Server object or hard-coded URLs. Multiple services can share a single Remote Server object. If you will be using a Remote Server object, you must register it before registering any related Web Service objects.
Web Service - Content: The Web Service object includes basic configuration settings, including the SOAP endpoints for the ContainerProvider and DocumentProvider, and Preference page URLs. Multiple Content Source or Content Crawler objects can use the same Web Service object. All remote content crawlers require an associated Web Service object. For information on specific settings, see the portal online help.
Content Source - Remote: The Content Source defines the location and access restrictions for the back-end repository. Each Web Service - Content object has one or more associated Content Source objects. The Content Source editor can include Service Configuration pages created for the content crawler. Multiple Content Crawler objects can use the same Remote Content Source, allowing you to crawl multiple locations of the same content repository without having to repeatedly specify all the settings. For details on specific settings, see the portal online help. For details on Service Configuration pages, see Creating Service Configuration Pages for Content Crawlers.
Content Crawler - Remote: Each content crawler has an associated Content Crawler object that defines basic settings, including destination folder and Content Type. The Content Crawler editor can include Service Configuration pages created for the Content Crawler. Refresh settings are also entered in the Content Crawler editor. For details on specific settings, see the portal online help. For details on Service Configuration pages, see Creating Service Configuration Pages for Content Crawlers.
Job: To run the content crawler, you must schedule a Job or add the Content Crawler object to an existing Job. The Content Crawler editor allows you to set a Job. For details on configuring Jobs, see the portal online help.
Global Content Type Map: If you are importing a proprietary file format, you might need to create a new Content Type. Content Types are used to determine the type of accessor used to index a file. You can create new Content Types, or map additional file extensions to an existing Content Type using the Global Content Type Map. Most standard file formats are supported for indexing by the portal. In most cases, the same document is returned during a crawl (for indexing) as for click-through (for display). You can also map additional file extensions to Content Types through the Global Content Type Map. For detailed instructions, see the portal online help or the Administrator Guide for Oracle WebCenter Interaction.
Global Document Property Map: To map document attributes to portal Properties, you must update the Global Document Property Map before running a content crawler. During a crawl, file attributes are imported into the portal and stored as Properties. The relationship between file attributes and portal Properties can be defined in two places: the Content Type editor or the Global Document Property Map.
Two types of metadata are returned during a crawl.
The crawler (aka provider) iterates over documents in a repository and retrieves the file name, path, size, and usually nothing else.
During the indexing step, the file is copied to portal Search, where the appropriate accessor executes full-text extraction and metadata extraction. For example, a for a Microsoft Office document, the portal uses the MS Office accessor to obtain additional properties, such as author, title, manager, category, etc.
If there are conflicts between the two sets of metadata, the setting in CrawlerConstants.TAG_PROPERTIES determines which is stored in the database (for details, see Service Configuration Pages above).
Note:
If any properties returned by the crawler or accessor are not included in the Global Document Property map, they are discarded. Mappings for the specific Content Type have precedence over mappings in the Global Document Property Map. The Object Created property is set by the portal and cannot be modified by code inside a Content Crawler.
Global ACL Sync Map: Content crawlers can import security settings based the Global ACL Sync Map, which defines how the Access Control List (ACL) of the source document corresponds with Oracle WebCenter Interaction's authentication groups. (An ACL consists of a list of names or groups. For each name or group, there is a corresponding list of possible permissions. The ACL returned to the portal is for read rights only.) For detailed instructions, see the portal online help or the Administrator Guide for Oracle WebCenter Interaction.
In most cases, the Global ACL Sync Map is automatically maintained by Authentication Sources. The Authentication Source is the first step in Oracle WebCenter Interaction security. To import security settings in a crawl, the back-end repository must have an associated Authentication Source. Content crawlers that import security need the user and category (domain) defined by an Authentication Source. You must configure the Authentication Source before the content crawler is run. Many repositories use the network's NT or LDAP security store; if an associated Authentication Source already exists, there is no need to create one.
Note:
Two settings are required to import security settings:
In the Web Service - Content editor on the Advanced Settings page, check Supports importing security with each document.
In the Content Crawler editor on the Main Settings page, check Import security with each document.
Service Configuration (SCI) pages are integrated with portal editors and used to define settings used by a content crawler.
Content crawlers must provide SCI pages for the Content Source and/or Content Crawler editors to build the preferences used by the content crawler. The URL to any associated SCI page(s) must be entered on the Advanced URLs page of the Web Service - Content editor. All optional settings are in the class CrawlerConstants
. For a list, see SCI Variables for Content Crawler Properties. SCI provides an easy way to write configuration pages that are integrated with portal editors. SCI wraps the portal's XUI XML and allows you to create controls without XUI. For a complete listing of classes and methods in the plumtree.remote.sci namespace, see the IDK API documentation. The following methods must be implemented: .
initialize
passes the namespace, whether Content Source or Content Crawler, settings (NamedValueMap). Dependent objects supply data.
getPages
returns fixed-length array of the number of custom pages.
getContent
returns the XML content for a page. The API provides a collection of helper classes to build the page (textbox, select box, tree element, etc.)
The example below is a SCI page for a Content Source editor that gets credentials for a database content crawler.
Imports System Imports Plumtree.Remote.Sci Imports Plumtree.Remote.Util Imports System.Security.Cryptography Namespace Plumtree.Remote.Crawler.DRV 'Page to enter name and password- first page for DataSourceEditor Public Class AuthPage Inherits AbstractPage #Region "Constructors" Public Sub New(ByVal editor As AbstractEditor) MyBase.New(editor) End Sub #End Region #Region "Functions" 'Gets the content for the page in string form. 'One textElement for name, one PasswordElement for password 'Note the way that the password is stored & the encryption used Public Overrides Function GetContent(ByVal errorCode As Integer, ByVal pageInfo As NamedValueMap) As String Dim page As New SciPage Dim userElement As New SciTextElement(DRVConstants.USER_NAME, "Enter the user name to authenticate to SQL Server") Dim userName As String = pageInfo.Get(DRVConstants.USER_NAME) If Not userName Is Nothing Then userElement.SetValue(userName) End If userElement.SetMandatoryValidation("User name is mandatory") Dim passElement As New SciPasswordElement(DRVConstants.PASSWORD, "Enter the password to authenticate to SQL Server", "Confirm", "Passwords do not match") 'deal with asterisks and the like- for now, just show password Dim password As String = pageInfo.Get(DRVConstants.ENC_PASSWORD) 'save the initial password? Dim settings As NamedValueMap = Me.Editor.Settings settings.Put(DRVConstants.ENC_PASSWORD, password) Editor.Settings = settings 'set asterisks for the value passElement.SetValue(DRVConstants.ASTERISKS) page.Add(userElement) page.Add(passElement) Return page.ToString End Function 'Gets the help page URI for the page. Public Overrides Function GetHelpURI() As String Return "" End Function 'Gets the image (icon) URI for the page. (This setting is for backward compatibility; no icon is displayed in version 5.0.) Public Overrides Function GetImageURI() As String Return "" End Function 'Gets the instructions for the page, displayed below the title in the editor. Public Overrides Function GetInstructions() As String Return "Enter SQL Server authentication information" End Function 'Gets the title for the page. Public Overrides Function GetTitle() As String Return "SQL Server Authentication" End Function 'Validates the current page and throws a ValidationException to report an error. Returns a NamedValueMap array of the settings entered on the editor page. Public Overrides Sub ValidatePage(ByVal pageInfo As NamedValueMap) 'if the password is not asterisks, then put it into settings Dim password As String = pageInfo.Get(DRVConstants.PASSWORD) If Not password.Equals(DRVConstants.ASTERISKS) Then Dim settings As NamedValueMap = Me.Editor.Settings 'encrypt this Dim encPassword As String = Utilities.EncryptPassword(password, Me.Editor.Locale) settings.Put(DRVConstants.ENC_PASSWORD, encPassword) Editor.Settings = settings End If End Sub #End Region End Class End Namespace
Federated Search provides access to external repositories without adding documents to the portal Directory. Federated Search is especially useful for content that is updated frequently or is only accessed by a small number of portal users
When the portal requests a federated search service, the remote service accesses the content repository and sends information about each file to the portal. The returned information is displayed to users in search results. The results include a URL that opens the file from the back-end content repository.
For details on implementing federated search services, see the following sections:
The Oracle WebCenter Interaction Development Kit (IDK) allows you to create remote Federated Search services and related configuration pages without parsing SOAP or accessing the portal API. The Oracle WebCenter Interaction Development Kit (IDK) Search API provides an abstraction from the necessary SOAP calls; you simply implement an object interface.
The following best practices apply to every federated search service:
Know what to expect in response to a query. You must be ready to handle pagination and authentication if necessary.
Check the SOAP timeout for the back-end server and calibrate your response accordingly.
Use relative URLs in your code to allow migration to another remote server.
For details on implementing Federated Search Services using the Oracle WebCenter Interaction Development Kit (IDK) Search API, see Oracle WebCenter Development Kit (IDK) Interfaces for Federated Search Service Development.
The Oracle WebCenter Interaction Development Kit (IDK) plumtree.remote.search package/namespace includes a set of interfaces to support federated search service development.
The Oracle WebCenter Interaction Development Kit (IDK) plumtree.remote.search package/namespace includes the following interfaces:
IRemoteSearch
ISearchQuery
ISearchUser
ISearchContext
ISearchRecord
ISearchResult
In general, the portal calls these interfaces in the following order. See the definitions that follow for more information.
IRemoteSearch.BasicSearch
, using ISearchQuery
, ISearchUser
and ISearchContext
as parameters.
The ISearchResult
object returned allows the federated search service to iterate through the search results and return them to the user. The service calls ISearchResult.GetSearchResultList
to retrieve an ISearchRecord
for each record returned. ISearchRecord
allows you to retrieve the title, description, file URL and image URL and set the title, description, file URL and image URL to be returned to the portal.
The sections below provide helpful information on the interfaces used to implement a federated search service For a complete listing of interfaces, classes, and methods, see the IDK API documentation.
The IRemoteSearch
interface allows the portal to initiate a query over a back-end directory structure. BasicSearch
allows you to pass in an ISearchQuery
that defines the query to be performed. You can also pass in a ISearchUser
and ISearchContext
for access to the PRC.
The ISearchQuery
interface defines the search query to be performed by the portal. Using ISearchQuery
, you can define the scope of the query and provide user preferences and user information to be used for authentication or user-level access control. SearchException
allows you to provide useful error messages (for example, the specific preference type that was not found). For details, see the IDK API documentation. This interface provides the following methods:
GetMaxReturn
determines the maximum number of records to return per page.
GetNumberToSkip
returns the number of records that will be skipped: where the search will start. For example, the search could start at record 30.
GetSearchInfo
returns any related administrative preferences set for the associated Federated Search object in the portal.
GetSearchResult
returns an ISearchResult
object that allows the federated search service to access the results returned by IRemoteSearch.
GetSearchString
returns the query string passed to the portal.
GetUserInfo
returns any User Information settings sent to the federated search service. To access User Information, you must configure the specific settings you need in the Web Service editor on the User Information page.
GetUserPrefs
returns any user settings sent to the federated search service. To access user settings, you must configure the specific settings you need in the Web Service editor on the Preferences page.
The ISearchUser
interface can be used to access the current user's portal object ID and locale, and to obtain the login token for the current session with the portal to access the PRC.
The ISearchContext
interface can be used to access the portal UUID and SOAP service endpoint URI to implement the PRC.
The ISearchResult
interface allows you to retrieve the results returned from a search query and return the results to the portal. The federated search service code must handle pagination; the methods in the ISearchResult
facilitate iteration over large numbers of search records.
Get/SetNumberSkipped
returns the number of records that were skipped: where the search started. For example, the search could start at record 30.
Get/SetSearchResultList
returns a SearchRecord
array of search results.
Get/SetTotalNumberofHits
returns the total number of search records.
Is/SetDescriptionEncoded
determines whether or not the description for the search results is HTMLencoded.
The ISearchRecord
interface allows you to manipulate the metadata for each search record. Only the title is required.
Get/SetTitle
returns the title for the search record (required).
Get/SetDescription
returns the description for the search record. If the description should be HTMLencoded, use ISearchResult.SetDescriptionEncoded
.
Get/SetOpenDocumentURL
returns the URL that will retrieve the document. This URL must be accessible over the web or through the gateway. If the document is gatewayed, make sure to configure the Web Service object with the appropriate gateway URLs.
Get/SetImageURL
returns the URL to the image that will be displayed with the search record.
After implementing a federated search service, you must deploy your code.
Follow the instructions below to deploy a Java federated search service:
Compile the class that implements the Oracle WebCenter Interaction Development Kit (IDK) interface and copy the entire package structure to the appropriate location in your web application (usually the \WEB-INF\classes directory).
Update the web.xml file in the WEB-INF directory by adding the class to the appropriate *Impl keys. For example, add your class to SearchImpl
as shown below. Note: The *Impl key in the web.xml file must reference the fully-qualified name of the class. If the service uses SCI, you must also enter the fully-qualified name of the appropriate implementation of the IAdminEditor
interface.
... <env-entry> <env-entry-name>SearchImpl</env-entry-name> <env-entry-value>com.plumtree.remote.search.helloworld.Search</env-entry-value> <env-entry-type>java.lang.String</env-entry-type> </env-entry> ...
Start your application server. (In most cases, you must restart your application server after copying a file.)
Test the directory by opening the following page in a web browser: http://<hostname:port>/idk/services/<servicetype>ProviderSoapBinding (for example, http://localhost:8080/idk/SearchSoapBinding). The browser should display the following message: "Hi there, this is an AXIS service! Perhaps there will be a form for invoking the service here..." When you configure the Web Service for the federated search service in the portal, enter this path as the Service Provider URL.
If the federated search service uses a SCI page to define settings, you must also deploy the SCI code. For details on using SCI pages, see Creating Service Configuration Pages for Content Crawlers.
To deploy a .NET federated search service, add a line to the deployment file (web.config) that specifies the fully qualified name of the class used to implement federated search. For a federated search service, you must enter values for the following parameters, as shown in the code that follows.
SearchImpl
SearchAssembly
... <appSettings> <add key='SearchAssembly' value='CompanyStoreSWS'/> <add key='SearchImpl' value='Plumtree.CompanyStore.SWS.CompanyStoreSWS'/> ...
If the federated search service uses a SCI page to define settings, you must also deploy the SCI code. For details on using SCI pages, see Creating Service Configuration Pages for Content Crawlers.