Overview of XML Connector Framework

Oracle SES provides an XML connector framework to crawl any repository that provides an XML interface to its contents. The connectors for Oracle Content Server, Oracle E-Business Suite 12, and Siebel 8 use this framework.

Every document in a repository is known as an item. An item contains information about the document, such as author, access URL, last modified date, security information, status, and contents.

A set of items is known as a feed or channel. To crawl a repository, an XML document must be generated for each feed. Each feed is associated with information such as feed name, type of the feed, and number of items.

To crawl a repository with the XML connector, place data feeds in a location accessible to Oracle SES over one of these protocols: HTTP, FTP, or FILE. Then generate an XML Configuration File that contains information such as feed location and feed type. Create a source with a source type that is based on this XML connector and trigger the crawl from Oracle SES to crawl the feeds.

There are two types of feeds:

Control feed: Individual feeds can be located anywhere, and a single control file is generated with links to the feeds. This control file is input to the connector through the configuration file. A link in control feed can point to another control feed. Control feed is useful when data feeds are distributed over many locations or when the data feeds are accessed over diverse protocols such as FTP and file.
Directory feed: All feeds are placed in a directory, and this directory is input to the connector through the configuration file. Directory feed is useful when the data feeds are available in a single directory.

Guidelines for the target repository generating the XML feeds:

XML feeds are generated by the target repository, and each file system has a limit on how many files it can hold. For directory feeds, the number of documents in each directory should be less than 10,000. There are two considerations:
- Feed files: The number of items in each feed file should be set such that the total number of feed files in the feed directory is kept under 10,000.
- Content files: If the feed files specify content through attachment links and the targets of these links are stored in the file system, then ensure that the targets are distributed in multiple directories so that the total number of files in each directory is kept under 10,000.
When feeds are generated real-time over HTTP, ensure that the component generating the feeds is sensitive to time-out issues of feed requests. The feed served as the response for every request should be made available within this time out interval; otherwise, the request from Oracle SES times out. The request is retried as many times as specified while setting up the source in Oracle SES. If all these attempts fail, then the crawler ignores this feed and proceeds with the next feed.

Example Using the XML Connector

The courses in the Oracle E-Business Suite Learning Management application can be crawled and indexed to readily search the courses offered, location and other details pertaining to the courses.

To crawl and index courses in Oracle E-Business Suite Learning Management:

Generate an XML feed containing the courses. Each course can be an item in the feed. The properties of the course such as location and instructor can be set as attributes of the item.
Move the feed to a location accessible to Oracle SES through HTTP, FTP, or file protocol.
Generate a control file that points to that feed.
Generate a configuration file to point to this feed. Specify the feed type as control, the URL of the control feed, and the source name in the configuration file.
Create an Oracle E-Business Suite 12 source in Oracle SES, specifying in the parameters the location of the configuration file, the user ID and the password to access the feed.

XML Configuration File

The configuration file is an XML file conforming to a set schema.

The following is an example of a configuration file to set up an XML-based source:

<rsscrawler xmlns="http://xmlns.oracle.com/search/rsscrawlerconfig">  
     <feedLocation>ftp://my.host.com/rss_feeds</feedLocation>
     <feedType>directoryFeed</feedType>
     <errorFileLocation>/tmp/errors</errorFileLocation>
     <securityType>attributeBased</securityType> 
     <sourceName>Contacts</sourceName>
     <securityAttribute name="EMPLOYEE_ID" grant="true"/> 
</rsscrawler>

Where

feedLocation is one of the following:
- URL of the directory, if the data feed is a directory feed
  
  This URL should be the FTP URL or the file URL of the directory where the data feeds are located. For example:
```
ftp://example.domain.com/relativePathOfDirectory
file://example.domain.com/c:\dir1\dir2\dir3
file://example.domain.com//private/home/dir1/dir2/dir3 
```
  File URL if the data feeds are available on the same computer as Oracle SES. The path specified in the URL should be the absolute path of the directory.
  
  FTP URL to access data feeds on any other computer. The path of the directory in the URL can be absolute or relative. The absolute path should be specified following the slash (/) after the host name in the URL. The relative path should be specified relative to the home directory of the user used to access FTP feeds.
  
  The user ID used to crawl the source should have write permissions on the directory, so that the data feeds can be deleted after crawl.
- URL of the control file, if the data feed is a control feed
  
  This URL can be HTTP, HTTPS, file, or FTP URL. For example:
```
http://example.com:7777/context/control.xml
```
  The path in FTP and file protocols can be absolute or relative.
feedType indicates the type of feed. Valid values are directoryFeed, controlFeed, and dataFeed.
errorFileLocation (optional) specifies the directory where status feeds should be uploaded.

A status feed is generated to indicate the status of the processing feed. This status feed is named data_feed_file_name.suc or data_feed_file_name.err depending on whether the processing was successful. Any errors encountered are listed in the error status feed. If a value is specified for this parameter, then the status feed is uploaded to this location. Otherwise, the status feed is uploaded to the same location as the data feed.

The user ID used to access the data feed should have write permission on the directory.

If feedLocation is an HTTP URL, then errorFileLocation also should be an HTTP URL, to which the status feeds are posted. If no value is specified for errorFileLocation, then the status feeds are posted to the URL given in feedLocation.

If an error occurs while processing a feed available over file or FTP protocol, then the erroneous feed is renamed filename.prcsdErr in the same directory.
sourceName (optional) specifies the name of the source.
securityType (optional) specifies the security type. Valid values are the following:
- noSecurity: There is no security information associated with this source at the document level. This is the default value.
- identityBased: Identity-based security is used for documents in the feed.
- attributeBased: Attribute-based security is used for documents in the feed. With this security model, security attributes should be specified in the securityAttribute tag, and the values for these attributes should be specified for each document.
securityAttribute specifies attribute-based security. One or more tags of this type should be specified, and each tag should contain the following attributes:
- name: Name of the security attribute.
- grant: Boolean parameter indicating whether this is a grant/deny attribute. The security attribute is considered a grant attribute if the value is true and a deny attribute if the value is false.