|Oracle Ultra Search Online Documentation
You can implement a crawler agent to crawl and index a proprietary document repository, such as Lotus Notes or Documentum. In Ultra Search, the proprietary repository is called a user-defined data source. The module that enables the crawler to access the data source is called a crawler agent.
The agent collects document URLs and associated metadata from the user-defined data source and returns the information to the Ultra Search crawler, which enques it for later crawling. The crawler agent must be implemented in Java using the Ultra Search crawler agent API.
A crawler agent does the following:
- Authenticates the crawler for accessing the data source
- Provides access to the data source document through a HTTP URL (display URL)
- Provides the metadata of the document in the form of document attribute
- Maps the document attribute to a common attribute name used by end users
- Provides a "flatten" view of the data source such that document is retrieved one by one in a streaming fashion.
- Instructs the crawler to parse the URL document for standard metadata, like author and title if necessary
- Optionally provides the list of URLs that have changed since a given time stamp
- Optionally provides an access URL in addition to the display URL for the processing of the document
From the crawler's perspective, the agent retrieves the list of URLs from the target data source and saves it in the crawler queue before processing it.
Note: If the crawler is interrupted for any reason, the agent invocation process is repeated with the original last crawl time stamp. If the crawler already finished enqueueing URLs fetched from the agent and is half way through crawling, then the crawler only starts the agent, but does not try to fetch URLs from the agent. Instead, it finishes crawling the URLs already enqueued.
There are two kinds of crawler agents: standard agents and smart agents.
The standard agent returns the list of URLs currently existing in the data source. It does not know whether any of the URLs had been crawled before, and it relies on the crawler to find any updates to the target data source. The standard agent's interaction with the crawler is the following:
- Crawler marks all existing URLs of this data source for garbage collection, assuming they no longer exist in the target data source.
- Calls the agent to get an updated list of URLs. It marks for crawling every URL that already exists. If it is new, it inserts it into the URL table and queue.
- Deletes the URLs that are still marked for garbage collection.
- Goes through every URL marked for crawling and checks for updates.
The smart agent uses a modified-since time stamp (provided by the crawler) to return the list of URLs that have been updated, inserted, and deleted. The crawler only crawls URLs returned by the agent and does not recrawl existing ones. For URLs that were deleted, the crawler removes them from the URL table. If the smart agent can only return updated or inserted URLs but not deleted URLs, then deleted URLs are not detected by the crawler. In this case, you must change the schedule crawler recrawl policy to periodically run the schedule in force recrawl mode. Force recrawl mode signals to the agent to return every URL in the data source.
The agent API isDeltaCrawlingCapable() tells the crawler whether the agent it invokes is a standard agent or a smart agent. The agent API startCrawling(boolean forceRecrawl, Date lastCrawlTime) lets the crawler tell the agent the last crawl time and whether the crawler is running in force recrawl mode.
Document Attributes and Properties
Document attributes, or metadata, describe document properties. Some attributes can be irrelevant to your application. The crawler agent creator must decide which document attributes should be extracted and saved. The agent can be also created such that the list of collected attributes are configurable. Ultra search automatically registers attributes returned by the agent. The agent can decide which attributes to return for a document.
Data Source Type Registration
A data source type is an abstraction of a data source. You can define new data source types with the following attributes:
- Name of data source type: For example, Lotus Notes. The name cannot be more than 100 bytes.
- ID of data source type: This is automatically assigned
- Description of the data source type: This limit is 4000 bytes.
- Agent Java class name: For example, WebDbAgent. The location of this class is predefined by Ultra Search in $ORACLE_HOME/ultrasearch/lib/agent/ and cannot be changed.
- Agent jar file name: The agent class can be stored in a java jar file. This jar file must be in $ORACLE_HOME/ultrasearch/lib/agent/.
- Parameters: Parameters are the properties of a data source. For example, seed URL and inclusion pattern for a Web data source. Defines a parameter by specifying a parameter name (100 bytes max.) and its description (4000 bytes max.). By default a parameter is not encrypted.
- Encryption: Should the value of this parameter be encrypted when stored.
Ultra Search does not enforce the occurrence of parameters. You cannot specify a particular parameter to have 0 or more, at least 1, or only 1 occurrence.
Data Source Registration
After a data source type is defined, any instance of that data source type can be defined:
- Data source name
- Description of the data source; limit to 4000 bytes.
- Data source type ID
- Default language; default is 'en' (English)
- Parameter values; for example, seed - http://www.oracle.com depth - 8
Data Source Attribute Registration
You can add new attributes to Ultra Search by providing the attribute name and the attribute data type. The data type can be string, number, or date. Attributes with the same name but different data type can be added. Attributes returned by an agent are automatically registered if they have not been defined.
User-Implemented Crawler Agent
The crawler agent has the following requirements:
- The agent must be implemented in Java
- The agent must support the Java agent APIs defined by Ultra Search
- The agent must return the URL attributes and properties
- The agent optionally can authenticate the crawler's access to the data source
- The agent must "flatten" the data source such that document is retrieved one by one in a streaming fashion. This is to encapsulate the crawling logic of a specific data source into the agent.
- The agent must decide which document attributes Ultra Search should keep. Any attribute not defined in Ultra Search is automatically registered.
- The agent can map attributes to data source properties. For example, if an attribute "ID" is the unique ID of a document, then the agent should return (document_key, 4) where "ID" has been mapped to the property "document_key" and its value is 4 for this particular document.
- If the attribute LOV is available, then the agent returns them upon request.
Interaction between the Crawler and the Crawler Agent
The crawler crawls data sources defined by the user through the invocation of the user-supplied crawler agent. The crawler can do the following:
- Invoke the crawler agent of the defined data source
- Supply data source parameter information to the agent
- Authenticate itself with the agent if needed
- Retrieve a list of URLs and associate attributes/properties that need to be crawled
- Use the URL provided by the agent to retrieve the document
- Detect insert, update, and delete to the data source
- Retrieve attribute LOV data if available
The crawler agent API is a collection of methods used to implement a crawler agent. A sample implementation of a crawler agent SampleAgent.java is provided under $ORACLE_HOME/ultrasearch/sample/.
UrlData: The crawler agent uses this interface to populate document properties and attribute values. Ultra Search provides a basic implementation of this interface that the agent can use directly or extend if necessary. The class is DocAttributes with a constructor that has no argument. The agent might decide to create a pool of UrlData objects and cycle through them during crawling. In the most simple implementation, the agent creates one DocAttributes object, repeatedly resets and populates the data, and returns this object.
LovInfo: The crawler agent uses this interface to submit attribute LOV definitions.
DataSourceParams: The crawler agent uses this interface to read and write data source parameters.
AgentException: The crawler agent uses this exception class when an error occurs.
CrawlerAgent: This interface lets the crawler communicate with the user-defined data source. The crawler agent must implement this interface.