![]() |
Sun ONE Portal Server Developer's Guide |
This chapter describes the search engine robot and the application programming interface (API) used to create plug-in robot application functions (RAFs).
Note For more information on the Search Engine Robot, see the Sun ONE Portal Server Administrator's Guide.
This chapter covers the following topics:
Search Engine Robot Overview
A search engine robot is an agent that identifies and reports on resources in its domains; it does so by using two kinds of filters: an enumerator filter and a generator filter.
The enumerator filter locates resources by using network protocols. It tests each resource, and, if it meets the selection criteria, it is enumerated. For example, the enumerator filter can extract hypertext links from an HTML file and use the links to find additional resources.
The generator filter tests each resource to determine if a resource description (RD) should be created. If the resource passes the test, the generator creates an RD which is stored in the search engine database.
How the Robot Works
Figure 5-1 illustrates how the search engine robot works.
Figure 5-1    How the Robot Works
![]()
In Figure 5-1, the robot examines URLs and their associated network resources. Each resource is tested by both the enumerator and the generator. If the resource passes the enumeration test, the robot checks it for additional URLs. If the resource passes the generator test, the robot generates a resource description that is stored in the search engine database.
Robot Configuration Files
Robot configuration files define the behavior of the search engine robots. These files reside in the directory /var/opt/SUNWps/serverinstance/URI/config. Table 5-1 provides a description of each of the robot configuration files:
It is recommended that you use the administration user interface to edit these configuration files.
The Filtering Process
The robot uses filters to determine which resources to process and how to process them. When the robot discovers references to resources as well as the resources themselves, it applies filters to each resource in order to enumerate it and to determine whether or not to generate a resource description to store in the search engine database.
The robot examines one or more starting point URLs, applies the filters, and then applies the filters to the URLs spawned by enumerating the starting point URLs, and so on. Note that the starting point URLs are defined in the filterrules.conf file.
A filter performs any required initialization operations and applies comparison tests to the current resource. The goal of each test is to either allow or deny the resource. A filter also has a shutdown phase during which it performs any required cleanup operations.
If a resource is allowed, it continues its passage through the filter. If a resource is denied, then the resource is rejected. No further action is taken by the filter for resources that are denied. If a resource is not denied, the robot will eventually enumerate it, attempting to discover further resources. The generator might also create a resource description for it.
Note that these operations are not necessarily linked. Some resources result in enumeration; others result in RD generation. Many resources result in both enumeration and RD generation. For example, if the resource is an FTP directory, the resource typically will not have an RD generated for it. However, the robot might enumerate the individual files in the FTP directory. An HTML document that contains links to other documents can produce a generated RD and can lead to enumeration of the linked documents as well.
Stages in the Filter Process
Both enumerator and generator filters have five phases in the filtering process. They both have four common phases, Setup, Metadata, Data, and Shutdown. If the resource makes it past the Data phase, it is either in the Enumerate or Generate phase, depending on whether the filter is an enumerator or a generator.
Setup
Performs initialization operations. Occurs only once in the life of the robot.
Metadata
Filters the resource based on metadata that is available about the resource. Metadata filtering occurs once per resource before the resource is retrieved over the network. Table 5-2 lists examples of common metadata types:
Data
Filters the resource based on its data. Data filtering is done once per resource after it is retrieved over the network. Data that can be used for filtering include:
- content-type
- content-length
- content-encoding
- content-charset
- last-modified
- expires
Enumerate
Enumerates the current resource in order to determine if it points to other resources to be examined.
Generate
Generates a resource description (RD) for the resource and saves it for adding it to the search engine database.
Shutdown
Performs any needed termination operations. Occurs once in the life of the robot.
Robot Completion Scripts
After the robot has finished with all the entries in the enumeration-pool and has finished all outstanding processing, you can specify additional programs to execute by enabling the cmdHook parameter.
The cmdHook is provided as a way for you to extend the shutdown phase of the robot. For example, you might want the robot to send email after it completes one run, start another process, analyze its own log files, write a small report, and so on.
When the robot has finished all outstanding processing, it will call the executable specified in the cmdHook setting. If the onCompletion parameter is set to idle or quit, the script is called once before the robot shuts down or goes idle. If the parameter is set to loop, the script is called each time the robot restarts. See the log samples in Monitoring cmdHook Execution.
The cmdHook script can be written in any language (as a Perl or shell script, a C program, and so on). If you choose to use a C program, you'll have to insert the cmdHook parameter in the robot.conf file manually, as the search engine Administration Interface does not scan binary executables. See the notes in Preparing Your Completion Script to Appear in the Administration Interface.
The cmdHook script is run from the robot's execution environment. This means your script will inherit any environment variables set by the robot and will not have access to environment variables that might be set by your search server or the administration server. The cmdHook script will be executed from /var/opt/SUNWps/serverinstance/URI. This is important to keep in mind if you are using any relative directory references.
There are two examples provided in BaseDir/SUNWps/samples/bin. They are a simple `touch' test (cmdHook0) and an email upon completion script (cmdHook1).
Monitoring cmdHook Execution
At the default logging level, a log message is written in robot.log when the cmdHook is run.
For example, if the onCompletion parameter is set to idle, the robot.log output would look like the following:
[12:45:57] Run cmd: /opt/SUNWps/bin/cmdHook1
[12:45:58] Complete cmd: /opt/SUNWps/cmdHook1
[12:45:58] Robot is idle....
. . . if the onCompletion parameter is set to shutdown, the robot.log output would look like:
[12:45:57] Run cmd: /opt/SUNWps/bin/cmdHook1
[12:45:58] Complete cmd: /opt/SUNWps/bin/cmdHook1
[12:46:33] Workload complete.
. . . if the onCompletion parameter is set to loop, the robot.log output would look like:
[12:52:04] Run cmd: /opt/SUNWps/bin/cmdHook1
[12:52:05] Complete cmd: /opt/SUNWps/bin/cmdHook1
[12:52:05] Restart Robot.
. . . if the onCompletion parameter is set to loop, there will be an additional entry in filter.log like this:
[12:54:41] Filter log started - loop 15
Preparing Your Completion Script to Appear in the Administration Interface
If you want your cmdHook script to display as an option on the search engine Administration page, you should follow these guidelines:
- Because the search engine Administration Interface does not scan binary files when it looks for cmdHook scripts, write your script in a run-time evaluated language instead of C or another compiled language.
- Place the file in BaseDir/SUNWps/bin.
- Name the file cmdHook, followed by an alphanumeric character-string, for example, cmdHook0, cmdHookAlpha, cmdHook12a
- Add a description string to the file and comment it so that it does not affect the execution of the script. For example:
#description="My menu choice string"
Creating New Robot Application Functions
Typically, you might modify the behavior of a search engine robot filter by using the search engine administration interface or the predefined robot application functions (RAFs).
Note For more information on the search engine administration interface and on the predefined robot application functions (RAFs), see the Sun ONE Portal Server Administrator's Guide.
When you want to modify the behavior of the search engine robot filters in a way that is not accommodated by the standard filter functions, you will need to create your own robot application functions (RAFs). Robot filters are defined in the file filter.conf. Filter definitions consist of filter directives, which each specify a robot application function.
This section describes the following topics:
- Robot Plug-in API Overview
- The Robot Application Function Header Files
- Writing Robot Application Functions
- Compiling and Linking your Code
- Loading Your Shared Object
- Using your New Robot Application Functions
Robot Plug-in API Overview
The robot plug-in API is a set of functions and header files that help you create your own robot application functions to use with the directives in robot configuration files. Use this API to create the built-in functions for the directives used in filter.conf (the robot filter configuration file).
When you become familiar with this API, you will be able to override, add, or customize robot functionality. For example, you will be able to create functions that use a custom database for access control, or you can create functions that create custom log files with special entries.
In general, most developers will write RAF functions in C. However, you can define the functions in any language as long as it can build a shared library. If you use C++, you will need to modify the provided C header files to be used by C++ files.
The following steps are a brief overview of the process for creating your own plug-in functions:
- Compile your code to create a shared object (.so) file.
- In the Setup directives at the top of filter.conf, you tell the robot to load your shared object file or dynamic-link library.
- Write directives that use your plug-in functions in the robot configuration file (filter.conf).
The BaseDir/SUNWps/sdk/robot/include/directory contains all the header files you need to include when writing your plug-in functions.
The BaseDir/SUNWps/sdk/robot/examples/ directory contains sample code, the header files, and a makefile. You should familiarize yourself with the code and samples.
The Robot Application Function Header Files
This section discusses the header files needed for creating robot application functions. The following topics are described:
Header File Hierarchy
The hierarchy of robot plug-in API header files is (directories are shown in bold):
robot
include
cscinfo.h
csmem.h
filterrules.h
robotapi.h
base
systems.h
libcs
adt.h
cs.h
csidcf.h
getopt.h
log.h
pblock.h
The robot and its header files are written in ANSI C.
Header File Contents
This section describes the header files you can include when writing your plug-in functions. This section is intended as a starting point for learning the functions included in the header files.
Most of the header files are stored in the following directories as described in Table 5-3:
Table 5-4 describes the header files in the include directory:
Table 5-5 describes the header files in the base directory:
Table 5-5    Header files in the base directory
Header File
Description
systems.h
Contains functions that handle system information.
Table 5-6 describes the header files in the libcs directory:
Writing Robot Application Functions
When you write robot application functions, you must make sure that the file that defines your robot application functions includes robotapi.h. You will also find many useful functions in csinfo.h.
All Robot Application Functions use parameter blocks, (pblocks) to receive and set parameter values. A parameter block stores parameters as name-value pairs. A parameter block is a hash table that is keyed on the name portion of each parameter it contains.
In this section, the following topics are described:
- RAF Prototype
- Writing Functions for Specific Directives
- Passing Parameters to Robot Application Functions
- Working with Parameter Blocks
- Getting Information on the Processed Resource
- Returning a Response Status Code
- Reporting Errors to the Robot Log File
- RAF Definition Example
RAF Prototype
All robot application functions have the following prototype:
int (*RobotAPIFn)(pblock *pb, CSFilter *csf, CSResource *csr);
pb is the parameter block containing the parameters for this function invocation.
csf is the pointer to an enumeration or generation filter.
Writing Functions for Specific Directives
You should write each function for a particular stage in the filtering process, (setup, metadata, data, enumeration, generation, and shutdown.) The function should only use the data sources that are available at the relevant stage. See the section Sources and Destinations section (in the Administrator's Guide) for a list of the data sources available at each stage.
At the Setup stage, the filter is preparing for setup and cannot get information about the resource's URL or content.
At the MetaData stage, the robot has encountered a URL for a resource but has not downloaded the resource's content. Consequently, information is available about the URL and the data that is derived from other sources such as the filter.conf file. At this stage, information is not available about the content of the resource.
At the Data stage, the robot has downloaded the content of the URL, so information is available about the content, such as the description, the author, and so on.
At the Enumeration and Generation stages, the same data sources are available as for the Data stage.
At the Shutdown stage, the filter has completed its processes and shuts down. Although functions written for this stage can use the same data sources as those available at the Data stage, shutdown functions typically restrict their operations to shutdown and clean up activities.
Passing Parameters to Robot Application Functions
You must use parameter blocks (pblocks) to pass arguments into Robot Application Functions and to extract data from them. For example, the following directive (in the filter.conf file) invokes the filter-by-exact function.
Data fn=filter-by-exact src=type deny=text/plain
The fn parameter indicates the function to invoke, which in this case is filter-by-exact. The src and deny arguments are parameters used with the function. They will be passed to the function in a parameter block, and the function should be defined to extract its parameters and their values from the parameter block.
The three structures that are used to hold parameters are libcs_pb_param, libcs_pb_entry, and libcs_pblock. These structures are defined in the header file BaseDir/SUNWps/sdk/robot/include/libcs/pblock.h.
- libcs_pb_param
This structure holds a single parameter. It records the name and value of the parameter:
typedef struct {
char *name,*value;
} libcs_pb_param;
- libcs_pb_entry
This structure creates linked lists of libcs_parameter structures:
struct libcs_pb_entry {
libcs_pb_param *param;
struct libcs_pb_entry *next;
};
- libcs_pblock
Working with Parameter Blocks
A parameter block stores parameters and values as name/value pairs. There are many pre-defined functions you can use to work with parameter blocks, to extract parameter values, to change parameter values, and so on. For example, libcs_pblock_findval(paramname, returnPblock) uses the given return pblock to return the value of the named parameter in the RAF's input pblock. For an example, see RAF Definition Example.
When adding, removing, editing, and creating name-value pairs for parameters, your robot application functions can use the functions in the pblock.h header file (in BaseDir/SUNWps/sdk/robot/include/libcs/).
The names of these functions are all prefixed by libcs_.
The following parameter manipulation functions are listed in Table 5-7. See the BaseDir/SUNWps/sdk/robot/include/libcs/pblock.h header file for full function signatures with return type and arguments.
Getting Information on the Processed Resource
As mentioned in RAF Prototype, the prototype for all robot application functions is in the following format:
int (*RobotAPIFn)(pblock *pb, CSFilter *csf, CSResource *csr);
where csr is a data structure that contains information about the resource being processed.
The CSResource structure is defined in the header file robotapi.h. This structure contains information about the resource being processed. Each resource is in SOIF syntax.
Objects in SOIF syntax have a schema name, an associated URL, and a set of attribute-value pairs.
In the Code Example 5-1, the schema name is @DOCUMENT, the URL is: http://developer.siroe.com/docs/manuals/htmlguid/index.htm, and the SOIF contains attribute-value pairs for title, author, and description.
A CSResource structure has a url field, which contains the URL for the SOIF. It also has an rd field, whose value is the SOIF for the resource. Once you get the SOIF for the resource, you can use the functions for working with SOIF that are defined in BaseDir/SUNWps/sdk/rdm/include/soif.h to get more information about the resource. (The file robotapi.h includes soif.h.)
For example, the macro SOIF_Findval(soif, attribute) gets the value of the given attribute in the given SOIF. Code Example 5-2 uses this macro to print the value of the META attribute if it exists for the resource being processed:
It is recommended that you review the CSResource structure in the file robotapi.h for more information on other fields and macros. For more information about the routines to use with SOIF objects, see Chapter 7, "Using the RDM API to Access the Search Engine and Database in C."
Returning a Response Status Code
When your robot application function has finished processing, it must return a code that tells the server how to proceed with the request.
These codes are defined in the header file BaseDir/SUNWps/sdk/robot/include/robotoapi.h. Table 5-8 describes the response status codes after the robot has completed processing:
Reporting Errors to the Robot Log File
When problems occur, robot application functions should return an appropriate response status code (such as REQ_ABORTED), and they should also log an error in the error log file.
To use the error-logging functionality, you must include the file log.h in the sdk/robot/include/libcs directory.
After you have ensured that log.h exists in the correct place, you can use the cslog_error macro to report errors. The prototype is in the following format:
cslog_error(int n, int loglevel, char* errorMessage)
The first parameter is not currently used (may be used in the future) You can pass this as any integer.
The second parameter is the log level. When the log level is less than or equal to the log level setting in the file process.conf, the error message is written in the robot.log.
The third parameter is the error message to print, and it has the same form as the argument to the standard printf() function.
For example:
cslog_error(1, 1, ("fn=extract-html-text: Out of memory!\n"));
This invocation of cslog_error would generate the following error message in the robot log file:
[22/Jan/1998:15:57:31] 8270@0: ERROR: fn=extract-html-text: Out of memory!
For another example:
cslog_error(1, 1,
("<URL:%s>: Error %d (%d): %s\n",
ep->eo->key,
urls->server_status,
status,
(s = cslog_linestr(urls->error_msg)))
This invocation of cslog_error would generate the following error message in the robot log file:
[22/Mar/2002:15:57:31] 8270@0: ERROR: <URL:http://budgie.siroe.com:80/>: Error 0 (-240): Can't connect to server
RAF Definition Example
This section shows an example definition for a robot application function.
This function copies a specified source data to a multi-valued field in an RD. For example, the search engine stores category or classification information in the classification field of an RD. The copy_mv function allows the robot to get the value of an HTML <META> tag of any name and store the value in the classification field in the database. For example, using this function, you could instruct the robot to get the content of the <META NAME="topic"> tag, and store it as the classification of the resource.
You would invoke this function with a directive such as the following:
Generate fn=copy_mv src=topic dst=classification
Code Example 5-3 shows a sample function definition:
Compiling and Linking your Code
You can compile your code with any ANSI C compiler. See the makefile in the BaseDir/SUNWps/sdk/robot/example directory for an example. The makefile assumes the use of gmake.
This section lists the linking options you need to use to create a shared object that the robot can be instructed to load by commands in the filter.conf configuration file. Note that you can link object files into a shared object. In Table 5-9, the compiled object files t.o and u.o are linked to form a shared object called test.so.
Table 5-9    Options for linking
System
Compile options
Solaris
ld -G t.o u.o -o test.so
Loading Your Shared Object
The robot uses the filters defined in filter.conf to filter resources that it encounters. If the file filter.conf uses your customized robot application functions, it must load the shared object that contains the functions.
To load the shared object, add a line to filter.conf:
Init fn=load-modules shlib=[path]filename.so funcs="function1,function2,...,functionN"
This initialization function opens the given shared object file and loads the functions function1, function2, and so on. You can then use the functions function1 and function2 in the robot configuration file (filter.conf). Remember to use the functions only with the directives you wrote them for, as described in the following section.
Using your New Robot Application Functions
When you have compiled and arranged for the loading of your functions, you need to provide for their execution. All functions are called as follows:
Directive fn=function [name1=value1] ... [nameN=valueN]
- directive identifies the class of function that is being called. Functions should not be called from the wrong directive.
- fn=function identifies the function to be called using the function's unique character-string name.
These two parameters are mandatory. In addition, there may be an arbitrary number of function-specific parameters, each of which is a name-value pair.
You will need to specify your function in the directive for which it was written. For example, the following line uses a plug-in function called wordcount that can be used in the Data stage. This function counts the words in a resource and assigns the count to a destination specified by a parameter called dst.
Data fn=wordcount dst=word-count