Sun logo      Previous      Contents      Index      Next     

Sun ONE Portal Server 6.2 Developer's Guide

Chapter 8
Search Engine Robot

This chapter describes the search engine robot and the application programming interface (API) used to create plug-in robot application functions (RAFs).


Note

For more information on the Search Engine Robot, see the Sun ONE Portal Server 6.2 Administrator’s Guide.


This chapter covers the following topics:


Search Engine Robot Overview

A search engine robot is an agent that identifies and reports on resources in its domains; it does so by using two kinds of filters: an enumerator filter and a generator filter.

The enumerator filter locates resources by using network protocols. It tests each resource, and, if it meets the selection criteria, it is enumerated. For example, the enumerator filter can extract hypertext links from an HTML file and use the links to find additional resources.

The generator filter tests each resource to determine if a resource description (RD) should be created. If the resource passes the test, the generator creates an RD which is stored in the search engine database.

How the Robot Works

Figure 8-1 illustrates how the search engine robot works. In Figure 8-1, the robot examines URLs and their associated network resources. Each resource is tested by both the enumerator and the generator. If the resource passes the enumeration test, the robot checks it for additional URLs. If the resource passes the generator test, the robot generates a resource description that is stored in the search engine database.

Figure 8-1  How the Robot Works

This figure shows how the Robot works

Robot Configuration Files

Robot configuration files define the behavior of the search engine robots. These files reside in the directory webcontainer/portal/config. The following table lists the configuration file in the first (left) column and provides a description of the corresponding robot configuration file in the second (right) column:

robot.conf

Defines most of the operating parameters for the robot. In addition, this file points the robot to applicable filters in the file filter.conf.

filter.conf

Contains all of the functions used by the Search Engine robot during the enumeration and generation filtering tasks. Including the same functions for both enumeration and generation ensures that a single rule change affects both tasks.

filterrules.conf

Contains the starting points (also referred to as starting point URLs) and rules used by the filterrules-process function.

classification.conf

Contains rules used to classify RDs generated by the robot

Use the administration console to edit these configuration files.

The Filtering Process

The robot uses filters to determine which resources to process and how to process them. When the robot discovers references to resources as well as the resources themselves, it applies filters to each resource in order to enumerate it and to determine whether or not to generate a resource description to store in the search engine database.

The robot examines one or more starting point URLs, applies the filters, and then applies the filters to the URLs spawned by enumerating the starting point URLs, and so on. Note that the starting point URLs are defined in the filterrules.conf file.

A filter performs any required initialization operations and applies comparison tests to the current resource. The goal of each test is to either allow or deny the resource. A filter also has a shutdown phase during which it performs any required cleanup operations.

If a resource is allowed, it continues its passage through the filter. If a resource is denied, then the resource is rejected. No further action is taken by the filter for resources that are denied. If a resource is not denied, the robot will eventually enumerate it, attempting to discover further resources. The generator might also create a resource description for it.

Note that these operations are not necessarily linked. Some resources result in enumeration; others result in RD generation. Many resources result in both enumeration and RD generation. For example, if the resource is an FTP directory, the resource typically will not have an RD generated for it. However, the robot might enumerate the individual files in the FTP directory. An HTML document that contains links to other documents can produce a generated RD and can lead to enumeration of the linked documents as well.

Stages in the Filter Process

Both enumerator and generator filters have five phases in the filtering process. They both have four common phases, Setup, Metadata, Data, and Shutdown. If the resource makes it past the Data phase, it is either in the Enumerate or Generate phase, depending on whether the filter is an enumerator or a generator.

Setup    

Performs initialization operations. Occurs only once in the life of the robot.

Metadata    

Filters the resource based on metadata that is available about the resource. Metadata filtering occurs once per resource before the resource is retrieved over the network. Table 8-1 lists examples of common metadata types:

Table 8-1  Common Metadata Types  

Metadata

Description

Example

Complete URL

The location of a resource

http://home.siroe.com/

Protocol

The access portion of the URL

http, ftp, file

Host

The address portion of the URL

www.siroe.com:

IP address

Numeric version of the host

198.95.249.6

PATH

The path portion of the URL

/index.html

Depth

Number of links from the starting point URL

5

Data    

Filters the resource based on its data. Data filtering is done once per resource after it is retrieved over the network. Data that can be used for filtering includes:

Enumerate    

Enumerates the current resource in order to determine if it points to other resources to be examined.

Generate    

Generates a resource description (RD) for the resource and saves it for adding it to the search engine database.

Shutdown    

Performs any needed termination operations. Occurs once in the life of the robot.


Robot Completion Scripts

After the robot has finished with all the entries in the enumeration-pool and has finished all outstanding processing, you can specify additional programs to execute by enabling the cmdHook parameter.

The cmdHook is provided as a way for you to extend the shutdown phase of the robot. For example, you might want the robot to send email after it completes one run, start another process, analyze its own log files, write a small report, and so on.

When the robot has finished all outstanding processing, it will call the executable specified in the cmdHook setting. If the onCompletion parameter is set to idle or quit, the script is called once before the robot shuts down or goes idle. If the parameter is set to loop, the script is called each time the robot restarts. See the log samples in Monitoring cmdHook Execution.

The cmdHook script can be written in any language (as a Perl or shell script, a C program, and so on). If you choose to use a C program, you’ll have to insert the cmdHook parameter in the robot.conf file manually, as the search engine Administration Interface does not scan binary executables.

The cmdHook script is run from the robot’s execution environment. This means your script will inherit any environment variables set by the robot and will not have access to environment variables that might be set by your search server or the administration server. The cmdHook script will be executed from /var/opt/SUNWps/serverinstance/portal. This is important to keep in mind if you are using any relative directory references.

There are two examples provided in portal-server-install-root/SUNWps/samples/robot. They are a simple ‘touch’ test (cmdHook0) and an email upon completion script (cmdHook1).

Monitoring cmdHook Execution

At the default logging level, a log message is written in robot.log when the cmdHook is run.

For example, if the onCompletion parameter is set to idle, the robot.log output would look like the following:

[12:45:57] Run cmd: /opt/SUNWps/bin/cmdHook1

[12:45:58] Complete cmd: /opt/SUNWps/cmdHook1

[12:45:58] Robot is idle....

. . . if the onCompletion parameter is set to shutdown, the robot.log output would look like:

[12:45:57] Run cmd: /opt/SUNWps/bin/cmdHook1

[12:45:58] Complete cmd: /opt/SUNWps/bin/cmdHook1

[12:46:33] Workload complete.

. . . if the onCompletion parameter is set to loop, the robot.log output would look like:

[12:52:04] Run cmd: /opt/SUNWps/bin/cmdHook1

[12:52:05] Complete cmd: /opt/SUNWps/bin/cmdHook1

[12:52:05] Restart Robot.

. . . if the onCompletion parameter is set to loop, there will be an additional entry in filter.log like this:

[12:54:41] Filter log started - loop 15

Preparing Your Completion Script to Appear in the Administration Interface

If you want your cmdHook script to display as an option on the search engine Administration page, you should follow these guidelines:

  1. Because the search engine Administration Interface does not scan binary files when it looks for cmdHook scripts, write your script in a run-time evaluated language instead of C or another compiled language.
  2. Place the file in portal-server-install-root/SUNWps/bin directory.
  3. Name the file cmdHook, followed by an alphanumeric character-string, for example, cmdHook0, cmdHookAlpha, cmdHook12a.
  4. Add a description string to the file and comment it so that it does not affect the execution of the script. For example:
  5. #description="My menu choice string"


Creating New Robot Application Functions

Typically, you might modify the behavior of a search engine robot filter by using the search engine administration interface or the predefined robot application functions (RAFs).


Note

For more information on the search engine administration interface and on the predefined robot application functions (RAFs), see the Sun ONE Portal Server 6.2 Administrator’s Guide.


When you want to modify the behavior of the search engine robot filters in a way that is not accommodated by the standard filter functions, you will need to create your own robot application functions (RAFs). Robot filters are defined in the file filter.conf. Filter definitions consist of filter directives, which each specify a robot application function.

This section describes the following topics:

Robot Plug-in API Overview

The robot plug-in API is a set of functions and header files that help you create your own robot application functions to use with the directives in robot configuration files. Use this API to create the built-in functions for the directives used in filter.conf (the robot filter configuration file).

When you become familiar with this API, you will be able to override, add, or customize robot functionality. For example, you will be able to create functions that use a custom database for access control, or you can create functions that create custom log files with special entries.

In general, most developers will write RAF functions in C. However, you can define the functions in any language as long as it can build a shared library. If you use C++, you will need to modify the provided C header files to be used by C++ files.

The following steps are a brief overview of the process for creating your own plug-in functions:

  1. Compile your code to create a shared object (.so) file.
  2. In the Setup directives at the top of filter.conf, you tell the robot to load your shared object file or dynamic-link library.
  3. Write directives that use your plug-in functions in the robot configuration file (filter.conf).

The portal-server-install-root/SUNWps/sdk/robot/include directory contains all the header files you need to include when writing your plug-in functions.

The portal-server-install-root/SUNWps/sdk/robot/examples directory contains sample code, the header files, and a makefile. You should familiarize yourself with the code and samples.

The Robot Application Function Header Files

This section discusses the header files needed for creating robot application functions. The following topics are described:

Header File Hierarchy

The hierarchy of robot plug-in API header files is (directories are shown in bold):

robot

    include

        cscinfo.h

        csmem.h

        filterrules.h

        robotapi.h

        base

            systems.h

        libcs

            adt.h

            cs.h

            csidcf.h

            getopt.h

            log.h

            pblock.h

The robot and its header files are written in ANSI C.

Header File Contents

This section describes the header files you can include when writing your plug-in functions. This section is intended as a starting point for learning the functions included in the header files.

Most of the header files are stored in the following directories:

BaseDir/SUNWps/sdk/robo t/  include

Contains header files that define general purpose data structures and function prototypes.

BaseDir/SUNWps/sdk/robo t/  include/base

Contains the systems.h header files that handle low-level, platform independent functions such as memory, file, and network access.

BaseDir/SUNWps/sdk/robo t/  include/libcs

Contains header files of functions that handle robot and HTTP-specific functions such as handling access to configuration files and HTTP.

The following table describes the header files in the include directory in the first (left) column and a brief description of the corresponding directory in the second (right) column:

csinfo.h

Contains functions for object typing, specifically for mapping files to MIME types.

csmem.h

Contains memory-related definitions.

robotapi.h

Contains the type definitions for structures and the return-code definitions for robot API functions.

filterrules.h

Contains type definitions for structures needed by filter rules.

The following table describes the header files in the base directory in the first (left) column and a brief description of the corresponding directory in the second (right) column:

systems.h

Contains functions that handle system information.

The following table describes the header files in the libcs directory in the first (left) column and a brief description of the corresponding directory in the second (right) column:

adt.h

Contains type definitions and function prototypes for utilities needed by the robot, such as linked lists, queues, and hash tables.

cs.h

Contains a library of common functions used by the Search Engine.

csidcf.h

Contains configuration definitions for Search Engine.

getopt.h

Contains routines to get options from the command line, for example, command line prog -n arg1 -p arg2, to get arg1 and arg2.

log.h

Contains routines for writing information to log files.

pblock.h

Contains functions that manage parameter passing and robot internal variables. It also contains functions to get values from a user via the server.

Writing Robot Application Functions

When you write robot application functions, you must make sure that the file that defines your robot application functions includes robotapi.h. You will also find many useful functions in csinfo.h.

All Robot Application Functions use parameter blocks, (pblocks) to receive and set parameter values. A parameter block stores parameters as name-value pairs. A parameter block is a hash table that is keyed on the name portion of each parameter it contains.

In this section, the following topics are described:

RAF Prototype

All robot application functions have the following prototype:

int (*RobotAPIFn)(pblock *pb, CSFilter *csf, CSResource *csr);

pb is the parameter block containing the parameters for this function invocation.

csf is the pointer to an enumeration or generation filter.


Note

The pb parameter is read-only, and any data modification should be performed on copies of the data. Doing otherwise is unsafe in threaded server architectures and will yield unpredictable results in multiprocess server architectures.


Writing Functions for Specific Directives

You should write each function for a particular stage in the filtering process, (setup, metadata, data, enumeration, generation, and shutdown.) The function should only use the data sources that are available at the relevant stage. See the section Sources and Destinations (in the Sun ONE Portal Server 6.2 Administrator’s Guide) for a list of the data sources available at each stage.

At the Setup stage, the filter is preparing for setup and cannot get information about the resource’s URL or content.

At the MetaData stage, the robot has encountered a URL for a resource but has not downloaded the resource’s content. Consequently, information is available about the URL and the data that is derived from other sources such as the filter.conf file. At this stage, information is not available about the content of the resource.

At the Data stage, the robot has downloaded the content of the URL, so information is available about the content, such as the description, the author, and so on.

At the Enumeration and Generation stages, the same data sources are available as for the Data stage.

At the Shutdown stage, the filter has completed its processes and shuts down. Although functions written for this stage can use the same data sources as those available at the Data stage, shutdown functions typically restrict their operations to shutdown and clean up activities.

Passing Parameters to Robot Application Functions

You must use parameter blocks (pblocks) to pass arguments into Robot Application Functions and to extract data from them. For example, the following directive (in the filter.conf file) invokes the filter-by-exact function.

Data fn=filter-by-exact src=type deny=text/plain

The fn parameter indicates the function to invoke, which in this case is filter-by-exact. The src and deny arguments are parameters used with the function. They will be passed to the function in a parameter block, and the function should be defined to extract its parameters and their values from the parameter block.

The three structures that are used to hold parameters are libcs_pb_param, libcs_pb_entry, and libcs_pblock. These structures are defined in the header file portal-server-install-root/SUNWps/sdk/robot/include/libcs/pblock.h file.

libcs_pb_param

This structure holds a single parameter. It records the name and value of the parameter:

typedef struct {

    char *name,*value;

} libcs_pb_param;

libcs_pb_entry

This structure creates linked lists of libcs_parameter structures:

struct libcs_pb_entry {

    libcs_pb_param *param;

    struct libcs_pb_entry *next;

};

libcs_pblock

This structure is a hash-table containing an array of libcs_pb_entry structures:

typedef struct {

    int hsize;

    struct libcs_pb_entry **ht;

} libcs_pblock;

Working with Parameter Blocks

A parameter block stores parameters and values as name/value pairs. There are many pre-defined functions you can use to work with parameter blocks, to extract parameter values, to change parameter values, and so on. For example, libcs_pblock_findval(paramname, returnPblock) uses the given return pblock to return the value of the named parameter in the RAF’s input pblock. For an example, see RAF Definition Example.

When adding, removing, editing, and creating name-value pairs for parameters, your robot application functions can use the functions in the pblock.h header file (in portal-server-install-root/SUNWps/sdk/robot/include/libcs directory).

The names of these functions are all prefixed by libcs_.

The following table contains the parameter manipulation functions in the first (left) column and a description of the corresponding function in the second (right) column. See the portal-server-install-root/SUNWps/sdk/robot/include/libcs/pblock.h header file for full function signatures with return type and arguments.

libcs_param_create

Creates a parameter with the given name and value. If the name and value are not null, they are copied and placed into a new pb_param structure.

libcs_param_free

Frees a given parameter if it is non-NULL. It returns 1 if the parameter was non-NULL and 0 if it was NULL. This function is useful for error checking before using the libcs_pblock_remove function.

libcs_pblock_create

Creates a new parameter block with a hash table of a chosen size. Returns the newly allocated parameter block

libcs_pblock_free

Frees a given parameter block and any entries inside it.

libcs_pblock_find

Finds the entry with the given name in a pblock and returns its value, otherwise returns NULL.

libcs_pblock_findval

Finds the entry with the given name in a pblock, and returns its value, otherwise returns NULL.

libcs_pblock_remove

Behaves like the libcs_pblock_find function, but in addition, it removes the entry from the pblock.

libcs_pblock_nninsert and libcs_pblock_nvinsert

These parameters create a new parameter with a given name and value and insert it into a given parameter block. The libcs_pblock_nninsert function requires that the value be an integer, but the libcs_pblock_nvinsert function accepts a string.

libcs_pblock_pinsert

Inserts a parameter into a parameter block.

libcs_pblock_str2pblock

Scans the given string for parameter pairs in the format name=value or name="value", adds them to a pblock, and returns the number of parameters added.

libcs_pblock_pblock2str

Places all of the parameters in the given parameter block into the given string. Each parameter is of the form name="value" and is separated by a space from any adjacent parameter.

Getting Information on the Processed Resource

As mentioned in RAF Prototype, the prototype for all robot application functions is in the following format:

int (*RobotAPIFn)(pblock *pb, CSFilter *csf, CSResource *csr);

where csr is a data structure that contains information about the resource being processed.

The CSResource structure is defined in the header file robotapi.h. This structure contains information about the resource being processed. Each resource is in SOIF syntax.

Objects in SOIF syntax have a schema name, an associated URL, and a set of attribute-value pairs.

In the Code Example 8-1, the schema name is @DOCUMENT, the URL is: http://developer.siroe.com/docs/manuals/htmlguid/index.htm, and the SOIF contains attribute-value pairs for title, author, and description.

Code Example 8-1  SOIF Syntax Example  

@DOCUMENT{ http://developer.siroe.com/docs/manuals/htmlguid/index.htm

    title{18}: HTML Tag Reference

    author{11}: Preston Day

    description{37}: Reference to HTML tags and attributes

}

A CSResource structure has a url field, which contains the URL for the SOIF. It also has an rd field, whose value is the SOIF for the resource. Once you get the SOIF for the resource, you can use the functions for working with SOIF that are defined in portal-server-install-root/SUNWps/sdk/rdm/include/soif.h file to get more information about the resource. (The file robotapi.h includes soif.h.)

For example, the macro SOIF_Findval(soif, attribute) gets the value of the given attribute in the given SOIF. Code Example 8-2 uses this macro to print the value of the META attribute if it exists for the resource being processed:

Code Example 8-2  SOIF_Findval Macro Example  

int my_new_raf(libcs_pblock *pb, CSFilter *csf, CSResource *csr)

    char *metavalue;

    if (metavalue = (char *)SOIF_Findval(csr->rd, “meta”))

    printf(“The value of the META tag in the resource is %s” metavalue);

   /* rest of function ... */

}

It is recommended that you review the CSResource structure in the file robotapi.h for more information on other fields and macros. For more information about the routines to use with SOIF objects, see Chapter 9, "Using the RDM API to Access the Search Engine and Database in C."

Returning a Response Status Code

When your robot application function has finished processing, it must return a code that tells the server how to proceed with the request.

These codes are defined in the header file portal-server-install-root/SUNWps/sdk/robot/include/robotoapi.h. The following table describes the response status codes after the robot has completed processing in the first (left) column and includes a description of the corresponding status code in the second (right) column:

REQ_PROCEED

The function performed its task, so proceed with the request.

REQ_ABORTED

The entire request should be aborted because an error occurred.

REQ_NOACTION

The function performed no task, but proceed anyway.

REQ_EXIT

End the session and exit.

REQ_RESTART

Restart the entire request-response process.

Reporting Errors to the Robot Log File

When problems occur, robot application functions should return an appropriate response status code (such as REQ_ABORTED), and they should also log an error in the error log file.

To use the error-logging functionality, you must include the file log.h in the portal-server-install-root/SUNWps/sdk/robot/include/libcs directory.

After you have ensured that log.h exists in the correct place, you can use the cslog_error macro to report errors. The prototype is in the following format:

cslog_error(int n, int loglevel, char* errorMessage)

The first parameter is not currently used (may be used in the future) You can pass this as any integer.

The second parameter is the log level. When the log level is less than or equal to the log level setting in the file process.conf, the error message is written in the robot.log.

The third parameter is the error message to print, and it has the same form as the argument to the standard printf() function.

For example:

cslog_error(1, 1, ("fn=extract-html-text: Out of memory!\n"));

This invocation of cslog_error would generate the following error message in the robot log file:

[22/Jan/1998:15:57:31] 8270@0: ERROR: fn=extract-html-text: Out of memory!

For another example:

cslog_error(1, 1,

    ("<URL:%s>: Error %d (%d): %s\n",

    ep->eo->key,

    urls->server_status,

    status,

    (s = cslog_linestr(urls->error_msg)))

This invocation of cslog_error would generate the following error message in the robot log file:

[22/Mar/2002:15:57:31] 8270@0: ERROR: <URL:http://budgie.siroe.com:80/>: Error 0 (-240): Can’t connect to server

RAF Definition Example

This section shows an example definition for a robot application function.

This function copies a specified source data to a multi-valued field in an RD. For example, the search engine stores category or classification information in the classification field of an RD. The copy_mv function allows the robot to get the value of an HTML <META> tag of any name and store the value in the classification field in the database. For example, using this function, you could instruct the robot to get the content of the <META NAME="topic"> tag, and store it as the classification of the resource.

You would invoke this function with a directive such as the following:

Generate fn=copy_mv src=topic dst=classification

Code Example 8-3 shows a sample function definition:

Code Example 8-3  Robot Application Function Example  

/******** example robot application function ********/

#include robotapi.h

#include pblock.h

#include log.h

#include objlog.h

NSAPI Public int copy_mv(libcs_pblock *pb, CSFilter *csf,

CSResource *csr)

{

char *s, *mv, *mvp;

/* Use the libcs_pblock_findval function to get the values of the

* "src" and "dst" parameters, which were specified by the

* directive that invoked this function */

char *src = libcs_pblock_findval("src", pb);

char *dst = libcs_pblock_findval("dst", pb);

if(!src || !dst) {

cslog_error(1, 1,

("<URL:%s>: Error: No source or destination available."

csr->url,)

return REQ_PROCEED;

}

/* If the current document does not have a META tag whose name

* matches the src parameter, just return, otherwise put the

* src value in the string s */

/* The function SOIF_Findval(soif, attribute) is defined

* in sdk/rdm/include/soif.h. It gets the value of the

* given attribute from the given resource.

* The rd in the CSResource is a soif that describes the resource

*/

if(!(s = (char *)SOIF_Findval(csr->rd, src)))

return REQ_PROCEED;

/* Now insert the string s into the

* Classification field of the RD */

/* Deal with possibility that the classification field

* already has one or more values */

if((mv = libcs_pblock_findval(dst, csr->sources)) != NULL) {

  sprintf(mvp, "%s;%s", mv, s);

  mvp = malloc((strlen(mv)+strlen(s)+2));

   /* append the new value to the existing values in the

  * classification field, separated by ’;’ */

  libcs_pblock_nvinsert(dst, mvp, csr->sources);

   /* do some clean up */

  free(mvp);

}

/* if no values already exist, do a simple value insert */

else

{

  libcs_pblock_nvinsert(dst, s, csr->sources);

}

/* We’re all done. Return a status code */

return REQ_PROCEED;

}

Compiling and Linking your Code

You can compile your code with any ANSI C compiler. See the makefile in the portal-server-install-root/SUNWps/sdk/robot/example directory for an example. The makefile assumes the use of gmake.

This section lists the linking options you need to use to create a shared object that the robot can be instructed to load by commands in the filter.conf configuration file. Note that you can link object files into a shared object. In Table 8-2, the compiled object files t.o and u.o are linked to form a shared object called test.so.

Table 8-2  Options for linking  

System

Compile options

Solaris

ld -G t.o u.o -o test.so

Loading Your Shared Object

The robot uses the filters defined in filter.conf to filter resources that it encounters. If the file filter.conf uses your customized robot application functions, it must load the shared object that contains the functions.

To load the shared object, add a line to filter.conf:

Init fn=load-modules shlib=[path]filename.so funcs="function1,function2,...,functionN"

This initialization function opens the given shared object file and loads the functions function1, function2, and so on. You can then use the functions function1 and function2 in the robot configuration file (filter.conf). Remember to use the functions only with the directives you wrote them for, as described in the following section.

Using your New Robot Application Functions

When you have compiled and arranged for the loading of your functions, you need to provide for their execution. All functions are called as follows:

Directive fn=function [name1=value1] ... [nameN=valueN]

These two parameters are mandatory. In addition, there may be an arbitrary number of function-specific parameters, each of which is a name-value pair.

You will need to specify your function in the directive for which it was written. For example, the following line uses a plug-in function called wordcount that can be used in the Data stage. This function counts the words in a resource and assigns the count to a destination specified by a parameter called dst.

Data fn=wordcount dst=word-count



Previous      Contents      Index      Next     


Copyright 2003 Sun Microsystems, Inc. All rights reserved.