SPIDER

A SPIDER element adds the capability to crawl document hierarchies on a file system or over HTTP or HTTPS.

If the SPIDER accesses a host that requires basic authentication or secure authentication, you must create and configure a Key_ring.xml file with the authentication information. From a root URL, the spider can spool URLs of documents to process. The URLs are stored as properties of the ENQUEUE_URL element. See the "Implementing the Endeca Crawler" appendix of the Endeca Forge Guide for details about using a spider.

Note: The Spider component has been deprecated. Support for the Endeca Crawler will be removed in a future release. It is recommended that you use the Endeca Web Crawler for Web crawls and the Endeca CAS Server for file system crawls.

DTD

<!ELEMENT SPIDER
    ( COMMENT?
    , RECORD_SOURCE
    , SPIDER_INIT?
    , SPIDER_RECORD_PROC
    )
>
<!ATTLIST SPIDER
    NAME    CDATA    #REQUIRED
>

Attributes

The following section describes the SPIDER element's attributes.

NAME

A unique name for the spider.

Sub-elements

The following table provides a brief overview of the SPIDER sub-elements.

Sub-element Brief description
COMMENT Associates a comment with a parent element and preserves the comment when the file is rewritten. This element provides an alternative to using inline XML comments of the form <!-- ... -->.
RECORD_SOURCE Specifies the name of a pipeline component from which this component should read records.
SPIDER_INIT Sets initialization values for a SPIDER.
SPIDER_RECORD_PROP Provides record processing information for the SPIDER in two sub-elements.

Example

This example shows a spider that crawls Acme Co.'s intranet and extranet.

<SPIDER NAME="CrawlAcmeCo">
   <RECORD_SOURCE>Parser</RECORD_SOURCE>
   <SPIDER_INIT>
     <IGNORE_ROBOTS.TXT/>
     <ROOT_URL>http://www.acme.com/</ROOT_URL>
     <ROOT_URL>http://intranet.acme.com/</ROOT_URL>
   </SPIDER_INIT>
   <SPIDER_RECORD_PROC>
     <!-- Get URLs from the Endeca.Relation.References property. -->
     <ENQUEUE_URL PROP_NAME="Endeca.Relation.References">
     </ENQUEUE_URL>
     <!-- Limit the crawl to the Acme Co domain: -->
     <URL_FILTER TYPE="HOST" ACTION="INCLUDE">
       *.acme.com
     </URL_FILTER>
   </SPIDER_RECORD_PROC>
</SPIDER>