A SPIDER element adds the capability to crawl document hierarchies on a file system or over HTTP or HTTPS.
If the SPIDER accesses a host that requires basic authentication or secure authentication, you must create and configure a Key_ring.xml file with the authentication information. From a root URL, the spider can spool URLs of documents to process. The URLs are stored as properties of the ENQUEUE_URL element. See the "Implementing the Endeca Crawler" appendix of the Endeca Forge Guide for details about using a spider.
<!ELEMENT SPIDER
( COMMENT?
, RECORD_SOURCE
, SPIDER_INIT?
, SPIDER_RECORD_PROC
)
>
<!ATTLIST SPIDER
NAME CDATA #REQUIRED
>
The following section describes the SPIDER element's attributes.
NAME
A unique name for the spider.
The following table provides a brief overview of the SPIDER sub-elements.
| Sub-element | Brief description |
|---|---|
| COMMENT | Associates a comment with a parent element and preserves the comment when the file is rewritten. This element provides an alternative to using inline XML comments of the form <!-- ... -->. |
| RECORD_SOURCE | Specifies the name of a pipeline component from which this component should read records. |
| SPIDER_INIT | Sets initialization values for a SPIDER. |
| SPIDER_RECORD_PROP | Provides record processing information for the SPIDER in two sub-elements. |
This example shows a spider that crawls Acme Co.'s intranet and extranet.
<SPIDER NAME="CrawlAcmeCo">
<RECORD_SOURCE>Parser</RECORD_SOURCE>
<SPIDER_INIT>
<IGNORE_ROBOTS.TXT/>
<ROOT_URL>http://www.acme.com/</ROOT_URL>
<ROOT_URL>http://intranet.acme.com/</ROOT_URL>
</SPIDER_INIT>
<SPIDER_RECORD_PROC>
<!-- Get URLs from the Endeca.Relation.References property. -->
<ENQUEUE_URL PROP_NAME="Endeca.Relation.References">
</ENQUEUE_URL>
<!-- Limit the crawl to the Acme Co domain: -->
<URL_FILTER TYPE="HOST" ACTION="INCLUDE">
*.acme.com
</URL_FILTER>
</SPIDER_RECORD_PROC>
</SPIDER>