Creating a spider

Follow the steps below to set up a spider in your Endeca Crawler pipeline.

To create a spider:

  1. In the Project tab of Developer Studio, double-click Pipeline Diagram.
  2. In the Pipeline Diagram editor, choose New > Spider. The New Spider editor displays.
  3. In the Name box, type a unique name for the spider. This should be the same name you specified as the value of URL_SOURCE when you created the record adapter.
  4. To limit the number of hops from the root URL (specified on the Root URLs tab), enter a value in the Maximum hops field. The Maximum hops value specifies the number of links that may be traversed beginning with the root URL before the spider reaches the document at a target URL. For example, if http://www.endeca.com is a root URL and it links to a document at http://www.endeca.com/news.html, then http://www.endeca.com/news.html is one hop away from the root.
  5. To limit the depth of the crawl from the root URL, enter a value in the Maximum depth field. Maximum depth is based on the number of separators in the path portion of the URL. For example, http://endeca.com has a depth of zero (no separators), whereas, http://endeca.com/products/index.shtml has a depth of one. The /products/ portion of the URL constitutes one separator.
  6. To specify the User-Agent HTTP header that the spider should present to Web servers, enter the desired value in the Agent name field.

    The Agent name identifies the name of the spider, as it will be referred to in the User-agent field of a Web server’s robots.txt file. If you provide a name, the spider adheres to the robots.txt standard. If you do not provide a name, the spider responds only to rules in a robots.txt file where the value of the User-agent field is “*”.

    Note: A robots.txt file allows Web-server administrators to identify robots, like spiders, and control what URLs a robot may or may not crawl on a Web server. The file specifies a robot’s User-agent name and the rules associated with the name. These crawling rules configured in robots.txt are often known as the robots.txt standard or, more formally, as the Robots Exclusion Standard. For more information on this standard, see http://www.robotstxt.org/wc/robots.html.
  7. To instruct the spider to ignore the robots.txt file on a Web server, check Ignore robots. By ignoring the file, the spider does not obey the robots.txt standard and proceeds with the crawl with the parameters you configure.
  8. If you want the spider to reject cookies, check Disable Cookies. If you leave this unchecked, cookie information is added to the records during the crawl, and the spider also stores and sends cookies to the server as it crawls. (When RETRIEVE_URL gets a Set Cookie header as part of its HTTP response, RETRIEVE_URL can pass this value back to the server, when appropriate, to simulate a session.)
  9. For the full crawl described in this section, do not provide any value in the Differential Crawl URL box.