Building HTTP Spider Verity Indexes
HTTP spider indexes are similar to the indexes that the spider functionality compiles for the file system index. When using the spider index on a website, vspider starts at the home page of the site and then follows each link on that page to the next level of the site. For each page at the next level, vspider follows each link on each page. After following a link, vspider indexes all of the data on the target page.
You can specify as many websites as you want, and you can configure the depth, or number of layers of links, that vspider follows into a website and index.
This section discusses how to:
Define HTTP gateway settings.
Define what to index.
Defining HTTP Gateway Settings
Selectto access the HTTP Gateway page.
Image: HTTP Gateway page
This example illustrates the fields and controls on the HTTP Gateway page. You can find definitions for the fields and controls later on this page.
- Depth of Links to Follow
Set the level of detail that you want to index within a certain site. If you enter 1, vspider starts at the homepage and follows each link on that page and indexes all of the data on the target pages. Then it stops. If you enter2, vspider follows the links on the previous pages and indexes one more level into the website.
As you increase the number, the number of links that vspider follows increases geometrically. Do not set this value too high, because it can impact performance negatively. You should not need to set this value higher than 10.
- List http://URLs to spider
Click the plus button to add multiple URLs to spider. Click the minus button to remove a URL from the list. If you forget to include the http:// (scheme) portion of the URL, the system automatically includes it.
URLs should contain only the alphanumeric characters as specified in RFC 1738. Any special character must be encoded. For example, encode a space character as %20, and encode a < as%3c. Additional examples are available.
- Stay in Domain
Select to limit spidering to a single domain. For example, suppose that you are spidering www.peoplesoft.com and you select this option. If a link points to a site outside the PeopleSoft domain (as in yahoo.com), the collection ignores the link.
- Stay in Host
Select to further limit spidering within a single server. If you select this option, the collection contains references to content only on the current web server or host. Links to content on other web servers within the domain are ignored. For example, if you are spidering www.peoplesoft.com and you select this option, you can index documents on www.peoplesoft.com, but not on www1.peoplesoft.com.
- Proxy Hostname andProxy Port
Enter a host and port for vspider to use. Enter the same settings that you would use in your web browser if you need a proxy to access the internet.
Defining What to Index
SelectThe fields on this page are documented in a previous section.