Table of Contents

Web Access Page

Use this page to define the Ultra Search Crawler search space. The search space includes all web pages crawled by the Ultra Search Crawler. These web pages are periodically checked for validity according to a synchronization schedule.

  Related Topics

You specify the crawler search space with the following settings:

  • Seed URLs to use as starting points for the crawling process.
  • Host Domains to include or exclude during the crawling process.
  • Document types for the crawler to fetch.

Additionally, you can specify a proxy server if the search space includes web pages that reside outside your organization's firewall.

Note: The Ultra Search Crawler observes the Robots Exclusion Protocol. The Robots Exclusion Protocol allows Web site administrators to indicate which parts of their site should not be visited by robots, such as the Ultra Search Crawler. The crawler also understands and respects robot META tags.

Seed URLs

Seed URLs are used by the Ultra Search Crawler as starting points for crawling the web. You can add or remove seed URLs from the seed URL List. Ultra Search currently supports seed URLs that use the HTTP protocol.

Domain control rules do not apply to the seed URLs. For example, if a seed URL is defined as www.yahoo.com and the inclusion rule is oracle.com, www.yahoo.com is still crawled. This example is not useful since all links that do not point to oracle.com are discarded.

Note: When there are no entries in the seed URL list, the crawler will not perform any crawling duties.

To add one or more seed URLs, do the following:

  1. Enter a list of URLs separated by valid delimiters. A valid delimiter can be a space, tab, newline or comma. Each URL must begin with "http://".
  2. Click on the "Add" button.

Proxies

Specify a proxy server for the HTTP protocol. Specifying a proxy server is optional. Currently, only the HTTP protocol is supported.

Domain Control

By default, all hosts are included in the Ultra Search crawling space. You can further limit this space by defining inclusion or exclusion domains.

Inclusion domains

Inclusion domains limit the Ultra Search Crawler crawling space. During the crawling process, the Ultra Search Crawler crawls the hosts that belong only to an inclusion domain you specify. For example, an inclusion domain of oracle.com limits the Ultra Search Crawler to hosts belonging to Oracle Corporation worldwide.

Exclusion domains

Exclusion domains allow you to further refine the crawling space. The crawler's crawling space is equal to the inclusion domain space minus the exclusion domain space. For example, an exclusion domain uk.oracle.com will prevent the crawler from crawling Oracle hosts in the United Kingdom.

Document Types

Specify the types of documents the Ultra Search Crawler should process. The left pane lists document types that are not processed by the Crawler. The right pane lists document types that are processed.