You specify the crawler search space with the following settings:
- Seed URLs to use as starting points for the crawling process.
- Host Domains to include or exclude during the crawling process.
- Document types for the crawler to fetch.
Additionally, you can specify a proxy server if the search space includes
web pages that reside outside your organization's firewall.
Note: The Ultra Search Crawler observes the Robots Exclusion Protocol.
The Robots Exclusion Protocol allows Web site administrators to indicate
which parts of their site should not be visited by robots, such as the
Ultra Search Crawler. The crawler also understands and respects robot
META tags.
Seed URLs
Seed URLs are used by the Ultra Search Crawler as starting points for
crawling the web. You can add or remove seed URLs from the seed URL
List. Ultra Search currently supports seed URLs that use the HTTP protocol.
Domain control rules do not apply to the seed URLs. For example, if
a seed URL is defined as www.yahoo.com and the inclusion rule is oracle.com,
www.yahoo.com is still crawled. This example is not useful since all
links that do not point to oracle.com are discarded.
Note: When there are no entries in the seed URL list, the crawler
will not perform any crawling duties.
To add one or more seed URLs, do the following:
- Enter a list of URLs separated by valid delimiters. A valid delimiter
can be a space, tab, newline or comma. Each URL must begin with "http://".
- Click on the "Add" button.
Proxies
Specify a proxy server for the HTTP protocol. Specifying a proxy server
is optional. Currently, only the HTTP protocol is supported.
Domain Control
By default, all hosts are included in the Ultra Search crawling space.
You can further limit this space by defining inclusion or exclusion
domains.
Inclusion domains
Inclusion domains limit the Ultra Search Crawler crawling space. During
the crawling process, the Ultra Search Crawler crawls the hosts that
belong only to an inclusion domain you specify. For example, an inclusion
domain of oracle.com limits the Ultra Search Crawler to hosts belonging
to Oracle Corporation worldwide.
Exclusion domains
Exclusion domains allow you to further refine the crawling space. The
crawler's crawling space is equal to the inclusion domain space minus
the exclusion domain space. For example, an exclusion domain uk.oracle.com
will prevent the crawler from crawling Oracle hosts in the United Kingdom.
Document Types
Specify the types of documents the Ultra Search Crawler should process.
The left pane lists document types that are not processed by the Crawler.
The right pane lists document types that are processed.
|