Tuning the crawling process
Web crawling strategy
The Ultra Search crawler is a powerful tool for
discovering information on web sites in an organization's
intranet. This feature is especially relevant to web
crawling. The other data sources are well defined such that
the crawler does not follow any links to other documents that
you may not be aware of.
Your web- crawling strategy can be as simply as
identifying a few well-known sites that are likely to
contain links to most of the other intranet sites in your
organization. You could add these sites to the Seed URL
list and launch the Primary Schedule. After the initial
crawl, you will have a good idea of the hosts that exist
in your intranet. You could then
define web sources to facilitate maintenance crawling on
individual hosts.
However, in reality, the process of discovering and
crawling your organization's intranet is an interactive
one characterized by periodic analysis of crawling
results and modification to crawling parameters to direct
the crawling process somewhat.
For example, if you observe that the crawler is spending
days crawling one web host, you may want to exclude
crawling at that host or limit the crawling depth.
Monitoring the crawling process
You can monitor the crawling process by using a
combination of the following methods:
URL Looping
During the process of crawling the web, the Ultra Search crawler
analyzes each newly discovered document to see if it is a duplicate of
a document that has already been crawled and indexed. If it is a
duplicate, the new document is not indexed.
URL looping refers to the scenario where, for some reason, a
large number of unique URLs all point to the same document. Although the
document is never indexed more than once, the documents still need
to be retrieved from the web server for analysis.
One particular difficult situation is where a site contains a
large number of pages and each page contains links to every other
page in the site. Ordinarily, this would not be a
problem as the crawler will eventually complete
analyzing all documents in the site.
However, some web servers attach parameters to generated
URLs to track information across requests. Such web
servers might generate a large number of unique URLs which all
point to the same document. For example
http://mycompany.com/somedocument.html?p_origin_page=10 may
refer to the same document as
http://mycompany.com/somedocument.html?p_origin_page=13
but the p_origin_page parameter is different in both
cases because the referring pages are different. If a
large number of parameters are specified and if the
number of referring links is large, a single unique document
may have thousands or tens of thousands or links all
referring to it. This scenario is one example of how URL looping may occur.
You can monitor the crawler statistics in the Ultra Search
Administration Tool to get an idea of what URLs and web servers are being
crawled the most. If you observe an inordinately large number of URL accesses to a
particular site or URL, you might want to do one of the
following:-
- Exclude the web
server.
Excluding the web server will prevent the crawler from
crawling any URLs at that host (Note that you cannot
limit the exclusion to a specific port on a host).
- Reduce the crawling
depth.
Reducing the crawling depth will limit the number of
levels of referred links the crawler will follow.
If you are observing URL looping
effects on a particular host, you should take a visual
survey of the site to find out an estimate of the
depth of the leaf pages at that site. Leaf pages are
pages that do not have any links to other pages.
As a general guideline, add 3 to the leaf page depth
and set the crawling depth to this value.
Be sure to restart the crawler after altering any
parameters in the Crawler Settings Page.
Your changes will take effect only after restarting the crawler.
|