You can set the
crawl-urlfilter.txt
files to accept certain hosts.
The crawl-urlfilter.txt
files in the configuration directories (default
, polite
, and non-polite
) all have this line commented out:
# accept hosts in MY.DOMAIN.NAME # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME.com/
To limit the crawl to a specific domain:
Example 4. Example of specifying hosts to accept
Specify the hosts to accept in these lines:
# accept hosts within endeca.com +^http://([a-z0-9]*\.)*endeca.com/
Then change the last lines of the file:
# include everything +.
to replace the plus sign with a minus sign:
# exclude everything else -.
With these two changes, hosts within the endeca.com
domain will be accepted by the crawler and everything else will be excluded.