The Endeca Web Crawler startup script has several flags to control the behavior of the crawl.
web-crawler Flag | Flag Argument |
---|---|
-c or --conf | The path of the configuration directory. If this flag is not specified, the workspace/conf/web-crawler/default directory is used as the default configuration directory. Note that if the flag points to a directory other than the default configuration directory (such as workspace/conf/web-crawler/polite-crawl), the files in the workspace/conf/web-crawler/default directory are read first. Optional. |
-d or --depth | A non-negative integer (greater than or equal to 0) that specifies the maximum depth of the crawl. The crawl depth is the number of link levels (starting from the seed URL) that will be crawled (see below for details). Required for all crawl types. |
-f or --force | This flag deletes the output directory before the crawl runs. No argument to the flag. Optional, but it cannot be used with resumable crawls. |
-JVM | Passes arguments on the command line to the Java Virtual Machine (JVM). If this flag is used, any arguments before it are passed to the Web Crawler and any arguments afterwards are appended to those passed to the JVM. Note that on Windows machines, the flag parameters should be quoted if they contain equals signs. Optional. |
-l or --limit | An integer that specifies an approximate
maximum for the number of requests to make (that is, the maximum number of
pages to be fetched). The maximum is a soft limit: when the limit is reached,
the Crawler does not add any more pages to the queue but the Crawler completes
any pages still in the queue.
When the limit is reached, the Web Crawler also writes a URL limit reached, starting shutdown. message to the log file. The default is 0 (zero), which sets no limit. This flag is useful when you are setting up and testing your crawl configuration. Optional. |
-r or --resume | Resumes a full or resumable crawl that has been previously run. Optional. |
-s or --seed | The seed for the crawl. Required for full
crawls, but is ignored for resumable crawls. The seed can be one URL, a file
containing URLs (one per line), or a directory containing *.lst files that
contain URLs.
For example: -s http://www.oracle.com (one URL) -s C:\Oracle\Endeca\IAS\workspace\conf\web-crawler\default\endeca.lst (an .lst file of URLs) -s C:\Oracle\Endeca\IAS\workspace\conf\web-crawler\default (a directory containing any number of .lst files) A URL must be fully qualified and include the protocol (http:// or https://) , not just the domain names, and if the port is not 80, include the port number. The default port for HTTP is 80. The default port for HTTPS is 443. |
-w or --working | The path of the Web Crawler workspace directory. If this flag is not used, the default name of the workspace directory is workspace and is located in the directory from which the startup script is run. Because each workspace directory must have a unique path, you must use this flag if you are starting multiple Web crawls on the same machine. Optional. |
The crawl depth (as set by the -d flag) specifies how many levels of page links will be followed. Each URL in the seed has a level of 0 and each link from a seed URL has a level of 1. The links from a level 1 URL have a level of 2 and so on.
Level 0: www.endeca.com is level 0 and has a link to about.html. Level 1: about.html is level 1 and its links are level 2. Level 2: contacts.html is level 2 and its links are level 3.Therefore, if you want to crawl all the level 2 pages, specify -d 2 as the flag argument.
The workspace/conf/web-crawler/default directory is the default configuration directory. For example, this directory is used if you do not specify the -c flag.
.\bin\web-crawler -c conf\web\intsites -d 2 -s conf\web\intsites\int.lstIn this example, the crawl uses the site.xml from the intsites directory, while the rest of the files are read from the default configuration directory.
.\bin\web-crawler -d 2 -s conf\web\intsites\int.lst -JVM -Xmx2gKeep in mind that this flag must be the last flag on the command line, because any arguments that follow it are appended to those passed to the JVM.