Command-line flags for crawls

The Endeca Web Crawler startup script has several flags to control the behavior of the crawl.

The web-crawler startup script has the following flags. If you do not specify flags, the web-crawler script displays the usage information and exits.
web-crawler Flag Flag Argument
-c or --conf The path of the configuration directory. If this flag is not specified, the workspace/conf/web-crawler/default directory is used as the default configuration directory. Note that if the flag points to a directory other than the default configuration directory (such as workspace/conf/web-crawler/polite-crawl), the files in the workspace/conf/web-crawler/default directory are read first. Optional.
-d or --depth A non-negative integer (greater than or equal to 0) that specifies the maximum depth of the crawl. The crawl depth is the number of link levels (starting from the seed URL) that will be crawled (see below for details). Required for all crawl types.
-f or --force This flag deletes the output directory before the crawl runs. No argument to the flag. Optional, but it cannot be used with resumable crawls.
-JVM Passes arguments on the command line to the Java Virtual Machine (JVM). If this flag is used, any arguments before it are passed to the Web Crawler and any arguments afterwards are appended to those passed to the JVM. Note that on Windows machines, the flag parameters should be quoted if they contain equals signs. Optional.
-l or --limit An integer that specifies an approximate maximum for the number of requests to make (that is, the maximum number of pages to be fetched). The maximum is a soft limit: when the limit is reached, the Crawler does not add any more pages to the queue but the Crawler completes any pages still in the queue.

When the limit is reached, the Web Crawler also writes a URL limit reached, starting shutdown. message to the log file.

The default is 0 (zero), which sets no limit. This flag is useful when you are setting up and testing your crawl configuration. Optional.

-r or --resume Resumes a full or resumable crawl that has been previously run. Optional.
-s or --seed The seed for the crawl. Required for full crawls, but is ignored for resumable crawls. The seed can be one URL, a file containing URLs (one per line), or a directory containing *.lst files that contain URLs.

For example:

-s http://www.oracle.com (one URL)

-s C:\Oracle\Endeca\IAS\workspace\conf\web-crawler\default\endeca.lst (an .lst file of URLs)

-s C:\Oracle\Endeca\IAS\workspace\conf\web-crawler\default (a directory containing any number of .lst files)

A URL must be fully qualified and include the protocol (http:// or https://) , not just the domain names, and if the port is not 80, include the port number. The default port for HTTP is 80. The default port for HTTPS is 443.

-w or --working The path of the Web Crawler workspace directory. If this flag is not used, the default name of the workspace directory is workspace and is located in the directory from which the startup script is run. Because each workspace directory must have a unique path, you must use this flag if you are starting multiple Web crawls on the same machine. Optional.

Setting the crawl depth

The crawl depth (as set by the -d flag) specifies how many levels of page links will be followed. Each URL in the seed has a level of 0 and each link from a seed URL has a level of 1. The links from a level 1 URL have a level of 2 and so on.

For example, if the seed is www.endeca.com, the levels are as follows:
Level 0: www.endeca.com is level 0 and has a link to about.html.
   Level 1: about.html is level 1 and its links are level 2.
   Level 2: contacts.html is level 2 and its links are level 3.
Therefore, if you want to crawl all the level 2 pages, specify -d 2 as the flag argument.

Specifying the configuration directory

The workspace/conf/web-crawler/default directory is the default configuration directory. For example, this directory is used if you do not specify the -c flag.

You can also use the -c flag to override one or more configuration files in the default configuration directory with files from another configuration directory. For example, assume you have a directory (named intsites) that has a site.xml file for a specific crawl (and no other configuration files). You would then use the -c flag to point to that directory:
.\bin\web-crawler -c conf\web\intsites -d 2 -s conf\web\intsites\int.lst
In this example, the crawl uses the site.xml from the intsites directory, while the rest of the files are read from the default configuration directory.

Specifying JVM arguments

To pass additional arguments to the Java Virtual Machine (JVM), you can use the -JVM script flag. For example, assume you want to override the default maximum heap size setting of 1024 MB that is hardcoded in the scripts with a setting of 2048 MB. The command line might be as follows:
.\bin\web-crawler -d 2 -s conf\web\intsites\int.lst -JVM -Xmx2g
Keep in mind that this flag must be the last flag on the command line, because any arguments that follow it are appended to those passed to the JVM.