Command-line flags for crawls
The Oracle Commerce Web Crawler startup script has several flags to control the behavior of the crawl.
The web-crawler
startup script has the following flags
. If used with no flags, the web-crawler
script displays the usage information and exits.
web-crawler Flag
|
Flag Argument
|
---|
-c or --conf
|
The path of the configuration directory. If this flag is not specified, the workspace/conf/web-crawler/default
directory
is used as the default configuration directory. Note that if the flag points to a directory other than the default configuration directory (such as workspace/conf/web-crawler/polite-crawl ), the files in the workspace/conf/web-crawler/default
directory are read first. Optional.
|
-d or --depth
|
A non-negative integer (greater than or equal to 0 ) that specifies the maximum depth of the crawl.
The crawl depth is the number of link levels (starting from the seed URL) that will be crawled (see below for details). Required for all crawl types.
|
-f or --force
|
No argument to the flag. This flag forces the output directory to be deleted before the crawl is run.
Optional, but it cannot be used with resumable crawls.
|
-JVM
|
Allows arguments on the command line to be passed to the Java Virtual Machine (JVM).
If this flag is used, any arguments before it are passed to the Web Crawler
and any arguments afterwards are appended to those passed to the JVM.
Note that on Windows machines, the flag parameters should be quoted if they contain equal signs.
Optional.
|
-l or --limit
|
An integer that specifies an approximate maximum
for the number of requests to make (that is, the maximum number of pages to be fetched). The maximum is a soft limit: when the limit is reached, the Crawler does not add any more pages to the queue but the Crawler completes any pages still in the queue.
When the limit is reached, the Web Crawler also writes a URL limit reached, starting shutdown. message to the log file.
The default is 0 (zero), which sets no limit. This flag is useful when you are setting up and testing your crawl configuration. Optional.
|
-r or --resume
|
Resumes a full or resumable crawl that has been previously run. Optional.
|
-s or --seed
|
The seed for the crawl. The seed can be one URL, a file containing URLs (one per line), or a
directory containing *.lst files that contain URLs. The URLs must be fully qualified,
not just the domain names; that is, you must specify the protocol (http:// or https:// )
and, if the
port is not 80, the port number. The default port of HTTP is 80. The default port for HTTPS is 443. Required for full crawls, but is ignored for resumable crawls .
|
-w or --working
|
The path of the Web Crawler workspace directory.
If this flag is not used, the default name of the workspace directory is workspace and is located in the directory from which the startup script is run. Because each workspace
directory must have a unique path, you must use this flag if you are starting multiple Web crawls on the same machine. Optional.
|