The Oracle Commerce Web Crawler startup script has several flags to control the behavior of the crawl.

The web-crawler startup script has the following flags . If used with no flags, the web-crawler script displays the usage information and exits.

web-crawler Flag

Flag Argument

-c or --conf

The path of the configuration directory. If this flag is not specified, the workspace/conf/web-crawler/default directory is used as the default configuration directory. Note that if the flag points to a directory other than the default configuration directory (such as workspace/conf/web-crawler/polite-crawl), the files in the workspace/conf/web-crawler/default directory are read first. Optional.

-d or --depth

A non-negative integer (greater than or equal to 0) that specifies the maximum depth of the crawl. The crawl depth is the number of link levels (starting from the seed URL) that will be crawled (see below for details). Required for all crawl types.

-f or --force

No argument to the flag. This flag forces the output directory to be deleted before the crawl is run. Optional, but it cannot be used with resumable crawls.

-JVM

Allows arguments on the command line to be passed to the Java Virtual Machine (JVM). If this flag is used, any arguments before it are passed to the Web Crawler and any arguments afterwards are appended to those passed to the JVM. Note that on Windows machines, the flag parameters should be quoted if they contain equal signs. Optional.

-l or --limit

An integer that specifies an approximate maximum for the number of requests to make (that is, the maximum number of pages to be fetched). The maximum is a soft limit: when the limit is reached, the Crawler does not add any more pages to the queue but the Crawler completes any pages still in the queue.

When the limit is reached, the Web Crawler also writes a URL limit reached, starting shutdown. message to the log file.

The default is 0 (zero), which sets no limit. This flag is useful when you are setting up and testing your crawl configuration. Optional.

-r or --resume

Resumes a full or resumable crawl that has been previously run. Optional.

-s or --seed

The seed for the crawl. The seed can be one URL, a file containing URLs (one per line), or a directory containing *.lst files that contain URLs. The URLs must be fully qualified, not just the domain names; that is, you must specify the protocol (http:// or https://) and, if the port is not 80, the port number. The default port of HTTP is 80. The default port for HTTPS is 443. Required for full crawls, but is ignored for resumable crawls .

-w or --working

The path of the Web Crawler workspace directory. If this flag is not used, the default name of the workspace directory is workspace and is located in the directory from which the startup script is run. Because each workspace directory must have a unique path, you must use this flag if you are starting multiple Web crawls on the same machine. Optional.


Copyright © Legal Notices