43 Coding the Crawler Configuration File
The BaseConfigurator
class, its methods, and interfaces control a crawler's site capture process. A sample code is available in the Site Capture installation for the FirstSiteII crawler.
Topics:
About Controlling a Crawler
To control a crawler, you need to code its CrawlerConfigurator.groovy
file with, at minimum, the starting URI and link extraction logic. You supply this information through the getStartUri()
and createLinkExtractor()
methods. You can also add additional code to specify, for example, the number of links to be crawled, the crawl depth, and the invocation of a post-crawl event such as copying statically downloaded files to a web server's doc base.
The methods and interfaces you use are provided in the BaseConfigurator
class. The default implementations can be overridden to customize and control a crawl process in a way that agrees with the structure of the target site and the data you have to collect.
The BaseConfigurator
methods and a simple CrawlerConfigurator.groovy
file described in the topics that follow demonstrate the usage of the required methods. Crawler customization methods are then discussed and followed by information about Site Capture's Java interfaces, including their default and custom implementations.
BaseConfigurator Methods
The CrawlerConfigurator.groovy
file contains the code of the CrawlerConfigurator
class. This class must extend BaseConfigurator
, which is an abstract class that provides default implementations for the crawler.
This table lists the methods and interfaces of the BaseConfigurator
class:
Table 43-1 Methods in the BaseConfigurator
Class
Method Type | Method | Notes |
---|---|---|
Required |
N/A |
|
Required |
Factory method in the LinkExtractor interface.Foot 1,Foot 2 |
|
Crawler Customization |
N/A |
|
Crawler Customization |
N/A |
|
Crawler Customization |
N/A |
|
Crawler Customization |
N/A |
|
Crawler Customization |
N/A |
|
Crawler Customization |
N/A |
|
Crawler Customization |
N/A |
|
Crawler Customization |
Factory method in the ResourceRewriter interface.a,b |
Footnote 1
The listed interfaces have default implementations, described in this chapter.
Footnote 2
Site Capture provides a sample link extractor and resource rewriter, both used by the FirstSiteII sample crawler. See Writing and Deploying a Custom Link Extractor and Writing a Custom ResourceRewriter.
This topic includes the following:
getStartUri
This method injects the crawler's start URI. Configure one or more start URIs for the crawl if the URIs belong to the same site. Multiple starting points enable the crawls to start in parallel.
To provide the start URI for the www.example.com
site:
/** * The method is used to configure the site url which needs to be crawled. */ public String[] getStartUri() { return ["http://www.example.com/home"]; //Groovy uses brackets for an array. }
To provide multiple start URIs for the site, enter a comma-separated array:
/** * The method is used to configure the site url which needs to be crawled. */ public String[] getStartUri() { return ["http://www.example.com/product","http://www.example.com/support"]; //Groovy uses brackets for an array. }
createLinkExtractor
This method configures the logic for extracting links from the crawled pages. The extracted links are then traversed. It shows a basic sample CrawlerConfigurator.groovy
file.
Two abstract methods in BaseConfigurator
must be overridden in CrawlerConfigurator
. They are getStartUri()
and createLinkExtractor()
. The createLinkExtractor
method is a factory method in the LinkExtractor
interface:
-
Implement the
LinkExtractor
interface to create your own link extraction algorithm, for example, using an HTML parser to parse the pages and extract links for the crawler to consume. -
To extract links, use the default implementation,
PatternLinkExtractor
, which uses regular expressions. For example,PatternLinkExtractor
can be used to extract links of the format/home/products
from expressions such as<a href="/home/product">Products</a>
.To use a regular expression for extracting links from
<a href="/home/product">Products</a>
on thewww.example.com
site:/** * The method is used to define the link extraction * algorithm from the crawled pages. * PatternLinkExtractor is a regex based extractor * which parses the links on the web page * based on the pattern configured inside the constructor. */ public LinkExtractor createLinkExtractor() { return new PatternLinkExtractor("['\"\\(](/[^\\s<'\"\\)]*)",1); }
-
For more information about regular expressions and
PatternLinkExtractor
, see Using the Default Implementation of LinkExtractor. -
For more information about implementing the
LinkExtractor
interface, see Writing and Deploying a Custom Link Extractor.
Basic Configuration File
This example of a simple CrawlerConfigurator.groovy
file, the required methods, getStartUri()
and createLinkExtractor()
, are overridden.
In this example, we override an additional method getMaxLinks()
. In the example, it is set to return 150
so that the test run can be completed quickly.
The file named CrawlerConfigurator.groovy
is used to inject dependency. Hence, its name must not be changed.
package com.fatwire.crawler.sample import java.text.DateFormat; import java.text.SimpleDateFormat; import java.util.regex.Pattern; import javax.mail.internet.AddressException; import javax.mail.internet.InternetAddress; import com.fatwire.crawler.*; import com.fatwire.crawler.remote.*; import com.fatwire.crawler.remote.di.*; import com.fatwire.crawler.impl.*; import com.fatwire.crawler.util.FileBuilder; import org.apache.commons.lang.SystemUtils; import org.apache.http.HttpHost; import org.apache.http.auth.*; import org.apache.http.client.*; import org.apache.http.impl.client.*; /** * Configurator for the crawler. * This is used to inject the dependency inside the crawler * to control the crawling process */ public class CrawlerConfigurator extends BaseConfigurator { public CrawlerConfigurator(GlobalConfigurator delegate){ super(delegate); } /** * The method is used to configure the site url which needs to be crawled. */ public String[] getStartUri() { return ["http://www.fatwire.com/home"]; //Groovy uses brackets for an array. } /** * The method is used to define the link extraction algorithm * from the crawled pages. * PatternLinkExtractor is a regex based extractor which parses * the links on the web page * based on the pattern configured inside the constructor. */ public LinkExtractor createLinkExtractor() { return new PatternLinkExtractor("['\"\\(](/[^\\s<'\"\\)]*)",1); } /** * The method is used to control the maximum number of links * to be crawled as part of this crawl session. */ public int getMaxLinks() { 150; }
Crawler Customization Methods
In addition to the required methods, the BaseConfigurator
class has methods with default implementations. You may want to override these methods to customize the crawl process in a way that agrees with the structure of the target site and the data you have to collect.
See these topics:
getMaxLinks
This method controls the number of links to be crawled. The number of links should be a positive integer. Otherwise, the crawl scans all the links in the same domain that are reachable from the start URI(s).
To specify crawling 500 links:
/** * default: -1; crawler will crawl over all the links reachable from the start URI * @return the maximum number of links to download. */ public int getMaxLinks() { return 500; }
getMaxCrawlDepth
This method controls the maximum depth to which a site is crawled. Links beyond the specified depth are ignored. The depth of the starting page is 0
.
/** * default: -1. Indicates infinite depth for a site. * @return the maximum depth to which we need to crawl the links. */ public int getMaxCrawlDepth() { return 4; }
getConnectionTimeout
This method determines how long the crawler will wait to establish a connection to its target site. If a connection is not established within the specified time, the crawler will ignore the link and continue to the next link.
To set a connection timeout of 50,000 milliseconds:
/** * default: 30000 ms * @return Connection timeout in milliseconds. */ public int getConnectionTimeout() { return 50000; // in milliseconds }
getSocketTimeout
This method controls the socket timeout of the request that is made by the crawler for the link to be crawled.
To provide a socket timeout of 30,000 milliseconds:
/** * default: 20000 ms * @return Socket timeout in milliseconds. */ public int getSocketTimeout() { return 30000; // in milliseconds }
getPostExecutionCommand
This method injects custom post-crawl logic, and it’s invoked when the crawler finishes its crawl session. It must return the absolute path of the script or command and parameters (if any).
For example, the getPostExecutionCommand()
can be used to automate deployment to a web server's doc base by invoking a batch or shell script to copy statically captured files after the crawl session ends.
Note:
-
The script or command should be present in the same location on all servers hosting Site Capture.
-
Avoid downloading large archive files (exceeding 250MB) from the Site Capture interface. Use
getPostExecutionCommand
to copy the files from the Site Capture file system to your preferred location. Archive size can be obtained from the crawler report, on the Job Details form.
To run a batch script named copy.bat
on the Site Capture server:
/** * default: null. * @return the command string for post execution. * Null if there is no such command. */ public String getPostExecutionCommand() { // The file is supposed to be at the path C:\\commands folder // on the computer where the site capture server is running return "C:\\commands\\copy.bat"; }
getNumWorkers
This method controls the number of worker threads used for the crawl process. The ideal number of parallel threads to be spawned for the crawl session depends on the architecture of the computer on which Site Capture is hosted.
To start 10 worker threads for a crawl process:
/** * default: 4. * @return the number of workers to start. * Workers will concurrently downloads resources. */ public int getNumWorkers() { // Start 10 worker threads which is involved in the crawl process. return 10; }
getUserAgent
This method configures the user agent that the crawler uses when it traverses the site. You should use this method to render the site in a different way than usual. For example, to render the site on a mobile device.
To configure the FireFox 3.6.17 user agent:
/** * default: publish-crawler/1.1 (http://www.fatwire.com) * @return the user agent identifier */ public String getUserAgent() { return "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US;rv:1.9.2.17) Gecko/20110420 Firefox/3.6.17 "; }
createResourceRewriter
This method rewrites URLs inside the HTML pages that are crawled. For example, you may want to rewrite the URLs to enable static delivery of a dynamic WebCenter Sites website.
The createResourceRewriter
method is a factory method in the ResourceRewriter
interface:
-
Implement the
ResourceRewriter
interface to convert dynamic URLs to static URLs, absolute URLs to relative URLs, and so on. -
You can also use the following default implementations:
-
NullResourceRewriter
: Does not rewrite any of the URLs. -
PatternResourceRewriter
: Searches for a regular pattern and rewrites as specified.
-
To use PatternResourceRewriter
to rewrite URLs such as http://www.site.com/home.html
to /home.html
:
/** * Factory method for a ResourceRewriter. * default: new NullResourceRewriter(); * @return the rewritten resource modifies the html before it is saved to disk. */ public ResourceRewriter createResourceRewriter() { new PatternResourceRewriter("http://www.site.com/([^\\s'\"]*)", '/$1'); }
-
For more information about the default implementations, see Using the Default Implementations of ResourceRewriter.
-
For more information about implementing the
ResourceRewriter
interface, see Writing a Custom ResourceRewriter.
createMailer
This method provides the implementation for sending email after the crawl. The createMailer
method is a factory method in the Mailer
interface.
-
Site Capture comes with an SMTP over TLS implementation, which emails the crawler report when a static or archive capture session ends (the crawler report is the
report.txt
file, described in Administering Oracle WebCenter Sites). -
If you are using a mail server other than SMTP-TLS (such as SMTP without authentication, or POP3), you must provide your own implementation.
To send no email:
/** * Factory method for a Mailer. * <p/> * default: new NullMailer(). * @return mailer holding configuration to send an email at the end of the crawl. * Should not be null. */ public Mailer createMailer() { return new NullMailer(); }
-
For more information about the default implementation, see Using the Default Implementation of Mailer.
-
For more information about implementing the
ResourceRewriter
interface, see Writing a Custom Mailer.
getProxyHost
The getProxyHost
method must be overridden if the site being crawled is behind a proxy server. You can configure the proxy server in this method.
Note:
If you use getProxyHost
, also use getProxyCredentials
, described on getProxyCredentials.
To configure a proxy server:
/** * default: null. * @return the host for the proxy, * null when there is no proxy needed */ public HttpHost getProxyHost() { //using the HTTPClient library return a HTTPHost return new HttpHost("www.myproxyserver.com", 883); }
getProxyCredentials
This method injects credentials for the proxy server which is configured in the getProxyHost
method.
See getProxyHost.
To authenticate a proxy server user named sampleuser
:
/** * default: null. * example: new UsernamePasswordCredentials(username, password); * @return user credentials for the proxy. */ public Credentials getProxyCredentials() { return new UsernamePasswordCredentials("sampleuser", "samplepassword"); //using the HTTPClient library return credentials }
Interfaces
Site Capture provides these interfaces with default implementations: LinkExtractor
, ResourceRewriter
, and Mailer
. Read further to know more about these interfaces.
LinkExtractor
A link extractor specifies which links are traversed by Site Capture in a crawl session. The implementation is injected through the CrawlerConfigurator.groovy
file. The implementation is called by the Site Capture framework during the crawl session to extract links from the markup that is downloaded as part of the crawl session.
Site Capture comes with one implementation of LinkExtractor
. You can also write and deploy your own custom link extraction logic. For more information, see the following sections:
LinkExtractor Interface
This interface has only one method (extract)
that needs to be implemented to provide the algorithm for extracting links from downloaded markup.
package com.fatwire.crawler; import java.util.List; import com.fatwire.crawler.url.ResourceURL; /** * Extracts the links out of a WebResource. */ public interface LinkExtractor { /** * Parses the WebResource and finds a list of links (if possible). * @param resource the WebResource to inspect. * @return a list of links found inside the WebResource. */ List<ResourceURL> extract(final WebResource resource); }
Using the Default Implementation of LinkExtractor
PatternLinkExtractor
is the default implementation for the LinkExtractor
interface. PatternLinkExtractor
extracts links on the basis of a regular expression. It takes a regular expression as input and returns only links matching that regular expression.
Common usage scenarios include using PatternLinkExtractor
for sites with dynamic URLs and using PatternLinkExtractor
for sites with static URLs.
-
Using
PatternLinkExtractor
for sites with dynamic URLs:For example, on
www.example.com
, the links have a pattern of/home
,/support
,and /cs/Satellite/
. To extract and traverse such kinds of links, usePatternLinkExtractor
in the following way:/** * The method is used to define the link extraction algorithm * from the crawled pages. * PatternLinkExtractor is a regex based extractor which parses * the links on the web page * based on the pattern configured inside the constructor. */ public LinkExtractor createLinkExtractor() { return new PatternLinkExtractor("['\"\\(](/[^\\s<'\"\\)]*)",1); }
The pattern [
'\"\\(
]/[^\\s<'\"\\)]*
) is used to extract links:-
that start with any one of the following characters:
-
Single quote ( ' )
-
Double quotes ( " )
-
Left parenthesis
(
-
-
continue with a slash ( / ) ,
-
and end with any one of the following characters:
-
Spaces (
\s
) -
Less-than symbol (< )
-
Single quote ( ' )
-
Double quote ( " )
-
Right parenthesis
)
-
Let's consider the URL inside the following markup:
<a href='/home'>Click Me</a>
We are interested only in extracting the
/home
link. This link matches the regular expression pattern because it starts with a single quote (')
and ends with a single quote ('
). The grouping of 1 will return the result as/home
. -
-
Using
PatternLinkExtractor
for sites with static URLs:For example, the markup for
www.example.com
has links such as:<a href="http://www.example.com/home/index.html">Click Me</a>
To extract and traverse such types of links, we can use
PatternLinkExtractor
in the following way:/** * The method is used to define the link extraction algorithm * from the crawled pages. * PatternLinkExtractor is a regex based extractor which parses * the links on the web page * based on the pattern configured inside the constructor. */ public LinkExtractor createLinkExtractor() { return new PatternLinkExtractor(Pattern.compile("http://www.example.com/[^\\s<'\"]*")); }
The above example instructs the crawler to extract links that start with
http://www.example.com
and end with any one of the following characters: spaces (\s
), less-than symbol (<
), single quote ('
), or double quotes ("
).Note:
For more details on groups and patterns, refer to the Java documentation for the
Pattern
andMatcher
classes.
Writing and Deploying a Custom Link Extractor
Site Capture provides a sample link extractor (and resource rewriter) used by the FirstSiteII sample crawler to download WebCenter Sites FirstSiteII dynamic website as a static site. For more information, see the source code for the FSIILinkExtractor
class in the following folder:
<SC_INSTALL_DIR>/fw-site-capture/crawler/_sample/FirstSiteII/src
To write a custom link extractor:
ResourceRewriter
A resource rewriter rewrites URLs inside the markup that is downloaded during the crawl session. The implementation must be injected through the CrawlerConfigurator.groovy
file.
Some use cases that require a resource rewriter are:
-
Crawling a dynamic site and creating a static copy.
-
Converting absolute URLs to relative URLs. For example, if the markup has URLs such as
http://www.example.com/abc.html
, then the crawler should removehttp://www.example.com
from the URL, thus allowing resources to be served from the host on which the downloaded files are stored.
Site Capture comes with the two implementations of ResourceRewriter
. You can also create custom implementations. For more information, see the following sections:
ResourceRewriter Interface
The rewrite
method rewrites URLs inside the markup that is downloaded during the crawl session.
package com.fatwire.crawler; import java.io.IOException; /** * Service for rewriting a resource. The crawler will use the implementation for * rewrite method to rewrite the resources that are downloaded as part of crawl * session. */ public interface ResourceRewriter { /** * @param resource * @return the bytes after the rewrite. * @throws IOException */ byte[] rewrite(WebResource resource) throws IOException; }
Using the Default Implementations of ResourceRewriter
Site Capture comes with the following implementations of ResourceRewriter:
-
NullResourceRewriter
, configured by default to skip the rewriting of links. IfResourceRewriter
is not configured in theCrawlerConfigurator.groovy
file, thenNullResourceRewriter
is injected by default. -
PatternResourceRewriter
, used to rewrite URLs based on the regular expression.PatternResourceRewriter
takes as input a regular expression to match the links inside the markup and replaces those links with the string that is provided inside the constructor.To rewrite an absolute URL as a relative URL:
From:
<a href="http://www.example.com/about/index.html">Click Me</a>
To:
<a href="/about/index.html">Click Me</a>
/** * Factory method for a ResourceRewriter. * default: new NullResourceRewriter(); * @return the rewritten resource modifies the html before it is saved to disk. */ public ResourceRewriter createResourceRewriter() { new PatternResourceRewriter("http://www.example.com/([^\\s'\"]*)", '/$1'); }
PatternResourceRewriter
has only one constructor that takes a regular expression and a string replacement:PatternResourceRewriter(final String regex, final String replacement)
Writing a Custom ResourceRewriter
Site Capture provides a sample resource rewriter (and link extractor) used by the FirstSiteII sample crawler to download WebCenter Sites' FirstSiteII dynamic website as a static site. For more information, see the source code for the FSIILinkExtractor
class in the following folder:
<SC_INSTALL_DIR>/fw-site-capture/crawler/_sample/FirstSiteII/src
To write a custom resource rewriter:
Mailer
A mailer sends email after the crawl ends. The implementation must be injected through the CrawlerConfigurator.groovy
file.
Site Capture provides an SMTPTlsMailer
implementation, which can be used to send the crawler report from the SMTP-TLS mail server. You also can implement the Mailer
interface to provide custom logic for sending emails from a server other than SMTP-TLS (such as SMTP without authentication, or POP3). Your custom logic also can specify the email to be an object other than the crawler report. If Mailer
is not configured in the CrawlerConfigurator.groovy
file, then NullMailer
is injected by default.
This section includes the following topics:
Mailer Interface
The sendMail
method is automatically called if the Mailer
is configured in the CrawlerConfigurator.groovy
file.
package com.fatwire.crawler; import java.io.IOException; import javax.mail.MessagingException; /** * Service to send an email. */ public interface Mailer { /** * Sends the mail. * * @param subject * @param report * @throws MessagingException * @throws IOException */ void sendMail(String subject, String report) throws MessagingException, IOException; }
Using the Default Implementation of Mailer
Site Capture provides an SMTP-TLS server-based email implementation that sends out the crawler report when a static or archive crawl session ends. (The crawler report is a (report.txt
file, described in About Accessing Log Files in Administering Oracle WebCenter
Sites).
-
Use the default mailer by injecting it through the
CrawlerConfigurator.groovy
file, as shown below:/** * Factory method for a Mailer. * <p/> * default: new NullMailer(). * @return mailer holding configuration to send an email * at the end of the crawl. * Should not be null. */ public Mailer createMailer() { try { // Creating a SmtpTlsMailer Object SmtpTlsMailer mailer = new SmtpTlsMailer(); InternetAddress from; // Creating an internet address from whom the mail // should be sent from = new InternetAddress("example@example.com"); // Setting the mail address inside the mailer object mailer.setFrom(from); // Setting the email address of the recipient inside // mailer.mailer.setTo(InternetAddress.parse("example@example.com")); // Setting the email server host for to be used for email. // The email server should be SMTP-TLS enabled. mailer.setHost("smtp.gmail.com", 587); // Setting the credentials of the mail account // mailer.setCredentials("example@example.com", "examplepassword"); return mailer; } catch (AddressException e) { log.error(e.getMessage()); } }
Writing a Custom Mailer
This section provides the steps to write a custom mailer.
To write a custom mailer:
This implementation emails the crawler report (the report.txt
file), given that the String report
argument in the sendMail
method names the crawler report, by default. You can customize the logic for emailing objects other than the crawler report.
Summary of Methods and Interfaces
For controlling a crawler's site capture process, the default implementations of methods and interfaces in the Site Capture BaseConfigurator
class are described here.
See these topics:
Methods
The following interfaces are used in the Site Capture BaseConfigurator
class:
The factory methods are in the following interfaces:
-
createLinkExtractor is in the LinkExtractor interface.
-
createResourceRewriter is in the ResourceRewriter interface.
-
createMailer is in the Mailer interface.
Interfaces
The following interfaces are used in the Site Capture BaseConfigurator
class:
-
Its default implementation is
PatternLinkExtractor
, which extracts links on the basis of a regular expression.Site Capture also provides a sample link extractor (and a sample resource rewriter), used by the FirstSiteII sample crawler to download WebCenter Sites' FirstSiteII dynamic website as a static site. Source code is available in the following folder:
<SC_INSTALL_DIR>/fw-site-capture/crawler/_sample/FirstSiteII/src
You can write and deploy your own custom link extraction logic.
-
Its default implementations are
NullResourceRewriter
, which skips the rewriting of links, andPatternResourceRewriter
, which rewrites URLs based on the regular expression.Site Capture provides a sample resource rewriter (and a sample link extractor), used by the FirstSiteII sample crawler to download WebCenter Sites' FirstSiteII dynamic website as a static site. Source code is available in the following folder:
<SC_INSTALL_DIR>/fw-site-capture/crawler/_sample/ FirstSiteII/src
You can write and deploy your own logic for rewriting URLs.
-
Its default implementation is
SMTPTlsMailer
, which sends the crawler report from the SMTP-TLS mail server. You can customize the logic for emailing other types of objects from other types of servers.