50 Coding the Crawler Configuration File

This chapter contains information about the BaseConfigurator class, about implementing its methods and interfaces to control a crawler's site capture process, and about sample code that is available in the Site Capture installation for the FirstSiteII crawler.

This chapter contains the following topics:

Section 50.1, "Overview of Controlling a Crawler"
Section 50.2, "BaseConfigurator Methods"
Section 50.3, "Crawler Customization Methods"
Section 50.4, "getSocketTimeout"
Section 50.5, "getPostExecutionCommand"
Section 50.6, "getNumWorkers"
Section 50.7, "getUserAgent"
Section 50.8, "createResourceRewriter"
Section 50.9, "createMailer"
Section 50.10, "getProxyHost"
Section 50.11, "getProxyCredentials"
Section 50.12, "Interfaces"
Section 50.13, "Summary of Methods and Interfaces"

50.1 Overview of Controlling a Crawler

Controlling a crawler requires coding its CrawlerConfigurator.groovy file with at least the following information: the starting URI and link extraction logic. Both pieces of information are supplied through the getStartUri() and createLinkExtractor() methods. Additional code can be added as necessary to specify, for example, the number of links to be crawled, the crawl depth, and the invocation of a post-crawl event such as copying statically downloaded files to a web server's doc base.

The methods and interfaces you will use are provided in the BaseConfigurator class. The default implementations can be overridden to customize and control a crawl process in a way that agrees with the structure of the target site and the data you need to collect.

This chapter begins with the BaseConfigurator methods (see Section 50.2, "BaseConfigurator Methods") and a simple CrawlerConfigurator.groovy file (see Section 50.2.2, "createLinkExtractor.") to demonstrate usage of the required methods. Crawler customization methods are then discussed and followed by information about Site Capture's Java interfaces, including their default and custom implementations.

50.2 BaseConfigurator Methods

The CrawlerConfigurator.groovy file contains the code of the CrawlerConfigurator class. This class must extend BaseConfigurator, which is an abstract class that provides default implementations for the crawler. Methods and interfaces of the BaseConfigurator class are listed in Table 50-1 and described in the sections that follow. A basic sample CrawlerConfigurator.groovy file is shown in Section 50.2.2, "createLinkExtractor."

Table 50-1 Methods in the BaseConfigurator Class

Method Type	Method	Notes
Required	Section 50.2.1, "getStartUri."	N/A
Required	Section 50.2.2, "createLinkExtractor."	Factory method in the Section 50.12.1, "LinkExtractor." interface.^Foot 1,^Foot 2
Crawler Customization	Section 50.3.1, "getMaxLinks."	N/A
Crawler Customization	Section 50.3.2, "getMaxCrawlDepth."	N/A
Crawler Customization	Section 50.3.3, "getConnectionTimeout."	N/A
Crawler Customization	Section 50.4, "getSocketTimeout."	N/A
Crawler Customization	Section 50.5, "getPostExecutionCommand."	N/A
Crawler Customization	Section 50.6, "getNumWorkers."	N/A
Crawler Customization	Section 50.7, "getUserAgent."	N/A
Crawler Customization	Section 50.8, "createResourceRewriter."	Factory method in the Section 50.12.2, "ResourceRewriter." interface.a,b

^Footnote 1The listed interfaces have default implementations, described in this chapter.

^Footnote 2Site Capture provides a sample link extractor and resource rewriter, both used by the FirstSiteII sample crawler. See Section 50.12.1.3, "Writing and Deploying a Custom Link Extractor." and Section 50.12.2.3, "Writing a Custom ResourceRewriter."

This section contains the following topics:

Two abstract methods in BaseConfigurator must be overridden in CrawlerConfigurator. They are getStartUri() and createLinkExtractor().

Section 50.2.1, "getStartUri"
Section 50.2.2, "createLinkExtractor"

50.2.1 getStartUri

This method is used to inject the crawler's start URI. You can configure one or more start URIs for the crawl, as long as the URIs belong to the same site. If you specify multiple starting points, the crawls will start in parallel.

Example

To provide the start URI for the www.example.com site:

/**
 * The method is used to configure the site url which needs to be crawled.
 */
public String[] getStartUri()
{
return ["http://www.example.com/home"]; //Groovy uses brackets for an array.
}

Example

To provide multiple start URIs for the site, enter a comma-separated array:

/**
 * The method is used to configure the site url which needs to be crawled.
 */
public String[] getStartUri()
{
return ["http://www.example.com/product","http://www.example.com/support"]; //Groovy uses brackets for an array.
}

50.2.2 createLinkExtractor

This method is used to configure the logic for extracting links from the crawled pages. The extracted links are then traversed. The createLinkExtractor method is a factory method in the LinkExtractor interface:

You can implement the LinkExtractor interface to create your own link extraction algorithm, for example, using an HTML parser to parse the pages and extract links for the crawler to consume.
You can also use the default implementation, PatternLinkExtractor, which uses regular expressions for extracting links. For example, PatternLinkExtractor can be used to extract links of the format /home/products from expressions such as <a href="/home/product">Products</a>, as shown in the sample code in "Basic Configuration File".

Example

To use a regular expression for extracting links from <a href="/home/product">Products</a> on the www.example.com site:
```
/**
 * The method is used to define the link extraction
 * algorithm from the crawled pages.
 * PatternLinkExtractor is a regex based extractor
 * which parses the links on the web page
 * based on the pattern configured inside the constructor.
 */
public LinkExtractor createLinkExtractor()
{
return new PatternLinkExtractor("['\"\$](/[^\\s<'\"\$]*)",1);
}
```
For more information about regular expressions and PatternLinkExtractor, see Section 50.12.1.2, "Using the Default Implementation of LinkExtractor."
For more information about implementing the LinkExtractor interface, see Section 50.12.1.3, "Writing and Deploying a Custom Link Extractor."

Basic Configuration File

Below is an example of a simple CrawlerConfigurator.groovy file in which the required methods, getStartUri() and createLinkExtractor(), are overridden.

Note:

In the sample below, we override an additional method getMaxLinks(), which is discussed. In the sample, it is set to return 150 so that the test run can be completed quickly.

The file named CrawlerConfigurator.groovy is used to inject dependency. Hence, its name must not be changed.

package com.fatwire.crawler.sample

import java.text.DateFormat;
import java.text.SimpleDateFormat;

import java.util.regex.Pattern;

import javax.mail.internet.AddressException;
import javax.mail.internet.InternetAddress;

import com.fatwire.crawler.*;
import com.fatwire.crawler.remote.*;
import com.fatwire.crawler.remote.di.*;
import com.fatwire.crawler.impl.*;
import com.fatwire.crawler.util.FileBuilder;

import org.apache.commons.lang.SystemUtils;
import org.apache.http.HttpHost;
import org.apache.http.auth.*; 
import org.apache.http.client.*; 
import org.apache.http.impl.client.*;
/**
 * Configurator for the crawler.
 * This is used to inject the dependency inside the crawler 
 * to control the crawling process
 */

public class CrawlerConfigurator extends BaseConfigurator {

public CrawlerConfigurator(GlobalConfigurator delegate){
super(delegate);
}

/** 
 * The method is used to configure the site url which needs to be crawled.
 */
public String[] getStartUri() {
return ["http://www.fatwire.com/home"]; //Groovy uses brackets for an array.
}

/**
 * The method is used to define the link extraction algorithm 
 * from the crawled pages.
 * PatternLinkExtractor is a regex based extractor which parses 
 * the links on the web page
 * based on the pattern configured inside the constructor.
 */
public LinkExtractor createLinkExtractor() {
return new PatternLinkExtractor("['\"\\(](/[^\\s<'\"\\)]*)",1);
}

/**
 * The method is used to control the maximum number of links 
 * to be crawled as part of this crawl session.
 */
public int getMaxLinks()
{
150;
}

50.3 Crawler Customization Methods

In addition to the required methods, the BaseConfigurator class has methods with default implementations that you may want to override to customize the crawl process in a way that agrees with the structure of the target site and the data you need to collect.

This section contains the following topics:

Section 50.3.1, "getMaxLinks"
Section 50.3.2, "getMaxCrawlDepth"
Section 50.3.3, "getConnectionTimeout"

50.3.1 getMaxLinks

This method is used to control the number of links to be crawled. The number of links should be a positive integer. Otherwise, the crawl will scan all the links in the same domain that are reachable from the start URI(s).

Example

To specify crawling 500 links:

/**
 * default: -1; crawler will crawl over all the links reachable from the start URI 
 * @return the maximum number of links to download.
 */
public int getMaxLinks()
{
return 500;
}

50.3.2 getMaxCrawlDepth

This method is used to control the maximum depth to which a site will be crawled. Links beyond the specified depth are ignored. The depth of the starting page is 0.

/**
 * default: -1. Indicates infinite depth for a site.
 * @return the maximum depth to which we need to crawl the links.
 */
public int getMaxCrawlDepth()
{
return 4;
}

50.3.3 getConnectionTimeout

This method determines how long the crawler will wait to establish a connection to its target site. If a connection is not established within the specified time, the crawler will ignore the link and continue to the next link.

Example

To set a connection timeout of 50,000 milliseconds:

/**
 * default: 30000 ms
 * @return Connection timeout in milliseconds.
 */
public int getConnectionTimeout()
{
return 50000; // in milliseconds
}

50.4 getSocketTimeout

This method controls the socket timeout of the request that is made by the crawler for the link to be crawled.

Example

To provide a socket timeout of 30,000 milliseconds:

/**
 * default: 20000 ms
 * @return Socket timeout in milliseconds.
 */
public int getSocketTimeout()
{
return 30000; // in milliseconds
}

50.5 getPostExecutionCommand

This method is used to inject custom post-crawl logic. This method is invoked when the crawler finishes its crawl session. The absolute path of the script or command and parameters (if any) must be returned by this method.

For example, the getPostExecutionCommand() can be used to automate deployment to a web server's doc base by invoking a batch or shell script to copy statically captured files after the crawl session ends.

Note:

The script or command should be present in the same location on all servers hosting Site Capture.
Avoid downloading large archive files (exceeding 250MB) from the Site Capture interface. Use getPostExecutionCommand to copy the files from the Site Capture file system to your preferred location. Archive size can be obtained from the crawler report, on the Job Details form.

Example

To execute a batch script named copy.bat on the Site Capture server:

/**
 * default: null.
 * @return the command string for post execution. 
 * Null if there is no such command.
 */
public String getPostExecutionCommand()
{
// The file is supposed to be at the path C:\\commands folder
// on the computer where the site capture server is running
return "C:\\commands\\copy.bat";
}

50.6 getNumWorkers

This method controls the number of worker threads used for the crawl process. The ideal number of parallel threads to be spawned for the crawl session depends on the architecture of the computer on which Site Capture is hosted.

Example

To start 10 worker threads for a crawl process:

/**
 * default: 4.
 * @return the number of workers to start.
 * Workers will concurrently downloads resources.
 */
public int getNumWorkers()
{
// Start 10 worker threads which is involved in the crawl process. 
return 10;
}

50.7 getUserAgent

This method is used to configure the user agent that will be utilized by the crawler when it traverses the site. This method should be used when the site is rendered differently from the way it is rendered in a browser (for example, when the site is rendered on a mobile device).

Example

To configure the FireFox 3.6.17 user agent:

/**
 * default: publish-crawler/1.1 (http://www.fatwire.com)
 * @return the user agent identifier
 */
public String getUserAgent()
{
return "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US;rv:1.9.2.17) Gecko/20110420 Firefox/3.6.17 ";
}

50.8 createResourceRewriter

This method is used to rewrite URLs inside the html pages that are crawled. For example, you may want to rewrite the URLs to enable static delivery of a dynamic WebCenter Sites website.

The createResourceRewriter method is a factory method in the ResourceRewriter interface:

You can implement the ResourceRewriter interface to convert dynamic URLs to static URLs, convert absolute URLs to relative URLs, and so on.
You can also use the following default implementations:
- NullResourceRewriter: Does not rewrite any of the URLs.
- PatternResourceRewriter: Searches for a regular pattern and rewrites as specified.

Example

To use PatternResourceRewriter to rewrite URLs such as http://www.site.com/home.html to /home.html:

/**
 * Factory method for a ResourceRewriter.
 * default: new NullResourceRewriter();
 * @return the rewritten resource modifies the html before it is saved to disk.
 */
public ResourceRewriter createResourceRewriter()
{
new PatternResourceRewriter("http://www.site.com/([^\\s'\"]*)", '/$1');
}

For more information about the default implementations, see Section 50.12.2.2, "Using the Default Implementations of ResourceRewriter."
For more information about implementing the ResourceRewriter interface, see Section 50.12.2.3, "Writing a Custom ResourceRewriter.".

50.9 createMailer

This method provides the implementation for sending email at the end of the crawl. The createMailer method is a factory method in the Mailer interface:

Site Capture comes with an SMTP over TLS implementation, which emails the crawler report when a static or archive capture session ends (the crawler report is the report.txt file, described in the Oracle Fusion Middleware WebCenter Sites Administrator's Guide).
If you are using a mail server other than SMTP-TLS (such as SMTP without authentication, or POP3), you will have to provide your own implementation.

Example

To send no email:

/**
 * Factory method for a Mailer.
 * <p/>
 * default: new NullMailer().
 * @return mailer holding configuration to send an email at the end of the crawl.
 * Should not be null.
 */
public Mailer createMailer()
{
return new NullMailer();
}

For more information about the default implementation, see Section 50.12.3.2, "Using the Default Implementation of Mailer."
For more information about implementing the ResourceRewriter interface, see Section 50.12.3.3, "Writing a Custom Mailer."

50.10 getProxyHost

This method must be overridden if the site being crawled is behind a proxy server. You can configure the proxy server in this method.

Note:

If you use getProxyHost, also use getProxyCredentials, described on Section 50.11, "getProxyCredentials."

Example

To configure a proxy server:

/**
 * default: null.
 * @return the host for the proxy,
 * null when there is no proxy needed
 */
public HttpHost getProxyHost()
{
//using the HTTPClient library return a HTTPHost
return new HttpHost("www.myproxyserver.com", 883);
}

50.11 getProxyCredentials

This method is used to inject credentials for the proxy server that is configured in the getProxyHost method (see Section 50.10, "getProxyHost").

Example

To authenticate a proxy server user named sampleuser:

/**
 * default: null.
 * example: new UsernamePasswordCredentials(username, password);
 * @return user credentials for the proxy.
 */
public Credentials getProxyCredentials()
{
return new UsernamePasswordCredentials("sampleuser", "samplepassword");
//using the HTTPClient library return credentials
}

50.12 Interfaces

Site Capture provides the following interfaces with default implementations:

Section 50.12.1, "LinkExtractor"
Section 50.12.2, "ResourceRewriter"
Section 50.12.3, "Mailer"

50.12.1 LinkExtractor

A link extractor is used to specify which links will be traversed by Site Capture in a crawl session. The implementation is injected through the CrawlerConfigurator.groovy file. The implementation is called by the Site Capture framework during the crawl session to extract links from the markup that is downloaded as part of the crawl session.

Site Capture comes with one implementation of LinkExtractor. You can also write and deploy your own custom link extraction logic. For more information, see the following sections:

Section 50.12.1.1, "LinkExtractor Interface"
Section 50.12.1.2, "Using the Default Implementation of LinkExtractor"
Section 50.12.1.3, "Writing and Deploying a Custom Link Extractor"

50.12.1.1 LinkExtractor Interface

This interface has only one method (extract) that needs to be implemented to provide the algorithm for extracting links from downloaded markup.

package com.fatwire.crawler;
import java.util.List;
import com.fatwire.crawler.url.ResourceURL;

/**
 * Extracts the links out of a WebResource.
 */

public interface LinkExtractor
{
/**
 * Parses the WebResource and finds a list of links (if possible).
 * @param resource the WebResource to inspect.
 * @return a list of links found inside the WebResource.
 */
List<ResourceURL> extract(final WebResource resource);
}

50.12.1.2 Using the Default Implementation of LinkExtractor

PatternLinkExtractor is the default implementation for the LinkExtractor interface. PatternLinkExtractor extracts links on the basis of a regular expression. It takes a regular expression as input and returns only links matching that regular expression.

Common usage scenarios are described below. They are: Using PatternLinkExtractor for sites with dynamic URLs, and using PatternLinkExtractor for sites with static URLs.

Using PatternLinkExtractor for sites with dynamic URLs:

For example, on www.example.com, the links have a pattern of /home, /support, and /cs/Satellite/. To extract and traverse such kinds of links, we use PatternLinkExtractor in the following way:
```
/**
 * The method is used to define the link extraction algorithm 
 * from the crawled pages.
 * PatternLinkExtractor is a regex based extractor which parses 
 * the links on the web page
 * based on the pattern configured inside the constructor.
 */
public LinkExtractor createLinkExtractor()
{
return new PatternLinkExtractor("['\"\$](/[^\\s<'\"\$]*)",1);
}
```
The pattern ['\"\$] (/[^\\s<'\"\$]*) is used to extract links
- that start with any one of the following characters:
  - single quote ( ' )
  - double quotes ( " )
  - left parenthesis (
- continue with a forward slash ( / ) ,
- and end with any one of the following characters:
  - spaces (\s)
  - less-than symbol (< )
  - single quote ( ' )
  - double quote ( " )
  - right parenthesis )
Let's consider the URL inside the following markup:

<a href='/home'>Click Me</a>

We are interested only in extracting the /home link. This link matches the regular expression pattern because it starts with a single quote (') and ends with a single quote ('). The grouping of 1 will return the result as /home.
Using PatternLinkExtractor for sites with static URLs:

For example, the markup for www.example.com has links such as:

<a href="http://www.example.com/home/index.html">Click Me</a>

To extract and traverse such types of links, we can use PatternLinkExtractor in the following way
```
/**
 * The method is used to define the link extraction algorithm 
 * from the crawled pages.
 * PatternLinkExtractor is a regex based extractor which parses 
 * the links on the web page
 * based on the pattern configured inside the constructor.
 */
public LinkExtractor createLinkExtractor()
{
return new PatternLinkExtractor(Pattern.compile("http://www.example.com/[^\\s<'\"]*"));
}
```
The above example instructs the crawler to extract links that start with http://www.example.com and end with any one of the following characters: spaces (\s), less-than symbol (<), single quote ('), or double quotes (").

Note:

For more details on groups and patterns, refer to the Java documentation for the Pattern and Matcher classes.

50.12.1.3 Writing and Deploying a Custom Link Extractor

Note:

Site Capture provides a sample link extractor (and resource rewriter) used by the FirstSiteII sample crawler to download WebCenter Sites' FirstSiteII dynamic website as a static site. For more information, see the source code for the FSIILinkExtractor class in the following folder:

<SC_INSTALL_DIR>/fw-site-capture/crawler/_sample/FirstSiteII/src

To write a custom link extractor

Create a project in your favorite Java IDE.
Copy the file fw-crawler-core.jar from the <SC_INSTALL_DIR>/fw-site-capture/webapps/ROOT/WEB-INF/lib folder to your project's build path.

Implement the LinkExtractor interface to provide the implementation for the extract() method. (The LinkExtractor interface is shown on Section 50.12.1.1, "LinkExtractor Interface.")

Below is pseudo-code showing a custom implementation:

package com.custom.crawler;
import java.util.List;
import com.fatwire.crawler.url.ResourceURL;
import com.fatwire.crawler.LinkExtractor;
/**
 * Extracts the links out of a WebResource.
 */
public class CustomLinkExtractor implements LinkExtractor
{
/**
 * A sample constructor for CustomLinkExtractor 
 */
public CustomLinkExtractor(String .......)
  {
  // Initialize if there are private members.
  // User's custom logic
  }

/**
 * Parses the WebResource and finds a list of links (if possible).
 * @param resource the WebResource to inspect.
 * @return a list of links found inside the WebResource.
 */
List<ResourceURL> extract(final WebResource resource)
  {
  // Your custom code for extraction Algorithm.
  }
}

Create a jar file for your custom implementation and copy it to the following folder:

<SC_INSTALL_DIR>/fw-site-capture/webapps/ROOT/WEB-INF/lib
Restart the Site Capture application server.

Inject the dependency by coding the CrawlerConfigurator.groovy file to include the custom link extractor class (CustomLinkExtractor, in this example):

/**
 * User's custom link extractor mechanism to extract the links from the
 * web resource downloaded as part of the crawl session.
 * The code below is only a pseudo code for an example. 
 * User is free to implement their own custom constructor
 * as shown in the next example.
 */
public LinkExtractor createLinkExtractor()
{
return new CustomLinkExtractor("Custom Logic For Your Constructor");
}

50.12.2 ResourceRewriter

A resource rewriter is used to rewrite URLs inside the markup that is downloaded during the crawl session. The implementation must be injected through the CrawlerConfigurator.groovy file.

Some use cases that will require a resource rewriter are:

Crawling a dynamic site and creating a static copy. For example, the FirstSiteII sample site has dynamic links. Converting FirstSiteII into a static site requires rewriting the URLs inside the downloaded markup.
Converting absolute URLs to relative URLs. For example, if the markup has URLs such as http://www.example.com/abc.html, the crawler should remove http://www.example.com from the URL, thus allowing resources to be served from the host on which the downloaded files are stored.

Site Capture comes with the two implementations of ResourceRewriter. You can also create custom implementations. For more information, see the following sections:

Section 50.12.2.1, "ResourceRewriter Interface"
Section 50.12.2.2, "Using the Default Implementations of ResourceRewriter"
Section 50.12.2.3, "Writing a Custom ResourceRewriter"

50.12.2.1 ResourceRewriter Interface

The rewrite method is used to rewrite URLs inside the markup that is downloaded during the crawl session.

package com.fatwire.crawler;
import java.io.IOException;

/**
 * Service for rewriting a resource. The crawler will use the implementation for
 * rewrite method to rewrite the resources that are downloaded as part of crawl
 * session.
 */
public interface ResourceRewriter
{
/**
 * @param resource
 * @return the bytes after the rewrite.
 * @throws IOException
 */
byte[] rewrite(WebResource resource) throws IOException;

}

50.12.2.2 Using the Default Implementations of ResourceRewriter

Site Capture comes with the following implementations of ResourceRewriter:

NullResourceRewriter, configured by default to skip the rewriting of links. If ResourceRewriter is not configured in the CrawlerConfigurator.groovy file, NullResourceRewriter is injected by default.
PatternResourceRewriter, used to rewrite URLs based on the regular expression. PatternResourceRewriter takes as input a regular expression to match the links inside the markup and replaces those links with the string that is provided inside the constructor.

Example

To rewrite an absolute URL as a relative URL:

From:

<a href="http://www.example.com/about/index.html">Click Me</a>

To:

<a href="/about/index.html">Click Me</a>
```
/**
 * Factory method for a ResourceRewriter.
 * default: new NullResourceRewriter();
 * @return the rewritten resource modifies the html before it is saved to disk.
 */
public ResourceRewriter createResourceRewriter()
{
new PatternResourceRewriter("http://www.example.com/([^\\s'\"]*)", '/$1');
}
```
PatternResourceRewriter has only one constructor that takes a regular expression and a string replacement:

PatternResourceRewriter(final String regex, final String replacement)

50.12.2.3 Writing a Custom ResourceRewriter

Note:

Site Capture provides a sample resource rewriter (and link extractor) used by the FirstSiteII sample crawler to download WebCenter Sites' FirstSiteII dynamic website as a static site. For more information, see the source code for the FSIILinkExtractor class in the following folder:

<SC_INSTALL_DIR>/fw-site-capture/crawler/_sample/FirstSiteII/src

To write a custom resource rewriter

Create a project inside your favorite IDE.
Copy the files fw-crawler-core.jar from <SC_INSTALL_DIR>/fw-site-capture/webapps/ROOT/WEB-INF/lib folder to your project's build path.

Implement the ResourceRewriter interface to provide the implementation for the rewrite method. (The ResourceRewriter interface is shown on .)

Below is pseudo-code showing a custom implementation:

package com.custom.crawler;
import com.fatwire.crawler.WebResource;
import com.fatwire.crawler.ResourceRewriter;

/**
 * Rewrite the links inside the markup downloaded as part of
 * crawl session.
 */
public class CustomResourceRewriter implements ResourceRewriter
{
/**
 * A sample constructor for CustomResourceRewriter
 */
public CustomResourceRewriter(String .......)
  {
  // Initialize if there are private members.
  // User's custom logic
  }

/**
 * @param resource
 * @return the bytes after the rewrite.
 * @throws IOException
 */
byte[] rewrite(WebResource resource) throws IOException
  {
  // Your custom code for re-writing Algorithm.
  }
}

Create a jar file for your custom implementation and copy it to the following folder: <SC_INSTALL_DIR>/fw-site-capture/webapps/ROOT/WEB-INF/lib
Restart your Site Capture application server.

Inject the dependency by coding the CrawlerConfigurator.groovy file to include the custom resource rewriter class (CustomResourceRewriter, in this example):

/*
 * User's custom resource rewriting mechanism to rewrite the links from the
 * web resource downloaded as part of the crawl session.
 *
 * The code below is only a pseudo code for an example. 
 * User is free to implement their own custom constructor 
 * as shown in the next example.
 */
public ResourceRewriter createResourceRewriter()
  {
  new CustomResourceRewriter("User's custom logic to initialize the things");
  }

50.12.3 Mailer

A mailer is used to send email after the crawl ends. The implementation must be injected through the CrawlerConfigurator.groovy file.

Site Capture provides an SMTPTlsMailer implementation, which can be used to send the crawler report from the SMTP-TLS mail server. You can also implement the Mailer interface to provide custom logic for sending emails from a server other than SMTP-TLS (such as SMTP without authentication, or POP3). Your custom logic can also specify the email to be an object other than the crawler report. If Mailer is not configured in the CrawlerConfigurator.groovy file, NullMailer is injected by default. For more information, see the following topics:

Section 50.12.3.1, "Mailer Interface"
Section 50.12.3.2, "Using the Default Implementation of Mailer"
Section 50.12.3.3, "Writing a Custom Mailer"

50.12.3.1 Mailer Interface

The sendMail method is automatically called if the Mailer is configured in the CrawlerConfigurator.groovy file.

package com.fatwire.crawler;
import java.io.IOException;
import javax.mail.MessagingException;

/**
 * Service to send an email.
 */
public interface Mailer
{
/**
 * Sends the mail.
 *
 * @param subject
 * @param report
 * @throws MessagingException
 * @throws IOException
 */
void sendMail(String subject, String report)
throws MessagingException, IOException;
}

50.12.3.2 Using the Default Implementation of Mailer

Site Capture provides an SMTP-TLS server-based email implementation that sends out the crawler report when a static or archive crawl session ends (the crawler report is a (report.txt file, described in the Oracle Fusion Middleware WebCenter Sites Administrator's Guide). You can use the default mailer by injecting it through the CrawlerConfigurator.groovy file as shown below:

/**
 * Factory method for a Mailer.
 * <p/>
 * default: new NullMailer().
 * @return mailer holding configuration to send an email 
 * at the end of the crawl.
 * Should not be null.
 */
public Mailer createMailer()
{
 try
  {
  // Creating a SmtpTlsMailer Object
  SmtpTlsMailer mailer = new SmtpTlsMailer();

  InternetAddress from;
  // Creating an internet address from whom the mail 
  // should be sent from = new InternetAddress("example@example.com");

  // Setting the mail address inside the mailer object mailer.setFrom(from);

  // Setting the email address of the recipient inside
  // mailer.mailer.setTo(InternetAddress.parse("example@example.com"));

  // Setting the email server host for to be used for email.
  // The email server should be SMTP-TLS enabled.
  mailer.setHost("smtp.gmail.com", 587);

  // Setting the credentials of the mail account
  // mailer.setCredentials("example@example.com", "examplepassword");

  return mailer;
  }
catch (AddressException e)
  {
  log.error(e.getMessage());
  }
}

50.12.3.3 Writing a Custom Mailer

To write a custom mailer

Create a project inside your favorite IDE.
Copy the files fw-crawler-core.jar from the <SC_INSTALL_DIR>/fw-site-capture/webapps/ROOT/WEB-INF/lib folder in your project's build path.

Implement the Mailer interface with the sendMail method.

Below is pseudo-code showing a custom implementation:

package com.custom.crawler;
import java.io.IOException;
import javax.mail.MessagingException;
import com.fatwire.crawler.Mailer;

/**
 * Implements an interface to implement the logic for sending emails
 * when the crawl session has been completed.
 */
public class CustomMailer implements Mailer
{
/**
 * A sample constructor for CustomMailer
 */
public CustomMailer()
  {
  // Initialize if there are private members.
  // User's custom logic
  }

/**
 * Sends the mail.
 *
 * @param subject
 * @param report
 * @throws MessagingException
 * @throws IOException
 */
void sendMail(String subject, String report)
throws MessagingException, IOException
  {
  // User's custom logic to send the emails.
  }
}

Create a jar file for your custom implementation and copy it to the following folder: <SC_INSTALL_DIR>/fw-site-capture/webapps/ROOT/WEB-INF/lib
Restart your Site Capture application server.

Inject the dependency by coding the CrawlerConfigurator.groovy file to include the custom mailer class (CustomMailer, in this example):

/**
 * Factory method for a Mailer.
 * <p/>
 * default: new NullMailer().
 * @return mailer holding configuration to send an email 
 * at the end of the crawl.
 * Should not be null.
 */
public Mailer createMailer()
{
CustomMailer mailer = new CustomMailer();
// Do some of the initilization stuffs
return mailer;
}
package com.custom.crawler;
import java.io.IOException;
import javax.mail.MessagingException;
import com.fatwire.crawler.Mailer;

/**
 * Implements an interface to implement the logic for sending emails
 * when the crawl session has been completed.
 */
public class CustomMailer implements Mailer
{
  /**
   * A sample constructor for CustomMailer
   */
  public CustomMailer()
    {
    // Initialize if there are private members.
    // User's custom logic
    }
  /**
   * Sends the mail.
   *
   * @param subject
   * @param report
   * @throws MessagingException
   * @throws IOException
   */
  void sendMail(String subject, String report)
  throws MessagingException, IOException
    {
    // User's custom logic to send the emails.
    }
}

The implementation above will email the crawler report (report.txt file, described in the Oracle Fusion Middleware WebCenter Sites Administrator's Guide), given that the String report argument in the sendMail method names the crawler report, by default. You can customize the logic for emailing objects other than the crawler report.

50.13 Summary of Methods and Interfaces

This chapter discussed methods and interfaces in Site Capture's BaseConfigurator class for controlling a crawler's site capture process. This section summarizes the methods as well as the interfaces and their default implementations.

This section contains the following topics:

Section 50.13.1, "Methods"
Section 50.13.2, "Interfaces"

50.13.1 Methods

The following is the list of methods used in Site Capture's BaseConfigurator:

Section 50.2.1, "getStartUri"
Section 50.2.2, "createLinkExtractor"
Section 50.3.1, "getMaxLinks"
Section 50.3.2, "getMaxCrawlDepth"
Section 50.3.3, "getConnectionTimeout"
Section 50.4, "getSocketTimeout"
Section 50.5, "getPostExecutionCommand"
Section 50.6, "getNumWorkers"
Section 50.7, "getUserAgent"
Section 50.8, "createResourceRewriter"
Section 50.9, "createMailer"
Section 50.10, "getProxyHost"
Section 50.11, "getProxyCredentials"

In the preceding list, the factory methods are in the following interfaces:

Section 50.2.2, "createLinkExtractor" is in the Section 50.12.1, "LinkExtractor" interface.
Section 50.8, "createResourceRewriter" is in the Section 50.12.2, "ResourceRewriter" interface.
Section 50.9, "createMailer" is in the Section 50.12.3, "Mailer" interface.

50.13.2 Interfaces

The following interfaces are used in Site Capture's BaseConfigurator:

Section 50.12.1, "LinkExtractor"

Its default implementation is PatternLinkExtractor, which extracts links on the basis of a regular expression.

Site Capture also provides a sample link extractor (and a sample resource rewriter), used by the FirstSiteII sample crawler to download WebCenter Sites' FirstSiteII dynamic website as a static site. Source code is available in the following folder: <SC_INSTALL_DIR>/fw-site-capture/crawler/_sample/FirstSiteII/src

You can write and deploy your own custom link extraction logic.
Section 50.12.2, "ResourceRewriter"

Its default implementations are NullResourceRewriter, which skips the rewriting of links; and PatternResourceRewriter, which rewrites URLs based on the regular expression.

Site Capture provides a sample resource rewriter (and a sample link extractor), used by the FirstSiteII sample crawler to download WebCenter Sites' FirstSiteII dynamic website as a static site. Source code is available in the following folder:<SC_INSTALL_DIR>/fw-site-capture/crawler/_sample/ FirstSiteII/src

You can write and deploy your own logic for rewriting URLs.
Section 50.12.3, "Mailer"

Its default implementation is SMTPTlsMailer, which sends the crawler report from the SMTP-TLS mail server. You can customize the logic for emailing other types of objects from other types of servers.