Coding the Crawler Configuration File

About Controlling a Crawler

To control a crawler, you need to code its CrawlerConfigurator.groovy file with, at minimum, the starting URI and link extraction logic. You supply this information through the getStartUri() and createLinkExtractor() methods. You can also add additional code to specify, for example, the number of links to be crawled, the crawl depth, and the invocation of a post-crawl event such as copying statically downloaded files to a web server's doc base.

The methods and interfaces you use are provided in the BaseConfigurator class. The default implementations can be overridden to customize and control a crawl process in a way that agrees with the structure of the target site and the data you have to collect.

The BaseConfigurator methods and a simple CrawlerConfigurator.groovy file described in the topics that follow demonstrate the usage of the required methods. Crawler customization methods are then discussed and followed by information about Site Capture's Java interfaces, including their default and custom implementations.

BaseConfigurator Methods

The CrawlerConfigurator.groovy file contains the code of the CrawlerConfigurator class. This class must extend BaseConfigurator, which is an abstract class that provides default implementations for the crawler.

This table lists the methods and interfaces of the BaseConfigurator class:

Table 43-1 Methods in the BaseConfigurator Class

Method Type	Method	Notes
Required	getStartUri	N/A
Required	createLinkExtractor	Factory method in the LinkExtractor interface.^Foot 1,^Foot 2
Crawler Customization	getMaxLinks	N/A
Crawler Customization	getMaxCrawlDepth	N/A
Crawler Customization	getConnectionTimeout	N/A
Crawler Customization	getSocketTimeout	N/A
Crawler Customization	getPostExecutionCommand	N/A
Crawler Customization	getNumWorkers	N/A
Crawler Customization	getUserAgent	N/A
Crawler Customization	createResourceRewriter	Factory method in the ResourceRewriter interface.a,b

^Footnote 1

The listed interfaces have default implementations, described in this chapter.

^Footnote 2

Site Capture provides a sample link extractor and resource rewriter, both used by the FirstSiteII sample crawler. See Writing and Deploying a Custom Link Extractor and Writing a Custom ResourceRewriter.

This topic includes the following:

getStartUri

This method injects the crawler's start URI. Configure one or more start URIs for the crawl if the URIs belong to the same site. Multiple starting points enable the crawls to start in parallel.

To provide the start URI for the www.example.com site:

/**
 * The method is used to configure the site url which needs to be crawled.
 */
public String[] getStartUri()
{
return ["http://www.example.com/home"]; //Groovy uses brackets for an array.
}

To provide multiple start URIs for the site, enter a comma-separated array:

/**
 * The method is used to configure the site url which needs to be crawled.
 */
public String[] getStartUri()
{
return ["http://www.example.com/product","http://www.example.com/support"]; //Groovy uses brackets for an array.
}

createLinkExtractor

This method configures the logic for extracting links from the crawled pages. The extracted links are then traversed. It shows a basic sample CrawlerConfigurator.groovy file.

Two abstract methods in BaseConfigurator must be overridden in CrawlerConfigurator. They are getStartUri() and createLinkExtractor(). The createLinkExtractor method is a factory method in the LinkExtractor interface:

Implement the LinkExtractor interface to create your own link extraction algorithm, for example, using an HTML parser to parse the pages and extract links for the crawler to consume.
To extract links, use the default implementation, PatternLinkExtractor, which uses regular expressions. For example, PatternLinkExtractor can be used to extract links of the format /home/products from expressions such as <a href="/home/product">Products</a>.

To use a regular expression for extracting links from <a href="/home/product">Products</a> on the www.example.com site:
```
/**
 * The method is used to define the link extraction
 * algorithm from the crawled pages.
 * PatternLinkExtractor is a regex based extractor
 * which parses the links on the web page
 * based on the pattern configured inside the constructor.
 */
public LinkExtractor createLinkExtractor()
{
return new PatternLinkExtractor("['\"\$](/[^\\s<'\"\$]*)",1);
}
```
For more information about regular expressions and PatternLinkExtractor, see Using the Default Implementation of LinkExtractor.
For more information about implementing the LinkExtractor interface, see Writing and Deploying a Custom Link Extractor.

Basic Configuration File

This example of a simple CrawlerConfigurator.groovy file, the required methods, getStartUri() and createLinkExtractor(), are overridden.

In this example, we override an additional method getMaxLinks(). In the example, it is set to return 150 so that the test run can be completed quickly.

The file named CrawlerConfigurator.groovy is used to inject dependency. Hence, its name must not be changed.

package com.fatwire.crawler.sample

import java.text.DateFormat;
import java.text.SimpleDateFormat;

import java.util.regex.Pattern;

import javax.mail.internet.AddressException;
import javax.mail.internet.InternetAddress;

import com.fatwire.crawler.*;
import com.fatwire.crawler.remote.*;
import com.fatwire.crawler.remote.di.*;
import com.fatwire.crawler.impl.*;
import com.fatwire.crawler.util.FileBuilder;

import org.apache.commons.lang.SystemUtils;
import org.apache.http.HttpHost;
import org.apache.http.auth.*; 
import org.apache.http.client.*; 
import org.apache.http.impl.client.*;
/**
 * Configurator for the crawler.
 * This is used to inject the dependency inside the crawler 
 * to control the crawling process
 */

public class CrawlerConfigurator extends BaseConfigurator {

public CrawlerConfigurator(GlobalConfigurator delegate){
super(delegate);
}

/** 
 * The method is used to configure the site url which needs to be crawled.
 */
public String[] getStartUri() {
return ["http://www.fatwire.com/home"]; //Groovy uses brackets for an array.
}

/**
 * The method is used to define the link extraction algorithm 
 * from the crawled pages.
 * PatternLinkExtractor is a regex based extractor which parses 
 * the links on the web page
 * based on the pattern configured inside the constructor.
 */
public LinkExtractor createLinkExtractor() {
return new PatternLinkExtractor("['\"\\(](/[^\\s<'\"\\)]*)",1);
}

/**
 * The method is used to control the maximum number of links 
 * to be crawled as part of this crawl session.
 */
public int getMaxLinks()
{
150;
}

Crawler Customization Methods

In addition to the required methods, the BaseConfigurator class has methods with default implementations. You may want to override these methods to customize the crawl process in a way that agrees with the structure of the target site and the data you have to collect.

See these topics:

getMaxLinks

This method controls the number of links to be crawled. The number of links should be a positive integer. Otherwise, the crawl scans all the links in the same domain that are reachable from the start URI(s).

To specify crawling 500 links:

/**
 * default: -1; crawler will crawl over all the links reachable from the start URI 
 * @return the maximum number of links to download.
 */
public int getMaxLinks()
{
return 500;
}

getMaxCrawlDepth

This method controls the maximum depth to which a site is crawled. Links beyond the specified depth are ignored. The depth of the starting page is 0.

/**
 * default: -1. Indicates infinite depth for a site.
 * @return the maximum depth to which we need to crawl the links.
 */
public int getMaxCrawlDepth()
{
return 4;
}

getConnectionTimeout

This method determines how long the crawler will wait to establish a connection to its target site. If a connection is not established within the specified time, the crawler will ignore the link and continue to the next link.

To set a connection timeout of 50,000 milliseconds:

/**
 * default: 30000 ms
 * @return Connection timeout in milliseconds.
 */
public int getConnectionTimeout()
{
return 50000; // in milliseconds
}

getSocketTimeout

This method controls the socket timeout of the request that is made by the crawler for the link to be crawled.

To provide a socket timeout of 30,000 milliseconds:

/**
 * default: 20000 ms
 * @return Socket timeout in milliseconds.
 */
public int getSocketTimeout()
{
return 30000; // in milliseconds
}

getPostExecutionCommand

This method injects custom post-crawl logic, and it’s invoked when the crawler finishes its crawl session. It must return the absolute path of the script or command and parameters (if any).

For example, the getPostExecutionCommand() can be used to automate deployment to a web server's doc base by invoking a batch or shell script to copy statically captured files after the crawl session ends.

Note:

The script or command should be present in the same location on all servers hosting Site Capture.
Avoid downloading large archive files (exceeding 250MB) from the Site Capture interface. Use getPostExecutionCommand to copy the files from the Site Capture file system to your preferred location. Archive size can be obtained from the crawler report, on the Job Details form.

To run a batch script named copy.bat on the Site Capture server:

/**
 * default: null.
 * @return the command string for post execution. 
 * Null if there is no such command.
 */
public String getPostExecutionCommand()
{
// The file is supposed to be at the path C:\\commands folder
// on the computer where the site capture server is running
return "C:\\commands\\copy.bat";
}

getNumWorkers

This method controls the number of worker threads used for the crawl process. The ideal number of parallel threads to be spawned for the crawl session depends on the architecture of the computer on which Site Capture is hosted.

To start 10 worker threads for a crawl process:

/**
 * default: 4.
 * @return the number of workers to start.
 * Workers will concurrently downloads resources.
 */
public int getNumWorkers()
{
// Start 10 worker threads which is involved in the crawl process. 
return 10;
}

getUserAgent

This method configures the user agent that the crawler uses when it traverses the site. You should use this method to render the site in a different way than usual. For example, to render the site on a mobile device.

To configure the FireFox 3.6.17 user agent:

/**
 * default: publish-crawler/1.1 (http://www.fatwire.com)
 * @return the user agent identifier
 */
public String getUserAgent()
{
return "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US;rv:1.9.2.17) Gecko/20110420 Firefox/3.6.17 ";
}

createResourceRewriter

This method rewrites URLs inside the HTML pages that are crawled. For example, you may want to rewrite the URLs to enable static delivery of a dynamic WebCenter Sites website.

The createResourceRewriter method is a factory method in the ResourceRewriter interface:

Implement the ResourceRewriter interface to convert dynamic URLs to static URLs, absolute URLs to relative URLs, and so on.
You can also use the following default implementations:
- NullResourceRewriter: Does not rewrite any of the URLs.
- PatternResourceRewriter: Searches for a regular pattern and rewrites as specified.

To use PatternResourceRewriter to rewrite URLs such as http://www.site.com/home.html to /home.html:

/**
 * Factory method for a ResourceRewriter.
 * default: new NullResourceRewriter();
 * @return the rewritten resource modifies the html before it is saved to disk.
 */
public ResourceRewriter createResourceRewriter()
{
new PatternResourceRewriter("http://www.site.com/([^\\s'\"]*)", '/$1');
}

For more information about the default implementations, see Using the Default Implementations of ResourceRewriter.
For more information about implementing the ResourceRewriter interface, see Writing a Custom ResourceRewriter.

createMailer

This method provides the implementation for sending email after the crawl. The createMailer method is a factory method in the Mailer interface.

Site Capture comes with an SMTP over TLS implementation, which emails the crawler report when a static or archive capture session ends (the crawler report is the report.txt file, described in Administering Oracle WebCenter Sites).
If you are using a mail server other than SMTP-TLS (such as SMTP without authentication, or POP3), you must provide your own implementation.

To send no email:

/**
 * Factory method for a Mailer.
 * <p/>
 * default: new NullMailer().
 * @return mailer holding configuration to send an email at the end of the crawl.
 * Should not be null.
 */
public Mailer createMailer()
{
return new NullMailer();
}

For more information about the default implementation, see Using the Default Implementation of Mailer.
For more information about implementing the ResourceRewriter interface, see Writing a Custom Mailer.

getProxyHost

The getProxyHost method must be overridden if the site being crawled is behind a proxy server. You can configure the proxy server in this method.

Note:

If you use getProxyHost, also use getProxyCredentials, described on getProxyCredentials.

To configure a proxy server:

/**
 * default: null.
 * @return the host for the proxy,
 * null when there is no proxy needed
 */
public HttpHost getProxyHost()
{
//using the HTTPClient library return a HTTPHost
return new HttpHost("www.myproxyserver.com", 883);
}

getProxyCredentials

This method injects credentials for the proxy server which is configured in the getProxyHost method.

See getProxyHost.

To authenticate a proxy server user named sampleuser:

/**
 * default: null.
 * example: new UsernamePasswordCredentials(username, password);
 * @return user credentials for the proxy.
 */
public Credentials getProxyCredentials()
{
return new UsernamePasswordCredentials("sampleuser", "samplepassword");
//using the HTTPClient library return credentials
}

Interfaces

Site Capture provides these interfaces with default implementations: LinkExtractor, ResourceRewriter, and Mailer. Read further to know more about these interfaces.

LinkExtractor

A link extractor specifies which links are traversed by Site Capture in a crawl session. The implementation is injected through the CrawlerConfigurator.groovy file. The implementation is called by the Site Capture framework during the crawl session to extract links from the markup that is downloaded as part of the crawl session.

Site Capture comes with one implementation of LinkExtractor. You can also write and deploy your own custom link extraction logic. For more information, see the following sections:

LinkExtractor Interface

This interface has only one method (extract) that needs to be implemented to provide the algorithm for extracting links from downloaded markup.

package com.fatwire.crawler;
import java.util.List;
import com.fatwire.crawler.url.ResourceURL;

/**
 * Extracts the links out of a WebResource.
 */

public interface LinkExtractor
{
/**
 * Parses the WebResource and finds a list of links (if possible).
 * @param resource the WebResource to inspect.
 * @return a list of links found inside the WebResource.
 */
List<ResourceURL> extract(final WebResource resource);
}

Using the Default Implementation of LinkExtractor

PatternLinkExtractor is the default implementation for the LinkExtractor interface. PatternLinkExtractor extracts links on the basis of a regular expression. It takes a regular expression as input and returns only links matching that regular expression.

Common usage scenarios include using PatternLinkExtractor for sites with dynamic URLs and using PatternLinkExtractor for sites with static URLs.

Using PatternLinkExtractor for sites with dynamic URLs:

For example, on www.example.com, the links have a pattern of /home, /support, and /cs/Satellite/. To extract and traverse such kinds of links, use PatternLinkExtractor in the following way:
```
/**
 * The method is used to define the link extraction algorithm 
 * from the crawled pages.
 * PatternLinkExtractor is a regex based extractor which parses 
 * the links on the web page
 * based on the pattern configured inside the constructor.
 */
public LinkExtractor createLinkExtractor()
{
return new PatternLinkExtractor("['\"\$](/[^\\s<'\"\$]*)",1);
}
```
The pattern ['\"\$] (/[^\\s<'\"\$]*) is used to extract links:
- that start with any one of the following characters:
  - Single quote ( ' )
  - Double quotes ( " )
  - Left parenthesis (
- continue with a slash ( / ) ,
- and end with any one of the following characters:
  - Spaces (\s)
  - Less-than symbol (< )
  - Single quote ( ' )
  - Double quote ( " )
  - Right parenthesis )
Let's consider the URL inside the following markup:

<a href='/home'>Click Me</a>

We are interested only in extracting the /home link. This link matches the regular expression pattern because it starts with a single quote (') and ends with a single quote ('). The grouping of 1 will return the result as /home.
Using PatternLinkExtractor for sites with static URLs:

For example, the markup for www.example.com has links such as:

<a href="http://www.example.com/home/index.html">Click Me</a>

To extract and traverse such types of links, we can use PatternLinkExtractor in the following way:
```
/**
 * The method is used to define the link extraction algorithm 
 * from the crawled pages.
 * PatternLinkExtractor is a regex based extractor which parses 
 * the links on the web page
 * based on the pattern configured inside the constructor.
 */
public LinkExtractor createLinkExtractor()
{
return new PatternLinkExtractor(Pattern.compile("http://www.example.com/[^\\s<'\"]*"));
}
```
The above example instructs the crawler to extract links that start with http://www.example.com and end with any one of the following characters: spaces (\s), less-than symbol (<), single quote ('), or double quotes (").

Note:

For more details on groups and patterns, refer to the Java documentation for the Pattern and Matcher classes.

Writing and Deploying a Custom Link Extractor

Site Capture provides a sample link extractor (and resource rewriter) used by the FirstSiteII sample crawler to download WebCenter Sites FirstSiteII dynamic website as a static site. For more information, see the source code for the FSIILinkExtractor class in the following folder:

<SC_INSTALL_DIR>/fw-site-capture/crawler/_sample/FirstSiteII/src

To write a custom link extractor:

Create a project in your Java IDE.
Copy the file fw-crawler-core.jar from the <SC_INSTALL_DIR>/fw-site-capture/webapps/ROOT/WEB-INF/lib folder to your project's build path.

Implement the LinkExtractor interface to provide the implementation for the extract() method. (The LinkExtractor interface is shown on LinkExtractor Interface.)

Below is pseudo-code showing a custom implementation:

package com.custom.crawler;
import java.util.List;
import com.fatwire.crawler.url.ResourceURL;
import com.fatwire.crawler.LinkExtractor;
/**
 * Extracts the links out of a WebResource.
 */
public class CustomLinkExtractor implements LinkExtractor
{
/**
 * A sample constructor for CustomLinkExtractor 
 */
public CustomLinkExtractor(String .......)
  {
  // Initialize if there are private members.
  // User's custom logic
  }

/**
 * Parses the WebResource and finds a list of links (if possible).
 * @param resource the WebResource to inspect.
 * @return a list of links found inside the WebResource.
 */
List<ResourceURL> extract(final WebResource resource)
  {
  // Your custom code for extraction Algorithm.
  }
}

Create a jar file for your custom implementation and copy it to the following folder:

<SC_INSTALL_DIR>/fw-site-capture/webapps/ROOT/WEB-INF/lib
Restart the Site Capture application server.

Inject the dependency by coding the CrawlerConfigurator.groovy file to include the custom link extractor class (CustomLinkExtractor, in this example):

/**
 * User's custom link extractor mechanism to extract the links from the
 * web resource downloaded as part of the crawl session.
 * The code below is only a pseudo code for an example. 
 * User is free to implement their own custom constructor
 * as shown in the next example.
 */
public LinkExtractor createLinkExtractor()
{
return new CustomLinkExtractor("Custom Logic For Your Constructor");
}

ResourceRewriter

A resource rewriter rewrites URLs inside the markup that is downloaded during the crawl session. The implementation must be injected through the CrawlerConfigurator.groovy file.

Some use cases that require a resource rewriter are:

Crawling a dynamic site and creating a static copy.
Converting absolute URLs to relative URLs. For example, if the markup has URLs such as http://www.example.com/abc.html, then the crawler should remove http://www.example.com from the URL, thus allowing resources to be served from the host on which the downloaded files are stored.

Site Capture comes with the two implementations of ResourceRewriter. You can also create custom implementations. For more information, see the following sections:

ResourceRewriter Interface

The rewrite method rewrites URLs inside the markup that is downloaded during the crawl session.

package com.fatwire.crawler;
import java.io.IOException;

/**
 * Service for rewriting a resource. The crawler will use the implementation for
 * rewrite method to rewrite the resources that are downloaded as part of crawl
 * session.
 */
public interface ResourceRewriter
{
/**
 * @param resource
 * @return the bytes after the rewrite.
 * @throws IOException
 */
byte[] rewrite(WebResource resource) throws IOException;

}

Using the Default Implementations of ResourceRewriter

Site Capture comes with the following implementations of ResourceRewriter:

NullResourceRewriter, configured by default to skip the rewriting of links. If ResourceRewriter is not configured in the CrawlerConfigurator.groovy file, then NullResourceRewriter is injected by default.
PatternResourceRewriter, used to rewrite URLs based on the regular expression. PatternResourceRewriter takes as input a regular expression to match the links inside the markup and replaces those links with the string that is provided inside the constructor.

To rewrite an absolute URL as a relative URL:

From:

<a href="http://www.example.com/about/index.html">Click Me</a>

To:

<a href="/about/index.html">Click Me</a>
```
/**
 * Factory method for a ResourceRewriter.
 * default: new NullResourceRewriter();
 * @return the rewritten resource modifies the html before it is saved to disk.
 */
public ResourceRewriter createResourceRewriter()
{
new PatternResourceRewriter("http://www.example.com/([^\\s'\"]*)", '/$1');
}
```
PatternResourceRewriter has only one constructor that takes a regular expression and a string replacement:

PatternResourceRewriter(final String regex, final String replacement)

Writing a Custom ResourceRewriter

Site Capture provides a sample resource rewriter (and link extractor) used by the FirstSiteII sample crawler to download WebCenter Sites' FirstSiteII dynamic website as a static site. For more information, see the source code for the FSIILinkExtractor class in the following folder:

<SC_INSTALL_DIR>/fw-site-capture/crawler/_sample/FirstSiteII/src

To write a custom resource rewriter:

Create a project in your IDE.
Copy the files fw-crawler-core.jar from <SC_INSTALL_DIR>/fw-site-capture/webapps/ROOT/WEB-INF/lib folder to your project's build path.

Implement the ResourceRewriter interface to provide the implementation for the rewrite method.

Below is a pseudo-code showing a custom implementation:

package com.custom.crawler;
import com.fatwire.crawler.WebResource;
import com.fatwire.crawler.ResourceRewriter;

/**
 * Rewrite the links inside the markup downloaded as part of
 * crawl session.
 */
public class CustomResourceRewriter implements ResourceRewriter
{
/**
 * A sample constructor for CustomResourceRewriter
 */
public CustomResourceRewriter(String .......)
  {
  // Initialize if there are private members.
  // User's custom logic
  }

/**
 * @param resource
 * @return the bytes after the rewrite.
 * @throws IOException
 */
byte[] rewrite(WebResource resource) throws IOException
  {
  // Your custom code for re-writing Algorithm.
  }
}

Create a jar file for your custom implementation and copy it to the following folder: <SC_INSTALL_DIR>/fw-site-capture/webapps/ROOT/WEB-INF/lib
Restart your Site Capture application server.

Inject the dependency by coding the CrawlerConfigurator.groovy file to include the custom resource rewriter class (CustomResourceRewriter, in this example):

/*
 * User's custom resource rewriting mechanism to rewrite the links from the
 * web resource downloaded as part of the crawl session.
 *
 * The code below is only a pseudo code for an example. 
 * User is free to implement their own custom constructor 
 * as shown in the next example.
 */
public ResourceRewriter createResourceRewriter()
  {
  new CustomResourceRewriter("User's custom logic to initialize the things");
  }

Mailer

A mailer sends email after the crawl ends. The implementation must be injected through the CrawlerConfigurator.groovy file.

Site Capture provides an SMTPTlsMailer implementation, which can be used to send the crawler report from the SMTP-TLS mail server. You also can implement the Mailer interface to provide custom logic for sending emails from a server other than SMTP-TLS (such as SMTP without authentication, or POP3). Your custom logic also can specify the email to be an object other than the crawler report. If Mailer is not configured in the CrawlerConfigurator.groovy file, then NullMailer is injected by default.

This section includes the following topics:

Mailer Interface

The sendMail method is automatically called if the Mailer is configured in the CrawlerConfigurator.groovy file.

package com.fatwire.crawler;
import java.io.IOException;
import javax.mail.MessagingException;

/**
 * Service to send an email.
 */
public interface Mailer
{
/**
 * Sends the mail.
 *
 * @param subject
 * @param report
 * @throws MessagingException
 * @throws IOException
 */
void sendMail(String subject, String report)
throws MessagingException, IOException;
}

Using the Default Implementation of Mailer

Site Capture provides an SMTP-TLS server-based email implementation that sends out the crawler report when a static or archive crawl session ends. (The crawler report is a (report.txt file, described in About Accessing Log Files in Administering Oracle WebCenter Sites).

Use the default mailer by injecting it through the CrawlerConfigurator.groovy file, as shown below:

/**
 * Factory method for a Mailer.
 * <p/>
 * default: new NullMailer().
 * @return mailer holding configuration to send an email 
 * at the end of the crawl.
 * Should not be null.
 */
public Mailer createMailer()
{
 try
  {
  // Creating a SmtpTlsMailer Object
  SmtpTlsMailer mailer = new SmtpTlsMailer();

  InternetAddress from;
  // Creating an internet address from whom the mail 
  // should be sent from = new InternetAddress("example@example.com");

  // Setting the mail address inside the mailer object mailer.setFrom(from);

  // Setting the email address of the recipient inside
  // mailer.mailer.setTo(InternetAddress.parse("example@example.com"));

  // Setting the email server host for to be used for email.
  // The email server should be SMTP-TLS enabled.
  mailer.setHost("smtp.gmail.com", 587);

  // Setting the credentials of the mail account
  // mailer.setCredentials("example@example.com", "examplepassword");

  return mailer;
  }
catch (AddressException e)
  {
  log.error(e.getMessage());
  }
}

Writing a Custom Mailer

This section provides the steps to write a custom mailer.

To write a custom mailer:

Create a project in your IDE.
Copy the files fw-crawler-core.jar from the <SC_INSTALL_DIR>/fw-site-capture/webapps/ROOT/WEB-INF/lib folder in your project's build path.

Implement the Mailer interface with the sendMail method.

Below is a pseudo-code showing a custom implementation:

package com.custom.crawler;
import java.io.IOException;
import javax.mail.MessagingException;
import com.fatwire.crawler.Mailer;

/**
 * Implements an interface to implement the logic for sending emails
 * when the crawl session has been completed.
 */
public class CustomMailer implements Mailer
{
/**
 * A sample constructor for CustomMailer
 */
public CustomMailer()
  {
  // Initialize if there are private members.
  // User's custom logic
  }

/**
 * Sends the mail.
 *
 * @param subject
 * @param report
 * @throws MessagingException
 * @throws IOException
 */
void sendMail(String subject, String report)
throws MessagingException, IOException
  {
  // User's custom logic to send the emails.
  }
}

Create a jar file for your custom implementation and copy it to the following folder: <SC_INSTALL_DIR>/fw-site-capture/webapps/ROOT/WEB-INF/lib
Restart your Site Capture application server.

Inject the dependency by coding the CrawlerConfigurator.groovy file to include the custom mailer class (CustomMailer, in this example):

/**
 * Factory method for a Mailer.
 * <p/>
 * default: new NullMailer().
 * @return mailer holding configuration to send an email 
 * at the end of the crawl.
 * Should not be null.
 */
public Mailer createMailer()
{
CustomMailer mailer = new CustomMailer();
// Do some of the initilization stuffs
return mailer;
}
package com.custom.crawler;
import java.io.IOException;
import javax.mail.MessagingException;
import com.fatwire.crawler.Mailer;

/**
 * Implements an interface to implement the logic for sending emails
 * when the crawl session has been completed.
 */
public class CustomMailer implements Mailer
{
  /**
   * A sample constructor for CustomMailer
   */
  public CustomMailer()
    {
    // Initialize if there are private members.
    // User's custom logic
    }
  /**
   * Sends the mail.
   *
   * @param subject
   * @param report
   * @throws MessagingException
   * @throws IOException
   */
  void sendMail(String subject, String report)
  throws MessagingException, IOException
    {
    // User's custom logic to send the emails.
    }
}

This implementation emails the crawler report (the report.txt file), given that the String report argument in the sendMail method names the crawler report, by default. You can customize the logic for emailing objects other than the crawler report.

Summary of Methods and Interfaces

For controlling a crawler's site capture process, the default implementations of methods and interfaces in the Site Capture BaseConfigurator class are described here.

See these topics:

Methods

The following interfaces are used in the Site Capture BaseConfigurator class:

The factory methods are in the following interfaces:

createLinkExtractor is in the LinkExtractor interface.
createResourceRewriter is in the ResourceRewriter interface.
createMailer is in the Mailer interface.

Interfaces

The following interfaces are used in the Site Capture BaseConfigurator class:

LinkExtractor

Its default implementation is PatternLinkExtractor, which extracts links on the basis of a regular expression.

Site Capture also provides a sample link extractor (and a sample resource rewriter), used by the FirstSiteII sample crawler to download WebCenter Sites' FirstSiteII dynamic website as a static site. Source code is available in the following folder: <SC_INSTALL_DIR>/fw-site-capture/crawler/_sample/FirstSiteII/src

You can write and deploy your own custom link extraction logic.
ResourceRewriter

Its default implementations are NullResourceRewriter, which skips the rewriting of links, and PatternResourceRewriter, which rewrites URLs based on the regular expression.

Site Capture provides a sample resource rewriter (and a sample link extractor), used by the FirstSiteII sample crawler to download WebCenter Sites' FirstSiteII dynamic website as a static site. Source code is available in the following folder:<SC_INSTALL_DIR>/fw-site-capture/crawler/_sample/ FirstSiteII/src

You can write and deploy your own logic for rewriting URLs.
Mailer

Its default implementation is SMTPTlsMailer, which sends the crawler report from the SMTP-TLS mail server. You can customize the logic for emailing other types of objects from other types of servers.