36 Using the Site Capture Application

Through the Site Capture application in WebCenter Sites, you can download sites in different modes and store them in different directories. You can manage and monitor Site Capture , as well as create and modify the crawlers that are run to download sites.

Topics:

36.1 Site Capture Model

You can initiate a crawl session manually from the Site Capture interface, or you can trigger one by the completing a WebCenter Sites RealTime publishing session. In each scenario, the crawler downloads the website to disk in either a static or archive mode, depending on how you choose to run the crawler.

Topics:

36.1.1 Capture Modes

When you download a site in either static or archive mode, the same files (html, css, and so on) are stored to disk, but with several differences. For example, statically downloaded sites are available only in the file system, whereas archived sites are available in both the file system and the Site Capture interface. Capture mode, then, determines how crawlers download sites and how you manage the results.

Table 36-1 Static Capture Mode and Archive Mode

Static Mode Archive Mode

Static mode supports rapid deployment, high availability scenarios.

Archive mode is used to maintain copies of websites on a regular basis for compliance purposes or similar reasons.

In static mode, a crawled site is stored as files ready to be served. Only the latest capture is kept (the previously stored files are overwritten).

In archive mode, all crawled sites are kept and stored as zip files (archives) in time-stamped folders. Pointers to the zip files are created in the Site Capture database.

You can initiate static crawl sessions manually from the application interface or after a publishing session. However, you can manage downloaded sites from the Site Capture file system only.

Like static sessions, you can manually initiate archive crawl sessions from the Site Capture interface or after a publishing session. However, because the zip files are referenced by pointers in the Site Capture database, you can manage them from the Site Capture interface. You can download the files, preview the archived sites, and set capture schedules.

For any capture mode, logs are generated after the crawl session to provide such information as crawled URLs, HTTP status, and network conditions. In static capture, you must obtain the logs from the file system. In archive capture, you can download them from the Site Capture interface. For any capture mode, you have the option of configuring crawlers to email reports as soon as they are generated.

36.1.2 Crawlers

Starting any type of site capture process requires you to define a crawler in the Site Capture interface. To help you get started quickly, Site Capture comes with two sample crawlers, Sample and FirstSiteII. This guide assumes the crawlers were installed during the Site Capture installation process and uses the Sample crawler primarily.

To create your own crawler, you must name the crawler (typically, after the target site), and upload a text file named CrawlerConfigurator.groovy, which controls the site capture process. You must code the groovy file with methods in the BaseConfigurator class that specify at least the starting URIs and link extraction logic for the crawler. Although the groovy file controls the site capture process, the capture mode is set outside the file.

To use a crawler for publishing-triggered site capture, you must take an additional step: you must name the crawler and specify its capture mode on the publishing destination definition on the source system that is integrated with Site Capture, as described in the Configuring Site Capture with the Configurator in Installing and Configuring Oracle WebCenter Sites. (On every publishing destination definition, you can specify one or more crawlers, but only a single capture mode.) Information about the successful start of crawler sessions is stored in the Site Capture file system and in the log files (futuretense.txt, by default) of the source and target systems.

The exercises in this chapter cover both types of crawler scenarios: manual and publishing-triggered.

36.2 Logging in to the Site Capture Application

You access the Site Capture application by logging in to WebCenter Sites.

  1. Access WebCenter Sites at the following URL:
    http://<server>:<port>/<context>/login
    

    In the previous example, <server> is the host name or IP address of the server running WebCenter Sites, <port> is the number of the WebCenter Sites application, and <context> is the name of the WebCenter Sites web application that is deployed on the server.

  2. Log in as a general administrator. Login credentials are case-sensitive. In this guide, we use the admin credentials.
  3. Click Login.
  4. If you are logging in for the first time, the following dialog opens:

    Select the AdminSite (to which the Site Capture application is assigned by default) and select the Site Capture icon.

    The Crawlers page opens.

  5. If the default crawlers were installed with Site Capture, they are listed under the names Sample and FirstSiteII.
  6. Your next step depends on your requirements:

36.3 Using the Default Crawlers

You should have installed the default crawlers, Sample and FirstSiteII, with the Site Capture application, and they are displayed in its interface. To define your own crawlers, see Defining a Crawler.

Topics:

36.3.1 Sample Crawler

You can use the Sample crawler to download any site. The purpose of the Sample crawler is to help you quickly download the site and to provide you with required configuration code, which you can reuse when creating your own crawlers. The Sample crawler is minimally configured with the required methods and an optional method that limits the duration of the crawl by limiting the number of links to crawl.

  • The required methods are getStartURi and createLinkExtractor (which defines the logic for extracting links from crawled pages).

  • The optional method is getMaxLinks, which specifies the number of links to crawl.

For more information about these methods, see Crawler Customization Methods in Developing with Oracle WebCenter Sites.

36.3.2 FirstSiteII Crawler

The FirstSiteII crawler is used to download the dynamic FirstSiteII sample website as a static site. The purpose of the crawler is to provide you with advanced configuration code that shows how to create a custom link extractor and resource rewriter, using the LinkExtractor and ResourceRewriter interfaces. See Interfaces in Developing with Oracle WebCenter Sites.

36.3.3 Running a Default Crawler

In this section, you will run either the Sample crawler or the FirstSiteII crawler. Using the FirstSiteII crawler requires you to publish the WebCenter Sites FirstSiteII sample site.

  1. On the Crawlers page, point to a default crawler – Sample or FirstSiteII – and select Edit Configuration.

    Note:

    If the default crawlers are not listed, skip to Setting Up a Site Capture Operation to define your own crawler.

  2. Set the starting URI for the crawler by editing the crawler configuration file. For instructions, skip to step 1 in Defining a Crawler, and continue with the rest of the steps to run the crawler and manage its captured data.

36.4 Setting Up a Site Capture Operation

In this section, you will step through the process of creating and running your own crawler to understand how the Site Capture interface and file system are organized.

Topics:

36.4.1 Creating a Starter Crawler Configuration File

Before you can create a crawler, you must have a configuration file that controls the site capture process for the crawler. The fastest way to create a useful file is to copy sample code and recode, as necessary.

  1. Copy the Sample configuration file to your local computer in either of these ways:

    • Log in to the Site Capture application. If the Crawlers page lists the Sample crawler, do the following (otherwise, skip to the item directly below):

      1. Point to Sample and select Edit Configuration.

      2. Go to the Configuration File field, copy its code to a text file on your local computer, and save the file as CrawlerConfigurator.groovy.

    • Go to the Site Capture host computer and copy the CrawlerConfigurator.groovy file from <SC_INSTALL_DIR>/fw-site-capture/crawler/Sample/app/ to your local computer.

      Note:

      Every crawler is controlled by its own CrawlerConfigurator.groovy file. The file is stored in a custom folder structure. For example:

      When you define a crawler, Site Capture creates a folder bearing the name of the crawler (<crawlerName>, or Sample in our scenario) and places that folder in the following path: <SC_INSTALL_DIR>/fw-site-capture/crawler/. Within the <crawlerName> folder, Site Capture creates an /app subfolder to which it uploads the groovy file from your local computer.

      When the crawler is used for the first time in a given mode, Site Capture creates additional subfolders (in /<crawlerName>/) to store sites captured in that mode. See Managing Statically Captured Sites.

  2. Your sample groovy file specifies a sample starting URI, which you will reset when you create the crawler in the next step. (In addition to the starting URI, you can set crawl depth and similar parameters, call post-crawl commands, and implement interfaces to define logic specific to your target sites.)

    At this point, you have the option to either customize the downloaded groovy file now, or first create the crawler and then customize its groovy file (which is editable in the Site Capture interface). To do the latter, continue to the next step Defining a Crawler.

36.4.2 Defining a Crawler

To define a crawler:

  1. Go to the Crawlers page and click Add Crawler.

  2. On the Add Crawlers page:

    1. Name the crawler after the site to be crawled.

      Note:

      • After you save a crawler, it cannot be renamed.

      • This guide assumes that every custom crawler is named after the target site and is not used to capture any other site.

    2. Enter a description (optional). For example: "This crawler is reserved for publishing-triggered site capture" or "This crawler is reserved for scheduled captures."

    3. In the Configuration File field, browse to the groovy file that you created in Creating a Starter Crawler Configuration File.

    4. Save the new crawler.

      Your CrawlerConfigurator.groovy file is uploaded to the <SC_INSTALL_DIR>/fw-site-capture/crawler/<crawlerName>/app folder on the Site Capture host computer. You can edit the file directly in the Site Capture interface.

  3. Continue to Editing the Crawler Configuration File.

36.4.3 Editing the Crawler Configuration File

From the Site Capture interface, you can recode the entire crawler configuration file. In this example, we simply set the starting URI.

  1. On the Crawlers page, point to the crawler you just defined and select Edit Configuration.

    The Configuration File field displays the crawler's CrawlerConfigurator.groovy file, located in <SC_INSTALL_DIR>/fw-site-capture/crawler/<crawler name>/app.

  2. Set the starting URI for the crawler in the following method:
    public String[] getStartUri() {
      return ["http://www.example.com/home"]

    Note:

    Take care of the following:

    You can set multiple starting URIs. They must belong to the same site. Enter a comma-separated array, as shown in the example below:

    public String[] getStartUri() {  return ["http://www.example.com/product","http://www.example.com/support"];  } 

    Your configuration file includes the createLinkExtractor method, which calls the logic for extracting the links to be crawled. The links are extracted from the markup that is downloaded during the crawl session. For additional information about this method and the extraction logic, see createLinkExtractor in Developing with Oracle WebCenter Sites.

    Your configuration file also includes the getMaxLinks() method, which specifies the number of links to crawl. Its default value is set to 150 to ensure a quick run. If for some reason you have to stop a static capture, you must stop the application server. You can stop archive captures from the Site Capture interface.

    See the Coding the Crawler Configuration File in Developing with Oracle WebCenter Sites.

  3. Click Save.
  4. Continue to Starting a Crawl.

36.4.4 Starting a Crawl

You can start a crawl in several ways. If you used a crawler in one mode, you can rerun it in a different mode:

36.4.4.1 Run the Crawler Manually in Static Mode

To run the crawler manually in static mode:

  1. On the Crawlers page, point to the crawler that you created and select Start Static Capture from the menu.

    The Crawlers page displays the following message when capture begins:

    "Success. Static capture started by crawler <crawlerName>."

  2. At this point, the Site Capture interface does not display any other information about the crawler or its process, nor does it make the downloaded site available to you. Instead, you will use the Site Capture file system to access the downloaded files and various logs:
    • To monitor the static capture process, look for the following files:

      • The lock file in <SC_INSTALL_DIR>/fw-site-capture/<crawlerName>/logs. The lock file is transient. It is created at the start of the static capture process to prevent the crawler from being called for an additional static capture. The lock file is deleted when the crawl session ends.

      • The crawler.log file in <SC_INSTALL_DIR>/fw-site-capture/logs/. (This file uses the term "VirtualHost" to mean "crawler.")

      • The inventory.db file in <SC_INSTALL_DIR>/fw-site-capture/<crawlerName>. This file lists the crawled URLs. The inventory.db file is used by the Site Capture system and must not be deleted or modified.

      • The audit.log, links.txt file, and report.txt files are available in /fw-site-capture/crawler/<crawlerName>/logs/yyyy/mm/dd.

    • To access the downloaded files, go to <SC_INSTALL_DIR>/fw-site-capture/crawler/<crawlerName>/www.

    See Managing Statically Captured Sites.

36.4.4.2 Run the Crawler Manually in Archive Mode

To run the crawler manually in archive mode:

  1. On the Crawlers page, point to the crawler that you created and select Start Archive.

    A comment dialog opens.

  2. In the dialog, add a comment about the upcoming job:

    Note:

    You cannot add a comment after a crawler starts running.

    If you choose to add a comment in the dialog above, it opens in the following places:

    • Job Comment field on the Job Details page (shown in the next step).

    • Job Comment field on the Jobs page.

    • Comment field on the Archives page.

  3. Click Start Archive.

    The Job Details page opens, where you can manage the archive process in several way. To follow this exercise, click Refresh (next to Job State) until Finished opens.

    • Refresh is shown as long as the job state is Scheduled or Running. Clicking Refresh updates the displayed job state. Possible job states are Scheduled, Running, Finished, Stopped, and Failed.

    • Stop Archive ends the crawlers session. Any captured resources are archived and the job state is changed from Running to Finished (click Refresh to see the change).

    • Preview is displayed when the job state is Finished. Clicking Preview displays the archived site.

    • Cancel redirects you to the Jobs page. The crawler, if running, continues to run.

  4. When the archive crawl ends, results are made available in the Site Capture interface. For example:
    • The crawler report opens on the Job Details page. The report lists the number of downloaded resources, their total size and download time, network conditions, HTTP status codes, and additional notes as necessary.

    • Click Preview on the Job Details page to render the archived site. Next to the site is the Archive ID table with archive management options, which are displayed when you point to an archive.

      Note:

      If your archived site contains links to external domains, its preview is likely to include those links, especially when the crawl depth and number of links to crawl are set to large values (in the CrawlerConfigurator.groovy file). Although the external domains can be browsed, they are not archived.

    • For a summary of pathways to various data, see About Managing Archived Sites.

36.4.4.3 Schedule the Crawler for Archive Capture

Only archive captures can be scheduled. For a given crawler, you can create multiple schedules – for example, one for capturing periodically, and another for capturing at a particular and unique time.

Note:

If you set multiple schedules, ensure that they do not overlap.

  1. Go the Crawlers page, point to the crawler that you created and select Schedule Archive.
  2. Click Add Schedule and make selections on all calendars: Days, Dates, Months, Hours, and Minutes.
  3. Click Save and add another schedule if necessary.
36.4.4.4 About Publishing a Site in RealTime Mode

If you configure your WebCenter Sites publishing systems to communicate with the Site Capture application, you can set up a RealTime publishing process to call one or more crawlers to capture the newly published site. For instructions, see Enabling Publishing-Triggered Site Capture.

36.4.5 About Managing Captured Data

For more information about accessing various data associated with static and archive captures, see Managing Statically Captured Sites.

Notes and Tips for Managing Crawlers and Captured Data presents a collection of notes and tips to bear in mind when managing crawlers and captured data.

36.5 Enabling Publishing-Triggered Site Capture

An administrative user can configure as many publishing destination definitions for Site Capture as necessary and can call as many crawlers as necessary. The following are the main steps for enabling publishing-triggered site capture:

36.5.1 About Integrating the Site Capture Application with Oracle WebCenter Sites

You can enable site capture after a RealTime publishing session only if the Site Capture application is first integrated with the source and target systems used in the publishing process. If Site Capture is not integrated, see Integrating Site Capture with the WebCenter Sites Publishing Process in Installing and Configuring Oracle WebCenter Sites for integration instructions, then continue with the steps below.

36.5.2 Configuring a RealTime Publishing Destination Definition for Site Capture

When configuring a publishing destination definition, you name the crawlers that are called after the publishing session. You also specify the capture mode.

  1. Go to the WebCenter Sites source system that is integrated with the Site Capture application.

    1. Create a RealTime publishing destination definition pointing to the WebCenter Sites target system that is integrated with Site Capture. See Adding a New RealTime Destination Definition.

    2. In the More Arguments section of the publishing destination definition, name the crawlers to be called after the publishing session, and set the capture mode by using the following parameters:

      • CRAWLERCONFIG: Specify the name of each crawler. If you use multiple crawlers, separate their names with a semicolon (;).

        Examples:

        For a single crawler:CRAWLERCONFIG=crawler1

        For multiple crawlers:CRAWLERCONFIG=crawler1;crawler2;crawler3

        Note:

        The crawlers that you specify here must also be configured and identically named in the Site Capture interface. Crawler names are case-sensitive.

      • CRAWLERMODE: To run an archive capture, set this parameter to dynamic. By default, static capture is enabled.

        Example: CRAWLERMODE=dynamic

        Note:

        • If CRAWLERMODE mode is omitted or set to a value other than dynamic, static capture starts when the publishing session ends.

        • You can set both crawler parameters in a single statement as follows: CRAWLERCONFIG=crawler1;crawler2&CRAWLERMODE=dynamic

        • While you can specify multiple crawlers, you can set only one mode. All crawlers run in that mode. To run some crawlers in a different mode, configure another publishing destination definition.

  2. Continue to the next section.

36.5.3 Matching Crawlers

Crawlers named in the publishing destination definition must exist in the Site Capture interface. Do the following:

  1. Verify that crawler names in the destination definition and Site Capture interface are identical. The names are case-sensitive.
  2. Ensure that a valid starting URI for the target site is set in each crawler configuration file. For information about navigating to the crawler configuration file, see Editing the Crawler Configuration File. For more information about writing configuration code, see the Coding the Crawler Configuration File in Developing with Oracle WebCenter Sites.

36.5.4 Managing Site Capture

To manage site capture:

  1. When you have enabled publishing-triggered site capture, you are ready to publish the target site. When publishing ends, site capture begins. The crawlers capture pages in either static or archive mode, depending on how you set the CRAWLERMODE parameter in the publishing destination definition (step b in Configuring a RealTime Publishing Destination Definition for Site Capture).

  2. To monitor the site capture process.

    • For static capture, the Site Capture interface does not display any information about the session, nor does it make the captured site available to you.

      • To determine that the crawlers were called, open the futuretense.txt file on the source or target WebCenter Sites system.

        Note:

        The futuretense.txt file on the WebCenter Sites source and target systems contains crawler startup status for any type of crawl: static and archive.

      • To monitor the capture process, go to the Site Capture file system and review the files that are listed in step 2 in Run the Crawler Manually in Static Mode.

    • For dynamic capture, you can view the status of the crawl from the Site Capture interface.

      1. Go to the Crawlers page, point to the crawler, and select Jobs from the pop-up menu.

      2. On the Job Details page, click Refresh next to the Job State until you see "Finished." (Possible value for Job State are Scheduled, Running, Finished, Stopped, or Failed.) For more information about the Job Details page, see steps 3 and 4 in Run the Crawler Manually in Archive Mode.

  3. Managing captured data.

    When the crawl session ends, you can manage the captured site and associated data as follows:

36.6 Managing Statically Captured Sites

For every crawler that a user creates in the Site Capture interface, Site Capture creates an identically named folder in its file system. This custom folder, <crawlerName>, is used to organize the configuration file, captures, and logs for the crawler as shown in using-site-capture-application.html#GUID-023C4D76-7BFF-43F6-A0D3-0DA764DA9860__CHDBCFCJ, while describes the <crawlerName> folder and its contents.

Note:

To access static captures and logs, you must use the file system. Archive captures and logs are managed from the Site Capture interface (their location in the file system is included in this section).

Figure 36-4 Site Capture Custom Folders: <crawlerName>

Description of Figure 36-4 follows
Description of "Figure 36-4 Site Capture Custom Folders: <crawlerName>"

Table 36-2 <crawlerName> Folder and its Contents

Folder Description

/fw-site-capture/crawler/<crawlerName>

Represents a crawler. For every crawler that a user defines in the Site Capture interface, Site Capture creates a /<crawlerName> folder. For example, if you installed the sample crawlers FirstSiteII and Sample, both crawlers are listed in the Site Capture interface, and have identically named folders in the Site Capture file system.

Note: In addition to the subfolders (described below), the <crawlerName> folder contains an inventory.db file, which lists statically crawled URLs. The file is created when the crawler takes its first static capture. Do not delete or modify inventory.db. It is used by the Site Capture system.

/fw-site-capture/crawler/<crawlerName>/app

Contains the CrawlerConfiguration.groovy file. Its code controls the crawl process. The /app folder is created when the crawler is created and saved.

/fw-site-capture/crawler/<crawlerName>/archive

The /archive folder is used strictly for archive capture. This folder contains a hierarchy of yyyy/mm/dd subfolders. The /dd subfolder stores all of the archive captures in time-stamped zip files.

The /archive folder is created when the crawler first runs in archive mode. The zip files (located in /dd) are referenced in the database and therefore made available in the Site Capture interface for you to download and display as websites.

Note: Archive captures are accessible from the Site Capture interface. Each zip file contains a URL log named __inventory.db. Do not delete or modify __inventory.db. It is used by the Site Capture system.

/fw-site-capture/crawler/<crawlerName>/www

Contains the latest statically captured site only (when the same crawler is rerun in static mode, it overwrites the previous capture). The site is stored as html, css, and other files that are readily served.

The /www folder is created when the crawler first runs in static mode.

Note: Static captures are accessible from the Site Capture file system only.

/fw-site-capture/crawler/<crawlerName>/logs/yyyy/mm/dd

Contains log files with information about crawled URLs. Log files are stored in the /dd subfolders and named as shown in using-site-capture-application.html#GUID-023C4D76-7BFF-43F6-A0D3-0DA764DA9860__CHDIAIFI.

  • The audit.log file lists the crawled URLs with data such as timestamps, crawl depth, HTTP status, and download time.

  • The links.txt file lists the crawled URLs.

  • The report.txt file lists the overall crawl statistics such as number of downloaded resources, total size, download size and time, and network conditions. For archive capture, this report is available in the Site Capture interface as the crawler report (on the Job Details form. Paths to the Job Details form are shown in using-site-capture-application.html#GUID-259FA0E4-2015-40FB-BF68-1EAECF2FB4DB__CHDHCCEA).

Note: If the crawler captured in both static mode and archive mode, the /dd subfolders contain logs for static captures and archive captures.

The /logs folder is also used to store a transient file named lock. The file is created at the start of the static capture process to prevent the crawler from being called for additional static captures. The lock file is deleted when the crawl session ends.

The folders under logs/yyyy/mm/ contain the following logs:

  • <yyyy-mm-dd-hh-mm-ss>-audit.log

  • <yyyy-mm-dd-hh-mm-ss>-links.txt

  • <yyyy-mm-dd-hh-mm-ss>-report.txt

36.7 About Managing Archived Sites

You can manage archived sites from different forms of the Site Capture interface. The following figure shows some pathways to various information: archives, jobs, site preview, crawler report, and URL log:

Figure 36-6 Paths to Archive Information

Description of Figure 36-6 follows
Description of "Figure 36-6 Paths to Archive Information"
  • For example, to preview a site, start at the Crawlers form, point to a crawler (crawlerName), select Archives from the pop-up menu (which opens the Archives form), point to an Archive ID, and select Preview from the pop-up menu.

  • Dashed lines represent multiple paths to the same option. For example, to preview a site, you can follow the Archives path, Jobs path, or Start Archive path for the crawler. To download an archive, you can follow the Archives path or the Jobs path.

  • The crawler report and URL log are marked by an asterisk (*).

36.8 Notes and Tips for Managing Crawlers and Captured Data

The following topics summarize notes and tips for managing crawlers and captured data:

36.8.1 Tips for Creating and Editing Crawlers

When creating crawlers and editing their configuration code, consider the following information:

  • Crawler names are case-sensitive.

  • Every crawler configuration file is named CrawlerConfigurator.groovy. You must not change this name.

  • You can configure a crawler to start at one or more seed URIs on a given site and to crawl one or more paths. You can use additional Java methods to set parameters such as crawl depth, call post-crawl commands, specify session timeout, and more. You can implement interfaces to define logic for extracting links, rewriting URLs, and sending email after a crawl session. See Coding the Crawler Configuration File in Developing with Oracle WebCenter Sites.

  • When a crawler is created and saved, its CrawlerConfigurator.groovy file is uploaded to the Site Capture file system and made editable in the Site Capture interface.

  • While a crawler is running a static site capture process, you cannot use it to run a second static capture process.

  • While a crawler is running an archive capture process, you can use it to run a second archive capture process. The second process is marked as Scheduled and starts after the first process ends.

36.8.2 Notes for Deleting a Crawler

If you must delete a crawler (which includes all of its captured information), do so from the Site Capture interface, not the file system. Deleting a crawler from the interface prevents broken links. For example, if a crawler ran in archive mode, deleting it from the interface removes two sets of information - the archives and logs, and database references to those archives and logs. Deleting the crawler from the file system retains database references to archives and logs that no longer exist, thus creating broken links in the Site Capture interface.

36.8.3 Notes for Scheduling a Crawler

Only archive crawls can be scheduled.

  • When setting a crawler schedule, consider the publishing schedule for the site and avoid overlapping the two.

  • You can create multiple schedules for a single crawler – for example, one schedule to call the crawler periodically, and another schedule to call the crawler at one particular and unique time.

  • When creating multiple schedules, ensure they do not overlap.

36.8.4 About Monitoring a Static Crawl

To determine whether a static crawler session is in progress or completed, look for the crawler lock file in the <SC_INSTALL_DIR>/fw-site-capture/<crawlerName>/logs folder. The lock file is transient. It is created at the start of the static capture process to prevent the crawler from being called for an additional static capture. The lock file is deleted when the crawler session ends.

36.8.5 About Stopping a Crawl

Before running a crawler, consider the number of links to be crawled and the crawl depth, both of which determine the duration of the crawler session.

  • If you must terminate an archive crawl, use the Site Capture interface. (Select Stop Archive on the Job Details form.)

  • If you must terminate a static crawl, you must stop the application server.

36.8.6 About Downloading Archives

Avoid downloading large archive files (exceeding 250MB) from the Site Capture interface. Instead, use the getPostExecutionCommand to copy the files from the Site Capture file system to your preferred location.

You can obtain archive size from the crawler report, on the Job Details form. Paths to the Job Details form are shown in using-site-capture-application.html#GUID-259FA0E4-2015-40FB-BF68-1EAECF2FB4DB__CHDHCCEA. See getPostExecutionCommand in Developing with Oracle WebCenter Sites.

36.8.7 Notes About Previewing Sites

If your archived site contains links to external domains, its preview is likely to include those links, especially when the crawl depth and number of links to crawl are set to large values (in the groovy file). Although the external domains can be browsed, they are not archived.

36.8.8 Tips for Configuring Publishing Destination Definitions

  • If you are running publishing-triggered site capture, you can set crawler parameters in a single statement on the publishing destination definition:

    CRAWLERCONFIG=crawler1;crawler2&CRAWLERMODE=dynamic

  • While you can specify multiple crawlers on a publishing destination definition, you can set one capture mode only. All crawlers run in that mode. To run some crawlers in a different mode, configure another publishing destination definition.

36.8.9 About Accessing Log Files

  • For statically captured sites, log files are available in the Site Capture file system only:

    • The inventory.db file, which lists statically crawled URLs, is located in the /fw-site-capture/crawler/<crawlerName> folder.

    Note:

    The inventory.db file is used by the Site Capture system. It must not be deleted or modified.

    • The crawler.log file is located in the <SC_INSTALL_DIR>/fw-site-capture/logs/ folder. (The crawler.log file uses the term "VirtualHost" to mean "crawler.")

  • For statically captured and archived sites, a common set of log files exists in the site Capture file system:

    • audit.log,which lists the crawled URLs, timestamps, crawl depth, HTTP status, and download time.

    • links.txt, which lists the crawled URLs

    • report.txt, which is the crawler report

    The files named above are located in the following folder:

    /fw-site-capture/crawler/<crawlerName>/logs/yyyy/mm/dd

    Note:

    For archived sites, report.txt is also available in the Site Capture interface, on the Job Details form, where it is called the Crawler Report. (Paths to the Job Details form are shown in using-site-capture-application.html#GUID-259FA0E4-2015-40FB-BF68-1EAECF2FB4DB__CHDHCCEA).

  • The archive process also generates a URL log for every crawl. The log is available in two places:

    • In the Site Capture file system, where it is called __inventory.db. This file is located within the zip file in the following folder:

      /fw-site-capture/crawler/<crawlerName>/archive/yyyy/mm/dd

      Note:

      The __inventory.db file is used by the Site Capture system. It must not be deleted or modified.

    • In the Site Capture interface, in the Archived URLs form (whose path is shown in using-site-capture-application.html#GUID-259FA0E4-2015-40FB-BF68-1EAECF2FB4DB__CHDHCCEA).

36.9 General Directory Structure

The Site Capture file system provides the framework in which Site Capture organizes custom crawlers and their captured content. The file system is created during the Site Capture installation process to store installation-related files, property files, sample crawlers, and sample code used by the FirstSiteII crawler to control its site capture process.

The following figure shows the folders most frequently accessed in Site Capture to help administrators find commonly used Site Capture information. All folders, except for <crawlerName>, are created during the Site Capture installation process. For information about <crawlerName> folders, see the following table and Custom Folders.

Figure 36-7 Site Capture File System

Description of Figure 36-7 follows
Description of "Figure 36-7 Site Capture File System "

Table 36-3 Site Capture Frequently Accessed Folders

Folder Description

/fw-site-capture

The parent folder.

/fw-site-capture/crawler

Contains all Site Capture crawlers, each stored in its own crawler-specific folder.

/fw/site-capture/crawler/_sample

Contains the source code for the FirstSiteII sample crawler.

Note: Folder names beginning with the underscore character ("_") are not treated as crawlers. They are not displayed in the Site Capture interface.

/fw-site-capture/crawler/Sample

Represents a crawler named "Sample." This folder is created only if the "Sample" crawler is installed during the Site Capture installation process.

The Sample folder contains an /app folder, which stores the CrawlerConfiguration.groovy file specific to the "Sample" crawler. The file contains basic configuration code for capturing any dynamic site. The code demonstrates the use of required methods (such as getStartUri) in the BaseConfigurator class.

When the Sample crawler is called in static or archive mode, subfolders are created within the /Sample folder.

/fw-site-capture/logs

Contains the crawler.log file, a system log for Site Capture.

/fw-site-capture/publish-listener

Contains the following files required for installing Site Capture for publishing-triggered crawls:

  • fw-crawler-publish-listener-1.1-elements.zip

  • fw-crawler-publish-listener-1.1.jar

/fw-site-capture/Sql-Scripts

Contains the following scripts, which create database tables that are required by Site Capture to store its data:

  • crawler_db2_db.sql

  • crawler_oracle_db.sql

  • crawler_sql_server_db.sql

/fw-site-capture/webapps

Contains the ROOT/WEB-INF/ folder.

/fw-site-capture/webapps/ROOT/WEB-INF

Contains the log4j.xml file, used to customize the path to the crawler.log file.

/fw-site-capture/webapps/ROOT/WEB-INF/classes

Contains the following files:

  • sitecapture.properties file, where you can specify information for the WebCenter Sites application on which Site Capture is running. The information includes the WebCenter Sites host computer name (or IP address) and port number.

  • root-context.xml file, where you can configure the Site Capture database.

36.10 Custom Folders

A custom folder is created for every crawler that a user creates in the Site Capture interface. The custom folder, <crawlerName>, is used to organize the configuration file, captures, and logs for the crawler, as summarized in the following figure.

Figure 36-8 Site Capture Custom Folders: <crawlerName>

Description of Figure 36-8 follows
Description of "Figure 36-8 Site Capture Custom Folders: <crawlerName>"